Azure SRE Agent GA: The Production Playbook

Between March 10 and April 15 of this year, four things shipped that will reshape on-call rotations for the rest of the decade.

Azure SRE Agent went GA. AWS DevOps Agent went GA two weeks later. They started talking to each other through a shared MCP server on April 1. And on April 15, Microsoft moved the whole thing to per-agent-hour token billing.

Microsoft published the headline number — 35,000 incidents mitigated, 1,300 agents deployed, 20,000 engineering hours saved. The slide deck made the rounds. CTOs forwarded it to platform leads with a one-line ask: "Why aren't we doing this?"

Here's the awkward truth nobody is putting on a slide: that 35,000-incident number is Microsoft running Microsoft's services on Microsoft's infrastructure. It's a real result. It is not a benchmark for what your team is going to get on day one.

This is the post we wish someone had handed us in late March. A clear-eyed read of what shipped, what the AAU bill will look like, the four-rung autonomy ladder you actually have to choose from, and a 90-day rollout plan that won't get rolled back by your security team in week three.

1. What actually shipped between March 10 and April 15

Three weeks. Four announcements. Tight enough to forget any of them, big enough that all of them matter.

March 10 — Azure SRE Agent GA. Public preview opened in late 2025; GA brought role-based access control to actionable autonomy, full audit logging, and an out-of-the-box approval workflow that integrates with Microsoft change-management policy. The same release announced 1,300 internal agents and the 35,000-incident track record. Ecolab, the cleaning and water-treatment giant, was named as the headline external customer.

March 31 — AWS DevOps Agent GA. AWS shipped a more cautious analogue: agent-led investigation across CloudWatch, X-Ray, Config, and CloudTrail, with execution gated behind explicit IAM permissions and SSM Run Command runbooks. AWS's framing is frontier autonomy — the model is allowed to act in narrowly-scoped, pre-approved domains.

April 1 — Cross-cloud MCP integration. The one that didn't get the press release it deserved. AWS published an MCP server for Bedrock, CloudWatch, and EKS. Azure SRE Agent connected to it on day one. Suddenly an alert fires in EKS, Azure SRE Agent runs a CloudWatch query through the AWS MCP server, the AWS DevOps Agent picks up the investigation thread, and a human gets one consolidated incident summary instead of two pages from two clouds. First-of-its-kind hyperscaler-to-hyperscaler agent handoff.

April 15 — AAU token billing. The most controversial change of the four. Azure SRE Agent left fixed-hourly pricing behind and moved to a metered model: 4 Azure Action Units (AAUs) per agent-hour as the always-on cost, plus per-million-tokens active-flow charges priced per model. Predictable in theory; surprising the first month if you didn't realise what always-on means for a fleet of agents.

Together these aren't four separate news items. They're a single architectural shift: the agent layer is now multi-cloud, metered, and yours to govern.

2. The 35,000-incident dataset everyone is misreading

Read the Microsoft GA blog post twice and you'll notice what's missing. There's no breakdown by incident severity. No mean-time-to-resolution by service tier. No false-positive rate for autonomous actions. The 35,000 number is real — Microsoft is running 1,300 SRE Agent instances inside Microsoft. But the population isn't your population.

The dataset that should drive your boardroom conversation isn't Microsoft's internal one. It's the Ecolab case study that landed the same week.

Ecolab — global cleaning and water-treatment giant, not a tech company — connected SRE Agent to roughly 550 Azure resources hosting their largest monolithic application. Daily performance alerts in their NOC dropped from 30–40 to under 10. The agent didn't fix most of them. It correlated, classified, and dismissed the duplicates and false positives that the alerting rules couldn't disambiguate.

That is the realistic first-quarter outcome. Not autonomous remediation — correlated triage. If your incident topology looks anything like Ecolab's (monolith plus a tail of supporting services on Azure SQL, Service Bus, Application Gateway), this is the result you should plan against. If you're running a 200-microservice mesh on AKS with eight different observability vendors, calibrate down. Microsoft's 35,000 is the ceiling for what an extremely experienced SRE Agent operator running on a homogeneous Azure-native estate can extract. Ecolab's 30→10 is the floor — and the more honest baseline.

3. The four-rung autonomy ladder you actually have to choose from

Every vendor has its own naming scheme. Azure has assistive and automated. AWS DevOps Agent has frontier autonomy. Rootly published a Levels 0–3 maturity model that's becoming the de-facto reference. PagerDuty introduced the term Virtual Responder in escalation policies and reserved Fully Autonomous Responder for H2 2026.

Cut through the marketing and there are four rungs. Every team has to pick which one they're on — per service, per runbook.

Every "AI SRE" vendor maps onto one of these four rungs. The mistake is treating them as a vendor choice instead of a per-runbook choice.

Rung 0 — Read-only. The agent is tail-following your logs, traces, and metrics. It produces incident summaries, suggests probable causes, drafts runbooks. It cannot push a button. Blast radius: zero. Skipping this rung is how you end up rolled back by audit in week three.

Rung 1 — Advised. The agent proposes a specific change — restart this pod, scale this ASG, rotate this cert. A human clicks approve. The action is logged with the agent's reasoning attached. This is where Azure SRE Agent's assistive mode lands. Blast radius: one service. Rollback: ≤5 minutes via the same agent.

Rung 2 — Approved. The agent autonomously executes from a pre-approved playbook library. Each playbook has been written, reviewed, and signed off by humans. The agent picks which one to run; it cannot improvise. Pre-approved actions: pod restarts, log truncation, expired-cert rotation, deploy rollback to last-good-build. Blast radius: one cluster or one tenant. Rollback: documented, automatic on probe failure.

Rung 3 — Guardrailed-autonomous. The agent self-decides and self-executes within hard guardrails: an explicit allowlist of action verbs, blast-radius caps (n pods or n%, whichever is smaller), automatic rollback triggers on a SLO probe, and a kill-switch wired to PagerDuty. AWS calls this frontier autonomy; Azure calls it automated; Rootly calls it Level 3. Practically, you're going to ship one or two runbooks at this rung in 2026 — not your whole catalogue.

The mistake teams make is assuming the rungs are sequential by vendor. They aren't. They're sequential by runbook. You should be running rung 0 globally, rung 1 on your top twenty alerts, rung 2 on three pre-approved playbooks, and rung 3 on exactly one — at least until your audit team has seen a clean quarter.

4. MCP is the real story

The April 1 cross-cloud MCP announcement got buried under the GA news, and that's a strategic error. Here's why it matters more than either Azure or AWS GA on its own.

Until now, every "AI for ops" demo has had the same hidden assumption: one cloud, one observability vendor, one CI system. The moment you have three of each — which describes basically every enterprise — the agent has to either (a) be re-trained on every tool's API, or (b) speak a shared protocol.

MCP is option (b). And on April 1, Anthropic's protocol officially became the wire format for hyperscaler-grade agent interop. Azure SRE Agent uses it to query AWS resources. AWS DevOps Agent uses it to query Datadog and PagerDuty. Datadog's Bits AI SRE uses it to read your GitHub source. PagerDuty's Spring 2026 release listed Anthropic, Cursor, LangChain, and 30+ other partners shipping MCP integrations.

The architectural implication: don't pick a winner among Azure SRE Agent, AWS DevOps Agent, Datadog Bits, or Rootly. Pick the MCP servers your existing tools expose, and let the agent layer commoditise over the next 18 months. The agent that mattered in 2024 is not going to be the agent that matters in 2027. The MCP server you ship for your internal Kubernetes platform might still be there.

5. What the AAU bill will actually look like

The April 15 billing flip is the conversation your CFO wants you to have. Here's the math nobody at Microsoft will quote you in the demo.

Baseline. Each running SRE Agent consumes 4 AAUs per hour as the always-on cost — the cost of having the agent connected to your subscription, not the cost of work it does. AAUs are priced regionally; for most US regions, plan on roughly $0.10 per AAU. That's $0.40/hour, $9.60/day, $292/month per agent — before the agent does anything.

Active flow. On top of always-on, every action the agent reasons through is metered as tokens against the underlying model — Claude, GPT-4o, or Microsoft's own Phi line, depending on the runbook. Active-flow rates run from $3 to $15 per million tokens. A typical multi-step investigation burns 50K–200K tokens; a passive triage event maybe 10K. Plan on $10–60 per major investigation in token cost on top of always-on.

Worked example. A team running 10 SRE Agents (one per critical service tier) at always-on plus 200 incidents per month at an average 75K active tokens:

Always-on: 10 × $292 = $2,920/month
Active flow: 200 × 75K tokens × $5/M = $75/month
Total: $2,995/month — call it $3K.

Tractable. The trap is the always-on number scaling with your service catalogue. Twenty agents = $5,840/mo before you've routed a single page. Fifty = $14,600. The Settings → AAU caps panel inside Azure SRE Agent will throttle agents past a monthly allocation; configure it the day you go GA, not the week after the bill arrives.

6. Three failure modes you won't see in the demo

Three risks the marketing won't surface. None of them are deal-breakers; all of them are scope items for your readiness work.

Cognitive drift. The Cloud Security Alliance's research on agentic systems lands on a stark observation: agents don't fail suddenly; they drift over time. Model versions update. Tool descriptions in MCP change. The agent's effective behaviour on a runbook in month six is not the agent's behaviour on the runbook in month one. Your governance has to include re-validation of every approved playbook on a calendar — at minimum quarterly, ideally per major model version.

The 44% missing-execution-data gap. ECI Research's 2025 builder survey found that only 56% of teams have telemetry of sufficient granularity for an agent to reason about the runtime — JVM pressure, GC pauses, Thanos query latency, network path metrics. The other 44% will get plausible-sounding agent answers grounded in the data the agent can see — which means confidently wrong remediations. Audit your tail telemetry before you turn rung 2 on.

Agent washing. Gartner's June 2025 estimate: of the thousands of vendors who relabelled themselves agentic in the past 18 months, roughly 130 are doing actual agentic work. The gap between an LLM wrapper around your existing dashboard and a genuine reasoning loop with tool use, memory, and rollback is enormous. Demand a live demo on your data, not a recorded one. If the vendor can't run the demo against a sandbox of your actual stack within two weeks, they're a wrapper.

7. A 90-day rollout that survives an audit

If you're starting from zero today, this is the calendar that gives you a defensible position by the end of Q3 — with actual production value at every milestone.

Weeks 1–4 — Rung 0, non-prod tier.

Connect SRE Agent to your dev/staging subscription only.
Hook MCP servers for: monitoring (Azure Monitor or Datadog), source control (GitHub or Azure DevOps), and your runbook repo (Confluence or Notion). No production write access.
Use the agent for incident summaries, RCA narratives, and runbook draft generation. Score every output against the human-written post-mortem.
Exit criteria: 4 consecutive weeks of agent-generated incident summaries with no factual errors against the human-written post-mortem.

Weeks 5–8 — Rung 1, advised mode in production.

Promote SRE Agent to your prod subscription, but only on rung 1: every action is a proposal that requires a human approval click.
Define and ship the audit log: agent reasoning + human reviewer + outcome, immutable, queryable.
Pick your top three alerting categories (likely: pod evictions, memory pressure, certificate expiry) and run them through the advised flow.
Exit criteria: 50 advised actions executed, ≥80% reviewer approval rate, zero production incidents caused by agent-proposed actions.

Weeks 9–12 — Rung 2, three pre-approved playbooks.

Pick three playbooks where blast radius is bounded and rollback is well-understood: pod restart on liveness failure, log volume truncation on disk pressure, expired TLS cert rotation.
Author the playbooks as your team would write them. The agent picks which to run; it does not improvise.
Wire the kill-switch: any rung-2 execution should be reversible by a single PagerDuty button or az command.
Exit criteria: 100 rung-2 executions, zero unintended consequences, audit team has reviewed the log and signed off in writing.

By week 13, you have evidence — concrete, auditable, FinOps-aware evidence — for whether to push to rung 3 on a single allowlisted runbook. That's a conversation you have with data. Most teams who try to skip to rung 3 on day one have it without data, lose the argument, and get rolled back to rung 0.

8. Where this goes by Q4 2026

PagerDuty's Fully Autonomous Responder lands in H2. Anthropic's Claude Code already has a PagerDuty integration; Cursor's IDE-agent ships incident replays. Microsoft and AWS will bring their MCP catalogues to feature parity by KubeCon NA. By Q4, the question won't be "do we have an SRE agent?" — every team will. The question is "what runbooks have we taught it, and who owns the audit log when it does the wrong thing?"

Gartner's headline projection — 70% enterprise penetration of agentic IT operations by 2029, up from less than 5% in 2025 — is the kind of curve that looks slow until it isn't. The teams who started this work in March 2026 will be running rung 3 on twelve runbooks in 2027 with a clean audit trail. The teams who waited until next quarter will be racing to catch up while their first AAU bill lands.

The takeaway

The Azure SRE Agent GA isn't the endpoint of the agentic-ops story. It's the start. The teams who treat it as a button to press will be back at rung 0 in six months, after a Friday-afternoon incident they can't explain to compliance.

The teams who treat it as a multi-quarter readiness program — autonomy ladder, audit log, AAU caps, allowlisted runbooks — are the ones who'll still be running the agent in production when the H2 features land.

The button is shipping. The program is your design problem.

Azure SRE Agent gives you the engine. The autonomy ladder, the runbook allowlist, the approval workflow, and the AAU spend cap are all still your design problem — and getting one of them wrong is how you become the 40% of agentic projects Gartner predicts will be cancelled by 2027. InfraZen runs a 4-week AI-SRE Readiness Engagement: we map your current incident topology, define the rung 1→rung 3 gates against your change-management policy, ship the audit/rollback plumbing, and pilot one allowlisted autonomous runbook in production. Talk to us before your first AAU bill lands.

Azure SRE Agent hit GA.
Here's what 35,000 incidents don't tell you.

1. What actually shipped between March 10 and April 15

2. The 35,000-incident dataset everyone is misreading

3. The four-rung autonomy ladder you actually have to choose from

4. MCP is the real story

5. What the AAU bill will actually look like

6. Three failure modes you won't see in the demo

7. A 90-day rollout that survives an audit

8. Where this goes by Q4 2026

The takeaway

Ship the agent. Keep the audit log.

Related posts.

The Alert Fatigue Trap: Why 44% of 2025 Outages Came From Dismissed Alerts

Kubernetes 1.33 In-Place Pod Resize: No More Restart Windows

Azure SRE Agent hit GA. Here's what 35,000 incidents don't tell you.

1. What actually shipped between March 10 and April 15

2. The 35,000-incident dataset everyone is misreading

3. The four-rung autonomy ladder you actually have to choose from

4. MCP is the real story

5. What the AAU bill will actually look like

6. Three failure modes you won't see in the demo

7. A 90-day rollout that survives an audit

8. Where this goes by Q4 2026

The takeaway

Ship the agent. Keep the audit log.

Related posts.

The Alert Fatigue Trap: Why 44% of 2025 Outages Came From Dismissed Alerts

Kubernetes 1.33 In-Place Pod Resize: No More Restart Windows

Azure SRE Agent hit GA.
Here's what 35,000 incidents don't tell you.