AI Broke Production: April 2026 DevOps & SRE Recap

On April 21, Datadog published its 2026 State of AI Engineering report. Buried near the front: 5% of AI requests now fail in production. The shareable headline was the failure rate. The much more interesting number was the next one — 60% of those failures are capacity, not model quality.

Hold that thought. Now look at the rest of April.

Vercel got popped by an OAuth-blanket-permission attack that originated at an AI productivity tool a junior employee had connected to their Google Workspace. The biggest npm supply-chain attack of the year — Mini Shai-Hulud — propagated through Bun runtimes that AI coding agents had been told to trust. OpenClaw, the fastest-growing GitHub project of 2026, racked up its 138th CVE on April 18 with 135,000 publicly exposed instances. CrowdStrike, Anthropic, and Wiz launched Project QuiltWorks on April 23 — a coalition that's basically an admission that the disclosure pipeline is broken because AI is finding vulnerabilities faster than humans can triage them.

Meanwhile, Cloudflare Agents Week shipped Dynamic Workers (claimed 100× faster than containers for short-lived agent workloads). Bluesky published one of the cleanest TIME_WAIT death-spiral postmortems we've seen in years. KubeCon's running theme was Solo.io's framing: cloud-native to AI-native. Even Airbnb pulled their observability stack back in-house with a public engineering write-up about why.

Squint at the month and you see the same shape in every story. An AI tool, AI workload, or AI-adjacent agent is somewhere in the blast radius. April 2026 is the first 30-day window where AI ops stopped being aspirational and started being the thing waking your team up at 3 AM.

Here's what mattered, why it spread, and what we'd put on your Q3 roadmap.

1. Three breaches, one root cause

If you read only one section, read this one. Three of the most-shared security stories of April share an architectural pattern: an AI tool, agent, or marketplace artefact ended up with permissions it shouldn't have had, on infrastructure nobody had inventoried as an attack surface.

Vercel — April 19–20. The hosting provider confirmed that customer data had been stolen via a breach at Context.ai, an AI productivity tool that had been OAuth-connected to a Vercel employee's Google Workspace. ShinyHunters listed the data on BreachForums for $2 million. The chain originated with a Lumma Stealer infection at Context.ai in February; the tokens it captured chained through Vercel's CI, secrets, and deploy pipeline two months later. The top-quoted Hacker News comment, with 867 points on the front page, was the architectural diagnosis: "When one OAuth token can compromise dev tools, CI pipeline, secrets, and deployment simultaneously, something architectural has gone wrong."

SAP “Mini Shai-Hulud” — April 29. A draft PR titled feat: ci speedup sat in a SAP repository long enough to leak a CircleCI npm publish token. Within hours, four packages were poisoned: @cap-js/sqlite@2.2.2, @cap-js/postgres@2.2.2, @cap-js/db-service@2.10.1, and mbt@1.2.48 — combined ~572K weekly downloads. Two firsts in this attack: it used the Bun runtime to evade EDR allowlists ("we don't have Bun on our list, so endpoint detection won't see it"), and it propagated through AI coding-agent config files (Claude Code skills) that nobody's SBOM tracks.

OpenClaw — running disaster. OpenClaw, the open-source agent platform that hit 20K stars in a single day in January and 346K stars by April, is the most popular GitHub repo nobody can responsibly run. Microsoft's February advisory said it plainly: "It is not appropriate to run it on a standard personal or corporate machine." By the end of April, the community CVE tracker had logged 138 CVEs at an average CVSS of 7.0. ARMO's April 24 disclosure of CVE-2026-32922 — a privilege-escalation path through the agent skill marketplace — was the one most teams found out about because their cloud-security platform flagged six instances they didn't deploy.

The audit nobody is running this quarter — and the one we'd start with on Monday.

The shared root cause is not "AI vendors are sloppy." It's that the AI tooling layer — productivity tools with OAuth, agent skill marketplaces, MCP servers, model-config files — has become a supply-chain layer that almost nobody's audit covers. Your SBOM tracks npm and Maven and PyPI. It does not track which Claude Code skills your developers downloaded last week, which OAuth grants are live in your Google org, or which MCP servers your platform team has running. Until it does, every one of these stories is going to repeat with different brand names attached. Pull the OAuth list. Revoke 80% of it. Re-grant with least-privilege scopes. That's a one-day project that's worth more than a quarter of feature work.

2. The vendors just repriced themselves around AI ops

While the breaches dominated headlines, three vendor moves told the longer-term story.

Datadog State of AI Engineering 2026 (April 21). The 5% production failure rate got the social shares. The 60% number — 60% of AI failures are capacity issues, not model quality — got the attention of every infra team that had been arguing about evals and prompts for two years. The buried takeaway: AI in production is failing the way every cloud service failed in 2014. Thundering herds. No backoff. No circuit breakers around third-party API calls. The cure isn't smarter models; it's the rate-limit and concurrency hygiene we already know how to do. Yanbing Li's tweet — "AI is starting to look a lot like the early days of cloud" — captured the mood. If you only do one thing this quarter: put a token-bucket rate limiter and a fallback model in front of every LLM call, and chart your degradation paths.

CrowdStrike Project QuiltWorks (April 23) + Falcon × Claude Opus 4.7 (April 30). First time an EDR vendor explicitly partnered with two foundation-model labs (Anthropic and OpenAI) plus IBM, Accenture, EY, and Kroll to handle the AI-discovered-vulnerability volume problem. Wiz's adjacent disclosure — that Claude Mythos had autonomously found thousands of zero-days in their internal red-team program — lit up infosec Twitter. Read the announcement carefully and the message is: the patch-Tuesday cadence collapses when the discovery rate outpaces human triage. Treat "patches available" as a continuous event stream, not a calendar. Your SOC's 30-day SLA is incompatible with a world where new CVEs land hourly.

Cloudflare Agents Week (April 13–17). Cloudflare went hardest of any vendor at the "where do you run a million short-lived AI workloads" question. Dynamic Workers shipped in beta with a claim of 100× faster cold start than containers for short-lived agent calls. Sandboxes hit GA. Workers Mesh and Agent Memory rounded out a story that, frankly, is the most aggressive bet against Kubernetes for AI workloads we've seen from a top-3 cloud provider. Whether their isolate model holds is open; the fact that they're betting publicly tells you where the puck is going. Don't migrate. Do put a "what would we do if 30% of our agent workloads ran on Workers?" plan in the doc folder where your CFO can find it.

3. The boring outages still matter

April had two outages that became case studies — neither of them about AI directly, both of them about classic ops failure modes the AI hype cycle has been ignoring.

Bluesky, April 7. The postmortem is the kind of thing you forward to junior engineers. An internal logging service started sending 15,000–20,000 URIs in a single batch. TCP ephemeral port exhaustion. Memcache connections starved. A blocking syscall in the hot path turned a recoverable error into a death spiral: error rate caused log rate caused thread spawn caused GC pressure caused more errors. The top-quoted Hacker News comment captured it: "Ahh, the three relevant numbers in development: 0, 1, and infinity." Audit your error paths for unbounded amplification this week.

Azure East US, April 24–25. ARM gateway deadlock for 13.5 hours. Dependency loop in the control plane: a regional management service required the gateway to come up, the gateway required a service that required the management service. Same shape as the CrowdStrike-driven Windows BSOD chain from 2024 — the lesson hasn't landed. If your control plane has loops, you don't have a control plane; you have a bistable system.

The pattern: thundering herds, blocking syscalls, control-plane partitions, ephemeral port exhaustion. None of these are new failure modes. All of them showed up in 2026 as if they were. The teams that made the AI investments fastest are the same teams that haven't audited their error paths since 2022. Both are true. The way to ship AI ops without losing the boring fundamentals is to have a person on the platform team whose explicit job is "boring infra reliability." If everyone's working on the agent, no one is reading the postmortems.

4. The platform layer is reorganising itself

Three signals from the second half of the month, individually small, collectively a shift.

Ingress-nginx EOL. The Kubernetes ingress-nginx project officially archived in March, with the active-vulnerability EOL clock ticking through April. It was the most-deployed ingress controller in the CNCF ecosystem; tens of thousands of clusters need to migrate to Gateway API or to a vendor controller before security CVEs go un-patched. If this isn't on your Jira board, put it there today. Migration takes 2–6 sprints depending on your config sprawl; don't be the team that finds out the hard way.

Kubernetes 1.36 “Haru” — April 22. Headline features include Workload-Aware Scheduling for AI/ML (better GPU and topology-aware bin-packing), fine-grained kubelet authz GA, and PSI-based health checks for nodes. If you have any GPU pools, pilot Workload-Aware Scheduling; the bin-packing improvement has been measured at 12–18% utilisation gain in early adopters.

KubeCon EU's running theme — "from cloud-native to AI-native." Solo.io's keynote and the agentregistry CNCF donation framed the shift cleanly. Spotify's Backstage went agent-first in their April release notes. Airbnb's "From Vendors to Vanguard" engineering blog detailed why they pulled their observability stack back in-house with OpenTelemetry + VictoriaMetrics — and the cost numbers ($14M/year saved) lit up LinkedIn for a week. The platform layer that ran 2018–2024 was a stack of vendor SaaS. The platform layer emerging in 2026 is more in-sourced, more agent-aware, and less tolerant of vendor lock-in.

5. What we'd put on your Q3 roadmap

Reading the month, these are the items we'd argue for in your next planning meeting. None of them are speculative — every one is a direct answer to something that broke in April.

OAuth scope audit. Inventory every AI tool with OAuth into Google Workspace, Microsoft 365, GitHub, and your CI. Revoke broad-scope grants. Re-issue with least privilege. (1–2 days. Defends against: Vercel-style breach.)
Agent supply-chain inventory. Map every Claude Code skill, MCP server, and AI coding-agent config your developers have installed. Add to SBOM. (1–2 sprints. Defends against: Mini Shai-Hulud-style propagation.)
Token-bucket rate limit + fallback model around every LLM call. Plus circuit breakers and exponential back-off. (1 sprint per service. Defends against: 60% of AI failures per Datadog.)
Ingress-nginx → Gateway API migration plan. With timeline. (2–6 sprints. Defends against: zero-day CVE on archived project.)
Workload-Aware Scheduling pilot if you run GPUs. (1 sprint. Returns: 12–18% utilisation.)
Error-path amplification audit. Every blocking syscall in a hot path. Every retry without backoff. Every log line in a hot loop. (1 sprint per service. Defends against: Bluesky-style death spiral.)
Continuous CVE intake stream. Replace the monthly patch calendar with a pipeline that triages new CVEs daily and prioritises by exploitability. (1 sprint to build, ongoing. Defends against: AI-velocity CVE discovery per QuiltWorks.)
Observability cost review with an honest "could we do this in-house?" question on the table. Not because everyone should follow Airbnb, but because the answer in 2026 is no longer reflexively "stay on the vendor."

The takeaway

The story of April 2026 is not that AI broke production. AI didn't break production.

AI tools were the proximate cause in three of the month's biggest stories, and they will be in three more next month. But the actual lesson is the older one. Every supply-chain compromise that hit in April was preventable by audit. Every AI failure rate that broke records was preventable by rate-limiting. Every postmortem that went viral pointed at error paths and control-plane loops that have been in the SRE textbook for fifteen years.

The teams that will outperform the next six months aren't the ones who adopt the most AI. They're the ones who treat AI tools as a new layer of the supply chain they have to govern, AI workloads as a new distribution of the same old failure modes they already know how to handle, and the boring fundamentals — auditing, rate-limiting, error-path discipline, EOL migrations — as table stakes that have to run alongside the agent program, not after it.

Either you build the runbooks for the new layer or your incidents will write them for you. April was the warning. Q3 is when the bill comes due.

InfraZen runs a one-week SRE Health Check that maps directly to the items above: OAuth scope audit, agent supply-chain inventory, error-path amplification review, observability cost benchmarking, and an EOL/migration plan for your top three platform dependencies. You'll have a written report and a prioritised backlog by Friday. Talk to us before your next viral postmortem is yours.

The month AI stopped being a feature
and started being an outage.

1. Three breaches, one root cause

2. The vendors just repriced themselves around AI ops

3. The boring outages still matter

4. The platform layer is reorganising itself

5. What we'd put on your Q3 roadmap

The takeaway

Audit the new layer. Before someone else does.

Related posts.

Azure SRE Agent Hit GA. Here's What 35,000 Incidents Don't Tell You.

The Alert Fatigue Trap: Why 44% of 2025 Outages Came From Dismissed Alerts

The month AI stopped being a feature and started being an outage.

1. Three breaches, one root cause

2. The vendors just repriced themselves around AI ops

3. The boring outages still matter

4. The platform layer is reorganising itself

5. What we'd put on your Q3 roadmap

The takeaway

Audit the new layer. Before someone else does.

Related posts.

Azure SRE Agent Hit GA. Here's What 35,000 Incidents Don't Tell You.

The Alert Fatigue Trap: Why 44% of 2025 Outages Came From Dismissed Alerts

The month AI stopped being a feature
and started being an outage.