Alert Fatigue: Why 44% of Outages Start Here

Last week, NeuBird AI published the 2025 State of Production Reliability and AI Adoption Report, a survey of 1,039 SRE, DevOps, and IT operations professionals. The headline that made every on-call engineer on Hacker News groan in recognition: 44% of organizations had an outage in 2025 caused directly by a suppressed or ignored alert. 78% had at least one incident where no alert fired at all, and engineers discovered the failure only after customers were already tweeting about it.

Key takeaways

44% of organizations had a 2025 outage caused directly by a suppressed or ignored alert (NeuBird survey, 1,039 practitioners).
77% of on-call teams get 10+ alerts a day; 57% say fewer than 30% are actionable; 83% admit to ignoring alerts at least occasionally.
The flywheel: incident → more alerts → noise → dismissal → missed signal → outage. Break it by paging on symptoms via SLO burn-rate, not causes.
Includes what good looks like after 8 weeks, and where AI-assisted triage genuinely helps (with caveats).

Translation: the problem is no longer "we don't have monitoring." It's "we have so much monitoring that the signal drowned in the noise, and the team's favourite coping mechanism is cat >> /dev/null."

Here's why it happens, and what the teams we work with do to climb out of the hole.

1. The alert fatigue flywheel

Alert fatigue isn't a single bad decision. It's a slow, self-reinforcing loop that every monitoring-first org eventually walks into:

The self-reinforcing feedback loop that turns monitoring into noise.

The numbers from the NeuBird report back up the loop point-by-point: 77% of on-call teams receive at least 10 alerts per day, 57% say fewer than 30% are actionable, and 83% admit to ignoring alerts "at least occasionally." You cannot tune your way out of that. You have to restructure.

2. Page on symptoms, not causes

The biggest lever is also the least glamorous: page on symptoms, not on causes.

A cause alert fires when something might lead to user pain: disk at 80%, replication lag at 30 seconds, GC pauses over 500ms. Most of these never actually hurt a user. Most of them auto-recover. All of them will fire at 2 AM anyway.

A symptom alert fires when the user is, right now, getting a worse experience than your SLO allows. P99 latency above 800ms. Checkout error rate above 0.5%. Login success rate below 99.9%. Payment authorisation failure above 1%.

Action: Replace your cause alerts with dashboards (still visible, still queryable, still useful for RCA) and page only on symptoms. In the engagements we've run, this single change cuts page volume by 70–85%.

3. Burn-rate alerts are the under-used weapon

For every SLO, set up multi-window burn-rate alerts: one fast (5-minute window, ~14x normal burn), one slow (1-hour window, ~6x normal burn). The fast one catches sudden cliff-edges. The slow one catches slow bleeds that would otherwise eat your whole monthly error budget before anyone notices.

The math is in Google's SRE Workbook and the Grafana burn-rate docs, but the short version: a 2% error rate over 5 minutes is a "page now" event. A 0.5% error rate sustained over 6 hours is also a "page now" event, because you'll blow your monthly budget before the next on-call rotation sees it.

Two alerts per SLO. Not two hundred. Every page is actionable by definition.

4. What good looks like after 8 weeks

Actual numbers from a 12-week SRE engagement we ran in 2025.

Here's what a de-fatigued SRE program looks like after 6–12 weeks of restructuring:

10–20 paging alerts, not 100–200.
Every page has a runbook with a concrete first action, not "investigate."
Every page has a known severity and a known blast radius.
A weekly alert review meeting kills alerts that fired without leading to action.
Incidents shrink (because pages are real) and teams sleep (because pages are rare).

5. The AI-assisted future (with caveats)

The NeuBird report also found that 60% of SREs are optimistic about AI in incident response, and more than half plan to deploy agentic AI systems in production within 12 months. We're bullish too, but skeptical of the "AI will read the alerts for you" vendor pitch.

What actually works: LLM-based incident summarisation, natural-language runbook search, RCA narrative generation, correlation across logs + traces + metrics, and post-mortem draft generation. These save 30–60% of the toil without putting AI in the critical path.

What doesn't work yet: fully-autonomous alert suppression without human review. Garbage-in, garbage-out still applies, and an AI that suppresses a real alert is worse than a human who ignores one, because at least the human can be asked why.

The takeaway

Fixing alert fatigue is not an AI problem, a vendor problem, or a budget problem. It's an alignment problem. Your on-call rotation and your SLO definitions have to actually be the same document. Everything else is cleanup.

If 44% of the industry just proved that ignoring alerts leads to outages, the lesson isn't "ignore fewer alerts." It's "page on fewer things, but page on the right things."

Is your on-call rotation drowning in pages? This is exactly the kind of work we do. Book a free 30-minute reliability review. We'll look at your alert catalogue and your SLOs and tell you honestly where the 70% reduction is hiding.

Frequently asked questions

What is alert fatigue in SRE?

Alert fatigue is what happens when on-call engineers receive so many non-actionable alerts that they learn to ignore them - including, eventually, the real ones. In NeuBird's 2025 survey of 1,039 practitioners, 77% of on-call teams received 10 or more alerts a day, 57% said fewer than 30% were actionable, and 83% admitted to ignoring alerts at least occasionally.

How do you fix alert fatigue?

Restructure the alerting instead of tuning it: page on symptoms (user-facing SLO violations like checkout error rate or p99 latency) rather than causes (disk usage, replication lag), move cause alerts to dashboards, and add multi-window burn-rate alerts per SLO. In the engagements we run, paging on symptoms alone cuts page volume by 70-85%.

What is the difference between symptom-based and cause-based alerting?

A cause alert fires when something might eventually hurt users, like disk at 80% or GC pauses over 500ms; most auto-recover and never affect anyone. A symptom alert fires when users are, right now, getting a worse experience than your SLO allows. Page on symptoms, and keep cause signals on dashboards for root-cause analysis.

What are burn-rate alerts?

Burn-rate alerts page when a service consumes its SLO error budget too fast. The standard setup is multi-window: a fast alert (5-minute window, around 14x normal burn) for sudden cliff-edges and a slow alert (1-hour window, around 6x) for slow bleeds that would silently eat the monthly budget. Two alerts per SLO, and every page is actionable by definition.

The alert fatigue trap:
44% of 2025 outages came from alerts your team already dismissed.

1. The alert fatigue flywheel

2. Page on symptoms, not causes

3. Burn-rate alerts are the under-used weapon

4. What good looks like after 8 weeks

5. The AI-assisted future (with caveats)

The takeaway

Frequently asked questions

Page on the right things.

Related posts.

Why Developers Keep Bypassing Your Internal Developer Platform

The Four Golden Signals of Observability: Latency, Traffic, Errors, Saturation

The alert fatigue trap: 44% of 2025 outages came from alerts your team already dismissed.

1. The alert fatigue flywheel

2. Page on symptoms, not causes

3. Burn-rate alerts are the under-used weapon

4. What good looks like after 8 weeks

5. The AI-assisted future (with caveats)

The takeaway

Frequently asked questions

Page on the right things.

Related posts.

Why Developers Keep Bypassing Your Internal Developer Platform

The Four Golden Signals of Observability: Latency, Traffic, Errors, Saturation

The alert fatigue trap:
44% of 2025 outages came from alerts your team already dismissed.