Pillar Guide · Site Reliability Engineering

What is SRE?
Site Reliability Engineering, demystified.

Site Reliability Engineering (SRE) is what happens when a software engineer is tasked with what used to be called operations. That's Ben Treynor Sloss's original 2014 line, and 12 years later it's still the cleanest one-sentence definition. The discipline was invented at Google in 2003, codified in the 2016 Site Reliability Engineering book, and is now standard practice at every major tech company.

The plain-English version: SRE is the engineering practice of running production systems with explicit numerical reliability targets, an error budget that gates risk-taking, and a discipline of eliminating repetitive operational work (toil) by writing code instead of doing it manually.

This guide covers what SRE actually is in 2026, the four practices that define it, how SRE differs from DevOps and traditional ops, and a practical decision framework for engineering leaders thinking about when to bring in SRE consulting or hire their first SRE.

Where SRE came from

Google in 2003 had a problem. Their services were growing faster than their ability to operate them with traditional sysadmins. Ben Treynor Sloss was hired to run the team that kept Google's production systems online and asked to "do it like a software problem." He hired software engineers and gave them an explicit constraint: spend at most 50% of your time on operations. The other 50% had to go to engineering work that reduced future operations — automation, tooling, architectural simplifications.

That constraint — the 50% toil cap — turned out to be the load-bearing idea. It forced engineers to invest in eliminating work rather than scaling it. The rest of the SRE discipline (SLOs, error budgets, blameless post-mortems, on-call rotation hygiene) is the operating model that makes the 50% cap work in practice.

The four practices that define SRE

1. SLIs and SLOs. A Service Level Indicator (SLI) is a measurement of something users care about — success rate, latency, freshness. A Service Level Objective (SLO) is a target for that SLI. Example: "99.95% of API requests return a non-5xx status code within 800ms over a rolling 28-day window." SLOs are not aspirations; they're contracts the team commits to. If they're set too high, every deploy feels risky. If they're set too low, customers churn. Picking the right SLOs is half the job.

2. Error budgets. An error budget is the inverse of an SLO. If your SLO is 99.95% over 28 days, your error budget is 0.05% — about 20 minutes of allowed downtime per month. The error budget is what makes SRE different from "ops with stricter rules." Teams that haven't burned their budget can ship aggressively, push experimental features, take deployment risk. Teams that have burned their budget freeze risky changes until reliability recovers. The budget converts reliability arguments into capacity decisions.

3. Incident response. Three rituals every SRE team runs: an on-call rotation that pages on burn-rate (not symptoms), a written incident runbook for every paging alert, and a blameless post-mortem after every Sev-1 or Sev-2 with action items tracked to completion. The point isn't preventing incidents (impossible); it's making each one a learning artefact. Read our take on the alert fatigue trap for the on-call hygiene side, and the Azure SRE Agent production playbook for the AI-assisted incident response direction.

4. Eliminating toil. Toil is operational work that's manual, repetitive, automatable, and grows with system size. The SRE commitment is to spend at least 50% of engineering time on work that reduces future toil — automation, refactors, better tooling. A team where everyone is firefighting and nobody is building is by definition not doing SRE, regardless of what the org chart says.

SRE vs DevOps vs Platform Engineering

The three terms overlap heavily and engineering leaders confuse them constantly. The clean distinction:

  • DevOps is the broader cultural and practice layer for shipping software fast and reliably. See our What is DevOps? guide for the full definition.
  • SRE is a specific implementation of the reliability side of DevOps, originating at Google, defined by SLOs, error budgets, and explicit toil reduction.
  • Platform Engineering is what you build to make DevOps and SRE practices self-serve at scale — an internal platform that abstracts complexity for feature teams.

For the side-by-side: DevOps vs SRE vs Platform Engineering →

SRE roles: embedded, dedicated, advisory

SRE shows up in three structural patterns inside companies. Pick the one that matches your stage:

Embedded SRE. An SRE is part of a feature team for a fixed period (typically 6 months). Their job is to upskill the team on SRE practices, instrument the service properly, and hand off. Best for teams that need to absorb the discipline without growing a separate org. This is the structure most consulting engagements model.

Dedicated SRE. A separate SRE team owns reliability for a portfolio of services and works with feature teams as a peer (not a service desk). Best for companies past 100 engineers with multiple business-critical services. Risks: SRE-as-ops if the team isn't given engineering autonomy and rotational ownership.

Advisory SRE. SRE expertise lives with consultants or staff engineers who set standards (SLO templates, runbook conventions, observability stack) but don't own services. Most common in mid-stage companies that need the discipline before they can justify a dedicated team.

When to bring in SRE help

Four triggers we see consistently across our clients:

  1. First enterprise customer. Procurement is asking for an uptime SLA, a security questionnaire, and an incident response plan. Outside SRE consulting turns those three deliverables around in two to four weeks.
  2. On-call is breaking the team. Engineers are quitting or asking off the rotation. Pages outnumber meaningful incidents 10-to-1. The fix is restructuring around SLOs and burn-rate alerts — usually visible within four weeks.
  3. Past 25 engineers, no platform discipline. Production access is informal, runbooks live in three different wikis, the same person debugs every Tuesday-night Postgres lock. Bring in SRE to set standards before staffing a full team.
  4. Post-incident reset. A Sev-1 ate a customer or a quarter. The post-mortem said "we'll do better." Three months later nothing has structurally changed. An outside SRE engagement makes the structural shift stick.

If two or more of those describe your team right now, a 30-minute conversation is genuinely useful even if you don't end up engaging us.

Common SRE myths

  • "SRE means 99.99% uptime." SRE doesn't prescribe a target. It prescribes setting a target that matches user expectations and managing to it. 99.9% is correct for many B2B services; 99.99% is overkill and expensive for most.
  • "SRE is just renamed ops." The 50% toil cap and the engineering autonomy to refactor systems are non-negotiable parts of the model. A team called SRE that doesn't have those is doing ops with a cooler title.
  • "You need a separate team to do SRE." The practices apply at any scale. The team structure is a function of company size, not a prerequisite for SRE.
  • "SRE owns reliability." The opposite, actually. The feature team owns reliability. SRE provides standards, tooling, and consultation. Reliability that an SRE team can't get the feature team bought into doesn't last past the next reorg.

Tired of 3 AM pages? Wondering whether you need an SRE hire or just better practices? InfraZen runs a free 30-minute SRE review that ends in honest advice on what to fix first — structure, alerting, or both. Book the review.

Related: SRE Consulting services · What is DevOps? · DevOps vs SRE vs Platform Engineering · The alert fatigue trap · SRE for SaaS

Reliability that isn't a liability.

Free 30-minute SRE review. SLOs, on-call, incident response. We'll tell you honestly where the highest-leverage week of work is.

SRE services Book the Review