Back to Monitoring, Observability & Reliability Series

Part 9: SLOs, SLIs, SLAs & Error Budgets

May 14, 2026 Wasil Zafar 18 min read

Monitoring tells you what happened. SLOs tell you whether it matters. Service Level Objectives are the bridge between technical metrics and business impact — they quantify "good enough" reliability and create a shared language between engineering, product, and leadership for making trade-off decisions.

Table of Contents

  1. SLI, SLO, SLA — Definitions
  2. Choosing the Right SLIs
  3. Setting Meaningful SLOs
  4. Error Budgets
  5. Multi-Window Burn Rate Alerting
  6. Conclusion & Next Steps

SLI, SLO, SLA — Definitions

These three terms are often confused. They form a hierarchy from measurement to target to contract:

TermDefinitionAudienceExample
SLI
(Service Level Indicator)
A quantitative measure of service behaviour — a ratio of good events to total events Engineering Proportion of HTTP requests completing in < 300ms
SLO
(Service Level Objective)
A target value or range for an SLI — what "good enough" means Engineering + Product 99.9% of requests complete in < 300ms over 30 days
SLA
(Service Level Agreement)
A contractual commitment with consequences for breach (refunds, credits) Business + Customers "99.95% uptime or receive 10% service credit"
The Hierarchy: SLIs are measured. SLOs are targets set against SLIs. SLAs are promises made to customers based on SLOs. Your SLA should always be looser than your SLO (e.g., SLO = 99.95%, SLA = 99.9%), giving you a buffer before contractual penalties apply.

Choosing the Right SLIs

SLI Types by Service Category

Different types of services need different SLIs. Google SRE identifies four common SLI categories:

Service TypePrimary SLIFormula
Request-driven
(APIs, web apps)
Availability + Latency Good requests / Total requests
Requests < threshold / Total requests
Pipeline / batch
(ETL, data processing)
Freshness + Correctness Records processed within deadline / Total records
Records without errors / Total records
Storage
(databases, object stores)
Durability + Availability Successful reads / Total reads
Data objects intact / Total data objects
Streaming
(Kafka, event systems)
Throughput + Freshness Messages delivered within SLA / Total messages
Consumer lag < threshold / Total time

Implementing SLIs in Prometheus

# SLI: Availability — proportion of non-5xx requests
# Good events: requests with status != 5xx
# Total events: all requests
sum(rate(http_requests_total{service="order-service",status!~"5.."}[30d]))
/ sum(rate(http_requests_total{service="order-service"}[30d]))

# SLI: Latency — proportion of requests completing under 300ms
# Good events: requests in buckets up to 300ms
# Total events: all requests
sum(rate(http_request_duration_seconds_bucket{service="order-service",le="0.3"}[30d]))
/ sum(rate(http_request_duration_seconds_count{service="order-service"}[30d]))

# SLI: Combined (most realistic) — requests that are both successful AND fast
# Good events: non-5xx requests completing under 300ms
# This requires custom instrumentation or recording rules
SLI Best Practice: Always express SLIs as a ratio: good events / total events. This produces a number between 0 and 1 (or 0% and 100%) that is directly comparable to your SLO target. Avoid absolute thresholds like "latency must be below 300ms" — instead measure "proportion of requests below 300ms."

Setting Meaningful SLOs

The Nines Table — What Each Level Actually Means

SLO TargetAllowed Downtime / MonthError Budget / MonthPractical Meaning
99% (two nines)7.3 hours1% of requests can failInternal tools, batch systems
99.5%3.65 hours0.5% failure rateNon-critical customer-facing services
99.9% (three nines)43.8 minutes0.1% failure rateStandard customer-facing APIs
99.95%21.9 minutes0.05% failure rateHigh-value transactional systems
99.99% (four nines)4.38 minutes0.01% failure ratePayment processing, auth systems
99.999% (five nines)26.3 seconds0.001% failure rateLife-critical systems only
The Cost of Each Nine: Each additional nine of reliability roughly requires a 10x increase in engineering effort and infrastructure cost. Going from 99.9% to 99.99% does not just mean "a little more reliable" — it means fundamentally different architecture (multi-region, active-active, zero-downtime deployments). Never set an SLO higher than your users actually need.

Common SLO Pitfalls

  • Setting SLOs too high: If your SLO is 99.99% but your architecture can only deliver 99.9%, you will always be in violation — demoralising the team and making the SLO meaningless
  • Setting SLOs too low: If your SLO is 95% but users expect 99.9%, you will meet your SLO while users are unhappy — the SLO is not useful
  • Too many SLOs: Start with 1-3 SLOs per service (availability + latency). Adding more creates noise without clarity
  • SLOs without error budget policies: An SLO without consequences for breach is just a number on a dashboard
  • Measuring from the wrong vantage point: Measure SLIs from the user's perspective (load balancer, API gateway), not from within the service

Error Budgets

An error budget is the inverse of an SLO — it quantifies how much unreliability is acceptable. If your SLO is 99.9%, your error budget is 0.1% — you can afford 0.1% of requests to fail within the SLO window.

Error Budget Math

Calculation

Error Budget Example: order-service

Given: SLO = 99.9% availability over 30 days. The service handles 10 million requests per day.

  • Total requests in 30 days: 10M × 30 = 300 million
  • Error budget (requests): 300M × 0.001 = 300,000 failed requests allowed
  • Error budget (time-based): 30 days × 24h × 60min × 0.001 = 43.2 minutes of total downtime
  • Daily budget: ~10,000 failed requests or ~1.44 minutes of downtime per day

If a deployment causes 50,000 errors in one hour, that consumes 16.7% of the monthly error budget in a single incident.

Error Budget SLO Math Capacity Planning

Error Budget Policy

An error budget policy defines what happens when the budget is exhausted. Without a policy, error budgets are just numbers. With a policy, they become a decision-making framework.

Budget RemainingActionFeature Velocity Impact
> 50%Normal operations — ship features freelyFull speed
25-50%Increased caution — require canary deployments for all changesSlight slowdown
10-25%Reduced risk — no non-critical changes, reliability work prioritisedSignificant slowdown
< 10%Feature freeze — only reliability improvements and critical bug fixesFrozen
Exhausted (0%)Full freeze + post-mortem required before resuming feature workCompletely frozen
The Error Budget Trade-Off: Error budgets create a healthy tension between feature velocity and reliability. When the budget is healthy, product teams can ship fast and take risks. When the budget is depleted, the team must slow down and fix reliability. This is not a punishment — it is a self-correcting mechanism that prevents reliability debt from accumulating unchecked.

Multi-Window Burn Rate Alerting

The Google SRE book's recommended approach to SLO alerting: instead of alerting when SLI < SLO (too late and too noisy), alert when the burn rate — the speed at which error budget is being consumed — exceeds a threshold.

A burn rate of 1x means consuming budget at exactly the rate that would exhaust it at the end of the SLO window. Higher burn rates exhaust budget faster:

Burn RateBudget Exhaustion Time (30-day SLO)Alert Severity
14.4x~2 hours (acute incident)P1 — Page immediately
6x~5 days (degradation)P2 — Page during hours
3x~10 days (slow burn)P3 — Ticket
1x30 days (normal consumption)No alert

Multi-window alerting requires both a short window (detecting the current spike) and a long window (confirming it is sustained) to fire:

# Multi-window burn rate alert rules (Prometheus)
# P1: Fast burn — 14.4x burn rate, 1h short window + 5m long window
groups:
  - name: slo-burn-rate-alerts
    rules:
      # Fast burn (P1): consumes 2% of 30-day budget in 1 hour
      - alert: SLOBurnRateCritical
        expr: |
          (
            sum(rate(http_requests_total{service="order-service",status=~"5.."}[1h]))
            / sum(rate(http_requests_total{service="order-service"}[1h]))
          ) > (14.4 * 0.001)
          and
          (
            sum(rate(http_requests_total{service="order-service",status=~"5.."}[5m]))
            / sum(rate(http_requests_total{service="order-service"}[5m]))
          ) > (14.4 * 0.001)
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "order-service SLO burn rate critical (14.4x)"
          description: "Error budget will be exhausted in ~2 hours at current rate."
          runbook_url: "https://runbooks.internal/slo-burn-critical"

      # Slow burn (P2): consumes 5% of 30-day budget in 6 hours
      - alert: SLOBurnRateHigh
        expr: |
          (
            sum(rate(http_requests_total{service="order-service",status=~"5.."}[6h]))
            / sum(rate(http_requests_total{service="order-service"}[6h]))
          ) > (6 * 0.001)
          and
          (
            sum(rate(http_requests_total{service="order-service",status=~"5.."}[30m]))
            / sum(rate(http_requests_total{service="order-service"}[30m]))
          ) > (6 * 0.001)
        for: 5m
        labels:
          severity: high
        annotations:
          summary: "order-service SLO burn rate high (6x)"
          description: "Error budget will be exhausted in ~5 days at current rate."
Error Budget Consumption Over Time
                                flowchart LR
                                    A[Day 1\nBudget: 100%] --> B[Day 5\nIncident burns 20%\nBudget: 80%]
                                    B --> C[Day 10\nNormal ops\nBudget: 78%]
                                    C --> D[Day 15\nBad deploy burns 30%\nBudget: 48%]
                                    D --> E[Day 18\nPolicy: canary all deploys\nBudget: 45%]
                                    E --> F[Day 25\nSlow burn detected\nBudget: 30%]
                                    F --> G[Day 30\nBudget: 22%\nSLO Met ✓]
                            

Conclusion & Next Steps

SLOs are the most important concept in reliability engineering — they translate technical metrics into business decisions. Key takeaways from Part 9:

  • SLIs measure behaviour (good events / total events); SLOs set targets; SLAs are contractual commitments
  • Express SLIs as ratios and measure from the user's perspective
  • Each additional nine of reliability costs ~10x more — never set SLOs higher than users need
  • Error budgets quantify acceptable unreliability and create a trade-off framework between features and reliability
  • Error budget policies with concrete actions (feature freeze, mandatory canaries) make SLOs enforceable
  • Multi-window burn rate alerting detects both acute incidents and gradual degradation