Part 9: SLOs, SLIs, SLAs & Error Budgets

SLI, SLO, SLA — Definitions

These three terms are often confused. They form a hierarchy from measurement to target to contract:

Term	Definition	Audience	Example
SLI (Service Level Indicator)	A quantitative measure of service behaviour — a ratio of good events to total events	Engineering	Proportion of HTTP requests completing in < 300ms
SLO (Service Level Objective)	A target value or range for an SLI — what "good enough" means	Engineering + Product	99.9% of requests complete in < 300ms over 30 days
SLA (Service Level Agreement)	A contractual commitment with consequences for breach (refunds, credits)	Business + Customers	"99.95% uptime or receive 10% service credit"

                            
                            The Hierarchy: SLIs are measured. SLOs are targets set against SLIs. SLAs are promises made to customers based on SLOs. Your SLA should always be looser than your SLO (e.g., SLO = 99.95%, SLA = 99.9%), giving you a buffer before contractual penalties apply.
                        

Choosing the Right SLIs

SLI Types by Service Category

Different types of services need different SLIs. Google SRE identifies four common SLI categories:

Service Type	Primary SLI	Formula
Request-driven (APIs, web apps)	Availability + Latency	Good requests / Total requests Requests < threshold / Total requests
Pipeline / batch (ETL, data processing)	Freshness + Correctness	Records processed within deadline / Total records Records without errors / Total records
Storage (databases, object stores)	Durability + Availability	Successful reads / Total reads Data objects intact / Total data objects
Streaming (Kafka, event systems)	Throughput + Freshness	Messages delivered within SLA / Total messages Consumer lag < threshold / Total time

Implementing SLIs in Prometheus

# SLI: Availability — proportion of non-5xx requests
# Good events: requests with status != 5xx
# Total events: all requests
sum(rate(http_requests_total{service="order-service",status!~"5.."}[30d]))
/ sum(rate(http_requests_total{service="order-service"}[30d]))

# SLI: Latency — proportion of requests completing under 300ms
# Good events: requests in buckets up to 300ms
# Total events: all requests
sum(rate(http_request_duration_seconds_bucket{service="order-service",le="0.3"}[30d]))
/ sum(rate(http_request_duration_seconds_count{service="order-service"}[30d]))

# SLI: Combined (most realistic) — requests that are both successful AND fast
# Good events: non-5xx requests completing under 300ms
# This requires custom instrumentation or recording rules

                            
                            SLI Best Practice: Always express SLIs as a ratio: good events / total events. This produces a number between 0 and 1 (or 0% and 100%) that is directly comparable to your SLO target. Avoid absolute thresholds like "latency must be below 300ms" — instead measure "proportion of requests below 300ms."
                        

Setting Meaningful SLOs

The Nines Table — What Each Level Actually Means

SLO Target	Allowed Downtime / Month	Error Budget / Month	Practical Meaning
99% (two nines)	7.3 hours	1% of requests can fail	Internal tools, batch systems
99.5%	3.65 hours	0.5% failure rate	Non-critical customer-facing services
99.9% (three nines)	43.8 minutes	0.1% failure rate	Standard customer-facing APIs
99.95%	21.9 minutes	0.05% failure rate	High-value transactional systems
99.99% (four nines)	4.38 minutes	0.01% failure rate	Payment processing, auth systems
99.999% (five nines)	26.3 seconds	0.001% failure rate	Life-critical systems only

                            
                            The Cost of Each Nine: Each additional nine of reliability roughly requires a 10x increase in engineering effort and infrastructure cost. Going from 99.9% to 99.99% does not just mean "a little more reliable" — it means fundamentally different architecture (multi-region, active-active, zero-downtime deployments). Never set an SLO higher than your users actually need.
                        

Common SLO Pitfalls

Setting SLOs too high: If your SLO is 99.99% but your architecture can only deliver 99.9%, you will always be in violation — demoralising the team and making the SLO meaningless
Setting SLOs too low: If your SLO is 95% but users expect 99.9%, you will meet your SLO while users are unhappy — the SLO is not useful
Too many SLOs: Start with 1-3 SLOs per service (availability + latency). Adding more creates noise without clarity
SLOs without error budget policies: An SLO without consequences for breach is just a number on a dashboard
Measuring from the wrong vantage point: Measure SLIs from the user's perspective (load balancer, API gateway), not from within the service

Error Budgets

An error budget is the inverse of an SLO — it quantifies how much unreliability is acceptable. If your SLO is 99.9%, your error budget is 0.1% — you can afford 0.1% of requests to fail within the SLO window.

Error Budget Math

Calculation

Error Budget Example: order-service

Given: SLO = 99.9% availability over 30 days. The service handles 10 million requests per day.

Total requests in 30 days: 10M × 30 = 300 million
Error budget (requests): 300M × 0.001 = 300,000 failed requests allowed
Error budget (time-based): 30 days × 24h × 60min × 0.001 = 43.2 minutes of total downtime
Daily budget: ~10,000 failed requests or ~1.44 minutes of downtime per day

If a deployment causes 50,000 errors in one hour, that consumes 16.7% of the monthly error budget in a single incident.

Error Budget SLO Math Capacity Planning

Error Budget Policy

An error budget policy defines what happens when the budget is exhausted. Without a policy, error budgets are just numbers. With a policy, they become a decision-making framework.

Budget Remaining	Action	Feature Velocity Impact
> 50%	Normal operations — ship features freely	Full speed
25-50%	Increased caution — require canary deployments for all changes	Slight slowdown
10-25%	Reduced risk — no non-critical changes, reliability work prioritised	Significant slowdown
< 10%	Feature freeze — only reliability improvements and critical bug fixes	Frozen
Exhausted (0%)	Full freeze + post-mortem required before resuming feature work	Completely frozen

                            
                            The Error Budget Trade-Off: Error budgets create a healthy tension between feature velocity and reliability. When the budget is healthy, product teams can ship fast and take risks. When the budget is depleted, the team must slow down and fix reliability. This is not a punishment — it is a self-correcting mechanism that prevents reliability debt from accumulating unchecked.
                        

Multi-Window Burn Rate Alerting

The Google SRE book's recommended approach to SLO alerting: instead of alerting when SLI < SLO (too late and too noisy), alert when the burn rate — the speed at which error budget is being consumed — exceeds a threshold.

A burn rate of 1x means consuming budget at exactly the rate that would exhaust it at the end of the SLO window. Higher burn rates exhaust budget faster:

Burn Rate	Budget Exhaustion Time (30-day SLO)	Alert Severity
14.4x	~2 hours (acute incident)	P1 — Page immediately
6x	~5 days (degradation)	P2 — Page during hours
3x	~10 days (slow burn)	P3 — Ticket
1x	30 days (normal consumption)	No alert

Multi-window alerting requires both a short window (detecting the current spike) and a long window (confirming it is sustained) to fire:

# Multi-window burn rate alert rules (Prometheus)
# P1: Fast burn — 14.4x burn rate, 1h short window + 5m long window
groups:
  - name: slo-burn-rate-alerts
    rules:
      # Fast burn (P1): consumes 2% of 30-day budget in 1 hour
      - alert: SLOBurnRateCritical
        expr: |
          (
            sum(rate(http_requests_total{service="order-service",status=~"5.."}[1h]))
            / sum(rate(http_requests_total{service="order-service"}[1h]))
          ) > (14.4 * 0.001)
          and
          (
            sum(rate(http_requests_total{service="order-service",status=~"5.."}[5m]))
            / sum(rate(http_requests_total{service="order-service"}[5m]))
          ) > (14.4 * 0.001)
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "order-service SLO burn rate critical (14.4x)"
          description: "Error budget will be exhausted in ~2 hours at current rate."
          runbook_url: "https://runbooks.internal/slo-burn-critical"

      # Slow burn (P2): consumes 5% of 30-day budget in 6 hours
      - alert: SLOBurnRateHigh
        expr: |
          (
            sum(rate(http_requests_total{service="order-service",status=~"5.."}[6h]))
            / sum(rate(http_requests_total{service="order-service"}[6h]))
          ) > (6 * 0.001)
          and
          (
            sum(rate(http_requests_total{service="order-service",status=~"5.."}[30m]))
            / sum(rate(http_requests_total{service="order-service"}[30m]))
          ) > (6 * 0.001)
        for: 5m
        labels:
          severity: high
        annotations:
          summary: "order-service SLO burn rate high (6x)"
          description: "Error budget will be exhausted in ~5 days at current rate."

Error Budget Consumption Over Time

                                flowchart LR
                                    A[Day 1\nBudget: 100%] --> B[Day 5\nIncident burns 20%\nBudget: 80%]
                                    B --> C[Day 10\nNormal ops\nBudget: 78%]
                                    C --> D[Day 15\nBad deploy burns 30%\nBudget: 48%]
                                    D --> E[Day 18\nPolicy: canary all deploys\nBudget: 45%]
                                    E --> F[Day 25\nSlow burn detected\nBudget: 30%]
                                    F --> G[Day 30\nBudget: 22%\nSLO Met ✓]

Conclusion & Next Steps

SLOs are the most important concept in reliability engineering — they translate technical metrics into business decisions. Key takeaways from Part 9:

SLIs measure behaviour (good events / total events); SLOs set targets; SLAs are contractual commitments
Express SLIs as ratios and measure from the user's perspective
Each additional nine of reliability costs ~10x more — never set SLOs higher than users need
Error budgets quantify acceptable unreliability and create a trade-off framework between features and reliability
Error budget policies with concrete actions (feature freeze, mandatory canaries) make SLOs enforceable
Multi-window burn rate alerting detects both acute incidents and gradual degradation

Previous Part 8: Kubernetes Observability Next Part 10: Incident Management & Post-Mortems

Cookie Consent

Part 9: SLOs, SLIs, SLAs & Error Budgets

Table of Contents

SLI, SLO, SLA — Definitions

Choosing the Right SLIs

SLI Types by Service Category

Implementing SLIs in Prometheus

Setting Meaningful SLOs

The Nines Table — What Each Level Actually Means

Common SLO Pitfalls

Error Budgets

Error Budget Math

Error Budget Example: order-service

Error Budget Policy

Multi-Window Burn Rate Alerting

Conclusion & Next Steps

Cookie Consent

Part 9: SLOs, SLIs, SLAs & Error Budgets

Table of Contents

SLI, SLO, SLA — Definitions

Choosing the Right SLIs

SLI Types by Service Category

Implementing SLIs in Prometheus

Setting Meaningful SLOs

The Nines Table — What Each Level Actually Means

Common SLO Pitfalls

Error Budgets

Error Budget Math

Error Budget Example: order-service

Error Budget Policy

Multi-Window Burn Rate Alerting

Conclusion & Next Steps

Continue the Series

Part 10: Incident Management & Post-Mortems

Part 7: Visualization & Alerting

Part 2: Metrics Fundamentals & Golden Signals