SLI, SLO, SLA — Definitions
These three terms are often confused. They form a hierarchy from measurement to target to contract:
| Term | Definition | Audience | Example |
|---|---|---|---|
| SLI (Service Level Indicator) |
A quantitative measure of service behaviour — a ratio of good events to total events | Engineering | Proportion of HTTP requests completing in < 300ms |
| SLO (Service Level Objective) |
A target value or range for an SLI — what "good enough" means | Engineering + Product | 99.9% of requests complete in < 300ms over 30 days |
| SLA (Service Level Agreement) |
A contractual commitment with consequences for breach (refunds, credits) | Business + Customers | "99.95% uptime or receive 10% service credit" |
Choosing the Right SLIs
SLI Types by Service Category
Different types of services need different SLIs. Google SRE identifies four common SLI categories:
| Service Type | Primary SLI | Formula |
|---|---|---|
| Request-driven (APIs, web apps) |
Availability + Latency | Good requests / Total requests Requests < threshold / Total requests |
| Pipeline / batch (ETL, data processing) |
Freshness + Correctness | Records processed within deadline / Total records Records without errors / Total records |
| Storage (databases, object stores) |
Durability + Availability | Successful reads / Total reads Data objects intact / Total data objects |
| Streaming (Kafka, event systems) |
Throughput + Freshness | Messages delivered within SLA / Total messages Consumer lag < threshold / Total time |
Implementing SLIs in Prometheus
# SLI: Availability — proportion of non-5xx requests
# Good events: requests with status != 5xx
# Total events: all requests
sum(rate(http_requests_total{service="order-service",status!~"5.."}[30d]))
/ sum(rate(http_requests_total{service="order-service"}[30d]))
# SLI: Latency — proportion of requests completing under 300ms
# Good events: requests in buckets up to 300ms
# Total events: all requests
sum(rate(http_request_duration_seconds_bucket{service="order-service",le="0.3"}[30d]))
/ sum(rate(http_request_duration_seconds_count{service="order-service"}[30d]))
# SLI: Combined (most realistic) — requests that are both successful AND fast
# Good events: non-5xx requests completing under 300ms
# This requires custom instrumentation or recording rules
Setting Meaningful SLOs
The Nines Table — What Each Level Actually Means
| SLO Target | Allowed Downtime / Month | Error Budget / Month | Practical Meaning |
|---|---|---|---|
| 99% (two nines) | 7.3 hours | 1% of requests can fail | Internal tools, batch systems |
| 99.5% | 3.65 hours | 0.5% failure rate | Non-critical customer-facing services |
| 99.9% (three nines) | 43.8 minutes | 0.1% failure rate | Standard customer-facing APIs |
| 99.95% | 21.9 minutes | 0.05% failure rate | High-value transactional systems |
| 99.99% (four nines) | 4.38 minutes | 0.01% failure rate | Payment processing, auth systems |
| 99.999% (five nines) | 26.3 seconds | 0.001% failure rate | Life-critical systems only |
Common SLO Pitfalls
- Setting SLOs too high: If your SLO is 99.99% but your architecture can only deliver 99.9%, you will always be in violation — demoralising the team and making the SLO meaningless
- Setting SLOs too low: If your SLO is 95% but users expect 99.9%, you will meet your SLO while users are unhappy — the SLO is not useful
- Too many SLOs: Start with 1-3 SLOs per service (availability + latency). Adding more creates noise without clarity
- SLOs without error budget policies: An SLO without consequences for breach is just a number on a dashboard
- Measuring from the wrong vantage point: Measure SLIs from the user's perspective (load balancer, API gateway), not from within the service
Error Budgets
An error budget is the inverse of an SLO — it quantifies how much unreliability is acceptable. If your SLO is 99.9%, your error budget is 0.1% — you can afford 0.1% of requests to fail within the SLO window.
Error Budget Math
Error Budget Example: order-service
Given: SLO = 99.9% availability over 30 days. The service handles 10 million requests per day.
- Total requests in 30 days: 10M × 30 = 300 million
- Error budget (requests): 300M × 0.001 = 300,000 failed requests allowed
- Error budget (time-based): 30 days × 24h × 60min × 0.001 = 43.2 minutes of total downtime
- Daily budget: ~10,000 failed requests or ~1.44 minutes of downtime per day
If a deployment causes 50,000 errors in one hour, that consumes 16.7% of the monthly error budget in a single incident.
Error Budget Policy
An error budget policy defines what happens when the budget is exhausted. Without a policy, error budgets are just numbers. With a policy, they become a decision-making framework.
| Budget Remaining | Action | Feature Velocity Impact |
|---|---|---|
| > 50% | Normal operations — ship features freely | Full speed |
| 25-50% | Increased caution — require canary deployments for all changes | Slight slowdown |
| 10-25% | Reduced risk — no non-critical changes, reliability work prioritised | Significant slowdown |
| < 10% | Feature freeze — only reliability improvements and critical bug fixes | Frozen |
| Exhausted (0%) | Full freeze + post-mortem required before resuming feature work | Completely frozen |
Multi-Window Burn Rate Alerting
The Google SRE book's recommended approach to SLO alerting: instead of alerting when SLI < SLO (too late and too noisy), alert when the burn rate — the speed at which error budget is being consumed — exceeds a threshold.
A burn rate of 1x means consuming budget at exactly the rate that would exhaust it at the end of the SLO window. Higher burn rates exhaust budget faster:
| Burn Rate | Budget Exhaustion Time (30-day SLO) | Alert Severity |
|---|---|---|
| 14.4x | ~2 hours (acute incident) | P1 — Page immediately |
| 6x | ~5 days (degradation) | P2 — Page during hours |
| 3x | ~10 days (slow burn) | P3 — Ticket |
| 1x | 30 days (normal consumption) | No alert |
Multi-window alerting requires both a short window (detecting the current spike) and a long window (confirming it is sustained) to fire:
# Multi-window burn rate alert rules (Prometheus)
# P1: Fast burn — 14.4x burn rate, 1h short window + 5m long window
groups:
- name: slo-burn-rate-alerts
rules:
# Fast burn (P1): consumes 2% of 30-day budget in 1 hour
- alert: SLOBurnRateCritical
expr: |
(
sum(rate(http_requests_total{service="order-service",status=~"5.."}[1h]))
/ sum(rate(http_requests_total{service="order-service"}[1h]))
) > (14.4 * 0.001)
and
(
sum(rate(http_requests_total{service="order-service",status=~"5.."}[5m]))
/ sum(rate(http_requests_total{service="order-service"}[5m]))
) > (14.4 * 0.001)
for: 2m
labels:
severity: critical
annotations:
summary: "order-service SLO burn rate critical (14.4x)"
description: "Error budget will be exhausted in ~2 hours at current rate."
runbook_url: "https://runbooks.internal/slo-burn-critical"
# Slow burn (P2): consumes 5% of 30-day budget in 6 hours
- alert: SLOBurnRateHigh
expr: |
(
sum(rate(http_requests_total{service="order-service",status=~"5.."}[6h]))
/ sum(rate(http_requests_total{service="order-service"}[6h]))
) > (6 * 0.001)
and
(
sum(rate(http_requests_total{service="order-service",status=~"5.."}[30m]))
/ sum(rate(http_requests_total{service="order-service"}[30m]))
) > (6 * 0.001)
for: 5m
labels:
severity: high
annotations:
summary: "order-service SLO burn rate high (6x)"
description: "Error budget will be exhausted in ~5 days at current rate."
flowchart LR
A[Day 1\nBudget: 100%] --> B[Day 5\nIncident burns 20%\nBudget: 80%]
B --> C[Day 10\nNormal ops\nBudget: 78%]
C --> D[Day 15\nBad deploy burns 30%\nBudget: 48%]
D --> E[Day 18\nPolicy: canary all deploys\nBudget: 45%]
E --> F[Day 25\nSlow burn detected\nBudget: 30%]
F --> G[Day 30\nBudget: 22%\nSLO Met ✓]
Conclusion & Next Steps
SLOs are the most important concept in reliability engineering — they translate technical metrics into business decisions. Key takeaways from Part 9:
- SLIs measure behaviour (good events / total events); SLOs set targets; SLAs are contractual commitments
- Express SLIs as ratios and measure from the user's perspective
- Each additional nine of reliability costs ~10x more — never set SLOs higher than users need
- Error budgets quantify acceptable unreliability and create a trade-off framework between features and reliability
- Error budget policies with concrete actions (feature freeze, mandatory canaries) make SLOs enforceable
- Multi-window burn rate alerting detects both acute incidents and gradual degradation