SLI / SLO / SLA Hierarchy
The Reliability Hierarchy
| Term | Definition | Example | Owner |
|---|---|---|---|
| SLI | Service Level Indicator — a quantitative measure of service quality | Ratio of successful HTTP requests | Engineering |
| SLO | Service Level Objective — target value for an SLI | 99.9% of requests succeed in <300ms | Engineering + Product |
| SLA | Service Level Agreement — contractual commitment with consequences | 99.5% uptime or credits issued | Business + Legal |
Golden Rule: Your SLO should always be stricter than your SLA. If your SLA promises 99.5% availability, set your internal SLO to 99.9%. The gap between SLO and SLA is your safety margin — when the SLO is breached, you have time to recover before violating the SLA.
Defining SLIs with Prometheus
# Recording rules for SLI: Request Availability
# SLI = proportion of successful requests
groups:
- name: slo:api-gateway:availability
rules:
# Total requests (30-day rolling window, multiple rates for burn-rate)
- record: slo:api_requests:rate5m
expr: sum(rate(http_requests_total{job="api-gateway"}[5m]))
- record: slo:api_errors:rate5m
expr: sum(rate(http_requests_total{job="api-gateway",status=~"5.."}[5m]))
# Error ratio (instantaneous)
- record: slo:api_error_ratio:rate5m
expr: |
slo:api_errors:rate5m / slo:api_requests:rate5m
# SLI: 1 - error_ratio = availability
- record: slo:api_availability:rate5m
expr: 1 - (slo:api_errors:rate5m / slo:api_requests:rate5m)
# Recording rules for SLI: Request Latency
# SLI = proportion of requests faster than threshold
groups:
- name: slo:api-gateway:latency
rules:
# Requests within latency budget (300ms threshold)
- record: slo:api_latency_good:rate5m
expr: |
sum(rate(http_request_duration_seconds_bucket{
job="api-gateway",
le="0.3"
}[5m]))
- record: slo:api_latency_total:rate5m
expr: |
sum(rate(http_request_duration_seconds_count{
job="api-gateway"
}[5m]))
# SLI: proportion within latency budget
- record: slo:api_latency_ratio:rate5m
expr: |
slo:api_latency_good:rate5m / slo:api_latency_total:rate5m
Error Budgets
The error budget is the acceptable amount of unreliability over a time window. For a 99.9% SLO over 30 days:
Error Budget Calculation:
Error budget = 1 - SLO target = 1 - 0.999 = 0.001 (0.1%)
In 30 days: 30 × 24 × 60 = 43,200 minutes
Budget in minutes: 43,200 × 0.001 = 43.2 minutes of downtime allowed
Budget in requests: if 1M requests/month, 1,000 can fail
Error budget = 1 - SLO target = 1 - 0.999 = 0.001 (0.1%)
In 30 days: 30 × 24 × 60 = 43,200 minutes
Budget in minutes: 43,200 × 0.001 = 43.2 minutes of downtime allowed
Budget in requests: if 1M requests/month, 1,000 can fail
# Recording rule: Error budget remaining (request-based)
groups:
- name: slo:api-gateway:error-budget
rules:
# Total errors in 30-day window
- record: slo:api_errors:increase30d
expr: |
sum(increase(http_requests_total{job="api-gateway",status=~"5.."}[30d]))
# Total requests in 30-day window
- record: slo:api_requests:increase30d
expr: |
sum(increase(http_requests_total{job="api-gateway"}[30d]))
# Error budget consumed (0 = none consumed, 1 = fully exhausted)
- record: slo:api_error_budget_consumed:ratio
expr: |
(slo:api_errors:increase30d / slo:api_requests:increase30d)
/ (1 - 0.999)
# Error budget remaining (1 = full budget, 0 = exhausted, negative = over)
- record: slo:api_error_budget_remaining:ratio
expr: 1 - slo:api_error_budget_consumed:ratio
Multi-Window Multi-Burn-Rate Alerting
Burn rate measures how fast you’re consuming your error budget. A burn rate of 1x means you’ll exactly exhaust your budget in 30 days. A burn rate of 14.4x means you’ll exhaust it in ~2 hours:
Multi-Window Burn-Rate Alert Windows
| Severity | Burn Rate | Long Window | Short Window | Budget Consumed | Time to Exhaust |
|---|---|---|---|---|---|
| Page | 14.4x | 1h | 5m | 2% in 1h | ~2 hours |
| Page | 6x | 6h | 30m | 5% in 6h | ~5 hours |
| Ticket | 3x | 1d | 2h | 10% in 1d | ~10 days |
| Ticket | 1x | 3d | 6h | 10% in 3d | ~30 days |
# Multi-window multi-burn-rate alerts for 99.9% availability SLO
groups:
- name: slo:api-gateway:burn-rate-alerts
rules:
# ---- Recording rules for burn rates ----
# 5m error ratio
- record: slo:api_error_ratio:rate5m
expr: |
sum(rate(http_requests_total{job="api-gateway",status=~"5.."}[5m]))
/ sum(rate(http_requests_total{job="api-gateway"}[5m]))
# 30m error ratio
- record: slo:api_error_ratio:rate30m
expr: |
sum(rate(http_requests_total{job="api-gateway",status=~"5.."}[30m]))
/ sum(rate(http_requests_total{job="api-gateway"}[30m]))
# 1h error ratio
- record: slo:api_error_ratio:rate1h
expr: |
sum(rate(http_requests_total{job="api-gateway",status=~"5.."}[1h]))
/ sum(rate(http_requests_total{job="api-gateway"}[1h]))
# 2h error ratio
- record: slo:api_error_ratio:rate2h
expr: |
sum(rate(http_requests_total{job="api-gateway",status=~"5.."}[2h]))
/ sum(rate(http_requests_total{job="api-gateway"}[2h]))
# 6h error ratio
- record: slo:api_error_ratio:rate6h
expr: |
sum(rate(http_requests_total{job="api-gateway",status=~"5.."}[6h]))
/ sum(rate(http_requests_total{job="api-gateway"}[6h]))
# 1d error ratio
- record: slo:api_error_ratio:rate1d
expr: |
sum(rate(http_requests_total{job="api-gateway",status=~"5.."}[1d]))
/ sum(rate(http_requests_total{job="api-gateway"}[1d]))
# 3d error ratio
- record: slo:api_error_ratio:rate3d
expr: |
sum(rate(http_requests_total{job="api-gateway",status=~"5.."}[3d]))
/ sum(rate(http_requests_total{job="api-gateway"}[3d]))
# ---- Alerting rules ----
# Page: 14.4x burn rate over 1h (confirmed by 5m short window)
- alert: SLOBurnRateCritical
expr: |
slo:api_error_ratio:rate1h > (14.4 * 0.001)
and
slo:api_error_ratio:rate5m > (14.4 * 0.001)
labels:
severity: critical
slo: api-availability
window: 1h
annotations:
summary: "API availability burning error budget 14.4x faster than allowed"
description: "At current rate, 30-day error budget will be exhausted in ~2 hours"
runbook_url: "https://runbooks.example.com/slo-api-availability"
# Page: 6x burn rate over 6h (confirmed by 30m short window)
- alert: SLOBurnRateHigh
expr: |
slo:api_error_ratio:rate6h > (6 * 0.001)
and
slo:api_error_ratio:rate30m > (6 * 0.001)
labels:
severity: critical
slo: api-availability
window: 6h
annotations:
summary: "API availability burning error budget 6x faster than allowed"
description: "At current rate, 30-day error budget will be exhausted in ~5 hours"
# Ticket: 3x burn rate over 1d (confirmed by 2h short window)
- alert: SLOBurnRateWarning
expr: |
slo:api_error_ratio:rate1d > (3 * 0.001)
and
slo:api_error_ratio:rate2h > (3 * 0.001)
labels:
severity: warning
slo: api-availability
window: 1d
annotations:
summary: "API availability burning error budget 3x faster than allowed"
# Ticket: 1x burn rate over 3d (confirmed by 6h short window)
- alert: SLOBurnRateSlow
expr: |
slo:api_error_ratio:rate3d > (1 * 0.001)
and
slo:api_error_ratio:rate6h > (1 * 0.001)
labels:
severity: warning
slo: api-availability
window: 3d
annotations:
summary: "API availability on track to miss SLO this period"
Automating with Sloth
Sloth generates all the recording rules and burn-rate alerts from a simple SLO specification:
# slo.yaml — Sloth SLO specification
version: "prometheus/v1"
service: "api-gateway"
labels:
team: platform
tier: tier-1
slos:
- name: "requests-availability"
objective: 99.9
description: "99.9% of API requests succeed"
sli:
events:
error_query: sum(rate(http_requests_total{job="api-gateway",status=~"5.."}[{{.window}}]))
total_query: sum(rate(http_requests_total{job="api-gateway"}[{{.window}}]))
alerting:
name: APIGatewayAvailability
labels:
team: platform
annotations:
runbook_url: "https://runbooks.example.com/api-gateway-availability"
page_alert:
labels:
severity: critical
ticket_alert:
labels:
severity: warning
- name: "requests-latency"
objective: 99.0
description: "99% of API requests complete within 300ms"
sli:
events:
error_query: |
sum(rate(http_request_duration_seconds_count{job="api-gateway"}[{{.window}}]))
-
sum(rate(http_request_duration_seconds_bucket{job="api-gateway",le="0.3"}[{{.window}}]))
total_query: sum(rate(http_request_duration_seconds_count{job="api-gateway"}[{{.window}}]))
alerting:
name: APIGatewayLatency
labels:
team: platform
# Generate Prometheus rules from Sloth spec
sloth generate -i slo.yaml -o rules/
# Output: rules/api-gateway.yaml
# Contains:
# - 14 recording rules (error ratios at multiple windows)
# - 4 alerting rules (multi-burn-rate, multi-window)
# - SLO metadata labels on all rules
# Validate generated rules
promtool check rules rules/api-gateway.yaml
# Sloth also supports Kubernetes CRD mode
sloth generate -i slo.yaml --mode kubernetes -o manifests/
SLO Dashboards
# Key queries for SLO dashboards
# 1. Current error budget remaining (percentage)
1 - (
sum(increase(http_requests_total{job="api-gateway",status=~"5.."}[30d]))
/ sum(increase(http_requests_total{job="api-gateway"}[30d]))
) / (1 - 0.999)
# 2. Burn rate (current, over 1h)
(
sum(rate(http_requests_total{job="api-gateway",status=~"5.."}[1h]))
/ sum(rate(http_requests_total{job="api-gateway"}[1h]))
) / (1 - 0.999)
# 3. Time until budget exhaustion (hours)
# If burn_rate > 1:
(720 * (1 - slo:api_error_budget_consumed:ratio)) / burn_rate_1h
# 4. SLI over sliding windows
slo:api_availability:rate5m # Real-time
slo:api_error_ratio:rate1h # Hourly trend
slo:api_error_ratio:rate1d # Daily trend
Error Budget Policy
Error Budget Policy Framework:
- >50% remaining: Normal velocity. Ship features freely, accept reasonable risk.
- 25–50% remaining: Caution. Require extra review for risky deployments. Prioritize reliability work.
- <25% remaining: Freeze non-critical deploys. Focus engineering time on reliability improvements.
- 0% (exhausted): Full feature freeze. All engineering effort on reliability until budget replenishes.
Conclusion
Key Takeaways:
- SLIs are ratios — good events / total events, measured from user-facing metrics
- Error budgets quantify risk tolerance — 99.9% SLO = 43 minutes of allowed downtime per month
- Multi-burn-rate alerts are precise — page for fast burns (14.4x), ticket for slow burns (1x)
- Short windows prevent stale alerts — confirm the long-window signal is still active
- Sloth automates the math — declare SLO target, get recording rules and alerts for free
- Error budget policies drive decisions — connect SLO status to deployment velocity