Introduction — Reliability as a Delivery Concern
There is a dangerous myth in software engineering: that deployment is the finish line. You write code, tests pass, CI goes green, you merge, and the pipeline deploys. Done, right? Wrong. You are not done deploying until you have verified it works in production, confirmed it has not degraded existing functionality, and established that the system remains healthy under real traffic.
Reliability is not something that happens after delivery — it is embedded within delivery. The CI/CD pipeline must include reliability gates: health checks, canary analysis, error budget evaluation, and automated rollback mechanisms. Without these, you are not doing continuous delivery — you are doing continuous hope.
The Reliability Spectrum
No system is 100% reliable. Pursuing absolute reliability is economically irrational — the cost increases exponentially while the benefit diminishes. Instead, teams must decide how reliable they need to be and engineer to that target:
| Availability Target | Downtime Per Year | Downtime Per Month | Typical Use Case |
|---|---|---|---|
| 99% ("two nines") | 3.65 days | 7.3 hours | Internal tools, batch processing |
| 99.9% ("three nines") | 8.77 hours | 43.8 minutes | Standard SaaS, e-commerce |
| 99.95% | 4.38 hours | 21.9 minutes | Infrastructure services, APIs |
| 99.99% ("four nines") | 52.6 minutes | 4.38 minutes | Payment processing, healthcare |
| 99.999% ("five nines") | 5.26 minutes | 26.3 seconds | Emergency services, stock exchanges |
The difference between 99.9% and 99.99% looks small numerically but represents an order of magnitude more engineering effort, redundancy, testing, and operational discipline.
SLIs, SLOs & Error Budgets
Google's SRE framework introduced a structured approach to reliability that has become industry standard. The three key concepts form a hierarchy:
Service Level Indicators (SLIs)
An SLI is a quantitative measure of service behaviour. It answers: "What are we actually measuring?" Good SLIs are user-facing — they measure what users experience, not internal system metrics.
Common SLIs include:
- Availability — proportion of successful requests (HTTP 2xx/3xx vs 5xx)
- Latency — time to serve a response (p50, p95, p99)
- Throughput — requests successfully processed per second
- Error rate — proportion of requests resulting in errors
- Correctness — proportion of responses with correct data
# Example: Prometheus recording rules for SLIs
groups:
- name: sli_rules
rules:
# Availability SLI: successful requests / total requests
- record: sli:availability:ratio_rate5m
expr: |
sum(rate(http_requests_total{code=~"2.."}[5m]))
/
sum(rate(http_requests_total[5m]))
# Latency SLI: requests under 300ms / total requests
- record: sli:latency:ratio_rate5m
expr: |
sum(rate(http_request_duration_seconds_bucket{le="0.3"}[5m]))
/
sum(rate(http_request_duration_seconds_count[5m]))
Service Level Objectives (SLOs)
An SLO is a target value for an SLI over a defined time window. It answers: "How reliable do we need to be?" SLOs are internal commitments — they set the engineering bar.
# Example SLO definitions
slos:
- name: "API Availability"
sli: sli:availability:ratio_rate5m
target: 0.999 # 99.9% of requests succeed
window: 30d # measured over rolling 30 days
- name: "API Latency (p99)"
sli: sli:latency:ratio_rate5m
target: 0.99 # 99% of requests under 300ms
window: 30d
Error Budgets — The Balance Between Reliability and Velocity
The error budget is the most powerful concept in SRE. It is simply: 1 minus the SLO target. If your SLO is 99.9% availability, your error budget is 0.1% — you are "allowed" to be unavailable for 43.8 minutes per month.
Error budgets resolve the eternal tension between feature velocity and reliability:
- Budget remaining? → Ship features aggressively. Take risks. Deploy frequently.
- Budget exhausted? → Freeze feature releases. Focus entirely on reliability improvements. Fix the systems that caused the budget burn.
flowchart TD
A[Check Error Budget] --> B{Budget Remaining?}
B -->|Yes > 50%| C[Ship Features Aggressively]
B -->|Yes 10-50%| D[Ship with Extra Caution]
B -->|No < 10%| E[Feature Freeze]
C --> F[Monitor SLI Impact]
D --> F
E --> G[Reliability Sprint]
G --> H[Fix Root Causes]
H --> I[Budget Recovers]
I --> A
F --> A
Google's Error Budget Policy
At Google, error budgets are enforced rigorously. When a team exhausts their error budget, releases are halted and the team must spend engineering time on reliability. This creates natural alignment between SRE teams (who want stability) and product teams (who want features). Neither side "wins" — the error budget is the objective arbiter. Teams that maintain good reliability earn the freedom to deploy quickly; teams that burn their budget lose that privilege until they invest in stability.
The Three Pillars of Observability
Observability is the ability to understand a system's internal state by examining its external outputs. Unlike monitoring (which tells you something is wrong), observability tells you why it is wrong — even for failure modes you did not anticipate.
The three pillars work together — each provides a different lens on system behaviour:
flowchart LR
subgraph Pillars
L[Logs
What happened?]
M[Metrics
How much? How fast?]
T[Traces
Where did time go?]
end
L --> C[Correlated View]
M --> C
T --> C
C --> I[Full System Understanding]
Pillar 1: Logs
Logs are discrete, timestamped records of events. They are the most detailed observability signal — they tell you exactly what happened and when. Modern logging requires structure:
{
"timestamp": "2026-05-13T14:23:01.234Z",
"level": "ERROR",
"service": "payment-service",
"traceId": "abc123def456",
"spanId": "span789",
"userId": "user_42",
"message": "Payment processing failed",
"error": {
"type": "TimeoutException",
"message": "Stripe API did not respond within 5000ms",
"stack": "at PaymentProcessor.charge(PaymentProcessor.java:142)"
},
"context": {
"amount": 4999,
"currency": "USD",
"paymentMethod": "card_ending_4242"
}
}
Key logging best practices:
- Structured format — JSON, not free-text. Enables querying and aggregation.
- Centralised collection — All services ship logs to one place (ELK Stack, Loki, CloudWatch).
- Correlation IDs — Include trace ID and span ID in every log entry to connect logs across services.
- Log levels — DEBUG, INFO, WARN, ERROR, FATAL. Production defaults to INFO; switch to DEBUG during incidents.
- Avoid PII — Never log passwords, tokens, or personally identifiable information.
Pillar 2: Metrics
Metrics are numeric measurements aggregated over time. Unlike logs (one entry per event), metrics summarise behaviour: "How many requests per second? What is the 95th percentile latency?"
The three fundamental metric types:
| Type | Description | Example | Use When |
|---|---|---|---|
| Counter | Monotonically increasing value | Total HTTP requests, errors | Counting occurrences |
| Gauge | Value that goes up and down | Memory usage, queue depth | Current state snapshots |
| Histogram | Distribution of values in buckets | Request latency percentiles | Understanding distributions |
from prometheus_client import Counter, Histogram, Gauge
# Counter: total requests served
http_requests_total = Counter(
'http_requests_total',
'Total HTTP requests',
['method', 'endpoint', 'status']
)
# Histogram: request duration distribution
http_request_duration = Histogram(
'http_request_duration_seconds',
'HTTP request duration in seconds',
['method', 'endpoint'],
buckets=[0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0]
)
# Gauge: current active connections
active_connections = Gauge(
'active_connections',
'Number of active connections'
)
Pillar 3: Distributed Traces
A trace follows a single request as it flows through multiple services. Each service adds a "span" — a named, timed operation. The trace shows the full journey: which services were called, how long each took, and where failures occurred.
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
# Configure OpenTelemetry tracing
provider = TracerProvider()
processor = BatchSpanProcessor(OTLPSpanExporter(endpoint="http://otel-collector:4317"))
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)
tracer = trace.get_tracer("payment-service")
# Create spans that form a trace
with tracer.start_as_current_span("process_payment") as span:
span.set_attribute("payment.amount", 4999)
span.set_attribute("payment.currency", "USD")
with tracer.start_as_current_span("validate_card"):
# Card validation logic
pass
with tracer.start_as_current_span("charge_stripe"):
# Stripe API call
pass
with tracer.start_as_current_span("update_database"):
# DB write
pass
Context propagation is critical — trace context (trace ID, span ID) must be passed between services via HTTP headers (W3C Trace Context standard: traceparent header) so that spans from different services are connected into a single trace.
Observability in CI/CD
Observability is not just for production. Your delivery pipeline itself needs observability:
Build & Deploy Metrics
- Build duration — Is CI getting slower? Which stages are bottlenecks?
- Build success rate — What percentage of builds pass? Trending up or down?
- Test flakiness — Which tests fail intermittently? Flaky tests erode confidence.
- Deployment frequency — How often are you deploying? (DORA metric)
- Lead time for changes — Commit to production in how many minutes?
Change Correlation — Connecting Deployments to Incidents
The single most important observability practice for CI/CD is deploy markers — annotations on your dashboards showing when deployments occurred. When an incident spike appears on a graph, the first question is always: "Did anything deploy recently?"
# Add deployment annotation to Grafana
curl -X POST http://grafana:3000/api/annotations \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $GRAFANA_TOKEN" \
-d '{
"dashboardUID": "main-dashboard",
"time": '"$(date +%s000)"',
"tags": ["deployment", "payment-service", "v2.14.1"],
"text": "Deployed payment-service v2.14.1 (commit abc123)"
}'
Automated Rollbacks
A rollback is the act of reverting a system to its previous known-good state. Automated rollbacks remove the human delay from incident mitigation — the system detects a problem and reverts itself before users are significantly impacted.
Rollback Triggers
What signals should trigger an automatic rollback?
- Error rate spike — 5xx responses exceed threshold (e.g., >1% for 2 minutes)
- Latency increase — p99 latency exceeds baseline by >200% for 3 minutes
- Health check failure — Liveness or readiness probes fail on new pods
- Crash loop — New containers restart more than 3 times in 5 minutes
- SLO burn rate — Error budget consumption rate exceeds safe threshold
# Argo Rollouts: automated rollback on error rate spike
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: payment-service
spec:
strategy:
canary:
steps:
- setWeight: 10
- pause: { duration: 5m }
- setWeight: 30
- pause: { duration: 5m }
- setWeight: 60
- pause: { duration: 5m }
- setWeight: 100
analysis:
templates:
- templateName: error-rate-check
startingStep: 1 # Begin analysis after first weight increase
# Automatic rollback if analysis fails
abortScaleDownDelaySeconds: 30
---
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: error-rate-check
spec:
metrics:
- name: error-rate
interval: 60s
failureLimit: 3
provider:
prometheus:
address: http://prometheus:9090
query: |
sum(rate(http_requests_total{status=~"5..",service="payment-service",version="canary"}[2m]))
/
sum(rate(http_requests_total{service="payment-service",version="canary"}[2m]))
successCondition: result[0] < 0.01 # Less than 1% error rate
Rollback Strategies by Deployment Type
flowchart TD
A[Problem Detected] --> B{Deployment Type?}
B -->|Blue-Green| C[Switch traffic back to Blue]
B -->|Canary| D[Halt promotion, route 100% to stable]
B -->|Rolling| E[Reverse rolling update]
B -->|Feature Flag| F[Disable flag instantly]
C --> G[Instant Recovery]
D --> G
F --> G
E --> H[Gradual Recovery]
| Strategy | Rollback Speed | User Impact During Rollback | Complexity |
|---|---|---|---|
| Blue-Green | Instant (DNS/LB switch) | Zero (old environment still running) | Low |
| Canary | Fast (stop promotion) | Minimal (only canary % affected) | Medium |
| Rolling | Slow (reverse rollout) | Some (mixed versions during rollback) | Medium |
| Feature Flag | Instant (toggle) | Zero (code still deployed, feature hidden) | Low (but requires flag infrastructure) |
Incident Response
When things go wrong — and they will — the speed and quality of your response determines the blast radius. A well-rehearsed incident response process turns a potential catastrophe into a minor blip.
The Incident Lifecycle
- Detection — Alerts fire, users report, or automated systems detect anomalies. Faster detection = smaller blast radius.
- Triage — Assess severity. Is this a P1 (revenue-impacting, all-hands) or a P3 (minor degradation, fix tomorrow)?
- Mitigation — Stop the bleeding. Rollback, scale up, failover, disable feature flag. Mitigation before root cause analysis.
- Resolution — Fix the underlying issue. This may happen after mitigation has already restored service.
- Review — Blameless postmortem. What happened, why, and what can we improve?
On-Call Rotations & Escalation
On-call is the practice of having designated engineers available to respond to incidents outside business hours. Sustainable on-call requires:
- Rotation schedules — Weekly rotations with primary and secondary on-call. No one is on-call for more than one week in four.
- Escalation paths — If the primary cannot resolve within 15 minutes, escalate to the secondary. If both cannot resolve, escalate to the engineering manager and then to senior staff.
- Compensation — On-call engineers should receive additional pay or time off. Uncompensated on-call leads to burnout and attrition.
- Runbooks — Pre-written playbooks for common incidents. "If error X occurs, do steps Y and Z." Reduces cognitive load at 3 AM.
The Incident Commander Role
For major incidents (P1/P2), an Incident Commander (IC) takes charge. The IC does not fix the problem — they coordinate the response:
- Assigns roles (investigation, communication, mitigation)
- Makes decisions when opinions conflict
- Communicates status to stakeholders every 15-30 minutes
- Decides when to escalate and when to stand down
- Ensures someone is taking notes for the postmortem
PagerDuty's Incident Response Framework
PagerDuty publishes their entire incident response process as open-source documentation. Their framework defines clear roles (Incident Commander, Scribe, Subject Matter Expert, Customer Liaison), communication templates, severity definitions, and escalation procedures. Their key principle: "If in doubt, page." It is always better to wake someone up unnecessarily than to let an incident grow because you were unsure whether to escalate. Companies including Dropbox, Shopify, and Stripe have adopted this framework.
Post-Incident Reviews (PIRs)
A Post-Incident Review (also called a postmortem or retrospective) is a structured analysis of what went wrong, why, and what to change. The goal is learning, not blame.
Blameless Postmortems
The cardinal rule of post-incident reviews: blameless, not nameless. We acknowledge who did what (because the details matter for understanding the sequence of events), but we do not punish people for making mistakes in complex systems. If an engineer can be blamed for an outage, the real failure is that the system allowed a single human error to cause an outage.
The Five Whys Technique
Originally from Toyota's manufacturing process, the Five Whys technique digs past surface symptoms to root causes:
- Why did the site go down? — The database ran out of connections.
- Why did it run out of connections? — A new service was leaking connections (not closing them).
- Why was it leaking connections? — The developer did not use the connection pool correctly.
- Why did incorrect usage get deployed? — The code review did not catch the issue.
- Why did the code review miss it? — There is no automated check for connection pool usage patterns.
Root cause: Missing static analysis rule for connection pool patterns. Action item: Add a linter rule that flags raw connection usage without proper cleanup.
Postmortem Template
# Post-Incident Review Template
title: "Payment Processing Outage - 2026-05-10"
severity: P1
duration: 47 minutes (14:23 - 15:10 UTC)
impact: "~12,000 users unable to complete purchases. Estimated revenue loss: $43,000"
timeline:
- "14:20 - Deploy payment-service v2.14.1 (canary 10%)"
- "14:23 - Error rate alert fires (5xx rate > 5%)"
- "14:25 - On-call engineer acknowledges alert"
- "14:28 - IC declared, war room opened"
- "14:35 - Root cause identified: Stripe SDK version incompatibility"
- "14:38 - Rollback initiated"
- "14:42 - Rollback complete, error rate normalising"
- "15:10 - All metrics nominal, incident resolved"
root_cause: |
Stripe SDK v4.2.0 introduced a breaking change in the payment intent API.
Our upgrade from v4.1.x to v4.2.0 was not caught by integration tests
because the test mock did not reflect the new API contract.
action_items:
- "Add contract tests against live Stripe sandbox (owner: @alice, due: May 17)"
- "Implement canary analysis with automatic rollback on >1% error rate (owner: @bob, due: May 20)"
- "Update Stripe SDK upgrade runbook with breaking change checklist (owner: @carol, due: May 15)"
lessons_learned:
- "Mocked external APIs can hide breaking changes - need contract tests"
- "Canary analysis was manual - automated rollback would have limited blast radius to <2 minutes"
Alerting Strategy
Alerts are the bridge between observability and action. A good alerting strategy wakes you up for real problems and lets you sleep through noise. A bad strategy causes alert fatigue — engineers start ignoring alerts, and real incidents go unnoticed.
Alert on Symptoms, Not Causes
The golden rule of alerting:
- Symptom alert (good): "API error rate is above 1% for the last 5 minutes" — this directly affects users.
- Cause alert (bad): "CPU usage is at 85%" — this might not affect users at all. Many services run fine at 85% CPU.
Alert on what users experience. Investigate causes after you are alerted to symptoms.
Reducing Alert Fatigue
- Every alert must be actionable — if there is nothing an engineer can do about it, remove the alert.
- Tune thresholds — if an alert fires 10 times but only 1 was a real problem, the threshold is too sensitive.
- Group related alerts — 50 alerts from 50 pods failing for the same reason should be 1 grouped alert.
- Severity levels — P1 pages you immediately. P3 goes to a ticket queue for business hours.
- Review alert volume weekly — if on-call engineers are paged more than twice per shift, there is a problem.
# Prometheus alerting rule: symptom-based
groups:
- name: payment-service-alerts
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5..",service="payment-service"}[5m]))
/
sum(rate(http_requests_total{service="payment-service"}[5m]))
> 0.01
for: 2m
labels:
severity: critical
team: payments
annotations:
summary: "Payment service error rate above 1%"
description: "Current error rate: {{ $value | humanizePercentage }}"
runbook: "https://wiki.internal/runbooks/payment-errors"
- alert: HighLatency
expr: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket{service="payment-service"}[5m])) by (le)
) > 2.0
for: 3m
labels:
severity: warning
team: payments
annotations:
summary: "Payment service p99 latency above 2 seconds"
Tools Landscape
The observability and reliability tooling ecosystem is vast. Here is a brief comparison of major players:
| Tool | Category | Strengths | Best For |
|---|---|---|---|
| Prometheus + Grafana | Metrics + Visualisation | Open-source, PromQL power, massive ecosystem | Kubernetes-native environments |
| Datadog | Full-stack observability | All three pillars unified, easy setup, APM | Teams wanting one platform |
| New Relic | APM + Full-stack | Strong APM, free tier, NRQL query language | Application performance focus |
| Honeycomb | Observability (trace-first) | High-cardinality queries, BubbleUp analysis | Debugging complex distributed systems |
| Jaeger | Distributed Tracing | Open-source, CNCF project, trace visualisation | Tracing in Kubernetes |
| OpenTelemetry | Instrumentation standard | Vendor-neutral, covers all three pillars | Avoiding vendor lock-in |
| PagerDuty / Opsgenie | Incident Management | Alerting, on-call scheduling, escalation | On-call management |
Exercises
Define SLIs and SLOs for a Service
Choose a service you work with (or imagine an e-commerce checkout service). Define 3 SLIs (availability, latency, error rate), set SLO targets for each, and calculate the monthly error budget. Document what happens when the budget is exhausted — what freezes, who decides when to resume releases?
Instrument a Service with OpenTelemetry
Take a simple HTTP service (Node.js, Python, or Go) and add OpenTelemetry instrumentation. Configure it to export metrics and traces to a local Jaeger instance. Verify you can see request traces with latency breakdowns for each middleware and handler.
Design an Automated Rollback Pipeline
Using Argo Rollouts (or describe pseudocode), design a canary deployment with automatic rollback. Define: the canary weight progression (10% → 30% → 60% → 100%), the analysis query (Prometheus or Datadog), the failure threshold, and what happens when analysis fails at the 30% step.
Write a Blameless Postmortem
Imagine this scenario: A configuration change was deployed that accidentally disabled rate limiting. A traffic spike overwhelmed the database, causing a 23-minute outage. Write a complete postmortem using the template provided: timeline, root cause, Five Whys analysis, action items with owners and due dates.
Conclusion & Next Steps
Reliability is not a phase you reach — it is a discipline you practice. By defining clear SLIs and SLOs, you make reliability measurable. By implementing error budgets, you make the reliability-versus-velocity tradeoff explicit and data-driven. By building observability into your CI/CD pipelines, you create a system that can detect, diagnose, and recover from failures automatically.
The three pillars of observability — logs, metrics, and traces — give you different lenses on the same truth. Combined with automated rollbacks, a clear incident response process, and blameless postmortems, you create a delivery system that learns from every failure and gets more reliable over time.
In Part 27: Monolith vs Microservices Delivery Architecture, we will explore how your architectural choices — monolith, modular monolith, or microservices — fundamentally shape your delivery pipelines, team structure, and the observability challenges you face.