Back to Software Engineering & Delivery Mastery Series

Part 26: Reliability, Rollbacks & Observability in CI/CD

May 13, 2026 Wasil Zafar 42 min read

Deploying code is not the finish line — verifying it works is. This article covers SLIs, SLOs, error budgets, the three pillars of observability, automated rollback strategies, incident response, and the culture of blameless postmortems that turns failures into improvements.

Table of Contents

  1. Introduction
  2. SLIs, SLOs & Error Budgets
  3. Three Pillars of Observability
  4. Observability in CI/CD
  5. Automated Rollbacks
  6. Incident Response
  7. Post-Incident Reviews
  8. Alerting Strategy
  9. Tools Landscape
  10. Exercises
  11. Conclusion & Next Steps

Introduction — Reliability as a Delivery Concern

There is a dangerous myth in software engineering: that deployment is the finish line. You write code, tests pass, CI goes green, you merge, and the pipeline deploys. Done, right? Wrong. You are not done deploying until you have verified it works in production, confirmed it has not degraded existing functionality, and established that the system remains healthy under real traffic.

Reliability is not something that happens after delivery — it is embedded within delivery. The CI/CD pipeline must include reliability gates: health checks, canary analysis, error budget evaluation, and automated rollback mechanisms. Without these, you are not doing continuous delivery — you are doing continuous hope.

Key Insight: Google's Site Reliability Engineering (SRE) book defines reliability as "the most important feature of any system." Users cannot use features that do not work. Reliability is not opposed to velocity — it enables it by giving teams confidence to deploy frequently.

The Reliability Spectrum

No system is 100% reliable. Pursuing absolute reliability is economically irrational — the cost increases exponentially while the benefit diminishes. Instead, teams must decide how reliable they need to be and engineer to that target:

Availability Target Downtime Per Year Downtime Per Month Typical Use Case
99% ("two nines") 3.65 days 7.3 hours Internal tools, batch processing
99.9% ("three nines") 8.77 hours 43.8 minutes Standard SaaS, e-commerce
99.95% 4.38 hours 21.9 minutes Infrastructure services, APIs
99.99% ("four nines") 52.6 minutes 4.38 minutes Payment processing, healthcare
99.999% ("five nines") 5.26 minutes 26.3 seconds Emergency services, stock exchanges

The difference between 99.9% and 99.99% looks small numerically but represents an order of magnitude more engineering effort, redundancy, testing, and operational discipline.

SLIs, SLOs & Error Budgets

Google's SRE framework introduced a structured approach to reliability that has become industry standard. The three key concepts form a hierarchy:

Service Level Indicators (SLIs)

An SLI is a quantitative measure of service behaviour. It answers: "What are we actually measuring?" Good SLIs are user-facing — they measure what users experience, not internal system metrics.

Common SLIs include:

  • Availability — proportion of successful requests (HTTP 2xx/3xx vs 5xx)
  • Latency — time to serve a response (p50, p95, p99)
  • Throughput — requests successfully processed per second
  • Error rate — proportion of requests resulting in errors
  • Correctness — proportion of responses with correct data
# Example: Prometheus recording rules for SLIs
groups:
  - name: sli_rules
    rules:
      # Availability SLI: successful requests / total requests
      - record: sli:availability:ratio_rate5m
        expr: |
          sum(rate(http_requests_total{code=~"2.."}[5m]))
          /
          sum(rate(http_requests_total[5m]))

      # Latency SLI: requests under 300ms / total requests
      - record: sli:latency:ratio_rate5m
        expr: |
          sum(rate(http_request_duration_seconds_bucket{le="0.3"}[5m]))
          /
          sum(rate(http_request_duration_seconds_count[5m]))

Service Level Objectives (SLOs)

An SLO is a target value for an SLI over a defined time window. It answers: "How reliable do we need to be?" SLOs are internal commitments — they set the engineering bar.

# Example SLO definitions
slos:
  - name: "API Availability"
    sli: sli:availability:ratio_rate5m
    target: 0.999          # 99.9% of requests succeed
    window: 30d            # measured over rolling 30 days
    
  - name: "API Latency (p99)"
    sli: sli:latency:ratio_rate5m
    target: 0.99           # 99% of requests under 300ms
    window: 30d
SLO vs SLA: An SLA (Service Level Agreement) is a contract with customers, often with financial penalties. An SLO is an internal target. Your SLOs should always be stricter than your SLAs — if your SLA promises 99.9%, your SLO should target 99.95% so you have margin before breaching the contract.

Error Budgets — The Balance Between Reliability and Velocity

The error budget is the most powerful concept in SRE. It is simply: 1 minus the SLO target. If your SLO is 99.9% availability, your error budget is 0.1% — you are "allowed" to be unavailable for 43.8 minutes per month.

Error budgets resolve the eternal tension between feature velocity and reliability:

  • Budget remaining? → Ship features aggressively. Take risks. Deploy frequently.
  • Budget exhausted? → Freeze feature releases. Focus entirely on reliability improvements. Fix the systems that caused the budget burn.
Error Budget Decision Flow
flowchart TD
    A[Check Error Budget] --> B{Budget Remaining?}
    B -->|Yes > 50%| C[Ship Features Aggressively]
    B -->|Yes 10-50%| D[Ship with Extra Caution]
    B -->|No < 10%| E[Feature Freeze]
    C --> F[Monitor SLI Impact]
    D --> F
    E --> G[Reliability Sprint]
    G --> H[Fix Root Causes]
    H --> I[Budget Recovers]
    I --> A
    F --> A
                            
Case Study

Google's Error Budget Policy

At Google, error budgets are enforced rigorously. When a team exhausts their error budget, releases are halted and the team must spend engineering time on reliability. This creates natural alignment between SRE teams (who want stability) and product teams (who want features). Neither side "wins" — the error budget is the objective arbiter. Teams that maintain good reliability earn the freedom to deploy quickly; teams that burn their budget lose that privilege until they invest in stability.

SRE Error Budget Google

The Three Pillars of Observability

Observability is the ability to understand a system's internal state by examining its external outputs. Unlike monitoring (which tells you something is wrong), observability tells you why it is wrong — even for failure modes you did not anticipate.

The three pillars work together — each provides a different lens on system behaviour:

Three Pillars of Observability
flowchart LR
    subgraph Pillars
        L[Logs
What happened?] M[Metrics
How much? How fast?] T[Traces
Where did time go?] end L --> C[Correlated View] M --> C T --> C C --> I[Full System Understanding]

Pillar 1: Logs

Logs are discrete, timestamped records of events. They are the most detailed observability signal — they tell you exactly what happened and when. Modern logging requires structure:

{
  "timestamp": "2026-05-13T14:23:01.234Z",
  "level": "ERROR",
  "service": "payment-service",
  "traceId": "abc123def456",
  "spanId": "span789",
  "userId": "user_42",
  "message": "Payment processing failed",
  "error": {
    "type": "TimeoutException",
    "message": "Stripe API did not respond within 5000ms",
    "stack": "at PaymentProcessor.charge(PaymentProcessor.java:142)"
  },
  "context": {
    "amount": 4999,
    "currency": "USD",
    "paymentMethod": "card_ending_4242"
  }
}

Key logging best practices:

  • Structured format — JSON, not free-text. Enables querying and aggregation.
  • Centralised collection — All services ship logs to one place (ELK Stack, Loki, CloudWatch).
  • Correlation IDs — Include trace ID and span ID in every log entry to connect logs across services.
  • Log levels — DEBUG, INFO, WARN, ERROR, FATAL. Production defaults to INFO; switch to DEBUG during incidents.
  • Avoid PII — Never log passwords, tokens, or personally identifiable information.

Pillar 2: Metrics

Metrics are numeric measurements aggregated over time. Unlike logs (one entry per event), metrics summarise behaviour: "How many requests per second? What is the 95th percentile latency?"

The three fundamental metric types:

Type Description Example Use When
Counter Monotonically increasing value Total HTTP requests, errors Counting occurrences
Gauge Value that goes up and down Memory usage, queue depth Current state snapshots
Histogram Distribution of values in buckets Request latency percentiles Understanding distributions
from prometheus_client import Counter, Histogram, Gauge

# Counter: total requests served
http_requests_total = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint', 'status']
)

# Histogram: request duration distribution
http_request_duration = Histogram(
    'http_request_duration_seconds',
    'HTTP request duration in seconds',
    ['method', 'endpoint'],
    buckets=[0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0]
)

# Gauge: current active connections
active_connections = Gauge(
    'active_connections',
    'Number of active connections'
)

Pillar 3: Distributed Traces

A trace follows a single request as it flows through multiple services. Each service adds a "span" — a named, timed operation. The trace shows the full journey: which services were called, how long each took, and where failures occurred.

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

# Configure OpenTelemetry tracing
provider = TracerProvider()
processor = BatchSpanProcessor(OTLPSpanExporter(endpoint="http://otel-collector:4317"))
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)

tracer = trace.get_tracer("payment-service")

# Create spans that form a trace
with tracer.start_as_current_span("process_payment") as span:
    span.set_attribute("payment.amount", 4999)
    span.set_attribute("payment.currency", "USD")
    
    with tracer.start_as_current_span("validate_card"):
        # Card validation logic
        pass
    
    with tracer.start_as_current_span("charge_stripe"):
        # Stripe API call
        pass
    
    with tracer.start_as_current_span("update_database"):
        # DB write
        pass

Context propagation is critical — trace context (trace ID, span ID) must be passed between services via HTTP headers (W3C Trace Context standard: traceparent header) so that spans from different services are connected into a single trace.

Observability in CI/CD

Observability is not just for production. Your delivery pipeline itself needs observability:

Build & Deploy Metrics

  • Build duration — Is CI getting slower? Which stages are bottlenecks?
  • Build success rate — What percentage of builds pass? Trending up or down?
  • Test flakiness — Which tests fail intermittently? Flaky tests erode confidence.
  • Deployment frequency — How often are you deploying? (DORA metric)
  • Lead time for changes — Commit to production in how many minutes?

Change Correlation — Connecting Deployments to Incidents

The single most important observability practice for CI/CD is deploy markers — annotations on your dashboards showing when deployments occurred. When an incident spike appears on a graph, the first question is always: "Did anything deploy recently?"

# Add deployment annotation to Grafana
curl -X POST http://grafana:3000/api/annotations \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $GRAFANA_TOKEN" \
  -d '{
    "dashboardUID": "main-dashboard",
    "time": '"$(date +%s000)"',
    "tags": ["deployment", "payment-service", "v2.14.1"],
    "text": "Deployed payment-service v2.14.1 (commit abc123)"
  }'
The Golden Signal: Google recommends monitoring four "golden signals" for every service: Latency, Traffic, Errors, and Saturation. If you can only monitor four things, monitor these — they cover the vast majority of failure scenarios.

Automated Rollbacks

A rollback is the act of reverting a system to its previous known-good state. Automated rollbacks remove the human delay from incident mitigation — the system detects a problem and reverts itself before users are significantly impacted.

Rollback Triggers

What signals should trigger an automatic rollback?

  • Error rate spike — 5xx responses exceed threshold (e.g., >1% for 2 minutes)
  • Latency increase — p99 latency exceeds baseline by >200% for 3 minutes
  • Health check failure — Liveness or readiness probes fail on new pods
  • Crash loop — New containers restart more than 3 times in 5 minutes
  • SLO burn rate — Error budget consumption rate exceeds safe threshold
# Argo Rollouts: automated rollback on error rate spike
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: payment-service
spec:
  strategy:
    canary:
      steps:
        - setWeight: 10
        - pause: { duration: 5m }
        - setWeight: 30
        - pause: { duration: 5m }
        - setWeight: 60
        - pause: { duration: 5m }
        - setWeight: 100
      analysis:
        templates:
          - templateName: error-rate-check
        startingStep: 1    # Begin analysis after first weight increase
      # Automatic rollback if analysis fails
      abortScaleDownDelaySeconds: 30
---
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: error-rate-check
spec:
  metrics:
    - name: error-rate
      interval: 60s
      failureLimit: 3
      provider:
        prometheus:
          address: http://prometheus:9090
          query: |
            sum(rate(http_requests_total{status=~"5..",service="payment-service",version="canary"}[2m]))
            /
            sum(rate(http_requests_total{service="payment-service",version="canary"}[2m]))
      successCondition: result[0] < 0.01  # Less than 1% error rate

Rollback Strategies by Deployment Type

Rollback Strategies by Deployment Type
flowchart TD
    A[Problem Detected] --> B{Deployment Type?}
    B -->|Blue-Green| C[Switch traffic back to Blue]
    B -->|Canary| D[Halt promotion, route 100% to stable]
    B -->|Rolling| E[Reverse rolling update]
    B -->|Feature Flag| F[Disable flag instantly]
    C --> G[Instant Recovery]
    D --> G
    F --> G
    E --> H[Gradual Recovery]
                            
Strategy Rollback Speed User Impact During Rollback Complexity
Blue-Green Instant (DNS/LB switch) Zero (old environment still running) Low
Canary Fast (stop promotion) Minimal (only canary % affected) Medium
Rolling Slow (reverse rollout) Some (mixed versions during rollback) Medium
Feature Flag Instant (toggle) Zero (code still deployed, feature hidden) Low (but requires flag infrastructure)
Key Insight: The fastest rollback is no rollback at all. Feature flags let you deploy code that is disabled by default, then enable it gradually. If something goes wrong, you flip the flag — no deployment needed. This is why progressive delivery with feature flags is increasingly preferred over traditional rollback strategies.

Incident Response

When things go wrong — and they will — the speed and quality of your response determines the blast radius. A well-rehearsed incident response process turns a potential catastrophe into a minor blip.

The Incident Lifecycle

  1. Detection — Alerts fire, users report, or automated systems detect anomalies. Faster detection = smaller blast radius.
  2. Triage — Assess severity. Is this a P1 (revenue-impacting, all-hands) or a P3 (minor degradation, fix tomorrow)?
  3. Mitigation — Stop the bleeding. Rollback, scale up, failover, disable feature flag. Mitigation before root cause analysis.
  4. Resolution — Fix the underlying issue. This may happen after mitigation has already restored service.
  5. Review — Blameless postmortem. What happened, why, and what can we improve?

On-Call Rotations & Escalation

On-call is the practice of having designated engineers available to respond to incidents outside business hours. Sustainable on-call requires:

  • Rotation schedules — Weekly rotations with primary and secondary on-call. No one is on-call for more than one week in four.
  • Escalation paths — If the primary cannot resolve within 15 minutes, escalate to the secondary. If both cannot resolve, escalate to the engineering manager and then to senior staff.
  • Compensation — On-call engineers should receive additional pay or time off. Uncompensated on-call leads to burnout and attrition.
  • Runbooks — Pre-written playbooks for common incidents. "If error X occurs, do steps Y and Z." Reduces cognitive load at 3 AM.

The Incident Commander Role

For major incidents (P1/P2), an Incident Commander (IC) takes charge. The IC does not fix the problem — they coordinate the response:

  • Assigns roles (investigation, communication, mitigation)
  • Makes decisions when opinions conflict
  • Communicates status to stakeholders every 15-30 minutes
  • Decides when to escalate and when to stand down
  • Ensures someone is taking notes for the postmortem
Case Study

PagerDuty's Incident Response Framework

PagerDuty publishes their entire incident response process as open-source documentation. Their framework defines clear roles (Incident Commander, Scribe, Subject Matter Expert, Customer Liaison), communication templates, severity definitions, and escalation procedures. Their key principle: "If in doubt, page." It is always better to wake someone up unnecessarily than to let an incident grow because you were unsure whether to escalate. Companies including Dropbox, Shopify, and Stripe have adopted this framework.

Incident Response PagerDuty On-Call

Post-Incident Reviews (PIRs)

A Post-Incident Review (also called a postmortem or retrospective) is a structured analysis of what went wrong, why, and what to change. The goal is learning, not blame.

Blameless Postmortems

The cardinal rule of post-incident reviews: blameless, not nameless. We acknowledge who did what (because the details matter for understanding the sequence of events), but we do not punish people for making mistakes in complex systems. If an engineer can be blamed for an outage, the real failure is that the system allowed a single human error to cause an outage.

Anti-Pattern: "Engineer X made a typo in the config file and caused the outage" is blame. "The deployment pipeline did not validate the config file format, allowing a malformed config to reach production" is a systemic finding that leads to an actual fix.

The Five Whys Technique

Originally from Toyota's manufacturing process, the Five Whys technique digs past surface symptoms to root causes:

  1. Why did the site go down? — The database ran out of connections.
  2. Why did it run out of connections? — A new service was leaking connections (not closing them).
  3. Why was it leaking connections? — The developer did not use the connection pool correctly.
  4. Why did incorrect usage get deployed? — The code review did not catch the issue.
  5. Why did the code review miss it? — There is no automated check for connection pool usage patterns.

Root cause: Missing static analysis rule for connection pool patterns. Action item: Add a linter rule that flags raw connection usage without proper cleanup.

Postmortem Template

# Post-Incident Review Template
title: "Payment Processing Outage - 2026-05-10"
severity: P1
duration: 47 minutes (14:23 - 15:10 UTC)
impact: "~12,000 users unable to complete purchases. Estimated revenue loss: $43,000"

timeline:
  - "14:20 - Deploy payment-service v2.14.1 (canary 10%)"
  - "14:23 - Error rate alert fires (5xx rate > 5%)"
  - "14:25 - On-call engineer acknowledges alert"
  - "14:28 - IC declared, war room opened"
  - "14:35 - Root cause identified: Stripe SDK version incompatibility"
  - "14:38 - Rollback initiated"
  - "14:42 - Rollback complete, error rate normalising"
  - "15:10 - All metrics nominal, incident resolved"

root_cause: |
  Stripe SDK v4.2.0 introduced a breaking change in the payment intent API.
  Our upgrade from v4.1.x to v4.2.0 was not caught by integration tests
  because the test mock did not reflect the new API contract.

action_items:
  - "Add contract tests against live Stripe sandbox (owner: @alice, due: May 17)"
  - "Implement canary analysis with automatic rollback on >1% error rate (owner: @bob, due: May 20)"
  - "Update Stripe SDK upgrade runbook with breaking change checklist (owner: @carol, due: May 15)"

lessons_learned:
  - "Mocked external APIs can hide breaking changes - need contract tests"
  - "Canary analysis was manual - automated rollback would have limited blast radius to <2 minutes"

Alerting Strategy

Alerts are the bridge between observability and action. A good alerting strategy wakes you up for real problems and lets you sleep through noise. A bad strategy causes alert fatigue — engineers start ignoring alerts, and real incidents go unnoticed.

Alert on Symptoms, Not Causes

The golden rule of alerting:

  • Symptom alert (good): "API error rate is above 1% for the last 5 minutes" — this directly affects users.
  • Cause alert (bad): "CPU usage is at 85%" — this might not affect users at all. Many services run fine at 85% CPU.

Alert on what users experience. Investigate causes after you are alerted to symptoms.

Reducing Alert Fatigue

  • Every alert must be actionable — if there is nothing an engineer can do about it, remove the alert.
  • Tune thresholds — if an alert fires 10 times but only 1 was a real problem, the threshold is too sensitive.
  • Group related alerts — 50 alerts from 50 pods failing for the same reason should be 1 grouped alert.
  • Severity levels — P1 pages you immediately. P3 goes to a ticket queue for business hours.
  • Review alert volume weekly — if on-call engineers are paged more than twice per shift, there is a problem.
# Prometheus alerting rule: symptom-based
groups:
  - name: payment-service-alerts
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5..",service="payment-service"}[5m]))
          /
          sum(rate(http_requests_total{service="payment-service"}[5m]))
          > 0.01
        for: 2m
        labels:
          severity: critical
          team: payments
        annotations:
          summary: "Payment service error rate above 1%"
          description: "Current error rate: {{ $value | humanizePercentage }}"
          runbook: "https://wiki.internal/runbooks/payment-errors"

      - alert: HighLatency
        expr: |
          histogram_quantile(0.99, 
            sum(rate(http_request_duration_seconds_bucket{service="payment-service"}[5m])) by (le)
          ) > 2.0
        for: 3m
        labels:
          severity: warning
          team: payments
        annotations:
          summary: "Payment service p99 latency above 2 seconds"

Tools Landscape

The observability and reliability tooling ecosystem is vast. Here is a brief comparison of major players:

Tool Category Strengths Best For
Prometheus + Grafana Metrics + Visualisation Open-source, PromQL power, massive ecosystem Kubernetes-native environments
Datadog Full-stack observability All three pillars unified, easy setup, APM Teams wanting one platform
New Relic APM + Full-stack Strong APM, free tier, NRQL query language Application performance focus
Honeycomb Observability (trace-first) High-cardinality queries, BubbleUp analysis Debugging complex distributed systems
Jaeger Distributed Tracing Open-source, CNCF project, trace visualisation Tracing in Kubernetes
OpenTelemetry Instrumentation standard Vendor-neutral, covers all three pillars Avoiding vendor lock-in
PagerDuty / Opsgenie Incident Management Alerting, on-call scheduling, escalation On-call management
Recommendation: Start with OpenTelemetry for instrumentation (it is vendor-neutral and becoming the standard), then choose backends based on your budget and scale. Prometheus + Grafana for metrics, Loki for logs, and Jaeger for traces is a powerful open-source stack. Datadog or New Relic if you want everything in one paid platform.

Exercises

Exercise 1

Define SLIs and SLOs for a Service

Choose a service you work with (or imagine an e-commerce checkout service). Define 3 SLIs (availability, latency, error rate), set SLO targets for each, and calculate the monthly error budget. Document what happens when the budget is exhausted — what freezes, who decides when to resume releases?

Exercise 2

Instrument a Service with OpenTelemetry

Take a simple HTTP service (Node.js, Python, or Go) and add OpenTelemetry instrumentation. Configure it to export metrics and traces to a local Jaeger instance. Verify you can see request traces with latency breakdowns for each middleware and handler.

Exercise 3

Design an Automated Rollback Pipeline

Using Argo Rollouts (or describe pseudocode), design a canary deployment with automatic rollback. Define: the canary weight progression (10% → 30% → 60% → 100%), the analysis query (Prometheus or Datadog), the failure threshold, and what happens when analysis fails at the 30% step.

Exercise 4

Write a Blameless Postmortem

Imagine this scenario: A configuration change was deployed that accidentally disabled rate limiting. A traffic spike overwhelmed the database, causing a 23-minute outage. Write a complete postmortem using the template provided: timeline, root cause, Five Whys analysis, action items with owners and due dates.

Conclusion & Next Steps

Reliability is not a phase you reach — it is a discipline you practice. By defining clear SLIs and SLOs, you make reliability measurable. By implementing error budgets, you make the reliability-versus-velocity tradeoff explicit and data-driven. By building observability into your CI/CD pipelines, you create a system that can detect, diagnose, and recover from failures automatically.

The three pillars of observability — logs, metrics, and traces — give you different lenses on the same truth. Combined with automated rollbacks, a clear incident response process, and blameless postmortems, you create a delivery system that learns from every failure and gets more reliable over time.

In Part 27: Monolith vs Microservices Delivery Architecture, we will explore how your architectural choices — monolith, modular monolith, or microservices — fundamentally shape your delivery pipelines, team structure, and the observability challenges you face.