Back to Distributed Systems & Kubernetes Series

Part 5: Failure & Resilience

May 14, 2026 Wasil Zafar 30 min read

In distributed systems, failure isn't an exception — it's the norm. Understanding failure modes and building self-healing resilience is what separates toy systems from production infrastructure.

Table of Contents

  1. The Philosophy of Failure
  2. Types of Failures
  3. Resilience Patterns
  4. Self-Healing Systems
  5. Chaos Engineering
  6. Real-World Failure Case Studies
  7. Exercises
  8. Conclusion

The Philosophy of Failure

Failure is Inevitable

In a single-machine system, failure is binary: the machine works or it doesn't. In distributed systems, failure is partial, unpredictable, and continuous. At any moment, some components are failing while others continue operating normally.

The Fundamental Shift: Don't design systems to prevent failure — design systems to function despite failure. A 99.99% available system with 10,000 components still has one component failing at any given moment. The question isn't "will something fail?" but "when something fails, what happens next?"

Consider the scale: a large Kubernetes cluster with 5,000 nodes and 100,000 pods experiences, on average:

  • ~2 node failures per day (hardware, kernel panics, OOM kills)
  • ~500 pod restarts per day (crashes, health check failures)
  • ~50 network connectivity blips per day (packet drops, timeout spikes)
  • ~5 disk failures per month

Google's internal data shows that in clusters with thousands of machines, something is always broken. The system must be designed so users never notice.

The Failure Spectrum

Not all failures are equal. They range from transient glitches to catastrophic data loss:

Failure Severity Spectrum
flowchart LR
    A[Transient
Network blip
1-100ms] --> B[Intermittent
Flapping service
Seconds to minutes] B --> C[Partial
Node failure
Minutes to hours] C --> D[Total
Datacenter outage
Hours to days] D --> E[Catastrophic
Data corruption
Permanent]
Failure Type Duration Detection Recovery Strategy
Transient Milliseconds Timeout triggers Retry with backoff
Intermittent Seconds–minutes Health check failures Circuit breaker + failover
Partial (node) Minutes–hours Heartbeat loss Reschedule workloads
Total (zone/region) Hours–days External monitoring Multi-region failover
Catastrophic (data) Permanent Consistency checks Restore from backup

Types of Failures

Node Failures

A node failure means an entire machine becomes unresponsive. In Kubernetes, this means all pods on that node stop serving traffic. The causes are varied: hardware failure, kernel panic, out-of-memory kill, disk full, or power loss.

# Simulate a node failure in Kubernetes:
# Watch current pod distribution across nodes
kubectl get pods -o wide --all-namespaces | head -20

# Cordon a node (prevent new pods from being scheduled)
kubectl cordon worker-node-3

# Drain the node (evict all pods gracefully)
kubectl drain worker-node-3 --ignore-daemonsets --delete-emptydir-data

# Observe pods being rescheduled to healthy nodes:
kubectl get pods -o wide -w
# NAME                        READY   STATUS    NODE
# payment-7d8f9c-abc12       1/1     Running   worker-node-1  ← rescheduled!
# inventory-5b6c7d-def34     1/1     Running   worker-node-2  ← rescheduled!

# To simulate an unresponsive node (for testing):
# SSH into the node and stop kubelet:
# systemctl stop kubelet
# After ~40 seconds (node-monitor-grace-period), Kubernetes marks it NotReady
# After ~5 minutes (pod-eviction-timeout), pods are rescheduled
The Detection Problem: When a node stops responding, the control plane can't immediately distinguish between "node crashed" and "network cable disconnected." If it reschedules pods too quickly, you risk duplicate instances (split-brain). If it waits too long, you lose availability. Kubernetes uses configurable timeouts to balance this trade-off.
Node Failure Detection Timeline
sequenceDiagram
    participant K as kubelet (Worker)
    participant A as API Server
    participant NC as Node Controller
    participant S as Scheduler
    
    K->>A: Heartbeat (every 10s)
    K->>A: Heartbeat (every 10s)
    Note over K: Node crashes!
    Note over A: No heartbeat received...
    A-->>NC: 40s timeout exceeded
    NC->>A: Mark node "NotReady"
    Note over NC: Wait pod-eviction-timeout (5m)
    NC->>A: Taint node "unreachable"
    NC->>S: Evict pods from failed node
    S->>A: Schedule pods on healthy nodes
    Note over S: Pods rescheduled successfully
                            

Network Partitions

A network partition splits a cluster into two or more groups that cannot communicate with each other — but each group is still internally functional. This is the most insidious failure because each side may believe it is the healthy majority.

# A network partition scenario in a 5-node cluster:
#
# Normal state: all nodes communicate freely
# [Node-1] ↔ [Node-2] ↔ [Node-3] ↔ [Node-4] ↔ [Node-5]
#
# After partition (e.g., switch failure between racks):
# Partition A: [Node-1] ↔ [Node-2] ↔ [Node-3]    (3 nodes — majority)
# Partition B: [Node-4] ↔ [Node-5]                 (2 nodes — minority)
#
# What happens in Kubernetes:
# - etcd requires quorum (3/5 nodes) → Partition A wins
# - API Server on Partition A continues operating normally
# - Partition B nodes are marked NotReady after timeout
# - Pods on Partition B nodes continue running locally (still serving)
# - But no new scheduling decisions for Partition B
# - When partition heals, nodes rejoin and reconcile state

# Real-world causes of partitions:
# 1. Network switch/router failure
# 2. Firewall rule changes
# 3. Cable damage
# 4. Cloud provider availability zone isolation
# 5. DNS failure (logical partition)
Why Quorum Matters: In a 5-node etcd cluster, quorum requires 3 nodes (majority). This ensures that even during a partition, at most one side can make progress. The minority side becomes read-only, preventing conflicting writes. This is why production clusters use odd numbers of control plane nodes (3, 5, 7).

Cascading Failures

A cascading failure occurs when the failure of one component triggers failures in dependent components, creating a domino effect that can bring down an entire system. These are the most dangerous failures because they amplify small problems into catastrophic outages.

Cascading Failure Domino Effect
flowchart TD
    A[Database node fails] --> B[Connection pool exhausted]
    B --> C[API requests queue up]
    C --> D[API response times spike to 30s]
    D --> E[Upstream services timeout]
    E --> F[Retry storms multiply load 10x]
    F --> G[Remaining database nodes overloaded]
    G --> H[ALL database nodes fail]
    H --> I[Complete system outage]
    
    style A fill:#BF092F,color:#fff
    style I fill:#132440,color:#fff
                            
Case Study Cascading Failure Pattern
The Thundering Herd

A common cascading pattern: a cache server restarts, and all cached data is lost. Suddenly, thousands of requests that were served from cache now hit the database directly. The database, sized for cache-miss traffic (maybe 5% of total), receives 20x its expected load and crashes.

Timeline: Cache restart → 0s: cache miss rate goes from 5% to 100% → 2s: database connection pool saturated → 5s: database CPU at 100% → 10s: database OOM kill → 15s: all services returning 500 errors.

Prevention: Cache warming (pre-populate before routing traffic), request coalescing (single database query for duplicate concurrent requests), circuit breakers (fail fast instead of queuing), gradual traffic restoration.

Thundering Herd Cache Stampede Load Amplification
# Cascading failure prevention in Kubernetes:

# 1. Pod Disruption Budgets — prevent too many pods dying at once
# This ensures at least 2 replicas of the payment service are always running:
cat <<EOF | kubectl apply -f -
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: payment-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: payment-service
EOF

# 2. Resource limits — prevent one pod from starving others
# Without limits, a memory leak in one pod can OOM-kill the entire node:
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: payment-pod
spec:
  containers:
  - name: payment
    image: payment:v2.1
    resources:
      requests:
        memory: "256Mi"
        cpu: "250m"
      limits:
        memory: "512Mi"
        cpu: "500m"
EOF

# 3. Priority classes — critical workloads survive resource pressure
cat <<EOF | kubectl apply -f -
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: critical-service
value: 1000000
globalDefault: false
description: "For production-critical services that must not be evicted"
EOF

Byzantine Failures

Named after the Byzantine Generals Problem, these are failures where a component doesn't simply stop — it provides incorrect information. A node might report healthy while serving stale data, or a network device might corrupt packets silently.

Failure Type Behaviour Detection Example
Crash failure Node stops completely Easy (heartbeat timeout) Server power loss
Omission failure Node drops some messages Moderate (sequence numbers) Overloaded network buffer
Timing failure Responds too late Easy (deadline exceeded) GC pause, disk I/O stall
Byzantine failure Responds incorrectly Very hard (checksums, consensus) Bit-flip in RAM, compromised node
Practical Insight: Most cloud-native systems (including Kubernetes) assume crash-failure models — nodes either work correctly or stop. Byzantine fault tolerance (BFT) is expensive (3f+1 replicas to tolerate f failures vs 2f+1 for crash tolerance) and is typically only used in blockchain systems or safety-critical aerospace/medical systems.

Resilience Patterns

Redundancy

Redundancy is the most fundamental resilience pattern: maintain multiple copies so that losing one doesn't mean losing the service. But redundancy has costs and complexities:

Redundancy Level What It Protects Against Cost Multiplier Example
Process redundancy Single process crash 2–3x compute 3 pod replicas
Node redundancy Node hardware failure 3–5x nodes Anti-affinity rules
Zone redundancy Availability zone outage 3x infrastructure Topology spread constraints
Region redundancy Regional disaster 2–3x everything Multi-region active-active
# Kubernetes: Spread replicas across zones for zone-level redundancy
apiVersion: apps/v1
kind: Deployment
metadata:
  name: payment-service
spec:
  replicas: 6
  selector:
    matchLabels:
      app: payment
  template:
    metadata:
      labels:
        app: payment
    spec:
      # Anti-affinity: don't schedule two payment pods on the same node
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - payment
              topologyKey: kubernetes.io/hostname
      # Topology spread: distribute evenly across availability zones
      topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: DoNotSchedule
        labelSelector:
          matchLabels:
            app: payment
      containers:
      - name: payment
        image: payment:v2.1
        resources:
          requests:
            memory: "256Mi"
            cpu: "250m"

Failover Strategies

Failover is the process of switching from a failed component to a healthy backup. The speed and correctness of failover determines how much downtime users experience.

Active-Passive vs Active-Active Failover
flowchart TD
    subgraph Active-Passive
        L1[Load Balancer] --> P1[Primary
Handles all traffic] L1 -.->|failover| S1[Standby
Idle, ready] P1 -->|replicate| S1 end subgraph Active-Active L2[Load Balancer] --> A1[Instance A
50% traffic] L2 --> A2[Instance B
50% traffic] A1 <-->|sync state| A2 end
Strategy Failover Time Data Loss Risk Cost Use Case
Cold standby Minutes–hours High (last backup) Low (idle hardware) Non-critical, cost-sensitive
Warm standby Seconds–minutes Moderate (async replication lag) Medium Business applications
Hot standby (active-passive) Seconds Low (sync replication) High (full duplicate) Databases, stateful services
Active-active Zero (no failover needed) None (all active) Highest (conflict resolution) Global services, CDNs
# Kubernetes automatic failover in action:

# A Deployment with 3 replicas across nodes:
kubectl get pods -o wide
# NAME                       READY   NODE
# api-server-6f7g8h-abc12   1/1     worker-1
# api-server-6f7g8h-def34   1/1     worker-2
# api-server-6f7g8h-ghi56   1/1     worker-3

# Worker-2 loses network connectivity:
# 1. kubelet stops sending heartbeats
# 2. After 40s: node marked NotReady
# 3. After 5m: pods tainted for eviction
# 4. Deployment controller detects 2/3 replicas healthy
# 5. Scheduler places new pod on worker-1 or worker-3:

kubectl get pods -o wide
# NAME                       READY   NODE
# api-server-6f7g8h-abc12   1/1     worker-1
# api-server-6f7g8h-ghi56   1/1     worker-3
# api-server-6f7g8h-jkl78   1/1     worker-1  ← NEW replacement pod

# Total downtime for users: ~0 seconds (Service routes around failed pod)
# The Service's endpoints are updated within seconds of pod readiness change

Graceful Degradation

Instead of failing completely, gracefully degraded systems shed load and reduce functionality to maintain core operations. The user gets a worse experience, but they get something rather than a blank error page.

Real-World Pattern Netflix Graceful Degradation
Netflix's Degradation Layers

When Netflix's recommendation engine fails, users don't see an error — they see a generic "Popular on Netflix" row instead of personalised recommendations. The experience degrades gracefully:

  • Level 0 (healthy): Personalised recommendations + personalised artwork + personalised row ordering
  • Level 1 (partial): Generic recommendations + personalised artwork
  • Level 2 (degraded): "Popular in your country" + generic artwork
  • Level 3 (minimal): Static cached homepage served from CDN

At each level, users can still browse and watch content. The personalisation quality drops, but the core function (streaming video) remains available.

Graceful Degradation Feature Flagging Fallback
# Graceful degradation patterns in Kubernetes:

# 1. Liveness vs Readiness probes — remove unhealthy pods from traffic
#    without killing them (gives time to recover)
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: recommendation-service
spec:
  containers:
  - name: recommender
    image: recommender:v3.2
    # Readiness: controls traffic routing
    # If this fails, pod is removed from Service endpoints
    # but keeps running (can recover)
    readinessProbe:
      httpGet:
        path: /healthz/ready
        port: 8080
      initialDelaySeconds: 5
      periodSeconds: 10
      failureThreshold: 3
    # Liveness: controls pod lifecycle
    # If this fails, pod is restarted
    livenessProbe:
      httpGet:
        path: /healthz/live
        port: 8080
      initialDelaySeconds: 15
      periodSeconds: 20
      failureThreshold: 5
    # Startup: gives slow-starting apps time to initialise
    startupProbe:
      httpGet:
        path: /healthz/startup
        port: 8080
      failureThreshold: 30
      periodSeconds: 10
EOF

# 2. Pod Priority — evict non-critical pods before critical ones
# When the cluster runs out of resources, low-priority batch jobs
# are evicted first to protect user-facing services

Bulkheads & Isolation

Named after ship bulkheads (watertight compartments that prevent a hull breach from sinking the entire ship), this pattern isolates components so that a failure in one doesn't propagate to others.

Bulkhead Isolation Pattern
flowchart LR
    subgraph No Bulkheads
        A1[All Services] --> P1[Shared Thread Pool
100 threads] P1 --> DB1[(Shared Database)] end subgraph With Bulkheads B1[Payment Service] --> P2[Payment Pool
30 threads] B2[Inventory Service] --> P3[Inventory Pool
30 threads] B3[Search Service] --> P4[Search Pool
40 threads] P2 --> DB2[(Payment DB)] P3 --> DB3[(Inventory DB)] P4 --> DB4[(Search DB)] end
# Kubernetes namespace-based bulkheads:
# Each team's workloads are isolated with resource quotas

# ResourceQuota for the payments namespace
apiVersion: v1
kind: ResourceQuota
metadata:
  name: payments-quota
  namespace: payments
spec:
  hard:
    # Prevent payments team from consuming all cluster resources
    requests.cpu: "20"
    requests.memory: "40Gi"
    limits.cpu: "40"
    limits.memory: "80Gi"
    pods: "100"
    services: "20"
---
# LimitRange sets per-pod defaults and maximums
apiVersion: v1
kind: LimitRange
metadata:
  name: payments-limits
  namespace: payments
spec:
  limits:
  - default:
      cpu: "500m"
      memory: "512Mi"
    defaultRequest:
      cpu: "250m"
      memory: "256Mi"
    max:
      cpu: "4"
      memory: "8Gi"
    type: Container
Bulkhead Levels in Kubernetes: Isolation can be applied at multiple levels — namespace (resource quotas), node (taints/tolerations to dedicate nodes), network (network policies restrict traffic), and even cluster (separate clusters for different environments or tenants). Each level adds isolation at the cost of operational complexity.

Self-Healing Systems

Health Checks & Probes

Self-healing starts with detection. A system cannot fix what it cannot observe. Health checks are the sensory nervous system of distributed infrastructure:

Probe Type Question It Answers Failure Action Use Case
Liveness "Is the process alive?" Restart container Deadlocked threads, stuck processes
Readiness "Can it handle traffic?" Remove from Service endpoints Still warming cache, loading config
Startup "Has it finished starting?" Wait (don't check liveness yet) Slow-starting legacy apps
# Common health check anti-patterns to avoid:

# BAD: Liveness probe that depends on external services
# If the database is down, this probe fails, pod restarts,
# but restarting won't fix the database — causes restart loops!
# livenessProbe:
#   httpGet:
#     path: /healthz  ← checks database connectivity
#   failureThreshold: 3

# GOOD: Liveness only checks internal process health
# livenessProbe:
#   httpGet:
#     path: /healthz/live  ← only checks "is the process responding?"
#   failureThreshold: 5

# GOOD: Readiness checks external dependencies
# readinessProbe:
#   httpGet:
#     path: /healthz/ready  ← checks DB, cache, etc.
#   failureThreshold: 3

# Key principle:
# Liveness = "Should this container be restarted?"
# Readiness = "Should this container receive traffic?"
# Never make liveness depend on things a restart won't fix!

Reconciliation Loops

The core self-healing mechanism in Kubernetes: controllers continuously compare desired state with actual state and take corrective action. This is the control loop pattern — the same principle used in thermostats, cruise control, and autopilots.

Kubernetes Reconciliation Loop
flowchart TD
    A[Observe Current State] --> B{Desired == Actual?}
    B -->|Yes| C[Sleep / Wait for Change]
    C --> A
    B -->|No| D[Calculate Difference]
    D --> E[Take Corrective Action]
    E --> A
    
    style B fill:#3B9797,color:#fff
    style D fill:#BF092F,color:#fff
                            
# Example: Deployment controller reconciliation
# You declare: "I want 3 replicas of my payment service"

kubectl scale deployment payment --replicas=3

# The Deployment controller's reconciliation loop:
# 1. OBSERVE: Query API server → current pod count = 3 → matches desired → sleep
# 2. EVENT: Pod crashes → current count drops to 2
# 3. OBSERVE: Query API server → current = 2, desired = 3 → MISMATCH
# 4. DIFF: Need 1 more pod
# 5. ACT: Create new pod spec, submit to scheduler
# 6. OBSERVE: current = 3 → matches desired → sleep

# This runs continuously — every few seconds
# It's the same loop for ALL Kubernetes controllers:
# - ReplicaSet controller → maintains pod count
# - Deployment controller → manages rollouts
# - Node controller → marks unhealthy nodes
# - Job controller → ensures jobs run to completion
# - Endpoint controller → updates Service endpoints
# - Namespace controller → cleans up deleted namespaces

Automatic Recovery

Self-healing goes beyond simple restarts. A truly self-healing system detects degradation, diagnoses root causes, and applies multi-level recovery:

Architecture Pattern Multi-Level Auto-Recovery
Recovery Escalation Ladder

Production systems implement escalating recovery strategies:

  1. Level 1 — Application restart: Container restarts via liveness probe failure (handles memory leaks, deadlocks)
  2. Level 2 — Pod replacement: Controller creates fresh pod on same/different node (handles corrupted local state)
  3. Level 3 — Node drain: Workloads evacuated from problematic node (handles node-level issues)
  4. Level 4 — Node replacement: Cluster autoscaler provisions new node, old node terminated (handles hardware failures)
  5. Level 5 — Zone failover: Traffic shifted to other availability zones (handles zone-level outages)

Each level is triggered only if lower levels fail to resolve the issue. This prevents over-reaction to transient problems.

Escalation Progressive Recovery Auto-Remediation
# Kubernetes restart policies and backoff:
apiVersion: v1
kind: Pod
metadata:
  name: resilient-app
spec:
  # restartPolicy: Always (default for Deployments)
  # Kubernetes uses exponential backoff for restarts:
  # 1st crash: restart immediately
  # 2nd crash: wait 10s
  # 3rd crash: wait 20s
  # 4th crash: wait 40s
  # ... up to 5 minutes maximum
  # After 10 minutes of stability, backoff resets
  restartPolicy: Always
  containers:
  - name: app
    image: my-app:v1.0
    resources:
      requests:
        memory: "128Mi"
        cpu: "100m"
      limits:
        memory: "256Mi"
        cpu: "200m"

Chaos Engineering

Core Principles

Chaos engineering is the discipline of experimenting on a production system to build confidence in its ability to withstand turbulent conditions. Rather than waiting for failures to happen, you deliberately inject failures to discover weaknesses before they cause outages.

Netflix's Philosophy: "The best way to avoid failure is to fail constantly." Netflix pioneered chaos engineering with Chaos Monkey, which randomly terminates production instances. If your system can't survive a random instance death, you'll discover that in a controlled experiment — not at 3 AM during peak traffic.

The scientific method of chaos engineering:

  1. Define steady state — what does "normal" look like? (e.g., p99 latency < 200ms, error rate < 0.1%)
  2. Hypothesise — "The system will continue in steady state when X fails"
  3. Design experiment — inject a specific failure (kill a pod, add network latency, fill a disk)
  4. Run experiment — execute in production (or pre-production with realistic traffic)
  5. Observe — did the system maintain steady state? What degraded?
  6. Fix — address any weaknesses discovered

Tools & Practice

Tool Scope Failure Types Platform
Chaos Monkey Instance termination Random pod/VM kills Netflix/Any cloud
Litmus Chaos Kubernetes-native Pod, node, network, IO faults Kubernetes (CNCF)
Chaos Mesh Kubernetes-native Pod, network, IO, time, JVM Kubernetes (CNCF)
Gremlin Enterprise CPU, memory, network, process Any (SaaS platform)
AWS FIS AWS services EC2, ECS, RDS, AZ failures AWS
# Chaos Mesh: Inject network latency into a Kubernetes service
# This adds 100ms latency to all traffic hitting the payment service
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: payment-latency-test
  namespace: chaos-testing
spec:
  action: delay
  mode: all
  selector:
    namespaces:
      - production
    labelSelectors:
      app: payment-service
  delay:
    latency: "100ms"
    jitter: "20ms"
    correlation: "75"
  duration: "5m"
  scheduler:
    cron: "@every 24h"
---
# Chaos Mesh: Kill random pods in the inventory service
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: inventory-pod-kill
  namespace: chaos-testing
spec:
  action: pod-kill
  mode: one
  selector:
    namespaces:
      - production
    labelSelectors:
      app: inventory-service
  scheduler:
    cron: "@every 2h"
Starting Safely: Don't begin chaos engineering in production on day one. Start in staging, limit blast radius (single pod, not entire service), run during business hours with the team ready, and have a clear "abort" button. Graduate to production only when you've built confidence and observability.

Real-World Failure Case Studies

AWS S3 Outage (February 2017)

Incident Cascading Failure + Human Error
A Typo That Broke the Internet

What happened: An engineer ran a routine maintenance script to remove a small number of S3 billing servers. A typo in the command removed a much larger set of servers — including the index subsystem that managed metadata for S3 objects.

Cascade: S3 → CloudFront → Lambda → IoT → hundreds of websites and services that depended on S3 (Slack, Trello, Quora, IFTTT, and large portions of the internet). Services that stored their health check pages on S3 couldn't even report they were down.

Duration: 4+ hours of degraded service in US-EAST-1.

Root cause: No rate limit on server removal + index system had no fast-restart capability (took hours to rebuild from cold).

Lessons:

  • Blast radius controls: limit how much a single command can destroy
  • Dependency awareness: map what breaks when your infrastructure fails
  • Recovery speed: design for fast restart, not just redundancy
  • Don't host your status page on the infrastructure you're monitoring
Human Error Cascading Failure Blast Radius

GitHub Database Incident (October 2018)

Incident Network Partition + Split Brain
43 Seconds That Caused 24 Hours of Recovery

What happened: A routine network maintenance caused a 43-second connectivity loss between GitHub's US East Coast data centre and its primary MySQL database cluster. The database cluster's orchestrator tool interpreted the partition as a primary failure and promoted a replica to primary.

The problem: When connectivity restored, there were now two primaries that had accepted conflicting writes during the 43 seconds. Data inconsistency across critical tables.

Duration: 24 hours and 11 minutes of degraded service. GitHub chose data integrity over availability — they manually reconciled the conflicting writes rather than losing data.

Lessons:

  • Short partitions can trigger long-lasting damage
  • Automatic failover must be tuned carefully (43s was too aggressive)
  • Data reconciliation after split-brain is extremely expensive
  • Prioritise consistency for write-heavy stateful systems
Network Partition Split Brain Data Integrity

Cloudflare Global Outage (July 2019)

Incident Configuration Error + Missing Guardrails
A Regex That Knocked Out 13% of HTTP Requests

What happened: A WAF (Web Application Firewall) rule update contained a catastrophically backtracking regular expression. When deployed globally, it consumed 100% CPU on every Cloudflare edge server simultaneously.

Scale of impact: ~13% of all HTTP requests globally dropped for 27 minutes. Millions of websites behind Cloudflare became unreachable.

Why it wasn't caught: The rule passed unit tests (which used short test strings). In production, real-world request payloads triggered exponential regex backtracking (ReDoS). The deployment system had no canary or staged rollout — the rule went to all servers simultaneously.

Lessons:

  • Canary deployments: never deploy to 100% simultaneously
  • Test with production-representative data, not just synthetic tests
  • CPU usage guardrails: kill processes exceeding CPU thresholds
  • Rollback speed: how fast can you undo a bad deploy globally?
Global Deploy No Canary ReDoS

Exercises

Exercise 1 — Failure Mode Analysis: You're designing a 3-tier web application (frontend → API → database) deployed across 3 availability zones. Map out every failure mode you can think of (node, network, disk, etc.) and specify what the user experiences for each. Then design the redundancy and failover strategy to maintain 99.9% availability.
Exercise 2 — Cascading Failure Prevention: A payment processing system has: Load Balancer → 5 API pods → Connection Pool (10 connections per pod) → PostgreSQL primary (max 100 connections). If the database starts responding slowly (2s per query instead of 50ms), trace the cascade. Where would you add circuit breakers, timeouts, and backpressure to prevent total failure?
Exercise 3 — Chaos Experiment Design: Design three chaos experiments for a Kubernetes-based e-commerce platform: one at the pod level, one at the node level, and one at the network level. For each, define: (a) the hypothesis, (b) the steady-state metrics, (c) the injection method, (d) the expected outcome, and (e) the abort criteria.
Exercise 4 — Recovery Time Calculation: An etcd cluster has 5 members. Calculate: (a) how many members can fail while maintaining quorum, (b) if the detection timeout is 40s and pod scheduling takes 15s, what's the minimum recovery time for a single etcd member failure, (c) what happens if 3 members fail within 30 seconds of each other?

Conclusion

Distributed systems fail in creative, unexpected ways. The systems that survive aren't the ones that prevent all failures — they're the ones designed to expect failure and recover automatically.

Key principles from this part:

  • Design for failure: Assume every component can fail at any time
  • Limit blast radius: Bulkheads, resource quotas, and isolation prevent one failure from becoming many
  • Self-healing over manual intervention: Reconciliation loops and health checks detect and fix problems without human involvement
  • Graceful degradation over hard failure: Shed load and reduce features rather than returning errors
  • Practice failure: Chaos engineering builds confidence that recovery works before real outages test it
  • Learn from incidents: Post-mortems without blame drive systemic improvements

In Part 6, we'll cross from distributed systems theory into Kubernetes specifically — understanding how Kubernetes implements all these resilience principles through its architecture: the API server, etcd, schedulers, controllers, and the declarative reconciliation model that makes self-healing automatic.