The Philosophy of Failure
Failure is Inevitable
In a single-machine system, failure is binary: the machine works or it doesn't. In distributed systems, failure is partial, unpredictable, and continuous. At any moment, some components are failing while others continue operating normally.
Consider the scale: a large Kubernetes cluster with 5,000 nodes and 100,000 pods experiences, on average:
- ~2 node failures per day (hardware, kernel panics, OOM kills)
- ~500 pod restarts per day (crashes, health check failures)
- ~50 network connectivity blips per day (packet drops, timeout spikes)
- ~5 disk failures per month
Google's internal data shows that in clusters with thousands of machines, something is always broken. The system must be designed so users never notice.
The Failure Spectrum
Not all failures are equal. They range from transient glitches to catastrophic data loss:
flowchart LR
A[Transient
Network blip
1-100ms] --> B[Intermittent
Flapping service
Seconds to minutes]
B --> C[Partial
Node failure
Minutes to hours]
C --> D[Total
Datacenter outage
Hours to days]
D --> E[Catastrophic
Data corruption
Permanent]
| Failure Type | Duration | Detection | Recovery Strategy |
|---|---|---|---|
| Transient | Milliseconds | Timeout triggers | Retry with backoff |
| Intermittent | Seconds–minutes | Health check failures | Circuit breaker + failover |
| Partial (node) | Minutes–hours | Heartbeat loss | Reschedule workloads |
| Total (zone/region) | Hours–days | External monitoring | Multi-region failover |
| Catastrophic (data) | Permanent | Consistency checks | Restore from backup |
Types of Failures
Node Failures
A node failure means an entire machine becomes unresponsive. In Kubernetes, this means all pods on that node stop serving traffic. The causes are varied: hardware failure, kernel panic, out-of-memory kill, disk full, or power loss.
# Simulate a node failure in Kubernetes:
# Watch current pod distribution across nodes
kubectl get pods -o wide --all-namespaces | head -20
# Cordon a node (prevent new pods from being scheduled)
kubectl cordon worker-node-3
# Drain the node (evict all pods gracefully)
kubectl drain worker-node-3 --ignore-daemonsets --delete-emptydir-data
# Observe pods being rescheduled to healthy nodes:
kubectl get pods -o wide -w
# NAME READY STATUS NODE
# payment-7d8f9c-abc12 1/1 Running worker-node-1 ← rescheduled!
# inventory-5b6c7d-def34 1/1 Running worker-node-2 ← rescheduled!
# To simulate an unresponsive node (for testing):
# SSH into the node and stop kubelet:
# systemctl stop kubelet
# After ~40 seconds (node-monitor-grace-period), Kubernetes marks it NotReady
# After ~5 minutes (pod-eviction-timeout), pods are rescheduled
sequenceDiagram
participant K as kubelet (Worker)
participant A as API Server
participant NC as Node Controller
participant S as Scheduler
K->>A: Heartbeat (every 10s)
K->>A: Heartbeat (every 10s)
Note over K: Node crashes!
Note over A: No heartbeat received...
A-->>NC: 40s timeout exceeded
NC->>A: Mark node "NotReady"
Note over NC: Wait pod-eviction-timeout (5m)
NC->>A: Taint node "unreachable"
NC->>S: Evict pods from failed node
S->>A: Schedule pods on healthy nodes
Note over S: Pods rescheduled successfully
Network Partitions
A network partition splits a cluster into two or more groups that cannot communicate with each other — but each group is still internally functional. This is the most insidious failure because each side may believe it is the healthy majority.
# A network partition scenario in a 5-node cluster:
#
# Normal state: all nodes communicate freely
# [Node-1] ↔ [Node-2] ↔ [Node-3] ↔ [Node-4] ↔ [Node-5]
#
# After partition (e.g., switch failure between racks):
# Partition A: [Node-1] ↔ [Node-2] ↔ [Node-3] (3 nodes — majority)
# Partition B: [Node-4] ↔ [Node-5] (2 nodes — minority)
#
# What happens in Kubernetes:
# - etcd requires quorum (3/5 nodes) → Partition A wins
# - API Server on Partition A continues operating normally
# - Partition B nodes are marked NotReady after timeout
# - Pods on Partition B nodes continue running locally (still serving)
# - But no new scheduling decisions for Partition B
# - When partition heals, nodes rejoin and reconcile state
# Real-world causes of partitions:
# 1. Network switch/router failure
# 2. Firewall rule changes
# 3. Cable damage
# 4. Cloud provider availability zone isolation
# 5. DNS failure (logical partition)
Cascading Failures
A cascading failure occurs when the failure of one component triggers failures in dependent components, creating a domino effect that can bring down an entire system. These are the most dangerous failures because they amplify small problems into catastrophic outages.
flowchart TD
A[Database node fails] --> B[Connection pool exhausted]
B --> C[API requests queue up]
C --> D[API response times spike to 30s]
D --> E[Upstream services timeout]
E --> F[Retry storms multiply load 10x]
F --> G[Remaining database nodes overloaded]
G --> H[ALL database nodes fail]
H --> I[Complete system outage]
style A fill:#BF092F,color:#fff
style I fill:#132440,color:#fff
The Thundering Herd
A common cascading pattern: a cache server restarts, and all cached data is lost. Suddenly, thousands of requests that were served from cache now hit the database directly. The database, sized for cache-miss traffic (maybe 5% of total), receives 20x its expected load and crashes.
Timeline: Cache restart → 0s: cache miss rate goes from 5% to 100% → 2s: database connection pool saturated → 5s: database CPU at 100% → 10s: database OOM kill → 15s: all services returning 500 errors.
Prevention: Cache warming (pre-populate before routing traffic), request coalescing (single database query for duplicate concurrent requests), circuit breakers (fail fast instead of queuing), gradual traffic restoration.
# Cascading failure prevention in Kubernetes:
# 1. Pod Disruption Budgets — prevent too many pods dying at once
# This ensures at least 2 replicas of the payment service are always running:
cat <<EOF | kubectl apply -f -
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: payment-pdb
spec:
minAvailable: 2
selector:
matchLabels:
app: payment-service
EOF
# 2. Resource limits — prevent one pod from starving others
# Without limits, a memory leak in one pod can OOM-kill the entire node:
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
name: payment-pod
spec:
containers:
- name: payment
image: payment:v2.1
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "500m"
EOF
# 3. Priority classes — critical workloads survive resource pressure
cat <<EOF | kubectl apply -f -
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: critical-service
value: 1000000
globalDefault: false
description: "For production-critical services that must not be evicted"
EOF
Byzantine Failures
Named after the Byzantine Generals Problem, these are failures where a component doesn't simply stop — it provides incorrect information. A node might report healthy while serving stale data, or a network device might corrupt packets silently.
| Failure Type | Behaviour | Detection | Example |
|---|---|---|---|
| Crash failure | Node stops completely | Easy (heartbeat timeout) | Server power loss |
| Omission failure | Node drops some messages | Moderate (sequence numbers) | Overloaded network buffer |
| Timing failure | Responds too late | Easy (deadline exceeded) | GC pause, disk I/O stall |
| Byzantine failure | Responds incorrectly | Very hard (checksums, consensus) | Bit-flip in RAM, compromised node |
Resilience Patterns
Redundancy
Redundancy is the most fundamental resilience pattern: maintain multiple copies so that losing one doesn't mean losing the service. But redundancy has costs and complexities:
| Redundancy Level | What It Protects Against | Cost Multiplier | Example |
|---|---|---|---|
| Process redundancy | Single process crash | 2–3x compute | 3 pod replicas |
| Node redundancy | Node hardware failure | 3–5x nodes | Anti-affinity rules |
| Zone redundancy | Availability zone outage | 3x infrastructure | Topology spread constraints |
| Region redundancy | Regional disaster | 2–3x everything | Multi-region active-active |
# Kubernetes: Spread replicas across zones for zone-level redundancy
apiVersion: apps/v1
kind: Deployment
metadata:
name: payment-service
spec:
replicas: 6
selector:
matchLabels:
app: payment
template:
metadata:
labels:
app: payment
spec:
# Anti-affinity: don't schedule two payment pods on the same node
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- payment
topologyKey: kubernetes.io/hostname
# Topology spread: distribute evenly across availability zones
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: payment
containers:
- name: payment
image: payment:v2.1
resources:
requests:
memory: "256Mi"
cpu: "250m"
Failover Strategies
Failover is the process of switching from a failed component to a healthy backup. The speed and correctness of failover determines how much downtime users experience.
flowchart TD
subgraph Active-Passive
L1[Load Balancer] --> P1[Primary
Handles all traffic]
L1 -.->|failover| S1[Standby
Idle, ready]
P1 -->|replicate| S1
end
subgraph Active-Active
L2[Load Balancer] --> A1[Instance A
50% traffic]
L2 --> A2[Instance B
50% traffic]
A1 <-->|sync state| A2
end
| Strategy | Failover Time | Data Loss Risk | Cost | Use Case |
|---|---|---|---|---|
| Cold standby | Minutes–hours | High (last backup) | Low (idle hardware) | Non-critical, cost-sensitive |
| Warm standby | Seconds–minutes | Moderate (async replication lag) | Medium | Business applications |
| Hot standby (active-passive) | Seconds | Low (sync replication) | High (full duplicate) | Databases, stateful services |
| Active-active | Zero (no failover needed) | None (all active) | Highest (conflict resolution) | Global services, CDNs |
# Kubernetes automatic failover in action:
# A Deployment with 3 replicas across nodes:
kubectl get pods -o wide
# NAME READY NODE
# api-server-6f7g8h-abc12 1/1 worker-1
# api-server-6f7g8h-def34 1/1 worker-2
# api-server-6f7g8h-ghi56 1/1 worker-3
# Worker-2 loses network connectivity:
# 1. kubelet stops sending heartbeats
# 2. After 40s: node marked NotReady
# 3. After 5m: pods tainted for eviction
# 4. Deployment controller detects 2/3 replicas healthy
# 5. Scheduler places new pod on worker-1 or worker-3:
kubectl get pods -o wide
# NAME READY NODE
# api-server-6f7g8h-abc12 1/1 worker-1
# api-server-6f7g8h-ghi56 1/1 worker-3
# api-server-6f7g8h-jkl78 1/1 worker-1 ← NEW replacement pod
# Total downtime for users: ~0 seconds (Service routes around failed pod)
# The Service's endpoints are updated within seconds of pod readiness change
Graceful Degradation
Instead of failing completely, gracefully degraded systems shed load and reduce functionality to maintain core operations. The user gets a worse experience, but they get something rather than a blank error page.
Netflix's Degradation Layers
When Netflix's recommendation engine fails, users don't see an error — they see a generic "Popular on Netflix" row instead of personalised recommendations. The experience degrades gracefully:
- Level 0 (healthy): Personalised recommendations + personalised artwork + personalised row ordering
- Level 1 (partial): Generic recommendations + personalised artwork
- Level 2 (degraded): "Popular in your country" + generic artwork
- Level 3 (minimal): Static cached homepage served from CDN
At each level, users can still browse and watch content. The personalisation quality drops, but the core function (streaming video) remains available.
# Graceful degradation patterns in Kubernetes:
# 1. Liveness vs Readiness probes — remove unhealthy pods from traffic
# without killing them (gives time to recover)
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
name: recommendation-service
spec:
containers:
- name: recommender
image: recommender:v3.2
# Readiness: controls traffic routing
# If this fails, pod is removed from Service endpoints
# but keeps running (can recover)
readinessProbe:
httpGet:
path: /healthz/ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
failureThreshold: 3
# Liveness: controls pod lifecycle
# If this fails, pod is restarted
livenessProbe:
httpGet:
path: /healthz/live
port: 8080
initialDelaySeconds: 15
periodSeconds: 20
failureThreshold: 5
# Startup: gives slow-starting apps time to initialise
startupProbe:
httpGet:
path: /healthz/startup
port: 8080
failureThreshold: 30
periodSeconds: 10
EOF
# 2. Pod Priority — evict non-critical pods before critical ones
# When the cluster runs out of resources, low-priority batch jobs
# are evicted first to protect user-facing services
Bulkheads & Isolation
Named after ship bulkheads (watertight compartments that prevent a hull breach from sinking the entire ship), this pattern isolates components so that a failure in one doesn't propagate to others.
flowchart LR
subgraph No Bulkheads
A1[All Services] --> P1[Shared Thread Pool
100 threads]
P1 --> DB1[(Shared Database)]
end
subgraph With Bulkheads
B1[Payment Service] --> P2[Payment Pool
30 threads]
B2[Inventory Service] --> P3[Inventory Pool
30 threads]
B3[Search Service] --> P4[Search Pool
40 threads]
P2 --> DB2[(Payment DB)]
P3 --> DB3[(Inventory DB)]
P4 --> DB4[(Search DB)]
end
# Kubernetes namespace-based bulkheads:
# Each team's workloads are isolated with resource quotas
# ResourceQuota for the payments namespace
apiVersion: v1
kind: ResourceQuota
metadata:
name: payments-quota
namespace: payments
spec:
hard:
# Prevent payments team from consuming all cluster resources
requests.cpu: "20"
requests.memory: "40Gi"
limits.cpu: "40"
limits.memory: "80Gi"
pods: "100"
services: "20"
---
# LimitRange sets per-pod defaults and maximums
apiVersion: v1
kind: LimitRange
metadata:
name: payments-limits
namespace: payments
spec:
limits:
- default:
cpu: "500m"
memory: "512Mi"
defaultRequest:
cpu: "250m"
memory: "256Mi"
max:
cpu: "4"
memory: "8Gi"
type: Container
Self-Healing Systems
Health Checks & Probes
Self-healing starts with detection. A system cannot fix what it cannot observe. Health checks are the sensory nervous system of distributed infrastructure:
| Probe Type | Question It Answers | Failure Action | Use Case |
|---|---|---|---|
| Liveness | "Is the process alive?" | Restart container | Deadlocked threads, stuck processes |
| Readiness | "Can it handle traffic?" | Remove from Service endpoints | Still warming cache, loading config |
| Startup | "Has it finished starting?" | Wait (don't check liveness yet) | Slow-starting legacy apps |
# Common health check anti-patterns to avoid:
# BAD: Liveness probe that depends on external services
# If the database is down, this probe fails, pod restarts,
# but restarting won't fix the database — causes restart loops!
# livenessProbe:
# httpGet:
# path: /healthz ← checks database connectivity
# failureThreshold: 3
# GOOD: Liveness only checks internal process health
# livenessProbe:
# httpGet:
# path: /healthz/live ← only checks "is the process responding?"
# failureThreshold: 5
# GOOD: Readiness checks external dependencies
# readinessProbe:
# httpGet:
# path: /healthz/ready ← checks DB, cache, etc.
# failureThreshold: 3
# Key principle:
# Liveness = "Should this container be restarted?"
# Readiness = "Should this container receive traffic?"
# Never make liveness depend on things a restart won't fix!
Reconciliation Loops
The core self-healing mechanism in Kubernetes: controllers continuously compare desired state with actual state and take corrective action. This is the control loop pattern — the same principle used in thermostats, cruise control, and autopilots.
flowchart TD
A[Observe Current State] --> B{Desired == Actual?}
B -->|Yes| C[Sleep / Wait for Change]
C --> A
B -->|No| D[Calculate Difference]
D --> E[Take Corrective Action]
E --> A
style B fill:#3B9797,color:#fff
style D fill:#BF092F,color:#fff
# Example: Deployment controller reconciliation
# You declare: "I want 3 replicas of my payment service"
kubectl scale deployment payment --replicas=3
# The Deployment controller's reconciliation loop:
# 1. OBSERVE: Query API server → current pod count = 3 → matches desired → sleep
# 2. EVENT: Pod crashes → current count drops to 2
# 3. OBSERVE: Query API server → current = 2, desired = 3 → MISMATCH
# 4. DIFF: Need 1 more pod
# 5. ACT: Create new pod spec, submit to scheduler
# 6. OBSERVE: current = 3 → matches desired → sleep
# This runs continuously — every few seconds
# It's the same loop for ALL Kubernetes controllers:
# - ReplicaSet controller → maintains pod count
# - Deployment controller → manages rollouts
# - Node controller → marks unhealthy nodes
# - Job controller → ensures jobs run to completion
# - Endpoint controller → updates Service endpoints
# - Namespace controller → cleans up deleted namespaces
Automatic Recovery
Self-healing goes beyond simple restarts. A truly self-healing system detects degradation, diagnoses root causes, and applies multi-level recovery:
Recovery Escalation Ladder
Production systems implement escalating recovery strategies:
- Level 1 — Application restart: Container restarts via liveness probe failure (handles memory leaks, deadlocks)
- Level 2 — Pod replacement: Controller creates fresh pod on same/different node (handles corrupted local state)
- Level 3 — Node drain: Workloads evacuated from problematic node (handles node-level issues)
- Level 4 — Node replacement: Cluster autoscaler provisions new node, old node terminated (handles hardware failures)
- Level 5 — Zone failover: Traffic shifted to other availability zones (handles zone-level outages)
Each level is triggered only if lower levels fail to resolve the issue. This prevents over-reaction to transient problems.
# Kubernetes restart policies and backoff:
apiVersion: v1
kind: Pod
metadata:
name: resilient-app
spec:
# restartPolicy: Always (default for Deployments)
# Kubernetes uses exponential backoff for restarts:
# 1st crash: restart immediately
# 2nd crash: wait 10s
# 3rd crash: wait 20s
# 4th crash: wait 40s
# ... up to 5 minutes maximum
# After 10 minutes of stability, backoff resets
restartPolicy: Always
containers:
- name: app
image: my-app:v1.0
resources:
requests:
memory: "128Mi"
cpu: "100m"
limits:
memory: "256Mi"
cpu: "200m"
Chaos Engineering
Core Principles
Chaos engineering is the discipline of experimenting on a production system to build confidence in its ability to withstand turbulent conditions. Rather than waiting for failures to happen, you deliberately inject failures to discover weaknesses before they cause outages.
The scientific method of chaos engineering:
- Define steady state — what does "normal" look like? (e.g., p99 latency < 200ms, error rate < 0.1%)
- Hypothesise — "The system will continue in steady state when X fails"
- Design experiment — inject a specific failure (kill a pod, add network latency, fill a disk)
- Run experiment — execute in production (or pre-production with realistic traffic)
- Observe — did the system maintain steady state? What degraded?
- Fix — address any weaknesses discovered
Tools & Practice
| Tool | Scope | Failure Types | Platform |
|---|---|---|---|
| Chaos Monkey | Instance termination | Random pod/VM kills | Netflix/Any cloud |
| Litmus Chaos | Kubernetes-native | Pod, node, network, IO faults | Kubernetes (CNCF) |
| Chaos Mesh | Kubernetes-native | Pod, network, IO, time, JVM | Kubernetes (CNCF) |
| Gremlin | Enterprise | CPU, memory, network, process | Any (SaaS platform) |
| AWS FIS | AWS services | EC2, ECS, RDS, AZ failures | AWS |
# Chaos Mesh: Inject network latency into a Kubernetes service
# This adds 100ms latency to all traffic hitting the payment service
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: payment-latency-test
namespace: chaos-testing
spec:
action: delay
mode: all
selector:
namespaces:
- production
labelSelectors:
app: payment-service
delay:
latency: "100ms"
jitter: "20ms"
correlation: "75"
duration: "5m"
scheduler:
cron: "@every 24h"
---
# Chaos Mesh: Kill random pods in the inventory service
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: inventory-pod-kill
namespace: chaos-testing
spec:
action: pod-kill
mode: one
selector:
namespaces:
- production
labelSelectors:
app: inventory-service
scheduler:
cron: "@every 2h"
Real-World Failure Case Studies
AWS S3 Outage (February 2017)
A Typo That Broke the Internet
What happened: An engineer ran a routine maintenance script to remove a small number of S3 billing servers. A typo in the command removed a much larger set of servers — including the index subsystem that managed metadata for S3 objects.
Cascade: S3 → CloudFront → Lambda → IoT → hundreds of websites and services that depended on S3 (Slack, Trello, Quora, IFTTT, and large portions of the internet). Services that stored their health check pages on S3 couldn't even report they were down.
Duration: 4+ hours of degraded service in US-EAST-1.
Root cause: No rate limit on server removal + index system had no fast-restart capability (took hours to rebuild from cold).
Lessons:
- Blast radius controls: limit how much a single command can destroy
- Dependency awareness: map what breaks when your infrastructure fails
- Recovery speed: design for fast restart, not just redundancy
- Don't host your status page on the infrastructure you're monitoring
GitHub Database Incident (October 2018)
43 Seconds That Caused 24 Hours of Recovery
What happened: A routine network maintenance caused a 43-second connectivity loss between GitHub's US East Coast data centre and its primary MySQL database cluster. The database cluster's orchestrator tool interpreted the partition as a primary failure and promoted a replica to primary.
The problem: When connectivity restored, there were now two primaries that had accepted conflicting writes during the 43 seconds. Data inconsistency across critical tables.
Duration: 24 hours and 11 minutes of degraded service. GitHub chose data integrity over availability — they manually reconciled the conflicting writes rather than losing data.
Lessons:
- Short partitions can trigger long-lasting damage
- Automatic failover must be tuned carefully (43s was too aggressive)
- Data reconciliation after split-brain is extremely expensive
- Prioritise consistency for write-heavy stateful systems
Cloudflare Global Outage (July 2019)
A Regex That Knocked Out 13% of HTTP Requests
What happened: A WAF (Web Application Firewall) rule update contained a catastrophically backtracking regular expression. When deployed globally, it consumed 100% CPU on every Cloudflare edge server simultaneously.
Scale of impact: ~13% of all HTTP requests globally dropped for 27 minutes. Millions of websites behind Cloudflare became unreachable.
Why it wasn't caught: The rule passed unit tests (which used short test strings). In production, real-world request payloads triggered exponential regex backtracking (ReDoS). The deployment system had no canary or staged rollout — the rule went to all servers simultaneously.
Lessons:
- Canary deployments: never deploy to 100% simultaneously
- Test with production-representative data, not just synthetic tests
- CPU usage guardrails: kill processes exceeding CPU thresholds
- Rollback speed: how fast can you undo a bad deploy globally?
Exercises
Conclusion
Distributed systems fail in creative, unexpected ways. The systems that survive aren't the ones that prevent all failures — they're the ones designed to expect failure and recover automatically.
Key principles from this part:
- Design for failure: Assume every component can fail at any time
- Limit blast radius: Bulkheads, resource quotas, and isolation prevent one failure from becoming many
- Self-healing over manual intervention: Reconciliation loops and health checks detect and fix problems without human involvement
- Graceful degradation over hard failure: Shed load and reduce features rather than returning errors
- Practice failure: Chaos engineering builds confidence that recovery works before real outages test it
- Learn from incidents: Post-mortems without blame drive systemic improvements
In Part 6, we'll cross from distributed systems theory into Kubernetes specifically — understanding how Kubernetes implements all these resilience principles through its architecture: the API server, etcd, schedulers, controllers, and the declarative reconciliation model that makes self-healing automatic.