Module 24: Graceful Degradation
The Philosophy of Graceful Degradation
A system that either works perfectly or fails completely is a brittle system. Graceful degradation means providing reduced but still useful functionality when components fail — preserving the core user experience while sacrificing non-essential features.
The hierarchy of degradation (from best to worst user experience):
- Full functionality: Everything works as designed
- Stale data: Serve cached/slightly-old data instead of failing (show yesterday's prices if real-time feed is down)
- Reduced features: Disable non-critical features (hide recommendations, disable search autocomplete)
- Read-only mode: Accept no writes but still serve reads (show catalog but disable purchasing)
- Static fallback: Serve a static version of the page from CDN
- Informational page: Show a "we're experiencing issues" page with ETA
- Complete outage: Nothing works — this is what we're trying to avoid
Degradation Patterns
Pattern 1: Cached fallback — When the primary data source is unavailable, serve cached data with a staleness indicator:
flowchart TD
A[Incoming Request] --> B{Primary Service
Available?}
B -->|Yes| C[Serve Fresh Data]
B -->|No| D{Cache Available?}
D -->|Yes| E{Cache Age
< Threshold?}
D -->|No| F{Feature Critical?}
E -->|Yes| G["Serve Cached Data
+ Staleness Header"]
E -->|No| H["Serve Stale Cache
+ Warning Banner"]
F -->|Yes| I[Read-Only Mode]
F -->|No| J[Disable Feature
Silently]
style A fill:#132440,color:#fff
style C fill:#3B9797,color:#fff
style G fill:#3B9797,color:#fff
style H fill:#16476A,color:#fff
style I fill:#BF092F,color:#fff
style J fill:#16476A,color:#fff
Pattern 2: Progressive feature disabling — Shed load by disabling features in priority order:
- Priority 4 (first to disable): Analytics tracking, A/B test assignment, personalized recommendations
- Priority 3: Search autocomplete, real-time notifications, chat widgets
- Priority 2: Non-essential API calls, image lazy-loading, third-party integrations
- Priority 1 (last resort): Write operations, user authentication, core business logic
Pattern 3: Read-only mode — When write paths are compromised but reads are healthy, switch to read-only to preserve availability for the majority of users (most web traffic is reads).
Feature Flags for Progressive Degradation
Feature flags are the runtime control mechanism for graceful degradation. They allow operators to disable functionality in production without deploying code changes:
# feature-flags.yaml — Degradation configuration
# Managed by operations team, applied at runtime
degradation_levels:
normal:
description: "All systems operational"
recommendations: true
search_autocomplete: true
real_time_inventory: true
analytics: true
notifications: true
level_1_light:
description: "Minor degradation — non-critical features disabled"
recommendations: false # Disable ML recommendations
search_autocomplete: false # Disable autocomplete API calls
real_time_inventory: true
analytics: false # Stop tracking to reduce load
notifications: true
level_2_moderate:
description: "Moderate degradation — secondary features disabled"
recommendations: false
search_autocomplete: false
real_time_inventory: false # Show cached inventory counts
analytics: false
notifications: false # Queue notifications for later
level_3_severe:
description: "Severe degradation — read-only mode"
recommendations: false
search_autocomplete: false
real_time_inventory: false
analytics: false
notifications: false
write_operations: false # No purchases, no account changes
serve_cached_pages: true # Serve from CDN cache
# Automatic triggers (can also be manually activated)
auto_triggers:
- condition: "error_rate > 5%"
action: "level_1_light"
- condition: "error_rate > 15%"
action: "level_2_moderate"
- condition: "error_rate > 30% OR p99_latency > 10s"
action: "level_3_severe"
Module 25: Chaos Engineering
Principles of Chaos Engineering
Chaos engineering is the discipline of experimenting on a system to build confidence in its capability to withstand turbulent conditions in production. It's not about breaking things randomly — it's a scientific method applied to distributed systems.
The four principles (from the Chaos Engineering manifesto):
- Build a hypothesis around steady state behavior: Define what "normal" looks like with measurable metrics (throughput, error rate, latency percentiles). The hypothesis is: "When we inject failure X, the system maintains steady state."
- Vary real-world events: Inject failures that actually happen — server crashes, network partitions, disk full, clock skew, DNS failures, certificate expiry. Don't inject unrealistic failures.
- Run experiments in production: Staging environments can't replicate production's complexity (traffic patterns, data distribution, dependency behavior). Start small in production with blast radius controls.
- Automate experiments to run continuously: Running chaos manually during business hours is a starting point. The goal is automated, continuous experiments that catch regressions immediately.
flowchart LR
A["1. Define Steady
State Hypothesis"] --> B["2. Design
Experiment"]
B --> C["3. Limit Blast
Radius"]
C --> D["4. Run
Experiment"]
D --> E{"5. Steady State
Maintained?"}
E -->|Yes| F["✓ Confidence
Increased"]
E -->|No| G["✗ Weakness
Found"]
G --> H["6. Fix &
Retest"]
H --> A
F --> I["7. Expand
Scope"]
I --> A
style A fill:#132440,color:#fff
style D fill:#3B9797,color:#fff
style F fill:#3B9797,color:#fff
style G fill:#BF092F,color:#fff
style H fill:#16476A,color:#fff
Chaos Mesh: Kubernetes-Native Chaos
Chaos Mesh is a CNCF project that brings chaos engineering to Kubernetes with fine-grained control over failure injection:
# chaos-mesh-pod-kill.yaml
# Kill random pods in the payment service to test resilience
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: payment-pod-kill
namespace: chaos-testing
spec:
action: pod-kill
mode: one # Kill one random pod
selector:
namespaces:
- production
labelSelectors:
app: payment-service
scheduler:
cron: "@every 2h" # Run every 2 hours
---
# Network delay experiment — simulate cross-region latency
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: catalog-network-delay
namespace: chaos-testing
spec:
action: delay
mode: all
selector:
namespaces:
- production
labelSelectors:
app: catalog-service
delay:
latency: "200ms" # Add 200ms latency
jitter: "50ms" # ±50ms variation
correlation: "75" # 75% correlation between packets
duration: "5m" # Run for 5 minutes
scheduler:
cron: "0 14 * * 1-5" # Weekdays at 2 PM
---
# I/O chaos — simulate disk degradation
apiVersion: chaos-mesh.org/v1alpha1
kind: IOChaos
metadata:
name: db-io-latency
namespace: chaos-testing
spec:
action: latency
mode: one
selector:
namespaces:
- production
labelSelectors:
app: postgres-primary
volumePath: /var/lib/postgresql/data
path: "**/*"
delay: "100ms" # 100ms I/O latency
percent: 50 # Affect 50% of operations
duration: "3m"
LitmusChaos Workflows
LitmusChaos provides a workflow-based approach where multiple chaos experiments are orchestrated together, simulating complex real-world failure scenarios:
# litmus-workflow.yaml
# Complete resilience test: network + pod + resource pressure
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
name: resilience-test-workflow
namespace: litmus
spec:
entrypoint: resilience-pipeline
templates:
- name: resilience-pipeline
steps:
# Step 1: Verify steady state before experiments
- - name: verify-steady-state
template: check-health
# Step 2: Run chaos experiments in parallel
- - name: network-partition
template: network-chaos
- name: pod-stress
template: cpu-stress
# Step 3: Verify system recovered
- - name: verify-recovery
template: check-health
# Step 4: More aggressive chaos
- - name: kill-nodes
template: node-drain
- name: check-health
container:
image: curlimages/curl:latest
command: ["/bin/sh", "-c"]
args:
- |
# Check that error rate stays below 1%
ERROR_RATE=$(curl -s http://prometheus:9090/api/v1/query \
--data-urlencode 'query=rate(http_errors_total[5m])/rate(http_requests_total[5m])' \
| jq '.data.result[0].value[1]' -r)
if (( $(echo "$ERROR_RATE > 0.01" | bc -l) )); then
echo "FAIL: Error rate $ERROR_RATE exceeds 1% threshold"
exit 1
fi
echo "PASS: Error rate $ERROR_RATE is within bounds"
- name: network-chaos
container:
image: litmuschaos/litmus-checker:latest
args:
- -name=pod-network-loss
- -appns=production
- -applabel=app=order-service
- -network-packet-loss-percentage=30
- -total-chaos-duration=120
- name: cpu-stress
container:
image: litmuschaos/litmus-checker:latest
args:
- -name=pod-cpu-hog
- -appns=production
- -applabel=app=order-service
- -cpu-cores=2
- -total-chaos-duration=120
- name: node-drain
container:
image: litmuschaos/litmus-checker:latest
args:
- -name=node-drain
- -node-label=role=worker
- -total-chaos-duration=60
GameDay Practices
A GameDay is a structured chaos engineering exercise where teams intentionally break systems to test their response capabilities — both technical (does the system handle it?) and organizational (do the people handle it?).
GameDay progression (maturity levels):
- Level 1 — Table-top: Discuss failure scenarios on a whiteboard. "What happens if the database fails?" No actual injection.
- Level 2 — Staging chaos: Inject failures in staging/pre-prod. Verify alarms fire, runbooks are followed.
- Level 3 — Production chaos (canary): Inject failures in production affecting a small percentage of traffic (1-5%).
- Level 4 — Production chaos (full): Inject failures affecting significant infrastructure (kill an AZ, partition a region).
- Level 5 — Continuous chaos: Automated experiments run 24/7 with automatic abort on SLO violation.
Module 26: Disaster Recovery
RTO & RPO: The DR Metrics
Every disaster recovery plan centers on two critical metrics:
- RPO (Recovery Point Objective): How much data can you afford to lose? Measured in time. RPO = 1 hour means you accept losing up to 1 hour of data. Determines backup frequency.
- RTO (Recovery Time Objective): How long can you be down? Measured in time. RTO = 4 hours means the business can tolerate 4 hours of complete outage. Determines recovery speed requirements.
DR Tiers: Cost vs. Recovery Speed
flowchart LR
A["💰 Backup &
Restore
RTO: 24h+
RPO: 24h
Cost: $"] --> B["🔥 Pilot
Light
RTO: 4-8h
RPO: 1h
Cost: $$"]
B --> C["♨️ Warm
Standby
RTO: 1-4h
RPO: 15min
Cost: $$$"]
C --> D["🔄 Hot Standby
(Active-Passive)
RTO: 5-30min
RPO: ~0
Cost: $$$$"]
D --> E["⚡ Active-Active
(Multi-Region)
RTO: ~0
RPO: 0
Cost: $$$$$"]
style A fill:#D4EDED,color:#132440
style B fill:#7CCBCB,color:#132440
style C fill:#3B9797,color:#fff
style D fill:#16476A,color:#fff
style E fill:#132440,color:#fff
Tier 1 — Backup & Restore: Data backed up periodically to another region. Recovery requires provisioning new infrastructure and restoring from backup. Cheapest but slowest.
Tier 2 — Pilot Light: Core infrastructure (database replicas, DNS records) kept running in DR region. Compute resources scaled to zero. On disaster, scale up compute and switch traffic. Like a gas pilot light — ready to ignite.
Tier 3 — Warm Standby: Scaled-down but fully functional copy running in DR region. Handles a fraction of traffic (or internal traffic). On disaster, scale up and redirect all traffic.
Tier 4 — Hot Standby: Full-scale replica in DR region with synchronous data replication. Traffic served from primary; failover is automatic or one-click. Near-zero data loss.
Tier 5 — Active-Active: Both regions serve production traffic simultaneously. Data replicated bidirectionally. No "failover" concept — if one region dies, the other absorbs its traffic. Zero downtime, zero data loss, maximum cost.
Multi-Region Failover Automation
Manual failover during a disaster is error-prone (people panic, runbooks are outdated, credentials are expired). Automated failover scripts should be tested regularly:
#!/bin/bash
# dr-failover.sh — Automated multi-region failover script
# Triggered by health check failure or manual invocation
set -euo pipefail
PRIMARY_REGION="us-east-1"
DR_REGION="eu-west-1"
HEALTH_ENDPOINT="https://api.primary.example.com/health"
DNS_ZONE="example.com"
FAILOVER_RECORD="api.example.com"
echo "[$(date -u)] === DISASTER RECOVERY FAILOVER INITIATED ==="
echo "[$(date -u)] Primary: $PRIMARY_REGION | DR: $DR_REGION"
# Step 1: Verify primary is actually down (prevent false positives)
echo "[$(date -u)] Step 1: Verifying primary failure..."
FAIL_COUNT=0
for i in {1..5}; do
if ! curl -sf --max-time 5 "$HEALTH_ENDPOINT" > /dev/null 2>&1; then
FAIL_COUNT=$((FAIL_COUNT + 1))
fi
sleep 2
done
if [ "$FAIL_COUNT" -lt 4 ]; then
echo "[$(date -u)] ABORT: Primary responded $((5-FAIL_COUNT))/5 times. Not a confirmed outage."
exit 1
fi
echo "[$(date -u)] CONFIRMED: Primary failed $FAIL_COUNT/5 health checks."
# Step 2: Scale up DR region compute
echo "[$(date -u)] Step 2: Scaling DR region compute..."
aws autoscaling update-auto-scaling-group \
--auto-scaling-group-name "app-asg-$DR_REGION" \
--min-size 6 --desired-capacity 12 --max-size 24 \
--region "$DR_REGION"
# Wait for instances to be healthy
echo "[$(date -u)] Waiting for DR instances to pass health checks..."
aws autoscaling wait group-in-service \
--auto-scaling-group-name "app-asg-$DR_REGION" \
--region "$DR_REGION"
# Step 3: Promote DR database replica to primary
echo "[$(date -u)] Step 3: Promoting DR database replica..."
aws rds promote-read-replica \
--db-instance-identifier "app-db-replica-$DR_REGION" \
--region "$DR_REGION"
aws rds wait db-instance-available \
--db-instance-identifier "app-db-replica-$DR_REGION" \
--region "$DR_REGION"
# Step 4: Update DNS to point to DR region
echo "[$(date -u)] Step 4: Updating DNS records..."
aws route53 change-resource-record-sets \
--hosted-zone-id "$DNS_ZONE_ID" \
--change-batch '{
"Changes": [{
"Action": "UPSERT",
"ResourceRecordSet": {
"Name": "'"$FAILOVER_RECORD"'",
"Type": "A",
"AliasTarget": {
"HostedZoneId": "'"$DR_ALB_ZONE_ID"'",
"DNSName": "'"$DR_ALB_DNS"'",
"EvaluateTargetHealth": true
}
}
}]
}'
# Step 5: Invalidate CDN cache
echo "[$(date -u)] Step 5: Invalidating CDN cache..."
aws cloudfront create-invalidation \
--distribution-id "$CDN_DISTRIBUTION_ID" \
--paths "/*"
# Step 6: Send notifications
echo "[$(date -u)] Step 6: Sending notifications..."
aws sns publish \
--topic-arn "$INCIDENT_TOPIC_ARN" \
--subject "DR FAILOVER COMPLETE" \
--message "Failover from $PRIMARY_REGION to $DR_REGION completed at $(date -u). All traffic now served from DR region."
echo "[$(date -u)] === FAILOVER COMPLETE ==="
echo "[$(date -u)] Traffic now served from: $DR_REGION"
Module 27: Recovery Architecture
Self-Healing Systems
Self-healing systems detect failures and recover automatically without human intervention. The foundation in Kubernetes is the reconciliation loop — controllers continuously compare desired state with actual state and take corrective action.
flowchart TD
A["Desired State
(Declared in YAML)"] --> B["Controller
(Reconciliation Loop)"]
C["Actual State
(Observed)"] --> B
B --> D{Desired ==
Actual?}
D -->|Yes| E["✓ No Action
(Wait & Watch)"]
D -->|No| F["Take Corrective
Action"]
F --> G["Create/Delete/Update
Resources"]
G --> C
E -->|"Poll interval"| C
style A fill:#3B9797,color:#fff
style C fill:#16476A,color:#fff
style E fill:#3B9797,color:#fff
style F fill:#BF092F,color:#fff
Kubernetes liveness and readiness probes are the basic building blocks of self-healing:
# k8s-self-healing.yaml
# Comprehensive health probes for self-healing behavior
apiVersion: apps/v1
kind: Deployment
metadata:
name: order-service
namespace: production
spec:
replicas: 3
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1
maxSurge: 1
selector:
matchLabels:
app: order-service
template:
metadata:
labels:
app: order-service
spec:
containers:
- name: order-service
image: order-service:v2.3.1
ports:
- containerPort: 8080
# Liveness probe: Is the process alive?
# If this fails, Kubernetes RESTARTS the container
livenessProbe:
httpGet:
path: /health/live
port: 8080
initialDelaySeconds: 30 # Wait 30s after start
periodSeconds: 10 # Check every 10s
timeoutSeconds: 3 # Timeout after 3s
failureThreshold: 3 # Restart after 3 failures
successThreshold: 1 # 1 success = healthy
# Readiness probe: Can the process handle traffic?
# If this fails, Kubernetes removes from Service endpoints
readinessProbe:
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 5 # Check sooner than liveness
periodSeconds: 5 # Check more frequently
timeoutSeconds: 2
failureThreshold: 2 # Remove faster (2 failures)
successThreshold: 2 # Require 2 passes to re-add
# Startup probe: Is the app still initializing?
# Disables liveness/readiness until startup succeeds
startupProbe:
httpGet:
path: /health/startup
port: 8080
initialDelaySeconds: 0
periodSeconds: 5
failureThreshold: 30 # Allow up to 150s to start
successThreshold: 1
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "500m"
# Pod Disruption Budget — limit simultaneous disruptions
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: order-service-pdb
spec:
minAvailable: 2 # Always keep at least 2 pods running
selector:
matchLabels:
app: order-service
Safe Rollbacks: Blue-Green & Canary
Deployment failures are a leading cause of outages. Safe rollback strategies ensure that bad deployments can be reversed quickly with minimal user impact:
Blue-Green Deployment: Maintain two identical production environments. Route all traffic to "Blue" while deploying to "Green." Once Green is verified, switch the router. Instant rollback = switch back to Blue.
- Advantage: Instant rollback, zero-downtime deployment, full production testing before exposure
- Disadvantage: Double infrastructure cost, database schema migrations are tricky (both versions must handle both schemas)
Canary Deployment: Route a small percentage of traffic (1-5%) to the new version. Monitor error rates, latency, and business metrics. Gradually increase traffic if healthy. Automatically roll back if metrics degrade.
- Advantage: Limited blast radius, real production validation, automated rollback
- Disadvantage: Requires sophisticated traffic routing, session affinity challenges, slower rollout
State Recovery & Reconciliation
After a failure, the most complex recovery challenge is state reconciliation — ensuring that all data stores, caches, queues, and indexes are consistent with each other.
Event sourcing for perfect recovery: If you store events (not just current state), you can always rebuild state by replaying events. This makes recovery deterministic — replay from the last known-good snapshot.
Reconciliation patterns:
- Full reconciliation: Compare entire datasets between primary and replica. Expensive but thorough. Run nightly.
- Incremental reconciliation: Compare only records modified since last reconciliation. Faster, run hourly.
- Checksum-based: Compute checksums of data blocks. Only reconcile blocks with mismatches. Efficient for large datasets.
- Event-driven: Use change data capture (CDC) to stream mutations. Consumers rebuild their views from the stream.
Case Studies
Netflix: The Evolution of Chaos Monkey
From Chaos Monkey to the Simian Army
When Netflix migrated to AWS in 2010, they built Chaos Monkey — a tool that randomly terminates production instances. The reasoning: if engineers know that instances can be killed at any time, they'll build services that handle it gracefully.
Evolution of Netflix's chaos tools:
- Chaos Monkey (2010): Randomly kills individual instances. Forces services to be stateless and handle instance loss.
- Chaos Gorilla (2011): Simulates an entire AWS Availability Zone outage. Forces multi-AZ architecture.
- Chaos Kong (2015): Simulates an entire AWS Region failure. Forces multi-region architecture.
- Latency Monkey: Injects artificial latency between services. Forces proper timeout handling.
- FIT (Failure Injection Testing): Targeted failure injection with precise scope control.
Key lessons from Netflix:
- Chaos engineering only works with strong observability — you must detect the impact
- Start simple (kill instances) before going complex (region failure)
- Make chaos opt-out, not opt-in — new services inherit chaos testing by default
- The goal isn't to prove the system is resilient — it's to find the weaknesses before customers do
Google DiRT: Disaster Recovery Testing at Scale
DiRT (Disaster Recovery Testing)
Google runs DiRT — annual disaster recovery exercises that simulate catastrophic failures affecting core infrastructure. These aren't small-scale tests; they simulate events like losing an entire data center or a critical internal service going offline.
How DiRT works:
- Planning (months before): A small team designs scenarios without telling operational teams what will happen. Scenarios range from "single service failure" to "multiple data center loss."
- Execution: Failures are injected during business hours. Operational teams must respond using existing runbooks and tools. The planning team observes but doesn't help.
- Post-mortem: Detailed analysis of what worked, what didn't, what was surprising. Action items tracked to completion.
DiRT findings that changed Google's architecture:
- Discovered that many services had undocumented dependencies on a single authentication service
- Found that failover procedures assumed network connectivity that wouldn't exist during a real disaster
- Revealed that backup restoration took 10x longer than documented because procedures were outdated
- Led to the "N+2" capacity rule — always have enough capacity to lose 2 data centers and still serve traffic
Conclusion & Next Steps
The key takeaways from this module:
- Graceful degradation is designed, not accidental. Define degradation levels in advance, implement them with feature flags, and test them regularly. The worst time to figure out your degradation strategy is during an incident.
- Chaos engineering is scientific, not destructive. Form a hypothesis about steady state, inject a failure, observe the result. If the system maintains steady state, your confidence increases. If not, you've found a weakness to fix.
- Start chaos small and grow. Kill a pod → Kill a node → Partition a network → Drop an AZ → Lose a region. Each level requires the previous one to be solid.
- RTO/RPO drive architecture decisions. The business defines acceptable loss and downtime. The architecture implements the cheapest DR tier that meets those targets.
- Self-healing is a reconciliation loop. Desired state vs. actual state, continuously. Kubernetes does this for compute. You need to extend it to data, configuration, and external dependencies.
- Safe deployments are safe rollbacks. Blue-green and canary aren't just deployment strategies — they're recovery strategies. Every deployment should be as easy to undo as it was to do.
- Test your DR plan or it doesn't exist. An untested failover script is a fiction. Google's DiRT and Netflix's Chaos Kong prove that only tested recovery works when you need it.
Next in the Series
In Part 14: Distributed Coordination & Consistency, we'll explore the fundamental challenges of making distributed nodes agree — consensus algorithms (Raft, Paxos), leader election, quorum systems, and the full spectrum of consistency models from strong to eventual.