Chaos Engineering Principles
Chaos engineering, pioneered by Netflix, is the discipline of experimenting on a system to build confidence in its ability to withstand turbulent conditions in production. It is not randomly breaking things — it is structured, hypothesis-driven experimentation.
The five principles from the Principles of Chaos Engineering:
- Build a hypothesis around steady state behaviour: Define what "normal" looks like (request rate, error rate, latency) before injecting faults
- Vary real-world events: Inject faults that actually happen — server crashes, network partitions, disk failures, dependency timeouts
- Run experiments in production: Staging environments do not have real traffic patterns, real data volumes, or real dependency chains
- Automate experiments to run continuously: A one-time chaos test is useful; continuous chaos testing is transformative
- Minimise blast radius: Start small, have a kill switch, and limit the scope of each experiment
Designing Chaos Experiments
flowchart TD
A[Define Steady State\nRequest rate, error rate, latency baseline] --> B[Formulate Hypothesis\nSystem should tolerate X failure]
B --> C[Design Experiment\nFault type, scope, duration, kill switch]
C --> D[Run Experiment\nInject fault, observe metrics]
D --> E{Hypothesis\nConfirmed?}
E -->|Yes| F[Confidence Increased\nSystem is resilient to this failure]
E -->|No| G[Weakness Found\nFile bug, fix, retest]
F --> H[Increase Scope\nLarger blast radius or more faults]
G --> H
Chaos Experiment: Payment Service Dependency Failure
Steady State: Order success rate > 99.5%, p99 latency < 500ms, checkout completion rate stable.
Hypothesis: When the payment provider API returns 503 errors for 30 seconds, the order service should circuit-break, return a friendly "try again" message, and queue the payment for retry. Order success rate should recover to > 99% within 60 seconds of payment API recovery.
Experiment: Inject 503 responses to 100% of payment API calls for 30 seconds using a service mesh fault injection rule.
Kill Switch: Remove the fault injection rule immediately via kubectl delete virtualservice payment-fault.
Blast Radius: Single environment (staging), single service, 30-second duration.
Common Fault Types
| Category | Fault Type | What It Tests |
|---|---|---|
| Infrastructure | Kill a VM/container | Auto-scaling, pod scheduling, health checks |
| Fill disk to 100% | Disk pressure handling, log rotation, alerts | |
| CPU/memory stress | Resource limits, throttling behaviour, OOMKill recovery | |
| Network | Latency injection (add 500ms) | Timeout configuration, circuit breakers, user experience |
| Packet loss (5-50%) | Retry logic, idempotency, connection pooling | |
| DNS failure | DNS caching, fallback resolution, service discovery | |
| Application | Dependency returns errors | Error handling, circuit breakers, graceful degradation |
| Dependency returns slowly | Timeout configuration, async patterns, backpressure | |
| Configuration change | Feature flag rollback, config validation, hot reload | |
| State | Database primary failover | Connection retry, read replica routing, data consistency |
| Cache eviction (flush Redis) | Cache-miss handling, thundering herd protection |
Chaos Engineering Tools
| Tool | Platform | Fault Types | Best For |
|---|---|---|---|
| Litmus Chaos | Kubernetes | Pod kill, network, CPU, disk, DNS | K8s-native chaos with CRDs |
| Chaos Mesh | Kubernetes | Pod, network, I/O, time, JVM, kernel | Advanced K8s faults (time skew, JVM) |
| Gremlin | Any (SaaS) | Full spectrum + scenarios | Enterprise teams wanting managed service |
| AWS FIS | AWS | EC2, ECS, RDS, network, AZ failure | AWS-native fault injection |
| Azure Chaos Studio | Azure | VM, AKS, network, Cosmos DB | Azure-native fault injection |
| Toxiproxy | Any | Network latency, timeout, bandwidth | Lightweight proxy-based network faults |
# Litmus Chaos — Kill a random pod in the order-service deployment
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: order-service-chaos
namespace: production
spec:
appinfo:
appns: production
applabel: app=order-service
appkind: deployment
chaosServiceAccount: litmus-admin
experiments:
- name: pod-delete
spec:
components:
env:
- name: TOTAL_CHAOS_DURATION
value: "30" # Kill pods for 30 seconds
- name: CHAOS_INTERVAL
value: "10" # Kill one pod every 10 seconds
- name: FORCE
value: "false" # Graceful termination (not force kill)
# Chaos Mesh — Inject 200ms network latency to payment-service
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: payment-latency
namespace: production
spec:
action: delay
mode: all
selector:
namespaces:
- production
labelSelectors:
app: payment-service
delay:
latency: "200ms"
correlation: "100"
jitter: "50ms"
duration: "60s"
Game Days
A game day is a scheduled, team-wide chaos exercise where you simulate a real incident and practice your response. It combines chaos engineering (fault injection) with incident management (response practice).
- Pre-game: Define scenario, brief participants (not all — some should be surprised), prepare kill switches, notify stakeholders
- Execute: Inject the fault, let the on-call team detect and respond using normal processes
- Observe: Did alerts fire? Did the team follow the runbook? How long to detect, mitigate, resolve?
- Debrief: Run a mini post-mortem — what worked, what did not, what needs to change
- Action items: Update runbooks, tune alerts, fix gaps discovered during the exercise
Chaos Engineering Maturity Model
| Level | Description | Practices |
|---|---|---|
| Level 0 | No chaos testing | Rely on testing environments and hope for the best |
| Level 1 | Ad-hoc experiments | Manual fault injection in staging, occasional game days |
| Level 2 | Systematic experiments | Documented experiments with hypotheses, run in staging regularly |
| Level 3 | Production chaos | Controlled experiments in production with small blast radius |
| Level 4 | Continuous chaos | Automated chaos experiments run continuously in CI/CD or production |
Conclusion & Next Steps
Chaos engineering turns unknown unknowns into known risks with mitigations. Key takeaways from Part 11:
- Chaos engineering is hypothesis-driven experimentation, not random breakage
- Every experiment needs a steady state hypothesis, a blast radius, and a kill switch
- Common fault categories: infrastructure (kill VMs), network (latency, partition), application (dependency failure), state (DB failover, cache flush)
- Tools like Litmus and Chaos Mesh bring chaos experiments to Kubernetes natively via CRDs
- Game days combine fault injection with incident response practice
- Start simple in staging (Level 1) and graduate to continuous production chaos (Level 4)