Part 11: Chaos Engineering & Reliability Testing

Chaos Engineering Principles

Chaos engineering, pioneered by Netflix, is the discipline of experimenting on a system to build confidence in its ability to withstand turbulent conditions in production. It is not randomly breaking things — it is structured, hypothesis-driven experimentation.

The five principles from the Principles of Chaos Engineering:

Build a hypothesis around steady state behaviour: Define what "normal" looks like (request rate, error rate, latency) before injecting faults
Vary real-world events: Inject faults that actually happen — server crashes, network partitions, disk failures, dependency timeouts
Run experiments in production: Staging environments do not have real traffic patterns, real data volumes, or real dependency chains
Automate experiments to run continuously: A one-time chaos test is useful; continuous chaos testing is transformative
Minimise blast radius: Start small, have a kill switch, and limit the scope of each experiment

Designing Chaos Experiments

Chaos Experiment Workflow

                                flowchart TD
                                    A[Define Steady State\nRequest rate, error rate, latency baseline] --> B[Formulate Hypothesis\nSystem should tolerate X failure]
                                    B --> C[Design Experiment\nFault type, scope, duration, kill switch]
                                    C --> D[Run Experiment\nInject fault, observe metrics]
                                    D --> E{Hypothesis\nConfirmed?}
                                    E -->|Yes| F[Confidence Increased\nSystem is resilient to this failure]
                                    E -->|No| G[Weakness Found\nFile bug, fix, retest]
                                    F --> H[Increase Scope\nLarger blast radius or more faults]
                                    G --> H

Example Experiment

Chaos Experiment: Payment Service Dependency Failure

Steady State: Order success rate > 99.5%, p99 latency < 500ms, checkout completion rate stable.

Hypothesis: When the payment provider API returns 503 errors for 30 seconds, the order service should circuit-break, return a friendly "try again" message, and queue the payment for retry. Order success rate should recover to > 99% within 60 seconds of payment API recovery.

Experiment: Inject 503 responses to 100% of payment API calls for 30 seconds using a service mesh fault injection rule.

Kill Switch: Remove the fault injection rule immediately via kubectl delete virtualservice payment-fault.

Blast Radius: Single environment (staging), single service, 30-second duration.

Circuit Breaker Graceful Degradation Retry Logic

Common Fault Types

Category	Fault Type	What It Tests
Infrastructure	Kill a VM/container	Auto-scaling, pod scheduling, health checks
	Fill disk to 100%	Disk pressure handling, log rotation, alerts
	CPU/memory stress	Resource limits, throttling behaviour, OOMKill recovery
Network	Latency injection (add 500ms)	Timeout configuration, circuit breakers, user experience
	Packet loss (5-50%)	Retry logic, idempotency, connection pooling
	DNS failure	DNS caching, fallback resolution, service discovery
Application	Dependency returns errors	Error handling, circuit breakers, graceful degradation
	Dependency returns slowly	Timeout configuration, async patterns, backpressure
	Configuration change	Feature flag rollback, config validation, hot reload
State	Database primary failover	Connection retry, read replica routing, data consistency
State	Cache eviction (flush Redis)	Cache-miss handling, thundering herd protection

Chaos Engineering Tools

Tool	Platform	Fault Types	Best For
Litmus Chaos	Kubernetes	Pod kill, network, CPU, disk, DNS	K8s-native chaos with CRDs
Chaos Mesh	Kubernetes	Pod, network, I/O, time, JVM, kernel	Advanced K8s faults (time skew, JVM)
Gremlin	Any (SaaS)	Full spectrum + scenarios	Enterprise teams wanting managed service
AWS FIS	AWS	EC2, ECS, RDS, network, AZ failure	AWS-native fault injection
Azure Chaos Studio	Azure	VM, AKS, network, Cosmos DB	Azure-native fault injection
Toxiproxy	Any	Network latency, timeout, bandwidth	Lightweight proxy-based network faults

# Litmus Chaos — Kill a random pod in the order-service deployment
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: order-service-chaos
  namespace: production
spec:
  appinfo:
    appns: production
    applabel: app=order-service
    appkind: deployment
  chaosServiceAccount: litmus-admin
  experiments:
    - name: pod-delete
      spec:
        components:
          env:
            - name: TOTAL_CHAOS_DURATION
              value: "30"          # Kill pods for 30 seconds
            - name: CHAOS_INTERVAL
              value: "10"          # Kill one pod every 10 seconds
            - name: FORCE
              value: "false"       # Graceful termination (not force kill)

# Chaos Mesh — Inject 200ms network latency to payment-service
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: payment-latency
  namespace: production
spec:
  action: delay
  mode: all
  selector:
    namespaces:
      - production
    labelSelectors:
      app: payment-service
  delay:
    latency: "200ms"
    correlation: "100"
    jitter: "50ms"
  duration: "60s"

Game Days

A game day is a scheduled, team-wide chaos exercise where you simulate a real incident and practice your response. It combines chaos engineering (fault injection) with incident management (response practice).

                            
                            Game Day Checklist:
                            Pre-game: Define scenario, brief participants (not all — some should be surprised), prepare kill switches, notify stakeholders
Execute: Inject the fault, let the on-call team detect and respond using normal processes
Observe: Did alerts fire? Did the team follow the runbook? How long to detect, mitigate, resolve?
Debrief: Run a mini post-mortem — what worked, what did not, what needs to change
Action items: Update runbooks, tune alerts, fix gaps discovered during the exercise

                        

Chaos Engineering Maturity Model

Level	Description	Practices
Level 0	No chaos testing	Rely on testing environments and hope for the best
Level 1	Ad-hoc experiments	Manual fault injection in staging, occasional game days
Level 2	Systematic experiments	Documented experiments with hypotheses, run in staging regularly
Level 3	Production chaos	Controlled experiments in production with small blast radius
Level 4	Continuous chaos	Automated chaos experiments run continuously in CI/CD or production

                            
                            Start at Level 1: Most teams should start with simple experiments in staging — kill a pod, add network latency, fail a dependency. Move to production only after you have confidence in your kill switches, monitoring detects the fault quickly, and the team is comfortable with the process.
                        

Conclusion & Next Steps

Chaos engineering turns unknown unknowns into known risks with mitigations. Key takeaways from Part 11:

Chaos engineering is hypothesis-driven experimentation, not random breakage
Every experiment needs a steady state hypothesis, a blast radius, and a kill switch
Common fault categories: infrastructure (kill VMs), network (latency, partition), application (dependency failure), state (DB failover, cache flush)
Tools like Litmus and Chaos Mesh bring chaos experiments to Kubernetes natively via CRDs
Game days combine fault injection with incident response practice
Start simple in staging (Level 1) and graduate to continuous production chaos (Level 4)

Previous Part 10: Incident Management Next Part 12: Observability as Code

Cookie Consent

Part 11: Chaos Engineering & Reliability Testing

Table of Contents

Chaos Engineering Principles

Designing Chaos Experiments

Chaos Experiment: Payment Service Dependency Failure

Common Fault Types

Chaos Engineering Tools

Game Days

Chaos Engineering Maturity Model

Conclusion & Next Steps

Cookie Consent

Part 11: Chaos Engineering & Reliability Testing

Table of Contents

Chaos Engineering Principles

Designing Chaos Experiments

Chaos Experiment: Payment Service Dependency Failure

Common Fault Types

Chaos Engineering Tools

Game Days

Chaos Engineering Maturity Model

Conclusion & Next Steps

Continue the Series

Part 12: Observability as Code & Platform Engineering

Part 10: Incident Management & Post-Mortems