Back to Monitoring, Observability & Reliability Series

Part 11: Chaos Engineering & Reliability Testing

May 14, 2026 Wasil Zafar 19 min read

Monitoring tells you when things break. Chaos engineering tells you what will break before it does. By deliberately injecting failures into production-like environments, you discover weaknesses in your system's resilience and fix them before they cause real outages.

Table of Contents

  1. Chaos Engineering Principles
  2. Designing Chaos Experiments
  3. Common Fault Types
  4. Chaos Engineering Tools
  5. Game Days
  6. Chaos Maturity Model
  7. Conclusion & Next Steps

Chaos Engineering Principles

Chaos engineering, pioneered by Netflix, is the discipline of experimenting on a system to build confidence in its ability to withstand turbulent conditions in production. It is not randomly breaking things — it is structured, hypothesis-driven experimentation.

The five principles from the Principles of Chaos Engineering:

  1. Build a hypothesis around steady state behaviour: Define what "normal" looks like (request rate, error rate, latency) before injecting faults
  2. Vary real-world events: Inject faults that actually happen — server crashes, network partitions, disk failures, dependency timeouts
  3. Run experiments in production: Staging environments do not have real traffic patterns, real data volumes, or real dependency chains
  4. Automate experiments to run continuously: A one-time chaos test is useful; continuous chaos testing is transformative
  5. Minimise blast radius: Start small, have a kill switch, and limit the scope of each experiment

Designing Chaos Experiments

Chaos Experiment Workflow
                                flowchart TD
                                    A[Define Steady State\nRequest rate, error rate, latency baseline] --> B[Formulate Hypothesis\nSystem should tolerate X failure]
                                    B --> C[Design Experiment\nFault type, scope, duration, kill switch]
                                    C --> D[Run Experiment\nInject fault, observe metrics]
                                    D --> E{Hypothesis\nConfirmed?}
                                    E -->|Yes| F[Confidence Increased\nSystem is resilient to this failure]
                                    E -->|No| G[Weakness Found\nFile bug, fix, retest]
                                    F --> H[Increase Scope\nLarger blast radius or more faults]
                                    G --> H
                            
Example Experiment

Chaos Experiment: Payment Service Dependency Failure

Steady State: Order success rate > 99.5%, p99 latency < 500ms, checkout completion rate stable.

Hypothesis: When the payment provider API returns 503 errors for 30 seconds, the order service should circuit-break, return a friendly "try again" message, and queue the payment for retry. Order success rate should recover to > 99% within 60 seconds of payment API recovery.

Experiment: Inject 503 responses to 100% of payment API calls for 30 seconds using a service mesh fault injection rule.

Kill Switch: Remove the fault injection rule immediately via kubectl delete virtualservice payment-fault.

Blast Radius: Single environment (staging), single service, 30-second duration.

Circuit Breaker Graceful Degradation Retry Logic

Common Fault Types

CategoryFault TypeWhat It Tests
InfrastructureKill a VM/containerAuto-scaling, pod scheduling, health checks
Fill disk to 100%Disk pressure handling, log rotation, alerts
CPU/memory stressResource limits, throttling behaviour, OOMKill recovery
NetworkLatency injection (add 500ms)Timeout configuration, circuit breakers, user experience
Packet loss (5-50%)Retry logic, idempotency, connection pooling
DNS failureDNS caching, fallback resolution, service discovery
ApplicationDependency returns errorsError handling, circuit breakers, graceful degradation
Dependency returns slowlyTimeout configuration, async patterns, backpressure
Configuration changeFeature flag rollback, config validation, hot reload
StateDatabase primary failoverConnection retry, read replica routing, data consistency
Cache eviction (flush Redis)Cache-miss handling, thundering herd protection

Chaos Engineering Tools

ToolPlatformFault TypesBest For
Litmus ChaosKubernetesPod kill, network, CPU, disk, DNSK8s-native chaos with CRDs
Chaos MeshKubernetesPod, network, I/O, time, JVM, kernelAdvanced K8s faults (time skew, JVM)
GremlinAny (SaaS)Full spectrum + scenariosEnterprise teams wanting managed service
AWS FISAWSEC2, ECS, RDS, network, AZ failureAWS-native fault injection
Azure Chaos StudioAzureVM, AKS, network, Cosmos DBAzure-native fault injection
ToxiproxyAnyNetwork latency, timeout, bandwidthLightweight proxy-based network faults
# Litmus Chaos — Kill a random pod in the order-service deployment
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: order-service-chaos
  namespace: production
spec:
  appinfo:
    appns: production
    applabel: app=order-service
    appkind: deployment
  chaosServiceAccount: litmus-admin
  experiments:
    - name: pod-delete
      spec:
        components:
          env:
            - name: TOTAL_CHAOS_DURATION
              value: "30"          # Kill pods for 30 seconds
            - name: CHAOS_INTERVAL
              value: "10"          # Kill one pod every 10 seconds
            - name: FORCE
              value: "false"       # Graceful termination (not force kill)
# Chaos Mesh — Inject 200ms network latency to payment-service
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: payment-latency
  namespace: production
spec:
  action: delay
  mode: all
  selector:
    namespaces:
      - production
    labelSelectors:
      app: payment-service
  delay:
    latency: "200ms"
    correlation: "100"
    jitter: "50ms"
  duration: "60s"

Game Days

A game day is a scheduled, team-wide chaos exercise where you simulate a real incident and practice your response. It combines chaos engineering (fault injection) with incident management (response practice).

Game Day Checklist:
  1. Pre-game: Define scenario, brief participants (not all — some should be surprised), prepare kill switches, notify stakeholders
  2. Execute: Inject the fault, let the on-call team detect and respond using normal processes
  3. Observe: Did alerts fire? Did the team follow the runbook? How long to detect, mitigate, resolve?
  4. Debrief: Run a mini post-mortem — what worked, what did not, what needs to change
  5. Action items: Update runbooks, tune alerts, fix gaps discovered during the exercise

Chaos Engineering Maturity Model

LevelDescriptionPractices
Level 0No chaos testingRely on testing environments and hope for the best
Level 1Ad-hoc experimentsManual fault injection in staging, occasional game days
Level 2Systematic experimentsDocumented experiments with hypotheses, run in staging regularly
Level 3Production chaosControlled experiments in production with small blast radius
Level 4Continuous chaosAutomated chaos experiments run continuously in CI/CD or production
Start at Level 1: Most teams should start with simple experiments in staging — kill a pod, add network latency, fail a dependency. Move to production only after you have confidence in your kill switches, monitoring detects the fault quickly, and the team is comfortable with the process.

Conclusion & Next Steps

Chaos engineering turns unknown unknowns into known risks with mitigations. Key takeaways from Part 11:

  • Chaos engineering is hypothesis-driven experimentation, not random breakage
  • Every experiment needs a steady state hypothesis, a blast radius, and a kill switch
  • Common fault categories: infrastructure (kill VMs), network (latency, partition), application (dependency failure), state (DB failover, cache flush)
  • Tools like Litmus and Chaos Mesh bring chaos experiments to Kubernetes natively via CRDs
  • Game days combine fault injection with incident response practice
  • Start simple in staging (Level 1) and graduate to continuous production chaos (Level 4)