Introduction
Modern applications run on layers of infrastructure defined in code — Terraform modules, Kubernetes manifests, Helm charts, Dockerfiles, network policies, IAM roles. This infrastructure is as complex and bug-prone as application code, yet many teams deploy it without any automated testing.
The consequences are severe: misconfigured security groups expose databases to the internet, under-provisioned resources collapse under load, and compliance violations trigger regulatory penalties. Infrastructure testing applies the same discipline you use for application code — static analysis, unit tests, integration tests, and chaos experiments — to the platform itself.
Why Test Infrastructure
Infrastructure testing serves three purposes:
- Prevent misconfigurations before they reach production
- Verify resilience — prove the system recovers from failures
- Validate performance — ensure the platform handles expected load
Infrastructure-as-Code Testing
IaC testing follows the same pyramid as application testing: static analysis (fast, cheap) at the base, unit tests in the middle, and integration tests (slow, expensive — they actually provision infrastructure) at the top.
flowchart LR
A[Lint & Format] --> B[Static Analysis]
B --> C[Unit Tests]
C --> D[Plan Review]
D --> E[Integration Tests]
E --> F[Apply to Staging]
F --> G[Smoke Tests]
style A fill:#3B9797,color:#fff
style B fill:#3B9797,color:#fff
style C fill:#16476A,color:#fff
style D fill:#16476A,color:#fff
style E fill:#BF092F,color:#fff
style F fill:#BF092F,color:#fff
style G fill:#132440,color:#fff
Static Analysis
Static analysis tools examine IaC files without executing them. They catch syntax errors, security misconfigurations, and policy violations in seconds.
| Tool | Focus | IaC Formats | Key Feature |
|---|---|---|---|
| tflint | Terraform linting | Terraform HCL | Provider-specific rules (AWS, Azure, GCP) |
| Checkov | Security & compliance | Terraform, CloudFormation, K8s, Helm, Dockerfile | 1000+ built-in security policies |
| OPA/Rego | Custom policy enforcement | Any JSON/YAML (via conftest) | Write custom policies in Rego language |
| tfsec | Terraform security | Terraform HCL | Fast, focused on security misconfigs |
| KICS | Multi-format scanning | Terraform, K8s, Docker, Ansible, CloudFormation | Broad coverage, CI-friendly |
# Run Checkov against Terraform files
checkov --directory ./terraform/ \
--framework terraform \
--output cli \
--soft-fail # Don't fail pipeline (initial adoption)
# Example output:
# Passed checks: 42, Failed checks: 3, Skipped checks: 0
# Check: CKV_AWS_24: "Ensure no security group allows ingress from 0.0.0.0/0 to port 22"
# FAILED for resource: aws_security_group.web
# File: /terraform/main.tf:45-60
Unit Testing IaC
Unit tests for IaC validate that modules produce the expected outputs given specific inputs — without actually provisioning infrastructure.
# Terratest (Go) example — but shown in Python-like pseudocode for clarity
# Real Terratest uses Go; alternative: pytest + tftest for Python
# test_vpc_module.py (using python-terraform wrapper)
import json
import subprocess
def test_vpc_module_creates_three_subnets():
"""Verify VPC module creates correct number of subnets."""
# Run terraform plan and capture output
result = subprocess.run(
['terraform', 'plan', '-out=plan.bin', '-var', 'subnet_count=3'],
cwd='./modules/vpc',
capture_output=True, text=True
)
# Convert plan to JSON for inspection
show_result = subprocess.run(
['terraform', 'show', '-json', 'plan.bin'],
cwd='./modules/vpc',
capture_output=True, text=True
)
plan = json.loads(show_result.stdout)
# Count planned subnet resources
subnets = [
r for r in plan['planned_values']['root_module']['resources']
if r['type'] == 'aws_subnet'
]
assert len(subnets) == 3, f"Expected 3 subnets, got {len(subnets)}"
def test_vpc_module_uses_correct_cidr():
"""Verify VPC uses the configured CIDR block."""
# Similar pattern: plan → JSON → assert on values
pass
Plan-Based Testing
The terraform plan output is a powerful testing artifact. It shows exactly what will change before anything is applied. Teams can write assertions against the plan to prevent dangerous operations.
# Sentinel policy (HashiCorp) — prevent destroying databases
import "tfplan/v2" as tfplan
main = rule {
all tfplan.resource_changes as _, rc {
rc.type is not "aws_rds_instance" or
rc.change.actions is not ["delete"]
}
}
# OPA/Rego equivalent (conftest)
# policy/terraform.rego
package terraform
deny[msg] {
resource := input.resource_changes[_]
resource.type == "aws_rds_instance"
resource.change.actions[_] == "delete"
msg := sprintf("Deleting RDS instance %v is not allowed via automation", [resource.address])
}
Configuration Testing
Beyond Terraform, applications depend on configuration files — Kubernetes manifests, Helm charts, ConfigMaps, application YAML. These configurations should be validated before deployment.
# Kubernetes manifest validation
# 1. Schema validation (are fields correct?)
kubeconform -strict -kubernetes-version 1.28.0 ./k8s/
# 2. Helm chart linting
helm lint ./charts/my-app/ --values values-production.yaml
# 3. Policy checks with conftest
conftest test ./k8s/ --policy ./policies/
# Example policy: all containers must have resource limits
# policy/k8s.rego
package kubernetes
deny[msg] {
container := input.spec.template.spec.containers[_]
not container.resources.limits
msg := sprintf("Container %v must have resource limits", [container.name])
}
# 4. Dry-run against cluster API
kubectl apply --dry-run=server -f ./k8s/deployment.yaml
--dry-run=server flag (not --dry-run=client) sends the manifest to the Kubernetes API server for validation against admission controllers, webhooks, and CRDs. It catches issues that local validation misses — like invalid resource references or webhook rejections.
Chaos Engineering
Chaos engineering is the discipline of proactively injecting failures into a system to discover weaknesses before they cause real outages. It answers the question: "How does our system behave when things go wrong?"
Principles of Chaos Engineering
- Define steady state — what does "normal" look like? (e.g., p99 latency < 200ms, error rate < 0.1%)
- Hypothesise — "If we kill one database replica, the system should failover within 30 seconds with zero user-visible errors"
- Inject failure — kill the replica in a controlled manner
- Observe — did the system maintain steady state? How long was the disruption?
- Learn — document findings, fix weaknesses, update runbooks
flowchart TD
A[Define Steady State] --> B[Form Hypothesis]
B --> C[Design Experiment]
C --> D[Limit Blast Radius]
D --> E[Run Experiment]
E --> F{Steady State
Maintained?}
F -->|Yes| G[Increase Scope]
F -->|No| H[Fix Weakness]
H --> I[Document & Share]
G --> I
I --> A
Chaos Engineering Tools
| Tool | Environment | Failure Types | Key Feature |
|---|---|---|---|
| Chaos Monkey (Netflix) | AWS | Instance termination | Pioneer — randomly kills EC2 instances during business hours |
| Litmus (CNCF) | Kubernetes | Pod kill, network delay, disk fill, CPU stress | ChaosEngine CRD — declarative experiments |
| Chaos Mesh (PingCAP/CNCF) | Kubernetes | Pod, network, I/O, time, kernel faults | Dashboard UI, fine-grained network partitioning |
| Gremlin | Any (SaaS) | Full spectrum (resource, network, state) | Commercial platform with safety controls and GameDay orchestration |
| AWS Fault Injection Service | AWS | EC2, ECS, RDS, network | Native AWS integration, IAM-controlled blast radius |
Netflix — From Chaos Monkey to Simian Army
Netflix pioneered chaos engineering in 2010 with Chaos Monkey, which randomly terminated EC2 instances during business hours. The hypothesis: if engineers know their instances will be killed at any time, they will build resilient architectures. It worked — Netflix's migration to AWS (2008-2016) produced one of the most resilient distributed systems ever built. They expanded to the "Simian Army": Chaos Gorilla (kill entire AWS availability zones), Latency Monkey (inject network delays), Conformity Monkey (find instances not adhering to best practices). The key lesson: chaos engineering works because it creates organisational pressure toward resilience, not just technical improvements.
Performance Testing
Performance testing verifies that your system handles expected (and unexpected) load. It encompasses several distinct test types, each answering different questions.
| Test Type | Purpose | Duration | Load Pattern |
|---|---|---|---|
| Load test | Verify system handles expected production load | 5-30 minutes | Steady at expected peak |
| Stress test | Find the breaking point | Ramp until failure | Continuously increasing |
| Spike test | Verify recovery from sudden traffic bursts | Short bursts (1-5 min) | Normal → 10x → normal |
| Soak test | Find memory leaks, connection pool exhaustion | Hours to days | Steady at moderate load |
| Breakpoint test | Determine maximum capacity | Ramp until SLA breach | Step-wise increase with measurement |
Performance Testing Tools
| Tool | Language | Protocol Support | Best For |
|---|---|---|---|
| k6 (Grafana) | JavaScript | HTTP, WebSocket, gRPC, Browser | Developer-friendly, CI integration, scripted scenarios |
| JMeter | Java (GUI + XML) | HTTP, JDBC, LDAP, FTP, JMS, SOAP | Enterprise, complex protocols, large test plans |
| Gatling | Scala/Java | HTTP, WebSocket | High throughput, beautiful reports |
| Locust | Python | HTTP (extensible) | Python teams, simple scripting, distributed |
| Artillery | YAML + JavaScript | HTTP, WebSocket, Socket.io | Quick setup, YAML-driven scenarios |
// k6 load test example
import http from 'k6/http';
import { check, sleep } from 'k6';
export const options = {
stages: [
{ duration: '2m', target: 100 }, // Ramp up to 100 users
{ duration: '5m', target: 100 }, // Stay at 100 users
{ duration: '2m', target: 200 }, // Ramp to 200 users
{ duration: '5m', target: 200 }, // Stay at 200 users
{ duration: '2m', target: 0 }, // Ramp down
],
thresholds: {
http_req_duration: ['p(95)<500'], // 95% of requests under 500ms
http_req_failed: ['rate<0.01'], // Error rate under 1%
},
};
export default function () {
// Simulate user browsing products
const res = http.get('https://api.example.com/products');
check(res, {
'status is 200': (r) => r.status === 200,
'response time OK': (r) => r.timings.duration < 500,
'body has products': (r) => JSON.parse(r.body).length > 0,
});
sleep(Math.random() * 3 + 1); // Think time: 1-4 seconds
}
Performance Testing in CI
Running performance tests in CI pipelines enables automated regression detection. The key challenges are: where to run (dedicated infrastructure), how to set baselines, and how to define pass/fail criteria.
# GitHub Actions: k6 performance gate
name: Performance Test
on:
push:
branches: [main]
jobs:
perf-test:
runs-on: ubuntu-latest
services:
app:
image: myapp:${{ github.sha }}
ports:
- 8080:8080
steps:
- uses: actions/checkout@v4
- uses: grafana/k6-action@v0.3.1
with:
filename: tests/performance/load-test.js
flags: --out json=results.json
- name: Check thresholds
run: |
# k6 exits non-zero if thresholds are breached
echo "Performance test passed - all thresholds met"
- uses: actions/upload-artifact@v4
if: always()
with:
name: k6-results
path: results.json
Security Testing
Security testing in the infrastructure context spans the application code, dependencies, container images, and deployed configurations.
| Type | When | What It Tests | Tools |
|---|---|---|---|
| SAST (Static Application Security Testing) | At commit/PR | Source code vulnerabilities (SQL injection, XSS patterns) | SonarQube, Semgrep, CodeQL |
| SCA (Software Composition Analysis) | At build | Known vulnerabilities in dependencies | Snyk, Dependabot, npm audit, Trivy |
| Container Scanning | After image build | OS package vulnerabilities in Docker images | Trivy, Grype, Snyk Container |
| DAST (Dynamic Application Security Testing) | Against running app | Runtime vulnerabilities (auth bypass, SSRF) | OWASP ZAP, Burp Suite, Nuclei |
| IAST (Interactive Application Security Testing) | During test execution | Vulnerabilities discovered through instrumented runtime | Contrast Security, Hdiv |
# Trivy: Scan container image for vulnerabilities
trivy image --severity HIGH,CRITICAL \
--exit-code 1 \
myregistry.io/myapp:latest
# Trivy: Scan Kubernetes manifests for misconfigurations
trivy config --severity HIGH,CRITICAL ./k8s/
# OWASP ZAP: Automated DAST scan (baseline scan)
docker run -t zaproxy/zap-stable zap-baseline.py \
-t https://staging.example.com \
-r zap-report.html
Compliance Testing
Compliance testing automates the verification of regulatory and organisational policies. Instead of manual audits (expensive, infrequent, point-in-time), Policy-as-Code enforces compliance continuously.
Open Policy Agent (OPA)
OPA is a general-purpose policy engine that decouples policy decisions from policy enforcement. You write policies in Rego (a declarative language), and OPA evaluates them against any JSON/YAML data.
# OPA/Rego policy: Ensure all S3 buckets have encryption enabled
# policy/s3.rego
package terraform.aws
deny[msg] {
resource := input.resource_changes[_]
resource.type == "aws_s3_bucket"
not resource.change.after.server_side_encryption_configuration
msg := sprintf(
"S3 bucket %v must have server-side encryption enabled (SOC2 CC6.1)",
[resource.address]
)
}
# Run with conftest
# conftest test --policy ./policy/ tfplan.json
Common compliance frameworks automated with Policy-as-Code:
- SOC 2 — encryption at rest, access logging, least-privilege IAM
- HIPAA — PHI encryption, audit trails, access controls
- PCI-DSS — network segmentation, vulnerability scanning, key rotation
- GDPR — data residency, retention policies, consent mechanisms
Smoke & Sanity Testing
After every deployment, a quick verification confirms the system is alive and functioning at a basic level. This is the final safety net before users hit new code.
Post-Deployment Verification
- Health checks — HTTP 200 from
/healthendpoints - Smoke tests — 3-5 critical path assertions (login works, homepage loads, API responds)
- Synthetic monitoring — external probes that continuously verify availability from multiple regions
- Canary tests — lightweight tests running against the canary deployment before full rollout
# Post-deployment smoke test script
#!/bin/bash
set -e
BASE_URL="${1:-https://api.example.com}"
echo "Running smoke tests against $BASE_URL"
# Test 1: Health endpoint
STATUS=$(curl -s -o /dev/null -w "%{http_code}" "$BASE_URL/health")
[ "$STATUS" -eq 200 ] || { echo "FAIL: Health check returned $STATUS"; exit 1; }
echo "PASS: Health endpoint OK"
# Test 2: API responds with valid JSON
RESPONSE=$(curl -s "$BASE_URL/api/v1/status")
echo "$RESPONSE" | jq -e '.version' > /dev/null || { echo "FAIL: Invalid API response"; exit 1; }
echo "PASS: API status OK (version: $(echo $RESPONSE | jq -r '.version'))"
# Test 3: Database connectivity (via app health)
DB_STATUS=$(curl -s "$BASE_URL/health/db" | jq -r '.status')
[ "$DB_STATUS" = "healthy" ] || { echo "FAIL: Database unhealthy"; exit 1; }
echo "PASS: Database connectivity OK"
echo "All smoke tests passed!"
Google — Canary Analysis at Scale
Google's deployment system (Borg/Kubernetes) uses automated canary analysis for every production deployment. A new version is deployed to 1% of traffic. An automated system (Kayenta, open-sourced via Spinnaker) compares metrics (latency, error rate, CPU usage) between the canary and the baseline. If metrics diverge beyond a threshold, the canary is automatically rolled back — no human intervention required. Only after automated analysis confirms the canary is healthy does traffic gradually shift to 5%, 25%, 50%, and finally 100%. This system catches approximately 90% of bad deployments before they affect more than 1% of users.
Exercises
Conclusion & Next Steps
Infrastructure testing is the discipline that separates "it works on my machine" from "it works reliably in production at scale." The key takeaways:
- IaC deserves the full testing pyramid — lint, static analysis, unit tests, plan assertions, integration tests
- Chaos engineering builds confidence through controlled failure injection — start small, increase scope gradually
- Performance testing prevents surprises — automated gates catch regressions before users notice
- Security scanning shifts left — catch vulnerabilities at commit time, not in production
- Compliance-as-Code replaces manual audits with continuous automated verification
Next in the Series
In Part 23: Software Supply Chain Security, we examine the most critical security frontier — securing the path from source code to production. From SolarWinds to Log4Shell, learn SBOM generation, the SLSA framework, Sigstore signing, and dependency management strategies.