Back to Software Engineering & Delivery Mastery Series

Part 22: Infrastructure & Platform Testing

May 13, 2026 Wasil Zafar 38 min read

Application code is only half the system. Your infrastructure — Terraform modules, Kubernetes manifests, network policies — deserves the same test discipline. Master IaC testing, chaos engineering, performance testing, and security scanning for platform reliability.

Table of Contents

  1. Introduction
  2. IaC Testing
  3. Configuration Testing
  4. Chaos Engineering
  5. Performance Testing
  6. Performance in CI
  7. Security Testing
  8. Compliance Testing
  9. Smoke & Sanity Testing
  10. Exercises
  11. Conclusion & Next Steps

Introduction

Modern applications run on layers of infrastructure defined in code — Terraform modules, Kubernetes manifests, Helm charts, Dockerfiles, network policies, IAM roles. This infrastructure is as complex and bug-prone as application code, yet many teams deploy it without any automated testing.

The consequences are severe: misconfigured security groups expose databases to the internet, under-provisioned resources collapse under load, and compliance violations trigger regulatory penalties. Infrastructure testing applies the same discipline you use for application code — static analysis, unit tests, integration tests, and chaos experiments — to the platform itself.

Why Test Infrastructure

Key Insight: Infrastructure bugs are more expensive than application bugs. A misconfigured firewall rule can expose every service. A wrong Terraform variable can delete a production database. A missing resource limit can let a single pod consume all cluster memory. The blast radius of infrastructure mistakes is typically the entire platform.

Infrastructure testing serves three purposes:

  • Prevent misconfigurations before they reach production
  • Verify resilience — prove the system recovers from failures
  • Validate performance — ensure the platform handles expected load

Infrastructure-as-Code Testing

IaC testing follows the same pyramid as application testing: static analysis (fast, cheap) at the base, unit tests in the middle, and integration tests (slow, expensive — they actually provision infrastructure) at the top.

IaC Testing Pipeline
flowchart LR
    A[Lint & Format] --> B[Static Analysis]
    B --> C[Unit Tests]
    C --> D[Plan Review]
    D --> E[Integration Tests]
    E --> F[Apply to Staging]
    F --> G[Smoke Tests]
    style A fill:#3B9797,color:#fff
    style B fill:#3B9797,color:#fff
    style C fill:#16476A,color:#fff
    style D fill:#16476A,color:#fff
    style E fill:#BF092F,color:#fff
    style F fill:#BF092F,color:#fff
    style G fill:#132440,color:#fff
                            

Static Analysis

Static analysis tools examine IaC files without executing them. They catch syntax errors, security misconfigurations, and policy violations in seconds.

Tool Focus IaC Formats Key Feature
tflint Terraform linting Terraform HCL Provider-specific rules (AWS, Azure, GCP)
Checkov Security & compliance Terraform, CloudFormation, K8s, Helm, Dockerfile 1000+ built-in security policies
OPA/Rego Custom policy enforcement Any JSON/YAML (via conftest) Write custom policies in Rego language
tfsec Terraform security Terraform HCL Fast, focused on security misconfigs
KICS Multi-format scanning Terraform, K8s, Docker, Ansible, CloudFormation Broad coverage, CI-friendly
# Run Checkov against Terraform files
checkov --directory ./terraform/ \
  --framework terraform \
  --output cli \
  --soft-fail  # Don't fail pipeline (initial adoption)

# Example output:
# Passed checks: 42, Failed checks: 3, Skipped checks: 0
# Check: CKV_AWS_24: "Ensure no security group allows ingress from 0.0.0.0/0 to port 22"
#   FAILED for resource: aws_security_group.web
#   File: /terraform/main.tf:45-60

Unit Testing IaC

Unit tests for IaC validate that modules produce the expected outputs given specific inputs — without actually provisioning infrastructure.

# Terratest (Go) example — but shown in Python-like pseudocode for clarity
# Real Terratest uses Go; alternative: pytest + tftest for Python

# test_vpc_module.py (using python-terraform wrapper)
import json
import subprocess

def test_vpc_module_creates_three_subnets():
    """Verify VPC module creates correct number of subnets."""
    # Run terraform plan and capture output
    result = subprocess.run(
        ['terraform', 'plan', '-out=plan.bin', '-var', 'subnet_count=3'],
        cwd='./modules/vpc',
        capture_output=True, text=True
    )
    
    # Convert plan to JSON for inspection
    show_result = subprocess.run(
        ['terraform', 'show', '-json', 'plan.bin'],
        cwd='./modules/vpc',
        capture_output=True, text=True
    )
    plan = json.loads(show_result.stdout)
    
    # Count planned subnet resources
    subnets = [
        r for r in plan['planned_values']['root_module']['resources']
        if r['type'] == 'aws_subnet'
    ]
    assert len(subnets) == 3, f"Expected 3 subnets, got {len(subnets)}"

def test_vpc_module_uses_correct_cidr():
    """Verify VPC uses the configured CIDR block."""
    # Similar pattern: plan → JSON → assert on values
    pass

Plan-Based Testing

The terraform plan output is a powerful testing artifact. It shows exactly what will change before anything is applied. Teams can write assertions against the plan to prevent dangerous operations.

# Sentinel policy (HashiCorp) — prevent destroying databases
import "tfplan/v2" as tfplan

main = rule {
    all tfplan.resource_changes as _, rc {
        rc.type is not "aws_rds_instance" or
        rc.change.actions is not ["delete"]
    }
}

# OPA/Rego equivalent (conftest)
# policy/terraform.rego
package terraform

deny[msg] {
    resource := input.resource_changes[_]
    resource.type == "aws_rds_instance"
    resource.change.actions[_] == "delete"
    msg := sprintf("Deleting RDS instance %v is not allowed via automation", [resource.address])
}

Configuration Testing

Beyond Terraform, applications depend on configuration files — Kubernetes manifests, Helm charts, ConfigMaps, application YAML. These configurations should be validated before deployment.

# Kubernetes manifest validation
# 1. Schema validation (are fields correct?)
kubeconform -strict -kubernetes-version 1.28.0 ./k8s/

# 2. Helm chart linting
helm lint ./charts/my-app/ --values values-production.yaml

# 3. Policy checks with conftest
conftest test ./k8s/ --policy ./policies/
# Example policy: all containers must have resource limits
# policy/k8s.rego
package kubernetes

deny[msg] {
    container := input.spec.template.spec.containers[_]
    not container.resources.limits
    msg := sprintf("Container %v must have resource limits", [container.name])
}

# 4. Dry-run against cluster API
kubectl apply --dry-run=server -f ./k8s/deployment.yaml
Key Insight: The --dry-run=server flag (not --dry-run=client) sends the manifest to the Kubernetes API server for validation against admission controllers, webhooks, and CRDs. It catches issues that local validation misses — like invalid resource references or webhook rejections.

Chaos Engineering

Chaos engineering is the discipline of proactively injecting failures into a system to discover weaknesses before they cause real outages. It answers the question: "How does our system behave when things go wrong?"

Principles of Chaos Engineering

  1. Define steady state — what does "normal" look like? (e.g., p99 latency < 200ms, error rate < 0.1%)
  2. Hypothesise — "If we kill one database replica, the system should failover within 30 seconds with zero user-visible errors"
  3. Inject failure — kill the replica in a controlled manner
  4. Observe — did the system maintain steady state? How long was the disruption?
  5. Learn — document findings, fix weaknesses, update runbooks
Chaos Engineering Experiment Flow
flowchart TD
    A[Define Steady State] --> B[Form Hypothesis]
    B --> C[Design Experiment]
    C --> D[Limit Blast Radius]
    D --> E[Run Experiment]
    E --> F{Steady State
Maintained?} F -->|Yes| G[Increase Scope] F -->|No| H[Fix Weakness] H --> I[Document & Share] G --> I I --> A

Chaos Engineering Tools

Tool Environment Failure Types Key Feature
Chaos Monkey (Netflix) AWS Instance termination Pioneer — randomly kills EC2 instances during business hours
Litmus (CNCF) Kubernetes Pod kill, network delay, disk fill, CPU stress ChaosEngine CRD — declarative experiments
Chaos Mesh (PingCAP/CNCF) Kubernetes Pod, network, I/O, time, kernel faults Dashboard UI, fine-grained network partitioning
Gremlin Any (SaaS) Full spectrum (resource, network, state) Commercial platform with safety controls and GameDay orchestration
AWS Fault Injection Service AWS EC2, ECS, RDS, network Native AWS integration, IAM-controlled blast radius
Case Study

Netflix — From Chaos Monkey to Simian Army

Netflix pioneered chaos engineering in 2010 with Chaos Monkey, which randomly terminated EC2 instances during business hours. The hypothesis: if engineers know their instances will be killed at any time, they will build resilient architectures. It worked — Netflix's migration to AWS (2008-2016) produced one of the most resilient distributed systems ever built. They expanded to the "Simian Army": Chaos Gorilla (kill entire AWS availability zones), Latency Monkey (inject network delays), Conformity Monkey (find instances not adhering to best practices). The key lesson: chaos engineering works because it creates organisational pressure toward resilience, not just technical improvements.

Netflix Resilience AWS

Performance Testing

Performance testing verifies that your system handles expected (and unexpected) load. It encompasses several distinct test types, each answering different questions.

Test Type Purpose Duration Load Pattern
Load test Verify system handles expected production load 5-30 minutes Steady at expected peak
Stress test Find the breaking point Ramp until failure Continuously increasing
Spike test Verify recovery from sudden traffic bursts Short bursts (1-5 min) Normal → 10x → normal
Soak test Find memory leaks, connection pool exhaustion Hours to days Steady at moderate load
Breakpoint test Determine maximum capacity Ramp until SLA breach Step-wise increase with measurement

Performance Testing Tools

Tool Language Protocol Support Best For
k6 (Grafana) JavaScript HTTP, WebSocket, gRPC, Browser Developer-friendly, CI integration, scripted scenarios
JMeter Java (GUI + XML) HTTP, JDBC, LDAP, FTP, JMS, SOAP Enterprise, complex protocols, large test plans
Gatling Scala/Java HTTP, WebSocket High throughput, beautiful reports
Locust Python HTTP (extensible) Python teams, simple scripting, distributed
Artillery YAML + JavaScript HTTP, WebSocket, Socket.io Quick setup, YAML-driven scenarios
// k6 load test example
import http from 'k6/http';
import { check, sleep } from 'k6';

export const options = {
  stages: [
    { duration: '2m', target: 100 },   // Ramp up to 100 users
    { duration: '5m', target: 100 },   // Stay at 100 users
    { duration: '2m', target: 200 },   // Ramp to 200 users
    { duration: '5m', target: 200 },   // Stay at 200 users
    { duration: '2m', target: 0 },     // Ramp down
  ],
  thresholds: {
    http_req_duration: ['p(95)<500'],   // 95% of requests under 500ms
    http_req_failed: ['rate<0.01'],     // Error rate under 1%
  },
};

export default function () {
  // Simulate user browsing products
  const res = http.get('https://api.example.com/products');
  
  check(res, {
    'status is 200': (r) => r.status === 200,
    'response time OK': (r) => r.timings.duration < 500,
    'body has products': (r) => JSON.parse(r.body).length > 0,
  });

  sleep(Math.random() * 3 + 1); // Think time: 1-4 seconds
}

Performance Testing in CI

Running performance tests in CI pipelines enables automated regression detection. The key challenges are: where to run (dedicated infrastructure), how to set baselines, and how to define pass/fail criteria.

# GitHub Actions: k6 performance gate
name: Performance Test
on:
  push:
    branches: [main]

jobs:
  perf-test:
    runs-on: ubuntu-latest
    services:
      app:
        image: myapp:${{ github.sha }}
        ports:
          - 8080:8080
    steps:
      - uses: actions/checkout@v4
      - uses: grafana/k6-action@v0.3.1
        with:
          filename: tests/performance/load-test.js
          flags: --out json=results.json
      - name: Check thresholds
        run: |
          # k6 exits non-zero if thresholds are breached
          echo "Performance test passed - all thresholds met"
      - uses: actions/upload-artifact@v4
        if: always()
        with:
          name: k6-results
          path: results.json
Warning: Never run load tests against production without explicit approval and safeguards. Even "small" load tests can trigger auto-scaling costs, rate limiting, or DDoS protection. Use isolated staging environments that mirror production topology but are safe to stress.

Security Testing

Security testing in the infrastructure context spans the application code, dependencies, container images, and deployed configurations.

Type When What It Tests Tools
SAST (Static Application Security Testing) At commit/PR Source code vulnerabilities (SQL injection, XSS patterns) SonarQube, Semgrep, CodeQL
SCA (Software Composition Analysis) At build Known vulnerabilities in dependencies Snyk, Dependabot, npm audit, Trivy
Container Scanning After image build OS package vulnerabilities in Docker images Trivy, Grype, Snyk Container
DAST (Dynamic Application Security Testing) Against running app Runtime vulnerabilities (auth bypass, SSRF) OWASP ZAP, Burp Suite, Nuclei
IAST (Interactive Application Security Testing) During test execution Vulnerabilities discovered through instrumented runtime Contrast Security, Hdiv
# Trivy: Scan container image for vulnerabilities
trivy image --severity HIGH,CRITICAL \
  --exit-code 1 \
  myregistry.io/myapp:latest

# Trivy: Scan Kubernetes manifests for misconfigurations
trivy config --severity HIGH,CRITICAL ./k8s/

# OWASP ZAP: Automated DAST scan (baseline scan)
docker run -t zaproxy/zap-stable zap-baseline.py \
  -t https://staging.example.com \
  -r zap-report.html

Compliance Testing

Compliance testing automates the verification of regulatory and organisational policies. Instead of manual audits (expensive, infrequent, point-in-time), Policy-as-Code enforces compliance continuously.

Open Policy Agent (OPA)

OPA is a general-purpose policy engine that decouples policy decisions from policy enforcement. You write policies in Rego (a declarative language), and OPA evaluates them against any JSON/YAML data.

# OPA/Rego policy: Ensure all S3 buckets have encryption enabled
# policy/s3.rego
package terraform.aws

deny[msg] {
    resource := input.resource_changes[_]
    resource.type == "aws_s3_bucket"
    not resource.change.after.server_side_encryption_configuration
    msg := sprintf(
        "S3 bucket %v must have server-side encryption enabled (SOC2 CC6.1)",
        [resource.address]
    )
}

# Run with conftest
# conftest test --policy ./policy/ tfplan.json

Common compliance frameworks automated with Policy-as-Code:

  • SOC 2 — encryption at rest, access logging, least-privilege IAM
  • HIPAA — PHI encryption, audit trails, access controls
  • PCI-DSS — network segmentation, vulnerability scanning, key rotation
  • GDPR — data residency, retention policies, consent mechanisms

Smoke & Sanity Testing

After every deployment, a quick verification confirms the system is alive and functioning at a basic level. This is the final safety net before users hit new code.

Post-Deployment Verification

  • Health checks — HTTP 200 from /health endpoints
  • Smoke tests — 3-5 critical path assertions (login works, homepage loads, API responds)
  • Synthetic monitoring — external probes that continuously verify availability from multiple regions
  • Canary tests — lightweight tests running against the canary deployment before full rollout
# Post-deployment smoke test script
#!/bin/bash
set -e

BASE_URL="${1:-https://api.example.com}"
echo "Running smoke tests against $BASE_URL"

# Test 1: Health endpoint
STATUS=$(curl -s -o /dev/null -w "%{http_code}" "$BASE_URL/health")
[ "$STATUS" -eq 200 ] || { echo "FAIL: Health check returned $STATUS"; exit 1; }
echo "PASS: Health endpoint OK"

# Test 2: API responds with valid JSON
RESPONSE=$(curl -s "$BASE_URL/api/v1/status")
echo "$RESPONSE" | jq -e '.version' > /dev/null || { echo "FAIL: Invalid API response"; exit 1; }
echo "PASS: API status OK (version: $(echo $RESPONSE | jq -r '.version'))"

# Test 3: Database connectivity (via app health)
DB_STATUS=$(curl -s "$BASE_URL/health/db" | jq -r '.status')
[ "$DB_STATUS" = "healthy" ] || { echo "FAIL: Database unhealthy"; exit 1; }
echo "PASS: Database connectivity OK"

echo "All smoke tests passed!"
Case Study

Google — Canary Analysis at Scale

Google's deployment system (Borg/Kubernetes) uses automated canary analysis for every production deployment. A new version is deployed to 1% of traffic. An automated system (Kayenta, open-sourced via Spinnaker) compares metrics (latency, error rate, CPU usage) between the canary and the baseline. If metrics diverge beyond a threshold, the canary is automatically rolled back — no human intervention required. Only after automated analysis confirms the canary is healthy does traffic gradually shift to 5%, 25%, 50%, and finally 100%. This system catches approximately 90% of bad deployments before they affect more than 1% of users.

Canary Kayenta Progressive Delivery

Exercises

Exercise 1 — IaC Test Pipeline: Design a complete testing pipeline for a Terraform module that provisions an AWS VPC with public/private subnets, NAT gateways, and security groups. List every stage from lint to integration test, the tools at each stage, and what each stage catches.
Exercise 2 — Chaos Experiment Design: Your team runs a microservices application on Kubernetes with 3 replicas of each service. Design a chaos experiment to verify that the system survives the loss of an entire availability zone. Define: steady-state metrics, hypothesis, blast radius controls, and success criteria.
Exercise 3 — Performance Test Script: Write a k6 script that simulates a realistic e-commerce workload: 70% browsing (GET /products), 20% adding to cart (POST /cart), 10% checkout (POST /orders). Include proper think times, thresholds (p95 < 400ms, error rate < 0.5%), and a ramp-up/ramp-down pattern.
Exercise 4 — Security Scanning Pipeline: Your organisation requires SOC 2 compliance. Design a CI pipeline that includes SAST, SCA, container scanning, and IaC policy checks. For each stage, specify the tool, what it catches, and the failure criteria (which severities block deployment vs. which create tickets).

Conclusion & Next Steps

Infrastructure testing is the discipline that separates "it works on my machine" from "it works reliably in production at scale." The key takeaways:

  • IaC deserves the full testing pyramid — lint, static analysis, unit tests, plan assertions, integration tests
  • Chaos engineering builds confidence through controlled failure injection — start small, increase scope gradually
  • Performance testing prevents surprises — automated gates catch regressions before users notice
  • Security scanning shifts left — catch vulnerabilities at commit time, not in production
  • Compliance-as-Code replaces manual audits with continuous automated verification

Next in the Series

In Part 23: Software Supply Chain Security, we examine the most critical security frontier — securing the path from source code to production. From SolarWinds to Log4Shell, learn SBOM generation, the SLSA framework, Sigstore signing, and dependency management strategies.