Part 22: Infrastructure & Platform Testing

Introduction

Modern applications run on layers of infrastructure defined in code — Terraform modules, Kubernetes manifests, Helm charts, Dockerfiles, network policies, IAM roles. This infrastructure is as complex and bug-prone as application code, yet many teams deploy it without any automated testing.

The consequences are severe: misconfigured security groups expose databases to the internet, under-provisioned resources collapse under load, and compliance violations trigger regulatory penalties. Infrastructure testing applies the same discipline you use for application code — static analysis, unit tests, integration tests, and chaos experiments — to the platform itself.

Why Test Infrastructure

                            
                            Key Insight: Infrastructure bugs are more expensive than application bugs. A misconfigured firewall rule can expose every service. A wrong Terraform variable can delete a production database. A missing resource limit can let a single pod consume all cluster memory. The blast radius of infrastructure mistakes is typically the entire platform.
                        

Infrastructure testing serves three purposes:

Prevent misconfigurations before they reach production
Verify resilience — prove the system recovers from failures
Validate performance — ensure the platform handles expected load

Infrastructure-as-Code Testing

IaC testing follows the same pyramid as application testing: static analysis (fast, cheap) at the base, unit tests in the middle, and integration tests (slow, expensive — they actually provision infrastructure) at the top.

IaC Testing Pipeline

flowchart LR
    A[Lint & Format] --> B[Static Analysis]
    B --> C[Unit Tests]
    C --> D[Plan Review]
    D --> E[Integration Tests]
    E --> F[Apply to Staging]
    F --> G[Smoke Tests]
    style A fill:#3B9797,color:#fff
    style B fill:#3B9797,color:#fff
    style C fill:#16476A,color:#fff
    style D fill:#16476A,color:#fff
    style E fill:#BF092F,color:#fff
    style F fill:#BF092F,color:#fff
    style G fill:#132440,color:#fff

Static Analysis

Static analysis tools examine IaC files without executing them. They catch syntax errors, security misconfigurations, and policy violations in seconds.

Tool	Focus	IaC Formats	Key Feature
tflint	Terraform linting	Terraform HCL	Provider-specific rules (AWS, Azure, GCP)
Checkov	Security & compliance	Terraform, CloudFormation, K8s, Helm, Dockerfile	1000+ built-in security policies
OPA/Rego	Custom policy enforcement	Any JSON/YAML (via conftest)	Write custom policies in Rego language
tfsec	Terraform security	Terraform HCL	Fast, focused on security misconfigs
KICS	Multi-format scanning	Terraform, K8s, Docker, Ansible, CloudFormation	Broad coverage, CI-friendly

# Run Checkov against Terraform files
checkov --directory ./terraform/ \
  --framework terraform \
  --output cli \
  --soft-fail  # Don't fail pipeline (initial adoption)

# Example output:
# Passed checks: 42, Failed checks: 3, Skipped checks: 0
# Check: CKV_AWS_24: "Ensure no security group allows ingress from 0.0.0.0/0 to port 22"
#   FAILED for resource: aws_security_group.web
#   File: /terraform/main.tf:45-60

Unit Testing IaC

Unit tests for IaC validate that modules produce the expected outputs given specific inputs — without actually provisioning infrastructure.

# Terratest (Go) example — but shown in Python-like pseudocode for clarity
# Real Terratest uses Go; alternative: pytest + tftest for Python

# test_vpc_module.py (using python-terraform wrapper)
import json
import subprocess

def test_vpc_module_creates_three_subnets():
    """Verify VPC module creates correct number of subnets."""
    # Run terraform plan and capture output
    result = subprocess.run(
        ['terraform', 'plan', '-out=plan.bin', '-var', 'subnet_count=3'],
        cwd='./modules/vpc',
        capture_output=True, text=True
    )
    
    # Convert plan to JSON for inspection
    show_result = subprocess.run(
        ['terraform', 'show', '-json', 'plan.bin'],
        cwd='./modules/vpc',
        capture_output=True, text=True
    )
    plan = json.loads(show_result.stdout)
    
    # Count planned subnet resources
    subnets = [
        r for r in plan['planned_values']['root_module']['resources']
        if r['type'] == 'aws_subnet'
    ]
    assert len(subnets) == 3, f"Expected 3 subnets, got {len(subnets)}"

def test_vpc_module_uses_correct_cidr():
    """Verify VPC uses the configured CIDR block."""
    # Similar pattern: plan → JSON → assert on values
    pass

Plan-Based Testing

The terraform plan output is a powerful testing artifact. It shows exactly what will change before anything is applied. Teams can write assertions against the plan to prevent dangerous operations.

# Sentinel policy (HashiCorp) — prevent destroying databases
import "tfplan/v2" as tfplan

main = rule {
    all tfplan.resource_changes as _, rc {
        rc.type is not "aws_rds_instance" or
        rc.change.actions is not ["delete"]
    }
}

# OPA/Rego equivalent (conftest)
# policy/terraform.rego
package terraform

deny[msg] {
    resource := input.resource_changes[_]
    resource.type == "aws_rds_instance"
    resource.change.actions[_] == "delete"
    msg := sprintf("Deleting RDS instance %v is not allowed via automation", [resource.address])
}

Configuration Testing

Beyond Terraform, applications depend on configuration files — Kubernetes manifests, Helm charts, ConfigMaps, application YAML. These configurations should be validated before deployment.

# Kubernetes manifest validation
# 1. Schema validation (are fields correct?)
kubeconform -strict -kubernetes-version 1.28.0 ./k8s/

# 2. Helm chart linting
helm lint ./charts/my-app/ --values values-production.yaml

# 3. Policy checks with conftest
conftest test ./k8s/ --policy ./policies/
# Example policy: all containers must have resource limits
# policy/k8s.rego
package kubernetes

deny[msg] {
    container := input.spec.template.spec.containers[_]
    not container.resources.limits
    msg := sprintf("Container %v must have resource limits", [container.name])
}

# 4. Dry-run against cluster API
kubectl apply --dry-run=server -f ./k8s/deployment.yaml

                            
                            Key Insight: The --dry-run=server flag (not --dry-run=client) sends the manifest to the Kubernetes API server for validation against admission controllers, webhooks, and CRDs. It catches issues that local validation misses — like invalid resource references or webhook rejections.
                        

Chaos Engineering

Chaos engineering is the discipline of proactively injecting failures into a system to discover weaknesses before they cause real outages. It answers the question: "How does our system behave when things go wrong?"

Principles of Chaos Engineering

Define steady state — what does "normal" look like? (e.g., p99 latency < 200ms, error rate < 0.1%)
Hypothesise — "If we kill one database replica, the system should failover within 30 seconds with zero user-visible errors"
Inject failure — kill the replica in a controlled manner
Observe — did the system maintain steady state? How long was the disruption?
Learn — document findings, fix weaknesses, update runbooks

Chaos Engineering Experiment Flow

flowchart TD
    A[Define Steady State] --> B[Form Hypothesis]
    B --> C[Design Experiment]
    C --> D[Limit Blast Radius]
    D --> E[Run Experiment]
    E --> F{Steady State
Maintained?}
    F -->|Yes| G[Increase Scope]
    F -->|No| H[Fix Weakness]
    H --> I[Document & Share]
    G --> I
    I --> A

Chaos Engineering Tools

Tool	Environment	Failure Types	Key Feature
Chaos Monkey (Netflix)	AWS	Instance termination	Pioneer — randomly kills EC2 instances during business hours
Litmus (CNCF)	Kubernetes	Pod kill, network delay, disk fill, CPU stress	ChaosEngine CRD — declarative experiments
Chaos Mesh (PingCAP/CNCF)	Kubernetes	Pod, network, I/O, time, kernel faults	Dashboard UI, fine-grained network partitioning
Gremlin	Any (SaaS)	Full spectrum (resource, network, state)	Commercial platform with safety controls and GameDay orchestration
AWS Fault Injection Service	AWS	EC2, ECS, RDS, network	Native AWS integration, IAM-controlled blast radius

Case Study

Netflix — From Chaos Monkey to Simian Army

Netflix pioneered chaos engineering in 2010 with Chaos Monkey, which randomly terminated EC2 instances during business hours. The hypothesis: if engineers know their instances will be killed at any time, they will build resilient architectures. It worked — Netflix's migration to AWS (2008-2016) produced one of the most resilient distributed systems ever built. They expanded to the "Simian Army": Chaos Gorilla (kill entire AWS availability zones), Latency Monkey (inject network delays), Conformity Monkey (find instances not adhering to best practices). The key lesson: chaos engineering works because it creates organisational pressure toward resilience, not just technical improvements.

Netflix Resilience AWS

Performance Testing

Performance testing verifies that your system handles expected (and unexpected) load. It encompasses several distinct test types, each answering different questions.

Test Type	Purpose	Duration	Load Pattern
Load test	Verify system handles expected production load	5-30 minutes	Steady at expected peak
Stress test	Find the breaking point	Ramp until failure	Continuously increasing
Spike test	Verify recovery from sudden traffic bursts	Short bursts (1-5 min)	Normal → 10x → normal
Soak test	Find memory leaks, connection pool exhaustion	Hours to days	Steady at moderate load
Breakpoint test	Determine maximum capacity	Ramp until SLA breach	Step-wise increase with measurement

Performance Testing Tools

Tool	Language	Protocol Support	Best For
k6 (Grafana)	JavaScript	HTTP, WebSocket, gRPC, Browser	Developer-friendly, CI integration, scripted scenarios
JMeter	Java (GUI + XML)	HTTP, JDBC, LDAP, FTP, JMS, SOAP	Enterprise, complex protocols, large test plans
Gatling	Scala/Java	HTTP, WebSocket	High throughput, beautiful reports
Locust	Python	HTTP (extensible)	Python teams, simple scripting, distributed
Artillery	YAML + JavaScript	HTTP, WebSocket, Socket.io	Quick setup, YAML-driven scenarios

// k6 load test example
import http from 'k6/http';
import { check, sleep } from 'k6';

export const options = {
  stages: [
    { duration: '2m', target: 100 },   // Ramp up to 100 users
    { duration: '5m', target: 100 },   // Stay at 100 users
    { duration: '2m', target: 200 },   // Ramp to 200 users
    { duration: '5m', target: 200 },   // Stay at 200 users
    { duration: '2m', target: 0 },     // Ramp down
  ],
  thresholds: {
    http_req_duration: ['p(95)<500'],   // 95% of requests under 500ms
    http_req_failed: ['rate<0.01'],     // Error rate under 1%
  },
};

export default function () {
  // Simulate user browsing products
  const res = http.get('https://api.example.com/products');
  
  check(res, {
    'status is 200': (r) => r.status === 200,
    'response time OK': (r) => r.timings.duration < 500,
    'body has products': (r) => JSON.parse(r.body).length > 0,
  });

  sleep(Math.random() * 3 + 1); // Think time: 1-4 seconds
}

Performance Testing in CI

Running performance tests in CI pipelines enables automated regression detection. The key challenges are: where to run (dedicated infrastructure), how to set baselines, and how to define pass/fail criteria.

# GitHub Actions: k6 performance gate
name: Performance Test
on:
  push:
    branches: [main]

jobs:
  perf-test:
    runs-on: ubuntu-latest
    services:
      app:
        image: myapp:${{ github.sha }}
        ports:
          - 8080:8080
    steps:
      - uses: actions/checkout@v4
      - uses: grafana/k6-action@v0.3.1
        with:
          filename: tests/performance/load-test.js
          flags: --out json=results.json
      - name: Check thresholds
        run: |
          # k6 exits non-zero if thresholds are breached
          echo "Performance test passed - all thresholds met"
      - uses: actions/upload-artifact@v4
        if: always()
        with:
          name: k6-results
          path: results.json

                            
                            Warning: Never run load tests against production without explicit approval and safeguards. Even "small" load tests can trigger auto-scaling costs, rate limiting, or DDoS protection. Use isolated staging environments that mirror production topology but are safe to stress.
                        

Security Testing

Security testing in the infrastructure context spans the application code, dependencies, container images, and deployed configurations.

Type	When	What It Tests	Tools
SAST (Static Application Security Testing)	At commit/PR	Source code vulnerabilities (SQL injection, XSS patterns)	SonarQube, Semgrep, CodeQL
SCA (Software Composition Analysis)	At build	Known vulnerabilities in dependencies	Snyk, Dependabot, npm audit, Trivy
Container Scanning	After image build	OS package vulnerabilities in Docker images	Trivy, Grype, Snyk Container
DAST (Dynamic Application Security Testing)	Against running app	Runtime vulnerabilities (auth bypass, SSRF)	OWASP ZAP, Burp Suite, Nuclei
IAST (Interactive Application Security Testing)	During test execution	Vulnerabilities discovered through instrumented runtime	Contrast Security, Hdiv

# Trivy: Scan container image for vulnerabilities
trivy image --severity HIGH,CRITICAL \
  --exit-code 1 \
  myregistry.io/myapp:latest

# Trivy: Scan Kubernetes manifests for misconfigurations
trivy config --severity HIGH,CRITICAL ./k8s/

# OWASP ZAP: Automated DAST scan (baseline scan)
docker run -t zaproxy/zap-stable zap-baseline.py \
  -t https://staging.example.com \
  -r zap-report.html

Compliance Testing

Compliance testing automates the verification of regulatory and organisational policies. Instead of manual audits (expensive, infrequent, point-in-time), Policy-as-Code enforces compliance continuously.

Open Policy Agent (OPA)

OPA is a general-purpose policy engine that decouples policy decisions from policy enforcement. You write policies in Rego (a declarative language), and OPA evaluates them against any JSON/YAML data.

# OPA/Rego policy: Ensure all S3 buckets have encryption enabled
# policy/s3.rego
package terraform.aws

deny[msg] {
    resource := input.resource_changes[_]
    resource.type == "aws_s3_bucket"
    not resource.change.after.server_side_encryption_configuration
    msg := sprintf(
        "S3 bucket %v must have server-side encryption enabled (SOC2 CC6.1)",
        [resource.address]
    )
}

# Run with conftest
# conftest test --policy ./policy/ tfplan.json

Common compliance frameworks automated with Policy-as-Code:

SOC 2 — encryption at rest, access logging, least-privilege IAM
HIPAA — PHI encryption, audit trails, access controls
PCI-DSS — network segmentation, vulnerability scanning, key rotation
GDPR — data residency, retention policies, consent mechanisms

Smoke & Sanity Testing

After every deployment, a quick verification confirms the system is alive and functioning at a basic level. This is the final safety net before users hit new code.

Post-Deployment Verification

Health checks — HTTP 200 from /health endpoints
Smoke tests — 3-5 critical path assertions (login works, homepage loads, API responds)
Synthetic monitoring — external probes that continuously verify availability from multiple regions
Canary tests — lightweight tests running against the canary deployment before full rollout

# Post-deployment smoke test script
#!/bin/bash
set -e

BASE_URL="${1:-https://api.example.com}"
echo "Running smoke tests against $BASE_URL"

# Test 1: Health endpoint
STATUS=$(curl -s -o /dev/null -w "%{http_code}" "$BASE_URL/health")
[ "$STATUS" -eq 200 ] || { echo "FAIL: Health check returned $STATUS"; exit 1; }
echo "PASS: Health endpoint OK"

# Test 2: API responds with valid JSON
RESPONSE=$(curl -s "$BASE_URL/api/v1/status")
echo "$RESPONSE" | jq -e '.version' > /dev/null || { echo "FAIL: Invalid API response"; exit 1; }
echo "PASS: API status OK (version: $(echo $RESPONSE | jq -r '.version'))"

# Test 3: Database connectivity (via app health)
DB_STATUS=$(curl -s "$BASE_URL/health/db" | jq -r '.status')
[ "$DB_STATUS" = "healthy" ] || { echo "FAIL: Database unhealthy"; exit 1; }
echo "PASS: Database connectivity OK"

echo "All smoke tests passed!"

Case Study

Google — Canary Analysis at Scale

Google's deployment system (Borg/Kubernetes) uses automated canary analysis for every production deployment. A new version is deployed to 1% of traffic. An automated system (Kayenta, open-sourced via Spinnaker) compares metrics (latency, error rate, CPU usage) between the canary and the baseline. If metrics diverge beyond a threshold, the canary is automatically rolled back — no human intervention required. Only after automated analysis confirms the canary is healthy does traffic gradually shift to 5%, 25%, 50%, and finally 100%. This system catches approximately 90% of bad deployments before they affect more than 1% of users.

Canary Kayenta Progressive Delivery

Exercises

                            
                            Exercise 1 — IaC Test Pipeline: Design a complete testing pipeline for a Terraform module that provisions an AWS VPC with public/private subnets, NAT gateways, and security groups. List every stage from lint to integration test, the tools at each stage, and what each stage catches.
                        

                            
                            Exercise 2 — Chaos Experiment Design: Your team runs a microservices application on Kubernetes with 3 replicas of each service. Design a chaos experiment to verify that the system survives the loss of an entire availability zone. Define: steady-state metrics, hypothesis, blast radius controls, and success criteria.
                        

                            
                            Exercise 3 — Performance Test Script: Write a k6 script that simulates a realistic e-commerce workload: 70% browsing (GET /products), 20% adding to cart (POST /cart), 10% checkout (POST /orders). Include proper think times, thresholds (p95 < 400ms, error rate < 0.5%), and a ramp-up/ramp-down pattern.
                        

                            
                            Exercise 4 — Security Scanning Pipeline: Your organisation requires SOC 2 compliance. Design a CI pipeline that includes SAST, SCA, container scanning, and IaC policy checks. For each stage, specify the tool, what it catches, and the failure criteria (which severities block deployment vs. which create tickets).
                        

Conclusion & Next Steps

Infrastructure testing is the discipline that separates "it works on my machine" from "it works reliably in production at scale." The key takeaways:

IaC deserves the full testing pyramid — lint, static analysis, unit tests, plan assertions, integration tests
Chaos engineering builds confidence through controlled failure injection — start small, increase scope gradually
Performance testing prevents surprises — automated gates catch regressions before users notice
Security scanning shifts left — catch vulnerabilities at commit time, not in production
Compliance-as-Code replaces manual audits with continuous automated verification

Next in the Series

In Part 23: Software Supply Chain Security, we examine the most critical security frontier — securing the path from source code to production. From SolarWinds to Log4Shell, learn SBOM generation, the SLSA framework, Sigstore signing, and dependency management strategies.

Previous Part 21: E2E Testing & UI Automation Next Part 23: Supply Chain Security

Cookie Consent

Part 22: Infrastructure & Platform Testing

Table of Contents

Introduction

Why Test Infrastructure

Infrastructure-as-Code Testing

Static Analysis

Unit Testing IaC

Plan-Based Testing

Configuration Testing

Chaos Engineering

Principles of Chaos Engineering

Chaos Engineering Tools

Netflix — From Chaos Monkey to Simian Army

Performance Testing

Performance Testing Tools

Performance Testing in CI

Security Testing

Compliance Testing

Open Policy Agent (OPA)

Smoke & Sanity Testing

Post-Deployment Verification

Google — Canary Analysis at Scale

Exercises

Conclusion & Next Steps

Next in the Series

Cookie Consent

Part 22: Infrastructure & Platform Testing

Table of Contents

Introduction

Why Test Infrastructure

Infrastructure-as-Code Testing

Static Analysis

Unit Testing IaC

Plan-Based Testing

Configuration Testing

Chaos Engineering

Principles of Chaos Engineering

Chaos Engineering Tools

Netflix — From Chaos Monkey to Simian Army

Performance Testing

Performance Testing Tools

Performance Testing in CI

Security Testing

Compliance Testing

Open Policy Agent (OPA)

Smoke & Sanity Testing

Post-Deployment Verification

Google — Canary Analysis at Scale

Exercises

Conclusion & Next Steps

Next in the Series

Continue the Series

Part 21: End-to-End Testing & UI Automation

Part 23: Software Supply Chain Security

Part 15: CI/CD Pipeline Design Patterns