Why CI/CD for Infrastructure
Infrastructure management has historically relied on manual processes — engineers SSH into servers, click through cloud consoles, and run scripts by hand. This approach creates snowflake environments that drift from their intended state, introduces human error at every step, and makes teams afraid to deploy changes because they cannot predict the outcome.
CI/CD pipelines transform infrastructure management by establishing a single, automated path from code commit to production. Every change is version-controlled, tested, reviewed, and deployed through the same repeatable process. The pipeline becomes the gatekeeper — no human touches production directly.
The Pipeline as Single Path to Production
In a mature infrastructure organization, the CI/CD pipeline is the only way changes reach production. No SSH access, no console clicks, no ad-hoc scripts. This principle ensures every change is auditable, repeatable, and reversible.
flowchart LR
subgraph Manual["❌ Manual Process"]
direction TB
M1[Engineer writes code] --> M2[SSH to server]
M2 --> M3[Run commands manually]
M3 --> M4[Hope nothing breaks]
M4 --> M5[No audit trail]
end
subgraph Automated["✅ CI/CD Pipeline"]
direction TB
A1[Engineer pushes code] --> A2[Automated lint & validate]
A2 --> A3[Security scan & test]
A3 --> A4[Plan review & approval]
A4 --> A5[Automated apply]
A5 --> A6[Full audit trail]
end
The benefits of pipeline-driven infrastructure are immediate and compounding:
| Aspect | Manual Deployment | CI/CD Pipeline |
|---|---|---|
| Speed | Hours to days | Minutes to hours |
| Consistency | Varies by engineer | Identical every time |
| Audit trail | Sparse or missing | Complete git + pipeline history |
| Rollback | Manual, error-prone | Revert commit, re-run pipeline |
| Risk | High (unknown state) | Low (tested, reviewed) |
| Scale | Limited by team size | Unlimited parallelism |
CI/CD Fundamentals
Before diving into specific tools, it is essential to understand the foundational concepts that all infrastructure CI/CD pipelines share. These principles apply regardless of whether you use GitHub Actions, GitLab CI, Jenkins, or any other platform.
Continuous Integration vs Continuous Delivery vs Continuous Deployment
These three terms are often confused but represent distinct levels of automation maturity:
| Concept | Definition | Infrastructure Example |
|---|---|---|
| Continuous Integration (CI) | Merge code frequently, run automated tests on every commit | Terraform validate, lint, and plan on every PR |
| Continuous Delivery (CD) | Code is always in a deployable state; deployment requires manual approval | Terraform plan is ready; human approves apply |
| Continuous Deployment (CD) | Every passing change is automatically deployed to production | Merged Terraform automatically applies (rare for infra) |
terraform apply is standard practice. The pipeline automates everything up to the apply step.
Pipeline Stages for Infrastructure
A well-designed infrastructure pipeline follows a progression of increasingly expensive and risky stages. Each stage acts as a gate — if it fails, subsequent stages do not run:
flowchart LR
A[Format Check] --> B[Validate]
B --> C[Lint & Security]
C --> D[Plan]
D --> E[Cost Estimate]
E --> F[Manual Approval]
F --> G[Apply]
G --> H[Smoke Test]
style A fill:#e8f5e9
style B fill:#e8f5e9
style C fill:#fff3e0
style D fill:#fff3e0
style E fill:#fff3e0
style F fill:#ffebee
style G fill:#ffebee
style H fill:#e3f2fd
Artifacts and State Passing
Infrastructure pipelines generate artifacts that must be passed between stages. The most critical is the Terraform plan file — it ensures the exact changes reviewed are the exact changes applied:
# Stage 1: Generate plan and save as artifact
terraform plan -out=tfplan -input=false
# Stage 2 (after approval): Apply the EXACT saved plan
terraform apply -input=false tfplan
terraform apply without a saved plan file in CI/CD. Between the plan and apply stages, another team member might merge changes that alter the plan. Using a saved plan file guarantees the reviewed changes are what gets applied.
GitHub Actions for Infrastructure
GitHub Actions is the most popular CI/CD platform for infrastructure teams using GitHub repositories. Its tight integration with pull requests, environments, and OIDC authentication makes it ideal for Terraform workflows.
Workflow File Structure
GitHub Actions workflows live in .github/workflows/ and are defined in YAML. Each workflow contains triggers, jobs, and steps:
# .github/workflows/terraform.yml
# Complete Terraform CI/CD pipeline for AWS infrastructure
name: "Terraform Infrastructure"
on:
push:
branches: [main]
paths: ["terraform/**"]
pull_request:
branches: [main]
paths: ["terraform/**"]
workflow_dispatch:
inputs:
environment:
description: "Target environment"
required: true
type: choice
options: [dev, staging, production]
permissions:
id-token: write # Required for OIDC
contents: read # Required to checkout
pull-requests: write # Required for PR comments
env:
TF_VERSION: "1.7.0"
AWS_REGION: "us-east-1"
WORKING_DIR: "terraform/environments/dev"
OIDC Authentication (No Long-Lived Credentials)
Modern CI/CD pipelines authenticate to cloud providers using OIDC (OpenID Connect) instead of storing long-lived access keys as secrets. This eliminates credential rotation burden and reduces the blast radius of a compromised pipeline:
# OIDC authentication - no stored credentials needed
jobs:
terraform:
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Configure AWS credentials via OIDC
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::123456789012:role/github-actions-terraform
role-session-name: github-actions-${{ github.run_id }}
aws-region: ${{ env.AWS_REGION }}
# For Azure:
# - name: Azure Login via OIDC
# uses: azure/login@v2
# with:
# client-id: ${{ secrets.AZURE_CLIENT_ID }}
# tenant-id: ${{ secrets.AZURE_TENANT_ID }}
# subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}
# For GCP:
# - name: GCP Auth via OIDC
# uses: google-github-actions/auth@v2
# with:
# workload_identity_provider: projects/123456/locations/global/workloadIdentityPools/github/providers/my-repo
# service_account: terraform@my-project.iam.gserviceaccount.com
Complete Terraform Pipeline
This is a production-ready GitHub Actions workflow that handles the full Terraform lifecycle with format checking, validation, security scanning, planning, and conditional apply:
# .github/workflows/terraform-complete.yml
name: "Terraform CI/CD"
on:
push:
branches: [main]
paths: ["terraform/**"]
pull_request:
branches: [main]
paths: ["terraform/**"]
permissions:
id-token: write
contents: read
pull-requests: write
jobs:
# ─── Stage 1: Format & Validate ────────────────────────────
validate:
name: "Format & Validate"
runs-on: ubuntu-latest
defaults:
run:
working-directory: terraform/environments/production
steps:
- uses: actions/checkout@v4
- name: Setup Terraform
uses: hashicorp/setup-terraform@v3
with:
terraform_version: "1.7.0"
- name: Terraform Format Check
run: terraform fmt -check -recursive -diff
- name: Terraform Init
run: terraform init -backend=false
- name: Terraform Validate
run: terraform validate
# ─── Stage 2: Security Scan ─────────────────────────────────
security:
name: "Security Scan"
runs-on: ubuntu-latest
needs: validate
steps:
- uses: actions/checkout@v4
- name: Run tfsec
uses: aquasecurity/tfsec-action@v1.0.3
with:
working_directory: terraform/
soft_fail: false
- name: Run Checkov
uses: bridgecrewio/checkov-action@v12
with:
directory: terraform/
framework: terraform
output_format: sarif
# ─── Stage 3: Plan ─────────────────────────────────────────
plan:
name: "Terraform Plan"
runs-on: ubuntu-latest
needs: [validate, security]
defaults:
run:
working-directory: terraform/environments/production
steps:
- uses: actions/checkout@v4
- name: Setup Terraform
uses: hashicorp/setup-terraform@v3
with:
terraform_version: "1.7.0"
- name: Configure AWS Credentials
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::123456789012:role/github-terraform
aws-region: us-east-1
- name: Terraform Init
run: terraform init
- name: Terraform Plan
id: plan
run: |
terraform plan -out=tfplan -input=false -no-color 2>&1 | tee plan-output.txt
echo "plan_output<> $GITHUB_OUTPUT
cat plan-output.txt >> $GITHUB_OUTPUT
echo "EOF" >> $GITHUB_OUTPUT
- name: Upload Plan Artifact
uses: actions/upload-artifact@v4
with:
name: tfplan
path: terraform/environments/production/tfplan
retention-days: 5
- name: Post Plan to PR
if: github.event_name == 'pull_request'
uses: actions/github-script@v7
with:
script: |
const output = `#### Terraform Plan 📖
\`\`\`hcl
${{ steps.plan.outputs.plan_output }}
\`\`\`
*Pushed by: @${{ github.actor }}, Action: \`${{ github.event_name }}\`*`;
github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body: output
});
# ─── Stage 4: Apply (main branch only) ─────────────────────
apply:
name: "Terraform Apply"
runs-on: ubuntu-latest
needs: plan
if: github.ref == 'refs/heads/main' && github.event_name == 'push'
environment:
name: production
url: https://console.aws.amazon.com
defaults:
run:
working-directory: terraform/environments/production
steps:
- uses: actions/checkout@v4
- name: Setup Terraform
uses: hashicorp/setup-terraform@v3
with:
terraform_version: "1.7.0"
- name: Configure AWS Credentials
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::123456789012:role/github-terraform
aws-region: us-east-1
- name: Terraform Init
run: terraform init
- name: Download Plan
uses: actions/download-artifact@v4
with:
name: tfplan
path: terraform/environments/production/
- name: Terraform Apply
run: terraform apply -input=false tfplan
Environment Protection Rules
GitHub Environments provide approval gates, deployment restrictions, and secrets scoping:
# Environment configuration (set in GitHub repo settings):
# Settings > Environments > production:
# - Required reviewers: [lead-engineer, platform-team]
# - Wait timer: 5 minutes
# - Deployment branches: main only
# - Environment secrets: AWS_ROLE_ARN (scoped to production)
# In workflow, reference the environment:
jobs:
deploy-production:
environment:
name: production
url: https://app.example.com
# This job will pause and wait for approval
# before any steps execute
GitLab CI for Infrastructure
GitLab CI offers a powerful integrated experience with built-in Terraform state management, merge request pipelines, and environment tracking — all within a single platform.
Pipeline Structure
# .gitlab-ci.yml - Complete Terraform pipeline
image:
name: hashicorp/terraform:1.7.0
entrypoint: [""]
variables:
TF_ROOT: "terraform/environments/${CI_ENVIRONMENT_NAME}"
TF_STATE_NAME: "${CI_ENVIRONMENT_NAME}"
TF_ADDRESS: "${CI_API_V4_URL}/projects/${CI_PROJECT_ID}/terraform/state/${TF_STATE_NAME}"
stages:
- validate
- security
- plan
- apply
- verify
cache:
key: "${CI_COMMIT_REF_SLUG}"
paths:
- ${TF_ROOT}/.terraform/
# ─── Validate Stage ──────────────────────────────────────────
fmt-check:
stage: validate
script:
- cd ${TF_ROOT}
- terraform fmt -check -recursive -diff
rules:
- if: $CI_PIPELINE_SOURCE == "merge_request_event"
- if: $CI_COMMIT_BRANCH == "main"
validate:
stage: validate
script:
- cd ${TF_ROOT}
- terraform init -backend=false
- terraform validate
rules:
- if: $CI_PIPELINE_SOURCE == "merge_request_event"
- if: $CI_COMMIT_BRANCH == "main"
# ─── Security Stage ──────────────────────────────────────────
tfsec:
stage: security
image: aquasec/tfsec:latest
script:
- tfsec ${TF_ROOT} --format json --out tfsec-results.json
artifacts:
reports:
sast: tfsec-results.json
rules:
- if: $CI_PIPELINE_SOURCE == "merge_request_event"
# ─── Plan Stage ──────────────────────────────────────────────
plan:
stage: plan
script:
- cd ${TF_ROOT}
- terraform init
-backend-config="address=${TF_ADDRESS}"
-backend-config="lock_address=${TF_ADDRESS}/lock"
-backend-config="unlock_address=${TF_ADDRESS}/lock"
-backend-config="username=gitlab-ci-token"
-backend-config="password=${CI_JOB_TOKEN}"
- terraform plan -out=plan.cache -input=false
- terraform show -no-color plan.cache > plan.txt
artifacts:
paths:
- ${TF_ROOT}/plan.cache
reports:
terraform: ${TF_ROOT}/plan.txt
rules:
- if: $CI_PIPELINE_SOURCE == "merge_request_event"
- if: $CI_COMMIT_BRANCH == "main"
# ─── Apply Stage ─────────────────────────────────────────────
apply:
stage: apply
script:
- cd ${TF_ROOT}
- terraform init
-backend-config="address=${TF_ADDRESS}"
-backend-config="lock_address=${TF_ADDRESS}/lock"
-backend-config="unlock_address=${TF_ADDRESS}/lock"
-backend-config="username=gitlab-ci-token"
-backend-config="password=${CI_JOB_TOKEN}"
- terraform apply -input=false plan.cache
dependencies:
- plan
rules:
- if: $CI_COMMIT_BRANCH == "main"
when: manual
environment:
name: production
action: start
GitLab Managed Terraform State
GitLab provides built-in Terraform state management, eliminating the need for external state backends like S3:
# GitLab Terraform state is managed automatically via the API
# State is stored per-project and accessible via CI_JOB_TOKEN
# No S3 bucket, DynamoDB table, or external backend needed
# View states: Settings > Infrastructure > Terraform states
# Each environment gets its own state file
# States are versioned and support locking
Environment-Specific Deployments
# Multi-environment deployment with promotion
.terraform_base:
image: hashicorp/terraform:1.7.0
before_script:
- cd terraform/environments/${CI_ENVIRONMENT_NAME}
- terraform init
deploy-dev:
extends: .terraform_base
stage: apply
environment:
name: dev
script:
- terraform apply -auto-approve -input=false
rules:
- if: $CI_COMMIT_BRANCH == "develop"
deploy-staging:
extends: .terraform_base
stage: apply
environment:
name: staging
script:
- terraform apply -auto-approve -input=false
rules:
- if: $CI_COMMIT_BRANCH == "main"
needs: ["plan-staging"]
deploy-production:
extends: .terraform_base
stage: apply
environment:
name: production
script:
- terraform apply -input=false plan.cache
rules:
- if: $CI_COMMIT_BRANCH == "main"
when: manual
needs: ["deploy-staging"]
Jenkins Pipelines
While GitHub Actions and GitLab CI dominate new projects, Jenkins remains prevalent in enterprises with existing investments. Its flexibility and plugin ecosystem make it adaptable to any infrastructure workflow.
Declarative vs Scripted Pipelines
| Aspect | Declarative Pipeline | Scripted Pipeline |
|---|---|---|
| Syntax | pipeline { } block |
Full Groovy scripting |
| Complexity | Simple, structured | Flexible, complex |
| Error handling | post { } blocks |
try/catch/finally |
| Best for | Standard CI/CD flows | Complex logic, dynamic stages |
| Recommendation | Use for most pipelines | Use when declarative is insufficient |
Terraform Jenkinsfile
// Jenkinsfile - Declarative Terraform Pipeline
pipeline {
agent {
docker {
image 'hashicorp/terraform:1.7.0'
args '-v /var/run/docker.sock:/var/run/docker.sock'
}
}
parameters {
choice(name: 'ENVIRONMENT', choices: ['dev', 'staging', 'production'])
booleanParam(name: 'AUTO_APPROVE', defaultValue: false)
}
environment {
AWS_CREDENTIALS = credentials('aws-terraform-role')
TF_DIR = "terraform/environments/${params.ENVIRONMENT}"
}
stages {
stage('Init') {
steps {
dir("${TF_DIR}") {
sh 'terraform init -input=false'
}
}
}
stage('Validate') {
steps {
dir("${TF_DIR}") {
sh 'terraform fmt -check -recursive'
sh 'terraform validate'
}
}
}
stage('Plan') {
steps {
dir("${TF_DIR}") {
sh 'terraform plan -out=tfplan -input=false'
archiveArtifacts artifacts: 'tfplan'
}
}
}
stage('Approval') {
when {
expression { !params.AUTO_APPROVE }
}
steps {
input message: "Apply Terraform changes to ${params.ENVIRONMENT}?",
ok: "Apply"
}
}
stage('Apply') {
steps {
dir("${TF_DIR}") {
sh 'terraform apply -input=false tfplan'
}
}
}
}
post {
always {
cleanWs()
}
failure {
slackSend channel: '#infra-alerts',
message: "Terraform pipeline FAILED for ${params.ENVIRONMENT}"
}
success {
slackSend channel: '#infra-deploys',
message: "Terraform applied to ${params.ENVIRONMENT} ✅"
}
}
}
Infrastructure Testing in Pipelines
Infrastructure testing follows a pyramid similar to application testing — fast, cheap tests at the base catching most issues, with slower, expensive tests at the top providing deeper confidence.
flowchart TB
subgraph Pyramid["Testing Pyramid"]
direction TB
E2E["🔺 End-to-End Tests
(Terratest, real infrastructure)"]
Integration["🔶 Integration Tests
(Policy checks, module testing)"]
Static["🟩 Static Analysis
(fmt, validate, lint, scan)"]
end
E2E -.->|"Slow, expensive
Run: pre-release"| Cost1["$$$ Minutes"]
Integration -.->|"Medium speed
Run: every PR"| Cost2["$$ Seconds"]
Static -.->|"Fast, cheap
Run: every commit"| Cost3["$ Milliseconds"]
Static Analysis Tools
# terraform fmt - Check code formatting
terraform fmt -check -recursive -diff
# Returns exit code 1 if files need formatting
# terraform validate - Check syntax and internal consistency
terraform init -backend=false
terraform validate
# Catches: missing variables, invalid references, type errors
# tflint - Terraform-specific linting
# Install: brew install tflint (or download binary)
tflint --init # Download plugins
tflint --recursive --format=compact
# Catches: deprecated syntax, invalid instance types,
# naming conventions, unused variables
Security Scanning
# tfsec - Find security vulnerabilities in Terraform
tfsec terraform/ --format json --out results.json
# Checks: open security groups, unencrypted storage,
# public access, missing logging
# Checkov - Policy-as-code scanner (multi-framework)
checkov -d terraform/ --framework terraform --output sarif
# Checks: CIS benchmarks, SOC2, HIPAA, PCI-DSS compliance
# Trivy - Container + IaC scanning
trivy config terraform/
# Checks: misconfigurations across Terraform, Kubernetes, Docker
# KICS - Keeping Infrastructure as Code Secure
docker run -v $(pwd):/path checkmarx/kics:latest scan -p /path/terraform
# Checks: 2000+ queries across multiple frameworks
| Tool | Focus | Speed | Frameworks | Cost |
|---|---|---|---|---|
| tfsec | Security best practices | Very fast | Terraform only | Free |
| Checkov | Compliance frameworks | Fast | Terraform, K8s, Docker, ARM | Free / Paid |
| Trivy | Vulnerabilities + misconfig | Fast | Multi-framework + containers | Free |
| KICS | Broad coverage | Medium | 15+ frameworks | Free |
| Sentinel | Enterprise policy | Fast | Terraform (HCP) | Paid (TFC) |
Policy as Code with OPA/Conftest
# Convert Terraform plan to JSON for policy evaluation
terraform plan -out=tfplan
terraform show -json tfplan > tfplan.json
# Run OPA/Conftest policies against the plan
conftest test tfplan.json --policy policy/ --output json
# Example: Deny resources without required tags
// policy/terraform.rego - OPA policy for Terraform
package terraform
import future.keywords.in
# Deny resources without required tags
deny[msg] {
resource := input.resource_changes[_]
resource.change.actions[_] == "create"
tags := resource.change.after.tags
required_tags := {"Environment", "Team", "CostCenter"}
missing := required_tags - {key | tags[key]}
count(missing) > 0
msg := sprintf("Resource %s missing tags: %v", [resource.address, missing])
}
# Deny overly permissive security groups
deny[msg] {
resource := input.resource_changes[_]
resource.type == "aws_security_group_rule"
resource.change.after.cidr_blocks[_] == "0.0.0.0/0"
resource.change.after.type == "ingress"
msg := sprintf("Security group %s allows ingress from 0.0.0.0/0", [resource.address])
}
Integration Testing with Terratest
# Terratest runs real infrastructure and validates it
# Written in Go, deploys then destroys test infrastructure
# Install Terratest
# go get github.com/gruntwork-io/terratest/modules/terraform
# Run tests (deploys real resources!)
cd test/
go test -v -timeout 30m ./...
# Cost estimation with Infracost
# Install: brew install infracost
infracost breakdown --path terraform/
infracost diff --path terraform/ --compare-to infracost-base.json
# Infracost in CI - add cost comments to PRs
# .github/workflows/infracost.yml snippet
- name: Generate Infracost diff
run: |
infracost diff \
--path=terraform/ \
--format=json \
--compare-to=/tmp/infracost-base.json \
--out-file=/tmp/infracost-diff.json
- name: Post Infracost comment
run: |
infracost comment github \
--path=/tmp/infracost-diff.json \
--repo=$GITHUB_REPOSITORY \
--pull-request=${{ github.event.pull_request.number }} \
--github-token=${{ secrets.GITHUB_TOKEN }}
# Result: "This PR will increase monthly costs by $47.20 (+12%)"
GitOps for Infrastructure
GitOps extends CI/CD by making Git the single source of truth for both application and infrastructure state. Instead of pipelines pushing changes to clusters, GitOps controllers pull desired state from Git and reconcile the live environment continuously.
Push-Based vs Pull-Based GitOps
| Aspect | Push-Based (Traditional CI/CD) | Pull-Based (GitOps) |
|---|---|---|
| Who applies | CI server pushes to cluster | In-cluster agent pulls from Git |
| Credentials | CI has cluster credentials | Agent has Git read access only |
| Drift detection | Manual or scheduled scans | Continuous reconciliation |
| Security | Broader attack surface | Minimal external access |
| Tools | GitHub Actions, GitLab CI | ArgoCD, Flux |
flowchart LR
Dev[Developer] -->|"git push"| Git[Git Repository]
Git -->|"watches"| Agent[GitOps Agent
ArgoCD / Flux]
Agent -->|"reconciles"| Cluster[Kubernetes Cluster]
Cluster -->|"reports status"| Agent
Agent -->|"updates status"| Git
style Git fill:#e8f5e9
style Agent fill:#fff3e0
style Cluster fill:#e3f2fd
ArgoCD
ArgoCD is the most popular GitOps controller for Kubernetes. It watches Git repositories and automatically synchronizes cluster state to match the desired configuration.
# argocd-application.yaml - ArgoCD Application CRD
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: production-infrastructure
namespace: argocd
spec:
project: default
source:
repoURL: https://github.com/myorg/infrastructure.git
targetRevision: main
path: kubernetes/overlays/production
destination:
server: https://kubernetes.default.svc
namespace: production
syncPolicy:
automated:
prune: true # Delete resources removed from Git
selfHeal: true # Revert manual changes in cluster
syncOptions:
- CreateNamespace=true
- ApplyOutOfSyncOnly=true
retry:
limit: 5
backoff:
duration: 5s
factor: 2
maxDuration: 3m
# Install ArgoCD
kubectl create namespace argocd
kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml
# Access ArgoCD UI
kubectl port-forward svc/argocd-server -n argocd 8080:443
# Get initial admin password
kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d
# Create application via CLI
argocd app create production-infra \
--repo https://github.com/myorg/infrastructure.git \
--path kubernetes/overlays/production \
--dest-server https://kubernetes.default.svc \
--dest-namespace production \
--sync-policy automated \
--auto-prune \
--self-heal
Flux
Flux is a CNCF graduated project that takes a more modular, composable approach to GitOps with separate controllers for different responsibilities:
# flux-source.yaml - GitRepository source
apiVersion: source.toolkit.fluxcd.io/v1
kind: GitRepository
metadata:
name: infrastructure
namespace: flux-system
spec:
interval: 1m
url: https://github.com/myorg/infrastructure.git
ref:
branch: main
secretRef:
name: github-credentials
---
# flux-kustomization.yaml - Kustomization (what to deploy)
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: production-infra
namespace: flux-system
spec:
interval: 5m
path: ./kubernetes/overlays/production
prune: true
sourceRef:
kind: GitRepository
name: infrastructure
healthChecks:
- apiVersion: apps/v1
kind: Deployment
name: web-api
namespace: production
timeout: 3m
---
# flux-helmrelease.yaml - Helm chart deployment
apiVersion: helm.toolkit.fluxcd.io/v2beta2
kind: HelmRelease
metadata:
name: nginx-ingress
namespace: ingress-system
spec:
interval: 10m
chart:
spec:
chart: ingress-nginx
version: "4.x"
sourceRef:
kind: HelmRepository
name: ingress-nginx
values:
controller:
replicaCount: 3
service:
type: LoadBalancer
GitOps for Terraform
While ArgoCD and Flux handle Kubernetes resources, Terraform changes need specialized GitOps tools:
| Tool | Type | How It Works | Best For |
|---|---|---|---|
| Atlantis | Self-hosted | Webhook-triggered plan/apply on PRs | Teams wanting full control |
| Spacelift | SaaS | Managed Terraform runner with policies | Enterprise with compliance needs |
| env0 | SaaS | Self-service IaC with guardrails | Platform teams enabling developers |
| Terraform Cloud | SaaS | HashiCorp's managed workflow | Teams already in HCP ecosystem |
| Crossplane | K8s-native | Cloud resources as K8s CRDs | Teams standardizing on K8s APIs |
# atlantis.yaml - Atlantis server-side config
version: 3
projects:
- name: production-vpc
dir: terraform/environments/production/vpc
workspace: default
autoplan:
when_modified: ["*.tf", "*.tfvars", "modules/**/*.tf"]
enabled: true
apply_requirements: [approved, mergeable]
workflow: production
workflows:
production:
plan:
steps:
- init
- run: tfsec .
- run: infracost breakdown --path . --format json --out-file /tmp/infracost.json
- plan:
extra_args: ["-var-file=production.tfvars"]
apply:
steps:
- apply
Deployment Strategies
Infrastructure changes require careful deployment strategies to minimize downtime and provide rollback capabilities. Unlike application deployments where you can run multiple versions simultaneously, infrastructure changes often affect shared resources.
Blue-Green Deployments for Infrastructure
Blue-green deployments maintain two identical environments. Traffic switches from the current (blue) to the new (green) once validation passes. For infrastructure, this typically applies to compute layers rather than databases:
flowchart TB
LB[Load Balancer / DNS]
subgraph Blue["Blue Environment (Current)"]
B1[ASG v1.2]
B2[App Servers]
end
subgraph Green["Green Environment (New)"]
G1[ASG v1.3]
G2[App Servers]
end
DB[(Shared Database)]
LB -->|"100% traffic"| Blue
LB -.->|"0% traffic
(switch after validation)"| Green
Blue --> DB
Green --> DB
style Blue fill:#e3f2fd
style Green fill:#e8f5e9
# blue-green.tf - Blue-Green deployment with Terraform
# Using create_before_destroy lifecycle
resource "aws_launch_template" "app" {
name_prefix = "app-"
image_id = var.ami_id # New AMI triggers replacement
instance_type = "t3.medium"
lifecycle {
create_before_destroy = true
}
}
resource "aws_autoscaling_group" "app" {
name = "app-${aws_launch_template.app.latest_version}"
desired_capacity = 3
max_size = 6
min_size = 3
target_group_arns = [aws_lb_target_group.app.arn]
vpc_zone_identifier = var.private_subnet_ids
launch_template {
id = aws_launch_template.app.id
version = "$Latest"
}
instance_refresh {
strategy = "Rolling"
preferences {
min_healthy_percentage = 66
instance_warmup = 300
}
}
lifecycle {
create_before_destroy = true
}
}
Canary Deployments
Canary deployments route a small percentage of traffic to the new infrastructure, gradually increasing it as confidence builds:
# canary-rollout.yaml - Argo Rollouts canary strategy
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: web-api
namespace: production
spec:
replicas: 10
strategy:
canary:
steps:
- setWeight: 5 # 5% traffic to canary
- pause: {duration: 5m}
- analysis:
templates:
- templateName: success-rate
- setWeight: 25 # 25% if metrics pass
- pause: {duration: 10m}
- analysis:
templates:
- templateName: success-rate
- setWeight: 50 # 50% traffic
- pause: {duration: 15m}
- setWeight: 100 # Full rollout
canaryService: web-api-canary
stableService: web-api-stable
trafficRouting:
nginx:
stableIngress: web-api-ingress
---
# analysis-template.yaml - Automated rollback criteria
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: success-rate
spec:
metrics:
- name: success-rate
interval: 60s
successCondition: result[0] >= 0.95
provider:
prometheus:
address: http://prometheus:9090
query: |
sum(rate(http_requests_total{service="web-api-canary",code=~"2.."}[5m]))
/
sum(rate(http_requests_total{service="web-api-canary"}[5m]))
Rollback Strategies
# Git-based rollback (preferred)
# Revert the commit and let the pipeline re-run
git revert HEAD --no-edit
git push origin main
# Pipeline runs with previous state → infrastructure reverts
# Terraform state-based rollback
# Apply a previous known-good state
terraform apply -target=module.networking -var="version=1.2.0"
# ArgoCD rollback
argocd app rollback production-infra
# Or: set targetRevision to a specific commit
argocd app set production-infra --revision abc1234
# Flux rollback - suspend and force revision
flux suspend kustomization production-infra
flux resume kustomization production-infra --revision=v1.2.0
Pipeline Security & Best Practices
CI/CD pipelines are high-value targets — they have credentials to modify production infrastructure. Securing the pipeline itself is as important as securing the infrastructure it manages.
Secret Management in Pipelines
# Best practice: OIDC > Vault > Repository Secrets > Environment Variables
# 1. OIDC (best) - No stored secrets at all
- uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: ${{ vars.AWS_ROLE_ARN }} # Not a secret!
aws-region: us-east-1
# 2. HashiCorp Vault integration
- name: Import Secrets from Vault
uses: hashicorp/vault-action@v3
with:
url: https://vault.company.com
method: jwt
role: ci-terraform
secrets: |
secret/data/terraform/aws access_key | AWS_ACCESS_KEY_ID ;
secret/data/terraform/aws secret_key | AWS_SECRET_ACCESS_KEY
# 3. GitHub repository/environment secrets (acceptable)
- name: Configure credentials
env:
DB_PASSWORD: ${{ secrets.PROD_DB_PASSWORD }} # Never log this!
Pipeline Hardening
# Pin action versions to full SHA (not tags which can be moved)
# BAD: uses: actions/checkout@v4
# GOOD: uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11
# Limit permissions to minimum required
permissions:
contents: read # Only read repo
id-token: write # Only for OIDC
pull-requests: write # Only for PR comments
# Everything else implicitly denied
# Require signed commits
# Settings > Branches > Branch protection:
# ✓ Require signed commits
# ✓ Require linear history
# Use step-level environment isolation
- name: Terraform Apply
env:
TF_VAR_sensitive_value: ${{ secrets.VALUE }}
run: terraform apply -input=false tfplan
# Secret only available in this step, not globally
Drift Detection Pipelines
Scheduled pipelines detect when live infrastructure drifts from its Terraform-defined state (due to manual console changes, other tools, or AWS service updates):
# .github/workflows/drift-detection.yml
name: "Infrastructure Drift Detection"
on:
schedule:
- cron: "0 8 * * 1-5" # Weekdays at 8am UTC
workflow_dispatch: {}
jobs:
detect-drift:
runs-on: ubuntu-latest
strategy:
matrix:
environment: [dev, staging, production]
steps:
- uses: actions/checkout@v4
- name: Setup Terraform
uses: hashicorp/setup-terraform@v3
- name: Configure AWS Credentials
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: ${{ vars.AWS_ROLE_ARN }}
aws-region: us-east-1
- name: Detect Drift
id: drift
working-directory: terraform/environments/${{ matrix.environment }}
run: |
terraform init
terraform plan -detailed-exitcode -input=false 2>&1 | tee drift-output.txt
EXIT_CODE=$?
if [ $EXIT_CODE -eq 2 ]; then
echo "drift_detected=true" >> $GITHUB_OUTPUT
else
echo "drift_detected=false" >> $GITHUB_OUTPUT
fi
- name: Alert on Drift
if: steps.drift.outputs.drift_detected == 'true'
run: |
curl -X POST "${{ secrets.SLACK_WEBHOOK }}" \
-H "Content-type: application/json" \
-d '{
"text": "⚠️ Infrastructure drift detected in ${{ matrix.environment }}!\nReview: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}"
}'
Hands-On Exercises
Build a GitHub Actions Terraform Pipeline
Create a complete CI/CD workflow for Terraform that runs on pull requests and pushes to main.
- Create
.github/workflows/terraform.ymlwith format check, validate, and plan jobs - Configure OIDC authentication to your cloud provider (use a free tier account)
- Add plan output as a PR comment using
actions/github-script - Add a conditional apply job that only runs on main branch merges
- Create a GitHub Environment with required reviewers for the apply stage
- Test by opening a PR that adds a new resource — verify the plan appears as a comment
Add Security Scanning to the Pipeline
Enhance the pipeline with automated security checks that catch vulnerabilities before they reach production.
- Add a
tfsecjob that runs in parallel with validation - Add
Checkovscanning with SARIF output for GitHub Security tab integration - Create a custom OPA policy (using Conftest) that denies resources without required tags
- Add
Infracostto post cost estimates as PR comments - Configure the pipeline to fail on critical security findings but warn on medium ones
- Deliberately introduce a security issue (e.g., open security group) and verify the pipeline catches it
Implement Environment Promotion (Dev → Staging → Prod)
Build a multi-environment pipeline with automatic promotion through lower environments and manual gates for production.
- Structure your Terraform code with separate
environments/dev,environments/staging,environments/productiondirectories sharing common modules - Create a pipeline that automatically applies to dev on every main branch push
- Add a staging job that triggers after dev succeeds (with a smoke test between)
- Add a production job with manual approval gate (GitHub Environment protection rules)
- Implement a rollback mechanism: if staging smoke tests fail, automatically revert dev
- Add Slack notifications for each promotion step
Set Up ArgoCD for Kubernetes GitOps
Deploy ArgoCD and configure it to manage Kubernetes resources from a Git repository with automatic sync and self-healing.
- Install ArgoCD on a local Kubernetes cluster (minikube, kind, or k3d)
- Create a Git repository with Kubernetes manifests organized using Kustomize overlays (base + production)
- Create an ArgoCD Application that watches the repository and auto-syncs
- Enable auto-prune and self-heal in the sync policy
- Test self-healing: manually
kubectl deletea resource and watch ArgoCD recreate it - Test drift correction:
kubectl edita deployment’s replicas and watch ArgoCD revert it - Set up an ApplicationSet to manage multiple environments from a single template
Conclusion & Next Steps
CI/CD pipelines transform infrastructure management from a manual, error-prone process into an automated, auditable, and repeatable workflow. By establishing the pipeline as the single path to production, teams gain confidence to deploy frequently while maintaining safety through automated testing, security scanning, and approval gates.
The key principles to carry forward:
- Pipeline as gatekeeper — no human touches production directly
- Test pyramid — fast static analysis catches most issues; expensive integration tests provide deep confidence
- Saved plan files — what you review is exactly what gets applied
- OIDC over stored secrets — eliminate long-lived credentials from pipelines
- GitOps for Kubernetes — pull-based reconciliation provides self-healing and drift correction
- Drift detection — scheduled scans catch unauthorized manual changes
Next in the Series
In Part 13: Monitoring & Observability, we will explore Prometheus, Grafana, centralized logging, distributed tracing, alerting strategies, and SLOs/SLIs for infrastructure — ensuring you can detect, diagnose, and resolve issues before they impact users.