Part 12: CI/CD Pipelines for Infrastructure

Why CI/CD for Infrastructure

Infrastructure management has historically relied on manual processes — engineers SSH into servers, click through cloud consoles, and run scripts by hand. This approach creates snowflake environments that drift from their intended state, introduces human error at every step, and makes teams afraid to deploy changes because they cannot predict the outcome.

CI/CD pipelines transform infrastructure management by establishing a single, automated path from code commit to production. Every change is version-controlled, tested, reviewed, and deployed through the same repeatable process. The pipeline becomes the gatekeeper — no human touches production directly.

                            
                            Key Insight: CI/CD for infrastructure is not just about automation — it is about establishing confidence. When every change must pass through lint checks, security scans, plan reviews, and approval gates before reaching production, teams deploy more frequently with less risk. The pipeline is both the safety net and the accelerator.
                        

The Pipeline as Single Path to Production

In a mature infrastructure organization, the CI/CD pipeline is the only way changes reach production. No SSH access, no console clicks, no ad-hoc scripts. This principle ensures every change is auditable, repeatable, and reversible.

Manual vs Automated Infrastructure Deployment

flowchart LR
    subgraph Manual["❌ Manual Process"]
        direction TB
        M1[Engineer writes code] --> M2[SSH to server]
        M2 --> M3[Run commands manually]
        M3 --> M4[Hope nothing breaks]
        M4 --> M5[No audit trail]
    end

    subgraph Automated["✅ CI/CD Pipeline"]
        direction TB
        A1[Engineer pushes code] --> A2[Automated lint & validate]
        A2 --> A3[Security scan & test]
        A3 --> A4[Plan review & approval]
        A4 --> A5[Automated apply]
        A5 --> A6[Full audit trail]
    end

The benefits of pipeline-driven infrastructure are immediate and compounding:

Aspect	Manual Deployment	CI/CD Pipeline
Speed	Hours to days	Minutes to hours
Consistency	Varies by engineer	Identical every time
Audit trail	Sparse or missing	Complete git + pipeline history
Rollback	Manual, error-prone	Revert commit, re-run pipeline
Risk	High (unknown state)	Low (tested, reviewed)
Scale	Limited by team size	Unlimited parallelism

CI/CD Fundamentals

Before diving into specific tools, it is essential to understand the foundational concepts that all infrastructure CI/CD pipelines share. These principles apply regardless of whether you use GitHub Actions, GitLab CI, Jenkins, or any other platform.

Continuous Integration vs Continuous Delivery vs Continuous Deployment

These three terms are often confused but represent distinct levels of automation maturity:

Concept	Definition	Infrastructure Example
Continuous Integration (CI)	Merge code frequently, run automated tests on every commit	Terraform validate, lint, and plan on every PR
Continuous Delivery (CD)	Code is always in a deployable state; deployment requires manual approval	Terraform plan is ready; human approves apply
Continuous Deployment (CD)	Every passing change is automatically deployed to production	Merged Terraform automatically applies (rare for infra)

                            
                            Infrastructure Best Practice: Most infrastructure teams use Continuous Delivery rather than Continuous Deployment. Infrastructure changes can be destructive (deleting databases, removing networks), so a human approval gate before terraform apply is standard practice. The pipeline automates everything up to the apply step.
                        

Pipeline Stages for Infrastructure

A well-designed infrastructure pipeline follows a progression of increasingly expensive and risky stages. Each stage acts as a gate — if it fails, subsequent stages do not run:

Infrastructure CI/CD Pipeline Stages

flowchart LR
    A[Format Check] --> B[Validate]
    B --> C[Lint & Security]
    C --> D[Plan]
    D --> E[Cost Estimate]
    E --> F[Manual Approval]
    F --> G[Apply]
    G --> H[Smoke Test]

    style A fill:#e8f5e9
    style B fill:#e8f5e9
    style C fill:#fff3e0
    style D fill:#fff3e0
    style E fill:#fff3e0
    style F fill:#ffebee
    style G fill:#ffebee
    style H fill:#e3f2fd

Artifacts and State Passing

Infrastructure pipelines generate artifacts that must be passed between stages. The most critical is the Terraform plan file — it ensures the exact changes reviewed are the exact changes applied:

# Stage 1: Generate plan and save as artifact
terraform plan -out=tfplan -input=false

# Stage 2 (after approval): Apply the EXACT saved plan
terraform apply -input=false tfplan

                            
                            Critical Warning: Never run terraform apply without a saved plan file in CI/CD. Between the plan and apply stages, another team member might merge changes that alter the plan. Using a saved plan file guarantees the reviewed changes are what gets applied.
                        

GitHub Actions for Infrastructure

GitHub Actions is the most popular CI/CD platform for infrastructure teams using GitHub repositories. Its tight integration with pull requests, environments, and OIDC authentication makes it ideal for Terraform workflows.

Workflow File Structure

GitHub Actions workflows live in .github/workflows/ and are defined in YAML. Each workflow contains triggers, jobs, and steps:

# .github/workflows/terraform.yml
# Complete Terraform CI/CD pipeline for AWS infrastructure
name: "Terraform Infrastructure"

on:
  push:
    branches: [main]
    paths: ["terraform/**"]
  pull_request:
    branches: [main]
    paths: ["terraform/**"]
  workflow_dispatch:
    inputs:
      environment:
        description: "Target environment"
        required: true
        type: choice
        options: [dev, staging, production]

permissions:
  id-token: write    # Required for OIDC
  contents: read     # Required to checkout
  pull-requests: write  # Required for PR comments

env:
  TF_VERSION: "1.7.0"
  AWS_REGION: "us-east-1"
  WORKING_DIR: "terraform/environments/dev"

OIDC Authentication (No Long-Lived Credentials)

Modern CI/CD pipelines authenticate to cloud providers using OIDC (OpenID Connect) instead of storing long-lived access keys as secrets. This eliminates credential rotation burden and reduces the blast radius of a compromised pipeline:

# OIDC authentication - no stored credentials needed
jobs:
  terraform:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout
        uses: actions/checkout@v4

      - name: Configure AWS credentials via OIDC
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::123456789012:role/github-actions-terraform
          role-session-name: github-actions-${{ github.run_id }}
          aws-region: ${{ env.AWS_REGION }}

      # For Azure:
      # - name: Azure Login via OIDC
      #   uses: azure/login@v2
      #   with:
      #     client-id: ${{ secrets.AZURE_CLIENT_ID }}
      #     tenant-id: ${{ secrets.AZURE_TENANT_ID }}
      #     subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}

      # For GCP:
      # - name: GCP Auth via OIDC
      #   uses: google-github-actions/auth@v2
      #   with:
      #     workload_identity_provider: projects/123456/locations/global/workloadIdentityPools/github/providers/my-repo
      #     service_account: terraform@my-project.iam.gserviceaccount.com

Complete Terraform Pipeline

This is a production-ready GitHub Actions workflow that handles the full Terraform lifecycle with format checking, validation, security scanning, planning, and conditional apply:

# .github/workflows/terraform-complete.yml
name: "Terraform CI/CD"

on:
  push:
    branches: [main]
    paths: ["terraform/**"]
  pull_request:
    branches: [main]
    paths: ["terraform/**"]

permissions:
  id-token: write
  contents: read
  pull-requests: write

jobs:
  # ─── Stage 1: Format & Validate ────────────────────────────
  validate:
    name: "Format & Validate"
    runs-on: ubuntu-latest
    defaults:
      run:
        working-directory: terraform/environments/production
    steps:
      - uses: actions/checkout@v4

      - name: Setup Terraform
        uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: "1.7.0"

      - name: Terraform Format Check
        run: terraform fmt -check -recursive -diff

      - name: Terraform Init
        run: terraform init -backend=false

      - name: Terraform Validate
        run: terraform validate

  # ─── Stage 2: Security Scan ─────────────────────────────────
  security:
    name: "Security Scan"
    runs-on: ubuntu-latest
    needs: validate
    steps:
      - uses: actions/checkout@v4

      - name: Run tfsec
        uses: aquasecurity/tfsec-action@v1.0.3
        with:
          working_directory: terraform/
          soft_fail: false

      - name: Run Checkov
        uses: bridgecrewio/checkov-action@v12
        with:
          directory: terraform/
          framework: terraform
          output_format: sarif

  # ─── Stage 3: Plan ─────────────────────────────────────────
  plan:
    name: "Terraform Plan"
    runs-on: ubuntu-latest
    needs: [validate, security]
    defaults:
      run:
        working-directory: terraform/environments/production
    steps:
      - uses: actions/checkout@v4

      - name: Setup Terraform
        uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: "1.7.0"

      - name: Configure AWS Credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::123456789012:role/github-terraform
          aws-region: us-east-1

      - name: Terraform Init
        run: terraform init

      - name: Terraform Plan
        id: plan
        run: |
          terraform plan -out=tfplan -input=false -no-color 2>&1 | tee plan-output.txt
          echo "plan_output<> $GITHUB_OUTPUT
          cat plan-output.txt >> $GITHUB_OUTPUT
          echo "EOF" >> $GITHUB_OUTPUT

      - name: Upload Plan Artifact
        uses: actions/upload-artifact@v4
        with:
          name: tfplan
          path: terraform/environments/production/tfplan
          retention-days: 5

      - name: Post Plan to PR
        if: github.event_name == 'pull_request'
        uses: actions/github-script@v7
        with:
          script: |
            const output = `#### Terraform Plan 📖
            \`\`\`hcl
            ${{ steps.plan.outputs.plan_output }}
            \`\`\`
            *Pushed by: @${{ github.actor }}, Action: \`${{ github.event_name }}\`*`;
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: output
            });

  # ─── Stage 4: Apply (main branch only) ─────────────────────
  apply:
    name: "Terraform Apply"
    runs-on: ubuntu-latest
    needs: plan
    if: github.ref == 'refs/heads/main' && github.event_name == 'push'
    environment:
      name: production
      url: https://console.aws.amazon.com
    defaults:
      run:
        working-directory: terraform/environments/production
    steps:
      - uses: actions/checkout@v4

      - name: Setup Terraform
        uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: "1.7.0"

      - name: Configure AWS Credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::123456789012:role/github-terraform
          aws-region: us-east-1

      - name: Terraform Init
        run: terraform init

      - name: Download Plan
        uses: actions/download-artifact@v4
        with:
          name: tfplan
          path: terraform/environments/production/

      - name: Terraform Apply
        run: terraform apply -input=false tfplan

Environment Protection Rules

GitHub Environments provide approval gates, deployment restrictions, and secrets scoping:

# Environment configuration (set in GitHub repo settings):
# Settings > Environments > production:
#   - Required reviewers: [lead-engineer, platform-team]
#   - Wait timer: 5 minutes
#   - Deployment branches: main only
#   - Environment secrets: AWS_ROLE_ARN (scoped to production)

# In workflow, reference the environment:
jobs:
  deploy-production:
    environment:
      name: production
      url: https://app.example.com
    # This job will pause and wait for approval
    # before any steps execute

GitLab CI for Infrastructure

GitLab CI offers a powerful integrated experience with built-in Terraform state management, merge request pipelines, and environment tracking — all within a single platform.

Pipeline Structure

# .gitlab-ci.yml - Complete Terraform pipeline
image:
  name: hashicorp/terraform:1.7.0
  entrypoint: [""]

variables:
  TF_ROOT: "terraform/environments/${CI_ENVIRONMENT_NAME}"
  TF_STATE_NAME: "${CI_ENVIRONMENT_NAME}"
  TF_ADDRESS: "${CI_API_V4_URL}/projects/${CI_PROJECT_ID}/terraform/state/${TF_STATE_NAME}"

stages:
  - validate
  - security
  - plan
  - apply
  - verify

cache:
  key: "${CI_COMMIT_REF_SLUG}"
  paths:
    - ${TF_ROOT}/.terraform/

# ─── Validate Stage ──────────────────────────────────────────
fmt-check:
  stage: validate
  script:
    - cd ${TF_ROOT}
    - terraform fmt -check -recursive -diff
  rules:
    - if: $CI_PIPELINE_SOURCE == "merge_request_event"
    - if: $CI_COMMIT_BRANCH == "main"

validate:
  stage: validate
  script:
    - cd ${TF_ROOT}
    - terraform init -backend=false
    - terraform validate
  rules:
    - if: $CI_PIPELINE_SOURCE == "merge_request_event"
    - if: $CI_COMMIT_BRANCH == "main"

# ─── Security Stage ──────────────────────────────────────────
tfsec:
  stage: security
  image: aquasec/tfsec:latest
  script:
    - tfsec ${TF_ROOT} --format json --out tfsec-results.json
  artifacts:
    reports:
      sast: tfsec-results.json
  rules:
    - if: $CI_PIPELINE_SOURCE == "merge_request_event"

# ─── Plan Stage ──────────────────────────────────────────────
plan:
  stage: plan
  script:
    - cd ${TF_ROOT}
    - terraform init
        -backend-config="address=${TF_ADDRESS}"
        -backend-config="lock_address=${TF_ADDRESS}/lock"
        -backend-config="unlock_address=${TF_ADDRESS}/lock"
        -backend-config="username=gitlab-ci-token"
        -backend-config="password=${CI_JOB_TOKEN}"
    - terraform plan -out=plan.cache -input=false
    - terraform show -no-color plan.cache > plan.txt
  artifacts:
    paths:
      - ${TF_ROOT}/plan.cache
    reports:
      terraform: ${TF_ROOT}/plan.txt
  rules:
    - if: $CI_PIPELINE_SOURCE == "merge_request_event"
    - if: $CI_COMMIT_BRANCH == "main"

# ─── Apply Stage ─────────────────────────────────────────────
apply:
  stage: apply
  script:
    - cd ${TF_ROOT}
    - terraform init
        -backend-config="address=${TF_ADDRESS}"
        -backend-config="lock_address=${TF_ADDRESS}/lock"
        -backend-config="unlock_address=${TF_ADDRESS}/lock"
        -backend-config="username=gitlab-ci-token"
        -backend-config="password=${CI_JOB_TOKEN}"
    - terraform apply -input=false plan.cache
  dependencies:
    - plan
  rules:
    - if: $CI_COMMIT_BRANCH == "main"
      when: manual
  environment:
    name: production
    action: start

GitLab Managed Terraform State

GitLab provides built-in Terraform state management, eliminating the need for external state backends like S3:

# GitLab Terraform state is managed automatically via the API
# State is stored per-project and accessible via CI_JOB_TOKEN
# No S3 bucket, DynamoDB table, or external backend needed

# View states: Settings > Infrastructure > Terraform states
# Each environment gets its own state file
# States are versioned and support locking

Environment-Specific Deployments

# Multi-environment deployment with promotion
.terraform_base:
  image: hashicorp/terraform:1.7.0
  before_script:
    - cd terraform/environments/${CI_ENVIRONMENT_NAME}
    - terraform init

deploy-dev:
  extends: .terraform_base
  stage: apply
  environment:
    name: dev
  script:
    - terraform apply -auto-approve -input=false
  rules:
    - if: $CI_COMMIT_BRANCH == "develop"

deploy-staging:
  extends: .terraform_base
  stage: apply
  environment:
    name: staging
  script:
    - terraform apply -auto-approve -input=false
  rules:
    - if: $CI_COMMIT_BRANCH == "main"
  needs: ["plan-staging"]

deploy-production:
  extends: .terraform_base
  stage: apply
  environment:
    name: production
  script:
    - terraform apply -input=false plan.cache
  rules:
    - if: $CI_COMMIT_BRANCH == "main"
      when: manual
  needs: ["deploy-staging"]

Jenkins Pipelines

While GitHub Actions and GitLab CI dominate new projects, Jenkins remains prevalent in enterprises with existing investments. Its flexibility and plugin ecosystem make it adaptable to any infrastructure workflow.

Declarative vs Scripted Pipelines

Aspect	Declarative Pipeline	Scripted Pipeline
Syntax	`pipeline { }` block	Full Groovy scripting
Complexity	Simple, structured	Flexible, complex
Error handling	`post { }` blocks	`try/catch/finally`
Best for	Standard CI/CD flows	Complex logic, dynamic stages
Recommendation	Use for most pipelines	Use when declarative is insufficient

Terraform Jenkinsfile

// Jenkinsfile - Declarative Terraform Pipeline
pipeline {
    agent {
        docker {
            image 'hashicorp/terraform:1.7.0'
            args '-v /var/run/docker.sock:/var/run/docker.sock'
        }
    }

    parameters {
        choice(name: 'ENVIRONMENT', choices: ['dev', 'staging', 'production'])
        booleanParam(name: 'AUTO_APPROVE', defaultValue: false)
    }

    environment {
        AWS_CREDENTIALS = credentials('aws-terraform-role')
        TF_DIR = "terraform/environments/${params.ENVIRONMENT}"
    }

    stages {
        stage('Init') {
            steps {
                dir("${TF_DIR}") {
                    sh 'terraform init -input=false'
                }
            }
        }

        stage('Validate') {
            steps {
                dir("${TF_DIR}") {
                    sh 'terraform fmt -check -recursive'
                    sh 'terraform validate'
                }
            }
        }

        stage('Plan') {
            steps {
                dir("${TF_DIR}") {
                    sh 'terraform plan -out=tfplan -input=false'
                    archiveArtifacts artifacts: 'tfplan'
                }
            }
        }

        stage('Approval') {
            when {
                expression { !params.AUTO_APPROVE }
            }
            steps {
                input message: "Apply Terraform changes to ${params.ENVIRONMENT}?",
                      ok: "Apply"
            }
        }

        stage('Apply') {
            steps {
                dir("${TF_DIR}") {
                    sh 'terraform apply -input=false tfplan'
                }
            }
        }
    }

    post {
        always {
            cleanWs()
        }
        failure {
            slackSend channel: '#infra-alerts',
                      message: "Terraform pipeline FAILED for ${params.ENVIRONMENT}"
        }
        success {
            slackSend channel: '#infra-deploys',
                      message: "Terraform applied to ${params.ENVIRONMENT} ✅"
        }
    }
}

Infrastructure Testing in Pipelines

Infrastructure testing follows a pyramid similar to application testing — fast, cheap tests at the base catching most issues, with slower, expensive tests at the top providing deeper confidence.

Infrastructure Testing Pyramid

flowchart TB
    subgraph Pyramid["Testing Pyramid"]
        direction TB
        E2E["🔺 End-to-End Tests
(Terratest, real infrastructure)"]
        Integration["🔶 Integration Tests
(Policy checks, module testing)"]
        Static["🟩 Static Analysis
(fmt, validate, lint, scan)"]
    end

    E2E -.->|"Slow, expensive
Run: pre-release"| Cost1["$$$ Minutes"]
    Integration -.->|"Medium speed
Run: every PR"| Cost2["$$ Seconds"]
    Static -.->|"Fast, cheap
Run: every commit"| Cost3["$ Milliseconds"]

Static Analysis Tools

# terraform fmt - Check code formatting
terraform fmt -check -recursive -diff
# Returns exit code 1 if files need formatting

# terraform validate - Check syntax and internal consistency
terraform init -backend=false
terraform validate
# Catches: missing variables, invalid references, type errors

# tflint - Terraform-specific linting
# Install: brew install tflint (or download binary)
tflint --init  # Download plugins
tflint --recursive --format=compact
# Catches: deprecated syntax, invalid instance types,
#           naming conventions, unused variables

Security Scanning

# tfsec - Find security vulnerabilities in Terraform
tfsec terraform/ --format json --out results.json
# Checks: open security groups, unencrypted storage,
#          public access, missing logging

# Checkov - Policy-as-code scanner (multi-framework)
checkov -d terraform/ --framework terraform --output sarif
# Checks: CIS benchmarks, SOC2, HIPAA, PCI-DSS compliance

# Trivy - Container + IaC scanning
trivy config terraform/
# Checks: misconfigurations across Terraform, Kubernetes, Docker

# KICS - Keeping Infrastructure as Code Secure
docker run -v $(pwd):/path checkmarx/kics:latest scan -p /path/terraform
# Checks: 2000+ queries across multiple frameworks

Tool	Focus	Speed	Frameworks	Cost
tfsec	Security best practices	Very fast	Terraform only	Free
Checkov	Compliance frameworks	Fast	Terraform, K8s, Docker, ARM	Free / Paid
Trivy	Vulnerabilities + misconfig	Fast	Multi-framework + containers	Free
KICS	Broad coverage	Medium	15+ frameworks	Free
Sentinel	Enterprise policy	Fast	Terraform (HCP)	Paid (TFC)

Policy as Code with OPA/Conftest

# Convert Terraform plan to JSON for policy evaluation
terraform plan -out=tfplan
terraform show -json tfplan > tfplan.json

# Run OPA/Conftest policies against the plan
conftest test tfplan.json --policy policy/ --output json

# Example: Deny resources without required tags

// policy/terraform.rego - OPA policy for Terraform
package terraform

import future.keywords.in

# Deny resources without required tags
deny[msg] {
    resource := input.resource_changes[_]
    resource.change.actions[_] == "create"
    tags := resource.change.after.tags
    required_tags := {"Environment", "Team", "CostCenter"}
    missing := required_tags - {key | tags[key]}
    count(missing) > 0
    msg := sprintf("Resource %s missing tags: %v", [resource.address, missing])
}

# Deny overly permissive security groups
deny[msg] {
    resource := input.resource_changes[_]
    resource.type == "aws_security_group_rule"
    resource.change.after.cidr_blocks[_] == "0.0.0.0/0"
    resource.change.after.type == "ingress"
    msg := sprintf("Security group %s allows ingress from 0.0.0.0/0", [resource.address])
}

Integration Testing with Terratest

# Terratest runs real infrastructure and validates it
# Written in Go, deploys then destroys test infrastructure

# Install Terratest
# go get github.com/gruntwork-io/terratest/modules/terraform

# Run tests (deploys real resources!)
cd test/
go test -v -timeout 30m ./...

# Cost estimation with Infracost
# Install: brew install infracost
infracost breakdown --path terraform/
infracost diff --path terraform/ --compare-to infracost-base.json

# Infracost in CI - add cost comments to PRs
# .github/workflows/infracost.yml snippet
- name: Generate Infracost diff
  run: |
    infracost diff \
      --path=terraform/ \
      --format=json \
      --compare-to=/tmp/infracost-base.json \
      --out-file=/tmp/infracost-diff.json

- name: Post Infracost comment
  run: |
    infracost comment github \
      --path=/tmp/infracost-diff.json \
      --repo=$GITHUB_REPOSITORY \
      --pull-request=${{ github.event.pull_request.number }} \
      --github-token=${{ secrets.GITHUB_TOKEN }}
# Result: "This PR will increase monthly costs by $47.20 (+12%)"

GitOps for Infrastructure

GitOps extends CI/CD by making Git the single source of truth for both application and infrastructure state. Instead of pipelines pushing changes to clusters, GitOps controllers pull desired state from Git and reconcile the live environment continuously.

                            
                            GitOps Principles: (1) Declarative configuration stored in Git, (2) Git as the single source of truth, (3) Changes applied automatically via reconciliation, (4) Continuous monitoring and self-healing. If live state drifts from Git, the controller corrects it automatically.
                        

Push-Based vs Pull-Based GitOps

Aspect	Push-Based (Traditional CI/CD)	Pull-Based (GitOps)
Who applies	CI server pushes to cluster	In-cluster agent pulls from Git
Credentials	CI has cluster credentials	Agent has Git read access only
Drift detection	Manual or scheduled scans	Continuous reconciliation
Security	Broader attack surface	Minimal external access
Tools	GitHub Actions, GitLab CI	ArgoCD, Flux

Pull-Based GitOps Flow

flowchart LR
    Dev[Developer] -->|"git push"| Git[Git Repository]
    Git -->|"watches"| Agent[GitOps Agent
ArgoCD / Flux]
    Agent -->|"reconciles"| Cluster[Kubernetes Cluster]
    Cluster -->|"reports status"| Agent
    Agent -->|"updates status"| Git

    style Git fill:#e8f5e9
    style Agent fill:#fff3e0
    style Cluster fill:#e3f2fd

ArgoCD

ArgoCD is the most popular GitOps controller for Kubernetes. It watches Git repositories and automatically synchronizes cluster state to match the desired configuration.

# argocd-application.yaml - ArgoCD Application CRD
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: production-infrastructure
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/myorg/infrastructure.git
    targetRevision: main
    path: kubernetes/overlays/production
  destination:
    server: https://kubernetes.default.svc
    namespace: production
  syncPolicy:
    automated:
      prune: true        # Delete resources removed from Git
      selfHeal: true     # Revert manual changes in cluster
    syncOptions:
      - CreateNamespace=true
      - ApplyOutOfSyncOnly=true
    retry:
      limit: 5
      backoff:
        duration: 5s
        factor: 2
        maxDuration: 3m

# Install ArgoCD
kubectl create namespace argocd
kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml

# Access ArgoCD UI
kubectl port-forward svc/argocd-server -n argocd 8080:443

# Get initial admin password
kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d

# Create application via CLI
argocd app create production-infra \
    --repo https://github.com/myorg/infrastructure.git \
    --path kubernetes/overlays/production \
    --dest-server https://kubernetes.default.svc \
    --dest-namespace production \
    --sync-policy automated \
    --auto-prune \
    --self-heal

Flux

Flux is a CNCF graduated project that takes a more modular, composable approach to GitOps with separate controllers for different responsibilities:

# flux-source.yaml - GitRepository source
apiVersion: source.toolkit.fluxcd.io/v1
kind: GitRepository
metadata:
  name: infrastructure
  namespace: flux-system
spec:
  interval: 1m
  url: https://github.com/myorg/infrastructure.git
  ref:
    branch: main
  secretRef:
    name: github-credentials

---
# flux-kustomization.yaml - Kustomization (what to deploy)
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: production-infra
  namespace: flux-system
spec:
  interval: 5m
  path: ./kubernetes/overlays/production
  prune: true
  sourceRef:
    kind: GitRepository
    name: infrastructure
  healthChecks:
    - apiVersion: apps/v1
      kind: Deployment
      name: web-api
      namespace: production
  timeout: 3m

---
# flux-helmrelease.yaml - Helm chart deployment
apiVersion: helm.toolkit.fluxcd.io/v2beta2
kind: HelmRelease
metadata:
  name: nginx-ingress
  namespace: ingress-system
spec:
  interval: 10m
  chart:
    spec:
      chart: ingress-nginx
      version: "4.x"
      sourceRef:
        kind: HelmRepository
        name: ingress-nginx
  values:
    controller:
      replicaCount: 3
      service:
        type: LoadBalancer

GitOps for Terraform

While ArgoCD and Flux handle Kubernetes resources, Terraform changes need specialized GitOps tools:

Tool	Type	How It Works	Best For
Atlantis	Self-hosted	Webhook-triggered plan/apply on PRs	Teams wanting full control
Spacelift	SaaS	Managed Terraform runner with policies	Enterprise with compliance needs
env0	SaaS	Self-service IaC with guardrails	Platform teams enabling developers
Terraform Cloud	SaaS	HashiCorp's managed workflow	Teams already in HCP ecosystem
Crossplane	K8s-native	Cloud resources as K8s CRDs	Teams standardizing on K8s APIs

# atlantis.yaml - Atlantis server-side config
version: 3
projects:
  - name: production-vpc
    dir: terraform/environments/production/vpc
    workspace: default
    autoplan:
      when_modified: ["*.tf", "*.tfvars", "modules/**/*.tf"]
      enabled: true
    apply_requirements: [approved, mergeable]
    workflow: production

workflows:
  production:
    plan:
      steps:
        - init
        - run: tfsec .
        - run: infracost breakdown --path . --format json --out-file /tmp/infracost.json
        - plan:
            extra_args: ["-var-file=production.tfvars"]
    apply:
      steps:
        - apply

Deployment Strategies

Infrastructure changes require careful deployment strategies to minimize downtime and provide rollback capabilities. Unlike application deployments where you can run multiple versions simultaneously, infrastructure changes often affect shared resources.

Blue-Green Deployments for Infrastructure

Blue-green deployments maintain two identical environments. Traffic switches from the current (blue) to the new (green) once validation passes. For infrastructure, this typically applies to compute layers rather than databases:

Blue-Green Infrastructure Deployment

flowchart TB
    LB[Load Balancer / DNS]

    subgraph Blue["Blue Environment (Current)"]
        B1[ASG v1.2]
        B2[App Servers]
    end

    subgraph Green["Green Environment (New)"]
        G1[ASG v1.3]
        G2[App Servers]
    end

    DB[(Shared Database)]

    LB -->|"100% traffic"| Blue
    LB -.->|"0% traffic
(switch after validation)"| Green
    Blue --> DB
    Green --> DB

    style Blue fill:#e3f2fd
    style Green fill:#e8f5e9

# blue-green.tf - Blue-Green deployment with Terraform
# Using create_before_destroy lifecycle

resource "aws_launch_template" "app" {
  name_prefix   = "app-"
  image_id      = var.ami_id  # New AMI triggers replacement
  instance_type = "t3.medium"

  lifecycle {
    create_before_destroy = true
  }
}

resource "aws_autoscaling_group" "app" {
  name                = "app-${aws_launch_template.app.latest_version}"
  desired_capacity    = 3
  max_size            = 6
  min_size            = 3
  target_group_arns   = [aws_lb_target_group.app.arn]
  vpc_zone_identifier = var.private_subnet_ids

  launch_template {
    id      = aws_launch_template.app.id
    version = "$Latest"
  }

  instance_refresh {
    strategy = "Rolling"
    preferences {
      min_healthy_percentage = 66
      instance_warmup        = 300
    }
  }

  lifecycle {
    create_before_destroy = true
  }
}

Canary Deployments

Canary deployments route a small percentage of traffic to the new infrastructure, gradually increasing it as confidence builds:

# canary-rollout.yaml - Argo Rollouts canary strategy
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: web-api
  namespace: production
spec:
  replicas: 10
  strategy:
    canary:
      steps:
        - setWeight: 5        # 5% traffic to canary
        - pause: {duration: 5m}
        - analysis:
            templates:
              - templateName: success-rate
        - setWeight: 25       # 25% if metrics pass
        - pause: {duration: 10m}
        - analysis:
            templates:
              - templateName: success-rate
        - setWeight: 50       # 50% traffic
        - pause: {duration: 15m}
        - setWeight: 100      # Full rollout
      canaryService: web-api-canary
      stableService: web-api-stable
      trafficRouting:
        nginx:
          stableIngress: web-api-ingress

---
# analysis-template.yaml - Automated rollback criteria
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: success-rate
spec:
  metrics:
    - name: success-rate
      interval: 60s
      successCondition: result[0] >= 0.95
      provider:
        prometheus:
          address: http://prometheus:9090
          query: |
            sum(rate(http_requests_total{service="web-api-canary",code=~"2.."}[5m]))
            /
            sum(rate(http_requests_total{service="web-api-canary"}[5m]))

Rollback Strategies

# Git-based rollback (preferred)
# Revert the commit and let the pipeline re-run
git revert HEAD --no-edit
git push origin main
# Pipeline runs with previous state → infrastructure reverts

# Terraform state-based rollback
# Apply a previous known-good state
terraform apply -target=module.networking -var="version=1.2.0"

# ArgoCD rollback
argocd app rollback production-infra
# Or: set targetRevision to a specific commit
argocd app set production-infra --revision abc1234

# Flux rollback - suspend and force revision
flux suspend kustomization production-infra
flux resume kustomization production-infra --revision=v1.2.0

Pipeline Security & Best Practices

CI/CD pipelines are high-value targets — they have credentials to modify production infrastructure. Securing the pipeline itself is as important as securing the infrastructure it manages.

                            
                            Security Critical: A compromised CI/CD pipeline can destroy entire environments, exfiltrate secrets, or create backdoor access. Treat pipeline security with the same rigor as production server security. Apply least privilege, audit all changes, and assume breach.
                        

Secret Management in Pipelines

# Best practice: OIDC > Vault > Repository Secrets > Environment Variables

# 1. OIDC (best) - No stored secrets at all
- uses: aws-actions/configure-aws-credentials@v4
  with:
    role-to-assume: ${{ vars.AWS_ROLE_ARN }}  # Not a secret!
    aws-region: us-east-1

# 2. HashiCorp Vault integration
- name: Import Secrets from Vault
  uses: hashicorp/vault-action@v3
  with:
    url: https://vault.company.com
    method: jwt
    role: ci-terraform
    secrets: |
      secret/data/terraform/aws access_key | AWS_ACCESS_KEY_ID ;
      secret/data/terraform/aws secret_key | AWS_SECRET_ACCESS_KEY

# 3. GitHub repository/environment secrets (acceptable)
- name: Configure credentials
  env:
    DB_PASSWORD: ${{ secrets.PROD_DB_PASSWORD }}  # Never log this!

Pipeline Hardening

# Pin action versions to full SHA (not tags which can be moved)
# BAD:  uses: actions/checkout@v4
# GOOD: uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11

# Limit permissions to minimum required
permissions:
  contents: read      # Only read repo
  id-token: write     # Only for OIDC
  pull-requests: write  # Only for PR comments
  # Everything else implicitly denied

# Require signed commits
# Settings > Branches > Branch protection:
#   ✓ Require signed commits
#   ✓ Require linear history

# Use step-level environment isolation
- name: Terraform Apply
  env:
    TF_VAR_sensitive_value: ${{ secrets.VALUE }}
  run: terraform apply -input=false tfplan
  # Secret only available in this step, not globally

Drift Detection Pipelines

Scheduled pipelines detect when live infrastructure drifts from its Terraform-defined state (due to manual console changes, other tools, or AWS service updates):

# .github/workflows/drift-detection.yml
name: "Infrastructure Drift Detection"

on:
  schedule:
    - cron: "0 8 * * 1-5"  # Weekdays at 8am UTC
  workflow_dispatch: {}

jobs:
  detect-drift:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        environment: [dev, staging, production]
    steps:
      - uses: actions/checkout@v4

      - name: Setup Terraform
        uses: hashicorp/setup-terraform@v3

      - name: Configure AWS Credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: ${{ vars.AWS_ROLE_ARN }}
          aws-region: us-east-1

      - name: Detect Drift
        id: drift
        working-directory: terraform/environments/${{ matrix.environment }}
        run: |
          terraform init
          terraform plan -detailed-exitcode -input=false 2>&1 | tee drift-output.txt
          EXIT_CODE=$?
          if [ $EXIT_CODE -eq 2 ]; then
            echo "drift_detected=true" >> $GITHUB_OUTPUT
          else
            echo "drift_detected=false" >> $GITHUB_OUTPUT
          fi

      - name: Alert on Drift
        if: steps.drift.outputs.drift_detected == 'true'
        run: |
          curl -X POST "${{ secrets.SLACK_WEBHOOK }}" \
            -H "Content-type: application/json" \
            -d '{
              "text": "⚠️ Infrastructure drift detected in ${{ matrix.environment }}!\nReview: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}"
            }'

Hands-On Exercises

Exercise 1 45 minutes

Build a GitHub Actions Terraform Pipeline

Create a complete CI/CD workflow for Terraform that runs on pull requests and pushes to main.

Create .github/workflows/terraform.yml with format check, validate, and plan jobs
Configure OIDC authentication to your cloud provider (use a free tier account)
Add plan output as a PR comment using actions/github-script
Add a conditional apply job that only runs on main branch merges
Create a GitHub Environment with required reviewers for the apply stage
Test by opening a PR that adds a new resource — verify the plan appears as a comment

GitHub Actions Terraform OIDC

Exercise 2 30 minutes

Add Security Scanning to the Pipeline

Enhance the pipeline with automated security checks that catch vulnerabilities before they reach production.

Add a tfsec job that runs in parallel with validation
Add Checkov scanning with SARIF output for GitHub Security tab integration
Create a custom OPA policy (using Conftest) that denies resources without required tags
Add Infracost to post cost estimates as PR comments
Configure the pipeline to fail on critical security findings but warn on medium ones
Deliberately introduce a security issue (e.g., open security group) and verify the pipeline catches it

tfsec Checkov OPA Infracost

Exercise 3 60 minutes

Implement Environment Promotion (Dev → Staging → Prod)

Build a multi-environment pipeline with automatic promotion through lower environments and manual gates for production.

Structure your Terraform code with separate environments/dev, environments/staging, environments/production directories sharing common modules
Create a pipeline that automatically applies to dev on every main branch push
Add a staging job that triggers after dev succeeds (with a smoke test between)
Add a production job with manual approval gate (GitHub Environment protection rules)
Implement a rollback mechanism: if staging smoke tests fail, automatically revert dev
Add Slack notifications for each promotion step

Multi-Environment Promotion Approval Gates

Exercise 4 60 minutes

Set Up ArgoCD for Kubernetes GitOps

Deploy ArgoCD and configure it to manage Kubernetes resources from a Git repository with automatic sync and self-healing.

Install ArgoCD on a local Kubernetes cluster (minikube, kind, or k3d)
Create a Git repository with Kubernetes manifests organized using Kustomize overlays (base + production)
Create an ArgoCD Application that watches the repository and auto-syncs
Enable auto-prune and self-heal in the sync policy
Test self-healing: manually kubectl delete a resource and watch ArgoCD recreate it
Test drift correction: kubectl edit a deployment’s replicas and watch ArgoCD revert it
Set up an ApplicationSet to manage multiple environments from a single template

ArgoCD Kubernetes GitOps Self-Healing

Conclusion & Next Steps

CI/CD pipelines transform infrastructure management from a manual, error-prone process into an automated, auditable, and repeatable workflow. By establishing the pipeline as the single path to production, teams gain confidence to deploy frequently while maintaining safety through automated testing, security scanning, and approval gates.

The key principles to carry forward:

Pipeline as gatekeeper — no human touches production directly
Test pyramid — fast static analysis catches most issues; expensive integration tests provide deep confidence
Saved plan files — what you review is exactly what gets applied
OIDC over stored secrets — eliminate long-lived credentials from pipelines
GitOps for Kubernetes — pull-based reconciliation provides self-healing and drift correction
Drift detection — scheduled scans catch unauthorized manual changes

Next in the Series

In Part 13: Monitoring & Observability, we will explore Prometheus, Grafana, centralized logging, distributed tracing, alerting strategies, and SLOs/SLIs for infrastructure — ensuring you can detect, diagnose, and resolve issues before they impact users.

Previous Part 11: Containers & Orchestration Next Part 13: Monitoring & Observability

Cookie Consent