The IaC Revolution
Imagine you need to build an identical copy of your production environment for a new development team. With traditional infrastructure management, this might take weeks of manual work — logging into consoles, clicking through wizards, running commands from memory, and praying you didn't miss a security group rule. With Infrastructure as Code, you run a single command and have an identical environment in minutes.
Why IaC Exists: The Problems It Solves
Before IaC, infrastructure was managed through a combination of manual processes that introduced serious problems:
flowchart TD
A[Manual Infrastructure] --> B[Snowflake Servers]
A --> C[Configuration Drift]
A --> D[Tribal Knowledge]
A --> E[Slow Provisioning]
A --> F[No Audit Trail]
B --> G[Every server is unique]
C --> H[Environments diverge over time]
D --> I[Only one person knows how it works]
E --> J[Days/weeks to provision]
F --> K[Who changed what and when?]
style A fill:#BF092F,color:#fff
style B fill:#132440,color:#fff
style C fill:#132440,color:#fff
style D fill:#132440,color:#fff
style E fill:#132440,color:#fff
style F fill:#132440,color:#fff
| Problem | Manual Approach | IaC Solution |
|---|---|---|
| Reproducibility | Follow a wiki/runbook (often outdated) | Run code — same result every time |
| Consistency | Humans make mistakes | Code is deterministic |
| Speed | Hours/days per environment | Minutes per environment |
| Documentation | Separate docs (drift from reality) | Code IS the documentation |
| Collaboration | Screen sharing, tribal knowledge | Code reviews, pull requests |
| Disaster Recovery | Rebuild from memory/backups | Re-apply code to new region |
| Audit Trail | Manual change logs (if any) | Git history shows every change |
Key Benefits of IaC
mindmap
root((IaC Benefits))
Version Control
Git history
Branching
Code review
Rollback
Reproducibility
Identical environments
Disaster recovery
Multi-region
Dev/Staging/Prod parity
Automation
CI/CD pipelines
Self-service
Scheduled deployments
Auto-scaling rules
Collaboration
Pull requests
Team ownership
Knowledge sharing
Onboarding
Testing
Validation
Linting
Security scanning
Cost estimation
Declarative vs Imperative
The most fundamental decision in IaC is choosing between two paradigms: declarative (describe the desired end state) and imperative (describe the steps to reach that state). This distinction shapes everything from how you think about infrastructure to which tools you choose.
Declarative: Describe WHAT You Want
In the declarative approach, you define the desired state of your infrastructure. The tool figures out how to make reality match that state. You say "I want 3 web servers behind a load balancer" without specifying the exact sequence of API calls.
# Declarative (Terraform HCL) — WHAT you want
# The tool determines HOW to create/modify/delete resources
resource "aws_instance" "web" {
count = 3
ami = "ami-0c55b159cbfafe1f0"
instance_type = "t3.micro"
tags = {
Name = "web-server-${count.index + 1}"
Environment = "production"
ManagedBy = "terraform"
}
}
resource "aws_lb" "web" {
name = "web-load-balancer"
internal = false
load_balancer_type = "application"
subnets = var.public_subnet_ids
}
Imperative: Describe HOW to Get There
In the imperative approach, you write the exact sequence of steps (commands, API calls) that should execute. You have full control over ordering and logic but must handle edge cases yourself.
#!/bin/bash
# Imperative (Bash script) — HOW to do it
# You specify every step and handle edge cases
# Step 1: Create instances
for i in 1 2 3; do
INSTANCE_ID=$(aws ec2 run-instances \
--image-id ami-0c55b159cbfafe1f0 \
--instance-type t3.micro \
--tag-specifications "ResourceType=instance,Tags=[{Key=Name,Value=web-server-$i}]" \
--query 'Instances[0].InstanceId' \
--output text)
echo "Created instance: $INSTANCE_ID"
INSTANCE_IDS+=($INSTANCE_ID)
done
# Step 2: Wait for instances to be running
aws ec2 wait instance-running --instance-ids ${INSTANCE_IDS[@]}
# Step 3: Create load balancer
LB_ARN=$(aws elbv2 create-load-balancer \
--name web-load-balancer \
--subnets subnet-abc123 subnet-def456 \
--type application \
--query 'LoadBalancers[0].LoadBalancerArn' \
--output text)
echo "Created load balancer: $LB_ARN"
flowchart LR
subgraph Declarative
D1[Define Desired State] --> D2[Tool Reads Current State]
D2 --> D3[Tool Calculates Diff]
D3 --> D4[Tool Applies Changes]
D4 --> D5[Desired = Current]
end
subgraph Imperative
I1[Write Step 1] --> I2[Write Step 2]
I2 --> I3[Write Step 3]
I3 --> I4[Handle Errors]
I4 --> I5[Hope State is Correct]
end
style D5 fill:#3B9797,color:#fff
style I5 fill:#BF092F,color:#fff
When to Use Each Approach
| Aspect | Declarative | Imperative |
|---|---|---|
| You specify | Desired end state | Step-by-step instructions |
| Idempotent? | Yes (by design) | Must be manually ensured |
| Ordering | Tool determines order | You control order |
| Drift handling | Automatic (re-converge) | Must detect and fix yourself |
| Learning curve | Domain-specific language | Familiar scripting languages |
| Flexibility | Limited to tool’s capabilities | Unlimited (any logic) |
| Best for | Cloud resources, repeatable infra | Complex orchestration, migrations |
| Examples | Terraform, CloudFormation, Bicep, K8s YAML | Ansible, Bash, Pulumi, Python scripts |
The IaC Landscape
The IaC ecosystem has matured significantly, with specialized tools for different use cases. Understanding the landscape helps you choose the right tool for your situation.
Terraform (HashiCorp)
Terraform is the most widely adopted multi-cloud IaC tool. It uses HashiCorp Configuration Language (HCL) — a declarative language designed specifically for infrastructure definition.
# Example: Terraform configuration for AWS VPC
terraform {
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
}
}
provider "aws" {
region = "us-east-1"
}
resource "aws_vpc" "main" {
cidr_block = "10.0.0.0/16"
enable_dns_hostnames = true
enable_dns_support = true
tags = {
Name = "main-vpc"
Environment = "production"
}
}
AWS CloudFormation
CloudFormation is AWS's native IaC service. It uses JSON or YAML templates with deep integration into AWS services, offering features like change sets, drift detection, and StackSets for multi-account deployment.
# Example: CloudFormation YAML template
AWSTemplateFormatVersion: '2010-09-09'
Description: Simple VPC with public subnet
Resources:
MainVPC:
Type: AWS::EC2::VPC
Properties:
CidrBlock: 10.0.0.0/16
EnableDnsHostnames: true
EnableDnsSupport: true
Tags:
- Key: Name
Value: main-vpc
- Key: Environment
Value: production
PublicSubnet:
Type: AWS::EC2::Subnet
Properties:
VpcId: !Ref MainVPC
CidrBlock: 10.0.1.0/24
AvailabilityZone: !Select [0, !GetAZs '']
MapPublicIpOnLaunch: true
Tags:
- Key: Name
Value: public-subnet-1
Outputs:
VpcId:
Description: VPC ID
Value: !Ref MainVPC
Export:
Name: MainVpcId
Azure Bicep
Bicep is Azure's domain-specific language for deploying Azure resources. It compiles down to ARM (Azure Resource Manager) templates but with a much cleaner syntax.
// Example: Azure Bicep for a Storage Account
@description('Location for all resources')
param location string = resourceGroup().location
@description('Storage account name')
param storageAccountName string = 'st${uniqueString(resourceGroup().id)}'
resource storageAccount 'Microsoft.Storage/storageAccounts@2023-01-01' = {
name: storageAccountName
location: location
sku: {
name: 'Standard_LRS'
}
kind: 'StorageV2'
properties: {
minimumTlsVersion: 'TLS1_2'
supportsHttpsTrafficOnly: true
accessTier: 'Hot'
}
}
output storageAccountId string = storageAccount.id
output primaryEndpoint string = storageAccount.properties.primaryEndpoints.blob
Pulumi
Pulumi takes a different approach — you write IaC using general-purpose programming languages (Python, TypeScript, Go, C#). This gives you full access to loops, conditionals, type systems, and testing frameworks.
# Example: Pulumi with Python (conceptual)
# Install: pip install pulumi pulumi-aws
# Initialize: pulumi new aws-python
# __main__.py
import pulumi
import pulumi_aws as aws
# Create a VPC using familiar Python
vpc = aws.ec2.Vpc("main-vpc",
cidr_block="10.0.0.0/16",
enable_dns_hostnames=True,
tags={"Name": "main-vpc", "Environment": "production"}
)
# Use Python loops for multiple subnets
subnets = []
for i, az in enumerate(["us-east-1a", "us-east-1b", "us-east-1c"]):
subnet = aws.ec2.Subnet(f"subnet-{i}",
vpc_id=vpc.id,
cidr_block=f"10.0.{i+1}.0/24",
availability_zone=az,
tags={"Name": f"subnet-{az}"}
)
subnets.append(subnet)
# Export outputs
pulumi.export("vpc_id", vpc.id)
pulumi.export("subnet_ids", [s.id for s in subnets])
Ansible
Ansible is primarily a configuration management and orchestration tool that uses imperative YAML playbooks. It's agentless (connects via SSH) and excels at configuring servers after they're provisioned.
# Example: Ansible playbook for server configuration
---
- name: Configure web servers
hosts: webservers
become: yes
vars:
app_port: 8080
nginx_version: "1.24"
tasks:
- name: Update apt cache
apt:
update_cache: yes
cache_valid_time: 3600
- name: Install nginx
apt:
name: "nginx={{ nginx_version }}*"
state: present
- name: Configure nginx virtual host
template:
src: templates/nginx.conf.j2
dest: /etc/nginx/sites-available/app
mode: '0644'
notify: Restart nginx
- name: Enable site
file:
src: /etc/nginx/sites-available/app
dest: /etc/nginx/sites-enabled/app
state: link
handlers:
- name: Restart nginx
systemd:
name: nginx
state: restarted
enabled: yes
Tool Comparison
| Tool | Language | Cloud Support | State | Paradigm | Best For |
|---|---|---|---|---|---|
| Terraform | HCL | Multi-cloud (1000+ providers) | State file | Declarative | Multi-cloud infrastructure |
| CloudFormation | YAML/JSON | AWS only | Managed by AWS | Declarative | AWS-native shops |
| Bicep | Bicep DSL | Azure only | Managed by Azure | Declarative | Azure-native shops |
| Pulumi | Python/TS/Go/C# | Multi-cloud | Pulumi Cloud or self-managed | Imperative* | Devs preferring real languages |
| Ansible | YAML | Multi-cloud + on-prem | Stateless | Imperative | Configuration management |
| CDK (AWS) | Python/TS/Java | AWS only | Managed by AWS | Imperative* | Complex AWS patterns |
* Pulumi and CDK define infrastructure imperatively in code but generate declarative state that converges on desired outcomes.
Core IaC Concepts
Regardless of which tool you choose, these foundational concepts underpin all IaC systems.
Desired State vs Current State
Every IaC tool works by comparing two things: what you want (desired state, defined in code) and what exists (current state, read from the cloud). The tool then computes the minimal set of changes to make current match desired.
flowchart TD
A[Your Code
Desired State] --> C{IaC Engine}
B[Cloud APIs
Current State] --> C
C --> D{Differences?}
D -->|Yes| E[Generate Plan]
D -->|No| F[No Changes Needed]
E --> G[Create Resources]
E --> H[Update Resources]
E --> I[Delete Resources]
G --> J[State Converged ✓]
H --> J
I --> J
F --> J
style A fill:#3B9797,color:#fff
style B fill:#16476A,color:#fff
style J fill:#3B9797,color:#fff
Idempotency
Idempotency means running the same operation multiple times produces the same result. This is perhaps the most important property of declarative IaC — you can safely re-run your code without fear of creating duplicate resources or breaking things.
# Idempotent behavior example:
# Running terraform apply multiple times
$ terraform apply # First run: Creates 3 servers, 1 VPC, 1 LB
# Apply complete! Resources: 5 added, 0 changed, 0 destroyed.
$ terraform apply # Second run: Nothing to do!
# Apply complete! Resources: 0 added, 0 changed, 0 destroyed.
$ terraform apply # Third run: Still nothing!
# Apply complete! Resources: 0 added, 0 changed, 0 destroyed.
# Non-idempotent (imperative) behavior:
$ bash create-servers.sh # Creates 3 servers
$ bash create-servers.sh # Creates 3 MORE servers (6 total!)
$ bash create-servers.sh # Creates 3 MORE servers (9 total!)
Plan/Apply Workflow
Most IaC tools separate planning from execution. The plan phase shows you what changes will occur, and the apply phase executes them. This safety mechanism prevents unintended modifications.
# The Plan/Apply workflow in Terraform:
# Step 1: PLAN — Preview changes (safe, read-only)
$ terraform plan
# Terraform will perform the following actions:
#
# # aws_instance.web[0] will be created
# + resource "aws_instance" "web" {
# + ami = "ami-0c55b159cbfafe1f0"
# + instance_type = "t3.micro"
# + tags = {
# + "Name" = "web-server-1"
# }
# }
#
# Plan: 3 to add, 0 to change, 0 to destroy.
# Step 2: REVIEW — Team examines the plan (human checkpoint)
# Step 3: APPLY — Execute the plan (makes real changes)
$ terraform apply
# Do you want to perform these actions?
# Enter a value: yes
# aws_instance.web[0]: Creating...
# aws_instance.web[0]: Creation complete after 32s
# Apply complete! Resources: 3 added, 0 changed, 0 destroyed.
Drift Detection
Configuration drift occurs when the actual state of infrastructure diverges from the defined state — typically caused by manual changes (ClickOps), emergency fixes, or external automation that bypasses IaC.
sequenceDiagram
participant Dev as Developer
participant Code as IaC Code
participant Cloud as Cloud Provider
participant Ops as Ops Engineer
Dev->>Code: Define desired state
Code->>Cloud: terraform apply
Note over Cloud: State matches code ✓
Ops->>Cloud: Manual change via console!
Note over Cloud: State DRIFTED from code ✗
Dev->>Code: terraform plan
Code-->>Dev: Drift detected!
Instance type changed manually
Dev->>Code: terraform apply
Code->>Cloud: Revert to desired state
Note over Cloud: State matches code ✓
terraform plan on a schedule and alert when drift is detected.
Immutable vs Mutable Infrastructure
| Aspect | Mutable (Pets) | Immutable (Cattle) |
|---|---|---|
| Update strategy | Modify in-place (patch, upgrade) | Replace entirely (new image/container) |
| Server identity | Named, long-lived ("db-prod-01") | Disposable, auto-scaled ("instance-abc123") |
| Configuration drift | Accumulates over time | Impossible (replaced, not modified) |
| Rollback | Complex (undo changes) | Simple (deploy previous version) |
| Tools | Ansible, Chef, Puppet | Terraform + Packer, Kubernetes, Docker |
| Example | SSH in, run apt upgrade | Build new AMI, replace instances |
State Management
State is the mechanism by which IaC tools track the relationship between your code and real-world resources. Understanding state is critical for working effectively with tools like Terraform.
What State Is and Why It's Needed
When you write resource "aws_instance" "web" {...} in Terraform, the tool needs to know: does this resource already exist? What's its current ID? What properties does it have? This mapping between code and real resources is stored in the state file.
{
"version": 4,
"terraform_version": "1.7.0",
"resources": [
{
"mode": "managed",
"type": "aws_instance",
"name": "web",
"instances": [
{
"attributes": {
"id": "i-0abc123def456789",
"ami": "ami-0c55b159cbfafe1f0",
"instance_type": "t3.micro",
"private_ip": "10.0.1.42",
"public_ip": "54.23.167.89",
"tags": {
"Name": "web-server-1"
}
}
}
]
}
]
}
Remote State Backends
For team environments, state must be stored remotely where everyone can access it. Local state files on a developer's laptop won't work for collaboration.
# AWS S3 Backend with DynamoDB locking
terraform {
backend "s3" {
bucket = "mycompany-terraform-state"
key = "production/network/terraform.tfstate"
region = "us-east-1"
encrypt = true
dynamodb_table = "terraform-state-locks"
}
}
# Azure Blob Storage Backend
terraform {
backend "azurerm" {
resource_group_name = "rg-terraform-state"
storage_account_name = "stterraformstate"
container_name = "tfstate"
key = "production/network.tfstate"
}
}
# Google Cloud Storage Backend
terraform {
backend "gcs" {
bucket = "mycompany-terraform-state"
prefix = "production/network"
}
}
State Locking
When multiple team members might run Terraform simultaneously, state locking prevents concurrent modifications that could corrupt state or create conflicting resources.
sequenceDiagram
participant A as Developer A
participant Lock as Lock Table
(DynamoDB)
participant State as State File
(S3)
participant B as Developer B
A->>Lock: Acquire lock
Lock-->>A: Lock granted ✓
A->>State: Read state
A->>State: Write updated state
B->>Lock: Acquire lock
Lock-->>B: Lock DENIED ✗
(already held)
Note over B: Waits or errors out
A->>Lock: Release lock
Lock-->>A: Lock released
B->>Lock: Acquire lock
Lock-->>B: Lock granted ✓
| Backend | Locking Mechanism | Configuration |
|---|---|---|
| AWS S3 | DynamoDB table | dynamodb_table = "terraform-locks" |
| Azure Blob | Blob lease | Automatic (built-in) |
| GCS | Object versioning | Automatic (built-in) |
| Terraform Cloud | Managed locking | Automatic (SaaS) |
| Consul | Session-based locks | lock = true |
State Security
# Security checklist for remote state:
# 1. Enable encryption at rest
# S3: Server-side encryption with KMS
aws s3api put-bucket-encryption \
--bucket mycompany-terraform-state \
--server-side-encryption-configuration '{
"Rules": [{"ApplyServerSideEncryptionByDefault": {"SSEAlgorithm": "aws:kms"}}]
}'
# 2. Block public access
aws s3api put-public-access-block \
--bucket mycompany-terraform-state \
--public-access-block-configuration \
BlockPublicAcls=true,IgnorePublicAcls=true,BlockPublicPolicy=true,RestrictPublicBuckets=true
# 3. Enable versioning (for recovery)
aws s3api put-bucket-versioning \
--bucket mycompany-terraform-state \
--versioning-configuration Status=Enabled
# 4. Restrict access with bucket policy
# Only allow specific IAM roles to read/write state
Terraform Introduction
Terraform is the industry standard for multi-cloud IaC. Let's explore its core concepts with a practical example that provisions a complete network stack.
HCL Syntax Basics
HashiCorp Configuration Language (HCL) is designed to be human-readable while being machine-parseable. Its three main constructs are resources, variables, and outputs.
# variables.tf — Input variables (parameterize your config)
variable "environment" {
description = "Deployment environment"
type = string
default = "development"
validation {
condition = contains(["development", "staging", "production"], var.environment)
error_message = "Environment must be development, staging, or production."
}
}
variable "instance_count" {
description = "Number of web server instances"
type = number
default = 2
}
variable "allowed_cidr_blocks" {
description = "CIDR blocks allowed to access the application"
type = list(string)
default = ["10.0.0.0/8"]
}
variable "tags" {
description = "Common tags for all resources"
type = map(string)
default = {
ManagedBy = "terraform"
Team = "platform"
}
}
# main.tf — Resource definitions
resource "aws_vpc" "main" {
cidr_block = "10.0.0.0/16"
enable_dns_hostnames = true
enable_dns_support = true
tags = merge(var.tags, {
Name = "${var.environment}-vpc"
Environment = var.environment
})
}
resource "aws_subnet" "public" {
count = 2
vpc_id = aws_vpc.main.id
cidr_block = cidrsubnet(aws_vpc.main.cidr_block, 8, count.index)
availability_zone = data.aws_availability_zones.available.names[count.index]
map_public_ip_on_launch = true
tags = merge(var.tags, {
Name = "${var.environment}-public-${count.index + 1}"
Tier = "public"
})
}
resource "aws_instance" "web" {
count = var.instance_count
ami = data.aws_ami.amazon_linux.id
instance_type = "t3.micro"
subnet_id = aws_subnet.public[count.index % 2].id
tags = merge(var.tags, {
Name = "${var.environment}-web-${count.index + 1}"
Role = "webserver"
})
}
# outputs.tf — Output values (expose information)
output "vpc_id" {
description = "ID of the VPC"
value = aws_vpc.main.id
}
output "public_subnet_ids" {
description = "IDs of public subnets"
value = aws_subnet.public[*].id
}
output "instance_public_ips" {
description = "Public IPs of web server instances"
value = aws_instance.web[*].public_ip
}
Providers
Providers are plugins that Terraform uses to interact with cloud platforms, SaaS services, and other APIs. Each provider offers a set of resources and data sources.
| Provider | Purpose | Resources | Registry |
|---|---|---|---|
hashicorp/aws | Amazon Web Services | 1,300+ | registry.terraform.io/providers/hashicorp/aws |
hashicorp/azurerm | Microsoft Azure | 900+ | registry.terraform.io/providers/hashicorp/azurerm |
hashicorp/google | Google Cloud Platform | 800+ | registry.terraform.io/providers/hashicorp/google |
hashicorp/kubernetes | Kubernetes clusters | 50+ | registry.terraform.io/providers/hashicorp/kubernetes |
integrations/github | GitHub repos/teams | 40+ | registry.terraform.io/providers/integrations/github |
cloudflare/cloudflare | Cloudflare DNS/CDN | 60+ | registry.terraform.io/providers/cloudflare/cloudflare |
Resource Lifecycle
stateDiagram-v2
[*] --> Planned: terraform plan
Planned --> Creating: terraform apply
Creating --> Created: API success
Created --> Updating: Code changed + apply
Updating --> Created: Update complete
Created --> Destroying: Resource removed from code
Destroying --> [*]: Destroyed
Created --> Tainted: Manual taint
Tainted --> Creating: terraform apply (recreate)
note right of Created: Normal steady state
note right of Tainted: Marked for recreation
Workflow Commands
# Complete Terraform workflow from scratch:
# 1. Initialize — download providers and modules
terraform init
# Initializing the backend...
# Initializing provider plugins...
# - Finding hashicorp/aws versions matching "~> 5.0"...
# - Installing hashicorp/aws v5.31.0...
# Terraform has been successfully initialized!
# 2. Format — consistent code style
terraform fmt -recursive
# main.tf
# variables.tf
# 3. Validate — check syntax and internal consistency
terraform validate
# Success! The configuration is valid.
# 4. Plan — preview what will change
terraform plan -out=tfplan
# Plan: 5 to add, 0 to change, 0 to destroy.
# Saved the plan to: tfplan
# 5. Apply — execute the plan
terraform apply tfplan
# Apply complete! Resources: 5 added, 0 changed, 0 destroyed.
# 6. Show — inspect current state
terraform show
# 7. Destroy — tear everything down (careful!)
terraform destroy
# Plan: 0 to add, 0 to change, 5 to destroy.
# Do you really want to destroy all resources?
Modules & Reusability
As your infrastructure grows, you'll find yourself repeating patterns — every application needs a VPC, subnets, security groups, and a load balancer. Modules let you package these patterns into reusable components.
Module Structure
# Standard module directory structure:
modules/
└── vpc/
├── main.tf # Resource definitions
├── variables.tf # Input variables (module parameters)
├── outputs.tf # Output values (what the module exposes)
├── versions.tf # Provider version constraints
└── README.md # Usage documentation
# modules/vpc/variables.tf
variable "name" {
description = "Name prefix for VPC resources"
type = string
}
variable "cidr_block" {
description = "CIDR block for the VPC"
type = string
default = "10.0.0.0/16"
}
variable "public_subnet_count" {
description = "Number of public subnets"
type = number
default = 2
}
variable "environment" {
description = "Environment tag"
type = string
}
# modules/vpc/main.tf
resource "aws_vpc" "this" {
cidr_block = var.cidr_block
enable_dns_hostnames = true
enable_dns_support = true
tags = {
Name = "${var.name}-vpc"
Environment = var.environment
}
}
resource "aws_subnet" "public" {
count = var.public_subnet_count
vpc_id = aws_vpc.this.id
cidr_block = cidrsubnet(var.cidr_block, 8, count.index)
availability_zone = data.aws_availability_zones.available.names[count.index]
map_public_ip_on_launch = true
tags = {
Name = "${var.name}-public-${count.index + 1}"
Tier = "public"
}
}
resource "aws_internet_gateway" "this" {
vpc_id = aws_vpc.this.id
tags = {
Name = "${var.name}-igw"
}
}
# modules/vpc/outputs.tf
output "vpc_id" {
description = "The ID of the VPC"
value = aws_vpc.this.id
}
output "public_subnet_ids" {
description = "List of public subnet IDs"
value = aws_subnet.public[*].id
}
output "internet_gateway_id" {
description = "The ID of the Internet Gateway"
value = aws_internet_gateway.this.id
}
Using Modules
# Using your custom module:
module "production_vpc" {
source = "./modules/vpc"
name = "prod"
cidr_block = "10.0.0.0/16"
public_subnet_count = 3
environment = "production"
}
module "staging_vpc" {
source = "./modules/vpc"
name = "staging"
cidr_block = "10.1.0.0/16"
public_subnet_count = 2
environment = "staging"
}
# Using a public registry module:
module "vpc" {
source = "terraform-aws-modules/vpc/aws"
version = "5.4.0"
name = "my-vpc"
cidr = "10.0.0.0/16"
azs = ["us-east-1a", "us-east-1b", "us-east-1c"]
private_subnets = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
public_subnets = ["10.0.101.0/24", "10.0.102.0/24", "10.0.103.0/24"]
enable_nat_gateway = true
single_nat_gateway = true
}
# Reference module outputs:
resource "aws_instance" "web" {
subnet_id = module.production_vpc.public_subnet_ids[0]
# ...
}
When to Create Modules
- Create a module when you repeat the same pattern 3+ times across projects
- Don't create a module for a single resource — that adds complexity without benefit
- Keep modules focused — a "vpc" module, not a "everything-for-my-app" module
- Version your modules using Git tags so consumers can pin versions
- Document inputs/outputs — a module without docs is a liability
IaC Best Practices
Version Control Everything
# .gitignore for Terraform projects
# Never commit these files:
# Local state files (use remote backends!)
*.tfstate
*.tfstate.*
# State lock info
.terraform.lock.hcl
# Terraform working directories
.terraform/
# Variable files with secrets
*.tfvars
!example.tfvars
# Plan files (may contain sensitive data)
*.tfplan
# Crash log files
crash.log
crash.*.log
# Override files (local dev only)
override.tf
override.tf.json
*_override.tf
*_override.tf.json
Environment Separation
# Recommended directory structure for multi-environment:
infrastructure/
├── modules/ # Reusable modules
│ ├── vpc/
│ ├── ecs-cluster/
│ └── rds/
├── environments/
│ ├── dev/
│ │ ├── main.tf # Uses modules with dev params
│ │ ├── variables.tf
│ │ ├── terraform.tfvars # Dev-specific values
│ │ └── backend.tf # Points to dev state bucket
│ ├── staging/
│ │ ├── main.tf
│ │ ├── variables.tf
│ │ ├── terraform.tfvars
│ │ └── backend.tf
│ └── production/
│ ├── main.tf
│ ├── variables.tf
│ ├── terraform.tfvars
│ └── backend.tf
└── global/ # Shared resources (IAM, DNS)
├── iam/
└── route53/
CI/CD for Infrastructure
flowchart LR
A[Developer
Push Code] --> B[PR Created]
B --> C[CI Pipeline]
C --> D[terraform fmt
check]
D --> E[terraform validate]
E --> F[terraform plan]
F --> G[Security Scan
tfsec/checkov]
G --> H[Cost Estimate
infracost]
H --> I[Plan Comment
on PR]
I --> J{Approval?}
J -->|Yes| K[Merge to main]
K --> L[CD Pipeline]
L --> M[terraform apply
-auto-approve]
M --> N[Post-deploy
verification]
J -->|No| O[Request Changes]
style A fill:#3B9797,color:#fff
style M fill:#132440,color:#fff
style N fill:#3B9797,color:#fff
# Example: GitHub Actions CI/CD for Terraform
name: Terraform CI/CD
on:
pull_request:
paths: ['infrastructure/**']
push:
branches: [main]
paths: ['infrastructure/**']
jobs:
plan:
runs-on: ubuntu-latest
if: github.event_name == 'pull_request'
steps:
- uses: actions/checkout@v4
- uses: hashicorp/setup-terraform@v3
with:
terraform_version: 1.7.0
- name: Terraform Init
run: terraform init
working-directory: infrastructure/environments/production
- name: Terraform Format Check
run: terraform fmt -check -recursive
- name: Terraform Validate
run: terraform validate
- name: Terraform Plan
run: terraform plan -no-color -out=tfplan
working-directory: infrastructure/environments/production
- name: Comment Plan on PR
uses: actions/github-script@v7
with:
script: |
// Post plan output as PR comment
apply:
runs-on: ubuntu-latest
if: github.ref == 'refs/heads/main' && github.event_name == 'push'
steps:
- uses: actions/checkout@v4
- uses: hashicorp/setup-terraform@v3
- name: Terraform Init
run: terraform init
working-directory: infrastructure/environments/production
- name: Terraform Apply
run: terraform apply -auto-approve
working-directory: infrastructure/environments/production
Testing Infrastructure Code
| Testing Level | Tool | What It Checks | Speed |
|---|---|---|---|
| Syntax | terraform validate | HCL syntax, type errors | Instant |
| Formatting | terraform fmt -check | Consistent code style | Instant |
| Linting | tflint | Best practices, deprecated features | Seconds |
| Security | tfsec, Checkov, Trivy | Security misconfigurations | Seconds |
| Cost | Infracost | Cost impact of changes | Seconds |
| Policy | OPA/Sentinel | Organizational policies | Seconds |
| Integration | Terratest (Go) | Actually deploys & verifies | Minutes |
# Running IaC tests in CI:
# 1. Format check (fails if code isn't formatted)
terraform fmt -check -recursive -diff
# 2. Validate syntax
terraform init -backend=false
terraform validate
# 3. Lint with tflint
tflint --init
tflint --recursive
# 4. Security scan with tfsec
tfsec . --format json --out results.json
# 5. Security scan with Checkov
checkov -d . --framework terraform
# 6. Cost estimation with Infracost
infracost breakdown --path . --format table
# NAME MONTHLY QTY UNIT MONTHLY COST
# aws_instance.web[0]
# ├─ Instance usage 730 hours $7.59
# └─ root_block_device
# └─ Storage (gp3) 20 GB $1.60
# OVERALL TOTAL $9.19
Hands-On Exercises
Write Your First Terraform Configuration
Create a Terraform config that manages a local file (no cloud account needed). This uses the local provider to demonstrate the full IaC lifecycle without any cloud costs.
# exercise1/main.tf
# Install Terraform, then run: terraform init && terraform apply
terraform {
required_providers {
local = {
source = "hashicorp/local"
version = "~> 2.0"
}
}
}
resource "local_file" "hello" {
content = "Hello, Infrastructure as Code!\nManaged by Terraform."
filename = "${path.module}/output/hello.txt"
}
resource "local_file" "config" {
content = jsonencode({
app_name = "my-app"
environment = "development"
version = "1.0.0"
features = ["logging", "metrics"]
})
filename = "${path.module}/output/config.json"
}
output "hello_file_path" {
value = local_file.hello.filename
}
output "config_file_path" {
value = local_file.config.filename
}
Tasks: (1) Run terraform init, (2) Run terraform plan and examine the output, (3) Run terraform apply, (4) Verify files were created, (5) Modify content and re-apply, (6) Run terraform destroy.
Declarative vs Imperative Comparison
Implement the same task using both paradigms and observe the differences in behavior, especially around idempotency and error handling.
Task: Create a directory structure with 3 config files.
#!/bin/bash
# imperative-approach.sh
# Run: chmod +x imperative-approach.sh && ./imperative-approach.sh
# Imperative: explicit steps, NOT idempotent
mkdir -p /tmp/iac-exercise/configs
echo '{"service": "api", "port": 8080}' > /tmp/iac-exercise/configs/api.json
echo '{"service": "web", "port": 3000}' > /tmp/iac-exercise/configs/web.json
echo '{"service": "worker", "port": 9090}' > /tmp/iac-exercise/configs/worker.json
echo "Created 3 config files"
ls -la /tmp/iac-exercise/configs/
# Problem: Run this twice — it silently overwrites!
# Problem: Delete one file manually — script doesn't notice!
# declarative-approach/main.tf
# Declarative: define desired state, Terraform handles the rest
locals {
services = {
api = { port = 8080 }
web = { port = 3000 }
worker = { port = 9090 }
}
}
resource "local_file" "config" {
for_each = local.services
content = jsonencode({
service = each.key
port = each.value.port
})
filename = "${path.module}/configs/${each.key}.json"
}
# Benefits: Idempotent, detects drift, manages lifecycle
Observe: (1) Run the bash script twice — what happens? (2) Delete a file and re-run each approach. (3) Which approach detects and corrects drift?
Design a Remote State Architecture
Design (on paper or in HCL) a complete remote state architecture for a team of 5 engineers working across 3 environments (dev, staging, prod).
Requirements to address:
- Where will state be stored? (S3, Azure Blob, GCS)
- How will you prevent concurrent modifications? (Locking)
- How will you separate environments? (Separate state files per env)
- How will you encrypt sensitive data in state?
- Who has access to production state vs dev state?
- How will you recover from state corruption?
# Starter template for your design:
# bootstrap/main.tf — Creates the state infrastructure itself
resource "aws_s3_bucket" "terraform_state" {
bucket = "myteam-terraform-state"
lifecycle {
prevent_destroy = true
}
}
resource "aws_s3_bucket_versioning" "terraform_state" {
bucket = aws_s3_bucket.terraform_state.id
versioning_configuration {
status = "Enabled"
}
}
resource "aws_dynamodb_table" "terraform_locks" {
name = "terraform-state-locks"
billing_mode = "PAY_PER_REQUEST"
hash_key = "LockID"
attribute {
name = "LockID"
type = "S"
}
}
# Your task: Add encryption, access policies, and per-env backends
Identify and Prevent Drift Scenarios
For each scenario below, explain: (1) How drift occurs, (2) How IaC detects it, (3) How to prevent it.
| Scenario | Your Analysis |
|---|---|
| An engineer opens port 22 via the AWS console during debugging and forgets to close it | How does terraform plan detect this? What policy prevents it? |
| Auto-scaling adds 3 new instances that aren't in Terraform state | Is this drift? How should IaC handle auto-scaled resources? |
| A database RDS instance is upgraded manually from db.t3.medium to db.t3.large | What happens on next terraform apply? Is that safe? |
| Someone deletes a resource that Terraform manages | What does Terraform do? Recreate or error? |
Bonus: Write a cron job or CI schedule that runs terraform plan daily and alerts on drift.
Conclusion & Next Steps
Infrastructure as Code fundamentally transforms how we manage infrastructure — from manual, error-prone processes to automated, version-controlled, reviewable workflows. The key takeaways from this article:
- IaC eliminates snowflakes — infrastructure is reproducible, consistent, and documented as code
- Declarative > Imperative for most infrastructure provisioning (idempotent, drift-resistant)
- Terraform is the industry standard for multi-cloud declarative IaC with the largest ecosystem
- State is sacred — store remotely, lock it, encrypt it, version it
- Modules enable reuse — package patterns once, deploy everywhere
- CI/CD for infrastructure — plan on PR, apply on merge, test continuously
- Prevent drift — enforce no-manual-change policies and monitor for deviations
Next in the Series
In Part 9: Terraform Fundamentals, we take a deep dive into Terraform's HCL language — providers, resources, data sources, locals, expressions, functions, and real-world deployment patterns. You'll build complete infrastructure from scratch across AWS and Azure.