Your Infrastructure Journey
Congratulations — you've reached the final chapter of a 20-part journey through infrastructure and cloud automation. From bare-metal hardware fundamentals to platform engineering at scale, you've built a comprehensive understanding of how modern infrastructure is designed, deployed, secured, and optimized. This final installment brings everything together: career guidance to translate your knowledge into professional success, and a capstone project that demonstrates mastery across the entire stack.
Over the course of this series, you've progressed from understanding physical servers, networking, and operating systems through configuration management, containerization, orchestration, infrastructure as code, CI/CD, security, and financial optimization. Each part built upon the last, creating a foundation that mirrors the real-world evolution from traditional IT operations to modern cloud-native platform engineering.
Skills You've Built
Let's acknowledge the breadth and depth of knowledge you've accumulated:
| Phase | Parts | Skills Acquired |
|---|---|---|
| Foundations | 1–4 | Hardware, networking, Linux administration, shell scripting |
| Automation | 5–6 | Configuration management (Ansible), containers (Docker) |
| Cloud & IaC | 7–9 | Cloud fundamentals, Terraform, CI/CD pipelines |
| Orchestration | 10–11 | Kubernetes, security hardening |
| Operations | 12–14 | Monitoring, GitOps, platform engineering |
| Advanced | 15–20 | Service mesh, multi-cloud, serverless, DR, FinOps, career |
flowchart LR
subgraph Foundations
P1[1. Hardware]
P2[2. Networking]
P3[3. Linux]
P4[4. Scripting]
end
subgraph Automation
P5[5. Ansible]
P6[6. Containers]
end
subgraph Cloud
P7[7. Cloud Fundamentals]
P8[8. Terraform]
P9[9. CI/CD]
end
subgraph Orchestration
P10[10. Kubernetes]
P11[11. Security]
end
subgraph Operations
P12[12. Monitoring]
P13[13. GitOps]
P14[14. Platform Eng]
end
subgraph Advanced
P15[15. Service Mesh]
P16[16. Multi-Cloud]
P17[17. Serverless]
P18[18. DR]
P19[19. FinOps]
P20[20. Career]
end
Foundations --> Automation --> Cloud --> Orchestration --> Operations --> Advanced
Career Paths in Infrastructure
The infrastructure domain offers multiple career trajectories, each with distinct responsibilities, required skills, and growth potential. Understanding these paths helps you target your job search and professional development effectively.
Infrastructure / Cloud Engineer
Cloud Engineers design, implement, and maintain cloud infrastructure. They focus on resource provisioning, networking, storage, and compute services. This role emphasizes breadth across cloud services and strong IaC skills. Day-to-day work includes writing Terraform modules, configuring VPCs, managing IAM policies, and troubleshooting infrastructure issues.
DevOps Engineer
DevOps Engineers bridge development and operations, focusing on CI/CD pipelines, automation, and developer productivity. They build and maintain deployment pipelines, manage artifact repositories, implement testing automation, and ensure smooth software delivery from code commit to production. The role demands strong scripting skills and deep knowledge of CI/CD tooling.
Site Reliability Engineer (SRE)
SREs apply software engineering principles to infrastructure and operations problems. Originating at Google, SRE focuses on reliability through SLOs, error budgets, incident management, and automation that eliminates toil. SREs write code to solve operational problems and are expected to spend at least 50% of their time on engineering rather than operations.
Platform Engineer
Platform Engineers build internal developer platforms (IDPs) that abstract infrastructure complexity. They create golden paths, self-service portals, and standardized templates that enable development teams to deploy independently. This role combines infrastructure expertise with product thinking — your users are internal developers.
Cloud Architect / Solutions Architect
Cloud Architects design large-scale distributed systems, make technology selection decisions, define standards, and provide technical leadership. They bridge business requirements with technical implementation, often working with enterprise customers or leading architecture decisions across multiple teams.
Security Engineer / DevSecOps
Security Engineers focused on infrastructure protect cloud environments through policy enforcement, vulnerability management, compliance automation, and incident response. DevSecOps practitioners embed security into CI/CD pipelines and shift security left into the development lifecycle.
| Role | Primary Focus | Key Skills | Salary Range (US) |
|---|---|---|---|
| Cloud Engineer | Infrastructure provisioning & maintenance | Terraform, AWS/Azure/GCP, networking | $110k–$170k |
| DevOps Engineer | CI/CD & developer productivity | Jenkins/GitHub Actions, Docker, scripting | $120k–$180k |
| SRE | Reliability & incident response | SLOs, observability, Go/Python, Kubernetes | $140k–$220k |
| Platform Engineer | Internal developer platforms | Backstage, Kubernetes, API design, UX | $140k–$210k |
| Cloud Architect | System design & technical strategy | Multi-cloud, distributed systems, leadership | $160k–$250k |
| DevSecOps | Security automation & compliance | SAST/DAST, OPA, network security, compliance | $130k–$200k |
flowchart TD
A[Junior Sysadmin / IT Support] --> B[Cloud Engineer]
A --> C[DevOps Engineer]
B --> D[Senior Cloud Engineer]
C --> E[Senior DevOps Engineer]
D --> F[Cloud Architect]
D --> G[SRE]
E --> G
E --> H[Platform Engineer]
G --> I[Staff SRE / Principal]
H --> J[Staff Platform Engineer]
F --> K[Principal Architect / VP Engineering]
I --> K
J --> K
C --> L[DevSecOps Engineer]
L --> M[Security Architect]
M --> K
Certification Roadmap
Certifications validate your knowledge and signal competence to employers. While they're not a substitute for hands-on experience, they open doors — particularly for career changers and early-career professionals. Here's a strategic roadmap organized by vendor and difficulty level.
AWS Certification Path
| Level | Certification | Focus | Prep Time |
|---|---|---|---|
| Foundational | Cloud Practitioner (CLF-C02) | Cloud concepts, billing, security basics | 2–4 weeks |
| Associate | Solutions Architect Associate (SAA-C03) | Architecture design, services selection | 4–8 weeks |
| Associate | SysOps Administrator (SOA-C02) | Operations, monitoring, troubleshooting | 4–6 weeks |
| Professional | DevOps Engineer Professional (DOP-C02) | CI/CD, automation, SDLC | 8–12 weeks |
| Specialty | Advanced Networking / Security | Deep-dive domains | 6–10 weeks |
Azure Certification Path
| Level | Certification | Focus | Prep Time |
|---|---|---|---|
| Foundational | AZ-900: Azure Fundamentals | Cloud concepts, Azure services overview | 1–3 weeks |
| Associate | AZ-104: Azure Administrator | Resource management, networking, identity | 4–8 weeks |
| Associate | AZ-400: DevOps Engineer Expert | CI/CD, IaC, security, compliance | 6–10 weeks |
| Expert | AZ-305: Solutions Architect Expert | Architecture design, governance, identity | 8–12 weeks |
Vendor-Neutral & Kubernetes Certifications
| Certification | Vendor | Focus | Difficulty | Cost |
|---|---|---|---|---|
| Terraform Associate (003) | HashiCorp | IaC fundamentals, HCL, state management | Moderate | $70 |
| CKA (Certified Kubernetes Admin) | CNCF | Cluster admin, networking, troubleshooting | Hard | $395 |
| CKAD (Certified Kubernetes App Dev) | CNCF | Application deployment, configuration | Moderate | $395 |
| CKS (Certified Kubernetes Security) | CNCF | Cluster security, supply chain, runtime | Very Hard | $395 |
| LFCS (Linux Foundation Certified Sysadmin) | Linux Foundation | Linux administration, networking, security | Moderate | $395 |
flowchart TD
subgraph "Year 1: Foundations"
A1[AWS Cloud Practitioner
or AZ-900] --> A2[Terraform Associate]
A2 --> A3[AWS SAA or AZ-104]
end
subgraph "Year 2: Specialization"
B1[CKA] --> B2[AWS DevOps Professional
or AZ-400]
B2 --> B3[CKAD or CKS]
end
subgraph "Year 3+: Expert"
C1[AWS/Azure Solutions Architect] --> C2[Specialty Certs]
end
A3 --> B1
B3 --> C1
Building Your Portfolio
Your GitHub profile is your infrastructure resume. Employers reviewing candidates for cloud/DevOps roles will look at your repositories before your LinkedIn. A well-structured portfolio demonstrates that you can not only build infrastructure but also document, organize, and communicate your work professionally.
GitHub Profile Best Practices
- Profile README: Create a personal README.md with a brief intro, tech stack badges, and links to key projects
- Pinned repositories: Pin your 4–6 best infrastructure projects
- Consistent activity: Regular commits show ongoing learning
- Clean commit history: Meaningful commit messages, not "fix" or "update"
Key Projects to Showcase
# Ideal GitHub repository structure for an infrastructure project
my-infra-project/
├── README.md # Architecture diagram, setup instructions, decisions
├── LICENSE
├── .github/
│ └── workflows/
│ ├── ci.yml # Terraform validate + plan on PR
│ └── cd.yml # Terraform apply on merge to main
├── docs/
│ ├── architecture.md # Detailed architecture decisions
│ ├── runbook.md # Operational procedures
│ └── cost-analysis.md # Monthly cost breakdown
├── terraform/
│ ├── environments/
│ │ ├── dev/
│ │ │ ├── main.tf
│ │ │ ├── variables.tf
│ │ │ └── terraform.tfvars
│ │ ├── staging/
│ │ └── prod/
│ ├── modules/
│ │ ├── networking/
│ │ ├── compute/
│ │ ├── database/
│ │ └── monitoring/
│ └── global/
│ └── backend.tf
├── kubernetes/
│ ├── base/
│ │ ├── deployment.yaml
│ │ ├── service.yaml
│ │ └── ingress.yaml
│ └── overlays/
│ ├── dev/
│ └── prod/
├── monitoring/
│ ├── prometheus/
│ ├── grafana/
│ └── alertmanager/
└── Makefile # Common commands documented
Writing About Your Projects
Document not just what you built but why you made specific decisions. A README that explains "I chose EKS over ECS because..." demonstrates architectural thinking. Include:
- Architecture diagrams (Mermaid or draw.io)
- Cost estimates (Infracost output)
- Trade-off decisions and alternatives considered
- What you would do differently in a production environment
- Lessons learned during implementation
# Example README.md header for a portfolio project
cat <<'EOF' > README.md
# Production-Grade EKS Infrastructure
[](https://terraform.io)
[](https://aws.amazon.com/eks/)
[](https://github.com/features/actions)
## Architecture
Multi-environment EKS cluster with:
- VPC with public/private subnets across 3 AZs
- Managed node groups with spot instances (70% cost saving)
- RDS PostgreSQL with Multi-AZ failover
- Prometheus + Grafana observability stack
- GitHub Actions CI/CD with Terraform plan/apply
## Quick Start
```bash
# Prerequisites: AWS CLI, Terraform, kubectl
make init ENV=dev
make plan ENV=dev
make apply ENV=dev
```
## Cost Estimate
| Environment | Monthly Cost | Spot Savings |
|-------------|-------------|--------------|
| Dev | $145/month | $89 saved |
| Staging | $312/month | $198 saved |
| Prod | $1,247/month| $834 saved |
## Architecture Decisions
| Decision | Choice | Rationale |
|----------|--------|-----------|
| Orchestration | EKS | Team Kubernetes expertise, portability |
| Node type | Spot + On-demand mix | Cost optimization with reliability |
| Database | RDS PostgreSQL | Managed service, Multi-AZ, automated backups |
| IaC | Terraform | Multi-cloud support, mature ecosystem |
EOF
Interview Preparation
Infrastructure interviews test both technical depth and operational judgment. You'll face a mix of system design questions, hands-on coding challenges, troubleshooting scenarios, and behavioral questions that assess how you handle incidents and collaborate with teams.
Common Interview Topics by Role
| Topic | Cloud Engineer | DevOps | SRE | Platform Eng |
|---|---|---|---|---|
| Networking (VPC, DNS, LB) | ★★★ | ★★ | ★★ | ★★ |
| Terraform / IaC | ★★★ | ★★★ | ★★ | ★★★ |
| CI/CD Pipelines | ★★ | ★★★ | ★★ | ★★★ |
| Kubernetes | ★★ | ★★★ | ★★★ | ★★★ |
| Monitoring & SLOs | ★★ | ★★ | ★★★ | ★★ |
| System Design | ★★ | ★★ | ★★★ | ★★★ |
| Incident Management | ★ | ★★ | ★★★ | ★★ |
| Security / IAM | ★★★ | ★★ | ★★ | ★★ |
Sample Questions with Answer Frameworks
Answer Framework (STAR + Architecture):
# Framework: describe the pipeline stages and technology choices
# 1. Source Stage
# - GitHub repository with branch protection
# - PR triggers CI, merge to main triggers CD
# 2. Build Stage
# - Docker multi-stage builds
# - Unit tests + code coverage
# - SAST scanning (Snyk/Trivy)
# - Container image push to ECR/ACR
# 3. Deploy to Dev (automatic)
# - Terraform plan + apply for infrastructure
# - Kubernetes rolling deployment
# - Smoke tests after deploy
# 4. Deploy to Staging (automatic after dev passes)
# - Integration tests
# - Performance tests (k6/Locust)
# - Security scanning
# 5. Deploy to Production (manual approval gate)
# - Canary deployment (10% → 50% → 100%)
# - Automated rollback if error rate > 1%
# - Post-deploy validation
# 6. Observability
# - Deployment markers in Grafana
# - SLO monitoring during rollout
# - Automated incident creation on failure
# Answer: Modular VPC with 3 AZs
# This is a common live-coding challenge in infrastructure interviews
variable "vpc_cidr" {
description = "CIDR block for the VPC"
type = string
default = "10.0.0.0/16"
}
variable "environment" {
description = "Environment name"
type = string
}
variable "availability_zones" {
description = "List of AZs"
type = list(string)
default = ["us-east-1a", "us-east-1b", "us-east-1c"]
}
resource "aws_vpc" "main" {
cidr_block = var.vpc_cidr
enable_dns_hostnames = true
enable_dns_support = true
tags = {
Name = "${var.environment}-vpc"
Environment = var.environment
ManagedBy = "terraform"
}
}
resource "aws_subnet" "public" {
count = length(var.availability_zones)
vpc_id = aws_vpc.main.id
cidr_block = cidrsubnet(var.vpc_cidr, 8, count.index)
availability_zone = var.availability_zones[count.index]
map_public_ip_on_launch = true
tags = {
Name = "${var.environment}-public-${var.availability_zones[count.index]}"
Type = "public"
}
}
resource "aws_subnet" "private" {
count = length(var.availability_zones)
vpc_id = aws_vpc.main.id
cidr_block = cidrsubnet(var.vpc_cidr, 8, count.index + 10)
availability_zone = var.availability_zones[count.index]
tags = {
Name = "${var.environment}-private-${var.availability_zones[count.index]}"
Type = "private"
}
}
resource "aws_internet_gateway" "main" {
vpc_id = aws_vpc.main.id
tags = { Name = "${var.environment}-igw" }
}
resource "aws_eip" "nat" {
domain = "vpc"
tags = { Name = "${var.environment}-nat-eip" }
}
resource "aws_nat_gateway" "main" {
allocation_id = aws_eip.nat.id
subnet_id = aws_subnet.public[0].id
tags = { Name = "${var.environment}-nat" }
depends_on = [aws_internet_gateway.main]
}
resource "aws_route_table" "public" {
vpc_id = aws_vpc.main.id
route {
cidr_block = "0.0.0.0/0"
gateway_id = aws_internet_gateway.main.id
}
tags = { Name = "${var.environment}-public-rt" }
}
resource "aws_route_table" "private" {
vpc_id = aws_vpc.main.id
route {
cidr_block = "0.0.0.0/0"
nat_gateway_id = aws_nat_gateway.main.id
}
tags = { Name = "${var.environment}-private-rt" }
}
resource "aws_route_table_association" "public" {
count = length(var.availability_zones)
subnet_id = aws_subnet.public[count.index].id
route_table_id = aws_route_table.public.id
}
resource "aws_route_table_association" "private" {
count = length(var.availability_zones)
subnet_id = aws_subnet.private[count.index].id
route_table_id = aws_route_table.private.id
}
output "vpc_id" { value = aws_vpc.main.id }
output "public_subnet_ids" { value = aws_subnet.public[*].id }
output "private_subnet_ids" { value = aws_subnet.private[*].id }
Behavioral Questions for DevOps/SRE
- "Tell me about an incident you managed." — Use the timeline format: detection → triage → mitigation → resolution → post-mortem
- "How do you handle pushback from developers on security/process changes?" — Emphasize empathy, data-driven arguments, and incremental adoption
- "Describe a time you automated a manual process." — Quantify: hours saved, error reduction, team impact
- "What's your approach to on-call?" — Runbooks, escalation policies, blameless post-mortems, reducing alert fatigue
The Capstone Project — Overview
This capstone project ties together concepts from all 20 parts into a single, production-grade infrastructure deployment. You'll build a multi-environment web application platform using Terraform for provisioning, Kubernetes for orchestration, GitHub Actions for CI/CD, and Prometheus/Grafana for observability.
Technology Stack
| Layer | Technology | Series Part |
|---|---|---|
| Infrastructure as Code | Terraform 1.7+ | Part 8 |
| Cloud Provider | AWS (VPC, EKS, RDS, S3) | Part 7 |
| Container Orchestration | Kubernetes (EKS 1.29) | Part 10 |
| CI/CD | GitHub Actions | Part 9 |
| Monitoring | Prometheus + Grafana (Helm) | Part 13 |
| Security | RBAC, Network Policies, OPA | Part 11 |
| Cost Management | Infracost, spot instances | Part 19 |
flowchart TB
subgraph Internet
U[Users]
GH[GitHub]
end
subgraph AWS["AWS Cloud"]
subgraph VPC["VPC (10.0.0.0/16)"]
subgraph Public["Public Subnets"]
ALB[Application Load Balancer]
NAT[NAT Gateway]
end
subgraph Private["Private Subnets"]
subgraph EKS["EKS Cluster"]
APP[App Pods]
MON[Monitoring Pods]
end
RDS[(RDS PostgreSQL)]
end
end
S3[S3 Bucket]
ECR[ECR Registry]
end
U --> ALB --> APP
APP --> RDS
APP --> S3
GH --> ECR --> EKS
MON --> APP
Capstone Phase 1: Foundation & Networking
Foundation & Networking
Create the network foundation: VPC with public and private subnets across 3 availability zones, internet gateway, NAT gateway, security groups, and network ACLs. This is the base layer everything else builds upon.
# terraform/environments/dev/main.tf
# Capstone Phase 1: Foundation configuration
terraform {
required_version = ">= 1.7.0"
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.40"
}
}
backend "s3" {
bucket = "capstone-terraform-state"
key = "dev/terraform.tfstate"
region = "us-east-1"
dynamodb_table = "terraform-locks"
encrypt = true
}
}
provider "aws" {
region = var.aws_region
default_tags {
tags = {
Project = "capstone"
Environment = var.environment
ManagedBy = "terraform"
Owner = "platform-team"
}
}
}
module "networking" {
source = "../../modules/networking"
environment = var.environment
vpc_cidr = "10.0.0.0/16"
availability_zones = ["us-east-1a", "us-east-1b", "us-east-1c"]
public_subnet_cidrs = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
private_subnet_cidrs = ["10.0.11.0/24", "10.0.12.0/24", "10.0.13.0/24"]
enable_nat_gateway = true
single_nat_gateway = true # Cost saving for dev; use false for prod
}
# terraform/modules/networking/main.tf
# Reusable networking module
resource "aws_vpc" "main" {
cidr_block = var.vpc_cidr
enable_dns_hostnames = true
enable_dns_support = true
tags = { Name = "${var.environment}-vpc" }
}
resource "aws_subnet" "public" {
count = length(var.availability_zones)
vpc_id = aws_vpc.main.id
cidr_block = var.public_subnet_cidrs[count.index]
availability_zone = var.availability_zones[count.index]
map_public_ip_on_launch = true
tags = {
Name = "${var.environment}-public-${count.index + 1}"
"kubernetes.io/role/elb" = "1"
}
}
resource "aws_subnet" "private" {
count = length(var.availability_zones)
vpc_id = aws_vpc.main.id
cidr_block = var.private_subnet_cidrs[count.index]
availability_zone = var.availability_zones[count.index]
tags = {
Name = "${var.environment}-private-${count.index + 1}"
"kubernetes.io/role/internal-elb" = "1"
}
}
resource "aws_internet_gateway" "main" {
vpc_id = aws_vpc.main.id
tags = { Name = "${var.environment}-igw" }
}
resource "aws_eip" "nat" {
count = var.enable_nat_gateway ? 1 : 0
domain = "vpc"
tags = { Name = "${var.environment}-nat-eip" }
}
resource "aws_nat_gateway" "main" {
count = var.enable_nat_gateway ? 1 : 0
allocation_id = aws_eip.nat[0].id
subnet_id = aws_subnet.public[0].id
tags = { Name = "${var.environment}-nat" }
depends_on = [aws_internet_gateway.main]
}
Security Groups
# terraform/modules/networking/security_groups.tf
# Security groups for the capstone project
resource "aws_security_group" "alb" {
name_prefix = "${var.environment}-alb-"
vpc_id = aws_vpc.main.id
description = "Security group for Application Load Balancer"
ingress {
from_port = 80
to_port = 80
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"]
description = "HTTP from internet"
}
ingress {
from_port = 443
to_port = 443
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"]
description = "HTTPS from internet"
}
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}
tags = { Name = "${var.environment}-alb-sg" }
}
resource "aws_security_group" "eks_nodes" {
name_prefix = "${var.environment}-eks-nodes-"
vpc_id = aws_vpc.main.id
description = "Security group for EKS worker nodes"
ingress {
from_port = 0
to_port = 0
protocol = "-1"
self = true
description = "Node-to-node communication"
}
ingress {
from_port = 80
to_port = 80
protocol = "tcp"
security_groups = [aws_security_group.alb.id]
description = "HTTP from ALB"
}
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}
tags = { Name = "${var.environment}-eks-nodes-sg" }
}
resource "aws_security_group" "database" {
name_prefix = "${var.environment}-db-"
vpc_id = aws_vpc.main.id
description = "Security group for RDS database"
ingress {
from_port = 5432
to_port = 5432
protocol = "tcp"
security_groups = [aws_security_group.eks_nodes.id]
description = "PostgreSQL from EKS nodes only"
}
tags = { Name = "${var.environment}-db-sg" }
}
Capstone Phase 2: Compute & Storage
Compute & Storage
Deploy a managed Kubernetes cluster with mixed node groups (on-demand + spot), a managed PostgreSQL database with automated backups, and S3 storage with lifecycle policies for cost optimization.
# terraform/modules/compute/eks.tf
# EKS cluster with managed node groups
resource "aws_eks_cluster" "main" {
name = "${var.environment}-cluster"
role_arn = aws_iam_role.eks_cluster.arn
version = "1.29"
vpc_config {
subnet_ids = var.private_subnet_ids
endpoint_private_access = true
endpoint_public_access = var.environment == "dev" ? true : false
security_group_ids = [var.eks_security_group_id]
}
encryption_config {
provider { key_arn = aws_kms_key.eks.arn }
resources = ["secrets"]
}
enabled_cluster_log_types = [
"api", "audit", "authenticator",
"controllerManager", "scheduler"
]
tags = { Name = "${var.environment}-eks" }
}
resource "aws_eks_node_group" "ondemand" {
cluster_name = aws_eks_cluster.main.name
node_group_name = "${var.environment}-ondemand"
node_role_arn = aws_iam_role.eks_node.arn
subnet_ids = var.private_subnet_ids
capacity_type = "ON_DEMAND"
instance_types = ["t3.medium"]
scaling_config {
desired_size = var.environment == "prod" ? 3 : 2
min_size = var.environment == "prod" ? 3 : 1
max_size = var.environment == "prod" ? 10 : 4
}
labels = {
workload = "critical"
type = "ondemand"
}
tags = { Name = "${var.environment}-ondemand-nodes" }
}
resource "aws_eks_node_group" "spot" {
cluster_name = aws_eks_cluster.main.name
node_group_name = "${var.environment}-spot"
node_role_arn = aws_iam_role.eks_node.arn
subnet_ids = var.private_subnet_ids
capacity_type = "SPOT"
instance_types = ["t3.medium", "t3.large", "t3a.medium", "t3a.large"]
scaling_config {
desired_size = var.environment == "prod" ? 3 : 1
min_size = 0
max_size = var.environment == "prod" ? 8 : 3
}
labels = {
workload = "flexible"
type = "spot"
}
taint {
key = "spot"
value = "true"
effect = "PREFER_NO_SCHEDULE"
}
tags = { Name = "${var.environment}-spot-nodes" }
}
Database & Object Storage
# terraform/modules/database/rds.tf
# Managed PostgreSQL with security best practices
resource "aws_db_subnet_group" "main" {
name = "${var.environment}-db-subnet-group"
subnet_ids = var.private_subnet_ids
tags = { Name = "${var.environment}-db-subnet-group" }
}
resource "aws_db_instance" "postgres" {
identifier = "${var.environment}-postgres"
engine = "postgres"
engine_version = "16.1"
instance_class = var.environment == "prod" ? "db.r6g.large" : "db.t3.medium"
allocated_storage = 20
max_allocated_storage = var.environment == "prod" ? 100 : 50
storage_encrypted = true
kms_key_id = aws_kms_key.rds.arn
db_name = "capstone"
username = "capstone_admin"
password = var.db_password # From secrets manager
multi_az = var.environment == "prod" ? true : false
db_subnet_group_name = aws_db_subnet_group.main.name
vpc_security_group_ids = [var.database_security_group_id]
backup_retention_period = var.environment == "prod" ? 30 : 7
backup_window = "03:00-04:00"
maintenance_window = "Mon:04:00-Mon:05:00"
deletion_protection = var.environment == "prod" ? true : false
skip_final_snapshot = var.environment != "prod"
performance_insights_enabled = true
tags = { Name = "${var.environment}-postgres" }
}
# terraform/modules/storage/s3.tf
# Object storage with lifecycle policies
resource "aws_s3_bucket" "app" {
bucket = "${var.environment}-capstone-app-assets"
tags = { Name = "${var.environment}-app-assets" }
}
resource "aws_s3_bucket_versioning" "app" {
bucket = aws_s3_bucket.app.id
versioning_configuration { status = "Enabled" }
}
resource "aws_s3_bucket_server_side_encryption_configuration" "app" {
bucket = aws_s3_bucket.app.id
rule {
apply_server_side_encryption_by_default {
sse_algorithm = "aws:kms"
}
}
}
resource "aws_s3_bucket_lifecycle_configuration" "app" {
bucket = aws_s3_bucket.app.id
rule {
id = "transition-to-ia"
status = "Enabled"
transition {
days = 30
storage_class = "STANDARD_IA"
}
transition {
days = 90
storage_class = "GLACIER"
}
noncurrent_version_expiration {
noncurrent_days = 60
}
}
}
resource "aws_s3_bucket_public_access_block" "app" {
bucket = aws_s3_bucket.app.id
block_public_acls = true
block_public_policy = true
ignore_public_acls = true
restrict_public_buckets = true
}
Capstone Phase 3: Application & CI/CD
Application Deployment & CI/CD
Deploy the application with Kubernetes manifests (Deployment, Service, Ingress, ConfigMap) and automate the entire workflow with GitHub Actions: lint, plan, apply, build, push, and deploy across environments.
# kubernetes/base/deployment.yaml
# Application deployment with best practices
apiVersion: apps/v1
kind: Deployment
metadata:
name: capstone-app
labels:
app: capstone
component: api
spec:
replicas: 3
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
selector:
matchLabels:
app: capstone
component: api
template:
metadata:
labels:
app: capstone
component: api
spec:
serviceAccountName: capstone-app
securityContext:
runAsNonRoot: true
runAsUser: 1000
fsGroup: 1000
containers:
- name: app
image: 123456789.dkr.ecr.us-east-1.amazonaws.com/capstone:latest
ports:
- containerPort: 8080
protocol: TCP
env:
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: capstone-db-credentials
key: url
- name: S3_BUCKET
valueFrom:
configMapKeyRef:
name: capstone-config
key: s3_bucket
resources:
requests:
cpu: 100m
memory: 256Mi
limits:
cpu: 500m
memory: 512Mi
readinessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 15
periodSeconds: 20
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: capstone
# kubernetes/base/service.yaml
apiVersion: v1
kind: Service
metadata:
name: capstone-app
labels:
app: capstone
spec:
type: ClusterIP
ports:
- port: 80
targetPort: 8080
protocol: TCP
selector:
app: capstone
component: api
---
# kubernetes/base/ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: capstone-app
annotations:
kubernetes.io/ingress.class: alb
alb.ingress.kubernetes.io/scheme: internet-facing
alb.ingress.kubernetes.io/target-type: ip
alb.ingress.kubernetes.io/healthcheck-path: /healthz
alb.ingress.kubernetes.io/certificate-arn: arn:aws:acm:us-east-1:123456789:certificate/xxx
alb.ingress.kubernetes.io/listen-ports: '[{"HTTPS":443}]'
alb.ingress.kubernetes.io/ssl-redirect: "443"
spec:
rules:
- host: app.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: capstone-app
port:
number: 80
GitHub Actions CI/CD Pipeline
# .github/workflows/deploy.yml
# Complete CI/CD pipeline for the capstone project
name: Deploy Infrastructure & Application
on:
push:
branches: [main]
pull_request:
branches: [main]
env:
AWS_REGION: us-east-1
EKS_CLUSTER: dev-cluster
ECR_REGISTRY: 123456789.dkr.ecr.us-east-1.amazonaws.com
ECR_REPOSITORY: capstone
jobs:
# Job 1: Terraform Lint & Validate
terraform-validate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: hashicorp/setup-terraform@v3
with:
terraform_version: 1.7.0
- name: Terraform Format Check
run: terraform fmt -check -recursive terraform/
- name: Terraform Init & Validate
working-directory: terraform/environments/dev
run: |
terraform init -backend=false
terraform validate
# Job 2: Terraform Plan (on PR)
terraform-plan:
runs-on: ubuntu-latest
needs: terraform-validate
if: github.event_name == 'pull_request'
permissions:
id-token: write
contents: read
pull-requests: write
steps:
- uses: actions/checkout@v4
- uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: ${{ secrets.AWS_ROLE_ARN }}
aws-region: ${{ env.AWS_REGION }}
- uses: hashicorp/setup-terraform@v3
- name: Terraform Plan
working-directory: terraform/environments/dev
run: |
terraform init
terraform plan -out=tfplan -no-color | tee plan-output.txt
- name: Comment PR with Plan
uses: actions/github-script@v7
with:
script: |
const fs = require('fs');
const plan = fs.readFileSync('terraform/environments/dev/plan-output.txt', 'utf8');
const truncated = plan.length > 60000 ? plan.substring(0, 60000) + '\n...(truncated)' : plan;
github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body: `## Terraform Plan\n\`\`\`\n${truncated}\n\`\`\``
});
# Job 3: Build & Push Container
build:
runs-on: ubuntu-latest
needs: terraform-validate
if: github.event_name == 'push'
outputs:
image_tag: ${{ steps.meta.outputs.tags }}
steps:
- uses: actions/checkout@v4
- uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: ${{ secrets.AWS_ROLE_ARN }}
aws-region: ${{ env.AWS_REGION }}
- uses: aws-actions/amazon-ecr-login@v2
- name: Build and Push
id: meta
run: |
IMAGE_TAG="${{ env.ECR_REGISTRY }}/${{ env.ECR_REPOSITORY }}:${{ github.sha }}"
docker build -t $IMAGE_TAG .
docker push $IMAGE_TAG
echo "tags=$IMAGE_TAG" >> $GITHUB_OUTPUT
# Job 4: Deploy to Dev
deploy-dev:
runs-on: ubuntu-latest
needs: build
if: github.event_name == 'push'
environment: dev
steps:
- uses: actions/checkout@v4
- uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: ${{ secrets.AWS_ROLE_ARN }}
aws-region: ${{ env.AWS_REGION }}
- name: Update kubeconfig
run: aws eks update-kubeconfig --name ${{ env.EKS_CLUSTER }}
- name: Deploy to Kubernetes
run: |
kubectl set image deployment/capstone-app \
app=${{ needs.build.outputs.image_tag }} \
-n capstone
kubectl rollout status deployment/capstone-app \
-n capstone --timeout=300s
- name: Smoke Test
run: |
APP_URL=$(kubectl get ingress capstone-app -n capstone -o jsonpath='{.status.loadBalancer.ingress[0].hostname}')
for i in {1..10}; do
STATUS=$(curl -s -o /dev/null -w "%{http_code}" "https://$APP_URL/healthz")
if [ "$STATUS" = "200" ]; then echo "Health check passed"; exit 0; fi
sleep 5
done
echo "Health check failed"; exit 1
Capstone Phase 4: Observability & Security
Observability & Security
Deploy a full monitoring stack with Prometheus, Grafana, and Alertmanager via Helm. Implement Kubernetes RBAC, network policies, and secrets management to secure the entire platform.
# monitoring/prometheus/values.yaml
# Helm values for kube-prometheus-stack
prometheus:
prometheusSpec:
retention: 15d
storageSpec:
volumeClaimTemplate:
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 50Gi
resources:
requests:
cpu: 200m
memory: 512Mi
limits:
cpu: 1000m
memory: 2Gi
grafana:
enabled: true
adminPassword: "" # Set via secret
persistence:
enabled: true
size: 10Gi
dashboardProviders:
dashboardproviders.yaml:
apiVersion: 1
providers:
- name: 'default'
orgId: 1
folder: 'Capstone'
type: file
disableDeletion: false
editable: true
options:
path: /var/lib/grafana/dashboards/default
alertmanager:
alertmanagerSpec:
storage:
volumeClaimTemplate:
spec:
resources:
requests:
storage: 5Gi
config:
global:
resolve_timeout: 5m
route:
receiver: 'slack-notifications'
group_by: ['alertname', 'namespace']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- receiver: 'pagerduty-critical'
match:
severity: critical
repeat_interval: 1h
receivers:
- name: 'slack-notifications'
slack_configs:
- api_url: '' # Set via secret
channel: '#alerts-capstone'
title: '{{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
- name: 'pagerduty-critical'
pagerduty_configs:
- service_key: '' # Set via secret
# monitoring/alerts/app-alerts.yaml
# Custom PrometheusRule for application alerts
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: capstone-app-alerts
namespace: monitoring
labels:
release: prometheus
spec:
groups:
- name: capstone.app
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{job="capstone-app",status=~"5.."}[5m]))
/
sum(rate(http_requests_total{job="capstone-app"}[5m])) > 0.01
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate on capstone-app"
description: "Error rate is {{ $value | humanizePercentage }} (threshold: 1%)"
- alert: HighLatency
expr: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket{job="capstone-app"}[5m])) by (le)
) > 2
for: 10m
labels:
severity: warning
annotations:
summary: "High p99 latency on capstone-app"
description: "p99 latency is {{ $value }}s (threshold: 2s)"
- alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total{namespace="capstone"}[15m]) > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Pod {{ $labels.pod }} is crash-looping"
description: "Pod has restarted {{ $value | humanize }} times in the last 15 minutes"
RBAC & Network Policies
# kubernetes/security/rbac.yaml
# Namespace-scoped RBAC for the capstone application
apiVersion: v1
kind: Namespace
metadata:
name: capstone
labels:
name: capstone
pod-security.kubernetes.io/enforce: restricted
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: capstone-app
namespace: capstone
annotations:
eks.amazonaws.com/role-arn: arn:aws:iam::123456789:role/capstone-app-role
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: capstone-app-role
namespace: capstone
rules:
- apiGroups: [""]
resources: ["configmaps", "secrets"]
verbs: ["get", "list", "watch"]
- apiGroups: [""]
resources: ["pods"]
verbs: ["get", "list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: capstone-app-binding
namespace: capstone
subjects:
- kind: ServiceAccount
name: capstone-app
namespace: capstone
roleRef:
kind: Role
name: capstone-app-role
apiGroup: rbac.authorization.k8s.io
# kubernetes/security/network-policy.yaml
# Restrict network traffic to least-privilege
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: capstone-app-policy
namespace: capstone
spec:
podSelector:
matchLabels:
app: capstone
component: api
policyTypes:
- Ingress
- Egress
ingress:
- from:
- namespaceSelector:
matchLabels:
name: ingress-nginx
ports:
- protocol: TCP
port: 8080
egress:
# Allow DNS
- to:
- namespaceSelector: {}
ports:
- protocol: UDP
port: 53
# Allow database access
- to:
- ipBlock:
cidr: 10.0.11.0/24 # Private subnet CIDR
ports:
- protocol: TCP
port: 5432
# Allow S3 (via VPC endpoint)
- to:
- ipBlock:
cidr: 0.0.0.0/0
ports:
- protocol: TCP
port: 443
# Deploy the monitoring stack and security policies
# Run these commands after EKS cluster is ready
# Add Helm repositories
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
# Create monitoring namespace
kubectl create namespace monitoring
# Install kube-prometheus-stack
helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--values monitoring/prometheus/values.yaml \
--set grafana.adminPassword="$(aws secretsmanager get-secret-value \
--secret-id capstone/grafana-admin --query SecretString --output text)"
# Apply RBAC and Network Policies
kubectl apply -f kubernetes/security/rbac.yaml
kubectl apply -f kubernetes/security/network-policy.yaml
# Apply custom alert rules
kubectl apply -f monitoring/alerts/app-alerts.yaml
# Verify deployment
kubectl get pods -n monitoring
kubectl get prometheusrules -n monitoring
Capstone: Cost & Cleanup
Before deploying your capstone to a cloud provider, estimate costs with Infracost. After completing the project and documenting it for your portfolio, tear down expensive resources to avoid ongoing charges while keeping documentation and code intact.
# Estimate monthly costs before deploying
infracost breakdown --path terraform/environments/dev
# Expected output for dev environment:
# ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
# ┃ Resource ┃ Monthly Cost ┃
# ┣━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━┫
# ┃ aws_eks_cluster.main ┃ $73.00 ┃
# ┃ aws_eks_node_group.ondemand ┃ $61.32 ┃
# ┃ aws_eks_node_group.spot ┃ ~$18.40 ┃
# ┃ aws_db_instance.postgres ┃ $49.28 ┃
# ┃ aws_nat_gateway.main ┃ $32.40 ┃
# ┃ aws_s3_bucket.app ┃ ~$2.30 ┃
# ┃ Other (ALB, EBS, etc.) ┃ ~$35.00 ┃
# ┣━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━┫
# ┃ TOTAL ┃ ~$271/month ┃
# ┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┻━━━━━━━━━━━━━━━━━┛
# Cleanup: Destroy all resources when done
# IMPORTANT: Save screenshots and documentation first!
# Step 1: Delete Kubernetes resources first
kubectl delete namespace capstone
kubectl delete namespace monitoring
# Step 2: Destroy Terraform infrastructure
cd terraform/environments/dev
terraform destroy -auto-approve
# Step 3: Clean up S3 state bucket (manual - has versioning)
aws s3 rm s3://capstone-terraform-state --recursive
aws s3api delete-bucket --bucket capstone-terraform-state
# Step 4: Delete ECR images
aws ecr batch-delete-image \
--repository-name capstone \
--image-ids "$(aws ecr list-images --repository-name capstone \
--query 'imageIds[*]' --output json)"
What to Keep for Your Portfolio
| Keep | Destroy |
|---|---|
| All Terraform code on GitHub | Running EKS cluster ($73+/month) |
| Kubernetes manifests | RDS database ($49+/month) |
| GitHub Actions workflows | NAT Gateway ($32+/month) |
| Architecture diagrams | Load Balancer |
| Screenshots of Grafana dashboards | EC2 instances / node groups |
| Cost analysis documentation | EBS volumes |
| README with architecture decisions | S3 buckets with data |
terraform destroy immediately. Set a billing alarm at $50 as a safety net. Never leave infrastructure running unattended.
Documenting Your Project for Employers
# Create a comprehensive project summary
cat <<'EOF' > docs/project-summary.md
# Capstone Project Summary
## What I Built
Production-grade multi-environment infrastructure on AWS demonstrating:
- Infrastructure as Code (Terraform modules, remote state, workspaces)
- Container orchestration (EKS with spot/on-demand mixed nodes)
- CI/CD automation (GitHub Actions with plan-on-PR, apply-on-merge)
- Full observability (Prometheus, Grafana, custom alerts, SLOs)
- Security hardening (RBAC, network policies, encryption, least privilege)
- Cost optimization (spot instances, lifecycle policies, Infracost)
## Key Decisions & Trade-offs
| Decision | Choice | Why |
|----------|--------|-----|
| Single vs Multi NAT | Single (dev) | $64/mo savings, acceptable for non-prod |
| Spot instances | Mixed fleet | 60% savings with graceful interruption handling |
| Database | RDS vs self-managed | Operational simplicity, automated backups/patches |
| Monitoring | Prometheus vs CloudWatch | Portability, community dashboards, cost |
## What I Would Change in Production
- Multi-NAT gateway for AZ resilience
- Dedicated VPN/Direct Connect for hybrid connectivity
- WAF in front of ALB
- Multi-region DR with Route 53 failover
- Dedicated monitoring account (cross-account metrics)
## Skills Demonstrated
Terraform, AWS (VPC, EKS, RDS, S3, IAM, KMS), Kubernetes,
GitHub Actions, Helm, Prometheus, Grafana, Network Security,
Cost Optimization, Documentation
EOF
Series Complete — Congratulations!
Let's reflect on what you've accomplished across this entire series:
- Parts 1–4 (Foundations): You understand how physical servers work, how networks route traffic, how Linux operates, and how to automate with scripts
- Parts 5–6 (Automation): You can configure fleets of servers with Ansible and containerize applications with Docker
- Parts 7–9 (Cloud & IaC): You provision cloud infrastructure declaratively with Terraform and automate delivery with CI/CD
- Parts 10–11 (Orchestration): You deploy and secure applications on Kubernetes with production-grade practices
- Parts 12–14 (Operations): You build observability stacks, implement GitOps, and design developer platforms
- Parts 15–20 (Advanced): You understand service meshes, multi-cloud, serverless, disaster recovery, cost optimization, and career development
mindmap root((Infrastructure
Engineer)) Foundations Hardware Networking Linux Scripting Automation Ansible Docker CI/CD Cloud AWS / Azure / GCP Terraform Serverless Orchestration Kubernetes Service Mesh GitOps Operations Monitoring Incident Response FinOps Security IAM / RBAC Network Policies Compliance
The Landscape is Always Evolving
Infrastructure engineering is a field that never stands still. New tools emerge, paradigms shift, and best practices evolve. Stay current by:
- Communities: CNCF Slack, DevOps subreddit, HashiCorp Discuss, Kubernetes Slack
- Conferences: KubeCon, HashiConf, re:Invent, DevOpsDays, SREcon
- Newsletters: TLDR DevOps, DevOps Weekly, KubeWeekly, Last Week in AWS
- Podcasts: Ship It!, Kubernetes Podcast, Software Engineering Daily
- Blogs: Kelsey Hightower, Charity Majors, Julia Evans, Corey Quinn
Final Thoughts
The infrastructure domain rewards curiosity, persistence, and a willingness to break things in order to understand them. Every outage you troubleshoot, every module you write, every pipeline you build adds to your expertise. The capstone project in this article is your launchpad — customize it, extend it, make it your own, and let it tell your story to future employers.
Thank you for joining this 20-part journey. Now go build something remarkable.