Introduction
Networking is the circulatory system of every infrastructure deployment. Without networks, your beautifully virtualized compute instances are isolated islands. Understanding networking — from the physical signal on a wire to the abstracted software-defined overlays in cloud environments — is essential for anyone working with modern infrastructure.
The Networking Landscape
Infrastructure networking has evolved through several paradigms:
- Physical networking (1980s–2000s): Manual switch configuration, static routes, hardware firewalls
- Virtualized networking (2000s–2010s): Virtual switches, VLANs, software firewalls on hypervisors
- Software-defined networking (2010s–present): Programmatic control, overlay networks, intent-based policies
- Cloud-native networking (2015–present): VPCs, service meshes, eBPF, zero-trust micro-segmentation
flowchart LR
A[Physical
Manual Config] --> B[Virtualized
vSwitches/VLANs]
B --> C[SDN
Programmable]
C --> D[Cloud-Native
VPCs/Service Mesh]
Network Fundamentals Review
Before diving into cloud networking, we must ground ourselves in the fundamentals. These concepts don't change regardless of whether you're configuring a physical Cisco switch or defining an AWS security group.
OSI Model — Infrastructure-Relevant Layers
The OSI (Open Systems Interconnection) model has 7 layers, but infrastructure engineers primarily work with four of them:
| Layer | Name | Key Protocols | Infrastructure Role | Devices |
|---|---|---|---|---|
| L2 | Data Link | Ethernet, ARP, STP | MAC addressing, switching, VLANs | Switches, bridges |
| L3 | Network | IP, ICMP, BGP, OSPF | Routing, subnetting, IP addressing | Routers, L3 switches |
| L4 | Transport | TCP, UDP | Port-based services, connection state | Firewalls, L4 load balancers |
| L7 | Application | HTTP, DNS, TLS, gRPC | Content routing, API gateways | L7 load balancers, WAFs |
flowchart TB
subgraph L7["Layer 7: Application"]
HTTP["HTTP/HTTPS"]
DNS_P["DNS"]
GRPC["gRPC"]
end
subgraph L4["Layer 4: Transport"]
TCP["TCP (reliable)"]
UDP["UDP (fast)"]
end
subgraph L3["Layer 3: Network"]
IP["IP Addressing"]
ROUTING["Routing (BGP/OSPF)"]
end
subgraph L2["Layer 2: Data Link"]
ETH["Ethernet Frames"]
VLAN["VLANs"]
ARP["ARP"]
end
L7 --> L4
L4 --> L3
L3 --> L2
IP Addressing & CIDR Notation
Every device on a network needs an IP address. IPv4 addresses are 32-bit numbers, typically written in dotted-decimal notation (e.g., 192.168.1.100). CIDR (Classless Inter-Domain Routing) notation expresses both the address and the subnet mask in a single format.
10.0.0.0/16 means the first 16 bits are the network portion (fixed), leaving 16 bits for host addresses. This gives 2^16 = 65,536 possible addresses in the subnet.
| CIDR | Subnet Mask | Usable Hosts | Common Use |
|---|---|---|---|
| /8 | 255.0.0.0 | 16,777,214 | Large enterprise (10.0.0.0/8) |
| /16 | 255.255.0.0 | 65,534 | VPC/VNet (10.0.0.0/16) |
| /20 | 255.255.240.0 | 4,094 | Large subnet |
| /24 | 255.255.255.0 | 254 | Standard subnet |
| /28 | 255.255.255.240 | 14 | Small subnet (firewall DMZ) |
| /32 | 255.255.255.255 | 1 | Single host route |
# Calculate subnet information using ipcalc
ipcalc 10.0.0.0/16
# Output shows:
# Address: 10.0.0.0
# Netmask: 255.255.0.0 = 16
# Wildcard: 0.0.255.255
# Network: 10.0.0.0/16
# HostMin: 10.0.0.1
# HostMax: 10.0.255.254
# Broadcast: 10.0.255.255
# Hosts/Net: 65534
# Split a /16 into four /18 subnets
ipcalc 10.0.0.0/16 -s 16000 16000 16000 16000
# View your machine's IP configuration
ip addr show
# Show the routing table
ip route show
# Check which interface a packet to 8.8.8.8 would use
ip route get 8.8.8.8
# List all network interfaces with their status
ip link show
Routing Basics — How Packets Find Their Way
Routing is the process of forwarding packets from source to destination across potentially dozens of intermediate networks. Routers maintain routing tables that map destination networks to next-hop addresses.
flowchart LR
A["Source
10.0.1.5"] --> R1["Router 1
10.0.1.1"]
R1 --> R2["Router 2
172.16.0.1"]
R2 --> R3["Router 3
192.168.1.1"]
R3 --> B["Destination
192.168.1.50"]
Static routes are manually configured and never change unless an admin updates them. Dynamic routing protocols like BGP (Border Gateway Protocol) and OSPF (Open Shortest Path First) allow routers to discover and advertise routes automatically.
# Add a static route: send traffic for 192.168.2.0/24 via gateway 10.0.1.1
sudo ip route add 192.168.2.0/24 via 10.0.1.1
# Delete a static route
sudo ip route del 192.168.2.0/24
# View routing table with all details
ip route show table all
# Trace the path packets take to reach a destination
traceroute 8.8.8.8
# Show BGP summary on a router (if running BIRD or FRRouting)
sudo birdc show protocols all
Switching & VLANs
Switches operate at Layer 2, forwarding Ethernet frames based on MAC addresses. VLANs (Virtual LANs) logically segment a physical switch into multiple broadcast domains, providing isolation without requiring separate physical infrastructure.
# Create a VLAN interface on Linux
sudo ip link add link eth0 name eth0.100 type vlan id 100
sudo ip addr add 10.0.100.1/24 dev eth0.100
sudo ip link set eth0.100 up
# View VLAN configuration
cat /proc/net/vlan/config
# Show bridge (virtual switch) information
bridge link show
# Create a Linux bridge (virtual switch)
sudo ip link add br0 type bridge
sudo ip link set eth0 master br0
sudo ip link set br0 up
Spanning Tree Protocol (STP) prevents broadcast loops in networks with redundant switch links. It designates one switch as the root bridge and blocks redundant paths, activating them only if the primary path fails.
Firewalls — Stateful vs Stateless
Firewalls control which traffic is permitted to flow between network segments. Understanding the difference between stateful and stateless filtering is critical for configuring cloud security correctly.
| Feature | Stateless (ACLs/NACLs) | Stateful (Security Groups) |
|---|---|---|
| Connection tracking | No — evaluates each packet independently | Yes — tracks connection state |
| Return traffic | Must explicitly allow both directions | Automatically allows return traffic |
| Rule evaluation | Processes rules in order (number priority) | Evaluates all rules, most permissive wins |
| Performance | Faster (no state table) | Slightly slower (maintains state) |
| Cloud example | AWS NACLs, Azure NSG (subnet-level) | AWS Security Groups, Azure NSG (NIC-level) |
| Use case | Broad subnet-level rules, DDoS mitigation | Instance-level micro-segmentation |
# iptables: Stateful firewall rules on Linux
# Allow established connections (stateful tracking)
sudo iptables -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT
# Allow SSH (port 22) from a specific subnet
sudo iptables -A INPUT -p tcp --dport 22 -s 10.0.1.0/24 -j ACCEPT
# Allow HTTP/HTTPS from anywhere
sudo iptables -A INPUT -p tcp --dport 80 -j ACCEPT
sudo iptables -A INPUT -p tcp --dport 443 -j ACCEPT
# Drop everything else (default deny)
sudo iptables -A INPUT -j DROP
# List all rules with line numbers
sudo iptables -L -n --line-numbers
# nftables: Modern Linux firewall (successor to iptables)
sudo nft add table inet filter
sudo nft add chain inet filter input '{ type filter hook input priority 0; policy drop; }'
# Allow established connections
sudo nft add rule inet filter input ct state established,related accept
# Allow SSH from management subnet
sudo nft add rule inet filter input ip saddr 10.0.1.0/24 tcp dport 22 accept
# Allow HTTP/HTTPS
sudo nft add rule inet filter input tcp dport { 80, 443 } accept
# List ruleset
sudo nft list ruleset
NAT (Network Address Translation)
NAT allows multiple devices on a private network to share a single public IP address. It's essential for cloud infrastructure because private subnets need NAT gateways to reach the internet for software updates and API calls without exposing instances directly.
flowchart LR
subgraph Private["Private Subnet (10.0.1.0/24)"]
VM1["VM: 10.0.1.10"]
VM2["VM: 10.0.1.11"]
VM3["VM: 10.0.1.12"]
end
subgraph Public["Public Subnet"]
NAT["NAT Gateway
Public IP: 54.23.x.x"]
end
VM1 --> NAT
VM2 --> NAT
VM3 --> NAT
NAT --> Internet["Internet"]
# Enable IP forwarding (required for NAT)
echo 1 | sudo tee /proc/sys/net/ipv4/ip_forward
# Configure SNAT (Source NAT) using iptables
# All traffic from 10.0.1.0/24 going out eth0 gets masqueraded
sudo iptables -t nat -A POSTROUTING -s 10.0.1.0/24 -o eth0 -j MASQUERADE
# Configure DNAT (Destination NAT) - port forwarding
# Forward port 8080 on public IP to internal server 10.0.1.10:80
sudo iptables -t nat -A PREROUTING -p tcp --dport 8080 -j DNAT --to-destination 10.0.1.10:80
# View NAT table rules
sudo iptables -t nat -L -n -v
Software-Defined Networking (SDN)
SDN is the paradigm shift that made cloud networking possible. Traditional networks couple the control plane (routing decisions) with the data plane (packet forwarding) in every switch and router. SDN separates them, centralizing control logic in software while switches become simple forwarding devices.
SDN Architecture
flowchart TB
subgraph App["Application Layer"]
FW["Firewall App"]
LB["Load Balancer App"]
MON["Monitoring App"]
end
subgraph Control["Control Plane (Centralized)"]
CTRL["SDN Controller
(OpenDaylight, ONOS)"]
end
subgraph Data["Data Plane (Distributed)"]
SW1["Switch 1"]
SW2["Switch 2"]
SW3["Switch 3"]
SW4["Switch 4"]
end
App -->|"Northbound API (REST)"| Control
Control -->|"Southbound API (OpenFlow)"| Data
Why SDN matters for cloud:
- Programmatic network configuration via APIs (no manual switch CLI)
- Multi-tenant isolation without physical separation
- Network policies that follow workloads (not tied to physical ports)
- Instant provisioning — new networks in milliseconds, not days
Overlay Networks (VXLAN, GRE)
Overlay networks create virtual network topologies on top of existing physical infrastructure. They encapsulate original packets inside new headers, allowing virtual networks to span multiple physical networks transparently.
| Technology | Encapsulation | Max Networks | Use Case |
|---|---|---|---|
| VXLAN | UDP + VXLAN header (50 bytes) | 16 million (24-bit VNI) | Cloud multi-tenancy, Kubernetes CNI |
| GRE | IP + GRE header (24 bytes) | Unlimited (tunnel-based) | Site-to-site VPNs, legacy overlay |
| Geneve | UDP + Geneve header (variable) | 16 million | Next-gen replacement for VXLAN (AWS uses this) |
| VLAN | 802.1Q tag (4 bytes) | 4,094 (12-bit ID) | Physical switch segmentation |
# Create a VXLAN interface on Linux
# VNI 100, destination multicast group, bound to physical interface eth0
sudo ip link add vxlan100 type vxlan id 100 \
group 239.1.1.1 \
dev eth0 \
dstport 4789
sudo ip addr add 10.200.1.1/24 dev vxlan100
sudo ip link set vxlan100 up
# Verify VXLAN interface
ip -d link show vxlan100
# Create a GRE tunnel
sudo ip tunnel add gre1 mode gre remote 203.0.113.1 local 198.51.100.1 ttl 255
sudo ip addr add 10.10.10.1/30 dev gre1
sudo ip link set gre1 up
# Verify tunnel
ip tunnel show
How Cloud Providers Implement Virtual Networking
Every major cloud provider runs a massive SDN platform under the hood:
AWS VPC uses a custom SDN based on Geneve encapsulation running on the Nitro hardware. Each ENI (Elastic Network Interface) is a virtual port in their software switch. The "Blackfoot" edge devices handle NAT, internet gateways, and VPN termination.
Azure VNet runs on their "Azure Accelerated Networking" stack with FPGA-offloaded SDN. SmartNICs handle encapsulation at line rate, and the Azure Network Controller manages millions of virtual networks.
GCP VPC uses their Andromeda SDN stack, which implements virtual networking in software on every host. Google's global backbone (B4 network) connects their data centers using centralized traffic engineering.
Load Balancing
Load balancers distribute incoming traffic across multiple backend servers to improve availability, reliability, and performance. They're one of the most critical components in any production infrastructure.
Layer 4 vs Layer 7 Load Balancing
flowchart TB
subgraph L4LB["Layer 4 Load Balancer"]
direction TB
L4["Sees: IP + Port
Decides by: TCP connection"]
end
subgraph L7LB["Layer 7 Load Balancer"]
direction TB
L7["Sees: Full HTTP request
Decides by: URL, headers, cookies"]
end
Client["Client Request"] --> L4LB
Client --> L7LB
L4LB --> S1["Server 1"]
L4LB --> S2["Server 2"]
L7LB -->|"/api/*"| API["API Servers"]
L7LB -->|"/static/*"| CDN["CDN/Cache"]
L7LB -->|"/ws/*"| WS["WebSocket Servers"]
| Feature | Layer 4 (TCP/UDP) | Layer 7 (HTTP/HTTPS) |
|---|---|---|
| Inspects | IP address + port number | Full HTTP request (URL, headers, body) |
| Speed | Very fast (no content parsing) | Slower (must parse HTTP) |
| TLS termination | No (pass-through) | Yes (offloads TLS from backends) |
| Content routing | No | Yes (path, host, header-based) |
| Sticky sessions | Source IP hash only | Cookie-based affinity |
| WebSocket support | Yes (transparent) | Yes (with upgrade handling) |
| Use case | Database clusters, TCP services, gaming | Web apps, APIs, microservices |
| Cloud examples | AWS NLB, Azure LB, GCP TCP/UDP LB | AWS ALB, Azure App GW, GCP HTTP(S) LB |
Load Balancing Algorithms
The algorithm determines which backend server receives each incoming request:
- Round Robin: Requests distributed sequentially 1 → 2 → 3 → 1 → 2... Simple but doesn't account for server load
- Weighted Round Robin: Servers with higher weight get proportionally more requests. Use when backends have different capacities
- Least Connections: Routes to the server with fewest active connections. Best for long-lived connections (WebSocket, database)
- IP Hash: Client IP determines the backend. Provides session affinity without cookies
- Random: Statistically equivalent to round-robin at scale, with no state tracking needed
- Least Response Time: Combines fewest connections with fastest response. Most intelligent but requires active measurement
Health Checks & Failover
Load balancers continuously verify backend health to route traffic only to healthy instances:
# NGINX upstream with health checks
upstream backend_servers {
# Least connections algorithm
least_conn;
server 10.0.1.10:8080 weight=3 max_fails=3 fail_timeout=30s;
server 10.0.1.11:8080 weight=2 max_fails=3 fail_timeout=30s;
server 10.0.1.12:8080 weight=1 max_fails=3 fail_timeout=30s;
# Passive health checking: mark unhealthy after 3 failures
# Active health checking (NGINX Plus only):
# health_check interval=5s fails=3 passes=2;
}
server {
listen 80;
location / {
proxy_pass http://backend_servers;
proxy_next_upstream error timeout http_500 http_502 http_503;
proxy_connect_timeout 5s;
proxy_read_timeout 30s;
}
}
# HAProxy configuration with active health checks
# /etc/haproxy/haproxy.cfg
cat <<'EOF'
global
daemon
maxconn 4096
defaults
mode http
timeout connect 5000ms
timeout client 50000ms
timeout server 50000ms
option httpchk GET /health
frontend web_frontend
bind *:80
default_backend web_servers
backend web_servers
balance leastconn
option httpchk GET /health HTTP/1.1\r\nHost:\ localhost
server web1 10.0.1.10:8080 check inter 5s fall 3 rise 2 weight 3
server web2 10.0.1.11:8080 check inter 5s fall 3 rise 2 weight 2
server web3 10.0.1.12:8080 check inter 5s fall 3 rise 2 weight 1
listen stats
bind *:8404
stats enable
stats uri /stats
EOF
Cloud Load Balancers
AWS Application Load Balancer (ALB): Layer 7, supports path/host-based routing, WebSocket, gRPC, native WAF integration. Ideal for microservices with multiple target groups.
AWS Network Load Balancer (NLB): Layer 4, handles millions of requests/sec with ultra-low latency. Supports static IPs and preserves source IP. Use for TCP/UDP services, gaming, IoT.
Azure Application Gateway: Layer 7 with built-in WAF, URL-based routing, cookie affinity. Supports autoscaling and zone redundancy.
Azure Load Balancer: Layer 4, supports both public and internal load balancing. HA Ports feature for NVAs (Network Virtual Appliances).
GCP Global HTTP(S) Load Balancer: Unique global anycast architecture — single IP address routes to nearest healthy backend worldwide. Integrated with Cloud CDN and Cloud Armor WAF.
# AWS: Create an Application Load Balancer
aws elbv2 create-load-balancer \
--name my-web-alb \
--subnets subnet-0123456789abcdef0 subnet-0fedcba9876543210 \
--security-groups sg-0123456789abcdef0 \
--scheme internet-facing \
--type application
# Create a target group with health check
aws elbv2 create-target-group \
--name my-web-targets \
--protocol HTTP \
--port 80 \
--vpc-id vpc-0123456789abcdef0 \
--health-check-protocol HTTP \
--health-check-path /health \
--health-check-interval-seconds 10 \
--healthy-threshold-count 2 \
--unhealthy-threshold-count 3
# Register targets
aws elbv2 register-targets \
--target-group-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/my-web-targets/abc123 \
--targets Id=i-0123456789abcdef0 Id=i-0fedcba9876543210
DNS & Service Discovery
DNS (Domain Name System) is the internet's phone book — translating human-readable names into IP addresses. In cloud infrastructure, DNS also serves as a service discovery mechanism, enabling microservices to find each other dynamically.
How DNS Works
sequenceDiagram
participant Client
participant Resolver as Recursive Resolver
(ISP/8.8.8.8)
participant Root as Root NS
(.)
participant TLD as TLD NS
(.com)
participant Auth as Authoritative NS
(example.com)
Client->>Resolver: Query: api.example.com?
Resolver->>Root: Where is .com?
Root-->>Resolver: Ask TLD at x.gtld-servers.net
Resolver->>TLD: Where is example.com?
TLD-->>Resolver: Ask NS at ns1.example.com
Resolver->>Auth: What is api.example.com?
Auth-->>Resolver: A 10.0.1.50 (TTL 300s)
Resolver-->>Client: A 10.0.1.50 (cached)
# Query DNS for a domain (shows A record)
dig api.example.com
# Query specific record type
dig api.example.com AAAA # IPv6 address
dig example.com MX # Mail servers
dig example.com TXT # TXT records (SPF, DKIM, verification)
dig example.com NS # Name servers
# Trace the full resolution path
dig +trace api.example.com
# Query a specific DNS server
dig @8.8.8.8 api.example.com
# Short answer only
dig +short api.example.com
# Show TTL remaining
dig +ttlid api.example.com
DNS Record Types
| Record | Purpose | Example Value | Infrastructure Use |
|---|---|---|---|
| A | IPv4 address | 93.184.216.34 | Map domain to server IP |
| AAAA | IPv6 address | 2606:2800:220:1::248 | IPv6-enabled services |
| CNAME | Canonical name (alias) | api.example.com → lb.aws.com | Point to load balancer DNS name |
| MX | Mail exchange | 10 mail.example.com | Email routing |
| TXT | Text data | "v=spf1 include:..." | SPF, DKIM, domain verification |
| SRV | Service location | _http._tcp 80 web1.example.com | Service discovery (port + host) |
| NS | Name server | ns1.example.com | Delegation to authoritative DNS |
| PTR | Reverse lookup | 34.216.184.93 → example.com | Email validation, debugging |
Cloud DNS Services
# AWS Route 53: Create a hosted zone and records
aws route53 create-hosted-zone \
--name example.com \
--caller-reference "$(date +%s)"
# Create an A record pointing to an ALB (alias record)
aws route53 change-resource-record-sets \
--hosted-zone-id Z0123456789ABCDEF \
--change-batch '{
"Changes": [{
"Action": "CREATE",
"ResourceRecordSet": {
"Name": "api.example.com",
"Type": "A",
"AliasTarget": {
"HostedZoneId": "Z35SXDOTRQ7X7K",
"DNSName": "my-alb-123456.us-east-1.elb.amazonaws.com",
"EvaluateTargetHealth": true
}
}
}]
}'
# Create a weighted routing policy (blue/green)
aws route53 change-resource-record-sets \
--hosted-zone-id Z0123456789ABCDEF \
--change-batch '{
"Changes": [{
"Action": "CREATE",
"ResourceRecordSet": {
"Name": "api.example.com",
"Type": "A",
"SetIdentifier": "blue",
"Weight": 90,
"TTL": 60,
"ResourceRecords": [{"Value": "10.0.1.10"}]
}
}, {
"Action": "CREATE",
"ResourceRecordSet": {
"Name": "api.example.com",
"Type": "A",
"SetIdentifier": "green",
"Weight": 10,
"TTL": 60,
"ResourceRecords": [{"Value": "10.0.2.10"}]
}
}]
}'
# Azure DNS: Create a zone and records
az network dns zone create \
--resource-group myResourceGroup \
--name example.com
# Add an A record
az network dns record-set a add-record \
--resource-group myResourceGroup \
--zone-name example.com \
--record-set-name api \
--ipv4-address 10.0.1.10
# Add a CNAME record
az network dns record-set cname set-record \
--resource-group myResourceGroup \
--zone-name example.com \
--record-set-name www \
--cname myapp.azurewebsites.net
Service Discovery Patterns
In dynamic cloud environments where instances scale up and down constantly, services need to discover each other automatically. There are several patterns:
flowchart TB
subgraph DNS_SD["DNS-Based Discovery"]
SVC1["Service A"] -->|"dig service-b.internal"| DNS_INT["Internal DNS
(Route 53 Private Zone)"]
DNS_INT -->|"10.0.1.x"| SVC2["Service B"]
end
subgraph REG_SD["Registry-Based Discovery"]
SVC3["Service C"] -->|"lookup(service-d)"| REG["Service Registry
(Consul/etcd/ZooKeeper)"]
REG -->|"10.0.2.x:8080"| SVC4["Service D"]
end
subgraph MESH_SD["Service Mesh Discovery"]
SVC5["Service E"] -->|"localhost:port"| PROXY1["Sidecar Proxy
(Envoy)"]
PROXY1 -->|"mTLS"| PROXY2["Sidecar Proxy
(Envoy)"]
PROXY2 --> SVC6["Service F"]
end
- DNS-based discovery: Simplest approach. Services register DNS records; clients resolve names. Works with any language/framework. Limitation: DNS caching can serve stale results
- Registry-based discovery: Services register themselves with a central registry (Consul, etcd, ZooKeeper). Clients query the registry for endpoints. Provides health metadata and real-time updates
- Service mesh: Sidecar proxies (Envoy, Linkerd) handle discovery transparently. Applications talk to localhost; the mesh routes to the correct destination with mTLS encryption, retries, and observability
# Kubernetes DNS-based service discovery
# When you create a Service, Kubernetes creates a DNS record:
# my-service.my-namespace.svc.cluster.local
# From inside a pod, resolve service DNS
nslookup my-service.my-namespace.svc.cluster.local
# Headless service (returns pod IPs instead of ClusterIP)
nslookup my-headless-service.my-namespace.svc.cluster.local
# SRV records for port discovery
dig _http._tcp.my-service.my-namespace.svc.cluster.local SRV
# Consul: Register a service and query for it
# Register service via HTTP API
curl -X PUT http://localhost:8500/v1/agent/service/register \
-H "Content-Type: application/json" \
-d '{
"ID": "web-1",
"Name": "web",
"Port": 8080,
"Tags": ["production", "v2"],
"Check": {
"HTTP": "http://localhost:8080/health",
"Interval": "10s",
"Timeout": "5s"
}
}'
# Query for healthy instances of a service
curl http://localhost:8500/v1/health/service/web?passing=true
# DNS interface: Consul also exposes services via DNS
dig @127.0.0.1 -p 8600 web.service.consul SRV
Cloud Networking Patterns
Cloud providers abstract physical networking into programmable, API-driven constructs. Understanding these patterns is essential for designing secure, scalable infrastructure.
VPC/VNet Design
A well-designed VPC separates concerns using subnets, route tables, and gateways:
flowchart TB
IGW["Internet Gateway"] --- PUB
subgraph VPC["VPC: 10.0.0.0/16"]
subgraph PUB["Public Subnets"]
PUB_A["10.0.1.0/24
AZ-a
ALB, NAT GW"]
PUB_B["10.0.2.0/24
AZ-b
ALB, NAT GW"]
end
subgraph PRIV["Private Subnets (App)"]
PRIV_A["10.0.10.0/24
AZ-a
App Servers"]
PRIV_B["10.0.11.0/24
AZ-b
App Servers"]
end
subgraph DATA["Private Subnets (Data)"]
DATA_A["10.0.20.0/24
AZ-a
RDS, ElastiCache"]
DATA_B["10.0.21.0/24
AZ-b
RDS, ElastiCache"]
end
end
PUB_A --> PRIV_A
PUB_B --> PRIV_B
PRIV_A --> DATA_A
PRIV_B --> DATA_B
PRIV_A -->|"NAT GW"| IGW
PRIV_B -->|"NAT GW"| IGW
# AWS: Create a VPC with public and private subnets
# Create VPC
aws ec2 create-vpc --cidr-block 10.0.0.0/16 --tag-specifications \
'ResourceType=vpc,Tags=[{Key=Name,Value=production-vpc}]'
# Create public subnet in AZ-a
aws ec2 create-subnet --vpc-id vpc-0123456789abcdef0 \
--cidr-block 10.0.1.0/24 \
--availability-zone us-east-1a \
--tag-specifications 'ResourceType=subnet,Tags=[{Key=Name,Value=public-a}]'
# Create private subnet in AZ-a
aws ec2 create-subnet --vpc-id vpc-0123456789abcdef0 \
--cidr-block 10.0.10.0/24 \
--availability-zone us-east-1a \
--tag-specifications 'ResourceType=subnet,Tags=[{Key=Name,Value=private-app-a}]'
# Create Internet Gateway and attach to VPC
aws ec2 create-internet-gateway --tag-specifications \
'ResourceType=internet-gateway,Tags=[{Key=Name,Value=prod-igw}]'
aws ec2 attach-internet-gateway --internet-gateway-id igw-0123456789 --vpc-id vpc-0123456789abcdef0
# Create NAT Gateway (requires an Elastic IP)
aws ec2 allocate-address --domain vpc
aws ec2 create-nat-gateway --subnet-id subnet-public-a \
--allocation-id eipalloc-0123456789 \
--tag-specifications 'ResourceType=natgateway,Tags=[{Key=Name,Value=nat-a}]'
# Terraform: Define a production VPC
resource "aws_vpc" "production" {
cidr_block = "10.0.0.0/16"
enable_dns_support = true
enable_dns_hostnames = true
tags = {
Name = "production-vpc"
Environment = "production"
}
}
resource "aws_subnet" "public" {
count = 2
vpc_id = aws_vpc.production.id
cidr_block = cidrsubnet(aws_vpc.production.cidr_block, 8, count.index + 1)
availability_zone = data.aws_availability_zones.available.names[count.index]
map_public_ip_on_launch = true
tags = {
Name = "public-${count.index + 1}"
Tier = "public"
}
}
resource "aws_subnet" "private_app" {
count = 2
vpc_id = aws_vpc.production.id
cidr_block = cidrsubnet(aws_vpc.production.cidr_block, 8, count.index + 10)
availability_zone = data.aws_availability_zones.available.names[count.index]
tags = {
Name = "private-app-${count.index + 1}"
Tier = "private"
}
}
resource "aws_subnet" "private_data" {
count = 2
vpc_id = aws_vpc.production.id
cidr_block = cidrsubnet(aws_vpc.production.cidr_block, 8, count.index + 20)
availability_zone = data.aws_availability_zones.available.names[count.index]
tags = {
Name = "private-data-${count.index + 1}"
Tier = "data"
}
}
resource "aws_internet_gateway" "main" {
vpc_id = aws_vpc.production.id
tags = { Name = "production-igw" }
}
resource "aws_nat_gateway" "main" {
count = 2
allocation_id = aws_eip.nat[count.index].id
subnet_id = aws_subnet.public[count.index].id
tags = { Name = "nat-gw-${count.index + 1}" }
}
resource "aws_eip" "nat" {
count = 2
domain = "vpc"
}
Network Security Groups & NACLs
Cloud networks provide two layers of firewall protection:
# Terraform: Security group for web servers (stateful)
resource "aws_security_group" "web" {
name = "web-servers"
description = "Security group for web servers"
vpc_id = aws_vpc.production.id
# Inbound: Allow HTTP/HTTPS from ALB only
ingress {
from_port = 80
to_port = 80
protocol = "tcp"
security_groups = [aws_security_group.alb.id]
description = "HTTP from ALB"
}
ingress {
from_port = 443
to_port = 443
protocol = "tcp"
security_groups = [aws_security_group.alb.id]
description = "HTTPS from ALB"
}
# Outbound: Allow all (for package updates, API calls)
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
description = "Allow all outbound"
}
tags = { Name = "web-servers-sg" }
}
# NACL for private subnets (stateless - need both directions)
resource "aws_network_acl" "private" {
vpc_id = aws_vpc.production.id
subnet_ids = aws_subnet.private_app[*].id
# Allow inbound from VPC CIDR
ingress {
protocol = "-1"
rule_no = 100
action = "allow"
cidr_block = "10.0.0.0/16"
from_port = 0
to_port = 0
}
# Allow return traffic from internet (ephemeral ports)
ingress {
protocol = "tcp"
rule_no = 200
action = "allow"
cidr_block = "0.0.0.0/0"
from_port = 1024
to_port = 65535
}
# Allow all outbound
egress {
protocol = "-1"
rule_no = 100
action = "allow"
cidr_block = "0.0.0.0/0"
from_port = 0
to_port = 0
}
tags = { Name = "private-nacl" }
}
VPC Peering & Transit Gateways
As organizations grow, they need to connect multiple VPCs (different environments, teams, or regions). Transit Gateways act as a hub connecting many VPCs through a single attachment point, avoiding the O(n²) mesh of VPC peering.
flowchart TB
subgraph Peering["VPC Peering (Full Mesh)"]
P1["VPC A"] <--> P2["VPC B"]
P2 <--> P3["VPC C"]
P1 <--> P3
P1 <--> P4["VPC D"]
P2 <--> P4
P3 <--> P4
end
subgraph TGW["Transit Gateway (Hub-Spoke)"]
T1["VPC A"] --> HUB["Transit
Gateway"]
T2["VPC B"] --> HUB
T3["VPC C"] --> HUB
T4["VPC D"] --> HUB
VPN_T["VPN"] --> HUB
end
# AWS: Create a Transit Gateway
aws ec2 create-transit-gateway \
--description "Production Transit Gateway" \
--options "AmazonSideAsn=64512,AutoAcceptSharedAttachments=enable,DefaultRouteTableAssociation=enable,DefaultRouteTablePropagation=enable,DnsSupport=enable"
# Attach VPCs to the Transit Gateway
aws ec2 create-transit-gateway-vpc-attachment \
--transit-gateway-id tgw-0123456789abcdef0 \
--vpc-id vpc-production \
--subnet-ids subnet-0123456789abcdef0 subnet-0fedcba9876543210
# Add route in VPC route table pointing to Transit Gateway
aws ec2 create-route \
--route-table-id rtb-0123456789abcdef0 \
--destination-cidr-block 10.1.0.0/16 \
--transit-gateway-id tgw-0123456789abcdef0
VPN & Direct Connect / ExpressRoute
Hybrid connectivity bridges on-premises data centers with cloud environments:
| Feature | Site-to-Site VPN | Direct Connect / ExpressRoute |
|---|---|---|
| Connection type | Encrypted tunnel over internet | Dedicated physical circuit |
| Bandwidth | Up to ~1.25 Gbps per tunnel | 1 Gbps to 100 Gbps |
| Latency | Variable (internet-dependent) | Consistent, low latency |
| Setup time | Minutes | Weeks to months |
| Cost | Low (data transfer fees) | High (port fees + cross-connect) |
| Encryption | Yes (IPsec) | Not by default (add MACsec or VPN overlay) |
| Redundancy | Multiple tunnels across AZs | Dual circuits to different locations |
| Best for | Dev/test, backup path, quick setup | Production workloads, large data transfer |
# AWS: Create a Site-to-Site VPN connection
# Step 1: Create a Virtual Private Gateway
aws ec2 create-vpn-gateway --type ipsec.1 --amazon-side-asn 64512
aws ec2 attach-vpn-gateway --vpn-gateway-id vgw-0123456789 --vpc-id vpc-0123456789abcdef0
# Step 2: Create a Customer Gateway (your on-prem device)
aws ec2 create-customer-gateway \
--type ipsec.1 \
--public-ip 203.0.113.1 \
--bgp-asn 65000
# Step 3: Create the VPN connection
aws ec2 create-vpn-connection \
--type ipsec.1 \
--vpn-gateway-id vgw-0123456789 \
--customer-gateway-id cgw-0123456789 \
--options '{"StaticRoutesOnly": false}'
# Step 4: Enable route propagation in VPC route table
aws ec2 enable-vgw-route-propagation \
--route-table-id rtb-0123456789 \
--gateway-id vgw-0123456789
Hands-On Exercises
Objective: Practice CIDR notation and subnet planning for a production VPC.
Scenario: You need to design a VPC with the following requirements:
- VPC CIDR: 10.0.0.0/16
- 3 Availability Zones
- Each AZ needs: 1 public subnet (small), 1 private-app subnet (medium), 1 private-data subnet (small)
- Room for future expansion
Tasks:
- Calculate the CIDR blocks for 9 subnets that don't overlap
- Verify no addresses are wasted (use
ipcalc) - Document which subnets get a route to the Internet Gateway vs NAT Gateway
# Exercise 1 Solution: Subnet planning
# VPC: 10.0.0.0/16 (65,536 addresses)
# Strategy: Use /20 for private-app (4094 hosts), /24 for public and data (254 hosts)
# Public subnets (small - for ALB + NAT GW)
echo "Public AZ-a: 10.0.1.0/24 (254 hosts)"
echo "Public AZ-b: 10.0.2.0/24 (254 hosts)"
echo "Public AZ-c: 10.0.3.0/24 (254 hosts)"
# Private app subnets (larger - for EC2/ECS workloads)
echo "Private-App AZ-a: 10.0.16.0/20 (4094 hosts)"
echo "Private-App AZ-b: 10.0.32.0/20 (4094 hosts)"
echo "Private-App AZ-c: 10.0.48.0/20 (4094 hosts)"
# Private data subnets (small - for RDS/ElastiCache)
echo "Private-Data AZ-a: 10.0.64.0/24 (254 hosts)"
echo "Private-Data AZ-b: 10.0.65.0/24 (254 hosts)"
echo "Private-Data AZ-c: 10.0.66.0/24 (254 hosts)"
# Verify with ipcalc
ipcalc 10.0.16.0/20
ipcalc 10.0.32.0/20
Objective: Configure NGINX as a Layer 7 reverse proxy with health checks and multiple backend pools.
Tasks:
- Set up NGINX with two upstream groups:
api_serversandweb_servers - Route
/api/*requests toapi_serversusing least_conn algorithm - Route all other traffic to
web_serversusing round-robin - Configure passive health checks (3 failures = unhealthy, 30s timeout)
- Add request headers:
X-Real-IP,X-Forwarded-For,X-Request-ID
# Exercise 2 Solution: NGINX reverse proxy config
cat <<'EOF' > /etc/nginx/conf.d/loadbalancer.conf
upstream api_servers {
least_conn;
server 10.0.10.1:8080 max_fails=3 fail_timeout=30s;
server 10.0.10.2:8080 max_fails=3 fail_timeout=30s;
server 10.0.10.3:8080 max_fails=3 fail_timeout=30s;
}
upstream web_servers {
server 10.0.11.1:3000 max_fails=3 fail_timeout=30s;
server 10.0.11.2:3000 max_fails=3 fail_timeout=30s;
}
server {
listen 80;
server_name example.com;
# API traffic -> api_servers
location /api/ {
proxy_pass http://api_servers;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Request-ID $request_id;
proxy_next_upstream error timeout http_500 http_502 http_503;
proxy_connect_timeout 5s;
proxy_read_timeout 30s;
}
# All other traffic -> web_servers
location / {
proxy_pass http://web_servers;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Request-ID $request_id;
}
}
EOF
# Test configuration
nginx -t
# Reload without downtime
nginx -s reload
Objective: Debug DNS resolution issues using command-line tools.
Scenario: Your application reports "could not resolve host: api.internal.example.com". Debug the issue systematically.
Tasks:
- Check local DNS resolver configuration (
/etc/resolv.conf) - Query the configured DNS server directly
- Trace the full resolution path
- Check if the record exists on the authoritative nameserver
- Verify TTL and caching behavior
# Exercise 3 Solution: DNS troubleshooting workflow
# Step 1: Check local resolver config
cat /etc/resolv.conf
# Look for: nameserver, search domain, options
# Step 2: Query the configured DNS server
dig @$(grep nameserver /etc/resolv.conf | head -1 | awk '{print $2}') api.internal.example.com
# Step 3: Try different DNS servers
dig @8.8.8.8 api.internal.example.com # Google Public DNS
dig @1.1.1.1 api.internal.example.com # Cloudflare DNS
dig @169.254.169.253 api.internal.example.com # AWS VPC DNS resolver
# Step 4: Trace full resolution path
dig +trace api.internal.example.com
# Step 5: Find authoritative nameserver and query it directly
dig NS example.com
dig @ns1.example.com api.internal.example.com
# Step 6: Check if it's a caching issue (compare TTLs)
dig +norecurse @8.8.8.8 api.internal.example.com # Cached?
dig +nocmd +noall +answer api.internal.example.com # Show TTL
# Step 7: Flush local DNS cache (if needed)
sudo systemd-resolve --flush-caches # systemd
sudo killall -HUP mDNSResponder # macOS
Objective: Deploy a complete production-ready VPC using Terraform with all networking components.
Tasks:
- Create a VPC with DNS support enabled
- Create 2 public subnets, 2 private-app subnets, 2 private-data subnets across 2 AZs
- Set up Internet Gateway, NAT Gateways (one per AZ for HA), and route tables
- Configure security groups for: ALB (public), web servers (from ALB only), database (from web only)
- Output all subnet IDs and security group IDs
# Exercise 4 Solution: Complete production VPC module
# main.tf
terraform {
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
}
}
provider "aws" {
region = "us-east-1"
}
data "aws_availability_zones" "available" {
state = "available"
}
locals {
azs = slice(data.aws_availability_zones.available.names, 0, 2)
vpc_cidr = "10.0.0.0/16"
environment = "production"
}
resource "aws_vpc" "main" {
cidr_block = local.vpc_cidr
enable_dns_support = true
enable_dns_hostnames = true
tags = { Name = "${local.environment}-vpc" }
}
resource "aws_internet_gateway" "main" {
vpc_id = aws_vpc.main.id
tags = { Name = "${local.environment}-igw" }
}
resource "aws_subnet" "public" {
count = length(local.azs)
vpc_id = aws_vpc.main.id
cidr_block = cidrsubnet(local.vpc_cidr, 8, count.index + 1)
availability_zone = local.azs[count.index]
map_public_ip_on_launch = true
tags = { Name = "public-${local.azs[count.index]}", Tier = "public" }
}
resource "aws_subnet" "private_app" {
count = length(local.azs)
vpc_id = aws_vpc.main.id
cidr_block = cidrsubnet(local.vpc_cidr, 4, count.index + 2)
availability_zone = local.azs[count.index]
tags = { Name = "private-app-${local.azs[count.index]}", Tier = "private" }
}
resource "aws_subnet" "private_data" {
count = length(local.azs)
vpc_id = aws_vpc.main.id
cidr_block = cidrsubnet(local.vpc_cidr, 8, count.index + 64)
availability_zone = local.azs[count.index]
tags = { Name = "private-data-${local.azs[count.index]}", Tier = "data" }
}
resource "aws_eip" "nat" {
count = length(local.azs)
domain = "vpc"
tags = { Name = "nat-eip-${local.azs[count.index]}" }
}
resource "aws_nat_gateway" "main" {
count = length(local.azs)
allocation_id = aws_eip.nat[count.index].id
subnet_id = aws_subnet.public[count.index].id
tags = { Name = "nat-${local.azs[count.index]}" }
depends_on = [aws_internet_gateway.main]
}
resource "aws_route_table" "public" {
vpc_id = aws_vpc.main.id
route {
cidr_block = "0.0.0.0/0"
gateway_id = aws_internet_gateway.main.id
}
tags = { Name = "public-rt" }
}
resource "aws_route_table" "private" {
count = length(local.azs)
vpc_id = aws_vpc.main.id
route {
cidr_block = "0.0.0.0/0"
nat_gateway_id = aws_nat_gateway.main[count.index].id
}
tags = { Name = "private-rt-${local.azs[count.index]}" }
}
resource "aws_route_table_association" "public" {
count = length(local.azs)
subnet_id = aws_subnet.public[count.index].id
route_table_id = aws_route_table.public.id
}
resource "aws_route_table_association" "private_app" {
count = length(local.azs)
subnet_id = aws_subnet.private_app[count.index].id
route_table_id = aws_route_table.private[count.index].id
}
resource "aws_route_table_association" "private_data" {
count = length(local.azs)
subnet_id = aws_subnet.private_data[count.index].id
route_table_id = aws_route_table.private[count.index].id
}
output "vpc_id" { value = aws_vpc.main.id }
output "public_subnet_ids" { value = aws_subnet.public[*].id }
output "private_app_subnet_ids" { value = aws_subnet.private_app[*].id }
output "private_data_subnet_ids" { value = aws_subnet.private_data[*].id }
Conclusion & Next Steps
Networking is the connective tissue that binds all infrastructure components together. In this article, we've covered:
- Fundamentals: OSI model, IP addressing, CIDR, routing, switching, firewalls, and NAT
- SDN: Control/data plane separation, overlay networks (VXLAN, Geneve), and cloud SDN implementations
- Load Balancing: L4 vs L7, algorithms, health checks, and cloud load balancer services
- DNS: Resolution process, record types, cloud DNS, and service discovery patterns
- Cloud Networking: VPC design, security groups/NACLs, transit gateways, and hybrid connectivity
Next in the Series
In Part 6: Infrastructure Storage, we explore block, object, and file storage fundamentals, RAID configurations, storage protocols (iSCSI, NFS, S3 API), cloud storage tiers, and data lifecycle management.