Back to Infrastructure & Cloud Automation Series

Part 5: Infrastructure Networking

May 14, 2026 Wasil Zafar 45 min read

A deep dive into infrastructure networking — from OSI fundamentals and IP addressing through SDN, load balancing, DNS, and cloud VPC design patterns that power modern distributed systems.

Table of Contents

  1. Introduction
  2. Network Fundamentals
  3. Software-Defined Networking
  4. Load Balancing
  5. DNS & Service Discovery
  6. Cloud Networking Patterns
  7. Hands-On Exercises
  8. Conclusion & Next Steps

Introduction

Networking is the circulatory system of every infrastructure deployment. Without networks, your beautifully virtualized compute instances are isolated islands. Understanding networking — from the physical signal on a wire to the abstracted software-defined overlays in cloud environments — is essential for anyone working with modern infrastructure.

Key Insight: Every cloud outage root-cause analysis eventually leads back to networking. Understanding how packets flow, how DNS resolves, and how load balancers distribute traffic is what separates senior infrastructure engineers from beginners.

The Networking Landscape

Infrastructure networking has evolved through several paradigms:

  • Physical networking (1980s–2000s): Manual switch configuration, static routes, hardware firewalls
  • Virtualized networking (2000s–2010s): Virtual switches, VLANs, software firewalls on hypervisors
  • Software-defined networking (2010s–present): Programmatic control, overlay networks, intent-based policies
  • Cloud-native networking (2015–present): VPCs, service meshes, eBPF, zero-trust micro-segmentation
Evolution of Infrastructure Networking
                                flowchart LR
                                    A[Physical
Manual Config] --> B[Virtualized
vSwitches/VLANs] B --> C[SDN
Programmable] C --> D[Cloud-Native
VPCs/Service Mesh]

Network Fundamentals Review

Before diving into cloud networking, we must ground ourselves in the fundamentals. These concepts don't change regardless of whether you're configuring a physical Cisco switch or defining an AWS security group.

OSI Model — Infrastructure-Relevant Layers

The OSI (Open Systems Interconnection) model has 7 layers, but infrastructure engineers primarily work with four of them:

Layer Name Key Protocols Infrastructure Role Devices
L2 Data Link Ethernet, ARP, STP MAC addressing, switching, VLANs Switches, bridges
L3 Network IP, ICMP, BGP, OSPF Routing, subnetting, IP addressing Routers, L3 switches
L4 Transport TCP, UDP Port-based services, connection state Firewalls, L4 load balancers
L7 Application HTTP, DNS, TLS, gRPC Content routing, API gateways L7 load balancers, WAFs
OSI Layers in Infrastructure Context
                                flowchart TB
                                    subgraph L7["Layer 7: Application"]
                                        HTTP["HTTP/HTTPS"]
                                        DNS_P["DNS"]
                                        GRPC["gRPC"]
                                    end
                                    subgraph L4["Layer 4: Transport"]
                                        TCP["TCP (reliable)"]
                                        UDP["UDP (fast)"]
                                    end
                                    subgraph L3["Layer 3: Network"]
                                        IP["IP Addressing"]
                                        ROUTING["Routing (BGP/OSPF)"]
                                    end
                                    subgraph L2["Layer 2: Data Link"]
                                        ETH["Ethernet Frames"]
                                        VLAN["VLANs"]
                                        ARP["ARP"]
                                    end
                                    L7 --> L4
                                    L4 --> L3
                                    L3 --> L2
                            

IP Addressing & CIDR Notation

Every device on a network needs an IP address. IPv4 addresses are 32-bit numbers, typically written in dotted-decimal notation (e.g., 192.168.1.100). CIDR (Classless Inter-Domain Routing) notation expresses both the address and the subnet mask in a single format.

CIDR Notation: 10.0.0.0/16 means the first 16 bits are the network portion (fixed), leaving 16 bits for host addresses. This gives 2^16 = 65,536 possible addresses in the subnet.
CIDR Subnet Mask Usable Hosts Common Use
/8255.0.0.016,777,214Large enterprise (10.0.0.0/8)
/16255.255.0.065,534VPC/VNet (10.0.0.0/16)
/20255.255.240.04,094Large subnet
/24255.255.255.0254Standard subnet
/28255.255.255.24014Small subnet (firewall DMZ)
/32255.255.255.2551Single host route
# Calculate subnet information using ipcalc
ipcalc 10.0.0.0/16

# Output shows:
# Address:   10.0.0.0
# Netmask:   255.255.0.0 = 16
# Wildcard:  0.0.255.255
# Network:   10.0.0.0/16
# HostMin:   10.0.0.1
# HostMax:   10.0.255.254
# Broadcast: 10.0.255.255
# Hosts/Net: 65534

# Split a /16 into four /18 subnets
ipcalc 10.0.0.0/16 -s 16000 16000 16000 16000
# View your machine's IP configuration
ip addr show

# Show the routing table
ip route show

# Check which interface a packet to 8.8.8.8 would use
ip route get 8.8.8.8

# List all network interfaces with their status
ip link show

Routing Basics — How Packets Find Their Way

Routing is the process of forwarding packets from source to destination across potentially dozens of intermediate networks. Routers maintain routing tables that map destination networks to next-hop addresses.

Packet Routing Through Multiple Hops
                                flowchart LR
                                    A["Source
10.0.1.5"] --> R1["Router 1
10.0.1.1"] R1 --> R2["Router 2
172.16.0.1"] R2 --> R3["Router 3
192.168.1.1"] R3 --> B["Destination
192.168.1.50"]

Static routes are manually configured and never change unless an admin updates them. Dynamic routing protocols like BGP (Border Gateway Protocol) and OSPF (Open Shortest Path First) allow routers to discover and advertise routes automatically.

# Add a static route: send traffic for 192.168.2.0/24 via gateway 10.0.1.1
sudo ip route add 192.168.2.0/24 via 10.0.1.1

# Delete a static route
sudo ip route del 192.168.2.0/24

# View routing table with all details
ip route show table all

# Trace the path packets take to reach a destination
traceroute 8.8.8.8

# Show BGP summary on a router (if running BIRD or FRRouting)
sudo birdc show protocols all

Switching & VLANs

Switches operate at Layer 2, forwarding Ethernet frames based on MAC addresses. VLANs (Virtual LANs) logically segment a physical switch into multiple broadcast domains, providing isolation without requiring separate physical infrastructure.

Key Insight: VLANs are the physical network equivalent of cloud security groups. They isolate traffic domains so that a broadcast storm in one VLAN doesn't affect others, and hosts in different VLANs can only communicate through a router.
# Create a VLAN interface on Linux
sudo ip link add link eth0 name eth0.100 type vlan id 100
sudo ip addr add 10.0.100.1/24 dev eth0.100
sudo ip link set eth0.100 up

# View VLAN configuration
cat /proc/net/vlan/config

# Show bridge (virtual switch) information
bridge link show

# Create a Linux bridge (virtual switch)
sudo ip link add br0 type bridge
sudo ip link set eth0 master br0
sudo ip link set br0 up

Spanning Tree Protocol (STP) prevents broadcast loops in networks with redundant switch links. It designates one switch as the root bridge and blocks redundant paths, activating them only if the primary path fails.

Firewalls — Stateful vs Stateless

Firewalls control which traffic is permitted to flow between network segments. Understanding the difference between stateful and stateless filtering is critical for configuring cloud security correctly.

Feature Stateless (ACLs/NACLs) Stateful (Security Groups)
Connection trackingNo — evaluates each packet independentlyYes — tracks connection state
Return trafficMust explicitly allow both directionsAutomatically allows return traffic
Rule evaluationProcesses rules in order (number priority)Evaluates all rules, most permissive wins
PerformanceFaster (no state table)Slightly slower (maintains state)
Cloud exampleAWS NACLs, Azure NSG (subnet-level)AWS Security Groups, Azure NSG (NIC-level)
Use caseBroad subnet-level rules, DDoS mitigationInstance-level micro-segmentation
# iptables: Stateful firewall rules on Linux
# Allow established connections (stateful tracking)
sudo iptables -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT

# Allow SSH (port 22) from a specific subnet
sudo iptables -A INPUT -p tcp --dport 22 -s 10.0.1.0/24 -j ACCEPT

# Allow HTTP/HTTPS from anywhere
sudo iptables -A INPUT -p tcp --dport 80 -j ACCEPT
sudo iptables -A INPUT -p tcp --dport 443 -j ACCEPT

# Drop everything else (default deny)
sudo iptables -A INPUT -j DROP

# List all rules with line numbers
sudo iptables -L -n --line-numbers
# nftables: Modern Linux firewall (successor to iptables)
sudo nft add table inet filter
sudo nft add chain inet filter input '{ type filter hook input priority 0; policy drop; }'

# Allow established connections
sudo nft add rule inet filter input ct state established,related accept

# Allow SSH from management subnet
sudo nft add rule inet filter input ip saddr 10.0.1.0/24 tcp dport 22 accept

# Allow HTTP/HTTPS
sudo nft add rule inet filter input tcp dport { 80, 443 } accept

# List ruleset
sudo nft list ruleset

NAT (Network Address Translation)

NAT allows multiple devices on a private network to share a single public IP address. It's essential for cloud infrastructure because private subnets need NAT gateways to reach the internet for software updates and API calls without exposing instances directly.

NAT Gateway in Cloud Architecture
                                flowchart LR
                                    subgraph Private["Private Subnet (10.0.1.0/24)"]
                                        VM1["VM: 10.0.1.10"]
                                        VM2["VM: 10.0.1.11"]
                                        VM3["VM: 10.0.1.12"]
                                    end
                                    subgraph Public["Public Subnet"]
                                        NAT["NAT Gateway
Public IP: 54.23.x.x"] end VM1 --> NAT VM2 --> NAT VM3 --> NAT NAT --> Internet["Internet"]
# Enable IP forwarding (required for NAT)
echo 1 | sudo tee /proc/sys/net/ipv4/ip_forward

# Configure SNAT (Source NAT) using iptables
# All traffic from 10.0.1.0/24 going out eth0 gets masqueraded
sudo iptables -t nat -A POSTROUTING -s 10.0.1.0/24 -o eth0 -j MASQUERADE

# Configure DNAT (Destination NAT) - port forwarding
# Forward port 8080 on public IP to internal server 10.0.1.10:80
sudo iptables -t nat -A PREROUTING -p tcp --dport 8080 -j DNAT --to-destination 10.0.1.10:80

# View NAT table rules
sudo iptables -t nat -L -n -v

Software-Defined Networking (SDN)

SDN is the paradigm shift that made cloud networking possible. Traditional networks couple the control plane (routing decisions) with the data plane (packet forwarding) in every switch and router. SDN separates them, centralizing control logic in software while switches become simple forwarding devices.

SDN Architecture

SDN Architecture — Separation of Planes
                                flowchart TB
                                    subgraph App["Application Layer"]
                                        FW["Firewall App"]
                                        LB["Load Balancer App"]
                                        MON["Monitoring App"]
                                    end
                                    subgraph Control["Control Plane (Centralized)"]
                                        CTRL["SDN Controller
(OpenDaylight, ONOS)"] end subgraph Data["Data Plane (Distributed)"] SW1["Switch 1"] SW2["Switch 2"] SW3["Switch 3"] SW4["Switch 4"] end App -->|"Northbound API (REST)"| Control Control -->|"Southbound API (OpenFlow)"| Data
Control Plane vs Data Plane: The control plane decides where traffic should go (routing decisions, policy enforcement). The data plane actually moves the packets based on those decisions. SDN centralizes the "brain" while distributing the "muscles."

Why SDN matters for cloud:

  • Programmatic network configuration via APIs (no manual switch CLI)
  • Multi-tenant isolation without physical separation
  • Network policies that follow workloads (not tied to physical ports)
  • Instant provisioning — new networks in milliseconds, not days

Overlay Networks (VXLAN, GRE)

Overlay networks create virtual network topologies on top of existing physical infrastructure. They encapsulate original packets inside new headers, allowing virtual networks to span multiple physical networks transparently.

Technology Encapsulation Max Networks Use Case
VXLANUDP + VXLAN header (50 bytes)16 million (24-bit VNI)Cloud multi-tenancy, Kubernetes CNI
GREIP + GRE header (24 bytes)Unlimited (tunnel-based)Site-to-site VPNs, legacy overlay
GeneveUDP + Geneve header (variable)16 millionNext-gen replacement for VXLAN (AWS uses this)
VLAN802.1Q tag (4 bytes)4,094 (12-bit ID)Physical switch segmentation
# Create a VXLAN interface on Linux
# VNI 100, destination multicast group, bound to physical interface eth0
sudo ip link add vxlan100 type vxlan id 100 \
    group 239.1.1.1 \
    dev eth0 \
    dstport 4789

sudo ip addr add 10.200.1.1/24 dev vxlan100
sudo ip link set vxlan100 up

# Verify VXLAN interface
ip -d link show vxlan100

# Create a GRE tunnel
sudo ip tunnel add gre1 mode gre remote 203.0.113.1 local 198.51.100.1 ttl 255
sudo ip addr add 10.10.10.1/30 dev gre1
sudo ip link set gre1 up

# Verify tunnel
ip tunnel show

How Cloud Providers Implement Virtual Networking

Every major cloud provider runs a massive SDN platform under the hood:

Cloud SDN Implementations AWS / Azure / GCP

AWS VPC uses a custom SDN based on Geneve encapsulation running on the Nitro hardware. Each ENI (Elastic Network Interface) is a virtual port in their software switch. The "Blackfoot" edge devices handle NAT, internet gateways, and VPN termination.

Azure VNet runs on their "Azure Accelerated Networking" stack with FPGA-offloaded SDN. SmartNICs handle encapsulation at line rate, and the Azure Network Controller manages millions of virtual networks.

GCP VPC uses their Andromeda SDN stack, which implements virtual networking in software on every host. Google's global backbone (B4 network) connects their data centers using centralized traffic engineering.

SDN VPC Overlay Multi-Tenant

Load Balancing

Load balancers distribute incoming traffic across multiple backend servers to improve availability, reliability, and performance. They're one of the most critical components in any production infrastructure.

Layer 4 vs Layer 7 Load Balancing

L4 vs L7 Load Balancing
                                flowchart TB
                                    subgraph L4LB["Layer 4 Load Balancer"]
                                        direction TB
                                        L4["Sees: IP + Port
Decides by: TCP connection"] end subgraph L7LB["Layer 7 Load Balancer"] direction TB L7["Sees: Full HTTP request
Decides by: URL, headers, cookies"] end Client["Client Request"] --> L4LB Client --> L7LB L4LB --> S1["Server 1"] L4LB --> S2["Server 2"] L7LB -->|"/api/*"| API["API Servers"] L7LB -->|"/static/*"| CDN["CDN/Cache"] L7LB -->|"/ws/*"| WS["WebSocket Servers"]
Feature Layer 4 (TCP/UDP) Layer 7 (HTTP/HTTPS)
InspectsIP address + port numberFull HTTP request (URL, headers, body)
SpeedVery fast (no content parsing)Slower (must parse HTTP)
TLS terminationNo (pass-through)Yes (offloads TLS from backends)
Content routingNoYes (path, host, header-based)
Sticky sessionsSource IP hash onlyCookie-based affinity
WebSocket supportYes (transparent)Yes (with upgrade handling)
Use caseDatabase clusters, TCP services, gamingWeb apps, APIs, microservices
Cloud examplesAWS NLB, Azure LB, GCP TCP/UDP LBAWS ALB, Azure App GW, GCP HTTP(S) LB

Load Balancing Algorithms

The algorithm determines which backend server receives each incoming request:

  • Round Robin: Requests distributed sequentially 1 → 2 → 3 → 1 → 2... Simple but doesn't account for server load
  • Weighted Round Robin: Servers with higher weight get proportionally more requests. Use when backends have different capacities
  • Least Connections: Routes to the server with fewest active connections. Best for long-lived connections (WebSocket, database)
  • IP Hash: Client IP determines the backend. Provides session affinity without cookies
  • Random: Statistically equivalent to round-robin at scale, with no state tracking needed
  • Least Response Time: Combines fewest connections with fastest response. Most intelligent but requires active measurement

Health Checks & Failover

Load balancers continuously verify backend health to route traffic only to healthy instances:

# NGINX upstream with health checks
upstream backend_servers {
    # Least connections algorithm
    least_conn;

    server 10.0.1.10:8080 weight=3 max_fails=3 fail_timeout=30s;
    server 10.0.1.11:8080 weight=2 max_fails=3 fail_timeout=30s;
    server 10.0.1.12:8080 weight=1 max_fails=3 fail_timeout=30s;

    # Passive health checking: mark unhealthy after 3 failures
    # Active health checking (NGINX Plus only):
    # health_check interval=5s fails=3 passes=2;
}

server {
    listen 80;
    location / {
        proxy_pass http://backend_servers;
        proxy_next_upstream error timeout http_500 http_502 http_503;
        proxy_connect_timeout 5s;
        proxy_read_timeout 30s;
    }
}
# HAProxy configuration with active health checks
# /etc/haproxy/haproxy.cfg

cat <<'EOF'
global
    daemon
    maxconn 4096

defaults
    mode http
    timeout connect 5000ms
    timeout client 50000ms
    timeout server 50000ms
    option httpchk GET /health

frontend web_frontend
    bind *:80
    default_backend web_servers

backend web_servers
    balance leastconn
    option httpchk GET /health HTTP/1.1\r\nHost:\ localhost
    
    server web1 10.0.1.10:8080 check inter 5s fall 3 rise 2 weight 3
    server web2 10.0.1.11:8080 check inter 5s fall 3 rise 2 weight 2
    server web3 10.0.1.12:8080 check inter 5s fall 3 rise 2 weight 1

listen stats
    bind *:8404
    stats enable
    stats uri /stats
EOF

Cloud Load Balancers

Cloud Load Balancer Comparison 2026

AWS Application Load Balancer (ALB): Layer 7, supports path/host-based routing, WebSocket, gRPC, native WAF integration. Ideal for microservices with multiple target groups.

AWS Network Load Balancer (NLB): Layer 4, handles millions of requests/sec with ultra-low latency. Supports static IPs and preserves source IP. Use for TCP/UDP services, gaming, IoT.

Azure Application Gateway: Layer 7 with built-in WAF, URL-based routing, cookie affinity. Supports autoscaling and zone redundancy.

Azure Load Balancer: Layer 4, supports both public and internal load balancing. HA Ports feature for NVAs (Network Virtual Appliances).

GCP Global HTTP(S) Load Balancer: Unique global anycast architecture — single IP address routes to nearest healthy backend worldwide. Integrated with Cloud CDN and Cloud Armor WAF.

ALB NLB Azure LB GCP LB
# AWS: Create an Application Load Balancer
aws elbv2 create-load-balancer \
    --name my-web-alb \
    --subnets subnet-0123456789abcdef0 subnet-0fedcba9876543210 \
    --security-groups sg-0123456789abcdef0 \
    --scheme internet-facing \
    --type application

# Create a target group with health check
aws elbv2 create-target-group \
    --name my-web-targets \
    --protocol HTTP \
    --port 80 \
    --vpc-id vpc-0123456789abcdef0 \
    --health-check-protocol HTTP \
    --health-check-path /health \
    --health-check-interval-seconds 10 \
    --healthy-threshold-count 2 \
    --unhealthy-threshold-count 3

# Register targets
aws elbv2 register-targets \
    --target-group-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/my-web-targets/abc123 \
    --targets Id=i-0123456789abcdef0 Id=i-0fedcba9876543210

DNS & Service Discovery

DNS (Domain Name System) is the internet's phone book — translating human-readable names into IP addresses. In cloud infrastructure, DNS also serves as a service discovery mechanism, enabling microservices to find each other dynamically.

How DNS Works

DNS Resolution Process
                                sequenceDiagram
                                    participant Client
                                    participant Resolver as Recursive Resolver
(ISP/8.8.8.8) participant Root as Root NS
(.) participant TLD as TLD NS
(.com) participant Auth as Authoritative NS
(example.com) Client->>Resolver: Query: api.example.com? Resolver->>Root: Where is .com? Root-->>Resolver: Ask TLD at x.gtld-servers.net Resolver->>TLD: Where is example.com? TLD-->>Resolver: Ask NS at ns1.example.com Resolver->>Auth: What is api.example.com? Auth-->>Resolver: A 10.0.1.50 (TTL 300s) Resolver-->>Client: A 10.0.1.50 (cached)
Key Insight: DNS TTL (Time To Live) is one of the most impactful settings in infrastructure. A TTL of 300 seconds means changes take up to 5 minutes to propagate. Set TTL low (60s) before planned migrations, and high (3600s) for stable records to reduce query load.
# Query DNS for a domain (shows A record)
dig api.example.com

# Query specific record type
dig api.example.com AAAA    # IPv6 address
dig example.com MX          # Mail servers
dig example.com TXT         # TXT records (SPF, DKIM, verification)
dig example.com NS          # Name servers

# Trace the full resolution path
dig +trace api.example.com

# Query a specific DNS server
dig @8.8.8.8 api.example.com

# Short answer only
dig +short api.example.com

# Show TTL remaining
dig +ttlid api.example.com

DNS Record Types

Record Purpose Example Value Infrastructure Use
AIPv4 address93.184.216.34Map domain to server IP
AAAAIPv6 address2606:2800:220:1::248IPv6-enabled services
CNAMECanonical name (alias)api.example.com → lb.aws.comPoint to load balancer DNS name
MXMail exchange10 mail.example.comEmail routing
TXTText data"v=spf1 include:..."SPF, DKIM, domain verification
SRVService location_http._tcp 80 web1.example.comService discovery (port + host)
NSName serverns1.example.comDelegation to authoritative DNS
PTRReverse lookup34.216.184.93 → example.comEmail validation, debugging

Cloud DNS Services

# AWS Route 53: Create a hosted zone and records
aws route53 create-hosted-zone \
    --name example.com \
    --caller-reference "$(date +%s)"

# Create an A record pointing to an ALB (alias record)
aws route53 change-resource-record-sets \
    --hosted-zone-id Z0123456789ABCDEF \
    --change-batch '{
        "Changes": [{
            "Action": "CREATE",
            "ResourceRecordSet": {
                "Name": "api.example.com",
                "Type": "A",
                "AliasTarget": {
                    "HostedZoneId": "Z35SXDOTRQ7X7K",
                    "DNSName": "my-alb-123456.us-east-1.elb.amazonaws.com",
                    "EvaluateTargetHealth": true
                }
            }
        }]
    }'

# Create a weighted routing policy (blue/green)
aws route53 change-resource-record-sets \
    --hosted-zone-id Z0123456789ABCDEF \
    --change-batch '{
        "Changes": [{
            "Action": "CREATE",
            "ResourceRecordSet": {
                "Name": "api.example.com",
                "Type": "A",
                "SetIdentifier": "blue",
                "Weight": 90,
                "TTL": 60,
                "ResourceRecords": [{"Value": "10.0.1.10"}]
            }
        }, {
            "Action": "CREATE",
            "ResourceRecordSet": {
                "Name": "api.example.com",
                "Type": "A",
                "SetIdentifier": "green",
                "Weight": 10,
                "TTL": 60,
                "ResourceRecords": [{"Value": "10.0.2.10"}]
            }
        }]
    }'
# Azure DNS: Create a zone and records
az network dns zone create \
    --resource-group myResourceGroup \
    --name example.com

# Add an A record
az network dns record-set a add-record \
    --resource-group myResourceGroup \
    --zone-name example.com \
    --record-set-name api \
    --ipv4-address 10.0.1.10

# Add a CNAME record
az network dns record-set cname set-record \
    --resource-group myResourceGroup \
    --zone-name example.com \
    --record-set-name www \
    --cname myapp.azurewebsites.net

Service Discovery Patterns

In dynamic cloud environments where instances scale up and down constantly, services need to discover each other automatically. There are several patterns:

Service Discovery Patterns
                                flowchart TB
                                    subgraph DNS_SD["DNS-Based Discovery"]
                                        SVC1["Service A"] -->|"dig service-b.internal"| DNS_INT["Internal DNS
(Route 53 Private Zone)"] DNS_INT -->|"10.0.1.x"| SVC2["Service B"] end subgraph REG_SD["Registry-Based Discovery"] SVC3["Service C"] -->|"lookup(service-d)"| REG["Service Registry
(Consul/etcd/ZooKeeper)"] REG -->|"10.0.2.x:8080"| SVC4["Service D"] end subgraph MESH_SD["Service Mesh Discovery"] SVC5["Service E"] -->|"localhost:port"| PROXY1["Sidecar Proxy
(Envoy)"] PROXY1 -->|"mTLS"| PROXY2["Sidecar Proxy
(Envoy)"] PROXY2 --> SVC6["Service F"] end
  • DNS-based discovery: Simplest approach. Services register DNS records; clients resolve names. Works with any language/framework. Limitation: DNS caching can serve stale results
  • Registry-based discovery: Services register themselves with a central registry (Consul, etcd, ZooKeeper). Clients query the registry for endpoints. Provides health metadata and real-time updates
  • Service mesh: Sidecar proxies (Envoy, Linkerd) handle discovery transparently. Applications talk to localhost; the mesh routes to the correct destination with mTLS encryption, retries, and observability
# Kubernetes DNS-based service discovery
# When you create a Service, Kubernetes creates a DNS record:
# my-service.my-namespace.svc.cluster.local

# From inside a pod, resolve service DNS
nslookup my-service.my-namespace.svc.cluster.local

# Headless service (returns pod IPs instead of ClusterIP)
nslookup my-headless-service.my-namespace.svc.cluster.local

# SRV records for port discovery
dig _http._tcp.my-service.my-namespace.svc.cluster.local SRV
# Consul: Register a service and query for it
# Register service via HTTP API
curl -X PUT http://localhost:8500/v1/agent/service/register \
    -H "Content-Type: application/json" \
    -d '{
        "ID": "web-1",
        "Name": "web",
        "Port": 8080,
        "Tags": ["production", "v2"],
        "Check": {
            "HTTP": "http://localhost:8080/health",
            "Interval": "10s",
            "Timeout": "5s"
        }
    }'

# Query for healthy instances of a service
curl http://localhost:8500/v1/health/service/web?passing=true

# DNS interface: Consul also exposes services via DNS
dig @127.0.0.1 -p 8600 web.service.consul SRV

Cloud Networking Patterns

Cloud providers abstract physical networking into programmable, API-driven constructs. Understanding these patterns is essential for designing secure, scalable infrastructure.

VPC/VNet Design

A well-designed VPC separates concerns using subnets, route tables, and gateways:

Production VPC Architecture
                                flowchart TB
                                    IGW["Internet Gateway"] --- PUB
                                    subgraph VPC["VPC: 10.0.0.0/16"]
                                        subgraph PUB["Public Subnets"]
                                            PUB_A["10.0.1.0/24
AZ-a
ALB, NAT GW"] PUB_B["10.0.2.0/24
AZ-b
ALB, NAT GW"] end subgraph PRIV["Private Subnets (App)"] PRIV_A["10.0.10.0/24
AZ-a
App Servers"] PRIV_B["10.0.11.0/24
AZ-b
App Servers"] end subgraph DATA["Private Subnets (Data)"] DATA_A["10.0.20.0/24
AZ-a
RDS, ElastiCache"] DATA_B["10.0.21.0/24
AZ-b
RDS, ElastiCache"] end end PUB_A --> PRIV_A PUB_B --> PRIV_B PRIV_A --> DATA_A PRIV_B --> DATA_B PRIV_A -->|"NAT GW"| IGW PRIV_B -->|"NAT GW"| IGW
Critical Design Rule: Never place databases, caches, or internal services in public subnets. Public subnets have a route to the Internet Gateway. Only load balancers, NAT gateways, and bastion hosts belong in public subnets.
# AWS: Create a VPC with public and private subnets
# Create VPC
aws ec2 create-vpc --cidr-block 10.0.0.0/16 --tag-specifications \
    'ResourceType=vpc,Tags=[{Key=Name,Value=production-vpc}]'

# Create public subnet in AZ-a
aws ec2 create-subnet --vpc-id vpc-0123456789abcdef0 \
    --cidr-block 10.0.1.0/24 \
    --availability-zone us-east-1a \
    --tag-specifications 'ResourceType=subnet,Tags=[{Key=Name,Value=public-a}]'

# Create private subnet in AZ-a
aws ec2 create-subnet --vpc-id vpc-0123456789abcdef0 \
    --cidr-block 10.0.10.0/24 \
    --availability-zone us-east-1a \
    --tag-specifications 'ResourceType=subnet,Tags=[{Key=Name,Value=private-app-a}]'

# Create Internet Gateway and attach to VPC
aws ec2 create-internet-gateway --tag-specifications \
    'ResourceType=internet-gateway,Tags=[{Key=Name,Value=prod-igw}]'
aws ec2 attach-internet-gateway --internet-gateway-id igw-0123456789 --vpc-id vpc-0123456789abcdef0

# Create NAT Gateway (requires an Elastic IP)
aws ec2 allocate-address --domain vpc
aws ec2 create-nat-gateway --subnet-id subnet-public-a \
    --allocation-id eipalloc-0123456789 \
    --tag-specifications 'ResourceType=natgateway,Tags=[{Key=Name,Value=nat-a}]'
# Terraform: Define a production VPC
resource "aws_vpc" "production" {
  cidr_block           = "10.0.0.0/16"
  enable_dns_support   = true
  enable_dns_hostnames = true

  tags = {
    Name        = "production-vpc"
    Environment = "production"
  }
}

resource "aws_subnet" "public" {
  count                   = 2
  vpc_id                  = aws_vpc.production.id
  cidr_block              = cidrsubnet(aws_vpc.production.cidr_block, 8, count.index + 1)
  availability_zone       = data.aws_availability_zones.available.names[count.index]
  map_public_ip_on_launch = true

  tags = {
    Name = "public-${count.index + 1}"
    Tier = "public"
  }
}

resource "aws_subnet" "private_app" {
  count             = 2
  vpc_id            = aws_vpc.production.id
  cidr_block        = cidrsubnet(aws_vpc.production.cidr_block, 8, count.index + 10)
  availability_zone = data.aws_availability_zones.available.names[count.index]

  tags = {
    Name = "private-app-${count.index + 1}"
    Tier = "private"
  }
}

resource "aws_subnet" "private_data" {
  count             = 2
  vpc_id            = aws_vpc.production.id
  cidr_block        = cidrsubnet(aws_vpc.production.cidr_block, 8, count.index + 20)
  availability_zone = data.aws_availability_zones.available.names[count.index]

  tags = {
    Name = "private-data-${count.index + 1}"
    Tier = "data"
  }
}

resource "aws_internet_gateway" "main" {
  vpc_id = aws_vpc.production.id
  tags   = { Name = "production-igw" }
}

resource "aws_nat_gateway" "main" {
  count         = 2
  allocation_id = aws_eip.nat[count.index].id
  subnet_id     = aws_subnet.public[count.index].id
  tags          = { Name = "nat-gw-${count.index + 1}" }
}

resource "aws_eip" "nat" {
  count  = 2
  domain = "vpc"
}

Network Security Groups & NACLs

Cloud networks provide two layers of firewall protection:

# Terraform: Security group for web servers (stateful)
resource "aws_security_group" "web" {
  name        = "web-servers"
  description = "Security group for web servers"
  vpc_id      = aws_vpc.production.id

  # Inbound: Allow HTTP/HTTPS from ALB only
  ingress {
    from_port       = 80
    to_port         = 80
    protocol        = "tcp"
    security_groups = [aws_security_group.alb.id]
    description     = "HTTP from ALB"
  }

  ingress {
    from_port       = 443
    to_port         = 443
    protocol        = "tcp"
    security_groups = [aws_security_group.alb.id]
    description     = "HTTPS from ALB"
  }

  # Outbound: Allow all (for package updates, API calls)
  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
    description = "Allow all outbound"
  }

  tags = { Name = "web-servers-sg" }
}

# NACL for private subnets (stateless - need both directions)
resource "aws_network_acl" "private" {
  vpc_id     = aws_vpc.production.id
  subnet_ids = aws_subnet.private_app[*].id

  # Allow inbound from VPC CIDR
  ingress {
    protocol   = "-1"
    rule_no    = 100
    action     = "allow"
    cidr_block = "10.0.0.0/16"
    from_port  = 0
    to_port    = 0
  }

  # Allow return traffic from internet (ephemeral ports)
  ingress {
    protocol   = "tcp"
    rule_no    = 200
    action     = "allow"
    cidr_block = "0.0.0.0/0"
    from_port  = 1024
    to_port    = 65535
  }

  # Allow all outbound
  egress {
    protocol   = "-1"
    rule_no    = 100
    action     = "allow"
    cidr_block = "0.0.0.0/0"
    from_port  = 0
    to_port    = 0
  }

  tags = { Name = "private-nacl" }
}

VPC Peering & Transit Gateways

As organizations grow, they need to connect multiple VPCs (different environments, teams, or regions). Transit Gateways act as a hub connecting many VPCs through a single attachment point, avoiding the O(n²) mesh of VPC peering.

VPC Peering vs Transit Gateway
                                flowchart TB
                                    subgraph Peering["VPC Peering (Full Mesh)"]
                                        P1["VPC A"] <--> P2["VPC B"]
                                        P2 <--> P3["VPC C"]
                                        P1 <--> P3
                                        P1 <--> P4["VPC D"]
                                        P2 <--> P4
                                        P3 <--> P4
                                    end
                                    subgraph TGW["Transit Gateway (Hub-Spoke)"]
                                        T1["VPC A"] --> HUB["Transit
Gateway"] T2["VPC B"] --> HUB T3["VPC C"] --> HUB T4["VPC D"] --> HUB VPN_T["VPN"] --> HUB end
# AWS: Create a Transit Gateway
aws ec2 create-transit-gateway \
    --description "Production Transit Gateway" \
    --options "AmazonSideAsn=64512,AutoAcceptSharedAttachments=enable,DefaultRouteTableAssociation=enable,DefaultRouteTablePropagation=enable,DnsSupport=enable"

# Attach VPCs to the Transit Gateway
aws ec2 create-transit-gateway-vpc-attachment \
    --transit-gateway-id tgw-0123456789abcdef0 \
    --vpc-id vpc-production \
    --subnet-ids subnet-0123456789abcdef0 subnet-0fedcba9876543210

# Add route in VPC route table pointing to Transit Gateway
aws ec2 create-route \
    --route-table-id rtb-0123456789abcdef0 \
    --destination-cidr-block 10.1.0.0/16 \
    --transit-gateway-id tgw-0123456789abcdef0

VPN & Direct Connect / ExpressRoute

Hybrid connectivity bridges on-premises data centers with cloud environments:

Feature Site-to-Site VPN Direct Connect / ExpressRoute
Connection typeEncrypted tunnel over internetDedicated physical circuit
BandwidthUp to ~1.25 Gbps per tunnel1 Gbps to 100 Gbps
LatencyVariable (internet-dependent)Consistent, low latency
Setup timeMinutesWeeks to months
CostLow (data transfer fees)High (port fees + cross-connect)
EncryptionYes (IPsec)Not by default (add MACsec or VPN overlay)
RedundancyMultiple tunnels across AZsDual circuits to different locations
Best forDev/test, backup path, quick setupProduction workloads, large data transfer
# AWS: Create a Site-to-Site VPN connection
# Step 1: Create a Virtual Private Gateway
aws ec2 create-vpn-gateway --type ipsec.1 --amazon-side-asn 64512
aws ec2 attach-vpn-gateway --vpn-gateway-id vgw-0123456789 --vpc-id vpc-0123456789abcdef0

# Step 2: Create a Customer Gateway (your on-prem device)
aws ec2 create-customer-gateway \
    --type ipsec.1 \
    --public-ip 203.0.113.1 \
    --bgp-asn 65000

# Step 3: Create the VPN connection
aws ec2 create-vpn-connection \
    --type ipsec.1 \
    --vpn-gateway-id vgw-0123456789 \
    --customer-gateway-id cgw-0123456789 \
    --options '{"StaticRoutesOnly": false}'

# Step 4: Enable route propagation in VPC route table
aws ec2 enable-vgw-route-propagation \
    --route-table-id rtb-0123456789 \
    --gateway-id vgw-0123456789

Hands-On Exercises

Exercise 1: Subnet Calculator Beginner

Objective: Practice CIDR notation and subnet planning for a production VPC.

Scenario: You need to design a VPC with the following requirements:

  • VPC CIDR: 10.0.0.0/16
  • 3 Availability Zones
  • Each AZ needs: 1 public subnet (small), 1 private-app subnet (medium), 1 private-data subnet (small)
  • Room for future expansion

Tasks:

  1. Calculate the CIDR blocks for 9 subnets that don't overlap
  2. Verify no addresses are wasted (use ipcalc)
  3. Document which subnets get a route to the Internet Gateway vs NAT Gateway
CIDR Subnetting VPC Design
# Exercise 1 Solution: Subnet planning
# VPC: 10.0.0.0/16 (65,536 addresses)
# Strategy: Use /20 for private-app (4094 hosts), /24 for public and data (254 hosts)

# Public subnets (small - for ALB + NAT GW)
echo "Public AZ-a: 10.0.1.0/24 (254 hosts)"
echo "Public AZ-b: 10.0.2.0/24 (254 hosts)"
echo "Public AZ-c: 10.0.3.0/24 (254 hosts)"

# Private app subnets (larger - for EC2/ECS workloads)
echo "Private-App AZ-a: 10.0.16.0/20 (4094 hosts)"
echo "Private-App AZ-b: 10.0.32.0/20 (4094 hosts)"
echo "Private-App AZ-c: 10.0.48.0/20 (4094 hosts)"

# Private data subnets (small - for RDS/ElastiCache)
echo "Private-Data AZ-a: 10.0.64.0/24 (254 hosts)"
echo "Private-Data AZ-b: 10.0.65.0/24 (254 hosts)"
echo "Private-Data AZ-c: 10.0.66.0/24 (254 hosts)"

# Verify with ipcalc
ipcalc 10.0.16.0/20
ipcalc 10.0.32.0/20
Exercise 2: Load Balancer Configuration Intermediate

Objective: Configure NGINX as a Layer 7 reverse proxy with health checks and multiple backend pools.

Tasks:

  1. Set up NGINX with two upstream groups: api_servers and web_servers
  2. Route /api/* requests to api_servers using least_conn algorithm
  3. Route all other traffic to web_servers using round-robin
  4. Configure passive health checks (3 failures = unhealthy, 30s timeout)
  5. Add request headers: X-Real-IP, X-Forwarded-For, X-Request-ID
NGINX Load Balancing Reverse Proxy
# Exercise 2 Solution: NGINX reverse proxy config
cat <<'EOF' > /etc/nginx/conf.d/loadbalancer.conf
upstream api_servers {
    least_conn;
    server 10.0.10.1:8080 max_fails=3 fail_timeout=30s;
    server 10.0.10.2:8080 max_fails=3 fail_timeout=30s;
    server 10.0.10.3:8080 max_fails=3 fail_timeout=30s;
}

upstream web_servers {
    server 10.0.11.1:3000 max_fails=3 fail_timeout=30s;
    server 10.0.11.2:3000 max_fails=3 fail_timeout=30s;
}

server {
    listen 80;
    server_name example.com;

    # API traffic -> api_servers
    location /api/ {
        proxy_pass http://api_servers;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Request-ID $request_id;
        proxy_next_upstream error timeout http_500 http_502 http_503;
        proxy_connect_timeout 5s;
        proxy_read_timeout 30s;
    }

    # All other traffic -> web_servers
    location / {
        proxy_pass http://web_servers;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Request-ID $request_id;
    }
}
EOF

# Test configuration
nginx -t

# Reload without downtime
nginx -s reload
Exercise 3: DNS Troubleshooting Intermediate

Objective: Debug DNS resolution issues using command-line tools.

Scenario: Your application reports "could not resolve host: api.internal.example.com". Debug the issue systematically.

Tasks:

  1. Check local DNS resolver configuration (/etc/resolv.conf)
  2. Query the configured DNS server directly
  3. Trace the full resolution path
  4. Check if the record exists on the authoritative nameserver
  5. Verify TTL and caching behavior
DNS Troubleshooting dig
# Exercise 3 Solution: DNS troubleshooting workflow

# Step 1: Check local resolver config
cat /etc/resolv.conf
# Look for: nameserver, search domain, options

# Step 2: Query the configured DNS server
dig @$(grep nameserver /etc/resolv.conf | head -1 | awk '{print $2}') api.internal.example.com

# Step 3: Try different DNS servers
dig @8.8.8.8 api.internal.example.com          # Google Public DNS
dig @1.1.1.1 api.internal.example.com          # Cloudflare DNS
dig @169.254.169.253 api.internal.example.com  # AWS VPC DNS resolver

# Step 4: Trace full resolution path
dig +trace api.internal.example.com

# Step 5: Find authoritative nameserver and query it directly
dig NS example.com
dig @ns1.example.com api.internal.example.com

# Step 6: Check if it's a caching issue (compare TTLs)
dig +norecurse @8.8.8.8 api.internal.example.com  # Cached?
dig +nocmd +noall +answer api.internal.example.com # Show TTL

# Step 7: Flush local DNS cache (if needed)
sudo systemd-resolve --flush-caches  # systemd
sudo killall -HUP mDNSResponder      # macOS
Exercise 4: Full VPC with Terraform Advanced

Objective: Deploy a complete production-ready VPC using Terraform with all networking components.

Tasks:

  1. Create a VPC with DNS support enabled
  2. Create 2 public subnets, 2 private-app subnets, 2 private-data subnets across 2 AZs
  3. Set up Internet Gateway, NAT Gateways (one per AZ for HA), and route tables
  4. Configure security groups for: ALB (public), web servers (from ALB only), database (from web only)
  5. Output all subnet IDs and security group IDs
Terraform VPC IaC Advanced
# Exercise 4 Solution: Complete production VPC module
# main.tf

terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}

provider "aws" {
  region = "us-east-1"
}

data "aws_availability_zones" "available" {
  state = "available"
}

locals {
  azs         = slice(data.aws_availability_zones.available.names, 0, 2)
  vpc_cidr    = "10.0.0.0/16"
  environment = "production"
}

resource "aws_vpc" "main" {
  cidr_block           = local.vpc_cidr
  enable_dns_support   = true
  enable_dns_hostnames = true
  tags = { Name = "${local.environment}-vpc" }
}

resource "aws_internet_gateway" "main" {
  vpc_id = aws_vpc.main.id
  tags   = { Name = "${local.environment}-igw" }
}

resource "aws_subnet" "public" {
  count                   = length(local.azs)
  vpc_id                  = aws_vpc.main.id
  cidr_block              = cidrsubnet(local.vpc_cidr, 8, count.index + 1)
  availability_zone       = local.azs[count.index]
  map_public_ip_on_launch = true
  tags = { Name = "public-${local.azs[count.index]}", Tier = "public" }
}

resource "aws_subnet" "private_app" {
  count             = length(local.azs)
  vpc_id            = aws_vpc.main.id
  cidr_block        = cidrsubnet(local.vpc_cidr, 4, count.index + 2)
  availability_zone = local.azs[count.index]
  tags = { Name = "private-app-${local.azs[count.index]}", Tier = "private" }
}

resource "aws_subnet" "private_data" {
  count             = length(local.azs)
  vpc_id            = aws_vpc.main.id
  cidr_block        = cidrsubnet(local.vpc_cidr, 8, count.index + 64)
  availability_zone = local.azs[count.index]
  tags = { Name = "private-data-${local.azs[count.index]}", Tier = "data" }
}

resource "aws_eip" "nat" {
  count  = length(local.azs)
  domain = "vpc"
  tags   = { Name = "nat-eip-${local.azs[count.index]}" }
}

resource "aws_nat_gateway" "main" {
  count         = length(local.azs)
  allocation_id = aws_eip.nat[count.index].id
  subnet_id     = aws_subnet.public[count.index].id
  tags          = { Name = "nat-${local.azs[count.index]}" }
  depends_on    = [aws_internet_gateway.main]
}

resource "aws_route_table" "public" {
  vpc_id = aws_vpc.main.id
  route {
    cidr_block = "0.0.0.0/0"
    gateway_id = aws_internet_gateway.main.id
  }
  tags = { Name = "public-rt" }
}

resource "aws_route_table" "private" {
  count  = length(local.azs)
  vpc_id = aws_vpc.main.id
  route {
    cidr_block     = "0.0.0.0/0"
    nat_gateway_id = aws_nat_gateway.main[count.index].id
  }
  tags = { Name = "private-rt-${local.azs[count.index]}" }
}

resource "aws_route_table_association" "public" {
  count          = length(local.azs)
  subnet_id      = aws_subnet.public[count.index].id
  route_table_id = aws_route_table.public.id
}

resource "aws_route_table_association" "private_app" {
  count          = length(local.azs)
  subnet_id      = aws_subnet.private_app[count.index].id
  route_table_id = aws_route_table.private[count.index].id
}

resource "aws_route_table_association" "private_data" {
  count          = length(local.azs)
  subnet_id      = aws_subnet.private_data[count.index].id
  route_table_id = aws_route_table.private[count.index].id
}

output "vpc_id" { value = aws_vpc.main.id }
output "public_subnet_ids" { value = aws_subnet.public[*].id }
output "private_app_subnet_ids" { value = aws_subnet.private_app[*].id }
output "private_data_subnet_ids" { value = aws_subnet.private_data[*].id }

Conclusion & Next Steps

Networking is the connective tissue that binds all infrastructure components together. In this article, we've covered:

  • Fundamentals: OSI model, IP addressing, CIDR, routing, switching, firewalls, and NAT
  • SDN: Control/data plane separation, overlay networks (VXLAN, Geneve), and cloud SDN implementations
  • Load Balancing: L4 vs L7, algorithms, health checks, and cloud load balancer services
  • DNS: Resolution process, record types, cloud DNS, and service discovery patterns
  • Cloud Networking: VPC design, security groups/NACLs, transit gateways, and hybrid connectivity
Remember: Good network design is invisible — you only notice it when it fails. Invest time in proper subnet planning, redundant paths, and security-in-depth. These decisions are expensive to change later.

Next in the Series

In Part 6: Infrastructure Storage, we explore block, object, and file storage fundamentals, RAID configurations, storage protocols (iSCSI, NFS, S3 API), cloud storage tiers, and data lifecycle management.