Service Deep Dive
In Part 4, we introduced Kubernetes Services as stable endpoints for ephemeral pods. Now we'll dig into how they actually work at the kernel level — the iptables rules, IPVS tables, and EndpointSlices that make virtual IPs real.
ClusterIP Internals — How Virtual IPs Work
When you create a Service, this is what happens under the hood:
flowchart TD
A[Pod A sends packet to
ClusterIP 10.96.45.123:80] --> B[Packet hits iptables
PREROUTING chain]
B --> C[KUBE-SERVICES chain
matches ClusterIP]
C --> D[KUBE-SVC-xxx chain
probability-based selection]
D -->|33%| E[DNAT to Pod 1
10.244.1.15:8080]
D -->|33%| F[DNAT to Pod 2
10.244.2.22:8080]
D -->|34%| G[DNAT to Pod 3
10.244.3.8:8080]
E --> H[Packet delivered to
selected Pod]
F --> H
G --> H
# See the actual iptables rules kube-proxy creates:
sudo iptables -t nat -L KUBE-SERVICES -n | grep "inventory"
# -A KUBE-SERVICES -d 10.96.45.123/32 -p tcp --dport 80 -j KUBE-SVC-ABCDEF
# The KUBE-SVC chain implements load balancing via probability:
sudo iptables -t nat -L KUBE-SVC-ABCDEF -n
# -A KUBE-SVC-ABCDEF -m statistic --mode random --probability 0.333 -j KUBE-SEP-POD1
# -A KUBE-SVC-ABCDEF -m statistic --mode random --probability 0.500 -j KUBE-SEP-POD2
# -A KUBE-SVC-ABCDEF -j KUBE-SEP-POD3
# Each KUBE-SEP chain performs the actual DNAT:
sudo iptables -t nat -L KUBE-SEP-POD1 -n
# -A KUBE-SEP-POD1 -p tcp -j DNAT --to-destination 10.244.1.15:8080
# With IPVS mode (better for large clusters):
sudo ipvsadm -Ln | grep -A 5 "10.96.45.123"
# TCP 10.96.45.123:80 rr
# -> 10.244.1.15:8080 Masq 1 0 0
# -> 10.244.2.22:8080 Masq 1 0 0
# -> 10.244.3.8:8080 Masq 1 0 0
--proxy-mode=ipvs).
Session Affinity & External Traffic Policy
By default, each request is independently load-balanced. Sometimes you need requests from the same client to consistently reach the same pod:
# Session affinity: same client IP → same pod (for 1 hour)
apiVersion: v1
kind: Service
metadata:
name: shopping-cart
spec:
selector:
app: cart
ports:
- port: 80
targetPort: 8080
sessionAffinity: ClientIP
sessionAffinityConfig:
clientIP:
timeoutSeconds: 3600 # 1 hour sticky sessions
# externalTrafficPolicy: Local — preserve client source IP
# (NodePort/LoadBalancer services only)
apiVersion: v1
kind: Service
metadata:
name: web-frontend
spec:
type: LoadBalancer
externalTrafficPolicy: Local # Don't SNAT — preserves client IP
selector:
app: frontend
ports:
- port: 443
targetPort: 8443
# With "Cluster" (default): traffic may hop between nodes, client IP is lost (SNAT)
# With "Local": traffic only goes to pods on the receiving node
# Pro: client IP preserved, no extra network hop
# Con: uneven load distribution if pods aren't evenly spread across nodes
Headless Services for StatefulSets
A headless service (clusterIP: None) skips the virtual IP entirely. DNS returns the individual pod IPs directly, giving clients full control over which pod they connect to:
# Headless service for a Kafka StatefulSet:
apiVersion: v1
kind: Service
metadata:
name: kafka-headless
spec:
clusterIP: None # ← No virtual IP allocated
selector:
app: kafka
ports:
- port: 9092
name: broker
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: kafka
spec:
serviceName: kafka-headless # Links StatefulSet to headless service
replicas: 3
selector:
matchLabels:
app: kafka
template:
metadata:
labels:
app: kafka
spec:
containers:
- name: kafka
image: confluentinc/cp-kafka:7.5.0
ports:
- containerPort: 9092
# DNS returns individual pod IPs (not a virtual IP):
nslookup kafka-headless.default.svc.cluster.local
# Name: kafka-headless.default.svc.cluster.local
# Address: 10.244.1.15
# Address: 10.244.2.22
# Address: 10.244.3.8
# Each pod gets a stable DNS name (StatefulSet guarantee):
nslookup kafka-0.kafka-headless.default.svc.cluster.local
# Address: 10.244.1.15
nslookup kafka-1.kafka-headless.default.svc.cluster.local
# Address: 10.244.2.22
# Kafka brokers use these stable DNS names for cluster membership.
# Even if kafka-0 restarts with a new IP, its DNS name stays the same.
EndpointSlices (Replacing Legacy Endpoints)
# View EndpointSlices for a service:
kubectl get endpointslices -l kubernetes.io/service-name=inventory
# NAME ADDRESSTYPE PORTS ENDPOINTS AGE
# inventory-abc12 IPv4 8080 10.244.1.15,10.244.2.22... 5m
# inventory-def34 IPv4 8080 10.244.3.8,10.244.4.11... 5m
# Detailed view of a single slice:
kubectl describe endpointslice inventory-abc12
# Endpoints:
# - Addresses: 10.244.1.15
# Conditions: Ready=true, Serving=true, Terminating=false
# TargetRef: Pod/inventory-7d8f9c-abc12
# NodeName: worker-1
# Zone: us-east-1a
Ingress
Ingress Resource vs Ingress Controller
A Service exposes pods inside the cluster. But how do external users reach your application? You could use type: LoadBalancer (one cloud LB per service — expensive), or you could use Ingress — a single entry point that routes external HTTP/HTTPS traffic to multiple services based on hostname or path.
flowchart LR
A[External Client
browser/mobile] -->|HTTPS| B[Cloud Load Balancer
AWS ALB / GCP LB]
B -->|NodePort| C[Ingress Controller Pod
nginx/Traefik/HAProxy]
C -->|Host: api.example.com
Path: /users| D[users-service
ClusterIP]
C -->|Host: api.example.com
Path: /orders| E[orders-service
ClusterIP]
C -->|Host: app.example.com| F[frontend-service
ClusterIP]
D --> G[Pod 1]
D --> H[Pod 2]
E --> I[Pod 3]
F --> J[Pod 4]
F --> K[Pod 5]
TLS Termination & Routing
# Ingress resource with host-based and path-based routing + TLS:
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: main-ingress
annotations:
nginx.ingress.kubernetes.io/ssl-redirect: "true"
nginx.ingress.kubernetes.io/rate-limit: "100"
nginx.ingress.kubernetes.io/rate-limit-window: "1m"
nginx.ingress.kubernetes.io/proxy-body-size: "10m"
cert-manager.io/cluster-issuer: "letsencrypt-prod"
spec:
ingressClassName: nginx
tls:
- hosts:
- api.example.com
- app.example.com
secretName: example-tls-cert # TLS cert stored as K8s Secret
rules:
# Host-based routing:
- host: api.example.com
http:
paths:
# Path-based routing within a host:
- path: /users
pathType: Prefix
backend:
service:
name: users-service
port:
number: 80
- path: /orders
pathType: Prefix
backend:
service:
name: orders-service
port:
number: 80
- host: app.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: frontend
port:
number: 80
# Deploy nginx ingress controller:
helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm install ingress-nginx ingress-nginx/ingress-nginx \
--namespace ingress-nginx --create-namespace
# Verify controller is running:
kubectl get pods -n ingress-nginx
# NAME READY STATUS AGE
# ingress-nginx-controller-5d88495688-abc12 1/1 Running 2m
# Check Ingress status (should show ADDRESS after LB provisions):
kubectl get ingress main-ingress
# NAME CLASS HOSTS ADDRESS PORTS AGE
# main-ingress nginx api.example.com,app.example.com 34.120.5.67 80, 443 5m
# Test routing:
curl -H "Host: api.example.com" https://34.120.5.67/users
curl -H "Host: app.example.com" https://34.120.5.67/
Ingress Controller Comparison
| Controller | Proxy Engine | Config via | Best For | Limitations |
|---|---|---|---|---|
| nginx-ingress | NGINX | Annotations + ConfigMap | General purpose, battle-tested | Config reloads drop connections briefly |
| Traefik | Go native | IngressRoute CRD + annotations | Auto-discovery, Let's Encrypt auto | Performance lower than nginx at scale |
| HAProxy | HAProxy | Annotations + ConfigMap | High-performance TCP/HTTP, WebSocket | Steeper learning curve |
| Envoy (Contour) | Envoy | HTTPProxy CRD | Advanced routing, gRPC, rate limiting | More complex setup |
| AWS ALB Controller | AWS ALB | Annotations | AWS-native, WAF integration | AWS-only, vendor lock-in |
Gateway API
The Evolution Beyond Ingress
The Ingress resource has fundamental limitations that annotations can't fix:
- Annotation soup: Every controller invents its own annotations for features (rate limiting, auth, retries). Your Ingress YAML is non-portable.
- HTTP-only: No native support for TCP, UDP, gRPC, or TLS passthrough.
- No role separation: Infrastructure teams and app teams edit the same resource — no delegation model.
- Limited routing: Can't do header-based routing, traffic splitting, or request mirroring.
The Gateway API (graduated to GA in Kubernetes 1.28) addresses all these issues with a role-oriented, extensible resource model:
flowchart TD
A[GatewayClass
Infra Provider defines
controller implementation] --> B[Gateway
Cluster Operator provisions
listeners, ports, TLS]
B --> C[HTTPRoute
App Developer defines
routing rules]
B --> D[GRPCRoute
App Developer defines
gRPC routing]
B --> E[TCPRoute
App Developer defines
TCP routing]
C --> F[Service A]
C --> G[Service B]
D --> H[gRPC Service]
E --> I[TCP Service]
style A fill:#132440,color:#fff
style B fill:#16476A,color:#fff
style C fill:#3B9797,color:#fff
style D fill:#3B9797,color:#fff
style E fill:#3B9797,color:#fff
Gateway API Resources
# 1. GatewayClass — defines which controller implements Gateways
# (created once by infra provider, like StorageClass)
apiVersion: gateway.networking.k8s.io/v1
kind: GatewayClass
metadata:
name: istio
spec:
controllerName: istio.io/gateway-controller
---
# 2. Gateway — provisions actual infrastructure (load balancer, listeners)
# (created by cluster operator / platform team)
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
name: production-gateway
namespace: infra
spec:
gatewayClassName: istio
listeners:
- name: https
protocol: HTTPS
port: 443
tls:
mode: Terminate
certificateRefs:
- name: wildcard-cert
allowedRoutes:
namespaces:
from: Selector
selector:
matchLabels:
gateway-access: "true" # Only labeled namespaces can attach
- name: http
protocol: HTTP
port: 80
HTTPRoute & Traffic Splitting
# 3. HTTPRoute — app developer defines routing (in their own namespace)
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: orders-route
namespace: orders-team # App team's namespace
spec:
parentRefs:
- name: production-gateway
namespace: infra # References Gateway in another namespace
hostnames:
- "api.example.com"
rules:
# Path-based routing with header matching:
- matches:
- path:
type: PathPrefix
value: /orders
headers:
- name: x-api-version
value: "v2"
backendRefs:
- name: orders-v2
port: 80
# Default route (no header match):
- matches:
- path:
type: PathPrefix
value: /orders
backendRefs:
# Traffic splitting — canary deployment:
- name: orders-v1
port: 80
weight: 90 # 90% to stable
- name: orders-v2
port: 80
weight: 10 # 10% to canary
Service Mesh Architecture
Why Service Meshes Exist
As your cluster grows from 5 services to 50 to 500, you face increasingly complex requirements:
- Security: Every service-to-service call should be encrypted (mTLS) and authorized. Implementing this in every service's code is unsustainable.
- Observability: You need latency metrics, success rates, and distributed traces for every call — without instrumenting every service.
- Traffic control: Canary deployments, circuit breaking, retries, and rate limiting should be configurable per-route, not hardcoded.
- Multi-language: Your services are in Go, Java, Python, Rust — you can't enforce consistent behavior with per-language libraries.
A service mesh solves this by moving all networking concerns out of application code and into infrastructure (sidecar proxies).
Data Plane vs Control Plane
flowchart TB
subgraph Control Plane
CP[Control Plane
istiod / linkerd-control]
end
subgraph Data Plane
subgraph Pod A
A1[App Container
orders-service] <--> A2[Sidecar Proxy
Envoy/linkerd2-proxy]
end
subgraph Pod B
B1[App Container
payment-service] <--> B2[Sidecar Proxy
Envoy/linkerd2-proxy]
end
subgraph Pod C
C1[App Container
inventory-service] <--> C2[Sidecar Proxy
Envoy/linkerd2-proxy]
end
end
CP -->|Config push:
routing rules, mTLS certs,
retry policies| A2
CP -->|Config push| B2
CP -->|Config push| C2
A2 -->|mTLS encrypted| B2
A2 -->|mTLS encrypted| C2
B2 -->|mTLS encrypted| C2
A2 -->|Telemetry:
latency, errors, traces| CP
B2 -->|Telemetry| CP
C2 -->|Telemetry| CP
Istio
Istio Architecture
Istio is the most feature-rich service mesh, used in production by companies like Airbnb, eBay, and Salesforce. Its architecture consists of:
- Envoy sidecars: High-performance C++ proxy, hot-restartable, extensive filter chain for traffic manipulation.
- istiod: Unified control plane (merged Pilot, Citadel, and Galley). Handles service discovery, config distribution, certificate management.
- Istio CNI: Optional — replaces the init container that modifies iptables with a CNI plugin (no privileged containers needed).
# Install Istio with the demo profile (includes all features):
curl -L https://istio.io/downloadIstio | sh -
cd istio-1.22.0
export PATH=$PWD/bin:$PATH
istioctl install --set profile=demo -y
# ✔ Istio core installed
# ✔ Istiod installed
# ✔ Ingress gateways installed
# ✔ Egress gateways installed
# Enable automatic sidecar injection for a namespace:
kubectl label namespace production istio-injection=enabled
# Verify sidecars are injected (2/2 means app + sidecar):
kubectl get pods -n production
# NAME READY STATUS AGE
# orders-5d88495688-abc12 2/2 Running 30s ← 2 containers!
# payment-7f8c9d-def34 2/2 Running 30s
# Check mesh status:
istioctl proxy-status
# NAME CDS LDS EDS RDS ISTIOD
# orders-5d88495688-abc12 SYNCED SYNCED SYNCED SYNCED istiod-xyz
# payment-7f8c9d-def34 SYNCED SYNCED SYNCED SYNCED istiod-xyz
Traffic Management — VirtualService & DestinationRule
# VirtualService: Define HOW traffic is routed
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: orders-routing
namespace: production
spec:
hosts:
- orders # Kubernetes service name
http:
# Route based on header (internal testing):
- match:
- headers:
x-canary:
exact: "true"
route:
- destination:
host: orders
subset: canary
# Default: traffic splitting (90/10 canary):
- route:
- destination:
host: orders
subset: stable
weight: 90
- destination:
host: orders
subset: canary
weight: 10
retries:
attempts: 3
perTryTimeout: 2s
retryOn: "5xx,reset,connect-failure"
timeout: 10s
---
# DestinationRule: Define WHERE traffic goes (subsets + connection policies)
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: orders-destination
namespace: production
spec:
host: orders
trafficPolicy:
connectionPool:
tcp:
maxConnections: 100
http:
h2UpgradePolicy: UPGRADE
http1MaxPendingRequests: 50
http2MaxRequests: 1000
outlierDetection:
consecutive5xxErrors: 5
interval: 10s
baseEjectionTime: 30s
maxEjectionPercent: 50
subsets:
- name: stable
labels:
version: v1
- name: canary
labels:
version: v2
mTLS & Security
# Enable strict mTLS for the entire mesh:
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
name: default
namespace: istio-system # Mesh-wide policy
spec:
mtls:
mode: STRICT # Reject any non-mTLS traffic
---
# Authorization policy: only orders-service can call payment-service
apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
name: payment-access
namespace: production
spec:
selector:
matchLabels:
app: payment
action: ALLOW
rules:
- from:
- source:
principals:
- "cluster.local/ns/production/sa/orders-service"
to:
- operation:
methods: ["POST"]
paths: ["/api/charge"]
# Verify mTLS is active between services:
istioctl authn tls-check orders-5d88495688-abc12 payment.production.svc.cluster.local
# HOST:PORT STATUS SERVER CLIENT AUTHN POLICY
# payment.production.svc.cluster.local:80 OK STRICT ISTIO default/istio-system
# Inject a fault to test resilience (500 error on 50% of requests):
kubectl apply -f - <<EOF
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: payment-fault
namespace: production
spec:
hosts:
- payment
http:
- fault:
abort:
httpStatus: 500
percentage:
value: 50
route:
- destination:
host: payment
EOF
# Observe how the circuit breaker in orders-service responds
Linkerd
Lightweight Architecture
Linkerd takes a fundamentally different approach: simplicity over features. Its micro-proxy (linkerd2-proxy) is written in Rust, uses ~10MB of memory (vs Envoy's ~50-100MB), and adds <1ms p99 latency.
# Install Linkerd:
curl -fsL https://run.linkerd.io/install | sh
export PATH=$HOME/.linkerd2/bin:$PATH
# Check prerequisites:
linkerd check --pre
# Install control plane:
linkerd install --crds | kubectl apply -f -
linkerd install | kubectl apply -f -
# Wait for control plane to be ready:
linkerd check
# √ control-plane-version
# √ linkerd-existence
# √ linkerd-config
# Status check results are √
# Inject sidecars into a namespace:
kubectl get deploy -n production -o yaml | linkerd inject - | kubectl apply -f -
# Or annotate namespace for auto-injection:
kubectl annotate namespace production linkerd.io/inject=enabled
# View real-time traffic metrics:
linkerd viz dashboard &
# Per-route success rate and latency:
linkerd viz routes deploy/orders -n production
# ROUTE SUCCESS RPS LATENCY_P50 LATENCY_P95 LATENCY_P99
# POST /api/orders 99.2% 45 12ms 35ms 78ms
# GET /api/orders/{id} 100.0% 120 5ms 12ms 25ms
# Live traffic tap (see individual requests):
linkerd viz tap deploy/orders -n production
# req id=0:1 src=10.244.1.15:52341 dst=10.244.2.22:8080 :method=POST :path=/api/orders
# rsp id=0:1 src=10.244.2.22:8080 dst=10.244.1.15:52341 :status=200 latency=15ms
Linkerd vs Istio
Service Mesh Feature Comparison
| Feature | Istio | Linkerd |
|---|---|---|
| Proxy | Envoy (C++, ~50MB RAM) | linkerd2-proxy (Rust, ~10MB RAM) |
| Latency overhead | ~3-5ms p99 | <1ms p99 |
| mTLS | ✅ Automatic, configurable | ✅ Automatic, always-on |
| Traffic splitting | ✅ VirtualService weights | ✅ TrafficSplit SMI |
| Circuit breaking | ✅ DestinationRule outlierDetection | ✅ Built-in (simpler config) |
| Fault injection | ✅ Delays, aborts | ❌ Not supported |
| Multi-cluster | ✅ Full federation | ✅ Multi-cluster link |
| Egress control | ✅ ServiceEntry + egress gateway | ❌ Limited |
| Complexity | High (steep learning curve) | Low (opinionated defaults) |
| CNCF status | Graduated | Graduated |
Choose Istio when you need advanced traffic manipulation (fault injection, complex routing rules, egress control) and your team can manage the complexity. Choose Linkerd when you want mTLS + observability with minimal overhead and operational burden.
Traffic Management Patterns
Canary & Blue-Green Deployments
Service meshes make sophisticated deployment strategies declarative:
# Canary: Gradually shift traffic from v1 to v2
# Week 1: 95/5 split
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: orders-canary
spec:
hosts:
- orders
http:
- route:
- destination:
host: orders
subset: stable # version: v1
weight: 95
- destination:
host: orders
subset: canary # version: v2
weight: 5
# Week 2: If metrics look good, shift to 80/20
# Week 3: 50/50
# Week 4: 0/100 (promote canary to stable)
---
# Blue-Green: Instant switchover (all-or-nothing)
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: orders-blue-green
spec:
hosts:
- orders
http:
- route:
- destination:
host: orders
subset: blue # Current production (v1)
weight: 100
- destination:
host: orders
subset: green # New version (v2) — 0% until switch
weight: 0
# To switch: flip weights to 0/100. Rollback: flip back to 100/0.
Circuit Breaking & Retries with Budgets
# Circuit breaking via Istio DestinationRule:
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: payment-circuit-breaker
spec:
host: payment
trafficPolicy:
connectionPool:
tcp:
maxConnections: 50 # Max concurrent TCP connections
http:
http1MaxPendingRequests: 25 # Requests queued when all connections busy
http2MaxRequests: 100 # Max concurrent HTTP/2 requests
maxRequestsPerConnection: 10 # Recycle connections after 10 requests
maxRetries: 3 # Max concurrent retries across all hosts
outlierDetection:
consecutive5xxErrors: 3 # Eject after 3 consecutive 5xx errors
interval: 10s # Check every 10 seconds
baseEjectionTime: 30s # Ejected for 30s (doubles each time)
maxEjectionPercent: 30 # Never eject more than 30% of hosts
Traffic Mirroring (Shadowing)
Mirror production traffic to a new version without affecting real users — the gold standard for testing with real load:
# Mirror 100% of traffic to canary (responses are discarded):
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: orders-mirror
spec:
hosts:
- orders
http:
- route:
- destination:
host: orders
subset: stable
weight: 100
mirror:
host: orders
subset: canary
mirrorPercentage:
value: 100.0
# Real users always get responses from "stable"
# Canary receives a copy of all requests — you monitor its metrics
# (latency, error rate, resource consumption) under real load
# without any risk to production traffic.
# Monitor mirrored traffic in the canary pods:
kubectl logs -f deploy/orders-v2 -n production | grep "mirrored"
# Compare metrics between stable and canary:
istioctl dashboard kiali
# Kiali shows side-by-side success rates, latency distributions,
# and error patterns for both versions.
# Linkerd equivalent — check golden metrics:
linkerd viz stat deploy/orders-v2 -n production
# NAME MESHED SUCCESS RPS LATENCY_P50 LATENCY_P99
# orders-v2 1/1 99.8% 45 8ms 42ms
Exercises
sudo iptables -t nat -L KUBE-SERVICES -n). Identify the probability-based load balancing chain. Delete one pod and observe how the iptables rules update. Bonus: Switch your cluster to IPVS mode and compare with ipvsadm -Ln.
hey or fortio) to send 1000 requests. Verify that ~10% reach v2. Gradually shift to 50/50, then 100% v2. Monitor metrics throughout. Finally, configure a circuit breaker and test it by making v2 return 500 errors.
istioctl authn tls-check to verify the mTLS state. Examine the certificate chain with openssl s_client connecting to the Envoy sidecar port.
Conclusion
Kubernetes Services, Ingress, and Service Meshes form a layered networking stack: Services provide stable in-cluster endpoints, Ingress (and increasingly Gateway API) exposes applications externally with sophisticated routing, and Service Meshes handle the hard problems of service-to-service communication — mTLS, observability, traffic management, and resilience — without modifying application code.
The Gateway API is actively replacing Ingress as the standard for external traffic management, while service meshes are becoming table stakes for production Kubernetes deployments with more than a handful of services.
In Part 10, we'll explore Kubernetes Storage — Persistent Volumes, StorageClasses, CSI drivers, and the patterns for running stateful workloads (databases, message queues) on a platform designed for stateless containers.