Why Service Discovery Exists
Static vs Dynamic Environments
In traditional infrastructure, services lived at known, fixed IP addresses. You'd configure database.internal:5432 in your application config and it would work for years. The server was a pet — named, maintained, and irreplaceable.
In container environments, everything changes constantly:
- Pods get new IPs every time they restart
- Scaling creates/destroys instances dynamically
- Rolling deployments replace pods one by one
- Node failures cause pods to reschedule elsewhere
The Ephemeral Container Problem
# Watch Kubernetes pod IPs change in real-time:
kubectl get pods -o wide -w
# NAME READY STATUS IP NODE
# inventory-7d8f9c-abc12 1/1 Running 10.244.1.15 worker-1
# inventory-7d8f9c-def34 1/1 Running 10.244.2.22 worker-2
# Delete a pod — Kubernetes creates a new one with a new IP:
kubectl delete pod inventory-7d8f9c-abc12
# NAME READY STATUS IP NODE
# inventory-7d8f9c-xyz99 1/1 Running 10.244.3.8 worker-3 ← NEW IP!
# The payment service can't track these changes manually.
# It needs an abstraction layer: a stable name that always resolves
# to the current healthy instances. That's service discovery.
Discovery Methods
DNS-Based Discovery
The simplest approach: register services as DNS records. Clients look up a hostname, DNS returns the current IP(s). When instances change, DNS records are updated.
sequenceDiagram
participant Client
participant DNS as DNS Server
participant S1 as Service Instance 1
participant S2 as Service Instance 2
Client->>DNS: Resolve "inventory.svc"
DNS->>Client: [10.244.1.15, 10.244.2.22]
Client->>S1: HTTP GET /api/stock
S1->>Client: 200 OK {"stock": 42}
Note over S1: Instance 1 crashes
Client->>DNS: Resolve "inventory.svc"
DNS->>Client: [10.244.2.22] (updated)
Client->>S2: HTTP GET /api/stock
S2->>Client: 200 OK {"stock": 42}
Advantages: Universal (every language has DNS support), no special client libraries needed.
Disadvantages: DNS TTL caching causes stale entries (clients may use dead IPs for seconds/minutes). DNS doesn't support health checking natively. Round-robin DNS provides poor load balancing.
Registry-Based Discovery
A dedicated service registry maintains a real-time map of service names to healthy instances. Services register themselves on startup and deregister on shutdown. Clients query the registry to find available instances.
| Registry | Protocol | Health Checks | Used By |
|---|---|---|---|
| Consul | HTTP/DNS | Active (TCP, HTTP, gRPC) | HashiCorp ecosystem |
| etcd | gRPC | Lease-based TTL | Kubernetes |
| ZooKeeper | Custom TCP | Ephemeral nodes | Kafka, Hadoop |
| Eureka | HTTP REST | Client heartbeats | Netflix/Spring Cloud |
Kubernetes Service Discovery
Kubernetes combines both approaches elegantly. A Service object provides a stable virtual IP (ClusterIP) and DNS name that automatically routes to healthy pod backends:
# Kubernetes Service creates a stable discovery endpoint:
apiVersion: v1
kind: Service
metadata:
name: inventory
namespace: production
spec:
selector:
app: inventory # Routes to pods with this label
ports:
- port: 80 # Service port (what clients connect to)
targetPort: 8080 # Pod port (where the app listens)
type: ClusterIP # Internal-only (default)
# After creating this Service, any pod in the cluster can reach it via:
# 1. DNS name (recommended):
curl http://inventory.production.svc.cluster.local/api/stock
curl http://inventory.production/api/stock # Short form (same namespace)
curl http://inventory/api/stock # Shortest (same namespace)
# 2. ClusterIP (stable virtual IP):
kubectl get svc inventory -n production
# NAME TYPE CLUSTER-IP PORT(S)
# inventory ClusterIP 10.96.45.123 80/TCP
# 3. Environment variables (legacy, injected at pod startup):
# INVENTORY_SERVICE_HOST=10.96.45.123
# INVENTORY_SERVICE_PORT=80
# Kubernetes automatically:
# - Watches for pods matching selector "app: inventory"
# - Maintains an Endpoints object with current pod IPs
# - Updates kube-proxy rules to route traffic
# - Removes unhealthy pods (failed readiness probes)
Communication Patterns
RPC & REST
The two dominant synchronous communication styles in distributed systems:
| Aspect | REST | RPC |
|---|---|---|
| Paradigm | Resource-oriented (nouns) | Action-oriented (verbs) |
| Protocol | HTTP/1.1 or HTTP/2 | Various (HTTP/2, TCP, custom) |
| Data format | JSON (human-readable) | Protobuf/binary (efficient) |
| Contract | OpenAPI/Swagger (optional) | IDL (mandatory, e.g., .proto files) |
| Best for | External APIs, CRUD | Internal services, high performance |
gRPC
gRPC is Google's modern RPC framework, widely used in Kubernetes and cloud-native systems. It uses HTTP/2 for transport and Protocol Buffers for serialization — achieving 7–10x better performance than JSON/REST for internal service communication.
# Protocol Buffer definition (inventory.proto):
# syntax = "proto3";
#
# service InventoryService {
# rpc GetStock (StockRequest) returns (StockResponse);
# rpc UpdateStock (UpdateRequest) returns (UpdateResponse);
# rpc WatchStock (StockRequest) returns (stream StockUpdate); // Server streaming
# }
#
# message StockRequest {
# string product_id = 1;
# }
#
# message StockResponse {
# string product_id = 1;
# int32 quantity = 2;
# string warehouse = 3;
# }
# Key gRPC features for distributed systems:
# 1. Bidirectional streaming (real-time updates)
# 2. Built-in deadlines/timeouts (propagate across services)
# 3. Automatic retries with backoff
# 4. Load balancing (client-side or via proxy)
# 5. Strong typing (compile-time contract enforcement)
# Kubernetes API Server itself uses gRPC internally
# etcd client communicates with etcd via gRPC
Message Queues
Asynchronous communication decouples producers from consumers. The sender doesn't wait for a response — it publishes a message and moves on. This provides temporal decoupling (services don't need to be online simultaneously) and load leveling (queues absorb traffic spikes).
flowchart LR
subgraph Synchronous
A1[Order Service] -->|HTTP/gRPC| B1[Inventory Service]
B1 -->|Response| A1
end
subgraph Asynchronous
A2[Order Service] -->|Publish| Q[Message Queue]
Q -->|Subscribe| B2[Inventory Service]
Q -->|Subscribe| C2[Notification Service]
Q -->|Subscribe| D2[Analytics Service]
end
| System | Pattern | Ordering | Use Case |
|---|---|---|---|
| Apache Kafka | Log-based (persistent) | Per-partition | Event streaming, audit logs |
| RabbitMQ | Broker (message queue) | Per-queue (FIFO) | Task queues, request buffering |
| NATS | Pub/Sub (lightweight) | Best-effort | Cloud-native, IoT, K8s events |
Resilience Patterns
Networks fail. Services crash. Distributed communication must be designed for failure, not just success.
Retries & Exponential Backoff
Transient failures (network blips, temporary overload) often resolve themselves. Retrying failed requests is the simplest resilience pattern — but naive retries can amplify failures:
# BAD: Immediate retries during an outage
# Service is overloaded, responding slowly
# 100 clients retry instantly → 200 requests → more overload → 400 requests → crash
# This is a "retry storm" — retries make the problem worse
# GOOD: Exponential backoff with jitter
# Attempt 1: fails → wait 100ms + random(0-50ms)
# Attempt 2: fails → wait 200ms + random(0-100ms)
# Attempt 3: fails → wait 400ms + random(0-200ms)
# Attempt 4: fails → wait 800ms + random(0-400ms)
# Attempt 5: fails → give up, return error
# The "jitter" prevents all clients from retrying at the same moment
# (thundering herd problem)
# Kubernetes uses this pattern for:
# - kubelet registering with API Server
# - Controller retry loops
# - CrashLoopBackOff (10s → 20s → 40s → ... → 5min cap)
Timeouts
Without timeouts, a slow downstream service causes the caller to hang indefinitely, consuming connections, threads, and memory. Always set timeouts at every integration point:
- Connection timeout: How long to wait establishing a TCP connection (~1–5s)
- Request timeout: How long to wait for a response (~1–30s depending on operation)
- Deadline propagation: Subtract elapsed time from the overall budget at each hop
Circuit Breakers
A circuit breaker prevents a client from repeatedly calling a failing service, giving it time to recover. Named after electrical circuit breakers that prevent overloading:
stateDiagram-v2
[*] --> Closed
Closed --> Open : Failure threshold exceeded
Open --> HalfOpen : Timeout expires
HalfOpen --> Closed : Test request succeeds
HalfOpen --> Open : Test request fails
note right of Closed : Normal operation\nAll requests pass through
note right of Open : Fast-fail all requests\nReturn error immediately
note right of HalfOpen : Allow one test request\nDecide based on result
- Closed: Normal operation. Requests flow through. Track failure rate.
- Open: Failure threshold exceeded. Immediately return errors without calling the service. Wait for a cooldown period.
- Half-Open: Allow one test request through. If it succeeds, close the circuit (service recovered). If it fails, open again.
Bulkheads
Named after ship compartments that prevent a leak from sinking the whole vessel. In software, bulkheads isolate failures so one slow dependency doesn't consume all resources:
- Thread pool isolation: Each downstream service gets its own thread pool. If Service A is slow and exhausts its pool, Service B's pool remains unaffected.
- Connection pool limits: Cap connections per downstream service.
- In Kubernetes: Resource limits (CPU/memory) on pods prevent one workload from starving others. Namespaces with ResourceQuotas provide cluster-level bulkheads.
Exercises
app: backend. Deploy 3 replica pods with that label. From another pod, verify you can reach the service by DNS name. Kill one pod and confirm traffic still flows. Observe the Endpoints object updating in real-time with kubectl get endpoints -w.
Conclusion
Service discovery and resilient communication are the circulatory system of distributed applications. Kubernetes solves discovery elegantly through Services (stable DNS names + virtual IPs), while resilience patterns (retries, timeouts, circuit breakers) ensure communication survives the inevitable failures of distributed networks.
In Part 5, we'll explore failure and resilience at the system level — node failures, network partitions, cascading failures, and the self-healing patterns that make distributed systems robust.