Back to Distributed Systems & Kubernetes Series

Part 4: Service Discovery & Communication

May 14, 2026 Wasil Zafar 35 min read

Containers are ephemeral — IPs constantly change. How do services find each other, and how do you build reliable communication in an unreliable network?

Table of Contents

  1. Why Service Discovery Exists
  2. Discovery Methods
  3. Communication Patterns
  4. Resilience Patterns
  5. Exercises
  6. Conclusion

Why Service Discovery Exists

Static vs Dynamic Environments

In traditional infrastructure, services lived at known, fixed IP addresses. You'd configure database.internal:5432 in your application config and it would work for years. The server was a pet — named, maintained, and irreplaceable.

In container environments, everything changes constantly:

  • Pods get new IPs every time they restart
  • Scaling creates/destroys instances dynamically
  • Rolling deployments replace pods one by one
  • Node failures cause pods to reschedule elsewhere

The Ephemeral Container Problem

The Core Problem: If a payment service needs to call an inventory service, and the inventory service's IP changes every time it restarts, how does the payment service know where to send requests? Hardcoding IPs is impossible when containers live for minutes or hours, not months.
# Watch Kubernetes pod IPs change in real-time:
kubectl get pods -o wide -w
# NAME                     READY   STATUS    IP           NODE
# inventory-7d8f9c-abc12   1/1     Running   10.244.1.15  worker-1
# inventory-7d8f9c-def34   1/1     Running   10.244.2.22  worker-2

# Delete a pod — Kubernetes creates a new one with a new IP:
kubectl delete pod inventory-7d8f9c-abc12
# NAME                     READY   STATUS    IP           NODE
# inventory-7d8f9c-xyz99   1/1     Running   10.244.3.8   worker-3  ← NEW IP!

# The payment service can't track these changes manually.
# It needs an abstraction layer: a stable name that always resolves
# to the current healthy instances. That's service discovery.

Discovery Methods

DNS-Based Discovery

The simplest approach: register services as DNS records. Clients look up a hostname, DNS returns the current IP(s). When instances change, DNS records are updated.

DNS-Based Service Discovery
sequenceDiagram
    participant Client
    participant DNS as DNS Server
    participant S1 as Service Instance 1
    participant S2 as Service Instance 2
    
    Client->>DNS: Resolve "inventory.svc"
    DNS->>Client: [10.244.1.15, 10.244.2.22]
    Client->>S1: HTTP GET /api/stock
    S1->>Client: 200 OK {"stock": 42}
    Note over S1: Instance 1 crashes
    Client->>DNS: Resolve "inventory.svc"
    DNS->>Client: [10.244.2.22] (updated)
    Client->>S2: HTTP GET /api/stock
    S2->>Client: 200 OK {"stock": 42}
                            

Advantages: Universal (every language has DNS support), no special client libraries needed.

Disadvantages: DNS TTL caching causes stale entries (clients may use dead IPs for seconds/minutes). DNS doesn't support health checking natively. Round-robin DNS provides poor load balancing.

Registry-Based Discovery

A dedicated service registry maintains a real-time map of service names to healthy instances. Services register themselves on startup and deregister on shutdown. Clients query the registry to find available instances.

Registry Protocol Health Checks Used By
Consul HTTP/DNS Active (TCP, HTTP, gRPC) HashiCorp ecosystem
etcd gRPC Lease-based TTL Kubernetes
ZooKeeper Custom TCP Ephemeral nodes Kafka, Hadoop
Eureka HTTP REST Client heartbeats Netflix/Spring Cloud

Kubernetes Service Discovery

Kubernetes combines both approaches elegantly. A Service object provides a stable virtual IP (ClusterIP) and DNS name that automatically routes to healthy pod backends:

# Kubernetes Service creates a stable discovery endpoint:
apiVersion: v1
kind: Service
metadata:
  name: inventory
  namespace: production
spec:
  selector:
    app: inventory      # Routes to pods with this label
  ports:
    - port: 80          # Service port (what clients connect to)
      targetPort: 8080  # Pod port (where the app listens)
  type: ClusterIP       # Internal-only (default)
# After creating this Service, any pod in the cluster can reach it via:
# 1. DNS name (recommended):
curl http://inventory.production.svc.cluster.local/api/stock
curl http://inventory.production/api/stock   # Short form (same namespace)
curl http://inventory/api/stock              # Shortest (same namespace)

# 2. ClusterIP (stable virtual IP):
kubectl get svc inventory -n production
# NAME        TYPE        CLUSTER-IP     PORT(S)
# inventory   ClusterIP   10.96.45.123   80/TCP

# 3. Environment variables (legacy, injected at pod startup):
# INVENTORY_SERVICE_HOST=10.96.45.123
# INVENTORY_SERVICE_PORT=80

# Kubernetes automatically:
# - Watches for pods matching selector "app: inventory"
# - Maintains an Endpoints object with current pod IPs
# - Updates kube-proxy rules to route traffic
# - Removes unhealthy pods (failed readiness probes)
How It Works Internally: When you create a Service, Kubernetes DNS (CoreDNS) adds an A record. kube-proxy on every node creates iptables/IPVS rules that intercept traffic to the ClusterIP and load-balance it across healthy pod IPs. When pods die or new ones start, the Endpoints are updated within seconds and routing rules are refreshed automatically.

Communication Patterns

RPC & REST

The two dominant synchronous communication styles in distributed systems:

Aspect REST RPC
Paradigm Resource-oriented (nouns) Action-oriented (verbs)
Protocol HTTP/1.1 or HTTP/2 Various (HTTP/2, TCP, custom)
Data format JSON (human-readable) Protobuf/binary (efficient)
Contract OpenAPI/Swagger (optional) IDL (mandatory, e.g., .proto files)
Best for External APIs, CRUD Internal services, high performance

gRPC

gRPC is Google's modern RPC framework, widely used in Kubernetes and cloud-native systems. It uses HTTP/2 for transport and Protocol Buffers for serialization — achieving 7–10x better performance than JSON/REST for internal service communication.

# Protocol Buffer definition (inventory.proto):
# syntax = "proto3";
# 
# service InventoryService {
#   rpc GetStock (StockRequest) returns (StockResponse);
#   rpc UpdateStock (UpdateRequest) returns (UpdateResponse);
#   rpc WatchStock (StockRequest) returns (stream StockUpdate);  // Server streaming
# }
# 
# message StockRequest {
#   string product_id = 1;
# }
# 
# message StockResponse {
#   string product_id = 1;
#   int32 quantity = 2;
#   string warehouse = 3;
# }

# Key gRPC features for distributed systems:
# 1. Bidirectional streaming (real-time updates)
# 2. Built-in deadlines/timeouts (propagate across services)
# 3. Automatic retries with backoff
# 4. Load balancing (client-side or via proxy)
# 5. Strong typing (compile-time contract enforcement)

# Kubernetes API Server itself uses gRPC internally
# etcd client communicates with etcd via gRPC

Message Queues

Asynchronous communication decouples producers from consumers. The sender doesn't wait for a response — it publishes a message and moves on. This provides temporal decoupling (services don't need to be online simultaneously) and load leveling (queues absorb traffic spikes).

Synchronous vs Asynchronous Communication
flowchart LR
    subgraph Synchronous
        A1[Order Service] -->|HTTP/gRPC| B1[Inventory Service]
        B1 -->|Response| A1
    end
    subgraph Asynchronous
        A2[Order Service] -->|Publish| Q[Message Queue]
        Q -->|Subscribe| B2[Inventory Service]
        Q -->|Subscribe| C2[Notification Service]
        Q -->|Subscribe| D2[Analytics Service]
    end
                            
System Pattern Ordering Use Case
Apache Kafka Log-based (persistent) Per-partition Event streaming, audit logs
RabbitMQ Broker (message queue) Per-queue (FIFO) Task queues, request buffering
NATS Pub/Sub (lightweight) Best-effort Cloud-native, IoT, K8s events
When to Use Each: Use synchronous (REST/gRPC) when the caller needs an immediate response (user-facing requests, queries). Use asynchronous (queues) when the work can be deferred, when you need fan-out to multiple consumers, or when you need to survive downstream service outages.

Resilience Patterns

Networks fail. Services crash. Distributed communication must be designed for failure, not just success.

Retries & Exponential Backoff

Transient failures (network blips, temporary overload) often resolve themselves. Retrying failed requests is the simplest resilience pattern — but naive retries can amplify failures:

# BAD: Immediate retries during an outage
# Service is overloaded, responding slowly
# 100 clients retry instantly → 200 requests → more overload → 400 requests → crash
# This is a "retry storm" — retries make the problem worse

# GOOD: Exponential backoff with jitter
# Attempt 1: fails → wait 100ms + random(0-50ms)
# Attempt 2: fails → wait 200ms + random(0-100ms)
# Attempt 3: fails → wait 400ms + random(0-200ms)
# Attempt 4: fails → wait 800ms + random(0-400ms)
# Attempt 5: fails → give up, return error

# The "jitter" prevents all clients from retrying at the same moment
# (thundering herd problem)

# Kubernetes uses this pattern for:
# - kubelet registering with API Server
# - Controller retry loops
# - CrashLoopBackOff (10s → 20s → 40s → ... → 5min cap)

Timeouts

Without timeouts, a slow downstream service causes the caller to hang indefinitely, consuming connections, threads, and memory. Always set timeouts at every integration point:

  • Connection timeout: How long to wait establishing a TCP connection (~1–5s)
  • Request timeout: How long to wait for a response (~1–30s depending on operation)
  • Deadline propagation: Subtract elapsed time from the overall budget at each hop
The Timeout Dilemma: Too short = false positives (healthy-but-slow services look dead, triggering unnecessary retries). Too long = resource exhaustion (threads/connections held open for unresponsive services). There's no universally correct timeout — measure P99 latency and set timeouts slightly above it. Monitor and adjust.

Circuit Breakers

A circuit breaker prevents a client from repeatedly calling a failing service, giving it time to recover. Named after electrical circuit breakers that prevent overloading:

Circuit Breaker State Machine
stateDiagram-v2
    [*] --> Closed
    Closed --> Open : Failure threshold exceeded
    Open --> HalfOpen : Timeout expires
    HalfOpen --> Closed : Test request succeeds
    HalfOpen --> Open : Test request fails
    
    note right of Closed : Normal operation\nAll requests pass through
    note right of Open : Fast-fail all requests\nReturn error immediately
    note right of HalfOpen : Allow one test request\nDecide based on result
                            
  • Closed: Normal operation. Requests flow through. Track failure rate.
  • Open: Failure threshold exceeded. Immediately return errors without calling the service. Wait for a cooldown period.
  • Half-Open: Allow one test request through. If it succeeds, close the circuit (service recovered). If it fails, open again.

Bulkheads

Named after ship compartments that prevent a leak from sinking the whole vessel. In software, bulkheads isolate failures so one slow dependency doesn't consume all resources:

  • Thread pool isolation: Each downstream service gets its own thread pool. If Service A is slow and exhausts its pool, Service B's pool remains unaffected.
  • Connection pool limits: Cap connections per downstream service.
  • In Kubernetes: Resource limits (CPU/memory) on pods prevent one workload from starving others. Namespaces with ResourceQuotas provide cluster-level bulkheads.

Exercises

Exercise 1 — Kubernetes Service Discovery: Create a Kubernetes Service of type ClusterIP that selects pods with label app: backend. Deploy 3 replica pods with that label. From another pod, verify you can reach the service by DNS name. Kill one pod and confirm traffic still flows. Observe the Endpoints object updating in real-time with kubectl get endpoints -w.
Exercise 2 — Circuit Breaker Design: Design a circuit breaker for a payment service calling a fraud detection API. Define: (a) What failure threshold triggers OPEN state (e.g., 5 failures in 10 seconds)? (b) How long is the cooldown before HALF-OPEN? (c) What does the payment service do when the circuit is OPEN (reject payment? allow with risk? queue for later)? Each choice has business implications.
Exercise 3 — Sync vs Async: An e-commerce checkout flow needs to: (a) Validate payment, (b) Decrement inventory, (c) Send confirmation email, (d) Update analytics. Which should be synchronous (blocking checkout) and which async (queued)? Justify your choices considering user experience, data consistency, and failure handling.

Conclusion

Service discovery and resilient communication are the circulatory system of distributed applications. Kubernetes solves discovery elegantly through Services (stable DNS names + virtual IPs), while resilience patterns (retries, timeouts, circuit breakers) ensure communication survives the inevitable failures of distributed networks.

In Part 5, we'll explore failure and resilience at the system level — node failures, network partitions, cascading failures, and the self-healing patterns that make distributed systems robust.