Service Discovery & Communication - Part 4

Parts 1–3 covered how distributed nodes store and agree on data. But agreeing on state is useless if services can't find each other and communicate reliably. This part focuses on the networking layer: how ephemeral services discover each other, the protocols they use to talk, and the resilience patterns that keep communication working when individual services fail.

Why Service Discovery Exists

Static vs Dynamic Environments

In traditional infrastructure, services lived at known, fixed IP addresses. You'd configure database.internal:5432 in your application config and it would work for years. The server was a pet — named, maintained, and irreplaceable.

In container environments, everything changes constantly:

Pods get new IPs every time they restart
Scaling creates/destroys instances dynamically
Rolling deployments replace pods one by one
Node failures cause pods to reschedule elsewhere

The Ephemeral Container Problem

                            
                            The Core Problem: If a payment service needs to call an inventory service, and the inventory service's IP changes every time it restarts, how does the payment service know where to send requests? Hardcoding IPs is impossible when containers live for minutes or hours, not months.
                        

# Watch Kubernetes pod IPs change in real-time:
kubectl get pods -o wide -w
# NAME                     READY   STATUS    IP           NODE
# inventory-7d8f9c-abc12   1/1     Running   10.244.1.15  worker-1
# inventory-7d8f9c-def34   1/1     Running   10.244.2.22  worker-2

# Delete a pod — Kubernetes creates a new one with a new IP:
kubectl delete pod inventory-7d8f9c-abc12
# NAME                     READY   STATUS    IP           NODE
# inventory-7d8f9c-xyz99   1/1     Running   10.244.3.8   worker-3  ← NEW IP!

# The payment service can't track these changes manually.
# It needs an abstraction layer: a stable name that always resolves
# to the current healthy instances. That's service discovery.

Discovery Methods

DNS-Based Discovery

The simplest approach: register services as DNS records. Clients look up a hostname, DNS returns the current IP(s). When instances change, DNS records are updated.

DNS-Based Service Discovery

sequenceDiagram
    participant Client
    participant DNS as DNS Server
    participant S1 as Service Instance 1
    participant S2 as Service Instance 2
    
    Client->>DNS: Resolve "inventory.svc"
    DNS->>Client: [10.244.1.15, 10.244.2.22]
    Client->>S1: HTTP GET /api/stock
    S1->>Client: 200 OK {"stock": 42}
    Note over S1: Instance 1 crashes
    Client->>DNS: Resolve "inventory.svc"
    DNS->>Client: [10.244.2.22] (updated)
    Client->>S2: HTTP GET /api/stock
    S2->>Client: 200 OK {"stock": 42}

Advantages: Universal (every language has DNS support), no special client libraries needed.

Disadvantages: DNS TTL caching causes stale entries (clients may use dead IPs for seconds/minutes). DNS doesn't support health checking natively. Round-robin DNS provides poor load balancing.

Registry-Based Discovery

A dedicated service registry maintains a real-time map of service names to healthy instances. Services register themselves on startup and deregister on shutdown. Clients query the registry to find available instances.

Registry	Protocol	Health Checks	Used By
Consul	HTTP/DNS	Active (TCP, HTTP, gRPC)	HashiCorp ecosystem
etcd	gRPC	Lease-based TTL	Kubernetes
ZooKeeper	Custom TCP	Ephemeral nodes	Kafka, Hadoop
Eureka	HTTP REST	Client heartbeats	Netflix/Spring Cloud

Kubernetes Service Discovery

Neither plain DNS nor an external registry is ideal on its own. DNS suffers from TTL staleness (clients cache dead IPs) and lacks health checking. External registries require every application to include registration code. Kubernetes solves both by building service discovery into the platform:

No TTL staleness: The ClusterIP is a virtual IP that never changes. Under the hood, kube-proxy intercepts traffic to this IP and routes it to live pods — no DNS cache invalidation needed.
Automatic registration: Pods don't register themselves. The Endpoints controller watches for pods matching a label selector and updates the routing table automatically.
Built-in health checking: Only pods passing readinessProbe receive traffic. A failing pod is removed from endpoints within seconds.

A Service object provides a stable virtual IP (ClusterIP) and DNS name that automatically routes to healthy pod backends:

# Kubernetes Service creates a stable discovery endpoint:
apiVersion: v1
kind: Service
metadata:
  name: inventory
  namespace: production
spec:
  selector:
    app: inventory      # Routes to pods with this label
  ports:
    - port: 80          # Service port (what clients connect to)
      targetPort: 8080  # Pod port (where the app listens)
  type: ClusterIP       # Internal-only (default)

# After creating this Service, any pod in the cluster can reach it via:
# 1. DNS name (recommended):
curl http://inventory.production.svc.cluster.local/api/stock
curl http://inventory.production/api/stock   # Short form (same namespace)
curl http://inventory/api/stock              # Shortest (same namespace)

# 2. ClusterIP (stable virtual IP):
kubectl get svc inventory -n production
# NAME        TYPE        CLUSTER-IP     PORT(S)
# inventory   ClusterIP   10.96.45.123   80/TCP

# 3. Environment variables (legacy, injected at pod startup):
# INVENTORY_SERVICE_HOST=10.96.45.123
# INVENTORY_SERVICE_PORT=80

# Kubernetes automatically:
# - Watches for pods matching selector "app: inventory"
# - Maintains an Endpoints object with current pod IPs
# - Updates kube-proxy rules to route traffic
# - Removes unhealthy pods (failed readiness probes)

                            
                            How It Works Internally: When you create a Service, Kubernetes DNS (CoreDNS) adds an A record. kube-proxy on every node creates iptables/IPVS rules that intercept traffic to the ClusterIP and load-balance it across healthy pod IPs. When pods die or new ones start, the Endpoints are updated within seconds and routing rules are refreshed automatically.
                        

Communication Patterns

RPC & REST

The two dominant synchronous communication styles in distributed systems:

Aspect	REST	RPC
Paradigm	Resource-oriented (nouns)	Action-oriented (verbs)
Protocol	HTTP/1.1 or HTTP/2	Various (HTTP/2, TCP, custom)
Data format	JSON (human-readable)	Protobuf/binary (efficient)
Contract	OpenAPI/Swagger (optional)	IDL (mandatory, e.g., .proto files)
Best for	External APIs, CRUD	Internal services, high performance

gRPC

gRPC is Google's modern RPC framework, widely used in Kubernetes and cloud-native systems. It uses HTTP/2 for transport and Protocol Buffers for serialization — achieving 7–10x better performance than JSON/REST for internal service communication.

# Protocol Buffer definition (inventory.proto):
# syntax = "proto3";
# 
# service InventoryService {
#   rpc GetStock (StockRequest) returns (StockResponse);
#   rpc UpdateStock (UpdateRequest) returns (UpdateResponse);
#   rpc WatchStock (StockRequest) returns (stream StockUpdate);  // Server streaming
# }
# 
# message StockRequest {
#   string product_id = 1;
# }
# 
# message StockResponse {
#   string product_id = 1;
#   int32 quantity = 2;
#   string warehouse = 3;
# }

# Key gRPC features for distributed systems:
# 1. Bidirectional streaming (real-time updates)
# 2. Built-in deadlines/timeouts (propagate across services)
# 3. Automatic retries with backoff
# 4. Load balancing (client-side or via proxy)
# 5. Strong typing (compile-time contract enforcement)

# Kubernetes API Server itself uses gRPC internally
# etcd client communicates with etcd via gRPC

Message Queues

Asynchronous communication decouples producers from consumers. The sender doesn't wait for a response — it publishes a message and moves on. This provides temporal decoupling (services don't need to be online simultaneously) and load leveling (queues absorb traffic spikes).

Synchronous vs Asynchronous Communication

flowchart LR
    subgraph Synchronous
        A1[Order Service] -->|HTTP/gRPC| B1[Inventory Service]
        B1 -->|Response| A1
    end
    subgraph Asynchronous
        A2[Order Service] -->|Publish| Q[Message Queue]
        Q -->|Subscribe| B2[Inventory Service]
        Q -->|Subscribe| C2[Notification Service]
        Q -->|Subscribe| D2[Analytics Service]
    end

System	Pattern	Ordering	Use Case
Apache Kafka	Log-based (persistent)	Per-partition	Event streaming, audit logs
RabbitMQ	Broker (message queue)	Per-queue (FIFO)	Task queues, request buffering
NATS	Pub/Sub (lightweight)	Best-effort	Cloud-native, IoT, K8s events

                            
                            When to Use Each: Use synchronous (REST/gRPC) when the caller needs an immediate response (user-facing requests, queries). Use asynchronous (queues) when the work can be deferred, when you need fan-out to multiple consumers, or when you need to survive downstream service outages.
                        

Resilience Patterns

Networks fail. Services crash. Distributed communication must be designed for failure, not just success.

Retries & Exponential Backoff

Transient failures (network blips, temporary overload) often resolve themselves. Retrying failed requests is the simplest resilience pattern — but naive retries can amplify failures:

# BAD: Immediate retries during an outage
# Service is overloaded, responding slowly
# 100 clients retry instantly → 200 requests → more overload → 400 requests → crash
# This is a "retry storm" — retries make the problem worse

# GOOD: Exponential backoff with jitter
# Attempt 1: fails → wait 100ms + random(0-50ms)
# Attempt 2: fails → wait 200ms + random(0-100ms)
# Attempt 3: fails → wait 400ms + random(0-200ms)
# Attempt 4: fails → wait 800ms + random(0-400ms)
# Attempt 5: fails → give up, return error

# The "jitter" prevents all clients from retrying at the same moment
# (thundering herd problem)

# Kubernetes uses this pattern for:
# - kubelet registering with API Server
# - Controller retry loops
# - CrashLoopBackOff (10s → 20s → 40s → ... → 5min cap)

Timeouts

Without timeouts, a slow downstream service causes the caller to hang indefinitely, consuming connections, threads, and memory. Always set timeouts at every integration point:

Connection timeout: How long to wait establishing a TCP connection (~1–5s)
Request timeout: How long to wait for a response (~1–30s depending on operation)
Deadline propagation: Subtract elapsed time from the overall budget at each hop

                            
                            The Timeout Dilemma: Too short = false positives (healthy-but-slow services look dead, triggering unnecessary retries). Too long = resource exhaustion (threads/connections held open for unresponsive services). There's no universally correct timeout — measure P99 latency and set timeouts slightly above it. Monitor and adjust.
                        

Circuit Breakers

A circuit breaker prevents a client from repeatedly calling a failing service, giving it time to recover. Named after electrical circuit breakers that prevent overloading:

Circuit Breaker State Machine

stateDiagram-v2
    [*] --> Closed
    Closed --> Open : Failure threshold exceeded
    Open --> HalfOpen : Timeout expires
    HalfOpen --> Closed : Test request succeeds
    HalfOpen --> Open : Test request fails
    
    note right of Closed : Normal operation\nAll requests pass through
    note right of Open : Fast-fail all requests\nReturn error immediately
    note right of HalfOpen : Allow one test request\nDecide based on result

Closed: Normal operation. Requests flow through. Track failure rate.
Open: Failure threshold exceeded. Immediately return errors without calling the service. Wait for a cooldown period.
Half-Open: Allow one test request through. If it succeeds, close the circuit (service recovered). If it fails, open again.

Bulkheads

Named after ship compartments that prevent a leak from sinking the whole vessel. In software, bulkheads isolate failures so one slow dependency doesn't consume all resources:

Thread pool isolation: Each downstream service gets its own thread pool. If Service A is slow and exhausts its pool, Service B's pool remains unaffected.
Connection pool limits: Cap connections per downstream service.
In Kubernetes: Resource limits (CPU/memory) on pods prevent one workload from starving others. Namespaces with ResourceQuotas provide cluster-level bulkheads.

                            
                            Resilience in Kubernetes — Where These Patterns Live:
                            Retries + Backoff: CrashLoopBackOff (10s → 20s → 40s → 5min cap); kubelet registration retries
Timeouts: readinessProbe.timeoutSeconds, terminationGracePeriodSeconds, client-go request timeouts
Circuit Breaker: readinessProbe removes unhealthy pods from Service endpoints — traffic stops flowing to them without crashing them (gives time to recover)
Bulkheads: Pod resource limits, namespace ResourceQuotas, PriorityClasses for eviction ordering

                        

Exercises

                            
                            Exercise 1 — Kubernetes Service Discovery: Create a Kubernetes Service of type ClusterIP that selects pods with label app: backend. Deploy 3 replica pods with that label. From another pod, verify you can reach the service by DNS name. Kill one pod and confirm traffic still flows. Observe the Endpoints object updating in real-time with kubectl get endpoints -w.
                        

                            
                            Exercise 2 — Circuit Breaker Design: Design a circuit breaker for a payment service calling a fraud detection API. Define: (a) What failure threshold triggers OPEN state (e.g., 5 failures in 10 seconds)? (b) How long is the cooldown before HALF-OPEN? (c) What does the payment service do when the circuit is OPEN (reject payment? allow with risk? queue for later)? Each choice has business implications.
                        

                            
                            Exercise 3 — Sync vs Async: An e-commerce checkout flow needs to: (a) Validate payment, (b) Decrement inventory, (c) Send confirmation email, (d) Update analytics. Which should be synchronous (blocking checkout) and which async (queued)? Justify your choices considering user experience, data consistency, and failure handling.
                        

Conclusion

Service discovery and resilient communication are the circulatory system of distributed applications. Kubernetes solves discovery elegantly through Services (stable DNS names + virtual IPs), while resilience patterns (retries, timeouts, circuit breakers) ensure communication survives the inevitable failures of distributed networks.

In Part 5, we'll explore failure and resilience at the system level — node failures, network partitions, cascading failures, and the self-healing patterns that make distributed systems robust.

Previous Part 3: CAP Theorem & Replication Next Part 5: Failure & Resilience

Cookie Consent

Part 4: Service Discovery & Communication

Table of Contents

Why Service Discovery Exists

Static vs Dynamic Environments

The Ephemeral Container Problem

Discovery Methods

DNS-Based Discovery

Registry-Based Discovery

Kubernetes Service Discovery

Communication Patterns

RPC & REST

gRPC

Message Queues

Resilience Patterns

Retries & Exponential Backoff

Timeouts

Circuit Breakers

Bulkheads

Exercises

Conclusion

Cookie Consent

Part 4: Service Discovery & Communication

Table of Contents

Why Service Discovery Exists

Static vs Dynamic Environments

The Ephemeral Container Problem

Discovery Methods

DNS-Based Discovery

Registry-Based Discovery

Kubernetes Service Discovery

Communication Patterns

RPC & REST

gRPC

Message Queues

Resilience Patterns

Retries & Exponential Backoff

Timeouts

Circuit Breakers

Bulkheads

Exercises

Conclusion

Continue the Series

Part 3: CAP Theorem & Replication

Part 5: Failure & Resilience

Part 9: Services, Ingress & Service Mesh