Back to Systems Thinking & Architecture Mastery Series

Distributed Control Systems & Feedback

May 15, 2026 Wasil Zafar 24 min read

"Every controller in your infrastructure — from HPA to circuit breakers to rate limiters — is implementing the same feedback loop that engineers have studied since James Watt's steam governor in 1788. Understanding control theory transforms these from mysterious black boxes into predictable, tunable systems."

Table of Contents

  1. Control Theory Foundations
  2. PID Controllers in Infrastructure
  3. Kubernetes as PID Controller
  4. Reconciliation as Control Loop
  5. Stability Analysis
  6. Oscillation & Damping
  7. Hierarchical Control
  8. Event-Driven vs Polling
  9. Self-Stabilizing Systems
  10. Key Takeaway

Control Theory Foundations

Control theory — a branch of engineering mathematics developed for physical systems — provides the theoretical foundation for understanding how distributed infrastructure manages itself. Every autoscaler, reconciliation loop, and self-healing system is implementing a feedback control system.

The Universal Control Loop: Measure the current state (process variable), compare it to the desired state (setpoint), compute the difference (error signal), and apply a corrective action (controller output). This cycle repeats continuously. Every Kubernetes controller, every autoscaler, every circuit breaker follows this exact pattern.

The four fundamental elements of any control system:

  • Setpoint (SP) — the desired state (e.g., 3 replicas, 70% CPU utilization, 100ms p99 latency)
  • Process Variable (PV) — the measured current state (actual replicas, current CPU, observed latency)
  • Error Signal (e) — the difference: e = SP - PV
  • Controller Output (u) — the corrective action (scale up 2 pods, increase rate limit, open circuit)
Generic Feedback Control Loop
flowchart LR
    SP["Setpoint\n(Desired State)"] --> SUM(("+/−"))
    SUM -->|"Error Signal"| CTRL["Controller\n(Decision Logic)"]
    CTRL -->|"Control Output"| PLANT["Plant/Process\n(Infrastructure)"]
    PLANT -->|"Actual State"| SENSOR["Sensor\n(Monitoring)"]
    SENSOR -->|"Measurement"| SUM
    DIST["Disturbances\n(Load Spikes, Failures)"] -->|"External Input"| PLANT
                            

PID Controllers in Infrastructure

The PID (Proportional-Integral-Derivative) controller is the most widely deployed control algorithm in engineering history. Its three terms each address a different aspect of error correction:

  • P (Proportional) — responds proportionally to current error. Larger error → larger correction. Fast but leaves steady-state error.
  • I (Integral) — accumulates past errors over time. Eliminates steady-state error but can cause overshoot and windup.
  • D (Derivative) — responds to rate of change of error. Provides damping, reduces overshoot, but amplifies noise.
"""
PID Controller Simulation — Applied to Infrastructure Autoscaling
Demonstrates how P, I, and D terms interact to control replica count.
"""
import numpy as np

class PIDController:
    """PID controller for infrastructure autoscaling."""

    def __init__(self, kp, ki, kd, setpoint, output_min=1, output_max=100):
        self.kp = kp          # Proportional gain
        self.ki = ki          # Integral gain
        self.kd = kd          # Derivative gain
        self.setpoint = setpoint
        self.output_min = output_min
        self.output_max = output_max
        self.integral = 0.0
        self.prev_error = 0.0

    def compute(self, measured_value, dt=1.0):
        """Compute control output given current measurement."""
        error = self.setpoint - measured_value

        # Proportional term — react to current error
        p_term = self.kp * error

        # Integral term — accumulate past errors (with anti-windup)
        self.integral += error * dt
        self.integral = np.clip(self.integral, -50, 50)  # Anti-windup
        i_term = self.ki * self.integral

        # Derivative term — react to rate of change
        derivative = (error - self.prev_error) / dt
        d_term = self.kd * derivative
        self.prev_error = error

        # Combined output (clamped to valid range)
        output = p_term + i_term + d_term
        return np.clip(output, self.output_min, self.output_max)


# Simulate: target 70% CPU utilization, control replica count
controller = PIDController(kp=0.5, ki=0.1, kd=0.2, setpoint=70.0)

# Simulated CPU readings (starting overloaded, then load spike at t=20)
cpu_readings = [90, 88, 82, 78, 74, 72, 71, 70, 70, 70,
                70, 70, 70, 70, 70, 70, 70, 70, 70, 70,
                95, 92, 88, 85, 80, 76, 73, 71, 70, 70]

print("Time | CPU% | Error | Replicas (output)")
print("-" * 50)
for t, cpu in enumerate(cpu_readings):
    replicas = controller.compute(cpu)
    error = 70.0 - cpu
    print(f"  {t:2d}  | {cpu:3.0f}% | {error:+5.1f} | {replicas:5.1f}")

Kubernetes as PID Controller

The Kubernetes Horizontal Pod Autoscaler (HPA) is fundamentally a control loop — though it doesn't implement a pure PID controller, it uses proportional control with stabilization windows that approximate PI behavior.

Kubernetes HPA as Control Loop
flowchart TD
    SP["Setpoint:\ntargetCPUUtilization = 70%"] --> COMP["HPA Controller\n(Compare & Decide)"]
    COMP -->|"Scale to N replicas"| DEPLOY["Deployment\n(ReplicaSet)"]
    DEPLOY --> PODS["Running Pods\n(Actual Workload)"]
    PODS --> METRICS["Metrics Server\n(CPU/Memory/Custom)"]
    METRICS -->|"Current avg CPU = X%"| COMP
    LOAD["User Traffic\n(Disturbance)"] --> PODS
    COMP -->|"Stabilization\nWindow"| STAB["Cooldown\n(Prevent Thrashing)"]
    STAB --> COMP
                            
# Kubernetes HPA — Control Theory in YAML
# The HPA is a proportional controller with stabilization (damping)
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-server-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  minReplicas: 3        # Output floor (anti-windup lower bound)
  maxReplicas: 50       # Output ceiling (anti-windup upper bound)
  metrics:
    # Setpoint: target 70% average CPU
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70    # SP = 70%
    # Secondary metric: custom requests/sec
    - type: Pods
      pods:
        metric:
          name: http_requests_per_second
        target:
          type: AverageValue
          averageValue: "1000"      # SP = 1000 rps/pod
  behavior:
    # Stabilization windows = damping (D-term approximation)
    scaleUp:
      stabilizationWindowSeconds: 60    # Wait 60s before scaling up
      policies:
        - type: Percent
          value: 100                     # Max 2x in one step
          periodSeconds: 60
        - type: Pods
          value: 4                       # Or max +4 pods
          periodSeconds: 60
      selectPolicy: Max
    scaleDown:
      stabilizationWindowSeconds: 300   # Wait 5 min before scaling down
      policies:
        - type: Percent
          value: 10                      # Max 10% reduction per period
          periodSeconds: 60
      selectPolicy: Min                  # Conservative scale-down
Why HPA Isn't Pure PID: Kubernetes HPA uses proportional control with a formula: desiredReplicas = ceil(currentReplicas × (currentMetric / targetMetric)). It approximates integral behavior through repeated proportional corrections, and approximates derivative behavior through stabilization windows. A true PID controller would provide smoother scaling but requires careful tuning to avoid instability in discrete, quantized systems (you can't have 3.7 pods).

Reconciliation as Control Loop

Declarative systems like Kubernetes implement control through reconciliation loops — the observe-diff-act cycle that continuously drives actual state toward desired state. This is a specific implementation of feedback control where the controller output is "apply the diff."

Pattern
The Observe-Diff-Act Reconciliation Pattern

Observe: Read current state from the system (equivalent to sensor measurement). Diff: Compare current state against desired state (equivalent to computing error signal). Act: Apply changes to close the gap (equivalent to controller output). This cycle repeats at a configurable interval (typically 10-30 seconds in Kubernetes controllers), creating a continuous control loop that self-heals drift without external intervention.

ReconciliationDeclarativeKubernetes

Stability Analysis

A control system is stable if it converges to the setpoint after a disturbance. An unstable system oscillates with growing amplitude or diverges entirely. In infrastructure terms, an unstable autoscaler thrashes between extremes, constantly scaling up and down without settling.

Stability Criteria for Infrastructure Controllers: A system is stable when: (1) Gain margin > 0 — the system doesn't over-correct, (2) Phase margin > 0 — the system doesn't react so late that corrections make things worse, (3) The feedback delay is shorter than the system's natural oscillation period.
System Response Types — Stability Analysis
flowchart LR
    subgraph STABLE["Stable (Well-Tuned)"]
        S1["Setpoint Change"] --> S2["Quick Rise"]
        S2 --> S3["Slight Overshoot"]
        S3 --> S4["Settle at Target"]
    end
    subgraph UNDER["Underdamped (Oscillating)"]
        U1["Setpoint Change"] --> U2["Overshoot"]
        U2 --> U3["Undershoot"]
        U3 --> U4["Oscillates..."]
        U4 --> U5["Eventually Settles"]
    end
    subgraph UNSTABLE["Unstable (Thrashing)"]
        X1["Setpoint Change"] --> X2["Massive Overshoot"]
        X2 --> X3["Massive Undershoot"]
        X3 --> X4["Growing Oscillation"]
        X4 --> X5["System Failure"]
    end
                            
"""
Stability Analysis — Simulating Autoscaler Behavior
Shows how different gain values affect system stability.
"""
import numpy as np

def simulate_autoscaler(kp, load_pattern, target_cpu=70, dt=1.0, steps=50):
    """Simulate proportional autoscaler with delayed feedback."""
    replicas = 5.0
    cpu_history = []
    replica_history = []
    feedback_delay = 3  # 3 time steps for metrics to propagate

    delayed_readings = [target_cpu] * feedback_delay  # Initialize buffer

    for t in range(steps):
        # Actual CPU depends on load / replicas
        actual_load = load_pattern[t % len(load_pattern)]
        actual_cpu = (actual_load / replicas) * 100

        # Delayed feedback (metrics pipeline latency)
        delayed_readings.append(actual_cpu)
        measured_cpu = delayed_readings[t]  # Read delayed value

        # Proportional controller
        error = measured_cpu - target_cpu
        adjustment = kp * error
        replicas = max(1, replicas + adjustment)

        cpu_history.append(actual_cpu)
        replica_history.append(replicas)

    return cpu_history, replica_history


# Test different proportional gains
load = [3.5] * 10 + [7.0] * 20 + [3.5] * 20  # Load doubles at t=10

print("=== Low Gain (kp=0.1) — Stable but Slow ===")
cpu, reps = simulate_autoscaler(kp=0.1, load_pattern=load)
print(f"Final CPU: {cpu[-1]:.1f}% | Final Replicas: {reps[-1]:.1f}")
print(f"Max overshoot: {max(cpu):.1f}%")

print("\n=== Medium Gain (kp=0.3) — Well-Tuned ===")
cpu, reps = simulate_autoscaler(kp=0.3, load_pattern=load)
print(f"Final CPU: {cpu[-1]:.1f}% | Final Replicas: {reps[-1]:.1f}")
print(f"Max overshoot: {max(cpu):.1f}%")

print("\n=== High Gain (kp=0.8) — Unstable (Thrashing) ===")
cpu, reps = simulate_autoscaler(kp=0.8, load_pattern=load)
print(f"Final CPU: {cpu[-1]:.1f}% | Final Replicas: {reps[-1]:.1f}")
print(f"Max overshoot: {max(cpu):.1f}%")
print(f"Oscillation range: {min(cpu[-10:]):.1f}% - {max(cpu[-10:]):.1f}%")

Oscillation & Damping

Oscillation is the most common failure mode of infrastructure controllers. It happens when the correction overshoots the target, triggering a reverse correction that also overshoots, creating a cycle of thrashing.

Common causes and solutions:

  • Too aggressive P gain → reduce proportional response, scale in smaller increments
  • No derivative term → add stabilization windows, rate limiters
  • Feedback delay → reduce metrics collection interval, use predictive scaling
  • Quantization effects → can't have fractional pods, so small errors cause discrete jumps
Damping Strategies: (1) Cooldown periods — don't scale again within N seconds of last action, (2) Rate limiters — max N% change per period, (3) Hysteresis — different thresholds for scale-up vs scale-down (e.g., scale up at 80% CPU, down at 50%), (4) Moving averages — smooth noisy metrics before feeding to controller.
# Measure controller lag — critical for stability analysis
# High lag = risk of oscillation

echo "=== HPA Controller Lag Analysis ==="
# Time between metric change and scaling action
echo "HPA reaction time (last 10 scaling events):"
kubectl get events -n production \
  --field-selector reason=SuccessfulRescale \
  --sort-by='.lastTimestamp' | tail -10

echo ""
echo "Metrics pipeline latency:"
# Time from pod metric emission to HPA seeing it
kubectl top pods -n production --no-headers | head -5
echo "vs"
kubectl get hpa -n production -o jsonpath='{range .items[*]}{.metadata.name}: current={.status.currentMetrics[0].resource.current.averageUtilization}% target={.spec.metrics[0].resource.target.averageUtilization}%{"\n"}{end}'

echo ""
echo "=== Detecting Oscillation ==="
# Count scaling events in last hour (>10 = potential thrashing)
SCALE_COUNT=$(kubectl get events -n production \
  --field-selector reason=SuccessfulRescale \
  --output json | python3 -c "
import json, sys
from datetime import datetime, timedelta
events = json.load(sys.stdin)['items']
cutoff = datetime.now() - timedelta(hours=1)
recent = [e for e in events if datetime.fromisoformat(e['lastTimestamp'].rstrip('Z')) > cutoff]
print(len(recent))
")
echo "Scaling events in last hour: $SCALE_COUNT"
if [ "$SCALE_COUNT" -gt 10 ]; then
    echo "WARNING: Possible oscillation detected! Consider:"
    echo "  - Increasing stabilizationWindowSeconds"
    echo "  - Reducing scale-down percentage"
    echo "  - Adding hysteresis gap"
fi

Hierarchical Control

Real infrastructure uses multiple controllers at different abstraction levels, forming a hierarchical control system. Higher-level controllers set the parameters for lower-level controllers, creating a cascade of increasingly fine-grained control.

Hierarchical Controllers in Kubernetes
flowchart TD
    L1["Level 1: Cluster Autoscaler\n(Add/remove nodes)"]
    L2["Level 2: HPA\n(Add/remove pods)"]
    L3["Level 3: VPA\n(Adjust pod resources)"]
    L4["Level 4: Application\n(Connection pools, caches)"]

    L1 -->|"Provides capacity\nfor pods to schedule"| L2
    L2 -->|"Determines pod count\nbased on load"| L3
    L3 -->|"Right-sizes each pod\nfor efficiency"| L4
    L4 -->|"Reports actual\nresource usage"| L3
    L3 -->|"Aggregate resource\ndemand"| L2
    L2 -->|"Pending pods signal\nnode shortage"| L1

    style L1 fill:#132440,color:#fff
    style L2 fill:#16476A,color:#fff
    style L3 fill:#3B9797,color:#fff
    style L4 fill:#f8f9fa,color:#132440
                            
Design Principle
Separation of Time Scales

Hierarchical control works because each level operates at a different time scale. The application adjusts connection pools in milliseconds. VPA adjusts pod resources in minutes. HPA adjusts replica count in seconds-to-minutes. Cluster Autoscaler adds nodes in minutes-to-hours. Each controller can treat the levels above it as "fixed infrastructure" and the levels below it as "fast-settling subsystems." This separation prevents resonance — where controllers at the same frequency would fight each other.

HierarchyTime ScalesResonance

Event-Driven vs Polling-Based Control

Control loops can be triggered in two fundamental ways, each with distinct tradeoffs:

Comparison
Event-Driven vs Polling Control Loops
DimensionEvent-DrivenPolling-Based
TriggerState change notificationPeriodic timer (every N seconds)
Detection SpeedImmediate (milliseconds)Up to one polling interval
System LoadProportional to change rateConstant (regardless of changes)
Missed EventsPossible if queue overflowsImpossible (always reads current state)
ConsistencyEventual (event ordering issues)Point-in-time consistent
K8s ExampleWatch API (informers)Resync period (full re-list)
EventsPollingTradeoffs
Kubernetes Hybrid Approach: Kubernetes uses both — informers watch for real-time events (fast detection) AND perform periodic re-list operations (consistency guarantee). The resync period (default 30s–10min depending on controller) ensures that even if events are missed, the controller will eventually converge. This "belt and suspenders" approach is why Kubernetes self-heals reliably.

Self-Stabilizing Systems

Dijkstra's concept of self-stabilizing systems (1974) provides a theoretical framework for understanding why reconciliation-based control planes are robust: regardless of what state the system starts in (even an arbitrary, corrupted state), it will converge to a legitimate state within finite time.

Dijkstra's Self-Stabilization Applied to Infrastructure: A self-stabilizing system requires no initialization and recovers from any transient fault. Kubernetes controllers embody this — you can delete random resources, corrupt etcd entries, kill controllers mid-operation — and the system will eventually converge back to the declared desired state. The key properties: (1) convergence — from any state, reach legitimate state in finite steps, (2) closure — once in a legitimate state, stay there unless disturbed.
"""
Self-Stabilizing System Simulation
Demonstrates convergence from arbitrary initial state to desired state.
Models a Kubernetes-like reconciliation controller.
"""
import numpy as np

class SelfStabilizingController:
    """Simulates a reconciler that converges from any state."""

    def __init__(self, desired_state):
        self.desired = desired_state  # Target configuration

    def reconcile(self, current_state):
        """Single reconciliation step — observe, diff, act."""
        actions = []
        for key, desired_val in self.desired.items():
            current_val = current_state.get(key, None)
            if current_val != desired_val:
                actions.append(f"Fix {key}: {current_val} -> {desired_val}")
                current_state[key] = desired_val
        return actions

    def simulate_convergence(self, initial_state, max_steps=10):
        """Simulate convergence from arbitrary initial state."""
        state = dict(initial_state)
        print(f"Desired state: {self.desired}")
        print(f"Initial state: {state}")
        print("-" * 50)

        for step in range(max_steps):
            actions = self.reconcile(state)
            if not actions:
                print(f"Step {step}: CONVERGED — system stable")
                return step
            print(f"Step {step}: {actions}")

        print("WARNING: Did not converge within max steps")
        return max_steps


# Desired state: 3 replicas, image v2, port 8080
desired = {"replicas": 3, "image": "app:v2", "port": 8080, "healthy": True}

# Test 1: Completely wrong initial state (corrupted)
print("=== Test 1: Corrupted State ===")
ctrl = SelfStabilizingController(desired)
ctrl.simulate_convergence(
    {"replicas": 7, "image": "app:v1", "port": 9090, "healthy": False}
)

print("\n=== Test 2: Partially Correct ===")
ctrl.simulate_convergence(
    {"replicas": 3, "image": "app:v1", "port": 8080, "healthy": True}
)

print("\n=== Test 3: Empty State (Fresh Start) ===")
ctrl.simulate_convergence({})

Key Takeaway

Key Takeaway
Infrastructure IS Control Theory

Every self-managing system in modern infrastructure is implementing concepts from control theory — often unknowingly. HPA is a proportional controller. Stabilization windows are derivative approximations. Rate limiters are gain limiters. Cooldown periods are damping. Circuit breakers are bang-bang controllers. Understanding these connections transforms debugging from "why is this autoscaler thrashing?" to "the gain is too high relative to the feedback delay — reduce the proportional response or increase the damping window." Control theory gives you a vocabulary and mathematical framework for reasoning about system behavior.

Control TheoryInfrastructureMental Model