The Declarative Model
Kubernetes is built on a declarative paradigm: you tell the system what you want (desired state), not how to get there. The system continuously works to make reality match your declaration. This is fundamentally different from imperative systems where you issue commands ("start 3 containers") and hope the system maintains that state.
flowchart LR
USER["User declares\nDesired State"] --> ETCD["etcd stores\nDesired State"]
ETCD --> CTRL["Controller reads\nDesired State"]
CTRL --> COMPARE{"Compare\nDesired vs Actual"}
COMPARE -->|"Drift detected"| ACT["Take Corrective\nAction"]
ACT --> ACTUAL["Actual State\nconverges"]
ACTUAL --> CTRL
COMPARE -->|"No drift"| WAIT["Wait for\nnext cycle"]
WAIT --> CTRL
The declarative model has three fundamental components:
- Desired State — stored in etcd as resource specifications (e.g.,
spec.replicas: 3) - Actual State — observed from the real world (e.g., kubelet reports 2 running pods)
- Reconciliation — the process of driving actual state toward desired state
# Desired state declaration — user intent, not imperative command
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-frontend
namespace: production
spec:
replicas: 3 # Desired: 3 pods running
selector:
matchLabels:
app: web-frontend
strategy:
type: RollingUpdate # How to reconcile during updates
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
template:
metadata:
labels:
app: web-frontend
spec:
containers:
- name: nginx
image: nginx:1.25
resources:
requests:
cpu: 100m
memory: 128Mi
Control Loop Pattern
Every controller in Kubernetes implements the same fundamental pattern — an infinite loop that observes, compares, and acts. This is the control loop, borrowed from control systems engineering (think thermostats, cruise control, industrial PID controllers).
flowchart TD
OBS["1. OBSERVE\nRead current state\nfrom cluster"] --> DIFF["2. DIFF\nCompare current state\nwith desired state"]
DIFF --> ACT["3. ACT\nTake corrective action\nto close the gap"]
ACT --> UPDATE["4. UPDATE STATUS\nWrite status back\nto API Server"]
UPDATE --> OBS
style OBS fill:#3B9797,color:#fff
style DIFF fill:#16476A,color:#fff
style ACT fill:#BF092F,color:#fff
style UPDATE fill:#132440,color:#fff
The four phases of the control loop:
- Observe — read current state from the cluster via informers (cached watches on API Server)
- Diff — compare observed state with the desired state stored in the resource spec
- Act — if a difference exists, take the minimum action needed to converge (create pod, delete pod, update config)
- Update Status — write the observed result back to the resource's
.statusfield
Kubernetes as a PID Controller
In control systems engineering, a PID controller continuously calculates an error value (difference between setpoint and measured variable) and applies a correction. Kubernetes controllers are the digital equivalent: the "setpoint" is spec.replicas, the "measured variable" is the count of ready pods, and the "correction" is creating or deleting pods. The difference is that Kubernetes operates on discrete objects (pods) rather than continuous signals.
How Controllers Implement Reconciliation
Controllers don't poll the API Server on every loop iteration — that would be catastrophically expensive at scale. Instead, they use the informer pattern: a local cached copy of cluster state maintained by long-lived watch connections to the API Server.
flowchart TD
ETCD["etcd"] -->|"watch stream"| API["API Server"]
API -->|"watch events"| INF["Informer\n(SharedIndexInformer)"]
INF -->|"cache update"| CACHE["Local Cache\n(Thread-safe Store)"]
INF -->|"event handler"| QUEUE["Work Queue\n(Rate-limited)"]
QUEUE -->|"dequeue key"| RECONCILE["Reconcile(key)\nFunction"]
RECONCILE -->|"read from"| CACHE
RECONCILE -->|"write to"| API
style INF fill:#3B9797,color:#fff
style QUEUE fill:#16476A,color:#fff
style RECONCILE fill:#BF092F,color:#fff
Informers — Efficient State Watching
An informer establishes a single watch connection to the API Server and maintains a local cache of all objects matching its criteria. When an object changes, the informer receives the event and updates its cache — no polling required.
- SharedIndexInformer — multiple controllers share one watch connection per resource type
- Reflector — handles the LIST+WATCH protocol (initial list, then incremental watches)
- DeltaFIFO — queues incoming events as deltas (Added, Updated, Deleted)
- Indexer — stores objects with custom indexes for fast lookup (e.g., by namespace, label)
Work Queue — Rate Limiting and Deduplication
When an informer detects a change, it doesn't immediately trigger reconciliation. Instead, it enqueues the object's key (namespace/name) into a work queue. This provides critical guarantees:
- Deduplication — if an object changes 10 times before processing, only one reconciliation runs
- Rate limiting — prevents thundering herd from overwhelming the API Server
- Retry with backoff — failed reconciliations are re-enqueued with exponential delay
- Ordering guarantees — each key is processed by at most one worker at a time
The Reconcile Function
The actual business logic lives in the Reconcile() function. It receives an object key, reads the desired state from cache, observes actual state, and takes action:
# Python custom controller using kopf framework
# Reconcile function for a custom CronBackup resource
import kopf
import kubernetes
from kubernetes import client, config
@kopf.on.create('backups.example.com', 'v1', 'cronbackups')
@kopf.on.update('backups.example.com', 'v1', 'cronbackups')
@kopf.on.resume('backups.example.com', 'v1', 'cronbackups')
def reconcile_cronbackup(spec, name, namespace, status, **kwargs):
"""
Reconciliation loop for CronBackup custom resource.
Ensures a CronJob exists matching the desired backup spec.
"""
config.load_incluster_config()
batch_v1 = client.BatchV1Api()
# 1. OBSERVE: Check if CronJob already exists
desired_schedule = spec.get('schedule', '0 2 * * *')
desired_target = spec.get('target_database')
desired_retention = spec.get('retention_days', 30)
cronjob_name = f"backup-{name}"
try:
existing = batch_v1.read_namespaced_cron_job(cronjob_name, namespace)
current_schedule = existing.spec.schedule
# 2. DIFF: Compare desired vs actual
if current_schedule != desired_schedule:
# 3. ACT: Update the CronJob to match desired state
existing.spec.schedule = desired_schedule
batch_v1.replace_namespaced_cron_job(cronjob_name, namespace, existing)
return {'message': f'Updated schedule to {desired_schedule}'}
else:
return {'message': 'CronJob already matches desired state'}
except kubernetes.client.exceptions.ApiException as e:
if e.status == 404:
# 3. ACT: CronJob doesn't exist — create it
cronjob = _build_cronjob(cronjob_name, namespace,
desired_schedule, desired_target, desired_retention)
batch_v1.create_namespaced_cron_job(namespace, cronjob)
return {'message': f'Created CronJob {cronjob_name}'}
raise
Built-in Controller Examples
Kubernetes ships with dozens of built-in controllers, each responsible for reconciling one resource type. The Controller Manager binary runs them all in a single process with shared informers.
Deployment Controller — Rolling Update Reconciliation
The Deployment controller doesn't manage pods directly. It manages ReplicaSets, which in turn manage pods. During a rolling update, it orchestrates the transition between old and new ReplicaSets:
sequenceDiagram
participant User
participant Deployment Controller
participant ReplicaSet Old (v1)
participant ReplicaSet New (v2)
participant Pods
User->>Deployment Controller: Update image to v2
Deployment Controller->>ReplicaSet New (v2): Create with replicas=1
ReplicaSet New (v2)->>Pods: Create 1 new pod (v2)
Note over Deployment Controller: Wait for v2 pod Ready
Deployment Controller->>ReplicaSet Old (v1): Scale down to 2
ReplicaSet Old (v1)->>Pods: Delete 1 old pod (v1)
Deployment Controller->>ReplicaSet New (v2): Scale up to 2
ReplicaSet New (v2)->>Pods: Create 1 more pod (v2)
Note over Deployment Controller: Repeat until complete
Deployment Controller->>ReplicaSet New (v2): Scale to 3
Deployment Controller->>ReplicaSet Old (v1): Scale to 0
ReplicaSet Controller — Pod Count Maintenance
The ReplicaSet controller has the simplest reconciliation logic: count running pods matching the selector, compare with spec.replicas, create or delete pods to match:
# Observe the reconciliation in action
# Delete a pod manually — watch the ReplicaSet controller recreate it
# Current state: 3 pods running
kubectl get pods -l app=web-frontend
# NAME READY STATUS AGE
# web-frontend-7d4b8c6-abc12 1/1 Running 2h
# web-frontend-7d4b8c6-def34 1/1 Running 2h
# web-frontend-7d4b8c6-ghi56 1/1 Running 2h
# Simulate failure — delete a pod
kubectl delete pod web-frontend-7d4b8c6-abc12
# Within seconds, controller detects drift (2 actual vs 3 desired)
kubectl get pods -l app=web-frontend
# NAME READY STATUS AGE
# web-frontend-7d4b8c6-def34 1/1 Running 2h
# web-frontend-7d4b8c6-ghi56 1/1 Running 2h
# web-frontend-7d4b8c6-jkl78 0/1 ContainerCreating 3s
# Check controller events
kubectl describe rs web-frontend-7d4b8c6 | grep -A5 Events
# Events:
# Type Reason Message
# ---- ------ -------
# Normal SuccessfulCreate Created pod: web-frontend-7d4b8c6-jkl78
HPA — Metric-Based Scaling Loop
The Horizontal Pod Autoscaler reconciles at a fixed interval (default 15 seconds), calculating desired replicas from current metrics:
# HPA desired state: scale between 2-10 based on CPU
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: web-frontend-hpa
namespace: production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: web-frontend
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70 # Target: 70% CPU
behavior:
scaleDown:
stabilizationWindowSeconds: 300 # Wait 5min before scaling down
policies:
- type: Percent
value: 10 # Max 10% reduction per period
periodSeconds: 60
scaleUp:
stabilizationWindowSeconds: 0 # Scale up immediately
policies:
- type: Percent
value: 100 # Can double capacity per period
periodSeconds: 60
desiredReplicas = ceil(currentReplicas × (currentMetricValue / desiredMetricValue)). If current CPU is 140% across 3 pods and target is 70%, the HPA calculates ceil(3 × (140/70)) = 6 replicas needed.
Custom Controllers & Operator Pattern
The Operator pattern extends Kubernetes' reconciliation machinery to manage any stateful application. An Operator is a custom controller that watches a Custom Resource Definition (CRD) and reconciles the application's lifecycle — installation, scaling, backup, upgrade, failover.
flowchart TD
CRD["Custom Resource\n(e.g., PostgresCluster)"] -->|"stored in"| ETCD["etcd"]
ETCD -->|"watched by"| OP["Operator Controller"]
OP -->|"manages"| PRIMARY["Primary DB Pod"]
OP -->|"manages"| REPLICA1["Replica Pod 1"]
OP -->|"manages"| REPLICA2["Replica Pod 2"]
OP -->|"manages"| BACKUP["Backup CronJob"]
OP -->|"manages"| SVC["Service + Endpoints"]
OP -->|"manages"| SECRET["TLS Certificates"]
subgraph "Control Plane Extension"
CRD
ETCD
OP
end
subgraph "Data Plane (Managed Workload)"
PRIMARY
REPLICA1
REPLICA2
BACKUP
SVC
SECRET
end
style OP fill:#BF092F,color:#fff
Operators encode domain-specific operational knowledge into software. Instead of a DBA manually failing over a database, the operator's reconciliation loop detects the failure and orchestrates the failover automatically — promoting a replica, updating endpoints, notifying dependent services.
CrunchyData PostgreSQL Operator — Reconciliation in Action
When you declare a PostgresCluster resource with replicas: 3, the Crunchy operator's reconciliation loop: (1) creates a primary StatefulSet, (2) creates replica StatefulSets with streaming replication configured, (3) creates PgBouncer connection pooling, (4) issues TLS certificates from its internal CA, (5) configures automated backups via pgBackRest, (6) monitors replication lag and triggers failover if primary is unresponsive for 30s. All from a single YAML declaration.
Level-Triggered vs Edge-Triggered
This is one of the most important architectural decisions in Kubernetes. Controllers are level-triggered, not edge-triggered. The difference is profound:
- Edge-triggered — react to changes (events). "A pod was deleted" → create one pod. If you miss the event, you miss the action.
- Level-triggered — react to state. "There should be 3 pods, there are 2" → create one pod. Even if you miss events, you eventually converge.
Practical consequences of level-triggered design:
- Idempotent reconciliation — running the same reconciliation 10 times produces the same result as running it once
- Crash recovery — a controller that crashes and restarts simply reconciles all objects from its cache, catching up on anything missed
- Resync periods — controllers periodically re-reconcile ALL objects, catching any drift that slipped through event delivery
- No ordering dependency — events can arrive out of order; the reconciler only cares about current state
# Demonstrate level-triggered behavior:
# Even after controller restart, desired state is maintained
# Scale deployment to 5
kubectl scale deployment web-frontend --replicas=5
# Simulate controller manager restart
kubectl delete pod -n kube-system -l component=kube-controller-manager
# Controller restarts, lists all objects, reconciles
# No pods are lost — level-triggered means "what should be" is re-evaluated
kubectl get pods -l app=web-frontend
# All 5 pods still running — the new controller reads desired state
# from etcd and observes actual state matches, no action needed
# Now simulate divergence during controller downtime
# (In practice, kubelet still runs pods even without controller)
kubectl get deployment web-frontend -o jsonpath='{.spec.replicas}'
# 5 — desired state persists in etcd regardless of controller health
Drift Detection & Correction
One of Kubernetes' most powerful properties is self-healing from manual changes. If someone (or something) modifies the actual state outside of the declared desired state, the reconciliation loop will detect and correct the drift:
# Detect drift: compare live state vs desired state
kubectl diff -f deployment.yaml
# Shows what would change if you re-applied the desired state
# Example: someone manually patched a pod count
kubectl scale deployment web-frontend --replicas=7
# But the HPA desired state says max 5 replicas
# The HPA reconciliation loop (every 15s) will detect:
# current: 7 pods, desired (based on metrics): 4 pods
# And scale back down to 4
# GitOps drift detection with Argo CD
# Argo CD runs its own reconciliation loop comparing
# Git (desired state) vs cluster (actual state)
kubectl get application web-frontend -n argocd -o jsonpath='{.status.sync.status}'
# "OutOfSync" — drift detected between Git and cluster
# Argo CD's reconciliation will either:
# 1. Auto-sync (if configured) — apply Git state to cluster
# 2. Alert — notify operators of drift for manual resolution
Drift Correction in Production
A junior engineer SSHs into a node and manually kills a container. In an imperative system, that container is gone forever. In Kubernetes: (1) kubelet detects the container stopped and reports pod status as "CrashLoopBackOff," (2) kubelet automatically restarts the container per the pod's restartPolicy: Always, (3) if the pod itself is deleted, the ReplicaSet controller detects drift (actual < desired) and creates a replacement pod, (4) the scheduler places it on a healthy node. Four layers of reconciliation catch the drift.
Reconciliation Frequency & Performance
Reconciliation isn't free. Every loop iteration consumes API Server resources, network bandwidth, and controller CPU. Kubernetes uses several mechanisms to balance responsiveness with efficiency:
Resync Period
Even with watches, controllers periodically re-list all objects and reconcile them. This catches any events missed due to watch disconnects or network partitions. Default resync periods:
- Controller Manager — 10-minute resync for most controllers
- HPA — 15-second metric collection interval
- Node controller — 5-second node health check interval
- Custom controllers — configurable, typically 30s-10min depending on urgency
Exponential Backoff on Errors
When reconciliation fails (API Server unavailable, invalid state, resource conflict), the work queue applies exponential backoff before retrying:
{
"rate_limiter_config": {
"base_delay": "5ms",
"max_delay": "1000s",
"failures_before_slow": 5,
"slow_delay": "10s"
},
"example_backoff_sequence": [
"Attempt 1: immediate",
"Attempt 2: 5ms delay",
"Attempt 3: 10ms delay",
"Attempt 4: 20ms delay",
"Attempt 5: 40ms delay",
"Attempt 6+: 10s delay (slow path)",
"Max delay cap: 1000s"
],
"reset_condition": "successful reconciliation resets backoff to 0"
}
Leader Election
In a high-availability setup with multiple controller manager replicas, only one leader actively reconciles at any time. Others remain on standby. If the leader crashes, a new leader is elected within the lease duration (typically 15 seconds).
Optimistic Concurrency
When two controllers (or a controller and a user) try to modify the same object simultaneously, Kubernetes uses resource versions for optimistic concurrency control. Every update must include the current resourceVersion — if it's stale (someone else wrote in between), the API Server rejects the update with a 409 Conflict. The controller simply re-reads and retries.
# Resource version enables optimistic concurrency
# The controller reads version 12345, makes changes, submits with that version
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-frontend
namespace: production
resourceVersion: "12345" # Must match current version for update to succeed
spec:
replicas: 5
# If another write changed resourceVersion to 12346,
# this update fails with 409 Conflict → controller retries
Reconciliation Is the Superpower
Kubernetes' reconciliation architecture is what makes it a distributed control system rather than just an orchestrator. Orchestrators execute sequences of commands. Control systems continuously maintain invariants. This distinction is why Kubernetes can self-heal from node failures, survive controller crashes, correct manual drift, and scale to thousands of nodes — the reconciliation loop makes the system eventually consistent without human intervention. Every component can fail independently and the system converges to the declared state once recovered.