Kubernetes Reconciliation Architecture

The Declarative Model

Kubernetes is built on a declarative paradigm: you tell the system what you want (desired state), not how to get there. The system continuously works to make reality match your declaration. This is fundamentally different from imperative systems where you issue commands ("start 3 containers") and hope the system maintains that state.

Declarative Model: Desired State → Actual State → Reconciliation

flowchart LR
    USER["User declares\nDesired State"] --> ETCD["etcd stores\nDesired State"]
    ETCD --> CTRL["Controller reads\nDesired State"]
    CTRL --> COMPARE{"Compare\nDesired vs Actual"}
    COMPARE -->|"Drift detected"| ACT["Take Corrective\nAction"]
    ACT --> ACTUAL["Actual State\nconverges"]
    ACTUAL --> CTRL
    COMPARE -->|"No drift"| WAIT["Wait for\nnext cycle"]
    WAIT --> CTRL

The declarative model has three fundamental components:

Desired State — stored in etcd as resource specifications (e.g., spec.replicas: 3)
Actual State — observed from the real world (e.g., kubelet reports 2 running pods)
Reconciliation — the process of driving actual state toward desired state

                            
                            Key Insight: The user never says "create a pod." The user says "I want 3 replicas." If a pod dies, no human intervention is needed — the reconciliation loop detects the drift (2 actual vs 3 desired) and creates a replacement. This is self-healing by design.
                        

# Desired state declaration — user intent, not imperative command
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-frontend
  namespace: production
spec:
  replicas: 3                    # Desired: 3 pods running
  selector:
    matchLabels:
      app: web-frontend
  strategy:
    type: RollingUpdate          # How to reconcile during updates
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  template:
    metadata:
      labels:
        app: web-frontend
    spec:
      containers:
      - name: nginx
        image: nginx:1.25
        resources:
          requests:
            cpu: 100m
            memory: 128Mi

Control Loop Pattern

Every controller in Kubernetes implements the same fundamental pattern — an infinite loop that observes, compares, and acts. This is the control loop, borrowed from control systems engineering (think thermostats, cruise control, industrial PID controllers).

Reconciliation Control Loop

flowchart TD
    OBS["1. OBSERVE\nRead current state\nfrom cluster"] --> DIFF["2. DIFF\nCompare current state\nwith desired state"]
    DIFF --> ACT["3. ACT\nTake corrective action\nto close the gap"]
    ACT --> UPDATE["4. UPDATE STATUS\nWrite status back\nto API Server"]
    UPDATE --> OBS

    style OBS fill:#3B9797,color:#fff
    style DIFF fill:#16476A,color:#fff
    style ACT fill:#BF092F,color:#fff
    style UPDATE fill:#132440,color:#fff

The four phases of the control loop:

Observe — read current state from the cluster via informers (cached watches on API Server)
Diff — compare observed state with the desired state stored in the resource spec
Act — if a difference exists, take the minimum action needed to converge (create pod, delete pod, update config)
Update Status — write the observed result back to the resource's .status field

Control Theory Parallel

Kubernetes as a PID Controller

In control systems engineering, a PID controller continuously calculates an error value (difference between setpoint and measured variable) and applies a correction. Kubernetes controllers are the digital equivalent: the "setpoint" is spec.replicas, the "measured variable" is the count of ready pods, and the "correction" is creating or deleting pods. The difference is that Kubernetes operates on discrete objects (pods) rather than continuous signals.

Control TheoryFeedback LoopConvergence

How Controllers Implement Reconciliation

Controllers don't poll the API Server on every loop iteration — that would be catastrophically expensive at scale. Instead, they use the informer pattern: a local cached copy of cluster state maintained by long-lived watch connections to the API Server.

Controller Architecture: Informers, Work Queue, Reconciler

flowchart TD
    ETCD["etcd"] -->|"watch stream"| API["API Server"]
    API -->|"watch events"| INF["Informer\n(SharedIndexInformer)"]
    INF -->|"cache update"| CACHE["Local Cache\n(Thread-safe Store)"]
    INF -->|"event handler"| QUEUE["Work Queue\n(Rate-limited)"]
    QUEUE -->|"dequeue key"| RECONCILE["Reconcile(key)\nFunction"]
    RECONCILE -->|"read from"| CACHE
    RECONCILE -->|"write to"| API

    style INF fill:#3B9797,color:#fff
    style QUEUE fill:#16476A,color:#fff
    style RECONCILE fill:#BF092F,color:#fff

Informers — Efficient State Watching

An informer establishes a single watch connection to the API Server and maintains a local cache of all objects matching its criteria. When an object changes, the informer receives the event and updates its cache — no polling required.

SharedIndexInformer — multiple controllers share one watch connection per resource type
Reflector — handles the LIST+WATCH protocol (initial list, then incremental watches)
DeltaFIFO — queues incoming events as deltas (Added, Updated, Deleted)
Indexer — stores objects with custom indexes for fast lookup (e.g., by namespace, label)

Work Queue — Rate Limiting and Deduplication

When an informer detects a change, it doesn't immediately trigger reconciliation. Instead, it enqueues the object's key (namespace/name) into a work queue. This provides critical guarantees:

Deduplication — if an object changes 10 times before processing, only one reconciliation runs
Rate limiting — prevents thundering herd from overwhelming the API Server
Retry with backoff — failed reconciliations are re-enqueued with exponential delay
Ordering guarantees — each key is processed by at most one worker at a time

The Reconcile Function

The actual business logic lives in the Reconcile() function. It receives an object key, reads the desired state from cache, observes actual state, and takes action:

# Python custom controller using kopf framework
# Reconcile function for a custom CronBackup resource
import kopf
import kubernetes
from kubernetes import client, config

@kopf.on.create('backups.example.com', 'v1', 'cronbackups')
@kopf.on.update('backups.example.com', 'v1', 'cronbackups')
@kopf.on.resume('backups.example.com', 'v1', 'cronbackups')
def reconcile_cronbackup(spec, name, namespace, status, **kwargs):
    """
    Reconciliation loop for CronBackup custom resource.
    Ensures a CronJob exists matching the desired backup spec.
    """
    config.load_incluster_config()
    batch_v1 = client.BatchV1Api()

    # 1. OBSERVE: Check if CronJob already exists
    desired_schedule = spec.get('schedule', '0 2 * * *')
    desired_target = spec.get('target_database')
    desired_retention = spec.get('retention_days', 30)

    cronjob_name = f"backup-{name}"

    try:
        existing = batch_v1.read_namespaced_cron_job(cronjob_name, namespace)
        current_schedule = existing.spec.schedule

        # 2. DIFF: Compare desired vs actual
        if current_schedule != desired_schedule:
            # 3. ACT: Update the CronJob to match desired state
            existing.spec.schedule = desired_schedule
            batch_v1.replace_namespaced_cron_job(cronjob_name, namespace, existing)
            return {'message': f'Updated schedule to {desired_schedule}'}
        else:
            return {'message': 'CronJob already matches desired state'}

    except kubernetes.client.exceptions.ApiException as e:
        if e.status == 404:
            # 3. ACT: CronJob doesn't exist — create it
            cronjob = _build_cronjob(cronjob_name, namespace,
                                     desired_schedule, desired_target, desired_retention)
            batch_v1.create_namespaced_cron_job(namespace, cronjob)
            return {'message': f'Created CronJob {cronjob_name}'}
        raise

Built-in Controller Examples

Kubernetes ships with dozens of built-in controllers, each responsible for reconciling one resource type. The Controller Manager binary runs them all in a single process with shared informers.

Deployment Controller — Rolling Update Reconciliation

The Deployment controller doesn't manage pods directly. It manages ReplicaSets, which in turn manage pods. During a rolling update, it orchestrates the transition between old and new ReplicaSets:

Deployment Rolling Update — Two ReplicaSets

sequenceDiagram
    participant User
    participant Deployment Controller
    participant ReplicaSet Old (v1)
    participant ReplicaSet New (v2)
    participant Pods

    User->>Deployment Controller: Update image to v2
    Deployment Controller->>ReplicaSet New (v2): Create with replicas=1
    ReplicaSet New (v2)->>Pods: Create 1 new pod (v2)
    Note over Deployment Controller: Wait for v2 pod Ready
    Deployment Controller->>ReplicaSet Old (v1): Scale down to 2
    ReplicaSet Old (v1)->>Pods: Delete 1 old pod (v1)
    Deployment Controller->>ReplicaSet New (v2): Scale up to 2
    ReplicaSet New (v2)->>Pods: Create 1 more pod (v2)
    Note over Deployment Controller: Repeat until complete
    Deployment Controller->>ReplicaSet New (v2): Scale to 3
    Deployment Controller->>ReplicaSet Old (v1): Scale to 0

ReplicaSet Controller — Pod Count Maintenance

The ReplicaSet controller has the simplest reconciliation logic: count running pods matching the selector, compare with spec.replicas, create or delete pods to match:

# Observe the reconciliation in action
# Delete a pod manually — watch the ReplicaSet controller recreate it

# Current state: 3 pods running
kubectl get pods -l app=web-frontend
# NAME                           READY   STATUS    AGE
# web-frontend-7d4b8c6-abc12    1/1     Running   2h
# web-frontend-7d4b8c6-def34    1/1     Running   2h
# web-frontend-7d4b8c6-ghi56    1/1     Running   2h

# Simulate failure — delete a pod
kubectl delete pod web-frontend-7d4b8c6-abc12

# Within seconds, controller detects drift (2 actual vs 3 desired)
kubectl get pods -l app=web-frontend
# NAME                           READY   STATUS              AGE
# web-frontend-7d4b8c6-def34    1/1     Running             2h
# web-frontend-7d4b8c6-ghi56    1/1     Running             2h
# web-frontend-7d4b8c6-jkl78    0/1     ContainerCreating   3s

# Check controller events
kubectl describe rs web-frontend-7d4b8c6 | grep -A5 Events
# Events:
#   Type    Reason            Message
#   ----    ------            -------
#   Normal  SuccessfulCreate  Created pod: web-frontend-7d4b8c6-jkl78

HPA — Metric-Based Scaling Loop

The Horizontal Pod Autoscaler reconciles at a fixed interval (default 15 seconds), calculating desired replicas from current metrics:

# HPA desired state: scale between 2-10 based on CPU
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-frontend-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-frontend
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70    # Target: 70% CPU
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300   # Wait 5min before scaling down
      policies:
      - type: Percent
        value: 10                       # Max 10% reduction per period
        periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 0     # Scale up immediately
      policies:
      - type: Percent
        value: 100                      # Can double capacity per period
        periodSeconds: 60

                            
                            HPA Formula: desiredReplicas = ceil(currentReplicas × (currentMetricValue / desiredMetricValue)). If current CPU is 140% across 3 pods and target is 70%, the HPA calculates ceil(3 × (140/70)) = 6 replicas needed.
                        

Custom Controllers & Operator Pattern

The Operator pattern extends Kubernetes' reconciliation machinery to manage any stateful application. An Operator is a custom controller that watches a Custom Resource Definition (CRD) and reconciles the application's lifecycle — installation, scaling, backup, upgrade, failover.

Operator Pattern — Extending the Control Plane

flowchart TD
    CRD["Custom Resource\n(e.g., PostgresCluster)"] -->|"stored in"| ETCD["etcd"]
    ETCD -->|"watched by"| OP["Operator Controller"]
    OP -->|"manages"| PRIMARY["Primary DB Pod"]
    OP -->|"manages"| REPLICA1["Replica Pod 1"]
    OP -->|"manages"| REPLICA2["Replica Pod 2"]
    OP -->|"manages"| BACKUP["Backup CronJob"]
    OP -->|"manages"| SVC["Service + Endpoints"]
    OP -->|"manages"| SECRET["TLS Certificates"]

    subgraph "Control Plane Extension"
        CRD
        ETCD
        OP
    end

    subgraph "Data Plane (Managed Workload)"
        PRIMARY
        REPLICA1
        REPLICA2
        BACKUP
        SVC
        SECRET
    end

    style OP fill:#BF092F,color:#fff

Operators encode domain-specific operational knowledge into software. Instead of a DBA manually failing over a database, the operator's reconciliation loop detects the failure and orchestrates the failover automatically — promoting a replica, updating endpoints, notifying dependent services.

Real-World Example

CrunchyData PostgreSQL Operator — Reconciliation in Action

When you declare a PostgresCluster resource with replicas: 3, the Crunchy operator's reconciliation loop: (1) creates a primary StatefulSet, (2) creates replica StatefulSets with streaming replication configured, (3) creates PgBouncer connection pooling, (4) issues TLS certificates from its internal CA, (5) configures automated backups via pgBackRest, (6) monitors replication lag and triggers failover if primary is unresponsive for 30s. All from a single YAML declaration.

OperatorPostgreSQLSelf-Healing

Level-Triggered vs Edge-Triggered

This is one of the most important architectural decisions in Kubernetes. Controllers are level-triggered, not edge-triggered. The difference is profound:

Edge-triggered — react to changes (events). "A pod was deleted" → create one pod. If you miss the event, you miss the action.
Level-triggered — react to state. "There should be 3 pods, there are 2" → create one pod. Even if you miss events, you eventually converge.

                            
                            Why Level-Triggered Wins: In a distributed system, messages get lost, controllers crash and restart, watches disconnect. An edge-triggered system that misses an event will permanently diverge from desired state. A level-triggered system will self-correct on the next reconciliation — it doesn't matter what happened, only what is vs what should be.
                        

Practical consequences of level-triggered design:

Idempotent reconciliation — running the same reconciliation 10 times produces the same result as running it once
Crash recovery — a controller that crashes and restarts simply reconciles all objects from its cache, catching up on anything missed
Resync periods — controllers periodically re-reconcile ALL objects, catching any drift that slipped through event delivery
No ordering dependency — events can arrive out of order; the reconciler only cares about current state

# Demonstrate level-triggered behavior:
# Even after controller restart, desired state is maintained

# Scale deployment to 5
kubectl scale deployment web-frontend --replicas=5

# Simulate controller manager restart
kubectl delete pod -n kube-system -l component=kube-controller-manager

# Controller restarts, lists all objects, reconciles
# No pods are lost — level-triggered means "what should be" is re-evaluated
kubectl get pods -l app=web-frontend
# All 5 pods still running — the new controller reads desired state
# from etcd and observes actual state matches, no action needed

# Now simulate divergence during controller downtime
# (In practice, kubelet still runs pods even without controller)
kubectl get deployment web-frontend -o jsonpath='{.spec.replicas}'
# 5 — desired state persists in etcd regardless of controller health

Drift Detection & Correction

One of Kubernetes' most powerful properties is self-healing from manual changes. If someone (or something) modifies the actual state outside of the declared desired state, the reconciliation loop will detect and correct the drift:

# Detect drift: compare live state vs desired state
kubectl diff -f deployment.yaml
# Shows what would change if you re-applied the desired state

# Example: someone manually patched a pod count
kubectl scale deployment web-frontend --replicas=7

# But the HPA desired state says max 5 replicas
# The HPA reconciliation loop (every 15s) will detect:
#   current: 7 pods, desired (based on metrics): 4 pods
# And scale back down to 4

# GitOps drift detection with Argo CD
# Argo CD runs its own reconciliation loop comparing
# Git (desired state) vs cluster (actual state)
kubectl get application web-frontend -n argocd -o jsonpath='{.status.sync.status}'
# "OutOfSync" — drift detected between Git and cluster

# Argo CD's reconciliation will either:
# 1. Auto-sync (if configured) — apply Git state to cluster
# 2. Alert — notify operators of drift for manual resolution

Self-Healing Scenario

Drift Correction in Production

A junior engineer SSHs into a node and manually kills a container. In an imperative system, that container is gone forever. In Kubernetes: (1) kubelet detects the container stopped and reports pod status as "CrashLoopBackOff," (2) kubelet automatically restarts the container per the pod's restartPolicy: Always, (3) if the pod itself is deleted, the ReplicaSet controller detects drift (actual < desired) and creates a replacement pod, (4) the scheduler places it on a healthy node. Four layers of reconciliation catch the drift.

Self-HealingDefense in DepthReconciliation

Reconciliation Frequency & Performance

Reconciliation isn't free. Every loop iteration consumes API Server resources, network bandwidth, and controller CPU. Kubernetes uses several mechanisms to balance responsiveness with efficiency:

Resync Period

Even with watches, controllers periodically re-list all objects and reconcile them. This catches any events missed due to watch disconnects or network partitions. Default resync periods:

Controller Manager — 10-minute resync for most controllers
HPA — 15-second metric collection interval
Node controller — 5-second node health check interval
Custom controllers — configurable, typically 30s-10min depending on urgency

Exponential Backoff on Errors

When reconciliation fails (API Server unavailable, invalid state, resource conflict), the work queue applies exponential backoff before retrying:

{
  "rate_limiter_config": {
    "base_delay": "5ms",
    "max_delay": "1000s",
    "failures_before_slow": 5,
    "slow_delay": "10s"
  },
  "example_backoff_sequence": [
    "Attempt 1: immediate",
    "Attempt 2: 5ms delay",
    "Attempt 3: 10ms delay",
    "Attempt 4: 20ms delay",
    "Attempt 5: 40ms delay",
    "Attempt 6+: 10s delay (slow path)",
    "Max delay cap: 1000s"
  ],
  "reset_condition": "successful reconciliation resets backoff to 0"
}

Leader Election

In a high-availability setup with multiple controller manager replicas, only one leader actively reconciles at any time. Others remain on standby. If the leader crashes, a new leader is elected within the lease duration (typically 15 seconds).

                            
                            Performance at Scale: In a 5,000-node cluster with 150,000 pods, the controller manager processes approximately 500 reconciliation events per second. The informer cache eliminates 99%+ of API Server reads — controllers read from local memory, only writing to the API Server when taking action. This architecture scales to clusters with millions of objects.
                        

Optimistic Concurrency

When two controllers (or a controller and a user) try to modify the same object simultaneously, Kubernetes uses resource versions for optimistic concurrency control. Every update must include the current resourceVersion — if it's stale (someone else wrote in between), the API Server rejects the update with a 409 Conflict. The controller simply re-reads and retries.

# Resource version enables optimistic concurrency
# The controller reads version 12345, makes changes, submits with that version
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-frontend
  namespace: production
  resourceVersion: "12345"    # Must match current version for update to succeed
spec:
  replicas: 5
# If another write changed resourceVersion to 12346,
# this update fails with 409 Conflict → controller retries

Key Takeaway

Reconciliation Is the Superpower

Kubernetes' reconciliation architecture is what makes it a distributed control system rather than just an orchestrator. Orchestrators execute sequences of commands. Control systems continuously maintain invariants. This distinction is why Kubernetes can self-heal from node failures, survive controller crashes, correct manual drift, and scale to thousands of nodes — the reconciliation loop makes the system eventually consistent without human intervention. Every component can fail independently and the system converges to the declared state once recovered.

ArchitectureDistributed SystemsSelf-Healing

Cookie Consent

Kubernetes Reconciliation Architecture

Table of Contents

The Declarative Model

Control Loop Pattern

Kubernetes as a PID Controller

How Controllers Implement Reconciliation

Informers — Efficient State Watching

Work Queue — Rate Limiting and Deduplication

The Reconcile Function

Built-in Controller Examples

Deployment Controller — Rolling Update Reconciliation

ReplicaSet Controller — Pod Count Maintenance

HPA — Metric-Based Scaling Loop

Custom Controllers & Operator Pattern

CrunchyData PostgreSQL Operator — Reconciliation in Action

Level-Triggered vs Edge-Triggered

Drift Detection & Correction

Drift Correction in Production

Reconciliation Frequency & Performance

Resync Period

Exponential Backoff on Errors

Leader Election

Optimistic Concurrency

Reconciliation Is the Superpower

Cookie Consent

Kubernetes Reconciliation Architecture

Table of Contents

The Declarative Model

Control Loop Pattern

Kubernetes as a PID Controller

How Controllers Implement Reconciliation

Informers — Efficient State Watching

Work Queue — Rate Limiting and Deduplication

The Reconcile Function

Built-in Controller Examples

Deployment Controller — Rolling Update Reconciliation

ReplicaSet Controller — Pod Count Maintenance

HPA — Metric-Based Scaling Loop

Custom Controllers & Operator Pattern

CrunchyData PostgreSQL Operator — Reconciliation in Action

Level-Triggered vs Edge-Triggered

Drift Detection & Correction

Drift Correction in Production

Reconciliation Frequency & Performance

Resync Period

Exponential Backoff on Errors

Leader Election

Optimistic Concurrency

Reconciliation Is the Superpower

Related Deep Dives

Systems Thinking & Architecture Mastery Series

Kubernetes Control Plane — API Server, etcd, Scheduler & Controllers

Kubernetes Data Plane — kubelet, kube-proxy & Container Runtime