Back to Systems Thinking & Architecture Mastery Series

Kubernetes Control Plane — API Server, etcd, Scheduler & Controllers

May 15, 2026 Wasil Zafar 26 min read

"The Kubernetes control plane is the brain of the cluster — it observes the desired state declared by users, compares it to actual state reported by nodes, and continuously reconciles the difference." — Understanding this machinery is key to operating Kubernetes at scale.

Table of Contents

  1. API Server — The Single Entry Point
  2. Watch Mechanism & Informers
  3. etcd — The Source of Truth
  4. Scheduler — Filtering & Scoring
  5. Controller Manager — Reconciliation Loops
  6. Cloud Controller Manager
  7. High Availability Configuration

API Server — The Single Entry Point

The kube-apiserver is the only component that talks directly to etcd. Every other component — scheduler, controllers, kubelet, kubectl — communicates exclusively through the API Server. This makes it the central hub of the entire control plane.

Every request to the API Server passes through a strict pipeline:

API Server Request Pipeline
flowchart LR
    REQ["Client Request\n(kubectl, kubelet)"] --> AUTH["Authentication\n(certs, tokens, OIDC)"]
    AUTH --> AUTHZ["Authorization\n(RBAC, ABAC, Webhook)"]
    AUTHZ --> ADM["Admission Control\n(Mutating → Validating)"]
    ADM --> VAL["Validation\n(Schema check)"]
    VAL --> ETCD["etcd Write\n(persist state)"]
    ETCD --> RESP["Response\nto Client"]
                            

Authentication

Multiple authentication strategies run in parallel — the first one that succeeds wins:

  • X.509 client certificates — the default for cluster components (kubelet, controller-manager)
  • Bearer tokens — ServiceAccount tokens (JWT) for in-cluster workloads
  • OIDC tokens — integration with identity providers (Azure AD, Google, Okta)
  • Webhook token authentication — delegate to external service

Authorization (RBAC)

Role-Based Access Control determines if the authenticated identity can perform the requested action (verb + resource + namespace). ClusterRoles/Roles bound to users/groups/service accounts via ClusterRoleBindings/RoleBindings.

Admission Controllers

The most powerful extension point. Two phases:

  • Mutating admission — can modify the request (inject sidecars, add labels, set defaults)
  • Validating admission — can only accept/reject (enforce policies, quotas)
Admission Webhooks: Modern policy engines (OPA/Gatekeeper, Kyverno, Kubewarden) implement as ValidatingAdmissionWebhooks. The API Server calls your webhook with the request, and you return allow/deny. This is how organizations enforce "no containers running as root" or "all pods must have resource limits" without modifying Kubernetes source code.
# Explore API Server resources and verbs
kubectl api-resources --sort-by=name

# Check which admission controllers are enabled
kubectl exec -n kube-system kube-apiserver-master -- \
  kube-apiserver --help 2>&1 | grep enable-admission-plugins

# View API Server audit log (if configured)
kubectl logs -n kube-system kube-apiserver-master | \
  grep -i "audit" | head -20

# Check API Server health
kubectl get --raw /healthz
kubectl get --raw /readyz

Watch Mechanism & Informers

The watch mechanism is how Kubernetes achieves event-driven architecture without polling. Components open long-lived HTTP connections to the API Server and receive notifications when resources change.

How Watches Work

  1. Client opens a watch: GET /api/v1/pods?watch=true&resourceVersion=12345
  2. API Server holds the connection open (HTTP chunked transfer encoding)
  3. When a pod changes, API Server pushes the event (ADDED, MODIFIED, DELETED)
  4. Client processes the event and updates its local cache

Shared Informers

Raw watches are expensive — each watcher gets its own stream. Shared Informers solve this with a single watch connection shared across all controllers on the same node:

  • Reflector — maintains the watch, handles reconnection and bookmark events
  • Store (cache) — local in-memory copy of all watched objects
  • Indexer — allows efficient lookup by key (namespace/name) or custom indexes
  • Event handlers — callbacks for AddFunc, UpdateFunc, DeleteFunc
Architecture Insight
Why Informers Matter for Scalability

In a cluster with 10,000 pods and 50 controllers, naive polling would create 500,000 API calls per minute (assuming 1 call/sec/controller). Informers reduce this to 1 watch connection per resource type, with local cache lookups for reads. The API Server only sends deltas over the watch — not full objects — using resourceVersion tracking. This is why Kubernetes can scale to thousands of nodes.

ScalabilityCachingEvent-Driven

etcd — The Source of Truth

etcd is a distributed key-value store that holds the entire cluster state. Every object you create in Kubernetes — pods, services, secrets, configmaps — is persisted as a serialized protobuf in etcd. If etcd is lost, the cluster is lost.

Raft Consensus

etcd uses the Raft consensus algorithm to replicate data across cluster members:

etcd Raft Consensus — Write Path
sequenceDiagram
    participant Client as API Server
    participant Leader as etcd Leader
    participant F1 as Follower 1
    participant F2 as Follower 2

    Client->>Leader: Write request
    Leader->>Leader: Append to log
    Leader->>F1: AppendEntries RPC
    Leader->>F2: AppendEntries RPC
    F1->>Leader: Ack (log replicated)
    F2->>Leader: Ack (log replicated)
    Note over Leader: Majority achieved (2/3)
    Leader->>Leader: Commit entry
    Leader->>Client: Write confirmed
    Leader->>F1: Commit notification
    Leader->>F2: Commit notification
                            

Data Model

Kubernetes objects are stored under a hierarchical key prefix:

  • /registry/pods/default/my-pod — Pod in default namespace
  • /registry/services/specs/kube-system/kube-dns — kube-dns Service
  • /registry/secrets/production/db-credentials — Secret

Operational Concerns

  • Compaction — etcd keeps history for watch replay; compaction removes old revisions to reclaim space
  • Defragmentation — after compaction, free space is fragmented; defrag reclaims it (causes brief unavailability)
  • Backup — periodic snapshots are critical; etcd data IS the cluster state
  • Cluster size — 3 nodes (tolerates 1 failure), 5 nodes (tolerates 2), 7 nodes maximum recommended
# etcd cluster configuration (static bootstrapping)
# /etc/etcd/etcd.conf.yaml
name: etcd-node-1
data-dir: /var/lib/etcd
listen-client-urls: https://10.0.1.10:2379
advertise-client-urls: https://10.0.1.10:2379
listen-peer-urls: https://10.0.1.10:2380
initial-advertise-peer-urls: https://10.0.1.10:2380
initial-cluster: >-
  etcd-node-1=https://10.0.1.10:2380,
  etcd-node-2=https://10.0.1.11:2380,
  etcd-node-3=https://10.0.1.12:2380
initial-cluster-state: new
client-transport-security:
  cert-file: /etc/etcd/pki/server.crt
  key-file: /etc/etcd/pki/server.key
  trusted-ca-file: /etc/etcd/pki/ca.crt
  client-cert-auth: true
peer-transport-security:
  cert-file: /etc/etcd/pki/peer.crt
  key-file: /etc/etcd/pki/peer.key
  trusted-ca-file: /etc/etcd/pki/ca.crt
  client-cert-auth: true
auto-compaction-mode: periodic
auto-compaction-retention: "8h"
quota-backend-bytes: 8589934592  # 8GB
# etcd health and operational commands

# Check cluster health
etcdctl endpoint health \
  --endpoints=https://10.0.1.10:2379,https://10.0.1.11:2379,https://10.0.1.12:2379 \
  --cacert=/etc/etcd/pki/ca.crt \
  --cert=/etc/etcd/pki/server.crt \
  --key=/etc/etcd/pki/server.key

# Check cluster member status (shows leader)
etcdctl endpoint status --write-out=table \
  --endpoints=https://10.0.1.10:2379,https://10.0.1.11:2379,https://10.0.1.12:2379

# Create snapshot backup
etcdctl snapshot save /backup/etcd-snapshot-$(date +%Y%m%d).db \
  --endpoints=https://10.0.1.10:2379

# Verify snapshot integrity
etcdctl snapshot status /backup/etcd-snapshot-20260515.db --write-out=table

# Check database size (alarm triggers at quota)
etcdctl endpoint status --write-out=json | \
  python3 -c "import sys,json; d=json.load(sys.stdin); print(f'DB Size: {d[0][\"Status\"][\"dbSize\"]/1024/1024:.1f} MB')"
Critical Warning: etcd performance directly determines API Server responsiveness. etcd requires low-latency storage (SSD/NVMe, not network-attached). A 99th percentile disk fsync above 10ms will cause leader elections, watch disconnections, and cascading control plane instability. Always dedicate fast local disks to etcd.

Scheduler — Filtering & Scoring

The kube-scheduler watches for newly created Pods with no assigned node and selects the best node for each one. It operates in two phases:

Scheduler Filter-Score Pipeline
flowchart LR
    POD["Unscheduled Pod"] --> FILTER["Filter Phase\n(Predicates)"]
    FILTER --> FEASIBLE["Feasible Nodes\n(passed all filters)"]
    FEASIBLE --> SCORE["Score Phase\n(Priorities)"]
    SCORE --> RANK["Ranked Nodes\n(highest score wins)"]
    RANK --> BIND["Bind\n(assign pod to node)"]
    BIND --> ETCD2["etcd\n(pod.spec.nodeName)"]
                            

Filter Phase (Predicates)

Eliminates nodes that cannot run the Pod. Each filter is a hard constraint:

  • NodeResourcesFit — node has enough CPU/memory for pod requests
  • NodePorts — requested host ports are available
  • NodeAffinity — node matches required affinity rules
  • TaintToleration — pod tolerates all node taints
  • PodTopologySpread — respects topology spread constraints
  • VolumeBinding — required PVs available in the node's zone

Score Phase (Priorities)

Ranks feasible nodes to find the best fit. Each scoring plugin produces 0–100:

  • LeastRequestedPriority — prefer nodes with most available resources
  • BalancedResourceAllocation — prefer nodes where CPU/memory utilization is balanced
  • InterPodAffinity — prefer co-location with affinity targets
  • ImageLocality — prefer nodes that already have container images cached
  • NodePreferAvoidPods — respect node annotations discouraging placement
# Custom scheduler profile with scoring weights
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles:
  - schedulerName: default-scheduler
    plugins:
      score:
        enabled:
          - name: NodeResourcesBalancedAllocation
            weight: 2
          - name: NodeResourcesFit
            weight: 1
          - name: InterPodAffinity
            weight: 2
          - name: ImageLocality
            weight: 1
        disabled:
          - name: NodeResourcesLeastAllocated
    pluginConfig:
      - name: NodeResourcesFit
        args:
          scoringStrategy:
            type: MostAllocated  # Bin-packing strategy
            resources:
              - name: cpu
                weight: 1
              - name: memory
                weight: 1
Advanced Topic
Gang Scheduling & Scheduler Extenders

Standard Kubernetes scheduling is pod-by-pod. For workloads requiring "all-or-nothing" scheduling (ML training needing 8 GPUs simultaneously, MPI jobs), gang scheduling is needed. Solutions include Volcano (batch scheduler), Coscheduling plugin, and scheduler extenders that add custom filter/score logic via webhooks. The Scheduling Framework (plugin-based) in Kubernetes 1.19+ makes custom scheduling logic first-class.

MLBatchAdvanced

Controller Manager — Reconciliation Loops

The kube-controller-manager runs dozens of control loops, each responsible for one piece of cluster state. Every controller follows the same pattern:

Controller Reconciliation Loop Pattern
flowchart TB
    OBSERVE["Observe\n(Watch API Server\nfor changes)"] --> DIFF["Diff\n(Compare desired\nvs actual state)"]
    DIFF --> ACT["Act\n(Create/Update/Delete\nresources to converge)"]
    ACT --> OBSERVE
    STATUS["Update Status\n(report current state\nback to API Server)"] --> OBSERVE
    ACT --> STATUS
                            

Key Built-in Controllers

Deployment Controller: Watches Deployment objects. When spec changes (new image, replicas), creates/updates the corresponding ReplicaSet. Manages rollout strategy (RollingUpdate, Recreate), rollback history, and progress conditions.

ReplicaSet Controller: Ensures the desired number of pod replicas are running. Creates pods when under count, deletes when over count. Uses label selectors to identify owned pods.

StatefulSet Controller: Like ReplicaSet but with ordered creation/deletion, stable network identities (pod-0, pod-1), and persistent volume claims per replica.

Job Controller: Runs pods to completion. Tracks succeeded/failed counts. Supports parallelism, completions, backoff limits, and indexed jobs.

DaemonSet Controller: Ensures exactly one pod runs on every node (or a subset via nodeSelector/affinity). Used for node-level agents (log collectors, CNI, monitoring).

Level-Triggered vs Edge-Triggered: Kubernetes controllers are level-triggered — they reconcile based on the CURRENT state difference, not on individual events. If the controller crashes and restarts, it simply observes current state and acts. This makes the system self-healing: missed events don't cause permanent drift because the next reconciliation will catch up.

Cloud Controller Manager

The cloud-controller-manager runs controllers with cloud-provider-specific logic, separated from the core controller-manager to allow cloud providers to evolve independently:

  • Node Controller — checks cloud provider to determine if a node VM still exists; if deleted, removes the Node object
  • Route Controller — configures cloud network routes so pods on different nodes can communicate
  • Service Controller — creates/updates/deletes cloud load balancers for Services of type LoadBalancer

High Availability Configuration

Production clusters must survive control plane component failures. The HA strategy differs by component:

API Server HA

Stateless — run multiple replicas behind a load balancer. Each instance connects to the same etcd cluster. Clients (kubelet, kubectl) use the LB endpoint.

Controller Manager & Scheduler HA

Only one instance can be active at a time (to avoid conflicting decisions). Uses leader election via a Lease object in Kubernetes:

  • All replicas start and attempt to acquire the lease
  • Winner becomes active leader, others are standby
  • Leader renews lease every ~2 seconds
  • If renewal fails (crash), another replica acquires the lease within ~15 seconds

etcd HA

Run 3, 5, or 7 members (always odd for Raft majority). Tolerates (n-1)/2 failures. 3 members is standard; 5 for large clusters; 7 is maximum recommended (more members increase write latency due to quorum requirement).

Key Takeaway
The Control Plane as SDN Controller

Kubernetes control plane IS an SDN controller for compute resources. The API Server is the centralized brain. etcd is the routing table (cluster state). Controllers are the routing protocols (continuously computing desired state). The Scheduler is path computation. And kubelet on each node is the data plane forwarding engine executing the control plane's decisions. The same separation-of-concerns pattern from networking, applied to container orchestration.

ArchitectureSDN ParallelPattern