API Server — The Single Entry Point
The kube-apiserver is the only component that talks directly to etcd. Every other component — scheduler, controllers, kubelet, kubectl — communicates exclusively through the API Server. This makes it the central hub of the entire control plane.
Every request to the API Server passes through a strict pipeline:
flowchart LR
REQ["Client Request\n(kubectl, kubelet)"] --> AUTH["Authentication\n(certs, tokens, OIDC)"]
AUTH --> AUTHZ["Authorization\n(RBAC, ABAC, Webhook)"]
AUTHZ --> ADM["Admission Control\n(Mutating → Validating)"]
ADM --> VAL["Validation\n(Schema check)"]
VAL --> ETCD["etcd Write\n(persist state)"]
ETCD --> RESP["Response\nto Client"]
Authentication
Multiple authentication strategies run in parallel — the first one that succeeds wins:
- X.509 client certificates — the default for cluster components (kubelet, controller-manager)
- Bearer tokens — ServiceAccount tokens (JWT) for in-cluster workloads
- OIDC tokens — integration with identity providers (Azure AD, Google, Okta)
- Webhook token authentication — delegate to external service
Authorization (RBAC)
Role-Based Access Control determines if the authenticated identity can perform the requested action (verb + resource + namespace). ClusterRoles/Roles bound to users/groups/service accounts via ClusterRoleBindings/RoleBindings.
Admission Controllers
The most powerful extension point. Two phases:
- Mutating admission — can modify the request (inject sidecars, add labels, set defaults)
- Validating admission — can only accept/reject (enforce policies, quotas)
# Explore API Server resources and verbs
kubectl api-resources --sort-by=name
# Check which admission controllers are enabled
kubectl exec -n kube-system kube-apiserver-master -- \
kube-apiserver --help 2>&1 | grep enable-admission-plugins
# View API Server audit log (if configured)
kubectl logs -n kube-system kube-apiserver-master | \
grep -i "audit" | head -20
# Check API Server health
kubectl get --raw /healthz
kubectl get --raw /readyz
Watch Mechanism & Informers
The watch mechanism is how Kubernetes achieves event-driven architecture without polling. Components open long-lived HTTP connections to the API Server and receive notifications when resources change.
How Watches Work
- Client opens a watch:
GET /api/v1/pods?watch=true&resourceVersion=12345 - API Server holds the connection open (HTTP chunked transfer encoding)
- When a pod changes, API Server pushes the event (ADDED, MODIFIED, DELETED)
- Client processes the event and updates its local cache
Shared Informers
Raw watches are expensive — each watcher gets its own stream. Shared Informers solve this with a single watch connection shared across all controllers on the same node:
- Reflector — maintains the watch, handles reconnection and bookmark events
- Store (cache) — local in-memory copy of all watched objects
- Indexer — allows efficient lookup by key (namespace/name) or custom indexes
- Event handlers — callbacks for AddFunc, UpdateFunc, DeleteFunc
Why Informers Matter for Scalability
In a cluster with 10,000 pods and 50 controllers, naive polling would create 500,000 API calls per minute (assuming 1 call/sec/controller). Informers reduce this to 1 watch connection per resource type, with local cache lookups for reads. The API Server only sends deltas over the watch — not full objects — using resourceVersion tracking. This is why Kubernetes can scale to thousands of nodes.
etcd — The Source of Truth
etcd is a distributed key-value store that holds the entire cluster state. Every object you create in Kubernetes — pods, services, secrets, configmaps — is persisted as a serialized protobuf in etcd. If etcd is lost, the cluster is lost.
Raft Consensus
etcd uses the Raft consensus algorithm to replicate data across cluster members:
sequenceDiagram
participant Client as API Server
participant Leader as etcd Leader
participant F1 as Follower 1
participant F2 as Follower 2
Client->>Leader: Write request
Leader->>Leader: Append to log
Leader->>F1: AppendEntries RPC
Leader->>F2: AppendEntries RPC
F1->>Leader: Ack (log replicated)
F2->>Leader: Ack (log replicated)
Note over Leader: Majority achieved (2/3)
Leader->>Leader: Commit entry
Leader->>Client: Write confirmed
Leader->>F1: Commit notification
Leader->>F2: Commit notification
Data Model
Kubernetes objects are stored under a hierarchical key prefix:
/registry/pods/default/my-pod— Pod in default namespace/registry/services/specs/kube-system/kube-dns— kube-dns Service/registry/secrets/production/db-credentials— Secret
Operational Concerns
- Compaction — etcd keeps history for watch replay; compaction removes old revisions to reclaim space
- Defragmentation — after compaction, free space is fragmented; defrag reclaims it (causes brief unavailability)
- Backup — periodic snapshots are critical; etcd data IS the cluster state
- Cluster size — 3 nodes (tolerates 1 failure), 5 nodes (tolerates 2), 7 nodes maximum recommended
# etcd cluster configuration (static bootstrapping)
# /etc/etcd/etcd.conf.yaml
name: etcd-node-1
data-dir: /var/lib/etcd
listen-client-urls: https://10.0.1.10:2379
advertise-client-urls: https://10.0.1.10:2379
listen-peer-urls: https://10.0.1.10:2380
initial-advertise-peer-urls: https://10.0.1.10:2380
initial-cluster: >-
etcd-node-1=https://10.0.1.10:2380,
etcd-node-2=https://10.0.1.11:2380,
etcd-node-3=https://10.0.1.12:2380
initial-cluster-state: new
client-transport-security:
cert-file: /etc/etcd/pki/server.crt
key-file: /etc/etcd/pki/server.key
trusted-ca-file: /etc/etcd/pki/ca.crt
client-cert-auth: true
peer-transport-security:
cert-file: /etc/etcd/pki/peer.crt
key-file: /etc/etcd/pki/peer.key
trusted-ca-file: /etc/etcd/pki/ca.crt
client-cert-auth: true
auto-compaction-mode: periodic
auto-compaction-retention: "8h"
quota-backend-bytes: 8589934592 # 8GB
# etcd health and operational commands
# Check cluster health
etcdctl endpoint health \
--endpoints=https://10.0.1.10:2379,https://10.0.1.11:2379,https://10.0.1.12:2379 \
--cacert=/etc/etcd/pki/ca.crt \
--cert=/etc/etcd/pki/server.crt \
--key=/etc/etcd/pki/server.key
# Check cluster member status (shows leader)
etcdctl endpoint status --write-out=table \
--endpoints=https://10.0.1.10:2379,https://10.0.1.11:2379,https://10.0.1.12:2379
# Create snapshot backup
etcdctl snapshot save /backup/etcd-snapshot-$(date +%Y%m%d).db \
--endpoints=https://10.0.1.10:2379
# Verify snapshot integrity
etcdctl snapshot status /backup/etcd-snapshot-20260515.db --write-out=table
# Check database size (alarm triggers at quota)
etcdctl endpoint status --write-out=json | \
python3 -c "import sys,json; d=json.load(sys.stdin); print(f'DB Size: {d[0][\"Status\"][\"dbSize\"]/1024/1024:.1f} MB')"
Scheduler — Filtering & Scoring
The kube-scheduler watches for newly created Pods with no assigned node and selects the best node for each one. It operates in two phases:
flowchart LR
POD["Unscheduled Pod"] --> FILTER["Filter Phase\n(Predicates)"]
FILTER --> FEASIBLE["Feasible Nodes\n(passed all filters)"]
FEASIBLE --> SCORE["Score Phase\n(Priorities)"]
SCORE --> RANK["Ranked Nodes\n(highest score wins)"]
RANK --> BIND["Bind\n(assign pod to node)"]
BIND --> ETCD2["etcd\n(pod.spec.nodeName)"]
Filter Phase (Predicates)
Eliminates nodes that cannot run the Pod. Each filter is a hard constraint:
- NodeResourcesFit — node has enough CPU/memory for pod requests
- NodePorts — requested host ports are available
- NodeAffinity — node matches required affinity rules
- TaintToleration — pod tolerates all node taints
- PodTopologySpread — respects topology spread constraints
- VolumeBinding — required PVs available in the node's zone
Score Phase (Priorities)
Ranks feasible nodes to find the best fit. Each scoring plugin produces 0–100:
- LeastRequestedPriority — prefer nodes with most available resources
- BalancedResourceAllocation — prefer nodes where CPU/memory utilization is balanced
- InterPodAffinity — prefer co-location with affinity targets
- ImageLocality — prefer nodes that already have container images cached
- NodePreferAvoidPods — respect node annotations discouraging placement
# Custom scheduler profile with scoring weights
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles:
- schedulerName: default-scheduler
plugins:
score:
enabled:
- name: NodeResourcesBalancedAllocation
weight: 2
- name: NodeResourcesFit
weight: 1
- name: InterPodAffinity
weight: 2
- name: ImageLocality
weight: 1
disabled:
- name: NodeResourcesLeastAllocated
pluginConfig:
- name: NodeResourcesFit
args:
scoringStrategy:
type: MostAllocated # Bin-packing strategy
resources:
- name: cpu
weight: 1
- name: memory
weight: 1
Gang Scheduling & Scheduler Extenders
Standard Kubernetes scheduling is pod-by-pod. For workloads requiring "all-or-nothing" scheduling (ML training needing 8 GPUs simultaneously, MPI jobs), gang scheduling is needed. Solutions include Volcano (batch scheduler), Coscheduling plugin, and scheduler extenders that add custom filter/score logic via webhooks. The Scheduling Framework (plugin-based) in Kubernetes 1.19+ makes custom scheduling logic first-class.
Controller Manager — Reconciliation Loops
The kube-controller-manager runs dozens of control loops, each responsible for one piece of cluster state. Every controller follows the same pattern:
flowchart TB
OBSERVE["Observe\n(Watch API Server\nfor changes)"] --> DIFF["Diff\n(Compare desired\nvs actual state)"]
DIFF --> ACT["Act\n(Create/Update/Delete\nresources to converge)"]
ACT --> OBSERVE
STATUS["Update Status\n(report current state\nback to API Server)"] --> OBSERVE
ACT --> STATUS
Key Built-in Controllers
Deployment Controller: Watches Deployment objects. When spec changes (new image, replicas), creates/updates the corresponding ReplicaSet. Manages rollout strategy (RollingUpdate, Recreate), rollback history, and progress conditions.
ReplicaSet Controller: Ensures the desired number of pod replicas are running. Creates pods when under count, deletes when over count. Uses label selectors to identify owned pods.
StatefulSet Controller: Like ReplicaSet but with ordered creation/deletion, stable network identities (pod-0, pod-1), and persistent volume claims per replica.
Job Controller: Runs pods to completion. Tracks succeeded/failed counts. Supports parallelism, completions, backoff limits, and indexed jobs.
DaemonSet Controller: Ensures exactly one pod runs on every node (or a subset via nodeSelector/affinity). Used for node-level agents (log collectors, CNI, monitoring).
Cloud Controller Manager
The cloud-controller-manager runs controllers with cloud-provider-specific logic, separated from the core controller-manager to allow cloud providers to evolve independently:
- Node Controller — checks cloud provider to determine if a node VM still exists; if deleted, removes the Node object
- Route Controller — configures cloud network routes so pods on different nodes can communicate
- Service Controller — creates/updates/deletes cloud load balancers for Services of type LoadBalancer
High Availability Configuration
Production clusters must survive control plane component failures. The HA strategy differs by component:
API Server HA
Stateless — run multiple replicas behind a load balancer. Each instance connects to the same etcd cluster. Clients (kubelet, kubectl) use the LB endpoint.
Controller Manager & Scheduler HA
Only one instance can be active at a time (to avoid conflicting decisions). Uses leader election via a Lease object in Kubernetes:
- All replicas start and attempt to acquire the lease
- Winner becomes active leader, others are standby
- Leader renews lease every ~2 seconds
- If renewal fails (crash), another replica acquires the lease within ~15 seconds
etcd HA
Run 3, 5, or 7 members (always odd for Raft majority). Tolerates (n-1)/2 failures. 3 members is standard; 5 for large clusters; 7 is maximum recommended (more members increase write latency due to quorum requirement).
The Control Plane as SDN Controller
Kubernetes control plane IS an SDN controller for compute resources. The API Server is the centralized brain. etcd is the routing table (cluster state). Controllers are the routing protocols (continuously computing desired state). The Scheduler is path computation. And kubelet on each node is the data plane forwarding engine executing the control plane's decisions. The same separation-of-concerns pattern from networking, applied to container orchestration.