Back to Distributed Systems & Kubernetes Series

Part 6: Kubernetes Architecture

May 14, 2026 Wasil Zafar 40 min read

Kubernetes is not "just container orchestration." It is a distributed systems control plane for declarative infrastructure management — implementing every resilience principle we've studied so far.

Table of Contents

  1. The Kubernetes Mental Model
  2. Control Plane Components
  3. Worker Node Components
  4. Component Communication Flow
  5. Cluster Setup & Installation
  6. Exercises
  7. Conclusion

The Kubernetes Mental Model

Declarative Reconciliation

Every concept from Parts 1–5 — consensus, replication, service discovery, resilience — converges in Kubernetes. But Kubernetes adds one powerful abstraction that makes it all manageable: declarative reconciliation.

The Core Insight: You don't tell Kubernetes how to do things. You tell it what you want, and it figures out how to get there — and how to stay there. This is the difference between imperative ("create 3 pods, put them on nodes 1, 2, and 3") and declarative ("I want 3 replicas of this application running at all times").
Kubernetes Reconciliation Model
flowchart TD
    A[User Submits Desired State] --> B[API Server Stores in etcd]
    B --> C[Controllers Watch for Changes]
    C --> D{Desired == Actual?}
    D -->|Yes| E[No action needed]
    D -->|No| F[Controller takes corrective action]
    F --> G[Actual state moves toward desired]
    G --> C
    E --> C
                            

This model is fundamentally different from traditional infrastructure management:

Aspect Imperative (Traditional) Declarative (Kubernetes)
Instructions "Create VM, install nginx, start service" "3 nginx pods should be running"
Failure handling Manual detection, manual fix Auto-detected, auto-fixed
Drift Accumulates silently Continuously reconciled
Scaling "Create 2 more VMs and configure them" "Change replicas from 3 to 5"
State tracking CMDB (often outdated) etcd is single source of truth

Desired State vs Actual State

This is the single most important concept in Kubernetes. Everything else flows from it:

# This YAML is a declaration of DESIRED STATE:
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-server
spec:
  replicas: 3          # "I want 3 pods running at all times"
  selector:
    matchLabels:
      app: web
  template:
    metadata:
      labels:
        app: web
    spec:
      containers:
      - name: nginx
        image: nginx:1.25
        ports:
        - containerPort: 80
        resources:
          requests:
            memory: "64Mi"
            cpu: "100m"
          limits:
            memory: "128Mi"
            cpu: "200m"

# When you apply this:
# 1. API Server stores it in etcd
# 2. Deployment controller sees: desired=3, actual=0
# 3. Creates a ReplicaSet
# 4. ReplicaSet controller sees: desired=3, actual=0
# 5. Creates 3 Pod objects
# 6. Scheduler sees 3 unscheduled pods
# 7. Assigns each to a node
# 8. kubelet on each node starts the container

# If a pod crashes:
# 1. ReplicaSet controller sees: desired=3, actual=2
# 2. Creates 1 new Pod
# 3. Scheduler assigns it
# 4. kubelet starts it
# Total recovery time: ~5-15 seconds
Analogy — The Thermostat: A thermostat is a reconciliation loop. You set desired temperature (21°C). The thermostat constantly observes actual temperature. If actual < desired, it turns on heating. If actual > desired, it turns on cooling. Kubernetes controllers work exactly the same way — but for infrastructure instead of temperature.

Control Plane Components

The control plane is the "brain" of the cluster. It makes global decisions about scheduling, detects failures, and maintains desired state. In production, control plane components run on dedicated nodes (often 3 or 5 for high availability).

Kubernetes Control Plane Architecture
flowchart TD
    subgraph Control Plane
        API[API Server
Central Hub] ETCD[(etcd
State Store)] SCHED[Scheduler
Pod Placement] CM[Controller Manager
Reconciliation Loops] CCM[Cloud Controller
Provider Integration] end subgraph Worker Nodes K1[kubelet] K2[kubelet] K3[kubelet] KP1[kube-proxy] KP2[kube-proxy] KP3[kube-proxy] end API <--> ETCD SCHED --> API CM --> API CCM --> API K1 --> API K2 --> API K3 --> API

API Server (kube-apiserver)

The API Server is the front door to everything in Kubernetes. Every component — kubectl, controllers, kubelets, external tools — communicates exclusively through the API Server. Nothing talks to etcd directly except the API Server.

# The API Server exposes a RESTful API over HTTPS:
# Every Kubernetes operation is an API call

# List pods (GET request to /api/v1/namespaces/default/pods)
kubectl get pods
# Equivalent: curl -k https://api-server:6443/api/v1/namespaces/default/pods

# Create a pod (POST request)
kubectl apply -f pod.yaml
# Equivalent: curl -X POST -d @pod.yaml https://api-server:6443/api/v1/namespaces/default/pods

# Watch for changes (long-lived HTTP connection with chunked responses)
kubectl get pods --watch
# Equivalent: curl https://api-server:6443/api/v1/namespaces/default/pods?watch=true

# Explore the API directly:
kubectl api-resources          # List all resource types
kubectl api-versions           # List all API versions
kubectl explain deployment     # Show schema for a resource type
kubectl explain deployment.spec.template.spec.containers

Key responsibilities of the API Server:

  • Authentication: Verifies identity (certificates, tokens, OIDC)
  • Authorisation: Checks permissions (RBAC — can this user create pods?)
  • Admission Control: Validates and mutates requests (resource quotas, default values, policy enforcement)
  • Persistence: Stores validated objects in etcd
  • Watch notifications: Notifies controllers of state changes
API Server Request Processing Pipeline
flowchart LR
    A[Client Request] --> B[Authentication]
    B --> C[Authorization]
    C --> D[Admission Control
Mutating Webhooks] D --> E[Schema Validation] E --> F[Admission Control
Validating Webhooks] F --> G[Persist to etcd] G --> H[Response to Client]

etcd

etcd is the distributed key-value store that holds all cluster state. It's the single source of truth — if etcd is lost and unrecoverable, the cluster is gone. It uses the Raft consensus algorithm (which we studied in Part 2) to maintain consistency across replicas.

# What etcd stores (all Kubernetes objects as key-value pairs):
# Key: /registry/pods/default/my-pod
# Value: JSON-encoded Pod object

# Check etcd cluster health:
ETCDCTL_API=3 etcdctl endpoint health \
  --endpoints=https://10.0.1.10:2379,https://10.0.1.11:2379,https://10.0.1.12:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/peer.crt \
  --key=/etc/kubernetes/pki/etcd/peer.key

# Output:
# https://10.0.1.10:2379 is healthy: committed index = 15423
# https://10.0.1.11:2379 is healthy: committed index = 15423
# https://10.0.1.12:2379 is healthy: committed index = 15423

# Check cluster member list:
ETCDCTL_API=3 etcdctl member list --write-out=table
# +------------------+---------+--------+----------------------------+
# |        ID        | STATUS  |  NAME  |         PEER ADDRS         |
# +------------------+---------+--------+----------------------------+
# | 8e9e05c52164694d | started | etcd-0 | https://10.0.1.10:2380     |
# | 91bc3c398fb3c146 | started | etcd-1 | https://10.0.1.11:2380     |
# | fd422379fda50e48 | started | etcd-2 | https://10.0.1.12:2380     |
# +------------------+---------+--------+----------------------------+

# Backup etcd (CRITICAL for disaster recovery):
ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-snapshot.db \
  --endpoints=https://10.0.1.10:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key
Production Critical: etcd is the most important component in a Kubernetes cluster. If etcd dies, the cluster cannot function — no new scheduling, no reconciliation, no API access. Always run etcd in a 3 or 5 node cluster (for quorum), use fast SSDs (etcd is latency-sensitive — requires <10ms fsync), and take regular snapshots for disaster recovery.
etcd Property Value Why It Matters
Consensus Raft (leader-based) Strong consistency, linearizable reads
Quorum (3 nodes) 2 of 3 must agree Survives 1 node failure
Quorum (5 nodes) 3 of 5 must agree Survives 2 node failures
Storage limit 8 GB default Compaction required to reclaim space
Disk requirement <10ms fsync Slow disk = slow cluster

Scheduler (kube-scheduler)

The Scheduler watches for newly created Pods that have no node assigned and selects a suitable node for each one. It doesn't run the pod — it just decides where it should go.

Scheduler Decision Pipeline
flowchart LR
    A[New Pod
No Node Assigned] --> B[Filtering Phase
Which nodes CAN run it?] B --> C[Scoring Phase
Which node is BEST?] C --> D[Binding
Assign pod to winner] B -->|Excludes| E[Insufficient CPU/Memory] B -->|Excludes| F[Taints not tolerated] B -->|Excludes| G[Affinity violated]
# How scheduling works step by step:

# 1. FILTERING: Eliminate nodes that can't run the pod
#    - Sufficient CPU/memory? (resource requests vs node capacity)
#    - Ports available? (hostPort conflicts)
#    - Node selectors match? (nodeSelector, nodeAffinity)
#    - Taints tolerated? (NoSchedule taints)
#    - Volume requirements met? (PV availability in zone)

# 2. SCORING: Rank remaining nodes (0-100 per plugin)
#    - LeastRequestedPriority: prefer nodes with most free resources
#    - BalancedResourceAllocation: balance CPU/memory usage
#    - InterPodAffinity: prefer nodes with co-located pods
#    - ImageLocality: prefer nodes that already have the container image
#    - TopologySpreadConstraints: spread across zones evenly

# 3. BINDING: Assign pod to highest-scoring node
#    - Update Pod.spec.nodeName in API Server
#    - kubelet on that node picks it up

# See why a pod isn't scheduling:
kubectl describe pod stuck-pod
# Events:
#   Warning  FailedScheduling  0/5 nodes available:
#   2 Insufficient memory, 3 node(s) had taint {dedicated: gpu}

# See scheduler decisions:
kubectl get events --field-selector reason=Scheduled
# Successfully assigned default/web-pod to worker-node-2

Controller Manager (kube-controller-manager)

The Controller Manager runs dozens of independent reconciliation loops (controllers), each responsible for a specific resource type. Each controller watches the API Server for changes and takes action to align actual state with desired state.

Controller Watches Reconciles
Deployment Deployment objects Creates/updates ReplicaSets for rollouts
ReplicaSet ReplicaSets + Pods Maintains desired pod count
Node Node heartbeats Marks unresponsive nodes NotReady
Job Job objects Ensures pods run to completion
Endpoints Services + Pods Updates Service endpoints when pods change
Namespace Namespace deletions Cleans up all resources in deleted namespaces
ServiceAccount Namespace creation Creates default ServiceAccount per namespace
# All controllers run as goroutines within a single binary:
# kube-controller-manager

# See which controllers are active:
kubectl get componentstatuses
# NAME                 STATUS    MESSAGE
# controller-manager   Healthy   ok
# scheduler            Healthy   ok
# etcd-0               Healthy   {"health":"true","reason":""}

# Controller Manager flags (key configuration):
# --controllers=*                    # Enable all controllers
# --concurrent-deployment-syncs=5    # Parallel deployment reconciliations
# --node-monitor-grace-period=40s    # Time before marking node NotReady
# --pod-eviction-timeout=5m0s        # Time before evicting pods from NotReady node
# --cluster-cidr=10.244.0.0/16      # Pod network range

Cloud Controller Manager

The Cloud Controller Manager connects Kubernetes to the underlying cloud provider (AWS, GCP, Azure). It handles cloud-specific operations that Kubernetes itself doesn't need to know about:

  • Node Controller: Detects when cloud VMs are deleted, updates node status
  • Route Controller: Configures cloud network routes for pod communication
  • Service Controller: Creates cloud load balancers for LoadBalancer-type Services
Why Separate? The Cloud Controller Manager was extracted from kube-controller-manager to decouple Kubernetes core from cloud provider code. This allows cloud providers to evolve independently and supports self-hosted Kubernetes (bare metal, on-premises) where no cloud controller is needed.

Worker Node Components

Worker nodes are the machines that actually run your application containers. Each worker node runs three core components:

kubelet

The kubelet is the agent on every worker node. It receives pod specifications from the API Server and ensures the described containers are running and healthy. It's the component that actually makes things happen on the physical machine.

# kubelet responsibilities:
# 1. Register the node with the API Server
# 2. Watch API Server for pods assigned to this node
# 3. Pull container images
# 4. Start/stop containers via container runtime (CRI)
# 5. Execute liveness/readiness/startup probes
# 6. Report pod status back to API Server
# 7. Manage volumes (mount/unmount)
# 8. Send node heartbeats (NodeLease)

# Check kubelet status on a node:
systemctl status kubelet

# kubelet logs:
journalctl -u kubelet -f --no-pager | tail -20

# Key kubelet configuration:
# --pod-manifest-path=/etc/kubernetes/manifests  # Static pods
# --cluster-dns=10.96.0.10                       # CoreDNS service IP
# --max-pods=110                                 # Max pods per node
# --node-status-update-frequency=10s             # Heartbeat interval
# --eviction-hard=memory.available<100Mi         # Eviction thresholds

# Static pods (managed directly by kubelet, not API Server):
ls /etc/kubernetes/manifests/
# etcd.yaml
# kube-apiserver.yaml
# kube-controller-manager.yaml
# kube-scheduler.yaml
# These are how control plane components run on master nodes!

kube-proxy

kube-proxy maintains network rules on each node that allow pods to communicate with Services. It implements the Service abstraction — translating a virtual ClusterIP into actual pod IPs.

# kube-proxy modes:

# 1. iptables mode (default):
# Creates iptables rules for each Service → endpoint mapping
# Packets are redirected at kernel level (very fast, no userspace)
iptables -t nat -L KUBE-SERVICES | head -20
# Chain KUBE-SERVICES
# target           prot  source     destination
# KUBE-SVC-ABC123  tcp   anywhere   10.96.45.12/32  /* default/my-svc */

# 2. IPVS mode (better for large clusters):
# Uses Linux IPVS (IP Virtual Server) for load balancing
# Supports more algorithms: round-robin, least-connection, weighted
# Better performance at 10,000+ services
ipvsadm -Ln
# TCP  10.96.45.12:80 rr
#   -> 10.244.1.15:8080   Masq  1  0  0
#   -> 10.244.2.22:8080   Masq  1  0  0
#   -> 10.244.3.8:8080    Masq  1  0  0

# Check kube-proxy mode:
kubectl get configmap kube-proxy -n kube-system -o yaml | grep mode
# mode: "iptables"  (or "ipvs")

Container Runtime

The container runtime is responsible for pulling images, creating containers, and managing their lifecycle. Kubernetes communicates with it through the Container Runtime Interface (CRI) — an abstraction that allows different runtimes to be plugged in.

Runtime CRI Compatible Use Case Notes
containerd Yes (native) Standard production runtime Default for most distributions
CRI-O Yes (native) Lightweight, OCI-focused Popular with OpenShift
Docker Engine Via dockershim (removed 1.24) Development only No longer supported in K8s
gVisor (runsc) Yes (via containerd) Security sandbox Kernel syscall interception
Kata Containers Yes (via containerd) VM-level isolation Lightweight VMs per pod
# Check which container runtime a cluster is using:
kubectl get nodes -o wide
# NAME       STATUS   ROLES    VERSION   CONTAINER-RUNTIME
# worker-1   Ready       v1.30.0   containerd://1.7.13
# worker-2   Ready       v1.30.0   containerd://1.7.13

# containerd CLI (crictl — CRI-compatible):
crictl ps                    # List running containers
crictl images                # List images on node
crictl inspect    # Container details
crictl logs       # Container logs

# Check containerd status:
systemctl status containerd
crictl info | head -20

Component Communication Flow

Pod Creation Lifecycle

When you run kubectl apply -f deployment.yaml, here's the complete sequence of events across all components:

Complete Pod Creation Flow
sequenceDiagram
    participant U as User (kubectl)
    participant API as API Server
    participant ETCD as etcd
    participant DC as Deployment Controller
    participant RC as ReplicaSet Controller
    participant S as Scheduler
    participant KL as kubelet (Worker)
    participant CR as Container Runtime
    
    U->>API: POST /apis/apps/v1/deployments
    API->>API: Authenticate + Authorise + Admit
    API->>ETCD: Store Deployment object
    API->>U: 201 Created
    
    DC->>API: Watch detects new Deployment
    DC->>API: Create ReplicaSet
    API->>ETCD: Store ReplicaSet
    
    RC->>API: Watch detects new ReplicaSet
    RC->>API: Create Pod (nodeName empty)
    API->>ETCD: Store Pod
    
    S->>API: Watch detects unscheduled Pod
    S->>S: Filter + Score nodes
    S->>API: Bind Pod to worker-2
    API->>ETCD: Update Pod.spec.nodeName
    
    KL->>API: Watch detects Pod assigned to me
    KL->>CR: Pull image + Create container
    CR->>KL: Container started
    KL->>API: Update Pod status: Running
    API->>ETCD: Store updated status
                            
Key Observation: Notice that no component talks to another directly. All communication goes through the API Server. This is the "hub and spoke" pattern — it simplifies security (one endpoint to secure), enables audit logging (all actions pass through one point), and decouples components (any can be replaced independently).

Watch Mechanism

Controllers don't poll the API Server — they establish long-lived watch connections. When any object changes, the API Server pushes the update to all watchers. This is efficient and enables near-instant reactions:

# How watches work:
# 1. Controller opens HTTP connection: GET /api/v1/pods?watch=true
# 2. API Server keeps connection open
# 3. When a pod changes, API Server sends event over the connection:
#    {"type": "MODIFIED", "object": {"kind": "Pod", ...}}
# 4. Controller processes the event and reconciles

# Watch types:
# ADDED    — new object created
# MODIFIED — existing object updated
# DELETED  — object removed

# Resource versions ensure no events are missed:
# If the connection drops, controller reconnects with last resourceVersion
# API Server replays all changes since that version

# See watches in action:
kubectl get pods --watch -v=7
# I0514 10:23:45.123456   GET https://api:6443/api/v1/pods?watch=true
# I0514 10:23:45.234567   Response Status: 200 OK
# (stream of events follows...)

Cluster Setup & Installation

kubeadm

kubeadm is the standard tool for bootstrapping Kubernetes clusters. It handles the complex process of generating certificates, configuring etcd, starting control plane components, and creating the token for worker nodes to join.

# Bootstrap a Kubernetes cluster with kubeadm:

# STEP 1: Install prerequisites (all nodes)
# - Container runtime (containerd)
# - kubeadm, kubelet, kubectl packages
# - Disable swap
# - Load required kernel modules (br_netfilter, overlay)
# - Set sysctl (net.bridge.bridge-nf-call-iptables = 1)

# STEP 2: Initialise the control plane (master node only)
sudo kubeadm init \
  --pod-network-cidr=10.244.0.0/16 \
  --control-plane-endpoint="k8s-api.example.com:6443" \
  --upload-certs

# Output includes:
# - kubeconfig setup instructions
# - Worker join command with token
# - Control plane join command (for HA)

# STEP 3: Configure kubectl (master node)
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config

# STEP 4: Install CNI plugin (networking)
kubectl apply -f https://raw.githubusercontent.com/projectcalico/calico/v3.27.0/manifests/calico.yaml

# STEP 5: Join worker nodes
# (run on each worker with the token from step 2)
sudo kubeadm join k8s-api.example.com:6443 \
  --token abcdef.0123456789abcdef \
  --discovery-token-ca-cert-hash sha256:abc123...

# Verify cluster:
kubectl get nodes
# NAME       STATUS   ROLES           AGE   VERSION
# master-1   Ready    control-plane   10m   v1.30.0
# worker-1   Ready    <none>          3m    v1.30.0
# worker-2   Ready    <none>          2m    v1.30.0

Managed vs Self-Managed Kubernetes

Aspect Managed (EKS/GKE/AKS) Self-Managed (kubeadm/k3s)
Control plane Provider manages (HA, upgrades, etcd) You manage everything
etcd Hidden, auto-backed up You must back up and maintain
Upgrades One-click or automatic Manual (kubeadm upgrade)
Networking Pre-integrated CNI Install and configure CNI yourself
Cost $70–$200/month for control plane $0 (just node costs)
Best for Production workloads, teams without deep K8s ops expertise Learning, edge/IoT, air-gapped, extreme customisation
Comparison Lightweight Distributions
K3s, MicroK8s, and Kind

For development and edge computing, lightweight distributions strip down Kubernetes:

  • K3s (Rancher): Single binary (~60MB), replaces etcd with SQLite, ideal for edge/IoT and development. Production-ready for small clusters.
  • MicroK8s (Canonical): Snap-based, single-node or multi-node, built-in addons (Istio, Prometheus). Great for developers on Ubuntu.
  • Kind (Kubernetes in Docker): Runs cluster nodes as Docker containers. Perfect for CI/CD testing and local development. Not for production.
  • Minikube: Single-node cluster in a VM or container. Focused on local development with easy addon management.
K3s MicroK8s Kind Minikube

Exercises

Exercise 1 — Component Identification: Using kubectl get pods -n kube-system, identify every control plane component running in your cluster. For each, explain: (a) what it does, (b) what happens if it fails, and (c) how it recovers.
Exercise 2 — Pod Creation Trace: Create a Deployment with 2 replicas and use kubectl get events --sort-by=.metadata.creationTimestamp to trace the complete creation flow. Map each event to the component that generated it (scheduler, kubelet, controller, etc.).
Exercise 3 — Failure Simulation: If you have a multi-node cluster: (a) Stop kubelet on a worker node. What happens to pods? How long before they're rescheduled? (b) Restart kubelet. What happens to the rescheduled pods? Do they move back?
Exercise 4 — Architecture Diagram: Draw a complete architecture diagram of a 3-master, 5-worker production cluster. Label: etcd cluster (3 or 5 members?), API Server load balancer, which components run where, and all communication paths. Include the CNI plugin and kube-proxy.

Conclusion

Kubernetes architecture implements every distributed systems principle we've covered:

  • Consensus (Part 2): etcd uses Raft for consistent state storage
  • CAP (Part 3): Kubernetes favours consistency — the API Server provides linearizable reads from etcd
  • Service Discovery (Part 4): CoreDNS + Services provide automatic discovery
  • Self-Healing (Part 5): Controllers continuously reconcile desired vs actual state

In Part 7, we'll explore the Kubernetes Object Model — Pods, ReplicaSets, Deployments, Services, ConfigMaps, and Secrets — and learn to write YAML manifests that declare the desired state of your applications.