Back to Distributed Systems & Kubernetes Series

Part 6: Kubernetes Architecture

May 14, 2026 Wasil Zafar 35 min read

Kubernetes is not "just container orchestration." It is a distributed systems control plane for declarative infrastructure management — implementing every resilience principle we've studied so far.

Table of Contents

  1. Cluster Setup & Installation
  2. The Kubernetes Mental Model
  3. Control Plane Components
  4. Worker Node Components
  5. Component Communication Flow
  6. Exercises
  7. Conclusion

This article takes a build-first approach: we'll set up a working Kubernetes cluster, then explore how each component works. By having a live environment, you can verify every concept hands-on as you read. The complete "Hello Kubernetes" exercise that demonstrates all features together is in Part 7, which pairs theory with practice for each object type.

Cluster Setup & Installation

kubeadm

kubeadm is the standard tool for bootstrapping Kubernetes clusters. It handles the complex process of generating certificates, configuring etcd, starting control plane components, and creating the token for worker nodes to join.

# Bootstrap a Kubernetes cluster with kubeadm:

# STEP 1: Install prerequisites (run on ALL nodes — master + workers)

# 1a. Disable swap (required by kubelet)
sudo swapoff -a
sudo sed -i '/ swap / s/^/#/' /etc/fstab

# 1b. Load required kernel modules
cat <<EOF | sudo tee /etc/modules-load.d/k8s.conf
overlay
br_netfilter
EOF
sudo modprobe overlay
sudo modprobe br_netfilter

# 1c. Set required sysctl parameters
cat <<EOF | sudo tee /etc/sysctl.d/k8s.conf
net.bridge.bridge-nf-call-iptables  = 1
net.bridge.bridge-nf-call-ip6tables = 1
net.ipv4.ip_forward                 = 1
EOF
sudo sysctl --system

# 1d. Install containerd (container runtime)
sudo apt-get update
sudo apt-get install -y containerd
sudo mkdir -p /etc/containerd
containerd config default | sudo tee /etc/containerd/config.toml
# Enable SystemdCgroup (required for kubelet)
sudo sed -i 's/SystemdCgroup = false/SystemdCgroup = true/' /etc/containerd/config.toml
sudo systemctl restart containerd
sudo systemctl enable containerd

# 1e. Install kubeadm, kubelet, kubectl
sudo apt-get install -y apt-transport-https ca-certificates curl gpg
sudo mkdir -p /etc/apt/keyrings

# Add Kubernetes signing key + apt repository (v1.30)
# Method A — Import from Release.key URL (works when key matches repo):
curl -fsSL https://pkgs.k8s.io/core:/stable:/v1.30/deb/Release.key | \
  sudo gpg --dearmor --yes -o /etc/apt/keyrings/kubernetes-apt-keyring.gpg

# Method B — If Method A fails with "NO_PUBKEY" error during apt-get update,
# the repo signing key has rotated. Import directly from a keyserver instead:
#   sudo rm -f /etc/apt/keyrings/kubernetes-apt-keyring.gpg
#   sudo gpg --no-default-keyring \
#     --keyring gnupg-ring:/etc/apt/keyrings/kubernetes-apt-keyring.gpg \
#     --keyserver hkps://keyserver.ubuntu.com \
#     --recv-keys 234654DA9A296436
#   sudo chmod 644 /etc/apt/keyrings/kubernetes-apt-keyring.gpg
#   sudo rm -f /etc/apt/keyrings/kubernetes-apt-keyring.gpg~

echo 'deb [signed-by=/etc/apt/keyrings/kubernetes-apt-keyring.gpg] https://pkgs.k8s.io/core:/stable:/v1.30/deb/ /' | \
  sudo tee /etc/apt/sources.list.d/kubernetes.list
sudo apt-get update
sudo apt-get install -y kubelet kubeadm kubectl
sudo apt-mark hold kubelet kubeadm kubectl
Troubleshooting GPG Key Errors: If apt-get update fails with NO_PUBKEY 234654DA9A296436, it means the signing key in the Release.key URL is stale (Kubernetes periodically rotates keys). The fix is to use Method B above — import the key directly from Ubuntu's keyserver using the gnupg-ring: prefix. This creates the keyring in the legacy format that apt can read. The chmod 644 is required because GPG creates files with 600 permissions but apt needs read access.
# STEP 2: Initialise the control plane (master node only)
#
# Option A — Single-node / learning cluster (simplest):
sudo kubeadm init \
  --pod-network-cidr=10.244.0.0/16

# Option B — Multi-master HA cluster (production):
# --control-plane-endpoint should point to a load balancer FQDN
# or a DNS name that resolves to your API server(s).
# For a single-node setup, use your machine's IP or hostname instead:
#   --control-plane-endpoint="$(hostname -I | awk '{print $1}'):6443"
sudo kubeadm init \
  --pod-network-cidr=10.244.0.0/16 \
  --control-plane-endpoint="k8s-api.example.com:6443" \
  --upload-certs
Hostname Resolution: If hostname -f fails with "Name or service not known", add your hostname to /etc/hosts first: echo "127.0.0.1 $(hostname)" >> /etc/hosts. kubeadm needs the node's hostname to resolve. The --control-plane-endpoint flag is only needed for HA setups where multiple control planes share a load balancer. For single-node clusters, omit it entirely.
# Output from kubeadm init (annotated):
# [init] Using Kubernetes version: v1.30.14
# [preflight] Running pre-flight checks
# [preflight] Pulling images required for setting up a Kubernetes cluster
# [certs] Generating "ca" certificate and key
# [certs] Generating "apiserver" certificate and key
# [certs] apiserver serving cert is signed for DNS names
#         [kubernetes kubernetes.default kubernetes.default.svc
#          kubernetes.default.svc.cluster.local your-hostname]
#         and IPs [10.96.0.1 <your-node-ip>]
# [certs] Generating etcd/ca, etcd/server, etcd/peer, front-proxy certs...
# [kubeconfig] Writing admin.conf, kubelet.conf, controller-manager.conf, scheduler.conf
# [etcd] Creating static Pod manifest for local etcd
# [control-plane] Creating static Pod manifests for apiserver, controller-manager, scheduler
# [kubelet-start] Starting the kubelet
# [kubelet-check] The kubelet is healthy after ~1s
# [api-check] The API server is healthy after ~6s
# [upload-config] Storing configuration in ConfigMap "kubeadm-config"
# [upload-certs] Storing certificates in Secret "kubeadm-certs"
# [mark-control-plane] Adding labels and taints to control-plane node
# [bootstrap-token] Creating token for node joins
# [addons] Applied essential addon: CoreDNS
# [addons] Applied essential addon: kube-proxy
#
# ✅ Your Kubernetes control-plane has initialized successfully!

# STEP 3: Configure kubectl (master node)
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config
# (Or if root: export KUBECONFIG=/etc/kubernetes/admin.conf)

# STEP 4: Install CNI plugin (networking)
# Without a CNI, pods can't communicate and nodes stay NotReady
kubectl apply -f https://raw.githubusercontent.com/projectcalico/calico/v3.27.0/manifests/calico.yaml

# STEP 5: Join worker nodes
# kubeadm init outputs a join command — run it on each worker:
sudo kubeadm join your-master:6443 \
  --token <token-from-init-output> \
  --discovery-token-ca-cert-hash sha256:<hash-from-init-output>

# For additional control-plane nodes (HA), add --control-plane --certificate-key:
# sudo kubeadm join your-master:6443 --token <token> \
#   --discovery-token-ca-cert-hash sha256:<hash> \
#   --control-plane --certificate-key <cert-key-from-init-output>

# If you lost the join command, regenerate the token:
kubeadm token create --print-join-command
# Output from kubeadm join (run on each worker node):
# [preflight] Running pre-flight checks
# [preflight] Reading configuration from the cluster...
# [kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
# [kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
# [kubelet-start] Starting the kubelet
# [kubelet-check] The kubelet is healthy after ~500ms
# [kubelet-start] Waiting for the kubelet to perform the TLS Bootstrap
#
# ✅ This node has joined the cluster:
# * Certificate signing request was sent to apiserver and a response was received.
# * The Kubelet was informed of the new secure connection details.

# Verify cluster (run on master):
kubectl get nodes
# NAME         STATUS     ROLES           AGE    VERSION
# master-1     Ready      control-plane   55m    v1.30.14
# worker-1     Ready      <none>          2m     v1.30.14
# worker-2     NotReady   <none>          30s    v1.30.14
#
# Note: Workers show "NotReady" for 30-90 seconds while the CNI
# plugin initialises networking. This is normal — wait and re-check:
kubectl get nodes
# NAME         STATUS   ROLES           AGE    VERSION
# master-1     Ready    control-plane   55m    v1.30.14
# worker-1     Ready    <none>          2m     v1.30.14
# worker-2     Ready    <none>          97s    v1.30.14
What kubeadm init Does Under the Hood: It generates a full PKI (12+ certificates), writes kubeconfig files for each component, creates static pod manifests in /etc/kubernetes/manifests/ for etcd + apiserver + controller-manager + scheduler, starts kubelet which launches those static pods, waits for the API server to be healthy, uploads config to ConfigMaps, creates bootstrap tokens, and installs CoreDNS + kube-proxy addons. All in ~10 seconds.

Single-Node Alternative

For a single-node learning cluster, remove the control-plane taint so pods can schedule on the master:

# SINGLE-NODE SETUP: Allow pods to run on the control-plane node
kubectl taint nodes --all node-role.kubernetes.io/control-plane-

# Verify the taint is gone:
kubectl describe node | grep Taints
# Taints: <none>

Managed vs Self-Managed Kubernetes

Aspect Managed (EKS/GKE/AKS) Self-Managed (kubeadm/k3s)
Control plane Provider manages (HA, upgrades, etcd) You manage everything
etcd Hidden, auto-backed up You must back up and maintain
Upgrades One-click or automatic Manual (kubeadm upgrade)
Networking Pre-integrated CNI Install and configure CNI yourself
Cost $70–$200/month for control plane $0 (just node costs)
Best for Production workloads, teams without deep K8s ops expertise Learning, edge/IoT, air-gapped, extreme customisation
Comparison Lightweight Distributions
K3s, MicroK8s, and Kind

For development and edge computing, lightweight distributions strip down Kubernetes:

  • K3s (Rancher): Single binary (~60MB), replaces etcd with SQLite, ideal for edge/IoT and development. Production-ready for small clusters.
  • MicroK8s (Canonical): Snap-based, single-node or multi-node, built-in addons (Istio, Prometheus). Great for developers on Ubuntu.
  • Kind (Kubernetes in Docker): Runs cluster nodes as Docker containers. Perfect for CI/CD testing and local development. Not for production.
  • Minikube: Single-node cluster in a VM or container. Focused on local development with easy addon management.
Going Deeper — Local Dev Tracks: This section introduces Minikube and Kind conceptually. For full hands-on walkthroughs — install, addons, multi-node clusters, and CI integration — see the dedicated tracks in this series: Minikube Track → and Kind Track →.
K3s MicroK8s Kind Minikube
Checkpoint: You now have a running Kubernetes cluster. In the sections below, we'll explore what each component does — and you can verify every concept by running commands against your live cluster.

The Kubernetes Mental Model

Now that you have a running cluster, let's explore the mental model that makes Kubernetes tick. Start by looking at what's already running:

# See every component Kubernetes installed automatically:
kubectl get pods -n kube-system
# NAME                                       READY   STATUS    AGE
# calico-kube-controllers-...                1/1     Running   ...
# calico-node-...                            1/1     Running   ...
# coredns-...                                1/1     Running   ...
# etcd-master-1                              1/1     Running   ...
# kube-apiserver-master-1                    1/1     Running   ...
# kube-controller-manager-master-1           1/1     Running   ...
# kube-proxy-...                             1/1     Running   ...
# kube-scheduler-master-1                    1/1     Running   ...

# Every component you see here is explained in the sections below.

Declarative Reconciliation

Every concept from Parts 1–5 — consensus, replication, service discovery, resilience — converges in Kubernetes. But Kubernetes adds one powerful abstraction that makes it all manageable: declarative reconciliation.

The Core Insight: You don't tell Kubernetes how to do things. You tell it what you want, and it figures out how to get there — and how to stay there. This is the difference between imperative ("create 3 pods, put them on nodes 1, 2, and 3") and declarative ("I want 3 replicas of this application running at all times").
Kubernetes Reconciliation Model
flowchart TD
    A[User Submits Desired State] --> B[API Server Stores in etcd]
    B --> C[Controllers Watch for Changes]
    C --> D{Desired == Actual?}
    D -->|Yes| E[No action needed]
    D -->|No| F[Controller takes corrective action]
    F --> G[Actual state moves toward desired]
    G --> C
    E --> C
                            

This model is fundamentally different from traditional infrastructure management:

Aspect Imperative (Traditional) Declarative (Kubernetes)
Instructions "Create VM, install nginx, start service" "3 nginx pods should be running"
Failure handling Manual detection, manual fix Auto-detected, auto-fixed
Drift Accumulates silently Continuously reconciled
Scaling "Create 2 more VMs and configure them" "Change replicas from 3 to 5"
State tracking CMDB (often outdated) etcd is single source of truth

Desired State vs Actual State

This is the single most important concept in Kubernetes. Everything else flows from it:

Reading YAML — A Quick Primer: Kubernetes resources are defined in YAML files (a human-readable data format). You don't need to memorise the syntax now — Part 7 covers the object model in detail. For now, just understand the structure:
  • apiVersion + kindwhat type of resource (e.g., a Deployment)
  • metadata — the resource's name and labels
  • spec — your desired state (how many replicas, which container image, etc.)
You submit this file with kubectl apply -f filename.yaml. Kubernetes reads it, stores it in etcd, and works to make reality match your declaration.
# This YAML is a declaration of DESIRED STATE:
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-server
spec:
  replicas: 3          # "I want 3 pods running at all times"
  selector:
    matchLabels:
      app: web
  template:
    metadata:
      labels:
        app: web
    spec:
      containers:
      - name: nginx
        image: nginx:1.25
        ports:
        - containerPort: 80
        resources:
          requests:
            memory: "64Mi"
            cpu: "100m"
          limits:
            memory: "128Mi"
            cpu: "200m"

# When you apply this:
# 1. API Server stores it in etcd
# 2. Deployment controller sees: desired=3, actual=0
# 3. Creates a ReplicaSet
# 4. ReplicaSet controller sees: desired=3, actual=0
# 5. Creates 3 Pod objects
# 6. Scheduler sees 3 unscheduled pods
# 7. Assigns each to a node
# 8. kubelet on each node starts the container

# If a pod crashes:
# 1. ReplicaSet controller sees: desired=3, actual=2
# 2. Creates 1 new Pod
# 3. Scheduler assigns it
# 4. kubelet starts it
# Total recovery time: ~5-15 seconds

# Don't worry about writing YAML yet — Part 7 teaches every field.
# For now, focus on the PATTERN: declare what you want → Kubernetes makes it happen.
Analogy — The Thermostat: A thermostat is a reconciliation loop. You set desired temperature (21°C). The thermostat constantly observes actual temperature. If actual < desired, it turns on heating. If actual > desired, it turns on cooling. Kubernetes controllers work exactly the same way — but for infrastructure instead of temperature.

Control Plane Components

The control plane is the "brain" of the cluster. It makes global decisions about scheduling, detects failures, and maintains desired state. In production, control plane components run on dedicated nodes (often 3 or 5 for high availability).

Kubernetes Control Plane Architecture
flowchart LR
    subgraph CP[Control Plane]
        direction TB
        ETCD[(etcd
State Store)] API[API Server
Central Hub] SCHED[Scheduler
Pod Placement] CM[Controller Manager
Reconciliation Loops] CCM[Cloud Controller
Provider Integration] API <-->|read/write state| ETCD SCHED -->|watch unscheduled pods| API CM -->|watch & reconcile| API CCM -->|cloud resources| API end subgraph W1[Worker Node 1] direction TB K1[kubelet
Pod Lifecycle] KP1[kube-proxy
Network Rules] end subgraph W2[Worker Node 2] direction TB K2[kubelet
Pod Lifecycle] KP2[kube-proxy
Network Rules] end K1 -->|report status & watch pods| API KP1 -->|watch Services & Endpoints| API K2 -->|report status & watch pods| API KP2 -->|watch Services & Endpoints| API

API Server (kube-apiserver)

The API Server is the front door to everything in Kubernetes. Every component — kubectl, controllers, kubelets, external tools — communicates exclusively through the API Server. Nothing talks to etcd directly except the API Server.

# The API Server exposes a RESTful API over HTTPS:
# Every Kubernetes operation is an API call

# List pods (GET request to /api/v1/namespaces/default/pods)
kubectl get pods
# Equivalent: curl -k https://api-server:6443/api/v1/namespaces/default/pods

# Create a pod (POST request)
kubectl apply -f pod.yaml
# Equivalent: curl -X POST -d @pod.yaml https://api-server:6443/api/v1/namespaces/default/pods

# Watch for changes (long-lived HTTP connection with chunked responses)
kubectl get pods --watch
# Equivalent: curl https://api-server:6443/api/v1/namespaces/default/pods?watch=true

# Explore the API directly:
kubectl api-resources          # List all resource types
kubectl api-versions           # List all API versions
kubectl explain deployment     # Show schema for a resource type
kubectl explain deployment.spec.template.spec.containers

# ╔══════════════════════════════════════════════════════════╗
# ║  🔧 TRY IT: Run these on your cluster right now:       ║
# ║  kubectl api-resources | head -20                      ║
# ║  kubectl get pods -n kube-system -l component=kube-apiserver  ║
# ╚══════════════════════════════════════════════════════════╝

Key responsibilities of the API Server:

  • Authentication: Verifies identity (certificates, tokens, OIDC)
  • Authorisation: Checks permissions (RBAC — can this user create pods?)
  • Admission Control: Validates and mutates requests (resource quotas, default values, policy enforcement)
  • Persistence: Stores validated objects in etcd
  • Watch notifications: Notifies controllers of state changes
API Server Request Processing Pipeline
flowchart TD
    A["`**Client Request**
    kubectl apply -f deploy.yaml`"] --> B

    subgraph AUTH[Identity & Access]
        direction LR
        B[Authentication
Who are you?] --> C[Authorization
Are you allowed?] end C --> D subgraph ADM[Admission Control] direction LR D[Mutating Webhooks
Modify the request] --> E[Schema Validation
Is it well-formed?] --> F[Validating Webhooks
Policy checks] end F --> G[Persist to etcd] G --> H["`**Response to Client** 201 Created`"]

etcd

etcd is the distributed key-value store that holds all cluster state. It's the single source of truth — if etcd is lost and unrecoverable, the cluster is gone. It uses the Raft consensus algorithm (which we studied in Part 2) to maintain consistency across replicas.

# What etcd stores (all Kubernetes objects as key-value pairs):
# Key: /registry/pods/default/my-pod
# Value: JSON-encoded Pod object

# ╔══════════════════════════════════════════════════════════╗
# ║  🔧 TRY IT: Find YOUR etcd endpoints and certs first: ║
# ╚══════════════════════════════════════════════════════════╝

# Step 1: Get your etcd pod name (it varies per cluster!):
kubectl get pods -n kube-system -l component=etcd
# NAME                                    READY   STATUS    AGE
# etcd-k8s-master.lab.example.com         1/1     Running   10h

# Step 2: Store the pod name in a variable for reuse:
ETCD_POD=$(kubectl get pods -n kube-system -l component=etcd -o jsonpath='{.items[0].metadata.name}')
echo $ETCD_POD
# etcd-k8s-master.lab.example.com

# Step 3: Extract the advertise URL (your actual etcd endpoint):
kubectl describe pod $ETCD_POD -n kube-system | grep -- --advertise-client-urls
# --advertise-client-urls=https://10.42.38.10:2379

# Step 4: Use kubectl exec to run etcdctl INSIDE the etcd pod
# (certs are already mounted inside — no host path issues):
kubectl exec -n kube-system $ETCD_POD -- etcdctl \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  endpoint health

# Output:
# https://127.0.0.1:2379 is healthy: successfully committed proposal: took = 11.16ms

# Member list (shows all etcd nodes in an HA cluster):
kubectl exec -n kube-system $ETCD_POD -- etcdctl \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  member list --write-out=table

# Single-node cluster output:
# +------------------+---------+------------------------------+--------------------------+--------------------------+------------+
# |        ID        | STATUS  |             NAME             |       PEER ADDRS         |      CLIENT ADDRS        | IS LEARNER |
# +------------------+---------+------------------------------+--------------------------+--------------------------+------------+
# | e179b74d7e9a5155 | started | k8s-master.lab.example.com   | https://10.42.38.10:2380 | https://10.42.38.10:2379 |      false |
# +------------------+---------+------------------------------+--------------------------+------------+
#
# Multi-node HA cluster output (3 voting members):
# | 8e9e05c52164694d | started | master-1 | https://192.168.1.100:2380 | https://192.168.1.100:2379 | false |
# | 91bc3c398fb3c146 | started | master-2 | https://192.168.1.101:2380 | https://192.168.1.101:2379 | false |
# | fd422379fda50e48 | started | master-3 | https://192.168.1.102:2380 | https://192.168.1.102:2379 | false |
#
# IS LEARNER column:
#   false = Full voting member (participates in Raft quorum)
#   true  = Learner/non-voting member (receives log replication but
#           does NOT vote in elections or count toward quorum).
#           Used when adding a new node — it catches up on data first,
#           then gets promoted to voter with: etcdctl member promote 
#           This prevents a slow new node from disrupting cluster consensus.

# Backup etcd (CRITICAL for disaster recovery):
kubectl exec -n kube-system $ETCD_POD -- etcdctl \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  snapshot save /var/lib/etcd/snapshot.db

# Copy the snapshot to your local machine:
kubectl cp kube-system/$ETCD_POD:/var/lib/etcd/snapshot.db ./etcd-backup.db
Production Critical: etcd is the most important component in a Kubernetes cluster. If etcd dies, the cluster cannot function — no new scheduling, no reconciliation, no API access. Always run etcd in a 3 or 5 node cluster (for quorum), use fast SSDs (etcd is latency-sensitive — requires <10ms fsync), and take regular snapshots for disaster recovery.
etcd Property Value Why It Matters
Consensus Raft (leader-based) Strong consistency, linearizable reads
Quorum (3 nodes) 2 of 3 must agree Survives 1 node failure
Quorum (5 nodes) 3 of 5 must agree Survives 2 node failures
Storage limit 8 GB default Compaction required to reclaim space
Disk requirement <10ms fsync Slow disk = slow cluster

Scheduler (kube-scheduler)

The Scheduler watches for newly created Pods that have no node assigned and selects a suitable node for each one. It doesn't run the pod — it just decides where it should go.

Scheduler Decision Pipeline
flowchart LR
    A[New Pod
No Node Assigned] --> B[Filtering Phase
Which nodes CAN run it?] B --> C[Scoring Phase
Which node is BEST?] C --> D[Binding
Assign pod to winner] B -->|Excludes| E[Insufficient CPU/Memory] B -->|Excludes| F[Taints not tolerated] B -->|Excludes| G[Affinity violated]

The scheduler works in three phases every time it sees an unassigned pod:

1. Filtering — Which nodes can run this pod?

Eliminates nodes that violate hard constraints:

  • Resources: Does the node have enough free CPU/memory for the pod's requests?
  • Ports: Is the required hostPort already taken?
  • Node selectors: Does the node match nodeSelector or nodeAffinity rules?
  • Taints: Does the pod tolerate the node's taints (e.g., NoSchedule)?
  • Volumes: Is the required PersistentVolume available in the node's zone?

2. Scoring — Which surviving node is best?

Ranks each remaining node 0–100 using scoring plugins:

  • LeastRequestedPriority: Prefer nodes with the most free resources
  • BalancedResourceAllocation: Prefer nodes where CPU and memory usage are balanced
  • InterPodAffinity: Prefer nodes already running co-located pods
  • ImageLocality: Prefer nodes that already pulled the container image (faster start)
  • TopologySpreadConstraints: Spread pods evenly across zones/nodes

3. Binding — Assign the pod to the winner

The scheduler updates Pod.spec.nodeName in the API Server. The kubelet on that node detects the assignment and starts the container.

# See why a pod isn't scheduling:
kubectl describe pod stuck-pod
# Events:
#   Warning  FailedScheduling  0/5 nodes available:
#   2 Insufficient memory, 3 node(s) had taint {dedicated: gpu}

# See successful scheduler decisions:
kubectl get events --field-selector reason=Scheduled
# Successfully assigned default/web-pod to worker-node-2

Controller Manager (kube-controller-manager)

The Controller Manager runs dozens of independent reconciliation loops (controllers), each responsible for a specific resource type. Each controller watches the API Server for changes and takes action to align actual state with desired state.

Controller Watches Reconciles
Deployment Deployment objects Creates/updates ReplicaSets for rollouts
ReplicaSet ReplicaSets + Pods Maintains desired pod count
Node Node heartbeats Marks unresponsive nodes NotReady
Job Job objects Ensures pods run to completion
Endpoints Services + Pods Updates Service endpoints when pods change
Namespace Namespace deletions Cleans up all resources in deleted namespaces
ServiceAccount Namespace creation Creates default ServiceAccount per namespace
# All controllers run as goroutines within a single binary:
# kube-controller-manager

# See which controllers are active:
kubectl get componentstatuses
# NAME                 STATUS    MESSAGE
# controller-manager   Healthy   ok
# scheduler            Healthy   ok
# etcd-0               Healthy   {"health":"true","reason":""}

# Controller Manager flags (key configuration):
# --controllers=*                    # Enable all controllers
# --concurrent-deployment-syncs=5    # Parallel deployment reconciliations
# --node-monitor-grace-period=40s    # Time before marking node NotReady
# --pod-eviction-timeout=5m0s        # Time before evicting pods from NotReady node
# --cluster-cidr=10.244.0.0/16      # Pod network range

Cloud Controller Manager

The Cloud Controller Manager connects Kubernetes to the underlying cloud provider (AWS, GCP, Azure). It handles cloud-specific operations that Kubernetes itself doesn't need to know about:

  • Node Controller: Detects when cloud VMs are deleted, updates node status
  • Route Controller: Configures cloud network routes for pod communication
  • Service Controller: Creates cloud load balancers for LoadBalancer-type Services
Why Separate? The Cloud Controller Manager was extracted from kube-controller-manager to decouple Kubernetes core from cloud provider code. This allows cloud providers to evolve independently and supports self-hosted Kubernetes (bare metal, on-premises) where no cloud controller is needed.

Worker Node Components

Worker nodes are the machines that actually run your application containers. Each worker node runs three core components:

kubelet

The kubelet is the agent on every worker node. It receives pod specifications from the API Server and ensures the described containers are running and healthy. It's the component that actually makes things happen on the physical machine.

# kubelet responsibilities:
# 1. Register the node with the API Server
# 2. Watch API Server for pods assigned to this node
# 3. Pull container images
# 4. Start/stop containers via container runtime (CRI)
# 5. Execute liveness/readiness/startup probes
# 6. Report pod status back to API Server
# 7. Manage volumes (mount/unmount)
# 8. Send node heartbeats (NodeLease)

# Check kubelet status on a node:
systemctl status kubelet
# ● kubelet.service - kubelet: The Kubernetes Node Agent
#      Loaded: loaded (/usr/lib/systemd/system/kubelet.service; enabled; preset: enabled)
#     Drop-In: /usr/lib/systemd/system/kubelet.service.d
#              └─10-kubeadm.conf
#      Active: active (running) since Sun 2026-06-07 14:27:41 UTC; 1 week 0 days ago
#        Docs: https://kubernetes.io/docs/
#    Main PID: 6416 (kubelet)
#       Tasks: 14 (limit: 19093)
#      Memory: 35.4M (peak: 37.1M)
#         CPU: 3h 12min 14.748s
#      CGroup: /system.slice/kubelet.service
#              └─6416 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/boot...
#
# Jun 14 22:24:46 k8s-worker-1.lab.example.com kubelet[6416]: I0614 22:24:46.9...
# Jun 14 22:24:47 k8s-worker-1.lab.example.com kubelet[6416]: I0614 22:24:47.1...

# kubelet logs (live tail):
journalctl -u kubelet -f --no-pager | tail -20

# Key kubelet configuration:
# --pod-manifest-path=/etc/kubernetes/manifests  # Static pods
# --cluster-dns=10.96.0.10                       # CoreDNS service IP
# --max-pods=110                                 # Max pods per node
# --node-status-update-frequency=10s             # Heartbeat interval
# --eviction-hard=memory.available<100Mi         # Eviction thresholds

# Static pods (managed directly by kubelet, not API Server):
# ON A MASTER NODE:
ls /etc/kubernetes/manifests/
# etcd.yaml
# kube-apiserver.yaml
# kube-controller-manager.yaml
# kube-scheduler.yaml
# These are how control plane components run on master nodes!

# ON A WORKER NODE:
ls /etc/kubernetes/manifests/
# (empty — workers have no static pods, only kubelet + kube-proxy)

# ╔══════════════════════════════════════════════════════════╗
# ║  🔧 TRY IT: SSH into a worker node and run:            ║
# ║  systemctl status kubelet                              ║
# ║  ls /etc/kubernetes/manifests/   (empty on workers!)   ║
# ║  crictl ps                       (running containers) ║
# ╚══════════════════════════════════════════════════════════╝

kube-proxy

kube-proxy maintains network rules on each node that allow pods to communicate with Services. It implements the Service abstraction — translating a virtual ClusterIP into actual pod IPs.

# kube-proxy modes:

# 1. iptables mode (default):
# Creates iptables rules for each Service → endpoint mapping
# Packets are redirected at kernel level (very fast, no userspace)
iptables -t nat -L KUBE-SERVICES | head -20
# Chain KUBE-SERVICES (2 references)
# target                     prot opt source       destination
# KUBE-SVC-NPX46M4PTMTKRN6Y  tcp  --  anywhere     10.96.0.1    /* default/kubernetes:https cluster IP */ tcp dpt:https
# KUBE-SVC-ERIFXISQEP7F7OF4  tcp  --  anywhere     10.96.0.10   /* kube-system/kube-dns:dns-tcp cluster IP */ tcp dpt:domain
# KUBE-SVC-JD5MR3NA4I4DYORP  tcp  --  anywhere     10.96.0.10   /* kube-system/kube-dns:metrics cluster IP */ tcp dpt:9153
# KUBE-SVC-TCOU7JCQXEZGVUNU  udp  --  anywhere     10.96.0.10   /* kube-system/kube-dns:dns cluster IP */ udp dpt:domain
# KUBE-NODEPORTS             all  --  anywhere     anywhere     /* kubernetes service nodeports; NOTE: this must be the last rule in this chain */ ADDRTYPE match dst-type LOCAL
#
# Each KUBE-SVC-* chain contains the actual load-balancing rules
# that distribute traffic to pod endpoints.

# 2. IPVS mode (better for large clusters):
# Uses Linux IPVS (IP Virtual Server) for load balancing
# Supports more algorithms: round-robin, least-connection, weighted
# Better performance at 10,000+ services
#
# Note: If your cluster uses iptables mode (the default), ipvsadm
# will show empty output — this is normal:
ipvsadm -Ln
# IP Virtual Server version 1.2.1 (size=4096)
# Prot LocalAddress:Port Scheduler Flags
#   -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
# (empty — cluster is using iptables mode, not IPVS)
#
# With IPVS mode enabled, you'd see:
# TCP  10.96.0.1:443 rr
#   -> 10.244.1.15:6443   Masq  1  0  0
# TCP  10.96.0.10:53 rr
#   -> 10.244.0.2:53      Masq  1  0  0
#   -> 10.244.0.3:53      Masq  1  0  0

# Check kube-proxy mode (run from master):
kubectl get configmap kube-proxy -n kube-system -o yaml | grep mode
#     mode: ""
#
# mode: "" (empty string) means kube-proxy uses the DEFAULT mode,
# which is iptables. Possible values:
#   ""          → iptables (default since Kubernetes 1.2)
#   "iptables"  → explicit iptables mode
#   "ipvs"      → IPVS mode (better for 1000+ services)
#   "nftables"  → nftables mode (alpha in 1.29+, replaces iptables)
#
# Key fields in the kube-proxy ConfigMap (config.conf):
#   clusterCIDR: 10.244.0.0/16      # Pod network range
#   iptables.syncPeriod: 0s          # How often rules are refreshed (0 = default 30s)
#   ipvs.scheduler: ""               # Load balancing algorithm (rr, lc, wrr, etc.)
#   ipvs.strictARP: false            # Must be true for MetalLB with IPVS
#   conntrack.maxPerCore: null        # Connection tracking table size
#
# The ConfigMap also contains kubeconfig.conf pointing to the API server:
#   server: https://k8s-master.lab.example.com:6443
#
# Note: kubectl only works on nodes with a valid kubeconfig.
# On worker nodes without ~/.kube/config, you'll get:
#   "The connection to the server localhost:8080 was refused"
# Solution: copy admin.conf from master or run kubectl on master.

Container Runtime

The container runtime is responsible for pulling images, creating containers, and managing their lifecycle. Kubernetes communicates with it through the Container Runtime Interface (CRI) — an abstraction that allows different runtimes to be plugged in.

Runtime CRI Compatible Use Case Notes
containerd Yes (native) Standard production runtime Default for most distributions
CRI-O Yes (native) Lightweight, OCI-focused Popular with OpenShift
Docker Engine Via dockershim (removed 1.24) Development only No longer supported in K8s
gVisor (runsc) Yes (via containerd) Security sandbox Kernel syscall interception
Kata Containers Yes (via containerd) VM-level isolation Lightweight VMs per pod
# Check which container runtime a cluster is using:
kubectl get nodes -o wide
# NAME       STATUS   ROLES    VERSION   CONTAINER-RUNTIME
# worker-1   Ready       v1.30.0   containerd://1.7.13
# worker-2   Ready       v1.30.0   containerd://1.7.13

# containerd CLI (crictl — CRI-compatible):
crictl ps                    # List running containers
crictl images                # List images on node
crictl inspect    # Container details
crictl logs       # Container logs

# Check containerd status:
systemctl status containerd
crictl info | head -20

Component Communication Flow

Pod Creation Lifecycle

When you run kubectl apply -f deployment.yaml, here's the complete sequence of events across all components:

Complete Pod Creation Flow
sequenceDiagram
    participant U as User (kubectl)
    participant API as API Server
    participant ETCD as etcd
    participant DC as Deployment Controller
    participant RC as ReplicaSet Controller
    participant S as Scheduler
    participant KL as kubelet (Worker)
    participant CR as Container Runtime
    
    U->>API: POST /apis/apps/v1/deployments
    API->>API: Authenticate + Authorise + Admit
    API->>ETCD: Store Deployment object
    API->>U: 201 Created
    
    DC->>API: Watch detects new Deployment
    DC->>API: Create ReplicaSet
    API->>ETCD: Store ReplicaSet
    
    RC->>API: Watch detects new ReplicaSet
    RC->>API: Create Pod (nodeName empty)
    API->>ETCD: Store Pod
    
    S->>API: Watch detects unscheduled Pod
    S->>S: Filter + Score nodes
    S->>API: Bind Pod to worker-2
    API->>ETCD: Update Pod.spec.nodeName
    
    KL->>API: Watch detects Pod assigned to me
    KL->>CR: Pull image + Create container
    CR->>KL: Container started
    KL->>API: Update Pod status: Running
    API->>ETCD: Store updated status
                            
Key Observation: Notice that no component talks to another directly. All communication goes through the API Server. This is the "hub and spoke" pattern — it simplifies security (one endpoint to secure), enables audit logging (all actions pass through one point), and decouples components (any can be replaced independently).

Watch Mechanism

Controllers don't poll the API Server — they establish long-lived watch connections. When any object changes, the API Server pushes the update to all watchers. This is efficient and enables near-instant reactions:

# How watches work:
# 1. Controller opens HTTP connection: GET /api/v1/pods?watch=true
# 2. API Server keeps connection open
# 3. When a pod changes, API Server sends event over the connection:
#    {"type": "MODIFIED", "object": {"kind": "Pod", ...}}
# 4. Controller processes the event and reconciles

# Watch types:
# ADDED    — new object created
# MODIFIED — existing object updated
# DELETED  — object removed

# Resource versions ensure no events are missed:
# If the connection drops, controller reconnects with last resourceVersion
# API Server replays all changes since that version

# See watches in action:
kubectl get pods --watch -v=7
# I0514 10:23:45.123456   GET https://api:6443/api/v1/pods?watch=true
# I0514 10:23:45.234567   Response Status: 200 OK
# (stream of events follows...)

Exercises

Exercise 1 — Component Identification: Using kubectl get pods -n kube-system, identify every control plane component running in your cluster. For each, explain: (a) what it does, (b) what happens if it fails, and (c) how it recovers.
Exercise 2 — Pod Creation Trace: Create a Deployment with 2 replicas and use kubectl get events --sort-by=.metadata.creationTimestamp to trace the complete creation flow. Map each event to the component that generated it (scheduler, kubelet, controller, etc.).
Exercise 3 — Failure Simulation: If you have a multi-node cluster: (a) Stop kubelet on a worker node. What happens to pods? How long before they're rescheduled? (b) Restart kubelet. What happens to the rescheduled pods? Do they move back?
Exercise 4 — Architecture Diagram: Draw a complete architecture diagram of a 3-master, 5-worker production cluster. Label: etcd cluster (3 or 5 members?), API Server load balancer, which components run where, and all communication paths. Include the CNI plugin and kube-proxy.

Conclusion

Kubernetes architecture implements every distributed systems principle we've covered:

  • Consensus (Part 2): etcd uses Raft for consistent state storage
  • CAP (Part 3): Kubernetes favours consistency — the API Server provides linearizable reads from etcd
  • Service Discovery (Part 4): CoreDNS + Services provide automatic discovery
  • Self-Healing (Part 5): Controllers continuously reconcile desired vs actual state

In Part 7, we'll explore the Kubernetes Object Model — Pods, ReplicaSets, Deployments, Services, ConfigMaps, and Secrets — with hands-on exercises for each object type so you can practice every concept on the cluster you just built.