Kubernetes Architecture - Part 6

This article takes a build-first approach: we'll set up a working Kubernetes cluster, then explore how each component works. By having a live environment, you can verify every concept hands-on as you read. The complete "Hello Kubernetes" exercise that demonstrates all features together is in Part 7, which pairs theory with practice for each object type.

Cluster Setup & Installation

kubeadm

kubeadm is the standard tool for bootstrapping Kubernetes clusters. It handles the complex process of generating certificates, configuring etcd, starting control plane components, and creating the token for worker nodes to join.

# Bootstrap a Kubernetes cluster with kubeadm:

# STEP 1: Install prerequisites (run on ALL nodes — master + workers)

# 1a. Disable swap (required by kubelet)
sudo swapoff -a
sudo sed -i '/ swap / s/^/#/' /etc/fstab

# 1b. Load required kernel modules
cat <<EOF | sudo tee /etc/modules-load.d/k8s.conf
overlay
br_netfilter
EOF
sudo modprobe overlay
sudo modprobe br_netfilter

# 1c. Set required sysctl parameters
cat <<EOF | sudo tee /etc/sysctl.d/k8s.conf
net.bridge.bridge-nf-call-iptables  = 1
net.bridge.bridge-nf-call-ip6tables = 1
net.ipv4.ip_forward                 = 1
EOF
sudo sysctl --system

# 1d. Install containerd (container runtime)
sudo apt-get update
sudo apt-get install -y containerd
sudo mkdir -p /etc/containerd
containerd config default | sudo tee /etc/containerd/config.toml
# Enable SystemdCgroup (required for kubelet)
sudo sed -i 's/SystemdCgroup = false/SystemdCgroup = true/' /etc/containerd/config.toml
sudo systemctl restart containerd
sudo systemctl enable containerd

# 1e. Install kubeadm, kubelet, kubectl
sudo apt-get install -y apt-transport-https ca-certificates curl gpg
sudo mkdir -p /etc/apt/keyrings

# Add Kubernetes signing key + apt repository (v1.30)
# Method A — Import from Release.key URL (works when key matches repo):
curl -fsSL https://pkgs.k8s.io/core:/stable:/v1.30/deb/Release.key | \
  sudo gpg --dearmor --yes -o /etc/apt/keyrings/kubernetes-apt-keyring.gpg

# Method B — If Method A fails with "NO_PUBKEY" error during apt-get update,
# the repo signing key has rotated. Import directly from a keyserver instead:
#   sudo rm -f /etc/apt/keyrings/kubernetes-apt-keyring.gpg
#   sudo gpg --no-default-keyring \
#     --keyring gnupg-ring:/etc/apt/keyrings/kubernetes-apt-keyring.gpg \
#     --keyserver hkps://keyserver.ubuntu.com \
#     --recv-keys 234654DA9A296436
#   sudo chmod 644 /etc/apt/keyrings/kubernetes-apt-keyring.gpg
#   sudo rm -f /etc/apt/keyrings/kubernetes-apt-keyring.gpg~

echo 'deb [signed-by=/etc/apt/keyrings/kubernetes-apt-keyring.gpg] https://pkgs.k8s.io/core:/stable:/v1.30/deb/ /' | \
  sudo tee /etc/apt/sources.list.d/kubernetes.list
sudo apt-get update
sudo apt-get install -y kubelet kubeadm kubectl
sudo apt-mark hold kubelet kubeadm kubectl

                            Troubleshooting GPG Key Errors: If apt-get update fails with NO_PUBKEY 234654DA9A296436, it means the signing key in the Release.key URL is stale (Kubernetes periodically rotates keys). The fix is to use Method B above — import the key directly from Ubuntu's keyserver using the gnupg-ring: prefix. This creates the keyring in the legacy format that apt can read. The chmod 644 is required because GPG creates files with 600 permissions but apt needs read access.
                        

# STEP 2: Initialise the control plane (master node only)
#
# Option A — Single-node / learning cluster (simplest):
sudo kubeadm init \
  --pod-network-cidr=10.244.0.0/16

# Option B — Multi-master HA cluster (production):
# --control-plane-endpoint should point to a load balancer FQDN
# or a DNS name that resolves to your API server(s).
# For a single-node setup, use your machine's IP or hostname instead:
#   --control-plane-endpoint="$(hostname -I | awk '{print $1}'):6443"
sudo kubeadm init \
  --pod-network-cidr=10.244.0.0/16 \
  --control-plane-endpoint="k8s-api.example.com:6443" \
  --upload-certs

                            Hostname Resolution: If hostname -f fails with "Name or service not known", add your hostname to /etc/hosts first: echo "127.0.0.1 $(hostname)" >> /etc/hosts. kubeadm needs the node's hostname to resolve. The --control-plane-endpoint flag is only needed for HA setups where multiple control planes share a load balancer. For single-node clusters, omit it entirely.
                        

# Output from kubeadm init (annotated):
# [init] Using Kubernetes version: v1.30.14
# [preflight] Running pre-flight checks
# [preflight] Pulling images required for setting up a Kubernetes cluster
# [certs] Generating "ca" certificate and key
# [certs] Generating "apiserver" certificate and key
# [certs] apiserver serving cert is signed for DNS names
#         [kubernetes kubernetes.default kubernetes.default.svc
#          kubernetes.default.svc.cluster.local your-hostname]
#         and IPs [10.96.0.1 <your-node-ip>]
# [certs] Generating etcd/ca, etcd/server, etcd/peer, front-proxy certs...
# [kubeconfig] Writing admin.conf, kubelet.conf, controller-manager.conf, scheduler.conf
# [etcd] Creating static Pod manifest for local etcd
# [control-plane] Creating static Pod manifests for apiserver, controller-manager, scheduler
# [kubelet-start] Starting the kubelet
# [kubelet-check] The kubelet is healthy after ~1s
# [api-check] The API server is healthy after ~6s
# [upload-config] Storing configuration in ConfigMap "kubeadm-config"
# [upload-certs] Storing certificates in Secret "kubeadm-certs"
# [mark-control-plane] Adding labels and taints to control-plane node
# [bootstrap-token] Creating token for node joins
# [addons] Applied essential addon: CoreDNS
# [addons] Applied essential addon: kube-proxy
#
# ✅ Your Kubernetes control-plane has initialized successfully!

# STEP 3: Configure kubectl (master node)
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config
# (Or if root: export KUBECONFIG=/etc/kubernetes/admin.conf)

# STEP 4: Install CNI plugin (networking)
# Without a CNI, pods can't communicate and nodes stay NotReady
kubectl apply -f https://raw.githubusercontent.com/projectcalico/calico/v3.27.0/manifests/calico.yaml

# STEP 5: Join worker nodes
# kubeadm init outputs a join command — run it on each worker:
sudo kubeadm join your-master:6443 \
  --token <token-from-init-output> \
  --discovery-token-ca-cert-hash sha256:<hash-from-init-output>

# For additional control-plane nodes (HA), add --control-plane --certificate-key:
# sudo kubeadm join your-master:6443 --token <token> \
#   --discovery-token-ca-cert-hash sha256:<hash> \
#   --control-plane --certificate-key <cert-key-from-init-output>

# If you lost the join command, regenerate the token:
kubeadm token create --print-join-command

# Output from kubeadm join (run on each worker node):
# [preflight] Running pre-flight checks
# [preflight] Reading configuration from the cluster...
# [kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
# [kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
# [kubelet-start] Starting the kubelet
# [kubelet-check] The kubelet is healthy after ~500ms
# [kubelet-start] Waiting for the kubelet to perform the TLS Bootstrap
#
# ✅ This node has joined the cluster:
# * Certificate signing request was sent to apiserver and a response was received.
# * The Kubelet was informed of the new secure connection details.

# Verify cluster (run on master):
kubectl get nodes
# NAME         STATUS     ROLES           AGE    VERSION
# master-1     Ready      control-plane   55m    v1.30.14
# worker-1     Ready      <none>          2m     v1.30.14
# worker-2     NotReady   <none>          30s    v1.30.14
#
# Note: Workers show "NotReady" for 30-90 seconds while the CNI
# plugin initialises networking. This is normal — wait and re-check:
kubectl get nodes
# NAME         STATUS   ROLES           AGE    VERSION
# master-1     Ready    control-plane   55m    v1.30.14
# worker-1     Ready    <none>          2m     v1.30.14
# worker-2     Ready    <none>          97s    v1.30.14

                            What kubeadm init Does Under the Hood: It generates a full PKI (12+ certificates), writes kubeconfig files for each component, creates static pod manifests in /etc/kubernetes/manifests/ for etcd + apiserver + controller-manager + scheduler, starts kubelet which launches those static pods, waits for the API server to be healthy, uploads config to ConfigMaps, creates bootstrap tokens, and installs CoreDNS + kube-proxy addons. All in ~10 seconds.
                        

Single-Node Alternative

For a single-node learning cluster, remove the control-plane taint so pods can schedule on the master:

# SINGLE-NODE SETUP: Allow pods to run on the control-plane node
kubectl taint nodes --all node-role.kubernetes.io/control-plane-

# Verify the taint is gone:
kubectl describe node | grep Taints
# Taints: <none>

Managed vs Self-Managed Kubernetes

Aspect	Managed (EKS/GKE/AKS)	Self-Managed (kubeadm/k3s)
Control plane	Provider manages (HA, upgrades, etcd)	You manage everything
etcd	Hidden, auto-backed up	You must back up and maintain
Upgrades	One-click or automatic	Manual (kubeadm upgrade)
Networking	Pre-integrated CNI	Install and configure CNI yourself
Cost	$70–$200/month for control plane	$0 (just node costs)
Best for	Production workloads, teams without deep K8s ops expertise	Learning, edge/IoT, air-gapped, extreme customisation

Comparison Lightweight Distributions

K3s, MicroK8s, and Kind

For development and edge computing, lightweight distributions strip down Kubernetes:

K3s (Rancher): Single binary (~60MB), replaces etcd with SQLite, ideal for edge/IoT and development. Production-ready for small clusters.
MicroK8s (Canonical): Snap-based, single-node or multi-node, built-in addons (Istio, Prometheus). Great for developers on Ubuntu.
Kind (Kubernetes in Docker): Runs cluster nodes as Docker containers. Perfect for CI/CD testing and local development. Not for production.
Minikube: Single-node cluster in a VM or container. Focused on local development with easy addon management.

Going Deeper — Local Dev Tracks: This section introduces Minikube and Kind conceptually. For full hands-on walkthroughs — install, addons, multi-node clusters, and CI integration — see the dedicated tracks in this series: Minikube Track → and Kind Track →.

K3s MicroK8s Kind Minikube

                            Checkpoint: You now have a running Kubernetes cluster. In the sections below, we'll explore what each component does — and you can verify every concept by running commands against your live cluster.
                        

The Kubernetes Mental Model

Now that you have a running cluster, let's explore the mental model that makes Kubernetes tick. Start by looking at what's already running:

# See every component Kubernetes installed automatically:
kubectl get pods -n kube-system
# NAME                                       READY   STATUS    AGE
# calico-kube-controllers-...                1/1     Running   ...
# calico-node-...                            1/1     Running   ...
# coredns-...                                1/1     Running   ...
# etcd-master-1                              1/1     Running   ...
# kube-apiserver-master-1                    1/1     Running   ...
# kube-controller-manager-master-1           1/1     Running   ...
# kube-proxy-...                             1/1     Running   ...
# kube-scheduler-master-1                    1/1     Running   ...

# Every component you see here is explained in the sections below.

Declarative Reconciliation

Every concept from Parts 1–5 — consensus, replication, service discovery, resilience — converges in Kubernetes. But Kubernetes adds one powerful abstraction that makes it all manageable: declarative reconciliation.

                            
                            The Core Insight: You don't tell Kubernetes how to do things. You tell it what you want, and it figures out how to get there — and how to stay there. This is the difference between imperative ("create 3 pods, put them on nodes 1, 2, and 3") and declarative ("I want 3 replicas of this application running at all times").
                        

Kubernetes Reconciliation Model

flowchart TD
    A[User Submits Desired State] --> B[API Server Stores in etcd]
    B --> C[Controllers Watch for Changes]
    C --> D{Desired == Actual?}
    D -->|Yes| E[No action needed]
    D -->|No| F[Controller takes corrective action]
    F --> G[Actual state moves toward desired]
    G --> C
    E --> C

This model is fundamentally different from traditional infrastructure management:

Aspect	Imperative (Traditional)	Declarative (Kubernetes)
Instructions	"Create VM, install nginx, start service"	"3 nginx pods should be running"
Failure handling	Manual detection, manual fix	Auto-detected, auto-fixed
Drift	Accumulates silently	Continuously reconciled
Scaling	"Create 2 more VMs and configure them"	"Change replicas from 3 to 5"
State tracking	CMDB (often outdated)	etcd is single source of truth

Desired State vs Actual State

This is the single most important concept in Kubernetes. Everything else flows from it:

                            
                            Reading YAML — A Quick Primer: Kubernetes resources are defined in YAML files (a human-readable data format). You don't need to memorise the syntax now — Part 7 covers the object model in detail. For now, just understand the structure:
                            apiVersion + kind — what type of resource (e.g., a Deployment)
metadata — the resource's name and labels
spec — your desired state (how many replicas, which container image, etc.)

                            You submit this file with kubectl apply -f filename.yaml. Kubernetes reads it, stores it in etcd, and works to make reality match your declaration.
                        

# This YAML is a declaration of DESIRED STATE:
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-server
spec:
  replicas: 3          # "I want 3 pods running at all times"
  selector:
    matchLabels:
      app: web
  template:
    metadata:
      labels:
        app: web
    spec:
      containers:
      - name: nginx
        image: nginx:1.25
        ports:
        - containerPort: 80
        resources:
          requests:
            memory: "64Mi"
            cpu: "100m"
          limits:
            memory: "128Mi"
            cpu: "200m"

# When you apply this:
# 1. API Server stores it in etcd
# 2. Deployment controller sees: desired=3, actual=0
# 3. Creates a ReplicaSet
# 4. ReplicaSet controller sees: desired=3, actual=0
# 5. Creates 3 Pod objects
# 6. Scheduler sees 3 unscheduled pods
# 7. Assigns each to a node
# 8. kubelet on each node starts the container

# If a pod crashes:
# 1. ReplicaSet controller sees: desired=3, actual=2
# 2. Creates 1 new Pod
# 3. Scheduler assigns it
# 4. kubelet starts it
# Total recovery time: ~5-15 seconds

# Don't worry about writing YAML yet — Part 7 teaches every field.
# For now, focus on the PATTERN: declare what you want → Kubernetes makes it happen.

                            
                            Analogy — The Thermostat: A thermostat is a reconciliation loop. You set desired temperature (21°C). The thermostat constantly observes actual temperature. If actual < desired, it turns on heating. If actual > desired, it turns on cooling. Kubernetes controllers work exactly the same way — but for infrastructure instead of temperature.
                        

Control Plane Components

The control plane is the "brain" of the cluster. It makes global decisions about scheduling, detects failures, and maintains desired state. In production, control plane components run on dedicated nodes (often 3 or 5 for high availability).

Kubernetes Control Plane Architecture

flowchart LR
    subgraph CP[Control Plane]
        direction TB
        ETCD[(etcd
State Store)]
        API[API Server
Central Hub]
        SCHED[Scheduler
Pod Placement]
        CM[Controller Manager
Reconciliation Loops]
        CCM[Cloud Controller
Provider Integration]

        API <-->|read/write state| ETCD
        SCHED -->|watch unscheduled pods| API
        CM -->|watch & reconcile| API
        CCM -->|cloud resources| API
    end

    subgraph W1[Worker Node 1]
        direction TB
        K1[kubelet
Pod Lifecycle]
        KP1[kube-proxy
Network Rules]
    end

    subgraph W2[Worker Node 2]
        direction TB
        K2[kubelet
Pod Lifecycle]
        KP2[kube-proxy
Network Rules]
    end

    K1 -->|report status & watch pods| API
    KP1 -->|watch Services & Endpoints| API
    K2 -->|report status & watch pods| API
    KP2 -->|watch Services & Endpoints| API

API Server (kube-apiserver)

The API Server is the front door to everything in Kubernetes. Every component — kubectl, controllers, kubelets, external tools — communicates exclusively through the API Server. Nothing talks to etcd directly except the API Server.

# The API Server exposes a RESTful API over HTTPS:
# Every Kubernetes operation is an API call

# List pods (GET request to /api/v1/namespaces/default/pods)
kubectl get pods
# Equivalent: curl -k https://api-server:6443/api/v1/namespaces/default/pods

# Create a pod (POST request)
kubectl apply -f pod.yaml
# Equivalent: curl -X POST -d @pod.yaml https://api-server:6443/api/v1/namespaces/default/pods

# Watch for changes (long-lived HTTP connection with chunked responses)
kubectl get pods --watch
# Equivalent: curl https://api-server:6443/api/v1/namespaces/default/pods?watch=true

# Explore the API directly:
kubectl api-resources          # List all resource types
kubectl api-versions           # List all API versions
kubectl explain deployment     # Show schema for a resource type
kubectl explain deployment.spec.template.spec.containers

# ╔══════════════════════════════════════════════════════════╗
# ║  🔧 TRY IT: Run these on your cluster right now:       ║
# ║  kubectl api-resources | head -20                      ║
# ║  kubectl get pods -n kube-system -l component=kube-apiserver  ║
# ╚══════════════════════════════════════════════════════════╝

Key responsibilities of the API Server:

Authentication: Verifies identity (certificates, tokens, OIDC)
Authorisation: Checks permissions (RBAC — can this user create pods?)
Admission Control: Validates and mutates requests (resource quotas, default values, policy enforcement)
Persistence: Stores validated objects in etcd
Watch notifications: Notifies controllers of state changes

API Server Request Processing Pipeline

flowchart TD
    A["`**Client Request**
    kubectl apply -f deploy.yaml`"] --> B

    subgraph AUTH[Identity & Access]
        direction LR
        B[Authentication
Who are you?] --> C[Authorization
Are you allowed?]
    end

    C --> D

    subgraph ADM[Admission Control]
        direction LR
        D[Mutating Webhooks
Modify the request] --> E[Schema Validation
Is it well-formed?] --> F[Validating Webhooks
Policy checks]
    end

    F --> G[Persist to etcd]
    G --> H["`**Response to Client**
    201 Created`"]

etcd

etcd is the distributed key-value store that holds all cluster state. It's the single source of truth — if etcd is lost and unrecoverable, the cluster is gone. It uses the Raft consensus algorithm (which we studied in Part 2) to maintain consistency across replicas.

# What etcd stores (all Kubernetes objects as key-value pairs):
# Key: /registry/pods/default/my-pod
# Value: JSON-encoded Pod object

# ╔══════════════════════════════════════════════════════════╗
# ║  🔧 TRY IT: Find YOUR etcd endpoints and certs first: ║
# ╚══════════════════════════════════════════════════════════╝

# Step 1: Get your etcd pod name (it varies per cluster!):
kubectl get pods -n kube-system -l component=etcd
# NAME                                    READY   STATUS    AGE
# etcd-k8s-master.lab.example.com         1/1     Running   10h

# Step 2: Store the pod name in a variable for reuse:
ETCD_POD=$(kubectl get pods -n kube-system -l component=etcd -o jsonpath='{.items[0].metadata.name}')
echo $ETCD_POD
# etcd-k8s-master.lab.example.com

# Step 3: Extract the advertise URL (your actual etcd endpoint):
kubectl describe pod $ETCD_POD -n kube-system | grep -- --advertise-client-urls
# --advertise-client-urls=https://10.42.38.10:2379

# Step 4: Use kubectl exec to run etcdctl INSIDE the etcd pod
# (certs are already mounted inside — no host path issues):
kubectl exec -n kube-system $ETCD_POD -- etcdctl \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  endpoint health

# Output:
# https://127.0.0.1:2379 is healthy: successfully committed proposal: took = 11.16ms

# Member list (shows all etcd nodes in an HA cluster):
kubectl exec -n kube-system $ETCD_POD -- etcdctl \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  member list --write-out=table

# Single-node cluster output:
# +------------------+---------+------------------------------+--------------------------+--------------------------+------------+
# |        ID        | STATUS  |             NAME             |       PEER ADDRS         |      CLIENT ADDRS        | IS LEARNER |
# +------------------+---------+------------------------------+--------------------------+--------------------------+------------+
# | e179b74d7e9a5155 | started | k8s-master.lab.example.com   | https://10.42.38.10:2380 | https://10.42.38.10:2379 |      false |
# +------------------+---------+------------------------------+--------------------------+------------+
#
# Multi-node HA cluster output (3 voting members):
# | 8e9e05c52164694d | started | master-1 | https://192.168.1.100:2380 | https://192.168.1.100:2379 | false |
# | 91bc3c398fb3c146 | started | master-2 | https://192.168.1.101:2380 | https://192.168.1.101:2379 | false |
# | fd422379fda50e48 | started | master-3 | https://192.168.1.102:2380 | https://192.168.1.102:2379 | false |
#
# IS LEARNER column:
#   false = Full voting member (participates in Raft quorum)
#   true  = Learner/non-voting member (receives log replication but
#           does NOT vote in elections or count toward quorum).
#           Used when adding a new node — it catches up on data first,
#           then gets promoted to voter with: etcdctl member promote 
#           This prevents a slow new node from disrupting cluster consensus.

# Backup etcd (CRITICAL for disaster recovery):
kubectl exec -n kube-system $ETCD_POD -- etcdctl \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  snapshot save /var/lib/etcd/snapshot.db

# Copy the snapshot to your local machine:
kubectl cp kube-system/$ETCD_POD:/var/lib/etcd/snapshot.db ./etcd-backup.db

                            
                            Production Critical: etcd is the most important component in a Kubernetes cluster. If etcd dies, the cluster cannot function — no new scheduling, no reconciliation, no API access. Always run etcd in a 3 or 5 node cluster (for quorum), use fast SSDs (etcd is latency-sensitive — requires <10ms fsync), and take regular snapshots for disaster recovery.
                        

etcd Property	Value	Why It Matters
Consensus	Raft (leader-based)	Strong consistency, linearizable reads
Quorum (3 nodes)	2 of 3 must agree	Survives 1 node failure
Quorum (5 nodes)	3 of 5 must agree	Survives 2 node failures
Storage limit	8 GB default	Compaction required to reclaim space
Disk requirement	<10ms fsync	Slow disk = slow cluster

Scheduler (kube-scheduler)

The Scheduler watches for newly created Pods that have no node assigned and selects a suitable node for each one. It doesn't run the pod — it just decides where it should go.

Scheduler Decision Pipeline

flowchart LR
    A[New Pod
No Node Assigned] --> B[Filtering Phase
Which nodes CAN run it?]
    B --> C[Scoring Phase
Which node is BEST?]
    C --> D[Binding
Assign pod to winner]
    
    B -->|Excludes| E[Insufficient CPU/Memory]
    B -->|Excludes| F[Taints not tolerated]
    B -->|Excludes| G[Affinity violated]

The scheduler works in three phases every time it sees an unassigned pod:

1. Filtering — Which nodes can run this pod?

Eliminates nodes that violate hard constraints:

Resources: Does the node have enough free CPU/memory for the pod's requests?
Ports: Is the required hostPort already taken?
Node selectors: Does the node match nodeSelector or nodeAffinity rules?
Taints: Does the pod tolerate the node's taints (e.g., NoSchedule)?
Volumes: Is the required PersistentVolume available in the node's zone?

2. Scoring — Which surviving node is best?

Ranks each remaining node 0–100 using scoring plugins:

LeastRequestedPriority: Prefer nodes with the most free resources
BalancedResourceAllocation: Prefer nodes where CPU and memory usage are balanced
InterPodAffinity: Prefer nodes already running co-located pods
ImageLocality: Prefer nodes that already pulled the container image (faster start)
TopologySpreadConstraints: Spread pods evenly across zones/nodes

3. Binding — Assign the pod to the winner

The scheduler updates Pod.spec.nodeName in the API Server. The kubelet on that node detects the assignment and starts the container.

# See why a pod isn't scheduling:
kubectl describe pod stuck-pod
# Events:
#   Warning  FailedScheduling  0/5 nodes available:
#   2 Insufficient memory, 3 node(s) had taint {dedicated: gpu}

# See successful scheduler decisions:
kubectl get events --field-selector reason=Scheduled
# Successfully assigned default/web-pod to worker-node-2

Controller Manager (kube-controller-manager)

The Controller Manager runs dozens of independent reconciliation loops (controllers), each responsible for a specific resource type. Each controller watches the API Server for changes and takes action to align actual state with desired state.

Controller	Watches	Reconciles
Deployment	Deployment objects	Creates/updates ReplicaSets for rollouts
ReplicaSet	ReplicaSets + Pods	Maintains desired pod count
Node	Node heartbeats	Marks unresponsive nodes NotReady
Job	Job objects	Ensures pods run to completion
Endpoints	Services + Pods	Updates Service endpoints when pods change
Namespace	Namespace deletions	Cleans up all resources in deleted namespaces
ServiceAccount	Namespace creation	Creates default ServiceAccount per namespace

# All controllers run as goroutines within a single binary:
# kube-controller-manager

# See which controllers are active:
kubectl get componentstatuses
# NAME                 STATUS    MESSAGE
# controller-manager   Healthy   ok
# scheduler            Healthy   ok
# etcd-0               Healthy   {"health":"true","reason":""}

# Controller Manager flags (key configuration):
# --controllers=*                    # Enable all controllers
# --concurrent-deployment-syncs=5    # Parallel deployment reconciliations
# --node-monitor-grace-period=40s    # Time before marking node NotReady
# --pod-eviction-timeout=5m0s        # Time before evicting pods from NotReady node
# --cluster-cidr=10.244.0.0/16      # Pod network range

Cloud Controller Manager

The Cloud Controller Manager connects Kubernetes to the underlying cloud provider (AWS, GCP, Azure). It handles cloud-specific operations that Kubernetes itself doesn't need to know about:

Node Controller: Detects when cloud VMs are deleted, updates node status
Route Controller: Configures cloud network routes for pod communication
Service Controller: Creates cloud load balancers for LoadBalancer-type Services

                            
                            Why Separate? The Cloud Controller Manager was extracted from kube-controller-manager to decouple Kubernetes core from cloud provider code. This allows cloud providers to evolve independently and supports self-hosted Kubernetes (bare metal, on-premises) where no cloud controller is needed.
                        

Worker Node Components

Worker nodes are the machines that actually run your application containers. Each worker node runs three core components:

kubelet

The kubelet is the agent on every worker node. It receives pod specifications from the API Server and ensures the described containers are running and healthy. It's the component that actually makes things happen on the physical machine.

# kubelet responsibilities:
# 1. Register the node with the API Server
# 2. Watch API Server for pods assigned to this node
# 3. Pull container images
# 4. Start/stop containers via container runtime (CRI)
# 5. Execute liveness/readiness/startup probes
# 6. Report pod status back to API Server
# 7. Manage volumes (mount/unmount)
# 8. Send node heartbeats (NodeLease)

# Check kubelet status on a node:
systemctl status kubelet
# ● kubelet.service - kubelet: The Kubernetes Node Agent
#      Loaded: loaded (/usr/lib/systemd/system/kubelet.service; enabled; preset: enabled)
#     Drop-In: /usr/lib/systemd/system/kubelet.service.d
#              └─10-kubeadm.conf
#      Active: active (running) since Sun 2026-06-07 14:27:41 UTC; 1 week 0 days ago
#        Docs: https://kubernetes.io/docs/
#    Main PID: 6416 (kubelet)
#       Tasks: 14 (limit: 19093)
#      Memory: 35.4M (peak: 37.1M)
#         CPU: 3h 12min 14.748s
#      CGroup: /system.slice/kubelet.service
#              └─6416 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/boot...
#
# Jun 14 22:24:46 k8s-worker-1.lab.example.com kubelet[6416]: I0614 22:24:46.9...
# Jun 14 22:24:47 k8s-worker-1.lab.example.com kubelet[6416]: I0614 22:24:47.1...

# kubelet logs (live tail):
journalctl -u kubelet -f --no-pager | tail -20

# Key kubelet configuration:
# --pod-manifest-path=/etc/kubernetes/manifests  # Static pods
# --cluster-dns=10.96.0.10                       # CoreDNS service IP
# --max-pods=110                                 # Max pods per node
# --node-status-update-frequency=10s             # Heartbeat interval
# --eviction-hard=memory.available<100Mi         # Eviction thresholds

# Static pods (managed directly by kubelet, not API Server):
# ON A MASTER NODE:
ls /etc/kubernetes/manifests/
# etcd.yaml
# kube-apiserver.yaml
# kube-controller-manager.yaml
# kube-scheduler.yaml
# These are how control plane components run on master nodes!

# ON A WORKER NODE:
ls /etc/kubernetes/manifests/
# (empty — workers have no static pods, only kubelet + kube-proxy)

# ╔══════════════════════════════════════════════════════════╗
# ║  🔧 TRY IT: SSH into a worker node and run:            ║
# ║  systemctl status kubelet                              ║
# ║  ls /etc/kubernetes/manifests/   (empty on workers!)   ║
# ║  crictl ps                       (running containers) ║
# ╚══════════════════════════════════════════════════════════╝

kube-proxy

kube-proxy maintains network rules on each node that allow pods to communicate with Services. It implements the Service abstraction — translating a virtual ClusterIP into actual pod IPs.

# kube-proxy modes:

# 1. iptables mode (default):
# Creates iptables rules for each Service → endpoint mapping
# Packets are redirected at kernel level (very fast, no userspace)
iptables -t nat -L KUBE-SERVICES | head -20
# Chain KUBE-SERVICES (2 references)
# target                     prot opt source       destination
# KUBE-SVC-NPX46M4PTMTKRN6Y  tcp  --  anywhere     10.96.0.1    /* default/kubernetes:https cluster IP */ tcp dpt:https
# KUBE-SVC-ERIFXISQEP7F7OF4  tcp  --  anywhere     10.96.0.10   /* kube-system/kube-dns:dns-tcp cluster IP */ tcp dpt:domain
# KUBE-SVC-JD5MR3NA4I4DYORP  tcp  --  anywhere     10.96.0.10   /* kube-system/kube-dns:metrics cluster IP */ tcp dpt:9153
# KUBE-SVC-TCOU7JCQXEZGVUNU  udp  --  anywhere     10.96.0.10   /* kube-system/kube-dns:dns cluster IP */ udp dpt:domain
# KUBE-NODEPORTS             all  --  anywhere     anywhere     /* kubernetes service nodeports; NOTE: this must be the last rule in this chain */ ADDRTYPE match dst-type LOCAL
#
# Each KUBE-SVC-* chain contains the actual load-balancing rules
# that distribute traffic to pod endpoints.

# 2. IPVS mode (better for large clusters):
# Uses Linux IPVS (IP Virtual Server) for load balancing
# Supports more algorithms: round-robin, least-connection, weighted
# Better performance at 10,000+ services
#
# Note: If your cluster uses iptables mode (the default), ipvsadm
# will show empty output — this is normal:
ipvsadm -Ln
# IP Virtual Server version 1.2.1 (size=4096)
# Prot LocalAddress:Port Scheduler Flags
#   -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
# (empty — cluster is using iptables mode, not IPVS)
#
# With IPVS mode enabled, you'd see:
# TCP  10.96.0.1:443 rr
#   -> 10.244.1.15:6443   Masq  1  0  0
# TCP  10.96.0.10:53 rr
#   -> 10.244.0.2:53      Masq  1  0  0
#   -> 10.244.0.3:53      Masq  1  0  0

# Check kube-proxy mode (run from master):
kubectl get configmap kube-proxy -n kube-system -o yaml | grep mode
#     mode: ""
#
# mode: "" (empty string) means kube-proxy uses the DEFAULT mode,
# which is iptables. Possible values:
#   ""          → iptables (default since Kubernetes 1.2)
#   "iptables"  → explicit iptables mode
#   "ipvs"      → IPVS mode (better for 1000+ services)
#   "nftables"  → nftables mode (alpha in 1.29+, replaces iptables)
#
# Key fields in the kube-proxy ConfigMap (config.conf):
#   clusterCIDR: 10.244.0.0/16      # Pod network range
#   iptables.syncPeriod: 0s          # How often rules are refreshed (0 = default 30s)
#   ipvs.scheduler: ""               # Load balancing algorithm (rr, lc, wrr, etc.)
#   ipvs.strictARP: false            # Must be true for MetalLB with IPVS
#   conntrack.maxPerCore: null        # Connection tracking table size
#
# The ConfigMap also contains kubeconfig.conf pointing to the API server:
#   server: https://k8s-master.lab.example.com:6443
#
# Note: kubectl only works on nodes with a valid kubeconfig.
# On worker nodes without ~/.kube/config, you'll get:
#   "The connection to the server localhost:8080 was refused"
# Solution: copy admin.conf from master or run kubectl on master.

Container Runtime

The container runtime is responsible for pulling images, creating containers, and managing their lifecycle. Kubernetes communicates with it through the Container Runtime Interface (CRI) — an abstraction that allows different runtimes to be plugged in.

Runtime	CRI Compatible	Use Case	Notes
containerd	Yes (native)	Standard production runtime	Default for most distributions
CRI-O	Yes (native)	Lightweight, OCI-focused	Popular with OpenShift
Docker Engine	Via dockershim (removed 1.24)	Development only	No longer supported in K8s
gVisor (runsc)	Yes (via containerd)	Security sandbox	Kernel syscall interception
Kata Containers	Yes (via containerd)	VM-level isolation	Lightweight VMs per pod

# Check which container runtime a cluster is using:
kubectl get nodes -o wide
# NAME       STATUS   ROLES    VERSION   CONTAINER-RUNTIME
# worker-1   Ready       v1.30.0   containerd://1.7.13
# worker-2   Ready       v1.30.0   containerd://1.7.13

# containerd CLI (crictl — CRI-compatible):
crictl ps                    # List running containers
crictl images                # List images on node
crictl inspect    # Container details
crictl logs       # Container logs

# Check containerd status:
systemctl status containerd
crictl info | head -20

Component Communication Flow

Pod Creation Lifecycle

When you run kubectl apply -f deployment.yaml, here's the complete sequence of events across all components:

Complete Pod Creation Flow

sequenceDiagram
    participant U as User (kubectl)
    participant API as API Server
    participant ETCD as etcd
    participant DC as Deployment Controller
    participant RC as ReplicaSet Controller
    participant S as Scheduler
    participant KL as kubelet (Worker)
    participant CR as Container Runtime
    
    U->>API: POST /apis/apps/v1/deployments
    API->>API: Authenticate + Authorise + Admit
    API->>ETCD: Store Deployment object
    API->>U: 201 Created
    
    DC->>API: Watch detects new Deployment
    DC->>API: Create ReplicaSet
    API->>ETCD: Store ReplicaSet
    
    RC->>API: Watch detects new ReplicaSet
    RC->>API: Create Pod (nodeName empty)
    API->>ETCD: Store Pod
    
    S->>API: Watch detects unscheduled Pod
    S->>S: Filter + Score nodes
    S->>API: Bind Pod to worker-2
    API->>ETCD: Update Pod.spec.nodeName
    
    KL->>API: Watch detects Pod assigned to me
    KL->>CR: Pull image + Create container
    CR->>KL: Container started
    KL->>API: Update Pod status: Running
    API->>ETCD: Store updated status

                            
                            Key Observation: Notice that no component talks to another directly. All communication goes through the API Server. This is the "hub and spoke" pattern — it simplifies security (one endpoint to secure), enables audit logging (all actions pass through one point), and decouples components (any can be replaced independently).
                        

Watch Mechanism

Controllers don't poll the API Server — they establish long-lived watch connections. When any object changes, the API Server pushes the update to all watchers. This is efficient and enables near-instant reactions:

# How watches work:
# 1. Controller opens HTTP connection: GET /api/v1/pods?watch=true
# 2. API Server keeps connection open
# 3. When a pod changes, API Server sends event over the connection:
#    {"type": "MODIFIED", "object": {"kind": "Pod", ...}}
# 4. Controller processes the event and reconciles

# Watch types:
# ADDED    — new object created
# MODIFIED — existing object updated
# DELETED  — object removed

# Resource versions ensure no events are missed:
# If the connection drops, controller reconnects with last resourceVersion
# API Server replays all changes since that version

# See watches in action:
kubectl get pods --watch -v=7
# I0514 10:23:45.123456   GET https://api:6443/api/v1/pods?watch=true
# I0514 10:23:45.234567   Response Status: 200 OK
# (stream of events follows...)

Exercises

                            
                            Exercise 1 — Component Identification: Using kubectl get pods -n kube-system, identify every control plane component running in your cluster. For each, explain: (a) what it does, (b) what happens if it fails, and (c) how it recovers.
                        

                            
                            Exercise 2 — Pod Creation Trace: Create a Deployment with 2 replicas and use kubectl get events --sort-by=.metadata.creationTimestamp to trace the complete creation flow. Map each event to the component that generated it (scheduler, kubelet, controller, etc.).
                        

                            
                            Exercise 3 — Failure Simulation: If you have a multi-node cluster: (a) Stop kubelet on a worker node. What happens to pods? How long before they're rescheduled? (b) Restart kubelet. What happens to the rescheduled pods? Do they move back?
                        

                            
                            Exercise 4 — Architecture Diagram: Draw a complete architecture diagram of a 3-master, 5-worker production cluster. Label: etcd cluster (3 or 5 members?), API Server load balancer, which components run where, and all communication paths. Include the CNI plugin and kube-proxy.
                        

Conclusion

Kubernetes architecture implements every distributed systems principle we've covered:

Consensus (Part 2): etcd uses Raft for consistent state storage
CAP (Part 3): Kubernetes favours consistency — the API Server provides linearizable reads from etcd
Service Discovery (Part 4): CoreDNS + Services provide automatic discovery
Self-Healing (Part 5): Controllers continuously reconcile desired vs actual state

In Part 7, we'll explore the Kubernetes Object Model — Pods, ReplicaSets, Deployments, Services, ConfigMaps, and Secrets — with hands-on exercises for each object type so you can practice every concept on the cluster you just built.

Previous Part 5: Failure & Resilience Next Part 7: Kubernetes Object Model

Cookie Consent

Part 6: Kubernetes Architecture

Table of Contents

Cluster Setup & Installation

kubeadm

Single-Node Alternative

Managed vs Self-Managed Kubernetes

K3s, MicroK8s, and Kind

The Kubernetes Mental Model

Declarative Reconciliation

Desired State vs Actual State

Control Plane Components

API Server (kube-apiserver)

etcd

Scheduler (kube-scheduler)

1. Filtering — Which nodes can run this pod?

2. Scoring — Which surviving node is best?

3. Binding — Assign the pod to the winner

Controller Manager (kube-controller-manager)

Cloud Controller Manager

Worker Node Components

kubelet

kube-proxy

Container Runtime

Component Communication Flow

Pod Creation Lifecycle

Watch Mechanism

Exercises

Conclusion

Cookie Consent

Part 6: Kubernetes Architecture

Table of Contents

Cluster Setup & Installation

kubeadm

Single-Node Alternative

Managed vs Self-Managed Kubernetes

K3s, MicroK8s, and Kind

The Kubernetes Mental Model

Declarative Reconciliation

Desired State vs Actual State

Control Plane Components

API Server (kube-apiserver)

etcd

Scheduler (kube-scheduler)

1. Filtering — Which nodes can run this pod?

2. Scoring — Which surviving node is best?

3. Binding — Assign the pod to the winner

Controller Manager (kube-controller-manager)

Cloud Controller Manager

Worker Node Components

kubelet

kube-proxy

Container Runtime

Component Communication Flow

Pod Creation Lifecycle

Watch Mechanism

Exercises

Conclusion

Continue the Series

Part 5: Failure & Resilience

Part 7: Kubernetes Object Model

Part 11: Kubernetes Internals