Kubernetes Data Plane — Nodes, kubelet, Container Runtime & Networking

Worker Node Components

A Kubernetes worker node is a machine (physical or virtual) that runs containerized workloads. While the control plane decides WHAT should run WHERE, the data plane (worker nodes) is responsible for actually RUNNING it. Each node has three essential components:

kubelet — the primary node agent; watches the API Server and ensures pods are running
Container runtime — actually creates and manages containers (containerd, CRI-O)
kube-proxy — implements Service networking (load balancing traffic to pods)

                            
                            Data Plane Analogy: Think of each worker node as a network switch in SDN. The switch (node) has a forwarding table (pod specs) programmed by the controller (API Server). It executes forwarding decisions (runs containers) at "line rate" without consulting the control plane for every packet (request). The kubelet is the node's local agent — like the switch's OpenFlow agent that receives and applies flow rules.
                        

kubelet — Pod Lifecycle Manager

The kubelet is the most important data plane component. It watches the API Server for Pods assigned to its node and drives the container runtime to match desired state:

kubelet Pod Lifecycle Management

flowchart TB
    API["API Server"] -->|"Watch: Pods on this node"| KL["kubelet"]
    KL --> SYNC["SyncLoop\n(periodic reconciliation)"]
    SYNC --> ADMIT["Admission\n(resource check, node conditions)"]
    ADMIT --> CRI2["CRI Calls\n(create/start/stop containers)"]
    CRI2 --> RT["Container Runtime\n(containerd / CRI-O)"]
    KL --> PROBE["Health Probes\n(liveness, readiness, startup)"]
    KL --> STATUS["Status Reporter\n(→ API Server)"]
    PROBE -->|"Failure"| CRI2
    RT --> CONTAINERS["Running Containers"]

Key kubelet Responsibilities

Pod lifecycle — create, start, restart, and kill containers based on pod spec
Health checking — execute liveness/readiness/startup probes, restart failing containers
Resource management — enforce CPU/memory limits via cgroups, evict pods when node is under pressure
Volume mounting — attach and mount persistent volumes into containers
Node status — report node conditions (Ready, MemoryPressure, DiskPressure, PIDPressure) to API Server
Image management — pull container images, garbage collect unused images
Logging/metrics — expose container logs via API, serve metrics on :10250

# Check kubelet status on a node
systemctl status kubelet

# View kubelet logs for debugging
journalctl -u kubelet -f --no-pager | tail -50

# Check node conditions reported by kubelet
kubectl describe node worker-1 | grep -A 10 "Conditions:"

# View kubelet configuration
kubectl proxy &
curl -s http://localhost:8001/api/v1/nodes/worker-1/proxy/configz | \
  python3 -m json.tool | head -40

# Check kubelet health endpoint
curl -sk https://localhost:10250/healthz

Container Runtime Interface (CRI)

The kubelet doesn't create containers directly — it uses the Container Runtime Interface (CRI), a gRPC API that abstracts the runtime. This decouples kubelet from specific container implementations.

CRI: How Pods Become Containers

sequenceDiagram
    participant KL as kubelet
    participant CRI as CRI (gRPC)
    participant RT as containerd
    participant OCI as runc (OCI)

    KL->>CRI: RunPodSandbox(PodSandboxConfig)
    CRI->>RT: Create sandbox (pause container)
    RT->>OCI: Create namespace + cgroups
    OCI-->>RT: Sandbox running
    RT-->>KL: PodSandboxID

    KL->>CRI: PullImage(ImageSpec)
    CRI->>RT: Pull from registry
    RT-->>KL: ImageRef

    KL->>CRI: CreateContainer(PodSandboxID, ContainerConfig)
    CRI->>RT: Create container in sandbox
    RT->>OCI: runc create
    OCI-->>RT: Container created

    KL->>CRI: StartContainer(ContainerID)
    CRI->>RT: Start container process
    RT->>OCI: runc start
    OCI-->>RT: Process running
    RT-->>KL: Started

containerd

The industry-standard runtime (used by Docker, Kubernetes, cloud providers). It handles image pull/push, container lifecycle, storage (snapshots), and interfaces with the low-level OCI runtime (runc) for actual container creation.

CRI-O

Purpose-built for Kubernetes — implements CRI and nothing else. Lighter weight than containerd, no Docker compatibility layer. Favored by OpenShift/Red Hat deployments.

                            
                            The Sandbox Model: Every pod gets a "pause" container (sandbox) that holds the network namespace. Application containers join this existing namespace. This is why containers in the same pod share localhost — they literally share the same network stack created by the sandbox.
                        

kube-proxy — Service Routing

kube-proxy implements the Kubernetes Service abstraction — mapping a stable ClusterIP to ephemeral pod IPs. It watches the API Server for Service and Endpoint objects, then programs the node's network stack to route traffic correctly.

kube-proxy Modes

kube-proxy Modes — iptables vs IPVS vs eBPF

flowchart TB
    subgraph IPT["iptables Mode"]
        PKT1["Packet → ClusterIP"] --> CHAIN["iptables KUBE-SERVICES chain"]
        CHAIN --> RULE["Random DNAT rule\n(statistic module)"]
        RULE --> POD1["Pod IP:Port"]
    end
    subgraph IPVS2["IPVS Mode"]
        PKT2["Packet → ClusterIP"] --> VIP["IPVS Virtual Server"]
        VIP --> ALGO["LB Algorithm\n(rr, lc, sh, dh)"]
        ALGO --> POD2["Pod IP:Port"]
    end
    subgraph EBPF2["eBPF Mode (Cilium)"]
        PKT3["Packet → ClusterIP"] --> BPF["eBPF Program\n(socket-level or TC)"]
        BPF --> MAP["BPF Map Lookup\n(Service → backends)"]
        MAP --> POD3["Pod IP:Port"]
    end

iptables mode (default): Creates one iptables rule per Service endpoint. Uses the statistic module for random load balancing. Simple but O(n) rule evaluation — degrades at thousands of Services.

IPVS mode: Uses Linux IPVS (IP Virtual Server) kernel module. O(1) lookup via hash tables. Supports multiple load balancing algorithms (round-robin, least connections, source hash). Recommended for clusters with 1000+ Services.

eBPF mode (Cilium): Replaces kube-proxy entirely. Service routing done in eBPF programs attached to socket or TC hooks. Fastest option — avoids both iptables and IPVS overhead, operates at socket level to skip full network stack traversal.

# View iptables rules created by kube-proxy for a Service
# Each Service gets a chain with DNAT rules for each endpoint
iptables -t nat -L KUBE-SERVICES -n | grep my-service

# Example output showing random load balancing:
# -A KUBE-SVC-XXXXX -m statistic --mode random --probability 0.333
#   -j KUBE-SEP-AAAAA
# -A KUBE-SVC-XXXXX -m statistic --mode random --probability 0.500
#   -j KUBE-SEP-BBBBB
# -A KUBE-SVC-XXXXX -j KUBE-SEP-CCCCC

# Check IPVS virtual servers (if using IPVS mode)
ipvsadm -Ln

# Check kube-proxy mode
kubectl logs -n kube-system -l k8s-app=kube-proxy | grep "Using"

Service Types and Traffic Flow

ClusterIP — internal-only virtual IP. kube-proxy DNATs packets destined for ClusterIP → pod IP
NodePort — opens a port (30000-32767) on every node. Traffic to any node's IP:NodePort → pod
LoadBalancer — provisions external cloud LB that routes to NodePorts (or directly to pods with cloud provider integration)
ExternalName — DNS CNAME alias, no proxying involved

Pod Networking Model

Kubernetes mandates a flat networking model with three fundamental rules:

Every pod gets its own IP address
Pods can communicate with all other pods without NAT
Agents on a node can communicate with all pods on that node

The CNI (Container Network Interface) plugin is responsible for implementing these guarantees. When kubelet creates a pod sandbox, it calls the CNI plugin to configure networking for the new namespace.

Pod-to-Pod Networking — Same Node vs Cross-Node

flowchart LR
    subgraph Node1["Node 1 (10.0.1.0/24)"]
        PA["Pod A\n10.0.1.5"] --> BR1["Bridge/veth\n(cbr0)"]
        PB["Pod B\n10.0.1.6"] --> BR1
    end
    subgraph Node2["Node 2 (10.0.2.0/24)"]
        PC["Pod C\n10.0.2.3"] --> BR2["Bridge/veth\n(cbr0)"]
        PD["Pod D\n10.0.2.4"] --> BR2
    end
    BR1 -->|"Same node:\nbridge forwarding"| BR1
    BR1 -->|"Cross-node:\nencap or routing"| BR2

Same-node communication: Pods on the same node communicate via a Linux bridge or direct veth-pair routing — stays entirely within the node's network namespace.

Cross-node communication: Depends on CNI implementation — either overlay (VXLAN encapsulation), direct routing (BGP announces pod CIDRs), or cloud-native (AWS ENI, Azure CNI assigning VNet IPs to pods).

Node Resource Management

The kubelet enforces resource boundaries using Linux cgroups and monitors node health for eviction decisions.

QoS Classes

Kubernetes assigns a Quality of Service class to each pod based on its resource specifications:

Guaranteed — requests == limits for all containers (highest priority, last evicted)
Burstable — at least one container has requests < limits (medium priority)
BestEffort — no requests or limits set (lowest priority, first evicted)

# Pod with Guaranteed QoS (requests == limits)
apiVersion: v1
kind: Pod
metadata:
  name: guaranteed-pod
  namespace: production
spec:
  containers:
    - name: app
      image: myapp:v2.1
      resources:
        requests:
          cpu: "500m"
          memory: "256Mi"
        limits:
          cpu: "500m"
          memory: "256Mi"
      livenessProbe:
        httpGet:
          path: /healthz
          port: 8080
        initialDelaySeconds: 10
        periodSeconds: 5
      readinessProbe:
        httpGet:
          path: /ready
          port: 8080
        initialDelaySeconds: 5
        periodSeconds: 3

Eviction

When node resources run low, kubelet evicts pods to protect node stability:

Soft eviction — threshold crossed for grace period (e.g., memory.available < 500Mi for 2 minutes)
Hard eviction — immediate eviction at threshold (e.g., memory.available < 100Mi)
Eviction order: BestEffort → Burstable (over requests) → Guaranteed (only at extreme pressure)

                            
                            Resource Limits Are Enforced by the Data Plane: CPU limits use CFS throttling (cgroup cpu.cfs_quota_us) — the kernel physically prevents the container from using more CPU. Memory limits use cgroup memory.limit_in_bytes — the kernel OOM-kills the process if it exceeds the limit. These are hard enforcements by the Linux kernel, not suggestions. A container exceeding its memory limit WILL be killed, regardless of what the control plane thinks.
                        

Data Plane Extensions

The Kubernetes data plane is extensible through several plugin interfaces:

Device Plugins

Expose specialized hardware to pods (GPUs, FPGAs, smart NICs, TPUs). The device plugin registers with kubelet via gRPC, advertises available devices, and allocates them to containers at runtime.

CSI Drivers (Container Storage Interface)

Provide persistent storage to pods. CSI separates storage provisioning (control plane: create volume) from mounting (data plane: attach volume to node, mount into container).

# CSI StorageClass — control plane configuration
# that drives data plane volume mounting
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: fast-ssd
provisioner: ebs.csi.aws.com
parameters:
  type: gp3
  iops: "5000"
  throughput: "250"
  encrypted: "true"
volumeBindingMode: WaitForFirstConsumer
reclaimPolicy: Delete
allowVolumeExpansion: true

Service Mesh Sidecars

Inject proxy containers (Envoy, Linkerd-proxy) alongside application containers. The sidecar intercepts all network traffic, adding mTLS, retries, circuit breaking, and observability — extending the data plane with service-level networking features without application changes.

Architecture Pattern

The Data Plane Keeps Getting Richer

Originally, the Kubernetes data plane was just "run containers and route packets." Today it encompasses GPUs, persistent storage, service meshes, security policies (seccomp, AppArmor, SELinux), network policies (CNI firewalling), and runtime security (Falco, Tetragon). Each extension adds data plane functionality without changing the control plane API. This is the power of the separation: the control plane remains stable while the data plane evolves rapidly.

ExtensibilityPluginsEvolution

Key Takeaway

Control Plane Decides, Data Plane Executes

The Kubernetes data plane embodies the same principle as a network switch in SDN: it receives instructions from the control plane and executes them locally at maximum efficiency. kubelet is the local agent, the container runtime is the forwarding engine, kube-proxy is the service routing table, and CNI is the network fabric. All are optimized for execution speed and reliability, while intelligence and decision-making remain centralized in the control plane.

ArchitectureSeparationPattern

Cookie Consent

Kubernetes Data Plane — Nodes, kubelet, Container Runtime & Networking

Table of Contents

Worker Node Components

kubelet — Pod Lifecycle Manager

Key kubelet Responsibilities

Container Runtime Interface (CRI)

containerd

CRI-O

kube-proxy — Service Routing

kube-proxy Modes

Service Types and Traffic Flow

Pod Networking Model

Node Resource Management

QoS Classes

Eviction

Data Plane Extensions

Device Plugins

CSI Drivers (Container Storage Interface)

Service Mesh Sidecars

The Data Plane Keeps Getting Richer

Control Plane Decides, Data Plane Executes

Cookie Consent

Kubernetes Data Plane — Nodes, kubelet, Container Runtime & Networking

Table of Contents

Worker Node Components

kubelet — Pod Lifecycle Manager

Key kubelet Responsibilities

Container Runtime Interface (CRI)

containerd

CRI-O

kube-proxy — Service Routing

kube-proxy Modes

Service Types and Traffic Flow

Pod Networking Model

Node Resource Management

QoS Classes

Eviction

Data Plane Extensions

Device Plugins

CSI Drivers (Container Storage Interface)

Service Mesh Sidecars

The Data Plane Keeps Getting Richer

Control Plane Decides, Data Plane Executes

Related Deep Dives

Systems Thinking & Architecture Mastery Series

Kubernetes Control Plane — API Server, etcd, Scheduler & Controllers

Software-Defined Networking (SDN)