Back to Containers & Runtime Environments Mastery Series

Part 3: Control Groups — Resource Management

May 14, 2026 Wasil Zafar 25 min read

Namespaces control what a process can see. Control groups (cgroups) control what a process can use. Together they form the complete container isolation model — visibility control plus resource budgets.

Table of Contents

  1. What Are Control Groups?
  2. cgroups v1 vs v2
  3. CPU Limits
  4. Memory Limits
  5. I/O Limits
  6. Process Limits
  7. cgroups v2 Unified Hierarchy
  8. How Docker Uses cgroups
  9. Kubernetes Connection
  10. Exercises
  11. Conclusion & Next Steps

What Are Control Groups?

Control groups (cgroups) are a Linux kernel feature that organises processes into hierarchical groups whose resource usage can be limited, monitored, and accounted for. Originally developed by Google engineers Paul Menage and Rohit Seth in 2006 (initially called "process containers"), cgroups were merged into the Linux kernel in version 2.6.24 (January 2008).

While namespaces create the illusion of isolation by controlling visibility, cgroups provide actual physical resource constraints. A process in a PID namespace might not be able to see other processes, but without cgroups it could still consume 100% of the CPU, allocate all available memory, or saturate disk I/O — effectively crashing the entire host and all other containers.

The Resource Budget Analogy: Think of cgroups as departmental budgets in a company. Namespaces are like separate offices (each department can only see their own work). But cgroups are the budget allocations — the engineering department gets 60% of the compute budget, marketing gets 25%, and HR gets 15%. No department can spend more than its allocation, regardless of what they can or cannot see. If engineering tries to exceed its budget, it gets throttled or denied.

cgroups provide four key capabilities:

  • Resource limiting — Set hard caps on CPU time, memory usage, I/O bandwidth, and process count
  • Prioritisation — Give some groups more resources than others when there is contention
  • Accounting — Track how much of each resource a group has consumed (for billing, monitoring, capacity planning)
  • Control — Freeze, checkpoint, restart, or kill all processes in a group atomically

Hierarchical Organisation

cgroups are organised in a tree structure. Each node in the tree is a group, and child groups inherit limits from their parents (but can set stricter limits). This hierarchy maps naturally to container orchestration: the system gets the root cgroup, each container runtime gets a child cgroup, and each container gets a grandchild cgroup.

# View the cgroup hierarchy on a modern Linux system (cgroups v2)
# The filesystem is mounted at /sys/fs/cgroup
ls /sys/fs/cgroup/
# Output: cgroup.controllers  cgroup.max.depth  cgroup.procs  cpu.max  memory.max ...

# View the tree structure
find /sys/fs/cgroup -name "cgroup.procs" -maxdepth 3 | head -20
# /sys/fs/cgroup/cgroup.procs
# /sys/fs/cgroup/system.slice/cgroup.procs
# /sys/fs/cgroup/system.slice/docker.service/cgroup.procs
# /sys/fs/cgroup/system.slice/ssh.service/cgroup.procs
# /sys/fs/cgroup/user.slice/cgroup.procs

# See which cgroup a specific process belongs to
cat /proc/self/cgroup
# Output (v2): 0::/user.slice/user-1000.slice/session-1.scope

# See which cgroup a Docker container's process belongs to
CONTAINER_PID=$(docker inspect --format '{{.State.Pid}}' my-container)
cat /proc/$CONTAINER_PID/cgroup
# Output (v2): 0::/system.slice/docker-abc123...def.scope

cgroups v1 vs v2 — Understanding the Transition

The Linux kernel has two implementations of cgroups that differ significantly in architecture. Understanding both is essential because production systems still run both versions, and Docker/Kubernetes behave differently depending on which is available.

Feature cgroups v1 cgroups v2
Kernel version 2.6.24 (2008) 4.5 (2016), mature by 5.x
Hierarchy Multiple hierarchies (one per controller) Single unified hierarchy
Mount point /sys/fs/cgroup/cpu, /sys/fs/cgroup/memory, etc. /sys/fs/cgroup (single mount)
Process membership Process can be in different groups per controller Process in exactly one group for all controllers
Thread support Limited Thread-level granularity with threaded controllers
Delegation Complex, security issues Clean delegation model for rootless containers
PSI (Pressure Stall Info) Not available Built-in resource pressure monitoring
Memory controller memory.limit_in_bytes memory.max, memory.high
CPU controller cpu.shares, cpu.cfs_quota_us cpu.weight, cpu.max

Migration Status (2026)

The industry is in the midst of transitioning from v1 to v2:

Current State: Ubuntu 22.04+, Fedora 31+, Debian 11+, and RHEL 9+ all default to cgroups v2. Docker Engine 20.10+ supports cgroups v2 natively. Kubernetes 1.25+ supports cgroups v2 with full feature parity. If you are on a recent Linux distribution, you are almost certainly running cgroups v2. Legacy systems may still use v1 or a hybrid mode.
# Check which cgroup version your system uses
stat -fc %T /sys/fs/cgroup/
# Output: "cgroup2fs" = v2, "tmpfs" = v1 (or hybrid)

# Alternative check
mount | grep cgroup
# v2: "cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime)"
# v1: multiple lines like "cgroup on /sys/fs/cgroup/cpu type cgroup (rw,...,cpu)"

# Check Docker's cgroup driver
docker info | grep -i cgroup
# Output: Cgroup Driver: systemd  (or cgroupfs)
#         Cgroup Version: 2

CPU Limits — Controlling Compute Time

CPU limiting is the most commonly used cgroup feature. It comes in two flavours: relative weights (how CPU is shared when there is contention) and hard limits (absolute caps on CPU time regardless of availability).

The key CPU control parameters are:

Parameter (v1) Parameter (v2) Purpose Default
cpu.shares cpu.weight Relative weight for fair scheduling under contention 1024 (v1) / 100 (v2)
cpu.cfs_period_us cpu.max (period component) Length of the scheduling period in microseconds 100000 (100ms)
cpu.cfs_quota_us cpu.max (quota component) Maximum CPU time allowed per period -1 (unlimited)
cpuset.cpus cpuset.cpus Pin processes to specific CPU cores All CPUs

The difference between shares/weight and quota/max is critical:

  • Shares (relative) — Only matter when CPUs are busy. If your container has 512 shares and another has 1024, the second gets twice the CPU time during contention. But if the system is idle, both can use 100% of available CPU.
  • Quota (absolute) — Hard limit regardless of system load. If you set a quota of 50000 per 100000 period (0.5 CPUs), your container will be throttled at 50% of one core even if the other 31 cores sit completely idle.

Hands-On: Setting CPU Limits

# === cgroups v2 example (modern systems) ===

# Create a new cgroup
sudo mkdir /sys/fs/cgroup/my-container

# Set CPU limit: 50% of one CPU (50ms quota per 100ms period)
echo "50000 100000" | sudo tee /sys/fs/cgroup/my-container/cpu.max
# Format: $QUOTA $PERIOD (in microseconds)
# 50000/100000 = 0.5 CPUs

# For 2 full CPUs: 200000 100000 (200ms per 100ms period)
# For 0.25 CPUs: 25000 100000 (25ms per 100ms period)

# Set CPU weight (relative priority, 1-10000, default 100)
echo "200" | sudo tee /sys/fs/cgroup/my-container/cpu.weight

# Pin to specific CPUs (cores 0 and 1 only)
echo "0-1" | sudo tee /sys/fs/cgroup/my-container/cpuset.cpus

# Add current shell to the cgroup
echo $$ | sudo tee /sys/fs/cgroup/my-container/cgroup.procs

# Run a CPU-intensive task and observe throttling
dd if=/dev/zero of=/dev/null bs=1M &
# Check: the task will be limited to 50% of one CPU

# View throttling statistics
cat /sys/fs/cgroup/my-container/cpu.stat
# Output:
# usage_usec 1234567     (total CPU time consumed)
# user_usec 1200000      (user-space CPU time)
# system_usec 34567      (kernel-space CPU time)
# nr_periods 456         (scheduling periods elapsed)
# nr_throttled 123       (periods where throttling occurred)
# throttled_usec 789000  (total time spent throttled)

# Clean up
kill %1
echo $$ | sudo tee /sys/fs/cgroup/cgroup.procs
sudo rmdir /sys/fs/cgroup/my-container
Experiment

The CPU Throttling Latency Problem

Setting aggressive CPU quotas can cause unexpected latency spikes. Consider a container with a quota of 10ms per 100ms period. If it receives a burst of requests and exhausts its 10ms quota in the first 5ms of the period, it is throttled for the remaining 95ms — nearly a full second of latency! This is why Kubernetes documentation warns against setting CPU limits too low for latency-sensitive services. Some teams prefer CPU requests without hard limits, relying on the scheduler's fair sharing instead of hard throttling.

Latency CFS Throttling Burstable

Memory Limits — Preventing Resource Exhaustion

Memory is a non-compressible resource — unlike CPU (where a process just runs slower when throttled), when memory runs out, something must give. The system must either deny the allocation (returning an error to the application) or kill a process to free memory. This makes memory limiting both critically important and potentially dangerous.

The key memory control parameters are:

Parameter (v1) Parameter (v2) Purpose
memory.limit_in_bytes memory.max Hard limit — OOM killer triggered if exceeded
memory.soft_limit_in_bytes memory.high Soft limit — kernel reclaims memory aggressively
memory.memsw.limit_in_bytes memory.swap.max Memory + swap combined limit
memory.usage_in_bytes memory.current Current memory usage (read-only)
memory.oom_control memory.oom.group OOM killer behaviour control
# === Setting memory limits (cgroups v2) ===

# Create a cgroup with 256MB memory limit
sudo mkdir /sys/fs/cgroup/mem-test

# Set hard limit (256 MB)
echo "268435456" | sudo tee /sys/fs/cgroup/mem-test/memory.max

# Set soft limit (200 MB) — kernel starts reclaiming at this point
echo "209715200" | sudo tee /sys/fs/cgroup/mem-test/memory.high

# Disable swap for this cgroup (force OOM rather than slow swap)
echo "0" | sudo tee /sys/fs/cgroup/mem-test/memory.swap.max

# Add current shell
echo $$ | sudo tee /sys/fs/cgroup/mem-test/cgroup.procs

# Monitor memory usage
cat /sys/fs/cgroup/mem-test/memory.current
# Output: current usage in bytes

# View detailed memory statistics
cat /sys/fs/cgroup/mem-test/memory.stat
# Output includes: anon, file, kernel, slab, sock, shmem, mapped_file, etc.

The OOM Killer — When Memory Runs Out

When a cgroup exceeds its memory.max limit and no memory can be reclaimed, the kernel's Out-of-Memory (OOM) killer activates. It selects a process within the cgroup to kill, freeing memory for the survivors. In container contexts, this usually means the container's main process is killed, causing the container to restart.

Experiment

Triggering the OOM Killer

You can safely observe the OOM killer in action with Docker:

# Run a container with 64MB memory limit
docker run --rm --memory=64m --name oom-test alpine sh -c '
    echo "Allocating memory until OOM..."
    # Allocate memory in 10MB chunks
    i=0
    while true; do
        dd if=/dev/zero of=/dev/shm/block$i bs=10M count=1 2>/dev/null
        i=$((i+1))
        echo "Allocated $((i*10))MB"
    done
'
# Output:
# Allocating memory until OOM...
# Allocated 10MB
# Allocated 20MB
# Allocated 30MB
# Allocated 40MB
# Allocated 50MB
# Killed

# Check Docker events
docker events --filter event=oom --since 1m
# Output: container oom abc123... (image=alpine, name=oom-test)
OOM Killer Memory Limit Container Restart
The Silent OOM Problem: A common production issue: the OOM killer kills a process inside the container, but the container runtime may or may not restart it depending on the restart policy. Applications should monitor memory.events (which counts OOM events) and log when they approach limits. In Kubernetes, an OOM-killed container gets status OOMKilled and the Pod is restarted according to its restartPolicy. Monitor for container_memory_working_set_bytes approaching limits in your alerting.

I/O Limits — Controlling Disk Bandwidth

The I/O controller (called blkio in cgroups v1, io in v2) limits the rate at which a cgroup can read from and write to block devices. Without I/O limits, a single container performing heavy disk operations (like a database backup or log rotation) can starve other containers of disk bandwidth, causing latency spikes across the entire host.

I/O limiting works on a per-device basis — you specify limits for specific block devices (identified by major:minor numbers):

# === I/O limiting (cgroups v2) ===

# Find the major:minor number of your disk
lsblk -o NAME,MAJ:MIN
# Output:
# NAME    MAJ:MIN
# sda       8:0
# ├─sda1    8:1
# └─sda2    8:2

# Create a cgroup with I/O limits
sudo mkdir /sys/fs/cgroup/io-test

# Set read bandwidth limit: 10MB/s on device 8:0
echo "8:0 rbps=10485760" | sudo tee /sys/fs/cgroup/io-test/io.max

# Set write bandwidth limit: 5MB/s on device 8:0
echo "8:0 wbps=5242880" | sudo tee /sys/fs/cgroup/io-test/io.max

# Set IOPS limits (operations per second)
echo "8:0 riops=1000 wiops=500" | sudo tee /sys/fs/cgroup/io-test/io.max

# Combined: set all limits at once
echo "8:0 rbps=10485760 wbps=5242880 riops=1000 wiops=500" | \
    sudo tee /sys/fs/cgroup/io-test/io.max

# With Docker — limit write speed to 10MB/s
docker run --rm --device-write-bps /dev/sda:10mb alpine \
    sh -c 'dd if=/dev/zero of=/tmp/testfile bs=1M count=100 oflag=direct'
# Without limit: ~200MB/s
# With limit: ~10MB/s (throttled)

# Monitor I/O statistics
cat /sys/fs/cgroup/io-test/io.stat
# Output: 8:0 rbytes=1048576 wbytes=524288 rios=100 wios=50 dbytes=0 dios=0
The Direct I/O Caveat: I/O throttling via cgroups only works reliably with direct I/O (O_DIRECT flag). Buffered I/O goes through the page cache, and the kernel may not attribute cache writebacks to the correct cgroup accurately. This means a container doing buffered writes might appear to exceed its I/O limits because the actual disk writes happen asynchronously in kernel context. For precise I/O control, configure your applications to use direct I/O or accept that buffered I/O limits are approximate.

Process Limits — Fork Bomb Protection

The PID controller limits the number of processes (and threads) that can exist within a cgroup. This is the primary defence against fork bombs — malicious or buggy code that recursively creates processes until the system's process table is exhausted, crashing the entire host.

# === Process limits (cgroups v2) ===

# Set maximum number of processes in a cgroup
sudo mkdir /sys/fs/cgroup/pid-test
echo "100" | sudo tee /sys/fs/cgroup/pid-test/pids.max

# Current process count
cat /sys/fs/cgroup/pid-test/pids.current
# Output: 0 (no processes assigned yet)

# With Docker — limit to 50 processes
docker run --rm --pids-limit=50 alpine sh -c '
    echo "Attempting to create processes..."
    for i in $(seq 1 100); do
        sleep 60 &
        if [ $? -ne 0 ]; then
            echo "Fork failed at process $i"
            break
        fi
    done
    echo "Running processes: $(ps aux | wc -l)"
'
# Output: Fork fails around process 48 (container has a few base processes)

# The classic fork bomb (DO NOT RUN without limits!)
# :(){ :|:& };:
# With pids.max set, this is safely contained — it hits the limit and stops
Security Scenario

Fork Bomb Containment

Without process limits, a single compromised container running :(){ :|:& };: (the classic bash fork bomb) can exhaust the host's process ID space (typically 32,768 or 4,194,304 depending on kernel.pid_max). This prevents ALL processes on the host from forking — including the container runtime, SSH daemon, and monitoring agents. With pids.max set, the fork bomb is contained to its cgroup. It fills up its 100-process allocation and stops. Other containers and the host are completely unaffected.

Fork Bomb DoS Prevention pids.max

cgroups v2 — The Unified Hierarchy

cgroups v2 was designed to fix the architectural problems of v1. The most important change is the unified hierarchy — instead of separate filesystem trees for each controller (cpu, memory, io, pids), there is a single tree where all controllers are managed together.

cgroups v2 Unified Hierarchy
flowchart TD
    ROOT["/sys/fs/cgroup (root)"] --> SYSTEM["system.slice"]
    ROOT --> USER["user.slice"]
    ROOT --> DOCKER["docker"]
    SYSTEM --> SSH["ssh.service"]
    SYSTEM --> NGINX["nginx.service"]
    DOCKER --> C1["container-abc123"]
    DOCKER --> C2["container-def456"]
    DOCKER --> C3["container-ghi789"]
    C1 --- C1R["cpu.max: 100000 100000
memory.max: 512M
pids.max: 200"] C2 --- C2R["cpu.max: 50000 100000
memory.max: 256M
pids.max: 100"] C3 --- C3R["cpu.max: 200000 100000
memory.max: 1G
pids.max: 500"] style ROOT fill:#132440,stroke:#3B9797,color:#fff style SYSTEM fill:#16476A,stroke:#3B9797,color:#fff style USER fill:#16476A,stroke:#3B9797,color:#fff style DOCKER fill:#3B9797,stroke:#132440,color:#fff style C1 fill:#BF092F,stroke:#132440,color:#fff style C2 fill:#BF092F,stroke:#132440,color:#fff style C3 fill:#BF092F,stroke:#132440,color:#fff

Key improvements in v2:

  • Single hierarchy — A process belongs to exactly one node, all resource controls applied at that node
  • No internal processes — Only leaf nodes can contain processes (simplifies resource distribution)
  • Delegation — Clean model for giving unprivileged users control over subtrees (critical for rootless containers)
  • Weight-based CPUcpu.weight (1–10000, default 100) replaces the confusing cpu.shares (2–262144, default 1024)
  • memory.high — Soft throttling before the hard kill (v1 had no graceful degradation)

Pressure Stall Information (PSI)

One of v2's most valuable additions is PSI — real-time metrics showing how much time processes spend waiting for resources. PSI answers the question: "Are my containers experiencing resource pressure?"

# Read PSI metrics for a cgroup
cat /sys/fs/cgroup/docker/container-abc123/cpu.pressure
# Output:
# some avg10=4.56 avg60=2.34 avg300=1.12 total=567890
# full avg10=0.12 avg60=0.08 avg300=0.03 total=12345

# Interpretation:
# "some" = percentage of time at least ONE task is stalled waiting for CPU
# "full" = percentage of time ALL tasks are stalled (complete starvation)
# avg10/60/300 = exponential moving averages over 10s/60s/300s windows

cat /sys/fs/cgroup/docker/container-abc123/memory.pressure
# some avg10=0.00 avg60=0.00 avg300=0.00 total=0
# full avg10=0.00 avg60=0.00 avg300=0.00 total=0
# (healthy: no memory pressure)

cat /sys/fs/cgroup/docker/container-abc123/io.pressure
# some avg10=12.34 avg60=8.56 avg300=5.23 total=234567
# full avg10=2.10 avg60=1.45 avg300=0.89 total=45678
# (some I/O contention — processes spending ~12% of time waiting for disk)
PSI for Right-Sizing: PSI metrics are invaluable for container right-sizing. If a container shows zero CPU pressure, its CPU limits might be too generous (wasting capacity). If it shows persistent pressure above 20%, it is likely under-provisioned. Tools like Kubernetes VPA (Vertical Pod Autoscaler) use PSI-like signals to recommend resource adjustments. PSI gives you ground truth about whether containers are actually experiencing resource contention versus just theoretically near their limits.

How Docker Uses cgroups

Every docker run flag that controls resources maps directly to cgroup parameters. Understanding this mapping demystifies Docker's resource management — there is no magic, just cgroup files being written:

Docker Flag cgroup v2 File Effect
--cpus=1.5 cpu.max → "150000 100000" 1.5 CPU cores (150ms per 100ms period)
--cpu-shares=512 cpu.weight → proportional mapping Relative CPU weight during contention
--cpuset-cpus="0,2" cpuset.cpus → "0,2" Pin to CPU cores 0 and 2 only
--memory=512m memory.max → "536870912" 512MB hard memory limit
--memory-reservation=256m memory.high → "268435456" 256MB soft limit (reclaim target)
--memory-swap=1g memory.swap.max Maximum swap usage
--pids-limit=100 pids.max → "100" Maximum 100 processes
--device-read-bps /dev/sda:10mb io.max → "8:0 rbps=10485760" 10MB/s read limit on /dev/sda
--device-write-iops /dev/sda:100 io.max → "8:0 wiops=100" 100 write IOPS on /dev/sda
# Run a container with resource limits
docker run -d --name resource-test \
    --cpus=0.5 \
    --memory=256m \
    --memory-swap=256m \
    --pids-limit=50 \
    nginx:alpine

# Verify the cgroup settings
CONTAINER_ID=$(docker inspect --format '{{.Id}}' resource-test)

# On cgroups v2 with systemd driver:
cat /sys/fs/cgroup/system.slice/docker-${CONTAINER_ID}.scope/cpu.max
# Output: 50000 100000

cat /sys/fs/cgroup/system.slice/docker-${CONTAINER_ID}.scope/memory.max
# Output: 268435456

cat /sys/fs/cgroup/system.slice/docker-${CONTAINER_ID}.scope/pids.max
# Output: 50

# View real-time resource usage
docker stats resource-test --no-stream
# Output:
# CONTAINER ID  NAME           CPU %  MEM USAGE / LIMIT  MEM %  NET I/O  BLOCK I/O  PIDS
# abc123def456  resource-test  0.01%  2.4MiB / 256MiB    0.94%  656B/0B  0B/0B      2

# Clean up
docker rm -f resource-test

Kubernetes Connection — Requests and Limits

Kubernetes exposes cgroup controls through its resource requests and limits model. Every container spec in a Pod can declare CPU and memory requests (minimum guaranteed resources) and limits (maximum allowed resources). These map directly to cgroup parameters:

The Kubernetes Resource Model: A request is the minimum resource guarantee — the scheduler uses it to decide which node can run the Pod. A limit is the maximum — enforced by cgroups at runtime. Setting requests without limits creates "burstable" Pods that can use more resources when available. Setting requests equal to limits creates "guaranteed" Pods with predictable performance but no elasticity.
# Kubernetes Pod spec with resource constraints
apiVersion: v1
kind: Pod
metadata:
  name: resource-demo
spec:
  containers:
  - name: app
    image: nginx:alpine
    resources:
      requests:
        cpu: "250m"        # 0.25 CPU cores (maps to cpu.weight)
        memory: "128Mi"    # 128 MiB minimum (scheduling decision)
      limits:
        cpu: "500m"        # 0.5 CPU cores (maps to cpu.max: "50000 100000")
        memory: "256Mi"    # 256 MiB maximum (maps to memory.max: "268435456")

The mapping between Kubernetes resources and cgroup files:

Kubernetes Setting cgroup Effect What Happens If Exceeded
resources.requests.cpu cpu.weight (proportional share) N/A — requests are guarantees, not limits
resources.limits.cpu cpu.max (hard cap) Process is throttled (slowed down)
resources.requests.memory Scheduling decision only N/A — used for node placement
resources.limits.memory memory.max (hard cap) Container is OOM-killed and restarted
Production Pattern

QoS Classes in Kubernetes

Kubernetes assigns Quality of Service (QoS) classes based on how resources are configured:

  • Guaranteed — requests == limits for all containers. Gets the highest cgroup priority. Last to be evicted.
  • Burstable — requests < limits (or only requests set). Can use more than requested when available. Medium eviction priority.
  • BestEffort — No requests or limits set. Gets whatever is left over. First to be evicted under pressure.

For production workloads, always set both requests and limits. For batch/background jobs that can tolerate disruption, BestEffort can maximise cluster utilisation at the cost of predictability.

QoS Eviction Resource Planning

Exercises

  1. CPU Throttling Observation — Run docker run --cpus=0.25 --rm alpine sh -c 'dd if=/dev/zero of=/dev/null bs=1M' and simultaneously run docker stats in another terminal. Observe the CPU percentage staying at ~25%. Now remove the --cpus flag and observe the difference.
  2. Memory OOM Experiment — Run a container with --memory=32m and attempt to allocate more memory using dd if=/dev/zero of=/dev/shm/test bs=1M count=64. Observe the OOM kill. Check docker inspect for the OOM-killed status.
  3. cgroup Filesystem Exploration — On your Linux system (or in a VM), explore /sys/fs/cgroup/. Start a Docker container, find its cgroup directory, and manually read cpu.max, memory.max, pids.max, and memory.current. Correlate with what docker stats reports.
  4. PSI Monitoring — If you have a cgroups v2 system, run a CPU-intensive container with tight CPU limits and monitor cpu.pressure in its cgroup directory. Calculate the percentage of time the container is being throttled.

Conclusion & Next Steps

Control groups complete the container isolation picture. Together with namespaces from Part 2:

  • Namespaces control visibility — what a process can see (processes, network, filesystem, hostname, IPC, users)
  • cgroups control consumption — what a process can use (CPU time, memory, disk I/O, process count)

Key takeaways from this article:

  • cgroups organise processes into a hierarchy with per-group resource limits and accounting
  • cgroups v2 provides a unified hierarchy, cleaner interface, and PSI metrics
  • CPU limits use quotas (hard caps) and weights (relative priority during contention)
  • Memory is non-compressible — exceeding limits triggers the OOM killer
  • Every Docker --cpus, --memory, --pids-limit flag maps to a cgroup file
  • Kubernetes requests and limits are the user-facing abstraction over cgroups

With namespaces and cgroups understood, the final kernel-level building block is the filesystem layer — how containers get their own root filesystem efficiently using union filesystems and copy-on-write.

Next in the Series

In Part 4: Union File Systems & Image Layering, we will explore how OverlayFS and copy-on-write semantics enable container images to be built in layers — making them space-efficient, fast to distribute, and sharable across containers.