What Are Control Groups?
Control groups (cgroups) are a Linux kernel feature that organises processes into hierarchical groups whose resource usage can be limited, monitored, and accounted for. Originally developed by Google engineers Paul Menage and Rohit Seth in 2006 (initially called "process containers"), cgroups were merged into the Linux kernel in version 2.6.24 (January 2008).
While namespaces create the illusion of isolation by controlling visibility, cgroups provide actual physical resource constraints. A process in a PID namespace might not be able to see other processes, but without cgroups it could still consume 100% of the CPU, allocate all available memory, or saturate disk I/O — effectively crashing the entire host and all other containers.
cgroups provide four key capabilities:
- Resource limiting — Set hard caps on CPU time, memory usage, I/O bandwidth, and process count
- Prioritisation — Give some groups more resources than others when there is contention
- Accounting — Track how much of each resource a group has consumed (for billing, monitoring, capacity planning)
- Control — Freeze, checkpoint, restart, or kill all processes in a group atomically
Hierarchical Organisation
cgroups are organised in a tree structure. Each node in the tree is a group, and child groups inherit limits from their parents (but can set stricter limits). This hierarchy maps naturally to container orchestration: the system gets the root cgroup, each container runtime gets a child cgroup, and each container gets a grandchild cgroup.
# View the cgroup hierarchy on a modern Linux system (cgroups v2)
# The filesystem is mounted at /sys/fs/cgroup
ls /sys/fs/cgroup/
# Output: cgroup.controllers cgroup.max.depth cgroup.procs cpu.max memory.max ...
# View the tree structure
find /sys/fs/cgroup -name "cgroup.procs" -maxdepth 3 | head -20
# /sys/fs/cgroup/cgroup.procs
# /sys/fs/cgroup/system.slice/cgroup.procs
# /sys/fs/cgroup/system.slice/docker.service/cgroup.procs
# /sys/fs/cgroup/system.slice/ssh.service/cgroup.procs
# /sys/fs/cgroup/user.slice/cgroup.procs
# See which cgroup a specific process belongs to
cat /proc/self/cgroup
# Output (v2): 0::/user.slice/user-1000.slice/session-1.scope
# See which cgroup a Docker container's process belongs to
CONTAINER_PID=$(docker inspect --format '{{.State.Pid}}' my-container)
cat /proc/$CONTAINER_PID/cgroup
# Output (v2): 0::/system.slice/docker-abc123...def.scope
cgroups v1 vs v2 — Understanding the Transition
The Linux kernel has two implementations of cgroups that differ significantly in architecture. Understanding both is essential because production systems still run both versions, and Docker/Kubernetes behave differently depending on which is available.
| Feature | cgroups v1 | cgroups v2 |
|---|---|---|
| Kernel version | 2.6.24 (2008) | 4.5 (2016), mature by 5.x |
| Hierarchy | Multiple hierarchies (one per controller) | Single unified hierarchy |
| Mount point | /sys/fs/cgroup/cpu, /sys/fs/cgroup/memory, etc. |
/sys/fs/cgroup (single mount) |
| Process membership | Process can be in different groups per controller | Process in exactly one group for all controllers |
| Thread support | Limited | Thread-level granularity with threaded controllers |
| Delegation | Complex, security issues | Clean delegation model for rootless containers |
| PSI (Pressure Stall Info) | Not available | Built-in resource pressure monitoring |
| Memory controller | memory.limit_in_bytes |
memory.max, memory.high |
| CPU controller | cpu.shares, cpu.cfs_quota_us |
cpu.weight, cpu.max |
Migration Status (2026)
The industry is in the midst of transitioning from v1 to v2:
# Check which cgroup version your system uses
stat -fc %T /sys/fs/cgroup/
# Output: "cgroup2fs" = v2, "tmpfs" = v1 (or hybrid)
# Alternative check
mount | grep cgroup
# v2: "cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime)"
# v1: multiple lines like "cgroup on /sys/fs/cgroup/cpu type cgroup (rw,...,cpu)"
# Check Docker's cgroup driver
docker info | grep -i cgroup
# Output: Cgroup Driver: systemd (or cgroupfs)
# Cgroup Version: 2
CPU Limits — Controlling Compute Time
CPU limiting is the most commonly used cgroup feature. It comes in two flavours: relative weights (how CPU is shared when there is contention) and hard limits (absolute caps on CPU time regardless of availability).
The key CPU control parameters are:
| Parameter (v1) | Parameter (v2) | Purpose | Default |
|---|---|---|---|
cpu.shares |
cpu.weight |
Relative weight for fair scheduling under contention | 1024 (v1) / 100 (v2) |
cpu.cfs_period_us |
cpu.max (period component) |
Length of the scheduling period in microseconds | 100000 (100ms) |
cpu.cfs_quota_us |
cpu.max (quota component) |
Maximum CPU time allowed per period | -1 (unlimited) |
cpuset.cpus |
cpuset.cpus |
Pin processes to specific CPU cores | All CPUs |
The difference between shares/weight and quota/max is critical:
- Shares (relative) — Only matter when CPUs are busy. If your container has 512 shares and another has 1024, the second gets twice the CPU time during contention. But if the system is idle, both can use 100% of available CPU.
- Quota (absolute) — Hard limit regardless of system load. If you set a quota of 50000 per 100000 period (0.5 CPUs), your container will be throttled at 50% of one core even if the other 31 cores sit completely idle.
Hands-On: Setting CPU Limits
# === cgroups v2 example (modern systems) ===
# Create a new cgroup
sudo mkdir /sys/fs/cgroup/my-container
# Set CPU limit: 50% of one CPU (50ms quota per 100ms period)
echo "50000 100000" | sudo tee /sys/fs/cgroup/my-container/cpu.max
# Format: $QUOTA $PERIOD (in microseconds)
# 50000/100000 = 0.5 CPUs
# For 2 full CPUs: 200000 100000 (200ms per 100ms period)
# For 0.25 CPUs: 25000 100000 (25ms per 100ms period)
# Set CPU weight (relative priority, 1-10000, default 100)
echo "200" | sudo tee /sys/fs/cgroup/my-container/cpu.weight
# Pin to specific CPUs (cores 0 and 1 only)
echo "0-1" | sudo tee /sys/fs/cgroup/my-container/cpuset.cpus
# Add current shell to the cgroup
echo $$ | sudo tee /sys/fs/cgroup/my-container/cgroup.procs
# Run a CPU-intensive task and observe throttling
dd if=/dev/zero of=/dev/null bs=1M &
# Check: the task will be limited to 50% of one CPU
# View throttling statistics
cat /sys/fs/cgroup/my-container/cpu.stat
# Output:
# usage_usec 1234567 (total CPU time consumed)
# user_usec 1200000 (user-space CPU time)
# system_usec 34567 (kernel-space CPU time)
# nr_periods 456 (scheduling periods elapsed)
# nr_throttled 123 (periods where throttling occurred)
# throttled_usec 789000 (total time spent throttled)
# Clean up
kill %1
echo $$ | sudo tee /sys/fs/cgroup/cgroup.procs
sudo rmdir /sys/fs/cgroup/my-container
The CPU Throttling Latency Problem
Setting aggressive CPU quotas can cause unexpected latency spikes. Consider a container with a quota of 10ms per 100ms period. If it receives a burst of requests and exhausts its 10ms quota in the first 5ms of the period, it is throttled for the remaining 95ms — nearly a full second of latency! This is why Kubernetes documentation warns against setting CPU limits too low for latency-sensitive services. Some teams prefer CPU requests without hard limits, relying on the scheduler's fair sharing instead of hard throttling.
Memory Limits — Preventing Resource Exhaustion
Memory is a non-compressible resource — unlike CPU (where a process just runs slower when throttled), when memory runs out, something must give. The system must either deny the allocation (returning an error to the application) or kill a process to free memory. This makes memory limiting both critically important and potentially dangerous.
The key memory control parameters are:
| Parameter (v1) | Parameter (v2) | Purpose |
|---|---|---|
memory.limit_in_bytes |
memory.max |
Hard limit — OOM killer triggered if exceeded |
memory.soft_limit_in_bytes |
memory.high |
Soft limit — kernel reclaims memory aggressively |
memory.memsw.limit_in_bytes |
memory.swap.max |
Memory + swap combined limit |
memory.usage_in_bytes |
memory.current |
Current memory usage (read-only) |
memory.oom_control |
memory.oom.group |
OOM killer behaviour control |
# === Setting memory limits (cgroups v2) ===
# Create a cgroup with 256MB memory limit
sudo mkdir /sys/fs/cgroup/mem-test
# Set hard limit (256 MB)
echo "268435456" | sudo tee /sys/fs/cgroup/mem-test/memory.max
# Set soft limit (200 MB) — kernel starts reclaiming at this point
echo "209715200" | sudo tee /sys/fs/cgroup/mem-test/memory.high
# Disable swap for this cgroup (force OOM rather than slow swap)
echo "0" | sudo tee /sys/fs/cgroup/mem-test/memory.swap.max
# Add current shell
echo $$ | sudo tee /sys/fs/cgroup/mem-test/cgroup.procs
# Monitor memory usage
cat /sys/fs/cgroup/mem-test/memory.current
# Output: current usage in bytes
# View detailed memory statistics
cat /sys/fs/cgroup/mem-test/memory.stat
# Output includes: anon, file, kernel, slab, sock, shmem, mapped_file, etc.
The OOM Killer — When Memory Runs Out
When a cgroup exceeds its memory.max limit and no memory can be reclaimed, the kernel's Out-of-Memory (OOM) killer activates. It selects a process within the cgroup to kill, freeing memory for the survivors. In container contexts, this usually means the container's main process is killed, causing the container to restart.
Triggering the OOM Killer
You can safely observe the OOM killer in action with Docker:
# Run a container with 64MB memory limit
docker run --rm --memory=64m --name oom-test alpine sh -c '
echo "Allocating memory until OOM..."
# Allocate memory in 10MB chunks
i=0
while true; do
dd if=/dev/zero of=/dev/shm/block$i bs=10M count=1 2>/dev/null
i=$((i+1))
echo "Allocated $((i*10))MB"
done
'
# Output:
# Allocating memory until OOM...
# Allocated 10MB
# Allocated 20MB
# Allocated 30MB
# Allocated 40MB
# Allocated 50MB
# Killed
# Check Docker events
docker events --filter event=oom --since 1m
# Output: container oom abc123... (image=alpine, name=oom-test)
memory.events (which counts OOM events) and log when they approach limits. In Kubernetes, an OOM-killed container gets status OOMKilled and the Pod is restarted according to its restartPolicy. Monitor for container_memory_working_set_bytes approaching limits in your alerting.
I/O Limits — Controlling Disk Bandwidth
The I/O controller (called blkio in cgroups v1, io in v2) limits the rate at which a cgroup can read from and write to block devices. Without I/O limits, a single container performing heavy disk operations (like a database backup or log rotation) can starve other containers of disk bandwidth, causing latency spikes across the entire host.
I/O limiting works on a per-device basis — you specify limits for specific block devices (identified by major:minor numbers):
# === I/O limiting (cgroups v2) ===
# Find the major:minor number of your disk
lsblk -o NAME,MAJ:MIN
# Output:
# NAME MAJ:MIN
# sda 8:0
# ├─sda1 8:1
# └─sda2 8:2
# Create a cgroup with I/O limits
sudo mkdir /sys/fs/cgroup/io-test
# Set read bandwidth limit: 10MB/s on device 8:0
echo "8:0 rbps=10485760" | sudo tee /sys/fs/cgroup/io-test/io.max
# Set write bandwidth limit: 5MB/s on device 8:0
echo "8:0 wbps=5242880" | sudo tee /sys/fs/cgroup/io-test/io.max
# Set IOPS limits (operations per second)
echo "8:0 riops=1000 wiops=500" | sudo tee /sys/fs/cgroup/io-test/io.max
# Combined: set all limits at once
echo "8:0 rbps=10485760 wbps=5242880 riops=1000 wiops=500" | \
sudo tee /sys/fs/cgroup/io-test/io.max
# With Docker — limit write speed to 10MB/s
docker run --rm --device-write-bps /dev/sda:10mb alpine \
sh -c 'dd if=/dev/zero of=/tmp/testfile bs=1M count=100 oflag=direct'
# Without limit: ~200MB/s
# With limit: ~10MB/s (throttled)
# Monitor I/O statistics
cat /sys/fs/cgroup/io-test/io.stat
# Output: 8:0 rbytes=1048576 wbytes=524288 rios=100 wios=50 dbytes=0 dios=0
O_DIRECT flag). Buffered I/O goes through the page cache, and the kernel may not attribute cache writebacks to the correct cgroup accurately. This means a container doing buffered writes might appear to exceed its I/O limits because the actual disk writes happen asynchronously in kernel context. For precise I/O control, configure your applications to use direct I/O or accept that buffered I/O limits are approximate.
Process Limits — Fork Bomb Protection
The PID controller limits the number of processes (and threads) that can exist within a cgroup. This is the primary defence against fork bombs — malicious or buggy code that recursively creates processes until the system's process table is exhausted, crashing the entire host.
# === Process limits (cgroups v2) ===
# Set maximum number of processes in a cgroup
sudo mkdir /sys/fs/cgroup/pid-test
echo "100" | sudo tee /sys/fs/cgroup/pid-test/pids.max
# Current process count
cat /sys/fs/cgroup/pid-test/pids.current
# Output: 0 (no processes assigned yet)
# With Docker — limit to 50 processes
docker run --rm --pids-limit=50 alpine sh -c '
echo "Attempting to create processes..."
for i in $(seq 1 100); do
sleep 60 &
if [ $? -ne 0 ]; then
echo "Fork failed at process $i"
break
fi
done
echo "Running processes: $(ps aux | wc -l)"
'
# Output: Fork fails around process 48 (container has a few base processes)
# The classic fork bomb (DO NOT RUN without limits!)
# :(){ :|:& };:
# With pids.max set, this is safely contained — it hits the limit and stops
Fork Bomb Containment
Without process limits, a single compromised container running :(){ :|:& };: (the classic bash fork bomb) can exhaust the host's process ID space (typically 32,768 or 4,194,304 depending on kernel.pid_max). This prevents ALL processes on the host from forking — including the container runtime, SSH daemon, and monitoring agents. With pids.max set, the fork bomb is contained to its cgroup. It fills up its 100-process allocation and stops. Other containers and the host are completely unaffected.
cgroups v2 — The Unified Hierarchy
cgroups v2 was designed to fix the architectural problems of v1. The most important change is the unified hierarchy — instead of separate filesystem trees for each controller (cpu, memory, io, pids), there is a single tree where all controllers are managed together.
flowchart TD
ROOT["/sys/fs/cgroup (root)"] --> SYSTEM["system.slice"]
ROOT --> USER["user.slice"]
ROOT --> DOCKER["docker"]
SYSTEM --> SSH["ssh.service"]
SYSTEM --> NGINX["nginx.service"]
DOCKER --> C1["container-abc123"]
DOCKER --> C2["container-def456"]
DOCKER --> C3["container-ghi789"]
C1 --- C1R["cpu.max: 100000 100000
memory.max: 512M
pids.max: 200"]
C2 --- C2R["cpu.max: 50000 100000
memory.max: 256M
pids.max: 100"]
C3 --- C3R["cpu.max: 200000 100000
memory.max: 1G
pids.max: 500"]
style ROOT fill:#132440,stroke:#3B9797,color:#fff
style SYSTEM fill:#16476A,stroke:#3B9797,color:#fff
style USER fill:#16476A,stroke:#3B9797,color:#fff
style DOCKER fill:#3B9797,stroke:#132440,color:#fff
style C1 fill:#BF092F,stroke:#132440,color:#fff
style C2 fill:#BF092F,stroke:#132440,color:#fff
style C3 fill:#BF092F,stroke:#132440,color:#fff
Key improvements in v2:
- Single hierarchy — A process belongs to exactly one node, all resource controls applied at that node
- No internal processes — Only leaf nodes can contain processes (simplifies resource distribution)
- Delegation — Clean model for giving unprivileged users control over subtrees (critical for rootless containers)
- Weight-based CPU —
cpu.weight(1–10000, default 100) replaces the confusingcpu.shares(2–262144, default 1024) - memory.high — Soft throttling before the hard kill (v1 had no graceful degradation)
Pressure Stall Information (PSI)
One of v2's most valuable additions is PSI — real-time metrics showing how much time processes spend waiting for resources. PSI answers the question: "Are my containers experiencing resource pressure?"
# Read PSI metrics for a cgroup
cat /sys/fs/cgroup/docker/container-abc123/cpu.pressure
# Output:
# some avg10=4.56 avg60=2.34 avg300=1.12 total=567890
# full avg10=0.12 avg60=0.08 avg300=0.03 total=12345
# Interpretation:
# "some" = percentage of time at least ONE task is stalled waiting for CPU
# "full" = percentage of time ALL tasks are stalled (complete starvation)
# avg10/60/300 = exponential moving averages over 10s/60s/300s windows
cat /sys/fs/cgroup/docker/container-abc123/memory.pressure
# some avg10=0.00 avg60=0.00 avg300=0.00 total=0
# full avg10=0.00 avg60=0.00 avg300=0.00 total=0
# (healthy: no memory pressure)
cat /sys/fs/cgroup/docker/container-abc123/io.pressure
# some avg10=12.34 avg60=8.56 avg300=5.23 total=234567
# full avg10=2.10 avg60=1.45 avg300=0.89 total=45678
# (some I/O contention — processes spending ~12% of time waiting for disk)
How Docker Uses cgroups
Every docker run flag that controls resources maps directly to cgroup parameters. Understanding this mapping demystifies Docker's resource management — there is no magic, just cgroup files being written:
| Docker Flag | cgroup v2 File | Effect |
|---|---|---|
--cpus=1.5 |
cpu.max → "150000 100000" |
1.5 CPU cores (150ms per 100ms period) |
--cpu-shares=512 |
cpu.weight → proportional mapping |
Relative CPU weight during contention |
--cpuset-cpus="0,2" |
cpuset.cpus → "0,2" |
Pin to CPU cores 0 and 2 only |
--memory=512m |
memory.max → "536870912" |
512MB hard memory limit |
--memory-reservation=256m |
memory.high → "268435456" |
256MB soft limit (reclaim target) |
--memory-swap=1g |
memory.swap.max |
Maximum swap usage |
--pids-limit=100 |
pids.max → "100" |
Maximum 100 processes |
--device-read-bps /dev/sda:10mb |
io.max → "8:0 rbps=10485760" |
10MB/s read limit on /dev/sda |
--device-write-iops /dev/sda:100 |
io.max → "8:0 wiops=100" |
100 write IOPS on /dev/sda |
# Run a container with resource limits
docker run -d --name resource-test \
--cpus=0.5 \
--memory=256m \
--memory-swap=256m \
--pids-limit=50 \
nginx:alpine
# Verify the cgroup settings
CONTAINER_ID=$(docker inspect --format '{{.Id}}' resource-test)
# On cgroups v2 with systemd driver:
cat /sys/fs/cgroup/system.slice/docker-${CONTAINER_ID}.scope/cpu.max
# Output: 50000 100000
cat /sys/fs/cgroup/system.slice/docker-${CONTAINER_ID}.scope/memory.max
# Output: 268435456
cat /sys/fs/cgroup/system.slice/docker-${CONTAINER_ID}.scope/pids.max
# Output: 50
# View real-time resource usage
docker stats resource-test --no-stream
# Output:
# CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
# abc123def456 resource-test 0.01% 2.4MiB / 256MiB 0.94% 656B/0B 0B/0B 2
# Clean up
docker rm -f resource-test
Kubernetes Connection — Requests and Limits
Kubernetes exposes cgroup controls through its resource requests and limits model. Every container spec in a Pod can declare CPU and memory requests (minimum guaranteed resources) and limits (maximum allowed resources). These map directly to cgroup parameters:
# Kubernetes Pod spec with resource constraints
apiVersion: v1
kind: Pod
metadata:
name: resource-demo
spec:
containers:
- name: app
image: nginx:alpine
resources:
requests:
cpu: "250m" # 0.25 CPU cores (maps to cpu.weight)
memory: "128Mi" # 128 MiB minimum (scheduling decision)
limits:
cpu: "500m" # 0.5 CPU cores (maps to cpu.max: "50000 100000")
memory: "256Mi" # 256 MiB maximum (maps to memory.max: "268435456")
The mapping between Kubernetes resources and cgroup files:
| Kubernetes Setting | cgroup Effect | What Happens If Exceeded |
|---|---|---|
resources.requests.cpu |
cpu.weight (proportional share) |
N/A — requests are guarantees, not limits |
resources.limits.cpu |
cpu.max (hard cap) |
Process is throttled (slowed down) |
resources.requests.memory |
Scheduling decision only | N/A — used for node placement |
resources.limits.memory |
memory.max (hard cap) |
Container is OOM-killed and restarted |
QoS Classes in Kubernetes
Kubernetes assigns Quality of Service (QoS) classes based on how resources are configured:
- Guaranteed — requests == limits for all containers. Gets the highest cgroup priority. Last to be evicted.
- Burstable — requests < limits (or only requests set). Can use more than requested when available. Medium eviction priority.
- BestEffort — No requests or limits set. Gets whatever is left over. First to be evicted under pressure.
For production workloads, always set both requests and limits. For batch/background jobs that can tolerate disruption, BestEffort can maximise cluster utilisation at the cost of predictability.
Exercises
- CPU Throttling Observation — Run
docker run --cpus=0.25 --rm alpine sh -c 'dd if=/dev/zero of=/dev/null bs=1M'and simultaneously rundocker statsin another terminal. Observe the CPU percentage staying at ~25%. Now remove the--cpusflag and observe the difference. - Memory OOM Experiment — Run a container with
--memory=32mand attempt to allocate more memory usingdd if=/dev/zero of=/dev/shm/test bs=1M count=64. Observe the OOM kill. Checkdocker inspectfor the OOM-killed status. - cgroup Filesystem Exploration — On your Linux system (or in a VM), explore
/sys/fs/cgroup/. Start a Docker container, find its cgroup directory, and manually readcpu.max,memory.max,pids.max, andmemory.current. Correlate with whatdocker statsreports. - PSI Monitoring — If you have a cgroups v2 system, run a CPU-intensive container with tight CPU limits and monitor
cpu.pressurein its cgroup directory. Calculate the percentage of time the container is being throttled.
Conclusion & Next Steps
Control groups complete the container isolation picture. Together with namespaces from Part 2:
- Namespaces control visibility — what a process can see (processes, network, filesystem, hostname, IPC, users)
- cgroups control consumption — what a process can use (CPU time, memory, disk I/O, process count)
Key takeaways from this article:
- cgroups organise processes into a hierarchy with per-group resource limits and accounting
- cgroups v2 provides a unified hierarchy, cleaner interface, and PSI metrics
- CPU limits use quotas (hard caps) and weights (relative priority during contention)
- Memory is non-compressible — exceeding limits triggers the OOM killer
- Every Docker
--cpus,--memory,--pids-limitflag maps to a cgroup file - Kubernetes requests and limits are the user-facing abstraction over cgroups
With namespaces and cgroups understood, the final kernel-level building block is the filesystem layer — how containers get their own root filesystem efficiently using union filesystems and copy-on-write.
Next in the Series
In Part 4: Union File Systems & Image Layering, we will explore how OverlayFS and copy-on-write semantics enable container images to be built in layers — making them space-efficient, fast to distribute, and sharable across containers.