Part 21: How Containers Actually Work — Namespaces & cgroups

Containers vs VMs

A virtual machine emulates an entire hardware stack — each VM runs its own kernel, its own OS, its own init system. A container is just a regular Linux process with restricted visibility and limited resources, sharing the host kernel. This distinction explains why containers start in milliseconds (no boot sequence) and use megabytes of overhead (no guest kernel).

VMs vs Containers — Architecture Comparison

flowchart TD
    subgraph VM["Virtual Machine"]
        VA[App A] --> GOS1[Guest OS]
        VB[App B] --> GOS2[Guest OS]
        GOS1 --> HV[Hypervisor]
        GOS2 --> HV
        HV --> HOS1[Host OS]
        HOS1 --> HW1[Hardware]
    end

    subgraph CT["Container"]
        CA[App A] --> CR[Container Runtime]
        CB[App B] --> CR
        CR --> HOS2[Host OS / Shared Kernel]
        HOS2 --> HW2[Hardware]
    end

            
            Key Insight: Containers share the host kernel. This means a kernel exploit inside a container can compromise the host. VMs provide stronger isolation because the hypervisor mediates all hardware access through a separate kernel. Choose VMs for multi-tenant workloads requiring hard security boundaries; choose containers for fast, lightweight deployment where you control the workload.
        

Linux Namespaces

Namespaces partition kernel resources so that one set of processes sees one set of resources, and another set sees a different set. Each container gets its own namespaces — its own PID tree, its own network stack, its own filesystem mounts — while the kernel remains shared.

Namespace	Isolates	Kernel Flag	Effect
PID	Process IDs	`CLONE_NEWPID`	Container's init is PID 1; host PIDs invisible
NET	Network stack	`CLONE_NEWNET`	Own interfaces, IPs, routes, iptables rules
MNT	Mount points	`CLONE_NEWNS`	Own filesystem tree; host mounts invisible
UTS	Hostname & domain	`CLONE_NEWUTS`	Container has its own hostname
IPC	System V IPC, POSIX MQs	`CLONE_NEWIPC`	Shared memory / semaphores isolated per container
USER	UIDs / GIDs	`CLONE_NEWUSER`	UID 0 inside maps to unprivileged UID on host
CGROUP	cgroup root view	`CLONE_NEWCGROUP`	Container sees its own cgroup as the root

PID Namespace

# Create a new PID namespace — the process inside sees itself as PID 1
sudo unshare --pid --fork --mount-proc bash -c '
    echo "My PID: $$"
    echo "Processes visible:"
    ps aux
'
# Output: PID 1 is bash — it cannot see host processes

# View namespaces of a running process
ls -la /proc/self/ns/
# Shows: cgroup, ipc, mnt, net, pid, pid_for_children, user, uts

# Compare namespaces of two processes
sudo readlink /proc/1/ns/pid       # Host init PID namespace
sudo readlink /proc/$(pgrep dockerd)/ns/pid  # dockerd's namespace (same as host)

NET Namespace

# Create an isolated network namespace
sudo ip netns add mycontainer

# List network namespaces
ip netns list

# Run a command inside the network namespace
sudo ip netns exec mycontainer ip link show
# Only sees: lo (loopback) — no eth0, no host interfaces

# Create a veth pair to connect namespaces
sudo ip link add veth-host type veth peer name veth-ct
sudo ip link set veth-ct netns mycontainer

# Assign IPs
sudo ip addr add 10.0.0.1/24 dev veth-host
sudo ip link set veth-host up
sudo ip netns exec mycontainer ip addr add 10.0.0.2/24 dev veth-ct
sudo ip netns exec mycontainer ip link set veth-ct up
sudo ip netns exec mycontainer ip link set lo up

# Ping across namespaces
ping -c 1 10.0.0.2   # Works! Traffic flows over the veth pair

# Cleanup
sudo ip netns del mycontainer

MNT Namespace

# Mount namespace gives the process its own filesystem view
sudo unshare --mount bash -c '
    # Mounts here are invisible to the host
    mount -t tmpfs tmpfs /tmp
    echo "secret" > /tmp/hidden.txt
    cat /tmp/hidden.txt   # "secret" — only visible in this namespace
    mount | grep tmpfs | tail -3
'
# After exit: /tmp/hidden.txt does not exist on the host

UTS Namespace

# UTS namespace isolates hostname
sudo unshare --uts bash -c '
    hostname mycontainer
    echo "Inside: $(hostname)"
'
echo "Host: $(hostname)"
# Container sees "mycontainer", host is unchanged

IPC Namespace

# IPC namespace isolates System V shared memory, semaphores, message queues
sudo unshare --ipc bash -c '
    ipcs -a    # Empty — no IPC objects visible from host
    # Create a shared memory segment (only visible in this namespace)
    ipcmk -M 1024
    ipcs -m    # Shows the new segment
'
ipcs -m    # Host does not see the segment created inside

USER Namespace

# User namespace remaps UIDs — root inside ≠ root outside
unshare --user --map-root-user bash -c '
    echo "Inside I am: $(whoami) (UID=$(id -u))"
    cat /proc/self/uid_map
    # 0  1000  1 — UID 0 inside maps to UID 1000 on host
'
echo "Outside I am: $(whoami) (UID=$(id -u))"

CGROUP Namespace

# Cgroup namespace makes the process see its cgroup as the root
# Without it, container processes can see the full cgroup hierarchy
cat /proc/self/cgroup   # Shows full path on host

sudo unshare --cgroup bash -c '
    cat /proc/self/cgroup   # Shows "/" — thinks it is at the root
'

cgroups — Resource Limits

Control groups (cgroups) limit, account for, and isolate the resource usage of process groups. While namespaces control what a process can see, cgroups control how much it can use. Linux uses cgroups v2 (unified hierarchy) on modern systems.

CPU Limits

# cgroups v2: limit a process to 50% of one CPU core
# Create a cgroup
sudo mkdir -p /sys/fs/cgroup/mycontainer

# Set CPU limit: 50ms out of every 100ms period = 50% of one core
echo "50000 100000" | sudo tee /sys/fs/cgroup/mycontainer/cpu.max

# Move current shell into the cgroup
echo $$ | sudo tee /sys/fs/cgroup/mycontainer/cgroup.procs

# Verify — run a CPU-intensive task and observe throttling
stress --cpu 1 --timeout 5 &
cat /sys/fs/cgroup/mycontainer/cpu.stat
# throttled_usec shows time the process was throttled

# Docker equivalent:
# docker run --cpus=0.5 ubuntu stress --cpu 1 --timeout 5

Memory Limits

# Limit memory to 100MB
echo "104857600" | sudo tee /sys/fs/cgroup/mycontainer/memory.max

# View current memory usage
cat /sys/fs/cgroup/mycontainer/memory.current

# View OOM kill count
cat /sys/fs/cgroup/mycontainer/memory.events
# oom_kill shows how many times the OOM killer triggered

# Set memory.high for throttling before hard kill
echo "83886080" | sudo tee /sys/fs/cgroup/mycontainer/memory.high
# At 80MB: kernel reclaims aggressively (slow but alive)
# At 100MB (memory.max): OOM kill

# Docker equivalent:
# docker run --memory=100m --memory-reservation=80m ubuntu

I/O Limits

# Limit disk I/O to 10MB/s write on device 8:0 (sda)
echo "8:0 wbps=10485760" | sudo tee /sys/fs/cgroup/mycontainer/io.max

# View I/O statistics for the cgroup
cat /sys/fs/cgroup/mycontainer/io.stat
# Shows: rbytes, wbytes, rios, wios, dbytes, dios

# Docker equivalent:
# docker run --device-write-bps /dev/sda:10mb ubuntu dd if=/dev/zero of=/tmp/test bs=1M count=100

OverlayFS — The Container Filesystem

OverlayFS is a union filesystem that layers a writable upper directory on top of read-only lower directories. Container images are stacks of read-only layers; when a container writes a file, the change goes to the upper (container-specific) layer without modifying the image layers below. This enables efficient sharing — 100 containers from the same image share the same lower layers on disk.

# Create an OverlayFS manually — simulating how Docker layers work
mkdir -p /tmp/overlay/{lower,upper,work,merged}

# Populate the "base image" layer
echo "I am from the image" > /tmp/overlay/lower/base.txt
echo "config=default" > /tmp/overlay/lower/config.txt

# Mount the overlay
sudo mount -t overlay overlay \
    -o lowerdir=/tmp/overlay/lower,upperdir=/tmp/overlay/upper,workdir=/tmp/overlay/work \
    /tmp/overlay/merged

# The merged view shows the lower layer contents
cat /tmp/overlay/merged/base.txt      # "I am from the image"
cat /tmp/overlay/merged/config.txt    # "config=default"

# Write a new file — goes to the upper layer only
echo "container data" > /tmp/overlay/merged/new.txt
ls /tmp/overlay/upper/   # new.txt exists here (copy-up on write)

# Modify an existing file — copy-up: original stays in lower, copy in upper
echo "config=custom" > /tmp/overlay/merged/config.txt
cat /tmp/overlay/lower/config.txt   # Still "config=default" (unchanged)
cat /tmp/overlay/upper/config.txt   # "config=custom" (the overlay)

# Cleanup
sudo umount /tmp/overlay/merged
rm -rf /tmp/overlay

Building a Container from Scratch

A "container" is nothing more than: (1) namespaces for isolation, (2) a root filesystem, (3) cgroups for resource limits, and (4) seccomp/capabilities for syscall filtering. Let's build one with basic Linux tools.

Hands-On Lab

From unshare to Docker — Building a Container by Hand

This step-by-step lab creates a minimal container using only Linux syscalls and command-line tools — no Docker required. You'll create namespaces, set up a root filesystem, mount /proc, configure cgroups, and exec into the isolated environment. This is exactly what runc does under the hood.

unsharechrootcgroupsnamespaces

# Step 1: Download a minimal rootfs (Alpine Linux — ~3MB)
mkdir -p /tmp/mycontainer/rootfs
cd /tmp/mycontainer
curl -sL https://dl-cdn.alpinelinux.org/alpine/v3.19/releases/x86_64/alpine-minirootfs-3.19.1-x86_64.tar.gz \
    | tar -xz -C rootfs

# Step 2: Create cgroup for resource limits
sudo mkdir -p /sys/fs/cgroup/mycontainer
echo "50000 100000" | sudo tee /sys/fs/cgroup/mycontainer/cpu.max    # 50% CPU
echo "67108864" | sudo tee /sys/fs/cgroup/mycontainer/memory.max     # 64MB RAM

# Step 3: Launch with all namespaces isolated
sudo unshare \
    --pid \
    --net \
    --mount \
    --uts \
    --ipc \
    --cgroup \
    --fork \
    bash -c '
        # Step 4: Move into our cgroup
        echo $$ > /sys/fs/cgroup/mycontainer/cgroup.procs

        # Step 5: Set hostname
        hostname mycontainer

        # Step 6: Set up the rootfs
        mount --bind rootfs rootfs
        cd rootfs

        # Step 7: Mount essential filesystems
        mount -t proc proc proc/
        mount -t sysfs sys sys/
        mount -t tmpfs tmp tmp/

        # Step 8: Pivot into the new root
        mkdir -p .old_root
        pivot_root . .old_root
        cd /
        umount -l /.old_root
        rmdir /.old_root

        # Step 9: We are now "inside" the container
        echo "Hostname: $(hostname)"
        echo "PID: $$"
        echo "Processes:"
        ps aux
        echo "Filesystem:"
        ls /
        echo "Memory limit:"
        cat /sys/fs/cgroup/memory.max 2>/dev/null || echo "(cgroup ns)"
    '

# Cleanup
sudo rmdir /sys/fs/cgroup/mycontainer 2>/dev/null
rm -rf /tmp/mycontainer

            
            Never Run Containers as --privileged: The --privileged flag disables all security restrictions — it gives the container full access to all host devices, disables seccomp filtering, grants all Linux capabilities, and removes AppArmor/SELinux confinement. A process inside a --privileged container can mount the host filesystem, load kernel modules, and escape trivially. Use --cap-add to grant only the specific capabilities needed, and --security-opt seccomp=profile.json for fine-grained syscall filtering.
        

Docker Architecture

Docker is not a monolith — it's a stack of components following the OCI (Open Container Initiative) specification. Understanding the layers helps you debug container issues and choose alternatives (Podman, containerd+nerdctl, CRI-O).

Docker Architecture — Component Stack

flowchart TD
    CLI["docker CLI"] -->|REST API| D["dockerd\n(Docker daemon)"]
    D -->|gRPC| CTD["containerd\n(container lifecycle)"]
    CTD -->|OCI spec| SHIM["containerd-shim"]
    SHIM -->|fork/exec| RUNC["runc\n(OCI runtime)"]
    RUNC -->|clone() + namespaces\n+ cgroups + pivot_root| CP["Container Process\n(your app)"]

    CTD -.->|Image pull/push| REG["Registry\n(Docker Hub, ECR, etc.)"]
    CTD -.->|Snapshots| OFS["OverlayFS\n(image layers)"]

containerd, runc, OCI

containerd manages the full container lifecycle: image pull, storage (snapshots), container creation, and task execution. runc is the low-level OCI runtime that actually creates the Linux namespaces, sets up cgroups, and execs the container process. The OCI runtime spec defines a JSON config (config.json) that any runtime can consume — this is why runc, crun, youki, and gVisor are interchangeable.

# See what Docker uses under the hood
docker info | grep -E "Runtime|Storage|Cgroup"
# Default Runtime: runc
# Storage Driver: overlay2
# Cgroup Driver: systemd (or cgroupfs)

# Inspect a container's namespaces and cgroups
CONTAINER_ID=$(docker run -d --name test-ns alpine sleep 3600)
PID=$(docker inspect --format '{{.State.Pid}}' $CONTAINER_ID)

# View the container's namespaces
sudo ls -la /proc/$PID/ns/
# cgroup, ipc, mnt, net, pid, user, uts — all different from host

# View its cgroup
cat /proc/$PID/cgroup
# 0::/system.slice/docker-.scope (systemd) or /docker/ (cgroupfs)

# View resource limits set by Docker
cat /sys/fs/cgroup/system.slice/docker-${CONTAINER_ID}*/memory.max
cat /sys/fs/cgroup/system.slice/docker-${CONTAINER_ID}*/cpu.max

# Enter a container's namespaces with nsenter
sudo nsenter --target $PID --pid --net --mount -- ps aux
# You see only the container's processes

# Cleanup
docker rm -f test-ns

# Use runc directly (bypassing Docker entirely)
mkdir -p /tmp/oci-bundle/rootfs
cd /tmp/oci-bundle

# Create a rootfs
docker export $(docker create alpine) | tar -C rootfs -xf -

# Generate OCI config.json
runc spec

# View the generated config (namespaces, mounts, cgroups defined here)
cat config.json | python3 -m json.tool | head -50
# "namespaces": [{"type": "pid"}, {"type": "network"}, ...]
# "linux": {"resources": {"memory": {"limit": ...}}}

# Run the container with runc
sudo runc run my-container
# You're now inside an OCI container — no Docker needed

# Cleanup
sudo runc delete my-container
rm -rf /tmp/oci-bundle

Exercises

# Exercise 1: Create a PID namespace and verify isolation
sudo unshare --pid --fork --mount-proc ps aux
# You should see only 2 processes: unshare and ps

# Exercise 2: View namespaces of a Docker container
docker run -d --name ex-ns alpine sleep 60
docker inspect --format '{{.State.Pid}}' ex-ns | xargs -I{} sudo ls -la /proc/{}/ns/
docker rm -f ex-ns

# Exercise 3: Set a memory limit with cgroups v2
sudo mkdir -p /sys/fs/cgroup/exercise
echo "52428800" | sudo tee /sys/fs/cgroup/exercise/memory.max   # 50MB
echo $$ | sudo tee /sys/fs/cgroup/exercise/cgroup.procs
cat /sys/fs/cgroup/exercise/memory.current
# Move back: echo $$ | sudo tee /sys/fs/cgroup/cgroup.procs
sudo rmdir /sys/fs/cgroup/exercise

# Exercise 4: Create an OverlayFS and test copy-on-write
mkdir -p /tmp/ex-ov/{lower,upper,work,merged}
echo "original" > /tmp/ex-ov/lower/file.txt
sudo mount -t overlay overlay -o lowerdir=/tmp/ex-ov/lower,upperdir=/tmp/ex-ov/upper,workdir=/tmp/ex-ov/work /tmp/ex-ov/merged
echo "modified" > /tmp/ex-ov/merged/file.txt
cat /tmp/ex-ov/lower/file.txt   # Still "original"
cat /tmp/ex-ov/upper/file.txt   # "modified" (copy-up)
sudo umount /tmp/ex-ov/merged && rm -rf /tmp/ex-ov

# Exercise 5: Compare Docker overhead vs VM
docker run --rm alpine cat /proc/version   # Same kernel as host!

Conclusion & Next Steps

Containers are built from four Linux kernel primitives: namespaces (what a process can see), cgroups (how much it can use), OverlayFS (layered filesystem), and seccomp/capabilities (which syscalls are allowed). Docker, Podman, and CRI-O are all just orchestrators that call clone() with the right flags, create cgroup directories, set up an overlay mount, and exec your process. Understanding these primitives makes container debugging, security hardening, and performance tuning straightforward — because under every docker run is just a Linux process.

PreviousPart 20: Secrets Management Next Part 22: Kubernetes Networking

Cookie Consent

Part 21: How Containers Actually Work — Namespaces & cgroups

Table of Contents

Containers vs VMs

Linux Namespaces

PID Namespace

NET Namespace

MNT Namespace

UTS Namespace

IPC Namespace

USER Namespace

CGROUP Namespace

cgroups — Resource Limits

CPU Limits

Memory Limits

I/O Limits

OverlayFS — The Container Filesystem

Building a Container from Scratch

From unshare to Docker — Building a Container by Hand

Docker Architecture

containerd, runc, OCI

Exercises

Conclusion & Next Steps

Cookie Consent

Part 21: How Containers Actually Work — Namespaces & cgroups

Table of Contents

Containers vs VMs

Linux Namespaces

PID Namespace

NET Namespace

MNT Namespace

UTS Namespace

IPC Namespace

USER Namespace

CGROUP Namespace

cgroups — Resource Limits

CPU Limits

Memory Limits

I/O Limits

OverlayFS — The Container Filesystem

Building a Container from Scratch

From unshare to Docker — Building a Container by Hand

Docker Architecture

containerd, runc, OCI

Exercises

Conclusion & Next Steps

Continue the Series

Part 20: Secrets Management

Part 22: Kubernetes Networking