Containers vs VMs
A virtual machine emulates an entire hardware stack — each VM runs its own kernel, its own OS, its own init system. A container is just a regular Linux process with restricted visibility and limited resources, sharing the host kernel. This distinction explains why containers start in milliseconds (no boot sequence) and use megabytes of overhead (no guest kernel).
flowchart TD
subgraph VM["Virtual Machine"]
VA[App A] --> GOS1[Guest OS]
VB[App B] --> GOS2[Guest OS]
GOS1 --> HV[Hypervisor]
GOS2 --> HV
HV --> HOS1[Host OS]
HOS1 --> HW1[Hardware]
end
subgraph CT["Container"]
CA[App A] --> CR[Container Runtime]
CB[App B] --> CR
CR --> HOS2[Host OS / Shared Kernel]
HOS2 --> HW2[Hardware]
end
Linux Namespaces
Namespaces partition kernel resources so that one set of processes sees one set of resources, and another set sees a different set. Each container gets its own namespaces — its own PID tree, its own network stack, its own filesystem mounts — while the kernel remains shared.
| Namespace | Isolates | Kernel Flag | Effect |
|---|---|---|---|
| PID | Process IDs | CLONE_NEWPID | Container's init is PID 1; host PIDs invisible |
| NET | Network stack | CLONE_NEWNET | Own interfaces, IPs, routes, iptables rules |
| MNT | Mount points | CLONE_NEWNS | Own filesystem tree; host mounts invisible |
| UTS | Hostname & domain | CLONE_NEWUTS | Container has its own hostname |
| IPC | System V IPC, POSIX MQs | CLONE_NEWIPC | Shared memory / semaphores isolated per container |
| USER | UIDs / GIDs | CLONE_NEWUSER | UID 0 inside maps to unprivileged UID on host |
| CGROUP | cgroup root view | CLONE_NEWCGROUP | Container sees its own cgroup as the root |
PID Namespace
# Create a new PID namespace — the process inside sees itself as PID 1
sudo unshare --pid --fork --mount-proc bash -c '
echo "My PID: $$"
echo "Processes visible:"
ps aux
'
# Output: PID 1 is bash — it cannot see host processes
# View namespaces of a running process
ls -la /proc/self/ns/
# Shows: cgroup, ipc, mnt, net, pid, pid_for_children, user, uts
# Compare namespaces of two processes
sudo readlink /proc/1/ns/pid # Host init PID namespace
sudo readlink /proc/$(pgrep dockerd)/ns/pid # dockerd's namespace (same as host)
NET Namespace
# Create an isolated network namespace
sudo ip netns add mycontainer
# List network namespaces
ip netns list
# Run a command inside the network namespace
sudo ip netns exec mycontainer ip link show
# Only sees: lo (loopback) — no eth0, no host interfaces
# Create a veth pair to connect namespaces
sudo ip link add veth-host type veth peer name veth-ct
sudo ip link set veth-ct netns mycontainer
# Assign IPs
sudo ip addr add 10.0.0.1/24 dev veth-host
sudo ip link set veth-host up
sudo ip netns exec mycontainer ip addr add 10.0.0.2/24 dev veth-ct
sudo ip netns exec mycontainer ip link set veth-ct up
sudo ip netns exec mycontainer ip link set lo up
# Ping across namespaces
ping -c 1 10.0.0.2 # Works! Traffic flows over the veth pair
# Cleanup
sudo ip netns del mycontainer
MNT Namespace
# Mount namespace gives the process its own filesystem view
sudo unshare --mount bash -c '
# Mounts here are invisible to the host
mount -t tmpfs tmpfs /tmp
echo "secret" > /tmp/hidden.txt
cat /tmp/hidden.txt # "secret" — only visible in this namespace
mount | grep tmpfs | tail -3
'
# After exit: /tmp/hidden.txt does not exist on the host
UTS Namespace
# UTS namespace isolates hostname
sudo unshare --uts bash -c '
hostname mycontainer
echo "Inside: $(hostname)"
'
echo "Host: $(hostname)"
# Container sees "mycontainer", host is unchanged
IPC Namespace
# IPC namespace isolates System V shared memory, semaphores, message queues
sudo unshare --ipc bash -c '
ipcs -a # Empty — no IPC objects visible from host
# Create a shared memory segment (only visible in this namespace)
ipcmk -M 1024
ipcs -m # Shows the new segment
'
ipcs -m # Host does not see the segment created inside
USER Namespace
# User namespace remaps UIDs — root inside ≠ root outside
unshare --user --map-root-user bash -c '
echo "Inside I am: $(whoami) (UID=$(id -u))"
cat /proc/self/uid_map
# 0 1000 1 — UID 0 inside maps to UID 1000 on host
'
echo "Outside I am: $(whoami) (UID=$(id -u))"
CGROUP Namespace
# Cgroup namespace makes the process see its cgroup as the root
# Without it, container processes can see the full cgroup hierarchy
cat /proc/self/cgroup # Shows full path on host
sudo unshare --cgroup bash -c '
cat /proc/self/cgroup # Shows "/" — thinks it is at the root
'
cgroups — Resource Limits
Control groups (cgroups) limit, account for, and isolate the resource usage of process groups. While namespaces control what a process can see, cgroups control how much it can use. Linux uses cgroups v2 (unified hierarchy) on modern systems.
CPU Limits
# cgroups v2: limit a process to 50% of one CPU core
# Create a cgroup
sudo mkdir -p /sys/fs/cgroup/mycontainer
# Set CPU limit: 50ms out of every 100ms period = 50% of one core
echo "50000 100000" | sudo tee /sys/fs/cgroup/mycontainer/cpu.max
# Move current shell into the cgroup
echo $$ | sudo tee /sys/fs/cgroup/mycontainer/cgroup.procs
# Verify — run a CPU-intensive task and observe throttling
stress --cpu 1 --timeout 5 &
cat /sys/fs/cgroup/mycontainer/cpu.stat
# throttled_usec shows time the process was throttled
# Docker equivalent:
# docker run --cpus=0.5 ubuntu stress --cpu 1 --timeout 5
Memory Limits
# Limit memory to 100MB
echo "104857600" | sudo tee /sys/fs/cgroup/mycontainer/memory.max
# View current memory usage
cat /sys/fs/cgroup/mycontainer/memory.current
# View OOM kill count
cat /sys/fs/cgroup/mycontainer/memory.events
# oom_kill shows how many times the OOM killer triggered
# Set memory.high for throttling before hard kill
echo "83886080" | sudo tee /sys/fs/cgroup/mycontainer/memory.high
# At 80MB: kernel reclaims aggressively (slow but alive)
# At 100MB (memory.max): OOM kill
# Docker equivalent:
# docker run --memory=100m --memory-reservation=80m ubuntu
I/O Limits
# Limit disk I/O to 10MB/s write on device 8:0 (sda)
echo "8:0 wbps=10485760" | sudo tee /sys/fs/cgroup/mycontainer/io.max
# View I/O statistics for the cgroup
cat /sys/fs/cgroup/mycontainer/io.stat
# Shows: rbytes, wbytes, rios, wios, dbytes, dios
# Docker equivalent:
# docker run --device-write-bps /dev/sda:10mb ubuntu dd if=/dev/zero of=/tmp/test bs=1M count=100
OverlayFS — The Container Filesystem
OverlayFS is a union filesystem that layers a writable upper directory on top of read-only lower directories. Container images are stacks of read-only layers; when a container writes a file, the change goes to the upper (container-specific) layer without modifying the image layers below. This enables efficient sharing — 100 containers from the same image share the same lower layers on disk.
# Create an OverlayFS manually — simulating how Docker layers work
mkdir -p /tmp/overlay/{lower,upper,work,merged}
# Populate the "base image" layer
echo "I am from the image" > /tmp/overlay/lower/base.txt
echo "config=default" > /tmp/overlay/lower/config.txt
# Mount the overlay
sudo mount -t overlay overlay \
-o lowerdir=/tmp/overlay/lower,upperdir=/tmp/overlay/upper,workdir=/tmp/overlay/work \
/tmp/overlay/merged
# The merged view shows the lower layer contents
cat /tmp/overlay/merged/base.txt # "I am from the image"
cat /tmp/overlay/merged/config.txt # "config=default"
# Write a new file — goes to the upper layer only
echo "container data" > /tmp/overlay/merged/new.txt
ls /tmp/overlay/upper/ # new.txt exists here (copy-up on write)
# Modify an existing file — copy-up: original stays in lower, copy in upper
echo "config=custom" > /tmp/overlay/merged/config.txt
cat /tmp/overlay/lower/config.txt # Still "config=default" (unchanged)
cat /tmp/overlay/upper/config.txt # "config=custom" (the overlay)
# Cleanup
sudo umount /tmp/overlay/merged
rm -rf /tmp/overlay
Building a Container from Scratch
A "container" is nothing more than: (1) namespaces for isolation, (2) a root filesystem, (3) cgroups for resource limits, and (4) seccomp/capabilities for syscall filtering. Let's build one with basic Linux tools.
From unshare to Docker — Building a Container by Hand
This step-by-step lab creates a minimal container using only Linux syscalls and command-line tools — no Docker required. You'll create namespaces, set up a root filesystem, mount /proc, configure cgroups, and exec into the isolated environment. This is exactly what runc does under the hood.
# Step 1: Download a minimal rootfs (Alpine Linux — ~3MB)
mkdir -p /tmp/mycontainer/rootfs
cd /tmp/mycontainer
curl -sL https://dl-cdn.alpinelinux.org/alpine/v3.19/releases/x86_64/alpine-minirootfs-3.19.1-x86_64.tar.gz \
| tar -xz -C rootfs
# Step 2: Create cgroup for resource limits
sudo mkdir -p /sys/fs/cgroup/mycontainer
echo "50000 100000" | sudo tee /sys/fs/cgroup/mycontainer/cpu.max # 50% CPU
echo "67108864" | sudo tee /sys/fs/cgroup/mycontainer/memory.max # 64MB RAM
# Step 3: Launch with all namespaces isolated
sudo unshare \
--pid \
--net \
--mount \
--uts \
--ipc \
--cgroup \
--fork \
bash -c '
# Step 4: Move into our cgroup
echo $$ > /sys/fs/cgroup/mycontainer/cgroup.procs
# Step 5: Set hostname
hostname mycontainer
# Step 6: Set up the rootfs
mount --bind rootfs rootfs
cd rootfs
# Step 7: Mount essential filesystems
mount -t proc proc proc/
mount -t sysfs sys sys/
mount -t tmpfs tmp tmp/
# Step 8: Pivot into the new root
mkdir -p .old_root
pivot_root . .old_root
cd /
umount -l /.old_root
rmdir /.old_root
# Step 9: We are now "inside" the container
echo "Hostname: $(hostname)"
echo "PID: $$"
echo "Processes:"
ps aux
echo "Filesystem:"
ls /
echo "Memory limit:"
cat /sys/fs/cgroup/memory.max 2>/dev/null || echo "(cgroup ns)"
'
# Cleanup
sudo rmdir /sys/fs/cgroup/mycontainer 2>/dev/null
rm -rf /tmp/mycontainer
--privileged flag disables all security restrictions — it gives the container full access to all host devices, disables seccomp filtering, grants all Linux capabilities, and removes AppArmor/SELinux confinement. A process inside a --privileged container can mount the host filesystem, load kernel modules, and escape trivially. Use --cap-add to grant only the specific capabilities needed, and --security-opt seccomp=profile.json for fine-grained syscall filtering.
Docker Architecture
Docker is not a monolith — it's a stack of components following the OCI (Open Container Initiative) specification. Understanding the layers helps you debug container issues and choose alternatives (Podman, containerd+nerdctl, CRI-O).
flowchart TD
CLI["docker CLI"] -->|REST API| D["dockerd\n(Docker daemon)"]
D -->|gRPC| CTD["containerd\n(container lifecycle)"]
CTD -->|OCI spec| SHIM["containerd-shim"]
SHIM -->|fork/exec| RUNC["runc\n(OCI runtime)"]
RUNC -->|clone() + namespaces\n+ cgroups + pivot_root| CP["Container Process\n(your app)"]
CTD -.->|Image pull/push| REG["Registry\n(Docker Hub, ECR, etc.)"]
CTD -.->|Snapshots| OFS["OverlayFS\n(image layers)"]
containerd, runc, OCI
containerd manages the full container lifecycle: image pull, storage (snapshots), container creation, and task execution. runc is the low-level OCI runtime that actually creates the Linux namespaces, sets up cgroups, and execs the container process. The OCI runtime spec defines a JSON config (config.json) that any runtime can consume — this is why runc, crun, youki, and gVisor are interchangeable.
# See what Docker uses under the hood
docker info | grep -E "Runtime|Storage|Cgroup"
# Default Runtime: runc
# Storage Driver: overlay2
# Cgroup Driver: systemd (or cgroupfs)
# Inspect a container's namespaces and cgroups
CONTAINER_ID=$(docker run -d --name test-ns alpine sleep 3600)
PID=$(docker inspect --format '{{.State.Pid}}' $CONTAINER_ID)
# View the container's namespaces
sudo ls -la /proc/$PID/ns/
# cgroup, ipc, mnt, net, pid, user, uts — all different from host
# View its cgroup
cat /proc/$PID/cgroup
# 0::/system.slice/docker-.scope (systemd) or /docker/ (cgroupfs)
# View resource limits set by Docker
cat /sys/fs/cgroup/system.slice/docker-${CONTAINER_ID}*/memory.max
cat /sys/fs/cgroup/system.slice/docker-${CONTAINER_ID}*/cpu.max
# Enter a container's namespaces with nsenter
sudo nsenter --target $PID --pid --net --mount -- ps aux
# You see only the container's processes
# Cleanup
docker rm -f test-ns
# Use runc directly (bypassing Docker entirely)
mkdir -p /tmp/oci-bundle/rootfs
cd /tmp/oci-bundle
# Create a rootfs
docker export $(docker create alpine) | tar -C rootfs -xf -
# Generate OCI config.json
runc spec
# View the generated config (namespaces, mounts, cgroups defined here)
cat config.json | python3 -m json.tool | head -50
# "namespaces": [{"type": "pid"}, {"type": "network"}, ...]
# "linux": {"resources": {"memory": {"limit": ...}}}
# Run the container with runc
sudo runc run my-container
# You're now inside an OCI container — no Docker needed
# Cleanup
sudo runc delete my-container
rm -rf /tmp/oci-bundle
Exercises
# Exercise 1: Create a PID namespace and verify isolation
sudo unshare --pid --fork --mount-proc ps aux
# You should see only 2 processes: unshare and ps
# Exercise 2: View namespaces of a Docker container
docker run -d --name ex-ns alpine sleep 60
docker inspect --format '{{.State.Pid}}' ex-ns | xargs -I{} sudo ls -la /proc/{}/ns/
docker rm -f ex-ns
# Exercise 3: Set a memory limit with cgroups v2
sudo mkdir -p /sys/fs/cgroup/exercise
echo "52428800" | sudo tee /sys/fs/cgroup/exercise/memory.max # 50MB
echo $$ | sudo tee /sys/fs/cgroup/exercise/cgroup.procs
cat /sys/fs/cgroup/exercise/memory.current
# Move back: echo $$ | sudo tee /sys/fs/cgroup/cgroup.procs
sudo rmdir /sys/fs/cgroup/exercise
# Exercise 4: Create an OverlayFS and test copy-on-write
mkdir -p /tmp/ex-ov/{lower,upper,work,merged}
echo "original" > /tmp/ex-ov/lower/file.txt
sudo mount -t overlay overlay -o lowerdir=/tmp/ex-ov/lower,upperdir=/tmp/ex-ov/upper,workdir=/tmp/ex-ov/work /tmp/ex-ov/merged
echo "modified" > /tmp/ex-ov/merged/file.txt
cat /tmp/ex-ov/lower/file.txt # Still "original"
cat /tmp/ex-ov/upper/file.txt # "modified" (copy-up)
sudo umount /tmp/ex-ov/merged && rm -rf /tmp/ex-ov
# Exercise 5: Compare Docker overhead vs VM
docker run --rm alpine cat /proc/version # Same kernel as host!
Conclusion & Next Steps
Containers are built from four Linux kernel primitives: namespaces (what a process can see), cgroups (how much it can use), OverlayFS (layered filesystem), and seccomp/capabilities (which syscalls are allowed). Docker, Podman, and CRI-O are all just orchestrators that call clone() with the right flags, create cgroup directories, set up an overlay mount, and exec your process. Understanding these primitives makes container debugging, security hardening, and performance tuning straightforward — because under every docker run is just a Linux process.