Back to Containers & Runtime Environments Mastery Series

Part 14: containerd & runc Deep Dive

May 14, 2026 Wasil Zafar 28 min read

When you type docker run, at least four processes collaborate to create your container: the Docker CLI, the Docker daemon (dockerd), containerd, and runc. This article peels back Docker's user-friendly surface to reveal the machinery underneath — the high-level runtime (containerd) that manages images, snapshots, and container metadata, and the low-level runtime (runc) that directly manipulates Linux kernel primitives to create isolated processes.

Table of Contents

  1. The Runtime Stack
  2. containerd Architecture
  3. Image Management
  4. Snapshotters
  5. Container Lifecycle
  6. containerd-shim
  7. runc Deep Dive
  8. runc Internals
  9. The OCI Bundle
  10. Container Runtime Interface
  11. CRI-O
  12. Alternative Runtimes
  13. Exercises
  14. Conclusion & Next Steps

The Runtime Stack

Container runtimes exist at two distinct levels, each with a clear responsibility boundary:

  • High-level runtimes (containerd, CRI-O) — Manage the complete container lifecycle: pulling images, managing storage, creating network namespaces, supervising container processes, and exposing APIs
  • Low-level runtimes (runc, crun, youki) — Execute a single operation: given an OCI bundle (rootfs + config.json), create a container process with the specified isolation. No image pulling, no networking, no supervision

The Complete Call Chain

When you run docker run nginx, here's the actual chain of process invocations:

Docker Run Call Chain
flowchart TD
    A["docker CLI
(client)"] -->|"REST API"| B["dockerd
(Docker daemon)"] B -->|"gRPC"| C["containerd
(high-level runtime)"] C -->|"exec"| D["containerd-shim
(per-container process)"] D -->|"exec"| E["runc
(low-level runtime)"] E -->|"clone() + exec()"| F["container process
(nginx)"] style A fill:#f8f9fa,stroke:#132440 style B fill:#f8f9fa,stroke:#132440 style C fill:#f0f9f9,stroke:#3B9797 style D fill:#f0f9f9,stroke:#3B9797 style E fill:#fff5f5,stroke:#BF092F style F fill:#f8f9fa,stroke:#132440

Each component has a distinct lifecycle and can be restarted independently:

Component Responsibility Can Restart Without Killing Containers?
docker CLIUser interface, command parsingN/A (stateless client)
dockerdAPI server, image builds, compose, swarmYes (delegates to containerd)
containerdImage management, container metadata, task supervisionYes (shims maintain containers)
containerd-shimHolds stdio, reaps exit codes, reports statusNo (one per container)
runcCreates the container, then exitsN/A (short-lived process)
Key Insight: runc is not a daemon. It creates the container and immediately exits. The containerd-shim takes over as the parent process of the container. This is why you can upgrade runc without affecting running containers — it's only invoked during container creation.

containerd Architecture

containerd is a CNCF graduated project designed as an industry-standard container runtime. Unlike Docker's monolithic daemon, containerd is built around a plugin architecture where every major function is a plugin that can be swapped or extended.

containerd Component Architecture
flowchart TB
    subgraph API["gRPC API Layer"]
        A1[Images Service]
        A2[Containers Service]
        A3[Tasks Service]
        A4[Content Service]
        A5[Snapshots Service]
        A6[Namespaces Service]
    end
    subgraph Plugins["Plugin Layer"]
        P1[Runtime Plugin
runc / kata / gvisor] P2[Snapshotter Plugin
overlayfs / native / btrfs] P3[Content Store
blob storage] P4[Differ Plugin
layer diffing] P5[GC Plugin
garbage collection] P6[CRI Plugin
Kubernetes interface] end subgraph Storage["Storage Layer"] S1[(Content Store
blobs by digest)] S2[(Metadata Store
BoltDB)] S3[(Snapshots
filesystem layers)] end API --> Plugins Plugins --> Storage style API fill:#f0f9f9,stroke:#3B9797 style Plugins fill:#f8f9fa,stroke:#132440 style Storage fill:#fff5f5,stroke:#BF092F

The gRPC API

containerd exposes all functionality through a gRPC API over a Unix socket (default: /run/containerd/containerd.sock). This API is the interface used by Docker, Kubernetes (via CRI plugin), and the ctr CLI.

# Check containerd is running
sudo systemctl status containerd

# containerd configuration file
cat /etc/containerd/config.toml

# List containerd plugins and their status
sudo ctr plugins ls

# Example output showing plugin types:
# TYPE                                   ID                       PLATFORMS   STATUS
# io.containerd.content.v1               content                  -           ok
# io.containerd.snapshotter.v1           overlayfs                linux/amd64 ok
# io.containerd.runtime.v2               task                     linux/amd64 ok
# io.containerd.grpc.v1                  cri                      linux/amd64 ok
# io.containerd.service.v1               containers-service       -           ok
# io.containerd.service.v1               tasks-service            -           ok

containerd uses namespaces to isolate different clients. Docker's containers live in the moby namespace, Kubernetes uses the k8s.io namespace, and ctr uses default by default. This prevents Docker and Kubernetes from interfering with each other on the same node.

# List all containerd namespaces
sudo ctr namespaces ls
# NAME    LABELS
# default
# moby    (Docker containers live here)
# k8s.io  (Kubernetes containers live here)

# Work in a specific namespace
sudo ctr -n moby containers ls
sudo ctr -n k8s.io containers ls

containerd: Image Management

containerd's image management is built on top of its content store — a content-addressable blob storage that holds all image layers, manifests, and configurations referenced by their SHA256 digest.

# Pull an image with ctr (containerd's CLI)
sudo ctr images pull docker.io/library/nginx:alpine
# docker.io/library/nginx:alpine: resolved
# index-sha256:... done
# manifest-sha256:... done
# layer-sha256:... done
# config-sha256:... done
# elapsed: 4.2 s

# List pulled images
sudo ctr images ls
# REF                            TYPE                                DIGEST          SIZE
# docker.io/library/nginx:alpine application/vnd.oci.image.index.v1 sha256:a1b2c3.. 44.2 MiB

# Inspect image details
sudo ctr images info docker.io/library/nginx:alpine

# Check image content (layers, config)
sudo ctr content ls | head -10
# DIGEST                                                                  SIZE    AGE
# sha256:a1b2c3d4e5f6...  7.6 kB  2 minutes
# sha256:b2c3d4e5f6a1...  3.4 MB  2 minutes
# sha256:c3d4e5f6a1b2...  28.1 MB 2 minutes

# Read a specific blob from the content store
sudo ctr content get sha256:a1b2c3d4e5f6... | jq .
Content Store Location: By default, containerd stores all content at /var/lib/containerd/io.containerd.content.v1.content/blobs/sha256/. Each file is named by its SHA256 digest. This is pure content-addressable storage — identical blobs are stored exactly once regardless of how many images reference them.

containerd: Snapshotters

Snapshotters are containerd's abstraction for preparing filesystem layers. They take the compressed tar layers from the content store and prepare them as mountable filesystem views. The snapshotter interface supports different backend technologies:

Snapshotter Backend Copy-on-Write Best For
overlayfsOverlayFS kernel moduleYesDefault, general purpose (Linux 4.0+)
nativePlain directory copiesNoFilesystems without overlay support
btrfsBtrfs subvolumesYes (filesystem-level)Btrfs-based systems
zfsZFS clonesYes (filesystem-level)ZFS-based systems
devmapperDevice mapper thin-provisioningYes (block-level)AWS Firecracker, block storage
stargzLazy-pulling eStargz imagesYesLarge images, fast startup
nydusNydus image formatYesContainer image acceleration
# List all snapshots
sudo ctr snapshots ls
# KEY                                    PARENT                                 KIND
# sha256:a1b2c3...                                                             Committed
# sha256:b2c3d4...                       sha256:a1b2c3...                       Committed
# sha256:c3d4e5...                       sha256:b2c3d4...                       Committed
# nginx-container                        sha256:c3d4e5...                       Active

# Inspect a snapshot's details (mounts, parent chain)
sudo ctr snapshots info sha256:a1b2c3...
# {
#   "Kind": "Committed",
#   "Name": "sha256:a1b2c3...",
#   "Created": "2026-05-14T10:00:00Z",
#   "Updated": "2026-05-14T10:00:00Z"
# }

# View mount instructions for a snapshot
sudo ctr snapshots mounts /tmp/mnt nginx-container
# mount -t overlay overlay -o
#   lowerdir=/var/lib/containerd/.../sha256:c3d4e5...,
#   upperdir=/var/lib/containerd/.../nginx-container/fs,
#   workdir=/var/lib/containerd/.../nginx-container/work
#   /tmp/mnt

# Prepare a new active snapshot (writable layer)
sudo ctr snapshots prepare my-new-layer sha256:c3d4e5...

The snapshot hierarchy mirrors the image layer stack. Each committed snapshot represents a read-only layer. An active snapshot is a writable layer created on top of committed parents — this is the container's writable layer where runtime changes are stored.

containerd: Container Lifecycle

containerd distinguishes between a container (metadata object) and a task (running process). A container is a static record of what to run. A task is the running instance of that container. This separation allows you to create containers without starting them, and to re-create tasks from the same container definition.

# Pull an image first
sudo ctr images pull docker.io/library/alpine:3.19

# Create a container (metadata only — nothing running yet)
sudo ctr containers create docker.io/library/alpine:3.19 my-alpine

# List containers (note: no running process yet)
sudo ctr containers ls
# CONTAINER    IMAGE                             RUNTIME
# my-alpine    docker.io/library/alpine:3.19     io.containerd.runc.v2

# Inspect container metadata
sudo ctr containers info my-alpine

# Start a task (this creates the actual running process)
sudo ctr tasks start -d my-alpine

# List running tasks
sudo ctr tasks ls
# TASK        PID      STATUS
# my-alpine   12345    RUNNING

# Execute a command in the running task
sudo ctr tasks exec --exec-id shell1 -t my-alpine /bin/sh

# View task process metrics
sudo ctr tasks metrics my-alpine

# Kill the task (container metadata remains)
sudo ctr tasks kill my-alpine

# Delete the task
sudo ctr tasks delete my-alpine

# Delete the container (metadata)
sudo ctr containers delete my-alpine
Container vs Task: Think of it like a program (file on disk) vs a process (running instance). A container is the "recipe" — image reference, runtime config, snapshot. A task is the "execution" — PID, status, I/O streams. You can delete and recreate tasks without losing the container's configuration.

containerd-shim

The containerd-shim is a critical but often overlooked component. One shim process exists per container, serving as the direct parent of the container process. The shim exists for three reasons:

  1. Daemon-less containers — The shim allows containerd to restart without killing running containers. The shim keeps the container alive independently.
  2. Exit code reaping — In Linux, when a process exits, its parent must call wait() to collect the exit status. The shim is the parent that reaps the container process.
  3. stdio management — The shim holds the container's stdin/stdout/stderr file descriptors and forwards them to logging systems (FIFO pipes or files).
containerd-shim Isolation Model
flowchart TB
    subgraph Daemon["containerd daemon"]
        CD[containerd process
PID 1000] end subgraph Shims["Per-Container Shims"] S1["shim (nginx)
PID 2001"] S2["shim (redis)
PID 2002"] S3["shim (postgres)
PID 2003"] end subgraph Containers["Container Processes"] C1["nginx
PID 3001"] C2["redis-server
PID 3002"] C3["postgres
PID 3003"] end CD -.->|"gRPC over ttrpc"| S1 CD -.->|"gRPC over ttrpc"| S2 CD -.->|"gRPC over ttrpc"| S3 S1 -->|"parent of"| C1 S2 -->|"parent of"| C2 S3 -->|"parent of"| C3 style Daemon fill:#f0f9f9,stroke:#3B9797 style Shims fill:#f8f9fa,stroke:#132440 style Containers fill:#fff5f5,stroke:#BF092F
# View shim processes on a running system
ps aux | grep containerd-shim
# root  2001  containerd-shim-runc-v2 -namespace moby -id abc123...
# root  2002  containerd-shim-runc-v2 -namespace moby -id def456...
# root  2003  containerd-shim-runc-v2 -namespace moby -id ghi789...

# Each shim's parent is PID 1 (init), NOT containerd
# This is by design — allows containerd restart
ps -o pid,ppid,comm -p 2001
# PID   PPID  COMMAND
# 2001  1     containerd-shim-runc-v2

# Shim communicates with containerd via ttrpc (lightweight gRPC)
# Socket location per container:
ls /run/containerd/io.containerd.runtime.v2.task/moby/abc123.../
# address  config.json  init.pid  log  log.json  shim.pid
Shim Version: containerd ships two shim versions: containerd-shim (v1, legacy) and containerd-shim-runc-v2 (v2, current). The v2 shim uses ttrpc protocol, supports one shim per pod (not per container) in Kubernetes, and has better resource efficiency. Always use v2 — v1 is deprecated.

runc Deep Dive

runc is the OCI reference implementation — the canonical low-level container runtime originally extracted from Docker's codebase. Written in Go, it directly manipulates Linux kernel features (namespaces, cgroups, capabilities, seccomp) to create containers. It takes an OCI bundle as input and produces a running, isolated process.

# Check runc version and supported features
runc --version
# runc version 1.1.12
# commit: v1.1.12-0-g51d5e946
# spec: 1.1.0
# go: go1.21.6
# libseccomp: 2.5.4

# Generate a default OCI config.json
mkdir -p mycontainer/rootfs
cd mycontainer
runc spec
# Creates config.json with sensible defaults

# Create a rootfs from an Alpine image
docker export $(docker create alpine:3.19) | tar -xC rootfs/

# Create a container (does NOT start it)
sudo runc create my-container
# Container is now in "created" state

# Check container state
sudo runc state my-container
# {
#   "ociVersion": "1.1.0",
#   "id": "my-container",
#   "pid": 45678,
#   "status": "created",
#   "bundle": "/home/user/mycontainer",
#   "rootfs": "/home/user/mycontainer/rootfs",
#   "created": "2026-05-14T10:00:00.123456789Z"
# }

# Start the container (transitions to "running")
sudo runc start my-container

# List running containers managed by runc
sudo runc list
# ID              PID     STATUS   BUNDLE                          CREATED
# my-container    45678   running  /home/user/mycontainer          2026-05-14T10:00:00Z

# Execute a command in the running container
sudo runc exec my-container ls /

# Send a signal to the container
sudo runc kill my-container SIGTERM

# Delete the container
sudo runc delete my-container

runc Internals

When runc creates a container, it performs a carefully orchestrated sequence of kernel operations. Understanding this sequence reveals exactly what "creating a container" means at the system level:

runc Container Creation Sequence
sequenceDiagram
    participant P as Parent (runc)
    participant I as Init Process (runc init)
    participant K as Kernel

    P->>K: clone(CLONE_NEWNS | CLONE_NEWPID | CLONE_NEWNET | CLONE_NEWUTS | CLONE_NEWIPC)
    K-->>I: New process in new namespaces
    I->>K: mount("none", "/", MS_REC|MS_PRIVATE)
    I->>K: Setup mounts (/proc, /dev, /sys, tmpfs)
    I->>K: pivot_root(rootfs, old_root)
    I->>K: umount(old_root, MNT_DETACH)
    I->>K: Set hostname (UTS namespace)
    I->>K: Configure cgroups (memory, cpu, pids)
    I->>K: Apply seccomp filter (BPF program)
    I->>K: Drop capabilities (keep only allowed set)
    I->>K: setuid/setgid (drop root if configured)
    I->>P: Signal ready (via pipe)
    P-->>I: Signal start (via pipe)
    I->>K: exec(container_entrypoint)
    Note over I: Init process replaced by
application process

Step-by-Step Breakdown

  1. clone() with namespace flags — Creates a new process in fresh namespaces. CLONE_NEWPID gives it PID 1 inside the container. CLONE_NEWNET gives it an isolated network stack. CLONE_NEWNS gives it private mount points.
  2. Mount propagation — Makes the mount namespace fully private so that mounts inside the container don't leak to the host and vice versa.
  3. Setup special filesystems — Mounts /proc (process information scoped to PID namespace), /dev (minimal device nodes), /sys (read-only sysfs).
  4. pivot_root() — Changes the root filesystem to the container's rootfs. Unlike chroot, pivot_root actually moves the old root out of scope, making it impossible to "escape" back to the host filesystem.
  5. Cgroup configuration — Places the container process into appropriate cgroups to enforce resource limits (memory, CPU, PIDs, I/O bandwidth).
  6. seccomp filter — Installs a BPF (Berkeley Packet Filter) program that intercepts system calls and blocks dangerous ones before they reach the kernel.
  7. Capability drop — Linux capabilities split root's powers into ~40 discrete privileges. The container keeps only what it needs (typically: KILL, NET_BIND_SERVICE, AUDIT_WRITE).
  8. exec() — The init process replaces itself with the actual container application. From this point, runc's code is gone — only the application code runs.
Why Two Processes? runc uses a "parent-child" or "double-fork" pattern. The parent process (runc) stays in the host namespaces and orchestrates. The child process (runc init) enters the new namespaces and sets up the container from the inside. They communicate via a pipe. This is necessary because some namespace operations can only be performed from inside the new namespace (e.g., you can only mount /proc for the new PID namespace from within it).

The OCI Bundle

An OCI bundle is the complete input to a low-level runtime. You can create one manually to understand exactly what runc receives from containerd:

# Create directory structure
mkdir -p my-oci-bundle/rootfs

# Option 1: Export a Docker image's filesystem
docker export $(docker create busybox:latest) | tar -xC my-oci-bundle/rootfs/

# Option 2: Build a minimal rootfs from scratch
mkdir -p my-oci-bundle/rootfs/{bin,dev,proc,sys,tmp}
cp /bin/busybox my-oci-bundle/rootfs/bin/
cd my-oci-bundle/rootfs/bin && for cmd in sh ls cat echo ps mount; do ln -s busybox $cmd; done && cd ../../..

# Generate the default config.json
cd my-oci-bundle
runc spec

# Customize config.json: change the process
# Edit config.json to set: "args": ["/bin/sh", "-c", "echo Hello from OCI bundle && ps aux"]

# Run the container directly with runc
sudo runc run my-first-oci-container

# Verify the container ran in isolation
sudo runc list

The minimal viable OCI bundle needs only:

  • A directory (rootfs/) with at least one executable file
  • A config.json that references that executable in process.args

Everything else in config.json — namespaces, mounts, cgroups, capabilities — provides isolation and resource control. Without them, you'd have a chrooted process with no real security boundary.

Container Runtime Interface (CRI)

The Container Runtime Interface (CRI) is Kubernetes' abstraction layer for container runtimes. It's a gRPC protocol that defines how the kubelet (Kubernetes node agent) communicates with any container runtime, without knowing the implementation details.

CRI Architecture in Kubernetes
flowchart LR
    subgraph K8s["Kubernetes Node"]
        KL[kubelet]
    end
    subgraph CRI["CRI Protocol (gRPC)"]
        RS[RuntimeService]
        IS[ImageService]
    end
    subgraph Runtimes["CRI Implementations"]
        CD[containerd
+ CRI plugin] CO[CRI-O] end subgraph Low["Low-Level"] R1[runc] R2[kata] end KL --> RS KL --> IS RS --> CD RS --> CO IS --> CD IS --> CO CD --> R1 CD --> R2 CO --> R1 style K8s fill:#f8f9fa,stroke:#132440 style CRI fill:#f0f9f9,stroke:#3B9797 style Runtimes fill:#f8f9fa,stroke:#132440 style Low fill:#fff5f5,stroke:#BF092F

CRI Services

Service Key Operations Purpose
RuntimeServiceRunPodSandbox, CreateContainer, StartContainer, StopContainer, RemoveContainer, ListContainers, ExecSyncContainer lifecycle management
ImageServicePullImage, ListImages, RemoveImage, ImageStatus, ImageFsInfoImage operations
# Use crictl to interact with the CRI directly
# (crictl is the CRI equivalent of docker CLI)

# Configure crictl to use containerd's CRI socket
cat /etc/crictl.yaml
# runtime-endpoint: unix:///run/containerd/containerd.sock
# image-endpoint: unix:///run/containerd/containerd.sock

# List pods (sandboxes)
sudo crictl pods
# POD ID    CREATED     STATE   NAME              NAMESPACE   ATTEMPT
# abc123    2 hours ago Ready   nginx-pod         default     0

# List containers
sudo crictl ps
# CONTAINER  IMAGE     CREATED     STATE     NAME    ATTEMPT  POD ID
# def456     nginx     2 hours ago Running   nginx   0        abc123

# Pull an image via CRI
sudo crictl pull docker.io/library/nginx:alpine

# Inspect a container
sudo crictl inspect def456

# View container logs
sudo crictl logs def456

# Execute a command in a container
sudo crictl exec -it def456 /bin/sh
Why Kubernetes Dropped Docker: In Kubernetes 1.24, the "dockershim" was removed. Kubernetes didn't stop supporting Docker images — it stopped using Docker as a runtime. The problem: Docker doesn't implement CRI natively. The dockershim was an adapter that translated CRI calls to Docker API calls. This extra layer added complexity with no benefit — containerd (which Docker uses internally anyway) implements CRI directly. Removing the shim simplified the stack: kubelet → containerd → runc, instead of kubelet → dockershim → dockerd → containerd → runc.

CRI-O

CRI-O is an alternative high-level runtime purpose-built for Kubernetes. Unlike containerd (which serves Docker, Kubernetes, and standalone use), CRI-O is only designed to implement the CRI protocol. This focused scope makes it lighter and reduces attack surface.

Feature containerd CRI-O
Primary PurposeGeneral-purpose container runtimeKubernetes-only runtime
CRI SupportVia built-in CRI pluginNative (it is a CRI implementation)
Docker CompatibilityYes (Docker uses containerd)No (Kubernetes only)
CLI Toolctr, nerdctlcrictl only
Image BuildingSupports BuildKitNo (use Buildah externally)
Default Snapshotteroverlayfsoverlayfs
Version SyncIndependent releasesTracks Kubernetes versions (1.28, 1.29...)
Maintained ByCNCF (graduated)CNCF (incubating)
Used ByDocker, AWS EKS, GKE, AKSRed Hat OpenShift, Fedora CoreOS
# CRI-O configuration
cat /etc/crio/crio.conf

# Key settings:
# [crio.runtime]
# default_runtime = "runc"
# conmon_cgroup = "pod"
#
# [crio.runtime.runtimes.runc]
# runtime_path = "/usr/bin/runc"
# runtime_type = "oci"
#
# [crio.runtime.runtimes.kata]
# runtime_path = "/usr/bin/kata-runtime"
# runtime_type = "oci"
# privileged_without_host_devices = true

# Check CRI-O status
sudo systemctl status crio

# CRI-O also uses crictl for interaction
sudo crictl --runtime-endpoint unix:///var/run/crio/crio.sock pods

Alternative Runtimes

OCI compliance enables a rich ecosystem of alternative low-level runtimes, each optimised for different use cases:

Runtime Language Key Differentiator Startup Overhead Use Case
runcGoReference implementation, battle-tested~100msGeneral purpose, default everywhere
crunC2× faster startup, 50% less memory~50msPerformance-critical, Podman default
youkiRustMemory safety, no GC pauses~60msSecurity-focused, emerging alternative
kata-containersGo/RustRuns each container in a lightweight VM~500msMulti-tenant clouds, untrusted workloads
gVisor (runsc)GoUser-space kernel intercepts all syscalls~150msDefence in depth, Google Cloud Run
FirecrackerRustMicroVM with minimal device model~125msAWS Lambda, AWS Fargate
WasmEdgeC++/RustWebAssembly runtime as OCI runtime~1msLightweight serverless, edge computing
Architecture Decision Runtime Selection
Choosing a Runtime: Decision Matrix

The choice of runtime depends on your threat model and performance requirements:

  • Single-tenant, trusted workloads → runc or crun (maximum performance, namespace isolation is sufficient)
  • Multi-tenant, untrusted workloads → kata-containers or gVisor (additional VM/kernel boundary between tenants)
  • Serverless/FaaS platforms → Firecracker (fast VM boot, minimal overhead, strong isolation for function execution)
  • Edge computing, plugins → WasmEdge (sub-millisecond startup, sandboxed execution, language-agnostic)
  • Maximum startup performance → crun (C implementation, no garbage collector, minimal overhead)

Key principle: All OCI-compliant runtimes run the same images. Your choice of runtime is independent of your image format. You can switch runtimes per workload in Kubernetes using RuntimeClass resources.

Architecture Security Performance
# Kubernetes RuntimeClass: run specific pods with kata-containers
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: kata
handler: kata
overhead:
  podFixed:
    memory: "160Mi"
    cpu: "250m"
scheduling:
  nodeSelector:
    kubernetes.io/runtime: kata
---
# Use the RuntimeClass in a Pod spec
apiVersion: v1
kind: Pod
metadata:
  name: untrusted-workload
spec:
  runtimeClassName: kata
  containers:
  - name: app
    image: untrusted-image:latest
    resources:
      limits:
        memory: "256Mi"
        cpu: "500m"

Exercises

Hands-On 30 minutes
Exercise 1: Run a Container with runc Directly

Bypass Docker entirely and create a container using only runc:

# Create the OCI bundle structure
mkdir -p runc-exercise/rootfs
cd runc-exercise

# Extract Alpine's filesystem
docker export $(docker create alpine:3.19) | tar -xC rootfs/

# Generate config.json
runc spec

# Edit config.json to run a custom command
# Change "args": ["sh"] to "args": ["/bin/sh", "-c", "echo PID: $$ && hostname && cat /proc/self/cgroup"]

# Run the container
sudo runc run exercise-container

# In another terminal, check the container state
sudo runc state exercise-container
sudo runc list
runc OCI Bundle
Hands-On 25 minutes
Exercise 2: Explore containerd with ctr

Use containerd's native CLI to pull images, create containers, and manage tasks:

# Pull an image directly through containerd
sudo ctr images pull docker.io/library/busybox:latest

# Create a container (metadata only)
sudo ctr containers create docker.io/library/busybox:latest my-busybox

# Start a task with a specific command
sudo ctr tasks start -d my-busybox /bin/sh -c "while true; do date; sleep 5; done"

# List tasks and check PID
sudo ctr tasks ls

# Exec into the running task
sudo ctr tasks exec --exec-id debug -t my-busybox /bin/sh

# View the container's snapshot
sudo ctr snapshots ls | grep busybox

# Cleanup
sudo ctr tasks kill my-busybox
sudo ctr tasks delete my-busybox
sudo ctr containers delete my-busybox
containerd ctr Tasks
Hands-On 20 minutes
Exercise 3: Trace the Runtime Stack

Start a Docker container and observe the entire process hierarchy:

# Start a long-running container
docker run -d --name trace-test nginx:alpine

# Find all related processes
CONTAINER_PID=$(docker inspect trace-test --format '{{.State.Pid}}')
echo "Container PID: $CONTAINER_PID"

# Show the full process tree
pstree -p $CONTAINER_PID -s
# systemd(1)───containerd(1234)───containerd-shim(5678)───nginx(9012)───nginx(9013)

# Verify the shim is the parent
ps -o pid,ppid,comm -p $CONTAINER_PID
# PID    PPID  COMMAND
# 9012   5678  nginx

# Check the shim's parent (should be PID 1, not containerd)
ps -o pid,ppid,comm -p 5678
# PID    PPID  COMMAND
# 5678   1     containerd-shim

# Cleanup
docker rm -f trace-test
Process Tree Shim Debugging

Conclusion & Next Steps

The container runtime stack is a masterpiece of separation of concerns. containerd handles the "what" — managing images, snapshots, and container metadata. runc handles the "how" — creating isolated processes using Linux kernel primitives. The containerd-shim bridges them, providing daemon-less container survival and exit code reaping.

Key takeaways:

  • containerd is a high-level runtime managing the full container lifecycle: images (content store), filesystem preparation (snapshotters), and process supervision (tasks)
  • runc is a short-lived process that creates containers from OCI bundles using clone(), pivot_root(), cgroups, seccomp, and capabilities — then exits
  • containerd-shim is the per-container process that parents the container, enabling daemon restarts without killing workloads
  • CRI abstracts runtime differences away from Kubernetes, enabling containerd and CRI-O to be interchangeable
  • Alternative runtimes (kata, gVisor, crun) provide different trade-offs between isolation strength and performance, all using the same OCI image format

Next in the Series

In Part 15: Security Fundamentals, we'll explore the security model that makes containers safe — from Linux capabilities and seccomp profiles to AppArmor/SELinux policies, rootless containers, and image signing. You'll learn to harden containers for production and understand the threat models that inform runtime design decisions.