Defence in Depth
No single security control is sufficient. Runtime security follows the defence-in-depth principle: multiple independent layers so that if one fails, others still protect the system. An attacker who bypasses seccomp still faces AppArmor; one who escapes AppArmor still hits a read-only filesystem with no capabilities.
Security Layer Stack
flowchart TB
A["Application Code"] --> B["User Namespace (non-root)"]
B --> C["Capability Restrictions"]
C --> D["Seccomp (syscall filtering)"]
D --> E["AppArmor / SELinux (MAC)"]
E --> F["Read-Only Filesystem"]
F --> G["Resource Limits (cgroups)"]
G --> H["Host Kernel"]
style A fill:#f8f9fa,stroke:#132440
style B fill:#f0f9f9,stroke:#3B9797
style C fill:#f0f9f9,stroke:#3B9797
style D fill:#f0f9f9,stroke:#3B9797
style E fill:#f0f9f9,stroke:#3B9797
style F fill:#f0f9f9,stroke:#3B9797
style G fill:#f0f9f9,stroke:#3B9797
style H fill:#fff5f5,stroke:#BF092F
Each layer addresses a different class of attack:
| Layer | Defends Against | Mechanism |
|---|---|---|
| Non-root user | Privilege escalation starting point | USER directive, user namespaces |
| Capabilities | Excessive root-like powers | --cap-drop ALL, selective --cap-add |
| Seccomp | Kernel exploitation via syscalls | BPF filter blocking dangerous syscalls |
| AppArmor/SELinux | File access, network, mount abuse | Mandatory access control policies |
| Read-only FS | Persistence, malware installation | --read-only flag, tmpfs for writes |
| Resource limits | DoS, cryptomining, fork bombs | cgroup memory, CPU, PID limits |
Linux Capabilities
Traditionally, Linux has two privilege levels: root (UID 0, can do everything) and non-root (restricted). Capabilities split root's powers into ~40 distinct privileges that can be independently granted or revoked.
Docker grants containers a subset of capabilities by default — more than necessary for most applications, but less than full root:
Default Docker Capabilities
| Capability | What It Allows | Usually Needed? |
|---|---|---|
| CAP_CHOWN | Change file ownership | Rarely |
| CAP_DAC_OVERRIDE | Bypass file read/write/execute permission checks | Rarely |
| CAP_FSETID | Don't clear set-user-ID/set-group-ID bits on file modify | Rarely |
| CAP_KILL | Send signals to any process | Sometimes |
| CAP_SETGID | Manipulate process GIDs | Sometimes |
| CAP_SETUID | Manipulate process UIDs | Sometimes |
| CAP_NET_BIND_SERVICE | Bind to ports below 1024 | Often (web servers) |
| CAP_NET_RAW | Use RAW/PACKET sockets (ping, tcpdump) | Rarely |
| CAP_SYS_CHROOT | Use chroot() | Rarely |
| CAP_MKNOD | Create special files using mknod() | Rarely |
| CAP_AUDIT_WRITE | Write to kernel audit log | Rarely |
| CAP_SETFCAP | Set file capabilities | Rarely |
# Run with ALL capabilities dropped, add back only what's needed
docker run --cap-drop ALL --cap-add NET_BIND_SERVICE nginx:alpine
# A web server typically needs only NET_BIND_SERVICE (for port 80/443)
# Everything else is unnecessary attack surface
# View capabilities of a running container
docker exec mycontainer cat /proc/1/status | grep Cap
# Decode capability hex values
capsh --decode=00000000a80425fb
# Run with completely unprivileged container (zero capabilities)
docker run --cap-drop ALL --user 1000:1000 myapp:latest
# DANGEROUS: Never do this in production
docker run --privileged myapp:latest
# --privileged gives ALL capabilities + device access + disables seccomp/AppArmor
--privileged flag is the nuclear option — it grants ALL capabilities, disables seccomp, disables AppArmor, gives access to all host devices, and uses the host's cgroup namespace. A privileged container can trivially escape to the host. Never use --privileged in production; identify the specific capability or device your container needs instead.
Seccomp (Secure Computing Mode)
Seccomp filters which system calls a process can make. The Linux kernel has ~450 syscalls, but most applications need only 50-100. By blocking unused syscalls, you eliminate entire classes of kernel exploits — an attacker who gains code execution inside the container cannot use blocked syscalls to escalate privileges.
Docker applies a default seccomp profile that blocks ~44 dangerous syscalls while allowing ~300+ safe ones:
| Blocked Syscall | Why It's Dangerous |
|---|---|
| mount / umount2 | Could mount host filesystems into container |
| reboot | Could reboot the host |
| clock_settime | Could alter system time affecting all containers |
| kexec_load | Could load a new kernel (complete host takeover) |
| ptrace | Could debug/control other processes |
| add_key / keyctl | Access kernel keyring (secrets of other containers) |
| unshare | Create new namespaces (escape current isolation) |
| bpf | Load eBPF programs (kernel-level code execution) |
Creating a Custom Seccomp Profile
{
"defaultAction": "SCMP_ACT_ERRNO",
"archMap": [
{ "architecture": "SCMP_ARCH_X86_64", "subArchitectures": ["SCMP_ARCH_X86"] }
],
"syscalls": [
{
"names": [
"accept", "accept4", "access", "bind", "brk",
"chdir", "chmod", "chown", "close", "connect",
"dup", "dup2", "dup3", "epoll_create", "epoll_ctl",
"epoll_wait", "execve", "exit", "exit_group",
"fchmod", "fchown", "fcntl", "fstat", "futex",
"getcwd", "getdents64", "getegid", "geteuid",
"getgid", "getpid", "getppid", "getuid",
"ioctl", "listen", "lseek", "madvise", "mmap",
"mprotect", "munmap", "nanosleep", "open",
"openat", "pipe", "poll", "read", "readlink",
"recvfrom", "recvmsg", "rename", "rt_sigaction",
"rt_sigprocmask", "rt_sigreturn", "select",
"sendmsg", "sendto", "set_robust_list",
"setsockopt", "shutdown", "socket", "stat",
"statfs", "sysinfo", "tgkill", "uname",
"unlink", "wait4", "write", "writev"
],
"action": "SCMP_ACT_ALLOW"
}
]
}
# Apply a custom seccomp profile
docker run --security-opt seccomp=custom-profile.json myapp:latest
# Disable seccomp entirely (DANGEROUS — for debugging only)
docker run --security-opt seccomp=unconfined myapp:latest
# Generate a custom profile from observed syscalls (using strace)
strace -f -o /tmp/syscalls.log -e trace=all docker run --rm myapp:latest
# Parse the log to build a minimal allow-list
# Use OCI seccomp bpf generator for tighter profiles
sudo apt-get install golang-github-seccomp-libseccomp-golang-dev
AppArmor
AppArmor is a Linux Security Module (LSM) that confines programs to a set of listed resources — files they can read/write, network operations they can perform, and other capabilities. Docker automatically loads a default AppArmor profile (docker-default) for every container.
# Check if AppArmor is enabled
cat /sys/module/apparmor/parameters/enabled
# Y
# View Docker's default AppArmor profile
cat /etc/apparmor.d/docker-default
# List loaded profiles
aa-status
# Create a custom AppArmor profile for a web application
cat > /etc/apparmor.d/docker-webapp << 'EOF'
#include
profile docker-webapp flags=(attach_disconnected,mediate_deleted) {
#include
#include
# Deny all file writes except to /tmp and /var/log
deny /etc/** w,
deny /usr/** w,
deny /bin/** w,
deny /sbin/** w,
# Allow read access to application files
/app/** r,
/app/node_modules/** r,
# Allow writes to specific directories only
/tmp/** rw,
/var/log/app/** rw,
/run/nginx.pid rw,
# Deny network raw access (no packet sniffing)
deny network raw,
# Deny mount operations
deny mount,
# Deny ptrace (no debugging other processes)
deny ptrace,
}
EOF
# Load the profile
apparmor_parser -r /etc/apparmor.d/docker-webapp
# Apply to a container
docker run --security-opt apparmor=docker-webapp mywebapp:latest
# Run without AppArmor (DANGEROUS)
docker run --security-opt apparmor=unconfined myapp:latest
SELinux
SELinux (Security-Enhanced Linux) provides Mandatory Access Control through type enforcement: every process and file has a security label, and policies define which labels can interact. SELinux is the default on RHEL/CentOS/Fedora systems.
| Feature | AppArmor | SELinux |
|---|---|---|
| Default distro | Ubuntu, Debian, SUSE | RHEL, CentOS, Fedora |
| Model | Path-based access control | Label-based type enforcement |
| Learning curve | Moderate (profile syntax) | Steep (policy language, labels) |
| Granularity | File paths + capabilities | Labels on all objects (files, ports, processes) |
| Multi-container isolation | Same profile per image | MCS labels (unique per container) |
| Docker support | Default profile auto-loaded | Works when host SELinux is enforcing |
# Check SELinux status
getenforce
# Enforcing
# Run Docker with SELinux enabled (RHEL/Fedora)
# Docker automatically assigns MCS (Multi-Category Security) labels
docker run --rm -it fedora:39 cat /proc/1/attr/current
# system_u:system_r:container_t:s0:c123,c456
# Each container gets unique MCS categories (c123,c456)
# This prevents containers from accessing each other's files
# Apply a custom SELinux label
docker run --security-opt label=type:custom_container_t myapp:latest
# Disable SELinux for a container (DANGEROUS)
docker run --security-opt label=disable myapp:latest
# Relabel host volumes for container access
docker run -v /host/data:/data:Z myapp:latest
# :Z relabels the directory with the container's MCS label (private)
# :z relabels with shared label (accessible by multiple containers)
Read-Only Filesystems
A read-only root filesystem prevents an attacker from modifying binaries, installing malware, or persisting backdoors. Combined with tmpfs for necessary writable paths, this creates an immutable container runtime.
# Run with read-only root filesystem
docker run --read-only nginx:alpine
# This will likely fail because nginx needs to write to /var/cache/nginx and /var/run
# Solution: Add tmpfs mounts for required writable paths
docker run --read-only \
--tmpfs /var/cache/nginx:size=10m \
--tmpfs /var/run:size=1m \
--tmpfs /tmp:size=50m \
nginx:alpine
# Full production example with all hardening combined
docker run -d \
--name secure-nginx \
--read-only \
--tmpfs /var/cache/nginx:size=10m,noexec,nosuid \
--tmpfs /var/run:size=1m,noexec,nosuid \
--tmpfs /tmp:size=50m,noexec,nosuid \
--cap-drop ALL \
--cap-add NET_BIND_SERVICE \
--security-opt no-new-privileges \
--user 101:101 \
-p 8080:80 \
nginx:alpine
# Verify the filesystem is read-only
docker exec secure-nginx touch /etc/test
# touch: /etc/test: Read-only file system
noexec flag on tmpfs mounts prevents execution of any files written to those directories. Even if an attacker manages to write a script to /tmp, they cannot execute it. Combined with a read-only root filesystem, this eliminates the most common post-exploitation technique: downloading and running malware.
Rootless Containers
Rootless mode runs the entire Docker daemon and containers as a non-root user. Even if an attacker breaks out of the container, they land in an unprivileged user namespace with no access to host resources.
# Install rootless Docker (Ubuntu/Debian)
# Prerequisites
sudo apt-get install -y uidmap dbus-user-session
# Install Docker rootless
dockerd-rootless-setuptool.sh install
# The daemon runs as your user (no sudo required)
docker info | grep "Root Dir"
# Root Dir: /home/user/.local/share/docker
# Verify rootless mode
docker info | grep "Security Options"
# rootless
# Start/stop rootless Docker
systemctl --user start docker
systemctl --user stop docker
# Enable on login
systemctl --user enable docker
loginctl enable-linger $(whoami)
Rootless mode limitations:
| Feature | Root Mode | Rootless Mode |
|---|---|---|
| Bind to port < 1024 | Yes | No (without sysctl) |
| overlay2 storage driver | Yes | Yes (kernel 5.11+) |
| Network performance | Native | Slight overhead (slirp4netns/pasta) |
| cgroup v2 resource limits | Full | Limited (systemd user slice) |
| Host device access | Yes | No |
| Container escape impact | Full host root access | Unprivileged user only |
No-New-Privileges Flag
The no-new-privileges flag prevents processes inside the container from gaining additional privileges through setuid/setgid binaries, file capabilities, or other escalation mechanisms.
# Enable no-new-privileges
docker run --security-opt no-new-privileges myapp:latest
# What this blocks:
# - setuid binaries (like sudo, su, ping)
# - setgid binaries
# - File capabilities (getcap/setcap)
# - execve() with elevated privileges
# Example: Without no-new-privileges, a setuid binary can escalate
docker run --rm -it ubuntu bash -c "
chmod u+s /usr/bin/find
# find is now setuid — could be used for escalation
"
# With no-new-privileges, setuid is ignored
docker run --rm --security-opt no-new-privileges -it ubuntu bash -c "
chmod u+s /usr/bin/find
ls -la /usr/bin/find
# The setuid bit is set but the kernel ignores it
"
no-new-privileges in production. Combined with a non-root USER and dropped capabilities, this creates a container where privilege escalation is effectively impossible through standard Linux mechanisms.
Resource Limits as Security
Resource limits (cgroups) aren't just for performance — they're a security control that prevents denial-of-service attacks, cryptomining abuse, and fork bombs.
# Memory limit: Container is OOM-killed if it exceeds 256MB
docker run --memory=256m --memory-swap=256m myapp:latest
# CPU limit: Container gets at most 0.5 CPU cores
docker run --cpus=0.5 myapp:latest
# PID limit: Prevents fork bombs (maximum 100 processes)
docker run --pids-limit=100 myapp:latest
# Block I/O limits: Prevent I/O saturation attacks
docker run --device-write-bps /dev/sda:10mb myapp:latest
# Ulimits: File descriptor and process limits
docker run --ulimit nofile=1024:1024 --ulimit nproc=50:50 myapp:latest
# Complete production security profile
docker run -d \
--name hardened-app \
--memory=512m \
--memory-swap=512m \
--cpus=1.0 \
--pids-limit=200 \
--ulimit nofile=2048:2048 \
--read-only \
--tmpfs /tmp:size=100m,noexec,nosuid \
--cap-drop ALL \
--cap-add NET_BIND_SERVICE \
--security-opt no-new-privileges \
--security-opt seccomp=custom-profile.json \
--user 1000:1000 \
myapp:latest
Docker Bench for Security
Docker Bench is an automated script that checks your Docker installation against the CIS Docker Benchmark — over 100 security best practices covering the host, daemon configuration, images, containers, and networking.
# Run Docker Bench for Security
docker run --rm --net host --pid host --userns host --cap-add audit_control \
-e DOCKER_CONTENT_TRUST=$DOCKER_CONTENT_TRUST \
-v /etc:/etc:ro \
-v /usr/bin/containerd:/usr/bin/containerd:ro \
-v /usr/bin/runc:/usr/bin/runc:ro \
-v /usr/lib/systemd:/usr/lib/systemd:ro \
-v /var/lib:/var/lib:ro \
-v /var/run/docker.sock:/var/run/docker.sock:ro \
docker/docker-bench-security
# Example output categories:
# [PASS] 1.1 - Ensure a separate partition for containers exists
# [WARN] 2.1 - Ensure network traffic is restricted between containers
# [PASS] 4.1 - Ensure that a user for the container has been created
# [WARN] 4.5 - Ensure Content trust for Docker is Enabled
# [PASS] 5.1 - Ensure that AppArmor Profile is Set
# [WARN] 5.4 - Ensure that privileged containers are not used
# [PASS] 5.12 - Ensure that the container's root filesystem is read-only
Runtime Security Tools
Static hardening (seccomp, AppArmor, capabilities) defines what containers can do. Runtime security tools detect anomalous behaviour — activity that's technically allowed but suspicious, like spawning a shell in a container that should only serve HTTP.
Falco — Cloud-Native Runtime Security
Falco (CNCF Incubating project) monitors system calls in real-time using eBPF and alerts on suspicious behaviour:
- Shell spawned in a container (
bash,sh) - Unexpected network connections (reverse shells, C2 traffic)
- Sensitive file access (
/etc/shadow,/proc/kcore) - Binary modification or file creation in non-tmp directories
- Package manager execution (apt, yum) inside running containers
# Example Falco rule
- rule: Shell Spawned in Container
desc: Detect shell execution in a container
condition: >
spawned_process and container and
proc.name in (bash, sh, zsh, dash, csh)
output: >
Shell spawned in container
(user=%user.name container=%container.name
shell=%proc.name parent=%proc.pname)
priority: WARNING
tags: [container, shell]
| Tool | Mechanism | Strengths |
|---|---|---|
| Falco | eBPF/kernel module syscall monitoring | CNCF project, rich rule language, low overhead |
| Sysdig Secure | eBPF + commercial platform | Commercial Falco + compliance + forensics |
| Tracee | eBPF (by Aqua Security) | Event-based, good for container forensics |
| Tetragon | eBPF (by Cilium/Isovalent) | Enforcement (kill processes), low overhead |
Exercises
nginx:alpine with --cap-drop ALL. It will fail because it can't bind to port 80. Add back the minimum capability needed (NET_BIND_SERVICE) to make it work. Then try running on port 8080 instead — does it still need the capability?
strace to identify the syscall set, then test your container with the restrictive profile.
--read-only and --security-opt no-new-privileges. Use docker exec to attempt: writing files, installing packages (apt-get), running setuid binaries. Document which operations fail and why.
Conclusion & Next Steps
Runtime security transforms containers from convenient isolation into genuine security boundaries through layered defences:
- Capabilities — Drop ALL, add back only what's needed (usually just NET_BIND_SERVICE)
- Seccomp — Filter syscalls to a minimal allow-list, blocking kernel exploit vectors
- AppArmor/SELinux — Mandatory access control preventing unauthorized file and network access
- Read-only FS — Immutable containers that cannot be modified post-deployment
- Rootless mode — Even container escapes land in an unprivileged user namespace
- Runtime detection — Falco and eBPF tools catch anomalous behaviour in real-time
With images secured (Part 15) and runtime hardened (this article), the remaining attack vector is the supply chain itself — the provenance of your dependencies and the handling of secrets.
Next in the Series
In Part 17: Supply Chain Security & Secrets, we'll secure the software supply chain with SBOMs, provenance attestations, image signing workflows, and safe secret management patterns — ensuring that every component in your container has a verified origin and that credentials never leak into image layers.