Why Containers Need Special Filesystems
Consider the fundamental challenge: you want to run 50 containers on a single host, each running a different application but all based on Ubuntu 22.04. The Ubuntu base filesystem is approximately 78 MB compressed (about 200 MB uncompressed). A naive approach would require 50 × 200 MB = 10 GB of disk space just for identical base OS files. This is wildly inefficient — those 50 copies of /usr/bin/bash, /lib/x86_64-linux-gnu/libc.so.6, and thousands of other files are byte-for-byte identical.
Virtual machines solved this differently (each VM carries its own full filesystem), but containers needed something smarter. The solution is a union filesystem — a filesystem that presents a unified view of multiple directory trees (layers) stacked on top of each other, where only the differences between layers need to be stored.
The Core Innovation
Union filesystems provide three key innovations that make containers practical:
- Space efficiency — Shared base layers are stored once on disk, regardless of how many containers use them
- Fast startup — Creating a new container doesn't require copying the entire filesystem; it only creates a thin writable layer
- Efficient distribution — Pulling an image only downloads layers not already present on the host
Union Filesystem Concepts
A union filesystem merges multiple directory trees (called branches or layers) into a single coherent view. The key rules that govern this merge are:
| Rule | Description | Example |
|---|---|---|
| Upper takes priority | If a file exists in multiple layers, the uppermost version wins | Custom /etc/nginx.conf overrides base layer's default |
| Directory merge | Directories with the same name are merged, showing all entries | /usr/bin/ shows files from all layers combined |
| Lower layers read-only | Only the topmost layer is writable; lower layers are immutable | Original base image is never modified by running containers |
| Deletion via whiteout | Removing a lower-layer file creates a special marker in the upper layer | Deleting /var/log/apt creates a whiteout, hiding the original |
flowchart TB
subgraph merged["Merged View (what the container sees)"]
M1["/bin/bash"]
M2["/app/server.js"]
M3["/etc/nginx.conf (custom)"]
M4["/var/log/ (empty)"]
end
subgraph upper["Upper Layer (writable — container changes)"]
U1["/etc/nginx.conf (modified)"]
U2[".wh.var_log (whiteout)"]
end
subgraph mid["Middle Layer (read-only — application)"]
A1["/app/server.js"]
A2["/app/package.json"]
end
subgraph lower["Lower Layer (read-only — base OS)"]
L1["/bin/bash"]
L2["/etc/nginx.conf (default)"]
L3["/var/log/apt/"]
end
lower --> mid --> upper --> merged
Layer Visibility Rules
When a process in a container accesses a file path, the union filesystem resolves it by searching layers from top to bottom:
- Check upper (writable) layer first — If the file exists here, return it immediately (even if it's a whiteout marker indicating deletion)
- Check middle layers in order — Search read-only layers from newest to oldest
- Check base layer last — If found nowhere else, look in the bottommost layer
- Return ENOENT — If not found in any layer, report "file not found"
OverlayFS Deep Dive
OverlayFS (also written as overlay or overlay2 in Docker's storage driver terminology) is the dominant union filesystem in modern container runtimes. It's built into the Linux kernel, requires no additional packages, and provides excellent performance characteristics for container workloads.
OverlayFS operates with four directory components:
| Component | Mount Option | Description | Docker Mapping |
|---|---|---|---|
| lowerdir | lowerdir= |
One or more read-only directories (colon-separated, bottom-to-top) | Image layers (base OS, dependencies, application code) |
| upperdir | upperdir= |
Single writable directory where all modifications are stored | Container's writable layer (ephemeral) |
| workdir | workdir= |
Scratch space for atomic operations (same filesystem as upperdir) | Internal work directory managed by Docker |
| merged | mount point | The unified view presented to processes (the "result") | Container's root filesystem (/) |
Hands-On: Building an OverlayFS Mount
Let's manually create an OverlayFS mount to understand exactly what Docker does under the hood:
# Create the directory structure
mkdir -p /tmp/overlay-demo/{lower,upper,work,merged}
# Populate the "base image" (lower layer)
echo "I am the base OS config" > /tmp/overlay-demo/lower/config.txt
echo "#!/bin/bash\necho hello" > /tmp/overlay-demo/lower/app.sh
mkdir -p /tmp/overlay-demo/lower/logs
echo "old log entry" > /tmp/overlay-demo/lower/logs/access.log
# Mount the overlay filesystem
sudo mount -t overlay overlay \
-o lowerdir=/tmp/overlay-demo/lower,\
upperdir=/tmp/overlay-demo/upper,\
workdir=/tmp/overlay-demo/work \
/tmp/overlay-demo/merged
# Verify: the merged view shows lower layer content
ls /tmp/overlay-demo/merged/
# Output: app.sh config.txt logs
cat /tmp/overlay-demo/merged/config.txt
# Output: I am the base OS config
# Now modify a file through the merged view
echo "I am the MODIFIED config" > /tmp/overlay-demo/merged/config.txt
# Create a new file
echo "new data" > /tmp/overlay-demo/merged/newfile.txt
# Delete a file from the lower layer
rm /tmp/overlay-demo/merged/logs/access.log
# Inspect what happened in the upper layer
ls -la /tmp/overlay-demo/upper/
# Output: config.txt newfile.txt logs/
cat /tmp/overlay-demo/upper/config.txt
# Output: I am the MODIFIED config
# The lower layer is UNTOUCHED
cat /tmp/overlay-demo/lower/config.txt
# Output: I am the base OS config
# Check for whiteout file (character device 0,0)
ls -la /tmp/overlay-demo/upper/logs/
# Output: c--------- 1 root root 0, 0 ... access.log (whiteout!)
# Clean up
sudo umount /tmp/overlay-demo/merged
The Lower Layer is Sacred
No matter what operations the container performs — modifying files, creating new ones, deleting existing ones — the lower layer (image) is never touched. All changes are captured in the upper layer. This is the fundamental guarantee that makes image sharing possible: one hundred containers can all use the same base image layers simultaneously because none of them can modify those layers.
When you docker commit a container, what actually happens is that the upper layer's contents are frozen and become a new read-only layer in a new image.
Copy-on-Write Semantics
Copy-on-Write (CoW) is the strategy OverlayFS uses to provide writable access to files that originate in read-only lower layers. The name describes the policy: a file is only copied to the writable layer at the moment it is first written to — never before.
There are three fundamental operations:
| Operation | What Happens | Performance Cost |
|---|---|---|
| Read-through | File is read directly from the lower layer — no copy needed | Near-zero (direct passthrough) |
| Copy-up | File is copied entirely from lower to upper layer before the first write | One-time cost proportional to file size |
| Whiteout | A character device (0,0) or .wh. prefixed file marks deletion |
Negligible (creates tiny marker file) |
flowchart TD
A["Process requests file operation"] --> B{"Operation type?"}
B -->|"READ"| C{"File in upper layer?"}
C -->|"Yes"| D["Read from upper layer"]
C -->|"No"| E["Read directly from lower layer
(read-through, zero cost)"]
B -->|"WRITE"| F{"File already in upper layer?"}
F -->|"Yes"| G["Write directly to upper layer"]
F -->|"No"| H["COPY-UP: Copy entire file
from lower → upper layer"]
H --> I["Write modification to
the new upper-layer copy"]
B -->|"DELETE"| J{"File origin?"}
J -->|"Upper layer only"| K["Simply remove from upper"]
J -->|"Lower layer"| L["Create whiteout marker
in upper layer"]
style H fill:#BF092F,color:#fff
style E fill:#3B9797,color:#fff
style L fill:#132440,color:#fff
Whiteout Files and Opaque Directories
When a container deletes a file that exists in a lower (read-only) layer, the filesystem can't actually remove it — the lower layer is immutable. Instead, OverlayFS creates a whiteout file in the upper layer that effectively hides the original:
# OverlayFS uses character devices with major:minor 0:0 as whiteouts
# When you delete /etc/motd from a lower layer:
ls -la /var/lib/docker/overlay2/<container-id>/diff/etc/
# c--------- 1 root root 0, 0 May 14 10:00 motd
# For opaque directories (hide ALL content of a lower-layer directory):
# A file named .wh..wh..opq is created in the upper layer directory
ls -la /var/lib/docker/overlay2/<container-id>/diff/var/cache/
# -r--r--r-- 1 root root 0 May 14 10:00 .wh..wh..opq
Docker Image Layers
Every Docker image is composed of an ordered stack of read-only layers. Each layer represents a set of filesystem changes (files added, modified, or deleted) relative to the layer below it. When Docker builds an image from a Dockerfile, most instructions create a new layer:
FROM ubuntu:22.04— Sets the base layers (the Ubuntu image itself is multiple layers)RUN apt-get update && apt-get install -y nginx— Creates a layer with the installed packagesCOPY ./app /usr/share/nginx/html— Creates a layer with your application filesCMD ["nginx", "-g", "daemon off;"]— Only adds metadata, no filesystem layer
Inspecting Image Layers
# View the layers of an image with their sizes
docker history nginx:latest
# IMAGE CREATED CREATED BY SIZE
# 3b25b682ea82 2 weeks ago CMD ["nginx" "-g" "daemon off;"] 0B
# <missing> 2 weeks ago STOPSIGNAL SIGQUIT 0B
# <missing> 2 weeks ago EXPOSE map[80/tcp:{}] 0B
# <missing> 2 weeks ago ENTRYPOINT ["/docker-entrypoint.sh"] 0B
# <missing> 2 weeks ago COPY file:xxx in /docker-entrypoint.d 4.62kB
# <missing> 2 weeks ago COPY file:xxx in /docker-entrypoint.d 3.02kB
# <missing> 2 weeks ago COPY file:xxx in /docker-entrypoint.sh 1.62kB
# <missing> 2 weeks ago RUN /bin/sh -c set -x ... 61.1MB
# <missing> 2 weeks ago ENV PKG_RELEASE=1~bookworm 0B
# <missing> 2 weeks ago ENV NJS_VERSION=0.8.4 0B
# <missing> 2 weeks ago ENV NGINX_VERSION=1.25.5 0B
# <missing> 3 weeks ago /bin/sh -c #(nop) CMD ["bash"] 0B
# <missing> 3 weeks ago /bin/sh -c #(nop) ADD file:xxx in / 74.8MB
# Inspect the detailed layer information (content-addressable storage)
docker image inspect nginx:latest --format '{{json .RootFS}}' | jq .
# {
# "Type": "layers",
# "Layers": [
# "sha256:2edcec3590a4ec7...", ← Debian base (74.8 MB)
# "sha256:e379e8aedd4d72...", ← nginx installation (61.1 MB)
# "sha256:b8d6e692a25e11...", ← entrypoint script
# "sha256:f1db227348d0a5...", ← config files
# "sha256:32ce5f6a5106ec...", ← more config
# "sha256:d73e673c3f132e...", ← final config
# "sha256:9a1f56e408ebbc..." ← latest changes
# ]
# }
# Examine the actual filesystem content of each layer on disk
# Docker stores layers in /var/lib/docker/overlay2/
ls /var/lib/docker/overlay2/
# Output shows content-addressed directories for each layer
# View the diff (changes) in a specific layer
ls /var/lib/docker/overlay2/<layer-hash>/diff/
# Shows only the files added/modified in that specific layer
Content-Addressable Storage
Docker uses content-addressable storage — each layer is identified by the SHA256 hash of its contents. This means:
- Two identical layers (same files, same permissions) always produce the same hash
- Layers can be safely deduplicated — if two images share a layer hash, only one copy exists on disk
- Layer integrity is guaranteed — if the content changes, the hash changes
- Pull operations can skip layers that already exist locally (matching by hash)
This is why image pulls are often faster than expected — if you already have ubuntu:22.04 layers cached, pulling another image based on the same Ubuntu version reuses those layers entirely.
Layer Sharing & Deduplication
Layer sharing is one of Docker's most powerful efficiency features. When multiple images share the same base layers, those layers are stored only once on disk and shared across all images and containers that reference them.
# Check disk usage and shared layers
docker system df -v
# TYPE TOTAL ACTIVE SIZE RECLAIMABLE
# Images 12 5 4.235GB 2.1GB (49%)
# Containers 5 3 125.4MB 45.2MB (36%)
# Local Volumes 8 4 890MB 445MB (50%)
# Build Cache 23 0 1.2GB 1.2GB
# Detailed image breakdown showing SHARED SIZE
docker system df -v | head -20
# REPOSITORY TAG IMAGE ID CREATED SIZE SHARED SIZE UNIQUE SIZE
# myapp latest abc123 1h ago 450MB 350MB 100MB
# nginx latest def456 2w ago 187MB 187MB 0B
# node 18 ghi789 3w ago 350MB 74.8MB 275.2MB
# Demonstrate layer sharing: pull two images with the same base
docker pull node:18-slim
# 18-slim: Pulling from library/node
# bd159e379b3b: Already exists ← Shared Debian layer!
# 28b2c5b2ea8e: Pull complete ← Node-specific layer
# ...
docker pull python:3.11-slim
# 3.11-slim: Pulling from library/python
# bd159e379b3b: Already exists ← Same shared Debian layer!
# a1b2c3d4e5f6: Pull complete ← Python-specific layer
# ...
node:18-slim, the base layer (~80 MB) is stored once, not 20 times. That's 80 MB on disk instead of 1.6 GB — a 95% saving on base layer storage. The same applies to container runtime memory: shared read-only layers use shared page cache entries, meaning the kernel caches the file data once for all containers accessing the same layer content.
The Writable Container Layer
When Docker creates a container from an image, it adds one thin writable layer on top of the image's read-only layers. This is called the container layer (or the diff). All filesystem modifications made by the running container — file writes, package installations, log files, temp files — are stored here and only here.
# Start a container and make changes
docker run -d --name demo nginx:latest
# Check the container's filesystem usage
docker inspect demo --format '{{.GraphDriver.Data.UpperDir}}'
# /var/lib/docker/overlay2/abc123.../diff
# See what's in the writable layer (initially almost empty)
sudo ls /var/lib/docker/overlay2/abc123.../diff/
# (empty or minimal runtime files)
# Now write some data inside the container
docker exec demo sh -c 'echo "hello" > /tmp/myfile.txt'
docker exec demo sh -c 'apt-get update > /dev/null 2>&1'
# Check the writable layer again — all changes are here
sudo ls /var/lib/docker/overlay2/abc123.../diff/
# tmp/ var/
sudo du -sh /var/lib/docker/overlay2/abc123.../diff/
# 28M (the apt-get update cached files)
docker rm), its writable layer is permanently deleted. Any data written to the container filesystem that isn't stored in a volume is gone forever. This is intentional — containers are designed to be disposable. Production applications should write persistent data to named volumes or bind mounts, never to the container layer.
# Prove the ephemeral nature
docker run --name temp-demo alpine sh -c 'echo "important data" > /data.txt'
docker start temp-demo # restart to verify
docker exec temp-demo cat /data.txt
# important data (still there — same container)
docker rm temp-demo # remove the container
docker run --name temp-demo alpine cat /data.txt
# cat: can't open '/data.txt': No such file or directory
# The data is GONE — new container, new empty writable layer
Alternative Storage Drivers
While overlay2 is the default and recommended storage driver for Docker, alternatives exist for specific use cases or legacy systems:
| Driver | Status | Backing Filesystem | Key Feature | Limitation |
|---|---|---|---|---|
| overlay2 | Default, recommended | ext4, xfs | In-kernel, no patches needed, fast | Limited to 128 lower layers |
| AUFS | Deprecated (removed in Docker 24+) | ext4, xfs | First union FS Docker used; mature | Not in mainline kernel; requires patches |
| btrfs | Supported | btrfs only | Native snapshots, compression, checksumming | Requires btrfs filesystem on host |
| zfs | Supported | ZFS only | Enterprise features: dedup, compression, snapshots | Heavy memory usage; complex setup |
| devicemapper | Deprecated | Block device | Works without specific filesystem requirements | Complex, slower, loopback mode unreliable |
| fuse-overlayfs | Rootless mode | Any | Works in user namespaces (rootless Docker) | Slower than kernel overlay2 (FUSE overhead) |
# Check which storage driver Docker is currently using
docker info --format '{{.Driver}}'
# overlay2
# View detailed storage driver information
docker info | grep -A5 "Storage Driver"
# Storage Driver: overlay2
# Backing Filesystem: extfs
# Supports d_type: true
# Using metacopy: false
# Native Overlay Diff: true
# userxattr: false
// Configure storage driver in /etc/docker/daemon.json
{
"storage-driver": "overlay2",
"storage-opts": [
"overlay2.override_kernel_check=true",
"overlay2.size=20G"
]
}
Performance Implications
Understanding the performance characteristics of union filesystems is essential for building efficient container images and running performant containers:
Read Performance
Reads are fast — almost as fast as native filesystem access. When reading a file from a lower layer, OverlayFS passes through directly to the underlying filesystem with minimal overhead. The kernel page cache still works normally, so frequently-accessed files remain cached in memory.
Write Performance (Copy-Up Cost)
The first write to a lower-layer file incurs the copy-up penalty. Subsequent writes to the same file are fast (writing directly to the upper layer copy). The cost depends on file size:
- Small files (configs, scripts): Copy-up is near-instantaneous
- Large files (databases, logs): Copy-up can be expensive (must copy entire file)
- Very large files (500 MB+ data files): Severe penalty — this is why volumes exist
Delete Performance (Whiteouts)
Deletions create tiny whiteout files — nearly free in terms of I/O. However, whiteouts do consume inode entries, and excessive whiteouts in deeply layered images can slow directory listings.
Best Practices for Dockerfile Ordering
# BAD: Copying source code (changes often) before installing deps (changes rarely)
FROM node:18-slim
COPY . /app # ← Changes every commit
RUN npm install # ← Forced to reinstall every time!
CMD ["node", "server.js"]
# GOOD: Install dependencies first (cached), copy source last
FROM node:18-slim
COPY package*.json /app/ # ← Changes only when deps change
RUN npm install # ← Cached on most rebuilds!
COPY . /app # ← Only this layer rebuilds on code changes
CMD ["node", "server.js"]
# GOOD: Combine RUN commands to minimize layers and reduce image size
FROM ubuntu:22.04
RUN apt-get update && \
apt-get install -y --no-install-recommends \
nginx \
curl \
ca-certificates && \
rm -rf /var/lib/apt/lists/*
# Single layer: install + cleanup = smaller image
# BAD: Separate RUN commands (cleanup doesn't help — data is in earlier layer)
FROM ubuntu:22.04
RUN apt-get update
RUN apt-get install -y nginx curl ca-certificates
RUN rm -rf /var/lib/apt/lists/*
# Three layers: the rm only hides files with whiteouts, doesn't recover space!
The "Delete Doesn't Save Space" Trap
A frequent mistake by Docker beginners is believing that deleting files in a later RUN instruction reduces image size. It doesn't — it creates whiteout markers that hide the files but the original layer still contains them. The only way to truly save space is to combine the install and cleanup in the same RUN instruction.
# Prove it: build these two Dockerfiles and compare sizes
# Version A: separate layers
echo 'FROM ubuntu:22.04
RUN dd if=/dev/zero of=/bigfile bs=1M count=100
RUN rm /bigfile' | docker build -t test-separate -
docker image ls test-separate
# REPOSITORY TAG SIZE
# test-separate latest 278MB (100MB wasted!)
# Version B: single layer
echo 'FROM ubuntu:22.04
RUN dd if=/dev/zero of=/bigfile bs=1M count=100 && rm /bigfile' | docker build -t test-combined -
docker image ls test-combined
# REPOSITORY TAG SIZE
# test-combined latest 78MB (no waste!)
Exercises
- Manual OverlayFS Experiment — On a Linux system (or VM), create an OverlayFS mount as shown in this article. Create files in the lower layer, modify them through the merged view, and verify that the lower layer remains unchanged. Create a file in the merged view, then find it in the upper layer. Delete a lower-layer file and inspect the whiteout character device.
- Layer Inspection Challenge — Pull the
nginx:latestimage and usedocker historyanddocker image inspectto document every layer, its size, and the instruction that created it. Then navigate to/var/lib/docker/overlay2/and find the actual directories on disk. Browse thediff/contents of each layer to see what files were added. - Layer Sharing Test — Pull both
python:3.11-slimandnode:18-slim(both based on Debian). Rundocker system df -vand identify the shared layers. Calculate the actual disk savings from layer deduplication versus storing each image independently. - Image Size Optimisation — Write a Dockerfile that installs
build-essential(to compile something), compiles a simple C program, then removesbuild-essential. First write it with separateRUNinstructions, then rewrite as a singleRUN. Compare the final image sizes and document the difference. Finally, use a multi-stage build (preview of Part 8) for maximum efficiency.
Conclusion & Next Steps
Union file systems and copy-on-write semantics are the third pillar of container technology (alongside namespaces and cgroups). They solve the fundamental storage challenge: giving each container its own writeable filesystem without the overhead of full copies.
Key takeaways from this article:
- Union filesystems merge multiple directory trees into a single unified view using layers
- OverlayFS is the default storage driver — it uses lowerdir (read-only layers), upperdir (writable), workdir (scratch), and merged (unified view)
- Copy-on-write means reads are cheap (passthrough), first writes have a copy-up cost, and deletes create whiteout markers
- Docker images are ordered stacks of read-only layers identified by SHA256 content hashes
- Layer sharing means the same base image stored once serves unlimited containers
- The container layer is ephemeral — use volumes for persistent data
- Dockerfile instruction ordering critically affects build cache performance and image size
With namespaces (isolation), cgroups (resource limits), and union filesystems (efficient storage) understood, you now have a complete picture of the Linux kernel primitives that power containers. In Part 5, we'll rise above these primitives to explore the Docker architecture — how the CLI, daemon, containerd, and runc orchestrate all these kernel features into a usable platform.
Next in the Series
In Part 5: Docker Architecture & Core Concepts, we will explore how Docker Engine's components — CLI, daemon, containerd, and runc — work together to transform kernel primitives into the container platform you use daily.