Back to Containers & Runtime Environments Mastery Series

Part 4: Union File Systems & Image Layering

May 14, 2026 Wasil Zafar 22 min read

Containers don't carry entire operating system copies — they share base layers and only store what's different. This magic is powered by union file systems and copy-on-write semantics, which make images lightweight, fast to distribute, and incredibly space-efficient.

Table of Contents

  1. Why Special Filesystems?
  2. Union Filesystem Concepts
  3. OverlayFS Deep Dive
  4. Copy-on-Write Semantics
  5. Docker Image Layers
  6. Layer Sharing & Deduplication
  7. The Writable Container Layer
  8. Alternative Storage Drivers
  9. Performance Implications
  10. Exercises
  11. Conclusion & Next Steps

Why Containers Need Special Filesystems

Consider the fundamental challenge: you want to run 50 containers on a single host, each running a different application but all based on Ubuntu 22.04. The Ubuntu base filesystem is approximately 78 MB compressed (about 200 MB uncompressed). A naive approach would require 50 × 200 MB = 10 GB of disk space just for identical base OS files. This is wildly inefficient — those 50 copies of /usr/bin/bash, /lib/x86_64-linux-gnu/libc.so.6, and thousands of other files are byte-for-byte identical.

Virtual machines solved this differently (each VM carries its own full filesystem), but containers needed something smarter. The solution is a union filesystem — a filesystem that presents a unified view of multiple directory trees (layers) stacked on top of each other, where only the differences between layers need to be stored.

The Transparent Overlay Analogy: Imagine a printed paper map of a city (the base layer). You place a transparent acetate sheet on top and draw your custom route with a marker (the upper layer). When you look down, you see both the original map and your additions as one unified image. You haven't modified the original map — it's still pristine underneath. Multiple people can each have their own acetate sheet over the same printed map, each drawing different routes, without affecting each other. This is exactly how union filesystems work.

The Core Innovation

Union filesystems provide three key innovations that make containers practical:

  • Space efficiency — Shared base layers are stored once on disk, regardless of how many containers use them
  • Fast startup — Creating a new container doesn't require copying the entire filesystem; it only creates a thin writable layer
  • Efficient distribution — Pulling an image only downloads layers not already present on the host

Union Filesystem Concepts

A union filesystem merges multiple directory trees (called branches or layers) into a single coherent view. The key rules that govern this merge are:

Rule Description Example
Upper takes priority If a file exists in multiple layers, the uppermost version wins Custom /etc/nginx.conf overrides base layer's default
Directory merge Directories with the same name are merged, showing all entries /usr/bin/ shows files from all layers combined
Lower layers read-only Only the topmost layer is writable; lower layers are immutable Original base image is never modified by running containers
Deletion via whiteout Removing a lower-layer file creates a special marker in the upper layer Deleting /var/log/apt creates a whiteout, hiding the original
Union Filesystem Layer Stacking
flowchart TB
    subgraph merged["Merged View (what the container sees)"]
        M1["/bin/bash"]
        M2["/app/server.js"]
        M3["/etc/nginx.conf (custom)"]
        M4["/var/log/ (empty)"]
    end

    subgraph upper["Upper Layer (writable — container changes)"]
        U1["/etc/nginx.conf (modified)"]
        U2[".wh.var_log (whiteout)"]
    end

    subgraph mid["Middle Layer (read-only — application)"]
        A1["/app/server.js"]
        A2["/app/package.json"]
    end

    subgraph lower["Lower Layer (read-only — base OS)"]
        L1["/bin/bash"]
        L2["/etc/nginx.conf (default)"]
        L3["/var/log/apt/"]
    end

    lower --> mid --> upper --> merged
                            

Layer Visibility Rules

When a process in a container accesses a file path, the union filesystem resolves it by searching layers from top to bottom:

  1. Check upper (writable) layer first — If the file exists here, return it immediately (even if it's a whiteout marker indicating deletion)
  2. Check middle layers in order — Search read-only layers from newest to oldest
  3. Check base layer last — If found nowhere else, look in the bottommost layer
  4. Return ENOENT — If not found in any layer, report "file not found"
Historical Context: Union filesystems aren't new to containers. The concept dates back to the Plan 9 operating system (Bell Labs, 1992) and was first implemented in Linux as UnionFS (2003), followed by AUFS (Another Union File System, 2006) which Docker originally used. OverlayFS was merged into the Linux kernel mainline in version 3.18 (December 2014), making it the default choice today because it doesn't require out-of-tree patches.

OverlayFS Deep Dive

OverlayFS (also written as overlay or overlay2 in Docker's storage driver terminology) is the dominant union filesystem in modern container runtimes. It's built into the Linux kernel, requires no additional packages, and provides excellent performance characteristics for container workloads.

OverlayFS operates with four directory components:

Component Mount Option Description Docker Mapping
lowerdir lowerdir= One or more read-only directories (colon-separated, bottom-to-top) Image layers (base OS, dependencies, application code)
upperdir upperdir= Single writable directory where all modifications are stored Container's writable layer (ephemeral)
workdir workdir= Scratch space for atomic operations (same filesystem as upperdir) Internal work directory managed by Docker
merged mount point The unified view presented to processes (the "result") Container's root filesystem (/)

Hands-On: Building an OverlayFS Mount

Let's manually create an OverlayFS mount to understand exactly what Docker does under the hood:

# Create the directory structure
mkdir -p /tmp/overlay-demo/{lower,upper,work,merged}

# Populate the "base image" (lower layer)
echo "I am the base OS config" > /tmp/overlay-demo/lower/config.txt
echo "#!/bin/bash\necho hello" > /tmp/overlay-demo/lower/app.sh
mkdir -p /tmp/overlay-demo/lower/logs
echo "old log entry" > /tmp/overlay-demo/lower/logs/access.log

# Mount the overlay filesystem
sudo mount -t overlay overlay \
    -o lowerdir=/tmp/overlay-demo/lower,\
upperdir=/tmp/overlay-demo/upper,\
workdir=/tmp/overlay-demo/work \
    /tmp/overlay-demo/merged

# Verify: the merged view shows lower layer content
ls /tmp/overlay-demo/merged/
# Output: app.sh  config.txt  logs

cat /tmp/overlay-demo/merged/config.txt
# Output: I am the base OS config
# Now modify a file through the merged view
echo "I am the MODIFIED config" > /tmp/overlay-demo/merged/config.txt

# Create a new file
echo "new data" > /tmp/overlay-demo/merged/newfile.txt

# Delete a file from the lower layer
rm /tmp/overlay-demo/merged/logs/access.log

# Inspect what happened in the upper layer
ls -la /tmp/overlay-demo/upper/
# Output: config.txt  newfile.txt  logs/

cat /tmp/overlay-demo/upper/config.txt
# Output: I am the MODIFIED config

# The lower layer is UNTOUCHED
cat /tmp/overlay-demo/lower/config.txt
# Output: I am the base OS config

# Check for whiteout file (character device 0,0)
ls -la /tmp/overlay-demo/upper/logs/
# Output: c--------- 1 root root 0, 0 ... access.log  (whiteout!)

# Clean up
sudo umount /tmp/overlay-demo/merged
Key Insight

The Lower Layer is Sacred

No matter what operations the container performs — modifying files, creating new ones, deleting existing ones — the lower layer (image) is never touched. All changes are captured in the upper layer. This is the fundamental guarantee that makes image sharing possible: one hundred containers can all use the same base image layers simultaneously because none of them can modify those layers.

When you docker commit a container, what actually happens is that the upper layer's contents are frozen and become a new read-only layer in a new image.

Immutability Layer Sharing Image Integrity

Copy-on-Write Semantics

Copy-on-Write (CoW) is the strategy OverlayFS uses to provide writable access to files that originate in read-only lower layers. The name describes the policy: a file is only copied to the writable layer at the moment it is first written to — never before.

There are three fundamental operations:

Operation What Happens Performance Cost
Read-through File is read directly from the lower layer — no copy needed Near-zero (direct passthrough)
Copy-up File is copied entirely from lower to upper layer before the first write One-time cost proportional to file size
Whiteout A character device (0,0) or .wh. prefixed file marks deletion Negligible (creates tiny marker file)
Copy-on-Write Decision Flow
flowchart TD
    A["Process requests file operation"] --> B{"Operation type?"}
    B -->|"READ"| C{"File in upper layer?"}
    C -->|"Yes"| D["Read from upper layer"]
    C -->|"No"| E["Read directly from lower layer
(read-through, zero cost)"] B -->|"WRITE"| F{"File already in upper layer?"} F -->|"Yes"| G["Write directly to upper layer"] F -->|"No"| H["COPY-UP: Copy entire file
from lower → upper layer"] H --> I["Write modification to
the new upper-layer copy"] B -->|"DELETE"| J{"File origin?"} J -->|"Upper layer only"| K["Simply remove from upper"] J -->|"Lower layer"| L["Create whiteout marker
in upper layer"] style H fill:#BF092F,color:#fff style E fill:#3B9797,color:#fff style L fill:#132440,color:#fff

Whiteout Files and Opaque Directories

When a container deletes a file that exists in a lower (read-only) layer, the filesystem can't actually remove it — the lower layer is immutable. Instead, OverlayFS creates a whiteout file in the upper layer that effectively hides the original:

# OverlayFS uses character devices with major:minor 0:0 as whiteouts
# When you delete /etc/motd from a lower layer:
ls -la /var/lib/docker/overlay2/<container-id>/diff/etc/
# c--------- 1 root root 0, 0 May 14 10:00 motd

# For opaque directories (hide ALL content of a lower-layer directory):
# A file named .wh..wh..opq is created in the upper layer directory
ls -la /var/lib/docker/overlay2/<container-id>/diff/var/cache/
# -r--r--r-- 1 root root 0 May 14 10:00 .wh..wh..opq
Copy-Up Gotcha: The copy-up operation copies the entire file, even if you only modify one byte. This means modifying a single character in a 500 MB database file will copy all 500 MB to the upper layer. This is why Docker best practices discourage storing large, frequently-modified files in the container filesystem — use volumes instead. The copy-up also includes all metadata (permissions, xattrs, timestamps).

Docker Image Layers

Every Docker image is composed of an ordered stack of read-only layers. Each layer represents a set of filesystem changes (files added, modified, or deleted) relative to the layer below it. When Docker builds an image from a Dockerfile, most instructions create a new layer:

  • FROM ubuntu:22.04 — Sets the base layers (the Ubuntu image itself is multiple layers)
  • RUN apt-get update && apt-get install -y nginx — Creates a layer with the installed packages
  • COPY ./app /usr/share/nginx/html — Creates a layer with your application files
  • CMD ["nginx", "-g", "daemon off;"] — Only adds metadata, no filesystem layer

Inspecting Image Layers

# View the layers of an image with their sizes
docker history nginx:latest
# IMAGE          CREATED       CREATED BY                                      SIZE
# 3b25b682ea82   2 weeks ago   CMD ["nginx" "-g" "daemon off;"]                0B
# <missing>      2 weeks ago   STOPSIGNAL SIGQUIT                              0B
# <missing>      2 weeks ago   EXPOSE map[80/tcp:{}]                           0B
# <missing>      2 weeks ago   ENTRYPOINT ["/docker-entrypoint.sh"]            0B
# <missing>      2 weeks ago   COPY file:xxx in /docker-entrypoint.d           4.62kB
# <missing>      2 weeks ago   COPY file:xxx in /docker-entrypoint.d           3.02kB
# <missing>      2 weeks ago   COPY file:xxx in /docker-entrypoint.sh          1.62kB
# <missing>      2 weeks ago   RUN /bin/sh -c set -x ...                      61.1MB
# <missing>      2 weeks ago   ENV PKG_RELEASE=1~bookworm                      0B
# <missing>      2 weeks ago   ENV NJS_VERSION=0.8.4                           0B
# <missing>      2 weeks ago   ENV NGINX_VERSION=1.25.5                        0B
# <missing>      3 weeks ago   /bin/sh -c #(nop) CMD ["bash"]                  0B
# <missing>      3 weeks ago   /bin/sh -c #(nop) ADD file:xxx in /             74.8MB
# Inspect the detailed layer information (content-addressable storage)
docker image inspect nginx:latest --format '{{json .RootFS}}' | jq .
# {
#   "Type": "layers",
#   "Layers": [
#     "sha256:2edcec3590a4ec7...",   ← Debian base (74.8 MB)
#     "sha256:e379e8aedd4d72...",   ← nginx installation (61.1 MB)
#     "sha256:b8d6e692a25e11...",   ← entrypoint script
#     "sha256:f1db227348d0a5...",   ← config files
#     "sha256:32ce5f6a5106ec...",   ← more config
#     "sha256:d73e673c3f132e...",   ← final config
#     "sha256:9a1f56e408ebbc..."    ← latest changes
#   ]
# }
# Examine the actual filesystem content of each layer on disk
# Docker stores layers in /var/lib/docker/overlay2/
ls /var/lib/docker/overlay2/
# Output shows content-addressed directories for each layer

# View the diff (changes) in a specific layer
ls /var/lib/docker/overlay2/<layer-hash>/diff/
# Shows only the files added/modified in that specific layer
Deep Dive

Content-Addressable Storage

Docker uses content-addressable storage — each layer is identified by the SHA256 hash of its contents. This means:

  • Two identical layers (same files, same permissions) always produce the same hash
  • Layers can be safely deduplicated — if two images share a layer hash, only one copy exists on disk
  • Layer integrity is guaranteed — if the content changes, the hash changes
  • Pull operations can skip layers that already exist locally (matching by hash)

This is why image pulls are often faster than expected — if you already have ubuntu:22.04 layers cached, pulling another image based on the same Ubuntu version reuses those layers entirely.

SHA256 Deduplication Content-Addressable

Layer Sharing & Deduplication

Layer sharing is one of Docker's most powerful efficiency features. When multiple images share the same base layers, those layers are stored only once on disk and shared across all images and containers that reference them.

# Check disk usage and shared layers
docker system df -v
# TYPE           TOTAL    ACTIVE   SIZE      RECLAIMABLE
# Images         12       5        4.235GB   2.1GB (49%)
# Containers     5        3        125.4MB   45.2MB (36%)
# Local Volumes  8        4        890MB     445MB (50%)
# Build Cache    23       0        1.2GB     1.2GB

# Detailed image breakdown showing SHARED SIZE
docker system df -v | head -20
# REPOSITORY    TAG     IMAGE ID     CREATED    SIZE      SHARED SIZE   UNIQUE SIZE
# myapp         latest  abc123       1h ago     450MB     350MB         100MB
# nginx         latest  def456       2w ago     187MB     187MB         0B
# node          18      ghi789       3w ago     350MB     74.8MB        275.2MB
# Demonstrate layer sharing: pull two images with the same base
docker pull node:18-slim
# 18-slim: Pulling from library/node
# bd159e379b3b: Already exists    ← Shared Debian layer!
# 28b2c5b2ea8e: Pull complete     ← Node-specific layer
# ...

docker pull python:3.11-slim
# 3.11-slim: Pulling from library/python
# bd159e379b3b: Already exists    ← Same shared Debian layer!
# a1b2c3d4e5f6: Pull complete     ← Python-specific layer
# ...
Real-World Savings: In a typical microservices deployment with 20 services all based on node:18-slim, the base layer (~80 MB) is stored once, not 20 times. That's 80 MB on disk instead of 1.6 GB — a 95% saving on base layer storage. The same applies to container runtime memory: shared read-only layers use shared page cache entries, meaning the kernel caches the file data once for all containers accessing the same layer content.

The Writable Container Layer

When Docker creates a container from an image, it adds one thin writable layer on top of the image's read-only layers. This is called the container layer (or the diff). All filesystem modifications made by the running container — file writes, package installations, log files, temp files — are stored here and only here.

# Start a container and make changes
docker run -d --name demo nginx:latest

# Check the container's filesystem usage
docker inspect demo --format '{{.GraphDriver.Data.UpperDir}}'
# /var/lib/docker/overlay2/abc123.../diff

# See what's in the writable layer (initially almost empty)
sudo ls /var/lib/docker/overlay2/abc123.../diff/
# (empty or minimal runtime files)

# Now write some data inside the container
docker exec demo sh -c 'echo "hello" > /tmp/myfile.txt'
docker exec demo sh -c 'apt-get update > /dev/null 2>&1'

# Check the writable layer again — all changes are here
sudo ls /var/lib/docker/overlay2/abc123.../diff/
# tmp/  var/

sudo du -sh /var/lib/docker/overlay2/abc123.../diff/
# 28M   (the apt-get update cached files)
Ephemeral by Design: When a container is removed (docker rm), its writable layer is permanently deleted. Any data written to the container filesystem that isn't stored in a volume is gone forever. This is intentional — containers are designed to be disposable. Production applications should write persistent data to named volumes or bind mounts, never to the container layer.
# Prove the ephemeral nature
docker run --name temp-demo alpine sh -c 'echo "important data" > /data.txt'
docker start temp-demo  # restart to verify
docker exec temp-demo cat /data.txt
# important data  (still there — same container)

docker rm temp-demo     # remove the container
docker run --name temp-demo alpine cat /data.txt
# cat: can't open '/data.txt': No such file or directory
# The data is GONE — new container, new empty writable layer

Alternative Storage Drivers

While overlay2 is the default and recommended storage driver for Docker, alternatives exist for specific use cases or legacy systems:

Driver Status Backing Filesystem Key Feature Limitation
overlay2 Default, recommended ext4, xfs In-kernel, no patches needed, fast Limited to 128 lower layers
AUFS Deprecated (removed in Docker 24+) ext4, xfs First union FS Docker used; mature Not in mainline kernel; requires patches
btrfs Supported btrfs only Native snapshots, compression, checksumming Requires btrfs filesystem on host
zfs Supported ZFS only Enterprise features: dedup, compression, snapshots Heavy memory usage; complex setup
devicemapper Deprecated Block device Works without specific filesystem requirements Complex, slower, loopback mode unreliable
fuse-overlayfs Rootless mode Any Works in user namespaces (rootless Docker) Slower than kernel overlay2 (FUSE overhead)
# Check which storage driver Docker is currently using
docker info --format '{{.Driver}}'
# overlay2

# View detailed storage driver information
docker info | grep -A5 "Storage Driver"
# Storage Driver: overlay2
#  Backing Filesystem: extfs
#  Supports d_type: true
#  Using metacopy: false
#  Native Overlay Diff: true
#  userxattr: false
// Configure storage driver in /etc/docker/daemon.json
{
    "storage-driver": "overlay2",
    "storage-opts": [
        "overlay2.override_kernel_check=true",
        "overlay2.size=20G"
    ]
}

Performance Implications

Understanding the performance characteristics of union filesystems is essential for building efficient container images and running performant containers:

Read Performance

Reads are fast — almost as fast as native filesystem access. When reading a file from a lower layer, OverlayFS passes through directly to the underlying filesystem with minimal overhead. The kernel page cache still works normally, so frequently-accessed files remain cached in memory.

Write Performance (Copy-Up Cost)

The first write to a lower-layer file incurs the copy-up penalty. Subsequent writes to the same file are fast (writing directly to the upper layer copy). The cost depends on file size:

  • Small files (configs, scripts): Copy-up is near-instantaneous
  • Large files (databases, logs): Copy-up can be expensive (must copy entire file)
  • Very large files (500 MB+ data files): Severe penalty — this is why volumes exist

Delete Performance (Whiteouts)

Deletions create tiny whiteout files — nearly free in terms of I/O. However, whiteouts do consume inode entries, and excessive whiteouts in deeply layered images can slow directory listings.

Best Practices for Dockerfile Ordering

Layer Ordering Rule: Place instructions that change least frequently at the top of your Dockerfile, and instructions that change most frequently at the bottom. This maximises layer cache reuse during rebuilds. Since Docker invalidates all layers after the first changed layer, a change in an early layer forces all subsequent layers to rebuild.
# BAD: Copying source code (changes often) before installing deps (changes rarely)
FROM node:18-slim
COPY . /app                    # ← Changes every commit
RUN npm install                # ← Forced to reinstall every time!
CMD ["node", "server.js"]

# GOOD: Install dependencies first (cached), copy source last
FROM node:18-slim
COPY package*.json /app/       # ← Changes only when deps change
RUN npm install                # ← Cached on most rebuilds!
COPY . /app                    # ← Only this layer rebuilds on code changes
CMD ["node", "server.js"]
# GOOD: Combine RUN commands to minimize layers and reduce image size
FROM ubuntu:22.04
RUN apt-get update && \
    apt-get install -y --no-install-recommends \
        nginx \
        curl \
        ca-certificates && \
    rm -rf /var/lib/apt/lists/*
# Single layer: install + cleanup = smaller image

# BAD: Separate RUN commands (cleanup doesn't help — data is in earlier layer)
FROM ubuntu:22.04
RUN apt-get update
RUN apt-get install -y nginx curl ca-certificates
RUN rm -rf /var/lib/apt/lists/*
# Three layers: the rm only hides files with whiteouts, doesn't recover space!
Common Mistake

The "Delete Doesn't Save Space" Trap

A frequent mistake by Docker beginners is believing that deleting files in a later RUN instruction reduces image size. It doesn't — it creates whiteout markers that hide the files but the original layer still contains them. The only way to truly save space is to combine the install and cleanup in the same RUN instruction.

# Prove it: build these two Dockerfiles and compare sizes
# Version A: separate layers
echo 'FROM ubuntu:22.04
RUN dd if=/dev/zero of=/bigfile bs=1M count=100
RUN rm /bigfile' | docker build -t test-separate -
docker image ls test-separate
# REPOSITORY      TAG     SIZE
# test-separate   latest  278MB  (100MB wasted!)

# Version B: single layer
echo 'FROM ubuntu:22.04
RUN dd if=/dev/zero of=/bigfile bs=1M count=100 && rm /bigfile' | docker build -t test-combined -
docker image ls test-combined
# REPOSITORY      TAG     SIZE
# test-combined   latest  78MB   (no waste!)
Image Size Layer Bloat Best Practice

Exercises

  1. Manual OverlayFS Experiment — On a Linux system (or VM), create an OverlayFS mount as shown in this article. Create files in the lower layer, modify them through the merged view, and verify that the lower layer remains unchanged. Create a file in the merged view, then find it in the upper layer. Delete a lower-layer file and inspect the whiteout character device.
  2. Layer Inspection Challenge — Pull the nginx:latest image and use docker history and docker image inspect to document every layer, its size, and the instruction that created it. Then navigate to /var/lib/docker/overlay2/ and find the actual directories on disk. Browse the diff/ contents of each layer to see what files were added.
  3. Layer Sharing Test — Pull both python:3.11-slim and node:18-slim (both based on Debian). Run docker system df -v and identify the shared layers. Calculate the actual disk savings from layer deduplication versus storing each image independently.
  4. Image Size Optimisation — Write a Dockerfile that installs build-essential (to compile something), compiles a simple C program, then removes build-essential. First write it with separate RUN instructions, then rewrite as a single RUN. Compare the final image sizes and document the difference. Finally, use a multi-stage build (preview of Part 8) for maximum efficiency.

Conclusion & Next Steps

Union file systems and copy-on-write semantics are the third pillar of container technology (alongside namespaces and cgroups). They solve the fundamental storage challenge: giving each container its own writeable filesystem without the overhead of full copies.

Key takeaways from this article:

  • Union filesystems merge multiple directory trees into a single unified view using layers
  • OverlayFS is the default storage driver — it uses lowerdir (read-only layers), upperdir (writable), workdir (scratch), and merged (unified view)
  • Copy-on-write means reads are cheap (passthrough), first writes have a copy-up cost, and deletes create whiteout markers
  • Docker images are ordered stacks of read-only layers identified by SHA256 content hashes
  • Layer sharing means the same base image stored once serves unlimited containers
  • The container layer is ephemeral — use volumes for persistent data
  • Dockerfile instruction ordering critically affects build cache performance and image size

With namespaces (isolation), cgroups (resource limits), and union filesystems (efficient storage) understood, you now have a complete picture of the Linux kernel primitives that power containers. In Part 5, we'll rise above these primitives to explore the Docker architecture — how the CLI, daemon, containerd, and runc orchestrate all these kernel features into a usable platform.

Next in the Series

In Part 5: Docker Architecture & Core Concepts, we will explore how Docker Engine's components — CLI, daemon, containerd, and runc — work together to transform kernel primitives into the container platform you use daily.