Part 7: File Systems & Storage — inodes, ext4 & OverlayFS

Block Devices & Partitions

Storage devices on Linux are represented as block devices — files in /dev/ that support reading and writing fixed-size blocks. Naming conventions:

/dev/sda — first SATA/SCSI disk; /dev/sda1, /dev/sda2 — its partitions
/dev/nvme0n1 — first NVMe SSD; /dev/nvme0n1p1 — its first partition
/dev/vda — virtual disk (KVM/cloud VMs)
/dev/loop0 — loop device (an image file mounted as if it were a disk)

# List block devices with hierarchy
lsblk                        # Tree: disk → partition → LVM volumes
lsblk -o NAME,SIZE,TYPE,FSTYPE,MOUNTPOINT

# List devices with UUIDs (used in /etc/fstab)
blkid

# Disk I/O stats (read/write throughput per device)
iostat -xh 1 3   # Extended stats, human-readable, 3 samples

# Partition table info
sudo fdisk -l /dev/sda 2>/dev/null || sudo fdisk -l /dev/nvme0n1 2>/dev/null

Inodes & Directory Entries

In Unix file systems, a file's metadata and its name are stored separately:

An inode stores everything about a file except its name: file type, permissions, owner, group, size, timestamps (atime/mtime/ctime), and pointers to the data blocks on disk.
A directory entry (dentry) maps a filename to an inode number. A directory is just a file containing a table of (name, inode_number) pairs.

This separation enables hard links: multiple directory entries pointing to the same inode. The file is deleted only when all hard links are removed and no process has the file open (link count drops to 0).

# Inspect a file's inode
ls -i /etc/passwd          # Show inode number
stat /etc/passwd           # Full inode info: size, blocks, permissions, all timestamps

# Hard links — two names, one inode
ln /tmp/test.txt /tmp/test-hardlink.txt  # Create hard link (requires: touch /tmp/test.txt)
ls -li /tmp/test*.txt      # Same inode number, link count = 2
stat /tmp/test.txt | grep "Links"  # Nlinks: 2

# Symbolic links — separate inode, contains path
ln -s /tmp/test.txt /tmp/test-symlink.txt
ls -li /tmp/test*.txt      # Symlink has DIFFERENT inode number

# Check inode usage (inodes can run out before disk space does!)
df -i                      # Show inode usage per filesystem

# Count inodes in a directory
find /var -maxdepth 3 -printf '%i\n' | sort -u | wc -l

            
            Inode Exhaustion: You can run out of inodes before you run out of disk space. This happens when you create huge numbers of small files (e.g., npm node_modules with 100,000 tiny files, or a mail server with millions of small messages). df -i shows inode usage. The solution is either to delete files or to format with more inodes (mkfs.ext4 -N NUM_INODES), but you can't add inodes to an existing filesystem without reformatting.
        

ext4 On-Disk Layout

ext4 is the default Linux filesystem for most distributions. The disk is divided into block groups, each containing a copy of the superblock (filesystem metadata), block/inode bitmaps, inode tables, and data blocks.

# View ext4 filesystem info (on a specific device/mount)
sudo tune2fs -l /dev/sda1 2>/dev/null || sudo tune2fs -l /dev/nvme0n1p1 2>/dev/null
# Shows: inode count, block count, block size, last check, mount count...

# Check filesystem integrity (unmounted, or read-only)
sudo e2fsck -n /dev/sda1 2>/dev/null   # -n = dry run (no changes)

# View ext4 features enabled
sudo dumpe2fs /dev/sda1 2>/dev/null | grep "Filesystem features"

# Check the superblock
sudo dumpe2fs /dev/sda1 2>/dev/null | head -40

Journaling & Crash Recovery

Before journaling, a kernel panic mid-write could leave the filesystem in an inconsistent state — metadata not matching data. Recovery required a full fsck scan of every block, which could take hours on large disks. Journaling solves this by writing changes to a circular log (the journal) before applying them to the main filesystem. On crash recovery, the journal is replayed — incomplete operations are either completed or rolled back.

ext4 offers three journaling modes:

journal: Both metadata and data are journaled. Slowest, but safest.
ordered (default): Only metadata is journaled, but data is written before metadata. Good balance.
writeback: Only metadata is journaled. Fastest, but data may be stale after crash.

The VFS Layer

The VFS (Virtual File System) is an abstraction layer in the kernel that gives userspace a uniform API (open(), read(), write(), close()) regardless of what actual filesystem is underneath — ext4, btrfs, XFS, tmpfs, procfs, NFS. Every filesystem registers its operations (read_inode, write_inode, lookup, etc.) with the VFS.

# See all mounted filesystems (types)
mount | column -t
# Note: proc, sysfs, devpts, tmpfs, cgroup — all virtual FSes!

# List just real/physical mounts
findmnt --real

# See the VFS cache stats
cat /proc/sys/fs/dentry-state   # Dentry cache: total, unused, want_pages
cat /proc/sys/fs/inode-state    # Inode cache: total, free

# View file descriptor limits
cat /proc/sys/fs/file-max   # System-wide max open files
ulimit -n                    # Per-process soft limit (usually 1024)
ulimit -Hn                   # Hard limit

OverlayFS — Container Layers

OverlayFS is a union mount filesystem that layers multiple directories as if they were one. It's the technology behind Docker image layers and container filesystems.

OverlayFS: How Docker Container Layers Work

flowchart TD
    subgraph Container["Container View (merged)"]
        M["/app/main.py (from container layer)\n/etc/nginx.conf (from image layer 3)\n/usr/bin/python3 (from image layer 2)\n/lib/libc.so.6 (from image layer 1)"]
    end
    subgraph Layers["Actual Layers on Disk"]
        UL["upperdir (Container writable layer)\n/app/main.py  ← new file written here"]
        L3["lowerdir 3 (Image Layer 3 — read-only)\n/etc/nginx.conf"]
        L2["lowerdir 2 (Image Layer 2 — read-only)\n/usr/bin/python3"]
        L1["lowerdir 1 (Image Layer 1 — read-only)\n/lib/libc.so.6"]
    end
    Container --> Layers
    style UL fill:#3B9797,color:#fff
    style Container fill:#132440,color:#fff

# See OverlayFS mounts for Docker containers
mount | grep overlay
# overlay on /var/lib/docker/overlay2/.../merged type overlay (...)

# Inspect a container's overlay layers
CONTAINER_ID=$(docker ps -q | head -1)
if [ -n "$CONTAINER_ID" ]; then
    docker inspect $CONTAINER_ID | python3 -c "
import json, sys
data = json.load(sys.stdin)[0]
gs = data.get('GraphDriver', {}).get('Data', {})
for k, v in gs.items():
    print(f'{k}:\n  {v[:100]}...' if len(str(v)) > 100 else f'{k}: {v}')
"
fi

# Manual OverlayFS mount (academic example)
mkdir -p /tmp/overlay-demo/{lower,upper,work,merged}
echo "from lower" > /tmp/overlay-demo/lower/file.txt
mount -t overlay overlay \
    -o lowerdir=/tmp/overlay-demo/lower,upperdir=/tmp/overlay-demo/upper,workdir=/tmp/overlay-demo/work \
    /tmp/overlay-demo/merged 2>/dev/null && \
    ls /tmp/overlay-demo/merged/ && \
    echo "modified" >> /tmp/overlay-demo/merged/file.txt && \
    echo "Lower unchanged:" && cat /tmp/overlay-demo/lower/file.txt && \
    echo "Upper has copy-on-write copy:" && cat /tmp/overlay-demo/upper/file.txt && \
    umount /tmp/overlay-demo/merged

Docker Internals

Why Docker Images Are Fast to Share and Pull

Each Docker image is a stack of OverlayFS layers. When you pull python:3.12-slim and node:20-slim, both share the Debian base layers — those layers are stored once on disk. When a container writes to the filesystem, OverlayFS implements copy-on-write: the file is copied from the read-only lower layer to the writable upper layer (the container layer), modified there, and the original is untouched. This is why two containers from the same image share read-only layers in RAM (via the page cache), but have independent writable upper layers.

OverlayFSCopy-on-WriteImage Layers

LVM — Logical Volume Manager

LVM abstracts physical storage into flexible logical volumes that can be resized, snapshotted, and spanned across multiple disks. The three-tier hierarchy:

# LVM three layers: PV → VG → LV

# Physical Volumes (PV) — raw disks or partitions
sudo pvdisplay         # Show all PVs
sudo pvs               # Brief summary

# Volume Groups (VG) — pool of PV storage
sudo vgdisplay         # Show all VGs
sudo vgs               # Brief: name, #PVs, #LVs, size, free

# Logical Volumes (LV) — slices of VG storage
sudo lvdisplay         # Show all LVs
sudo lvs               # Brief: name, VG, size, mountpoint

# Common operations:
# Extend an LV and filesystem (no unmount needed for ext4/xfs)
# sudo lvextend -L +10G /dev/vg0/lv-data
# sudo resize2fs /dev/vg0/lv-data      # ext4
# sudo xfs_growfs /mount/point         # xfs

# Take a snapshot (instant, space-efficient backup point)
# sudo lvcreate -L 5G -s -n lv-data-snap /dev/vg0/lv-data

Mounting & /etc/fstab

# Mount a filesystem
mount /dev/sdb1 /mnt/data                     # Basic mount
mount -t ext4 /dev/sdb1 /mnt/data            # Explicit type
mount -o ro,noexec /dev/sdb1 /mnt/data       # Read-only, no exec

# Unmount
umount /mnt/data                              # By mountpoint
umount /dev/sdb1                              # By device
umount -l /mnt/data                           # Lazy unmount (when busy)

# View currently mounted filesystems
findmnt                       # Tree view
mount | column -t             # Traditional view
cat /proc/mounts              # Kernel's view

# /etc/fstab — persistent mounts (loaded at boot)
# Format: device  mountpoint  type  options  dump  pass
# Example entries:
cat /etc/fstab
# UUID=xxx...   /            ext4   errors=remount-ro   0 1
# UUID=yyy...   /boot/efi    vfat   umask=0077          0 1
# tmpfs         /tmp         tmpfs  defaults,size=2G    0 0

# Mount all entries in /etc/fstab
# sudo mount -a    (tests /etc/fstab changes without rebooting)

Exercises

# Exercise 1: Inode inspection
stat /etc/hostname          # View full inode info
ls -i /etc/hostname         # View inode number
df -i /                     # View inode usage on root filesystem

# Exercise 2: Explore an ext4 filesystem
mount | grep "type ext4"    # Find an ext4 mount
# sudo tune2fs -l /dev/sda1 | head -20

# Exercise 3: Understand hard vs symlinks
mkdir -p /tmp/link-demo
echo "test" > /tmp/link-demo/original.txt
ln /tmp/link-demo/original.txt /tmp/link-demo/hardlink.txt
ln -s /tmp/link-demo/original.txt /tmp/link-demo/symlink.txt
ls -li /tmp/link-demo/
# Hard link: same inode number, same size
# Symlink: different inode, small size (just the path string)
rm /tmp/link-demo/original.txt
cat /tmp/link-demo/hardlink.txt  # Works — data still has reference
cat /tmp/link-demo/symlink.txt 2>&1  # Fails — dangling symlink!

# Exercise 4: Check Docker overlay layers (if Docker available)
mount | grep overlay | head -5

Conclusion & Next Steps

Files in Linux are inodes with data blocks — names are just directory entries pointing at inodes. ext4 uses journaling to survive crashes safely. The VFS gives a unified API across all filesystem types. OverlayFS is the foundation of container image layering — understanding it demystifies Docker's copy-on-write and image sharing. LVM adds flexibility for managing physical storage.

PreviousPart 6: Memory Management Next Part 8: Linux Permissions

Cookie Consent

Part 7: File Systems & Storage — inodes, ext4 & OverlayFS

Table of Contents

Block Devices & Partitions

Inodes & Directory Entries

ext4 On-Disk Layout

Journaling & Crash Recovery

The VFS Layer

OverlayFS — Container Layers

Why Docker Images Are Fast to Share and Pull

LVM — Logical Volume Manager

Mounting & /etc/fstab

Exercises

Conclusion & Next Steps

Cookie Consent

Part 7: File Systems & Storage — inodes, ext4 & OverlayFS

Table of Contents

Block Devices & Partitions

Inodes & Directory Entries

ext4 On-Disk Layout

Journaling & Crash Recovery

The VFS Layer

OverlayFS — Container Layers

Why Docker Images Are Fast to Share and Pull

LVM — Logical Volume Manager

Mounting & /etc/fstab

Exercises

Conclusion & Next Steps

Continue the Series

Part 6: Memory Management

Part 8: Linux Permissions & Security Model