Part 14: Memory Hierarchy & Cache

Introduction

Modern computers use a memory hierarchy to balance speed and capacity. Fast memory is expensive and limited; slower memory is cheap and abundant. Caches bridge this gap.

Graph showing the growing speed gap between CPU processing speed and memory access time over decades — The memory wall: CPU speeds have grown exponentially faster than memory speeds, making caches essential.

                        
                        Series Context: This is Part 14 of 24 in the Computer Architecture & Operating Systems Mastery series. We transition from process management to memory systems.
                    

Computer Architecture & OS Mastery

Your 24-step learning path • Currently on Step 14

1

14

Memory Hierarchy & Cache

L1/L2/L3, cache coherence, NUMA

You Are Here

15

Memory Management Fundamentals

Address spaces, fragmentation, allocation

16

Virtual Memory & Paging

Page tables, TLB, demand paging

17

File Systems & Storage

Inodes, journaling, ext4, NTFS

18

I/O Systems & Device Drivers

Interrupts, DMA, disk scheduling

19

Multiprocessor Systems

SMP, NUMA, cache coherence

20

OS Security & Protection

Privilege levels, ASLR, sandboxing

21

Virtualization & Containers

Hypervisors, namespaces, cgroups

22

Advanced Kernel Internals

Linux subsystems, kernel debugging

23

Case Studies

Linux vs Windows vs macOS

24

Capstone Projects

Shell, thread pool, paging simulator

                        
                        The Memory Wall: CPUs have gotten 10,000× faster over decades, but memory access time has only improved 10×. Caches hide this latency gap by keeping frequently-used data close to the CPU.
                    

Memory Hierarchy Overview

Memory is organized in a pyramid: smaller, faster, more expensive memory at the top; larger, slower, cheaper memory at the bottom.

Memory hierarchy pyramid showing registers, L1/L2/L3 cache, RAM, SSD, and HDD with latency and capacity comparisons — The memory hierarchy pyramid balances speed, capacity, and cost across multiple levels from registers to disk storage.

The Memory Pyramid

Memory Hierarchy (Modern Desktop/Server):
══════════════════════════════════════════════════════════════

              ┌─────────┐
              │Registers│  ~1 KB      ~0.3 ns    $$$$$$
              │  ~32    │
         ┌────┴─────────┴────┐
         │    L1 Cache       │  32-64 KB   ~1 ns      $$$$$
         │  (per core)       │  (data + instruction separate)
    ┌────┴───────────────────┴────┐
    │        L2 Cache             │  256-512 KB  ~4 ns    $$$$
    │       (per core)            │
┌───┴─────────────────────────────┴───┐
│           L3 Cache (shared)         │  8-32 MB    ~12 ns   $$$
│          (all cores share)          │
├─────────────────────────────────────┤
│           Main Memory (RAM)         │  16-128 GB  ~100 ns   $$
│              (DDR4/5)               │
├─────────────────────────────────────┤
│          SSD (NVMe/SATA)            │  512GB-4TB  ~50-100 μs  $
│                                     │
├─────────────────────────────────────┤
│          HDD (Magnetic)             │  1-20 TB    ~5-10 ms    ¢
└─────────────────────────────────────┘


Speed Comparison (cycles at 3 GHz):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Level           Latency      CPU Cycles     Relative
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Register        0.3 ns       1              1×
L1 Cache        1 ns         3-4            3×
L2 Cache        4 ns         12             12×
L3 Cache        12 ns        36             36×
Main Memory     100 ns       300            300×
SSD             50 μs        150,000        150,000×
HDD             5 ms         15,000,000     15,000,000×

# View CPU cache information on Linux
$ lscpu | grep -i cache
L1d cache:           32K    # L1 data cache per core
L1i cache:           32K    # L1 instruction cache per core
L2 cache:            256K   # L2 per core
L3 cache:            8192K  # L3 shared

# Detailed cache info
$ cat /sys/devices/system/cpu/cpu0/cache/index0/size
32K
$ cat /sys/devices/system/cpu/cpu0/cache/index0/type
Data

# Windows PowerShell
Get-WmiObject Win32_CacheMemory | Select Purpose, InstalledSize

Locality of Reference

Caches work because of locality—programs tend to access the same data (or nearby data) repeatedly.

Illustration of temporal and spatial locality showing repeated access patterns in loop iteration over an array — Programs exhibit temporal locality (reusing data) and spatial locality (accessing nearby data), enabling effective caching.

Types of Locality:
══════════════════════════════════════════════════════════════

1. TEMPORAL LOCALITY (Time)
   • Recently accessed data will likely be accessed again soon
   • Example: Loop counter, frequently called functions
   
   for (int i = 0; i < 1000; i++) {  /* i accessed 1000 times */
       sum += array[i];               /* sum accessed 1000 times */
   }

2. SPATIAL LOCALITY (Space)
   • Data near recently accessed data will likely be accessed
   • Example: Array elements, struct fields, sequential code
   
   int array[1000];
   for (int i = 0; i < 1000; i++) {
       sum += array[i];   /* array[0], array[1], array[2]... */
   }
   /* Each array element is adjacent in memory */


Why Caches Exploit Locality:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
• Temporal: Keep recently accessed data in cache
• Spatial: Load entire cache lines (64 bytes), not single bytes
  → Access array[0] loads array[0..15] into cache
  → array[1..15] are already cached!

                        
                        Rule of Thumb: Programs spend 90% of time in 10% of code. A small, fast cache can satisfy most memory accesses if it holds the "hot" working set.
                    

Cache Basics

A cache stores copies of frequently accessed memory in small, fast SRAM. Key concepts: cache lines, tags, and hit/miss.

Diagram showing cache line structure with valid bit, dirty bit, tag, and data fields, plus memory address decomposition into tag, index, and offset — A cache line contains metadata (valid, dirty, tag) and data, with memory addresses decomposed into tag, index, and offset fields.

Cache Structure

Cache Terminology:
══════════════════════════════════════════════════════════════

CACHE LINE (Block):
• Unit of transfer between cache and memory
• Typically 64 bytes (exploits spatial locality)
• Contains: Tag + Data + Status bits

TAG:
• Identifies which memory address is stored
• Part of memory address stored alongside data

VALID BIT:
• Indicates if cache line contains valid data

DIRTY BIT:
• Indicates if cached data differs from memory
• (For write-back caches)


Cache Line Structure:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
┌───────┬───────┬──────────────────────────────────────────┐
│ Valid │ Dirty │   Tag   │          Data (64 bytes)       │
│   1   │   1   │  ~20    │                                │
└───────┴───────┴─────────┴────────────────────────────────┘


Memory Address Decomposition:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
For 64-byte cache line, 64 sets, 32-bit address:

│◄──────── 32 bits ─────────►│
┌─────────────┬────────┬──────┐
│    Tag      │ Index  │Offset│
│   20 bits   │ 6 bits │6 bits│
└─────────────┴────────┴──────┘

• Offset (6 bits): Which byte within 64-byte line
• Index (6 bits): Which cache set (64 sets)
• Tag (20 bits): Distinguishes different addresses mapping to same set

Cache Hit and Miss:
══════════════════════════════════════════════════════════════

CACHE HIT:
CPU requests → Check cache → Found! → Return data
Latency: 1-4 cycles (L1), 12 cycles (L2), 36 cycles (L3)

CACHE MISS:
CPU requests → Check cache → Not found → Fetch from memory
                                       → Store in cache
                                       → Return data
Latency: 100+ cycles (memory access)


Types of Cache Misses (The Three C's):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
1. COMPULSORY (Cold) Miss
   • First access to data - never been in cache
   • Unavoidable (except prefetching)

2. CAPACITY Miss
   • Cache not big enough for working set
   • Data evicted and accessed again
   • Fix: Larger cache

3. CONFLICT Miss
   • Multiple addresses map to same cache location
   • Data evicted prematurely
   • Fix: Higher associativity

Cache Mapping Strategies

How does a memory address map to a cache location? Three strategies with different trade-offs.

Cache Mapping Types

1. DIRECT MAPPED Cache:
══════════════════════════════════════════════════════════════
Each memory address maps to exactly ONE cache location.
Location = (Address) mod (Number of cache lines)

Memory:         Cache:
┌─────┐         ┌─────┐
│  0  │ ────┐   │  0  │
├─────┤     │   ├─────┤
│  1  │ ────┼→  │  1  │
├─────┤     │   ├─────┤
│  2  │ ────┤   │  2  │
├─────┤     │   ├─────┤
│  3  │ ────┘   │  3  │ ← Address 0,4,8... all map here!
├─────┤         └─────┘
│  4  │ ────┐
├─────┤     │   Problem: Addresses 0 and 4 CONFLICT
│  5  │     │   Accessing alternately causes thrashing!
│ ... │     │
└─────┘     └──→

✓ Simple, fast lookup (one comparison)
✗ High conflict miss rate


2. FULLY ASSOCIATIVE Cache:
══════════════════════════════════════════════════════════════
Any memory address can go in ANY cache location.

Memory:         Cache:
┌─────┐         ┌─────┐
│ Any │ ────────│ Any │  Check ALL tags in parallel
│ ... │ ────────│ ... │  
└─────┘         └─────┘

✓ No conflict misses
✗ Expensive - must compare all tags simultaneously
✗ Only practical for small caches (TLB)


3. SET-ASSOCIATIVE Cache (Best of Both):
══════════════════════════════════════════════════════════════
Address maps to a SET of N locations (N-way associative).

4-Way Set-Associative (4 lines per set):
                    ┌─────────────────────┐
Memory addr ──────→ │ Set 0: Way0|Way1|Way2|Way3 │
maps to set         │ Set 1: Way0|Way1|Way2|Way3 │
                    │ Set 2: Way0|Way1|Way2|Way3 │
                    │  ...                       │
                    └─────────────────────┘

Set = (Address) mod (Number of sets)
Then search within set (4 comparisons max)

✓ Balance of speed and flexibility
✓ Most common: 4-way, 8-way, 16-way L1
✗ More complex than direct-mapped

Comparison:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
                    Direct    4-Way       Fully
                    Mapped    Assoc.      Assoc.
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Comparisons/access     1         4          All
Conflict misses       High      Low        None
Hardware cost         Low       Medium     High
Typical use           Old L1    Modern     TLB, 
                                L1/L2/L3   small caches

Replacement Policies

When a cache set is full, which line do we evict? The replacement policy decides.

Cache Replacement Policies:
══════════════════════════════════════════════════════════════

1. LRU (Least Recently Used)
   • Evict line that hasn't been accessed longest
   • Exploits temporal locality
   • Expensive to implement perfectly (track all accesses)
   • Common: Pseudo-LRU approximations
   
   Access order: A B C D A B E (4-way cache)
   Cache: [E, A, B, D]  ← C was LRU, evicted

2. FIFO (First In, First Out)
   • Evict oldest line (first loaded)
   • Simple to implement
   • Doesn't consider access frequency
   • Can evict frequently-used data

3. Random
   • Pick random line to evict
   • Surprisingly competitive!
   • Avoids pathological worst-cases
   • Simple hardware

4. LFU (Least Frequently Used)
   • Evict line with fewest accesses
   • Requires access counters
   • Problem: Old but formerly-popular data never evicted
   

Modern CPUs: Pseudo-LRU or Tree-PLRU
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
True LRU for 8-way requires tracking order of 8 elements.
Approximations use tree structure with few bits:

        ┌───────┐
        │   0   │  ← Points to "older" half
        └───┬───┘
       ┌────┴────┐
       │    │    │
     ┌─┴─┐ ┌─┴─┐
     │ 0 │ │ 1 │
     └─┬─┘ └─┬─┘
    ┌──┴──┐┌─┴──┐
   W0 W1  W2 W3

Evict by following 0-pointers down tree.

Write Policies

When CPU writes data, when does it update main memory? Two main strategies.

Write-Through vs Write-Back

Write Policies:
══════════════════════════════════════════════════════════════

WRITE-THROUGH:
• Every write goes to BOTH cache AND memory immediately
• Memory always has current data
• Simple, no dirty bits needed

CPU writes X=5:
   ┌─────┐    X=5    ┌───────┐    X=5    ┌────────┐
   │ CPU │ ────────→ │ Cache │ ────────→ │ Memory │
   └─────┘           └───────┘           └────────┘

✓ Memory always consistent
✓ Simple recovery on crash
✗ Every write is slow (memory speed)
✗ High memory bandwidth usage

Solution: Write Buffer
   Writes queue in buffer, drain to memory in background
   CPU doesn't wait for memory


WRITE-BACK:
• Write only to cache (mark line "dirty")
• Write to memory only when line evicted
• Memory may have stale data

CPU writes X=5:
   ┌─────┐    X=5    ┌───────┐           ┌────────┐
   │ CPU │ ────────→ │ Cache │  (old X)  │ Memory │
   └─────┘           │[dirty]│           │ X=old  │
                     └───────┘           └────────┘

Later, when cache line evicted:
                     ┌───────┐    X=5    ┌────────┐
                     │ Cache │ ────────→ │ Memory │
                     └───────┘           └────────┘

✓ Fewer memory writes (coalesce multiple writes)
✓ Lower bandwidth
✗ Complex (track dirty bits)
✗ Crash can lose data

Most modern CPUs: Write-back for performance

Write Miss Policies:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
What if write is to address NOT in cache?

Write-Allocate (typical with write-back):
  Load cache line from memory, then write to cache
  Good if writing more data to same line soon

No-Write-Allocate (typical with write-through):
  Write directly to memory, don't load to cache
  Good if data won't be read soon

Cache Coherence

In multicore systems, each core has its own L1/L2 cache. What if two cores cache the same address and one writes to it?

The Cache Coherence Problem:
══════════════════════════════════════════════════════════════

    Core 0           Core 1
   ┌──────┐         ┌──────┐
   │ L1   │         │ L1   │
   │ X=5  │         │ X=5  │  ← Both cached X=5
   └──┬───┘         └──┬───┘
      │                │
      └──────┬─────────┘
             │
         ┌───┴───┐
         │  L3   │
         │ X=5   │
         └───────┘

Core 0 writes X=7:
   ┌──────┐         ┌──────┐
   │ L1   │         │ L1   │
   │ X=7  │         │ X=5  │  ← STALE! Core 1 sees old value!
   └──────┘         └──────┘

This is INCOHERENT - different cores see different values!


MESI Protocol (Most common coherence protocol):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Each cache line has a state:

M (Modified): This cache has only copy, it's dirty
              • Can read/write freely
              • Must write back before others can read

E (Exclusive): This cache has only copy, it's clean
              • Can read freely
              • Can transition to Modified on write

S (Shared): Multiple caches may have copies
              • Can read freely
              • Must broadcast before writing (→ invalidate others)

I (Invalid): Line not valid / not present
              • Must fetch from memory or other cache


State Transitions:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
         Read hit     Write hit     Other reads    Other writes
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Modified    M            M         →S (share)      →I (inval)
Exclusive   E           →M         →S              →I
Shared      S           →M*        S               →I
Invalid    →E or S*     →M*        -               -

* Requires bus transaction (snoop)

                        
                        Performance Impact: Cache coherence adds overhead. "False sharing" occurs when different cores write to different variables that happen to be on the same cache line—causing constant invalidations even though there's no true sharing!
                    

/* False Sharing Example */
struct CountersBad {
    int counter0;  /* Core 0 increments */
    int counter1;  /* Core 1 increments */
};  /* Both on SAME cache line! Constant invalidations */

/* Fix: Pad to separate cache lines */
struct CountersGood {
    int counter0;
    char padding[60];  /* 64-byte cache line alignment */
    int counter1;
};  /* Each counter on OWN cache line */

NUMA Architecture

NUMA (Non-Uniform Memory Access) systems have memory physically distributed across multiple nodes. Access time depends on which node owns the memory.

NUMA Topology

NUMA Architecture (2-Socket Server):
══════════════════════════════════════════════════════════════

┌─────────────────────────────────────────────────────────────┐
│                        Node 0                               │
│  ┌────────────────────────┐    ┌──────────────────────┐    │
│  │  CPU 0 (8 cores)       │    │  Local Memory        │    │
│  │  ┌────┐ ┌────┐ ┌────┐  │←──→│  64 GB DDR4          │    │
│  │  │Core│ │Core│ │... │  │    │  Access: ~80ns       │    │
│  │  └────┘ └────┘ └────┘  │    └──────────────────────┘    │
│  │  └──── L3 Cache ──────┘│                                │
│  └────────────────────────┘                                │
│            │                                                │
│            │ Interconnect (QPI / UPI)                      │
│            │ ~40-100 GB/s, adds ~40ns latency              │
│            │                                                │
│            ↓                                                │
└────────────┼────────────────────────────────────────────────┘
             │
┌────────────┼────────────────────────────────────────────────┐
│            ↓                        Node 1                  │
│  ┌────────────────────────┐    ┌──────────────────────┐    │
│  │  CPU 1 (8 cores)       │    │  Remote Memory       │    │
│  │  ┌────┐ ┌────┐ ┌────┐  │←──→│  64 GB DDR4          │    │
│  │  │Core│ │Core│ │... │  │    │  Access from Node0:  │    │
│  │  └────┘ └────┘ └────┘  │    │  ~120ns (+40ns hop)  │    │
│  │  └──── L3 Cache ──────┘│    └──────────────────────┘    │
│  └────────────────────────┘                                │
└─────────────────────────────────────────────────────────────┘


NUMA Latency Impact:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Access Type          Latency     Bandwidth
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Local memory         ~80ns       ~100 GB/s
Remote memory        ~120ns      ~60 GB/s (interconnect limited)
                     (+50%)      (-40%)

For memory-intensive workloads, NUMA placement matters!

# Check NUMA topology on Linux
$ numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0-7
node 0 size: 65536 MB
node 1 cpus: 8-15  
node 1 size: 65536 MB
node distances:
node   0   1 
  0:  10  21    # Local=10, Remote=21 (2.1× slower)
  1:  21  10

# Run process on specific NUMA node
$ numactl --cpunodebind=0 --membind=0 ./my_program

# View NUMA statistics
$ numastat
                           node0           node1
numa_hit              1234567890      987654321
numa_miss                   1234         56789  # Cross-node accesses (bad)
numa_foreign                5678          1234

                        
                        NUMA Optimization: Allocate memory on the same node as the threads that use it. Linux's "first touch" policy places pages on the node where they're first accessed. Pre-fault pages in the right context!
                    

Conclusion & Next Steps

Memory hierarchy is fundamental to computer performance. We've covered:

Memory Pyramid: Speed vs capacity trade-offs
Locality: Temporal and spatial access patterns
Cache Mapping: Direct, fully associative, set-associative
Replacement: LRU, FIFO, random policies
Write Policies: Write-through vs write-back
Coherence: MESI protocol for multicore
NUMA: Non-uniform memory access in multi-socket systems

                        
                        Key Insight: Writing cache-friendly code means accessing memory sequentially, reusing data while it's cached, and avoiding false sharing in multithreaded programs.
                    

Table of Contents

Introduction

Computer Architecture & OS Mastery

Part 1: Foundations of Computer Systems

Digital Logic & CPU Building Blocks

Instruction Set Architecture (ISA)

Assembly Language & Machine Code

Assemblers, Linkers & Loaders

Compilers & Program Translation

CPU Execution & Pipelining

OS Architecture & Kernel Design

Processes & Program Execution

Threads & Concurrency

CPU Scheduling Algorithms

Synchronization & Coordination

Deadlocks & Prevention