Part 19: Multiprocessor Systems

Introduction

Modern systems feature multiple CPU cores and processors. Operating systems must efficiently schedule work across these resources while maintaining cache coherence and minimizing contention.

                        
                        Series Context: This is Part 19 of 24 in the Computer Architecture & Operating Systems Mastery series. We explore how operating systems scale to multiple processors.
                    

Computer Architecture & OS Mastery

Your 24-step learning path • Currently on Step 19

1

19

Multiprocessor Systems

SMP, NUMA, cache coherence

You Are Here

20

OS Security & Protection

Privilege levels, ASLR, sandboxing

21

Virtualization & Containers

Hypervisors, namespaces, cgroups

22

Advanced Kernel Internals

Linux subsystems, kernel debugging

23

Case Studies

Linux vs Windows vs macOS

24

Capstone Projects

Shell, thread pool, paging simulator

                        
                        The Scaling Challenge: Clock speeds hit a wall around 2005. The solution? Multiple cores. But how does an OS efficiently use 8, 64, or even 1000 CPUs?
                    

Types of Multiprocessor Systems

Multiprocessor systems vary dramatically in how CPUs share memory and communicate.

Classification

Flynn's Taxonomy:
══════════════════════════════════════════════════════════════

SISD - Single Instruction, Single Data
• Traditional uniprocessor
• One CPU, one data stream

SIMD - Single Instruction, Multiple Data
• GPUs, vector processors
• Same operation on many data elements
• Great for graphics, ML

MIMD - Multiple Instruction, Multiple Data
• General-purpose multiprocessors
• Each CPU runs independent program
• This is what we focus on!


Memory Architecture:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

1. SHARED MEMORY (UMA/NUMA)
   • All CPUs access same physical memory
   • Communication via shared variables
   • Examples: Desktop, server CPUs

2. DISTRIBUTED MEMORY
   • Each node has private memory
   • Communication via message passing
   • Examples: Supercomputers, clusters

Symmetric Multiprocessing (SMP)

SMP (also called UMA - Uniform Memory Access) treats all processors equally with uniform memory access times.

SMP architecture diagram showing multiple CPUs with private L1 caches connected through a shared L3 cache and system bus to unified main memory — Symmetric multiprocessing (SMP) — all CPUs share equal access to memory through a common bus

SMP Architecture:
══════════════════════════════════════════════════════════════

┌───────┐  ┌───────┐  ┌───────┐  ┌───────┐
│ CPU 0 │  │ CPU 1 │  │ CPU 2 │  │ CPU 3 │
│  L1   │  │  L1   │  │  L1   │  │  L1   │
└───┬───┘  └───┬───┘  └───┬───┘  └───┬───┘
    │          │          │          │
    └──────────┼──────────┼──────────┘
               │
        ┌──────┴──────┐
        │   L3 Cache   │  (shared)
        └──────┬──────┘
               │
        ┌──────┴──────┐
        │  System Bus  │
        └──────┬──────┘
               │
        ┌──────┴──────┐
        │    Memory    │
        └─────────────┘

• All CPUs equal - any can run any process
• Uniform access time to all memory
• Single OS instance manages all CPUs
• Shared bus becomes bottleneck beyond ~8 CPUs

# View CPU topology on Linux
$ lscpu
Architecture:          x86_64
CPU(s):                8
Thread(s) per core:    2
Core(s) per socket:    4
Socket(s):             1
NUMA node(s):          1

# Detailed topology
$ cat /sys/devices/system/cpu/cpu0/topology/core_id
0

NUMA Architecture

NUMA (Non-Uniform Memory Access) scales beyond SMP by giving each CPU group its own local memory.

NUMA Topology

NUMA Architecture (2-node example):
══════════════════════════════════════════════════════════════

        NUMA Node 0                    NUMA Node 1
   ┌────────────────────┐       ┌────────────────────┐
   │ CPU0  CPU1  CPU2 │       │ CPU4  CPU5  CPU6 │
   │       CPU3       │       │       CPU7       │
   └─────────┬──────────┘       └─────────┬──────────┘
              │                         │
   ┌─────────┴──────────┐       ┌─────────┴──────────┐
   │   Local Memory   │       │   Local Memory   │
   │     (64 GB)      │       │     (64 GB)      │
   └─────────┬──────────┘       └─────────┬──────────┘
              │                         │
              └───────────┬───────────┘
                        │
              ┌─────────┴─────────┐
              │   Interconnect   │ (QPI, UPI, etc.)
              └───────────────────┘


Memory Access Latencies:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
CPU0 accessing Node 0 memory: ~80 ns  (local)
CPU0 accessing Node 1 memory: ~150 ns (remote, 2x!)

NUMA ratio = remote/local latency
Typical: 1.5x to 2x penalty for remote access

# View NUMA topology
$ numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3
node 0 size: 65536 MB
node 1 cpus: 4 5 6 7
node 1 size: 65536 MB
node distances:
node   0   1
  0:  10  21   # Local=10, remote=21 (2.1x penalty)
  1:  21  10

# Run program on specific node
$ numactl --cpunodebind=0 --membind=0 ./my_program

Cache Coherence

When multiple CPUs cache the same data, we need cache coherence to ensure consistency.

The Cache Coherence Problem:
══════════════════════════════════════════════════════════════

1. CPU0 reads X=5 into cache
2. CPU1 reads X=5 into cache
3. CPU0 writes X=10 (updates its cache)
4. CPU1 reads X... sees stale value 5!

   CPU0 Cache       CPU1 Cache
   ┌─────────┐       ┌─────────┐
   │ X = 10  │       │ X = 5   │  ← Incoherent!
   └─────────┘       └─────────┘
        │               │
   ┌────┴───────────────┴────┐
   │      Memory: X = 5         │
   └───────────────────────────┘


MESI Protocol (common solution):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Modified:  This cache has only valid copy (dirty)
Exclusive: This cache has only valid copy (clean)
Shared:    Multiple caches have copy (read-only)
Invalid:   Cache line is stale/empty

When CPU0 writes:
1. Invalidate all other caches holding that line
2. Transition to Modified state
3. Other CPUs must fetch from CPU0 or memory

                        
                        False Sharing: Different variables on the same cache line cause unnecessary coherence traffic. Solution: Pad data structures to cache line boundaries (64 bytes typical).
                    

Multiprocessor Scheduling

Scheduling for multiple CPUs introduces new challenges beyond single-CPU scheduling.

Comparison of global queue vs per-CPU runqueue scheduling approaches showing task distribution, lock contention, and cache affinity tradeoffs — Multiprocessor scheduling — global queue vs per-CPU runqueues with load balancing

Scheduling Approaches:
══════════════════════════════════════════════════════════════

1. GLOBAL QUEUE (Single shared queue)

   ┌───────────────────────────┐
   │ Task Queue: [T1][T2][T3]... │
   └───────┬─────┬──────┬───────┘
           │     │      │
         CPU0  CPU1   CPU2

   ✓ Automatic load balancing
   ✗ Queue lock is bottleneck
   ✗ Poor cache locality


2. PER-CPU QUEUES (Linux approach)

   CPU0 Queue    CPU1 Queue    CPU2 Queue
   [T1][T4]      [T2][T5]      [T3][T6]
       │             │             │
     CPU0          CPU1          CPU2

   ✓ No lock contention
   ✓ Better cache affinity
   ✗ Can become unbalanced
   → Need load balancing between queues

Processor Affinity

Processor affinity keeps threads on the same CPU to maintain cache warmth.

# View process affinity
$ taskset -p $$
pid 1234's current affinity mask: ff   # Can run on CPUs 0-7

# Set affinity - run on CPUs 0 and 1 only
$ taskset -c 0,1 ./my_program

# Change running process
$ taskset -p -c 0-3 1234

# In C code:
#include <sched.h>
cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(0, &cpuset);  // Run on CPU 0
sched_setaffinity(0, sizeof(cpuset), &cpuset);

                        
                        Soft vs Hard Affinity: Soft affinity is a preference (scheduler tries to respect). Hard affinity is mandatory (process can ONLY run on specified CPUs). Use hard affinity for latency-critical applications.
                    

Load Balancing

Per-CPU queues need load balancing to prevent idle CPUs while others are overloaded.

Linux Load Balancing:
══════════════════════════════════════════════════════════════

Scheduling Domains (hierarchical):

                  ┌─────────────┐
                  │  System      │ Balance every 64ms
                  └──────┬──────┘
            ┌───────┴───────┐
      ┌─────┴─────┐  ┌─────┴─────┐
      │  NUMA 0   │  │  NUMA 1   │ Balance every 16ms
      └────┬─────┘  └────┬─────┘
       ┌───┴───┐          ┌───┴───┐
     ┌─┴─┐ ┌─┴─┐      ┌─┴─┐ ┌─┴─┐
     │C0│ │C1│      │C2│ │C3│  Core level
     └──┘ └──┘      └──┘ └──┘  Balance every 4ms

Balancing prefers:
1. Same core (shared L2) - cheapest
2. Same NUMA node - moderate cost
3. Different NUMA node - most expensive


Push vs Pull Migration:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Push: Busy CPU pushes tasks to idle CPUs (periodic)
Pull: Idle CPU steals tasks from busy CPUs (immediate)

Conclusion & Next Steps

Multiprocessor systems are now the norm. We've covered:

System Types: UMA vs NUMA, shared vs distributed memory
SMP: Symmetric multiprocessing with uniform access
NUMA: Non-uniform access with local/remote memory
Cache Coherence: MESI protocol and false sharing
Scheduling: Global vs per-CPU queues
Affinity: Keeping threads on same CPU
Load Balancing: Work stealing and push migration

                        
                        Key Insight: Memory locality is everything in multiprocessor systems. Keep data close to the CPU that uses it, and avoid unnecessary data sharing between cores.
                    

Table of Contents

Introduction

Computer Architecture & OS Mastery

Part 1: Foundations of Computer Systems

Digital Logic & CPU Building Blocks

Instruction Set Architecture (ISA)

Assembly Language & Machine Code

Assemblers, Linkers & Loaders

Compilers & Program Translation

CPU Execution & Pipelining

OS Architecture & Kernel Design

Processes & Program Execution

Threads & Concurrency

CPU Scheduling Algorithms

Synchronization & Coordination

Deadlocks & Prevention

Memory Hierarchy & Cache

Memory Management Fundamentals

Virtual Memory & Paging

File Systems & Storage

I/O Systems & Device Drivers