Back to Technology

Part 7: CPU Execution & Pipelining

January 31, 2026 Wasil Zafar 28 min read

Master CPU execution fundamentals—the fetch-decode-execute cycle, pipeline stages, hazard detection and resolution, and branch prediction techniques.

Table of Contents

  1. Introduction
  2. Instruction Cycle
  3. Pipeline Stages
  4. Pipeline Hazards
  5. Branch Prediction
  6. Conclusion & Next Steps

Introduction

CPU pipelining is a fundamental technique that allows processors to execute multiple instructions simultaneously by overlapping their execution stages. Understanding pipelining reveals how modern CPUs achieve high throughput despite instruction-level dependencies.

Series Context: This is Part 7 of 24 in the Computer Architecture & Operating Systems Mastery series. Building on compilers and program translation, we now explore how CPUs execute instructions efficiently through pipelining.

Computer Architecture & OS Mastery

Your 24-step learning path • Currently on Step 7
1
Part 1: Foundations of Computer Systems
System overview, architectures, OS role
2
Digital Logic & CPU Building Blocks
Gates, registers, datapath, microarchitecture
3
Instruction Set Architecture (ISA)
RISC vs CISC, instruction formats, addressing
4
Assembly Language & Machine Code
Registers, stack, calling conventions
5
Assemblers, Linkers & Loaders
Object files, ELF, dynamic linking
6
Compilers & Program Translation
Lexing, parsing, code generation
7
CPU Execution & Pipelining
Fetch-decode-execute, hazards, prediction
You Are Here
8
OS Architecture & Kernel Design
Monolithic, microkernel, system calls
9
Processes & Program Execution
Process lifecycle, PCB, fork/exec
10
Threads & Concurrency
Threading models, pthreads, race conditions
11
CPU Scheduling Algorithms
FCFS, RR, CFS, real-time scheduling
12
Synchronization & Coordination
Locks, semaphores, classic problems
13
Deadlocks & Prevention
Coffman conditions, Banker's algorithm
14
Memory Hierarchy & Cache
L1/L2/L3, cache coherence, NUMA
15
Memory Management Fundamentals
Address spaces, fragmentation, allocation
16
Virtual Memory & Paging
Page tables, TLB, demand paging
17
File Systems & Storage
Inodes, journaling, ext4, NTFS
18
I/O Systems & Device Drivers
Interrupts, DMA, disk scheduling
19
Multiprocessor Systems
SMP, NUMA, cache coherence
20
OS Security & Protection
Privilege levels, ASLR, sandboxing
21
Virtualization & Containers
Hypervisors, namespaces, cgroups
22
Advanced Kernel Internals
Linux subsystems, kernel debugging
23
Case Studies
Linux vs Windows vs macOS
24
Capstone Projects
Shell, thread pool, paging simulator

Data Forwarding (Bypassing)

Without Forwarding (Stall required):
══════════════════════════════════════════════════════════════
I1: ADD R1, R2, R3      # Writes R1
I2: SUB R4, R1, R5      # Reads R1 ← needs I1's result

       1    2    3    4    5    6    7    8
I1:   IF   ID   EX   MEM  WB
I2:        IF   ID   --   --   ID   EX   MEM  WB
                ↑    ↑↑        ↑
           Detects  Stall    Retry after R1 written
           hazard   (bubbles)

→ 2 cycle penalty


With Forwarding:
══════════════════════════════════════════════════════════════
                     ┌─────────────────┐
                     │   Forwarding    │
                     │     Unit        │
                     └────────┬────────┘
                              │
       1    2    3    4    5  │
I1:   IF   ID   EX   MEM  WB  │
                 │            │
I2:        IF   ID   EX   MEM  WB
                     ↑
           Receives R1 value directly from I1's EX output!

→ 0 cycle penalty (data arrives just in time)
Load-Use Hazard: Even with forwarding, loads require at least 1 stall. The data isn't available until after MEM stage, but the next instruction needs it in EX. This is called a "load-use hazard" and requires a "load delay slot."

Branch Prediction

Branches occur every 5-7 instructions in typical code. With a 3-cycle branch penalty on a 20-stage pipeline, CPI would be terrible without prediction. Modern CPUs achieve >95% branch prediction accuracy.

Static Prediction

Static prediction uses fixed rules without runtime history. Simple but limited accuracy.

Static Branch Prediction Strategies:
══════════════════════════════════════════════════════════════
1. Always Predict NOT TAKEN
   - Assume branch falls through
   - Fetch sequential instruction (PC + 4)
   - Simple, ~40-50% accuracy

2. Always Predict TAKEN
   - Assume branch jumps
   - Problem: Don't know target until decode
   - ~60-70% accuracy (loops taken more often)

3. Backward Taken, Forward Not Taken (BTFNT)
   - Backward branches (target < PC): Predict TAKEN
     → Loops usually taken (except last iteration)
   - Forward branches (target > PC): Predict NOT TAKEN
     → if-statements often fall through
   - ~65-75% accuracy

Loop example:
    0x100:  loop: ...
    0x110:        ...
    0x120:        BNE  R1, R0, loop   ← Backward branch: predict TAKEN
    0x124:        ...

    0x200:        BEQ  R2, R3, skip   ← Forward branch: predict NOT TAKEN
    0x204:        ...
    0x220:  skip: ...

Dynamic Prediction

Dynamic prediction uses runtime history to predict future behavior. Each branch has state tracked in a prediction table.

1-Bit and 2-Bit Predictors

1-Bit Predictor:
══════════════════════════════════════════════════════════════
State: Last outcome (Taken or Not Taken)
Predict: Same as last time

Problem with loops:
for (i=0; i<1000; i++) { ... }

Iteration:  1    2    3   ...  999  1000
Actual:     T    T    T   ...   T    NT
Predicted:  ?    T    T   ...   T    T
Result:     ?    ✓    ✓   ...   ✓    ✗ (mispredicted)

Next loop iteration 1:
Predicted: NT (from last loop's exit)
Actual:    T
Result:    ✗ (mispredicted again!)

→ 2 mispredictions per loop (first and last iteration)


2-Bit Saturating Counter:
══════════════════════════════════════════════════════════════
4 states, requires 2 consecutive mispredictions to change

        ┌───────────┐                 ┌───────────┐
   NT   │ Strongly  │      T          │ Strongly  │   T
   ┌───→│ Not Taken │←────────────────│   Taken   │←───┐
   │    │  (00)     │                 │   (11)    │    │
   │    └─────┬─────┘                 └─────┬─────┘    │
   │          │ T                       NT  │          │
   │          ▼                             ▼          │
   │    ┌───────────┐                 ┌───────────┐    │
   │    │  Weakly   │       T         │  Weakly   │    │
   └────│ Not Taken │────────────────→│   Taken   │────┘
    NT  │  (01)     │                 │   (10)    │ NT
        └───────────┘                 └───────────┘

Prediction: States 10, 11 → Taken; States 00, 01 → Not Taken

Loop with 2-bit predictor:
Iteration:  1    2    3   ...  999  1000  | 1    2
Actual:     T    T    T   ...   T    NT   | T    T
State:      11   11   11  ...  11    10   | 11   11
Predicted:  T    T    T   ...   T    T    | T    T
Result:     ✓    ✓    ✓   ...   ✓    ✗    | ✓    ✓

→ Only 1 misprediction per loop (last iteration only)
Modern Predictors: Real CPUs use far more sophisticated predictors:
  • Correlating predictors: Use history of other branches
  • Tournament predictors: Multiple predictors, pick best one
  • Neural predictors: Small neural networks (used in AMD Zen)
  • TAGE predictor: Tagged geometric history length (Intel)

Branch Target Buffer

The Branch Target Buffer (BTB) caches branch target addresses so we don't have to wait for decode to know where to fetch next.

BTB Structure and Operation

Branch Target Buffer (BTB):
══════════════════════════════════════════════════════════════
┌─────────────────────────────────────────────────────────────┐
│  PC Tag  │  Target Address  │  Prediction Bits  │  Valid   │
├──────────┼──────────────────┼───────────────────┼──────────┤
│  0x1020  │     0x1200       │       11 (ST)     │    1     │
│  0x2048  │     0x2000       │       10 (WT)     │    1     │
│  0x3100  │     0x3500       │       01 (WNT)    │    1     │
│  ...     │     ...          │       ...         │   ...    │
└─────────────────────────────────────────────────────────────┘

BTB Lookup Process:
                                                              
   Fetch Stage                                                
   ┌────────┐       ┌─────────┐                              
   │   PC   │──────→│   BTB   │                              
   │        │       │ Lookup  │                              
   └────────┘       └────┬────┘                              
                         │                                    
           ┌─────────────┼─────────────┐                     
           │             │             │                      
           ▼             ▼             ▼                      
        No Hit       Hit + NT       Hit + T                   
         │             │             │                        
         ▼             ▼             ▼                        
     Fetch PC+4    Fetch PC+4   Fetch Target                 
                                                              

Timeline with BTB:
══════════════════════════════════════════════════════════════
Without BTB (must wait for decode):
       1    2    3    4    5
BEQ:  IF   ID   EX   MEM  WB
            ↑
      Target known here (cycle 2)
      Then fetch from target → 1+ cycle penalty always

With BTB (lookup in parallel with fetch):
       1    2    3    4    5
BEQ:  IF   ID   EX   MEM  WB
      ↑
      BTB provides target HERE (cycle 1)
      Start fetching target immediately
      → 0 cycle penalty if prediction correct!

Speculative Execution

Speculative Execution:
══════════════════════════════════════════════════════════════
CPU predicts branch outcome and executes speculatively
→ If correct: full speed, no penalty
→ If wrong: flush speculated instructions, restart

Example: Predicted TAKEN (but actually NOT TAKEN)

       1    2    3    4    5    6    7    8
BEQ:  IF   ID   EX   MEM  WB
                 ↑
          Branch resolved: NOT TAKEN!

target:    IF   ID   ← Speculative (wrong path)
target+4:       IF   ← Speculative (wrong path)
                      
       1    2    3    4    5    6    7    8
BEQ:  IF   ID   EX   MEM  WB
target:    IF   ID   XX   XX  ← FLUSHED (wrong prediction)
target+4:       IF   XX   XX  ← FLUSHED
PC+4:               IF   ID   EX  MEM  WB  ← Correct path

Misprediction Penalty = 2 cycles (for 5-stage)
In deep pipelines (20+ stages): 10-20+ cycle penalty!


Why Misprediction Penalty Matters:
═══════════════════════════════════════════════════════════════
Pipeline Depth    Misprediction Penalty    Impact at 15% mispredict
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
5-stage           2 cycles                 CPI = 1.3
14-stage          7 cycles                 CPI = 2.05  
20-stage          10 cycles                CPI = 2.5
31-stage          15+ cycles               CPI = 3.25

This is why Intel reduced pipeline depth after Pentium 4!

Conclusion & Next Steps

CPU pipelining is one of the most important techniques in computer architecture, enabling processors to execute multiple instructions simultaneously and achieve high throughput. We've explored:

  • Instruction Cycle: Fetch, decode, execute, memory, write-back stages
  • Pipeline Benefits: N-stage pipeline can approach N× throughput improvement
  • Hazards: Data (RAW, WAR, WAW), control (branches), structural (resource conflicts)
  • Hazard Resolution: Stalling, forwarding, branch prediction, compiler scheduling
  • Branch Prediction: Static (BTFNT), dynamic (2-bit saturating counters), BTB
Key Takeaway: Pipelining increases throughput, not latency. Modern CPUs use deep pipelines (14-20 stages) with sophisticated branch predictors (95%+ accuracy) and forwarding networks to minimize stalls.

Next in the Series

In Part 8: OS Architecture & Kernel Design, we'll explore how the operating system kernel manages hardware resources, system calls, interrupts, and privilege levels.