Back to Technology

ARM Assembly Part 21: ARM Microarchitecture Deep Dive

May 28, 2026 Wasil Zafar 25 min read

Understanding what happens below the ISA boundary explains every counter-intuitive performance result from Part 18. This part maps the journey from instruction fetch and branch prediction down through register renaming, the reorder buffer, execution ports, store queues, and the cache/TLB hierarchy.

Table of Contents

  1. Pipeline Overview
  2. Out-of-Order Execution
  3. Branch Prediction
  4. Memory Subsystem
  5. Cache & TLB Architecture
  6. Core Comparison
  7. Assembly Implications
  8. Case Study: M1 vs Cortex-X2
  9. Hands-On Exercises
  10. Conclusion & Next Steps

Pipeline Overview

Series Overview: Part 21 of 28. Prerequisites: Part 18 (Performance Profiling), Part 7 (Memory Model), Part 5 (Branching).

ARM Assembly Mastery

Your 28-step learning path • Currently on Step 21
1
Architecture History & Core Concepts
ARMv1→v9, RISC philosophy, profiles
2
ARM32 Instruction Set Fundamentals
ARM vs Thumb, registers, CPSR, barrel shifter
3
AArch64 Registers, Addressing & Data Movement
X/W regs, addressing modes, load/store pairs
4
Arithmetic, Logic & Bit Manipulation
ADD/SUB, bitfield extract/insert, CLZ
5
Branching, Loops & Conditional Execution
Branch types, link register, jump tables
6
Stack, Subroutines & AAPCS
Calling conventions, prologue/epilogue
7
Memory Model, Caches & Barriers
Weak ordering, DMB/DSB/ISB, TLB
8
NEON & Advanced SIMD
Vector ops, intrinsics, media processing
9
SVE & SVE2 Scalable Vector Extensions
Predicate regs, gather/scatter, HPC/ML
10
Floating-Point & VFP Instructions
IEEE-754, scalar FP, rounding modes
11
Exception Levels, Interrupts & Vector Tables
EL0–EL3, GIC, fault debugging
12
MMU, Page Tables & Virtual Memory
Stage-1 translation, permissions, huge pages
13
TrustZone & ARM Security Extensions
Secure monitor, world switching, TF-A
14
Cortex-M Assembly & Bare-Metal Embedded
NVIC, SysTick, linker scripts, low-power
15
Cortex-A System Programming & Boot
EL3→EL1 transitions, MMU setup, PSCI
16
Apple Silicon & macOS ABI
ARM64e PAC, Mach-O, dyld, perf counters
17
Inline Assembly, GCC/Clang & C Interop
Constraints, clobbers, compiler interaction
18
Performance Profiling & Micro-Optimization
Pipeline hazards, PMU, benchmarking
19
Reverse Engineering & ARM Binary Analysis
ELF, disassembly, CFR, iOS/Android quirks
20
Building a Bare-Metal OS Kernel
Bootloader, UART, scheduler, context switch
21
ARM Microarchitecture Deep Dive
OOO pipelines, reorder buffers, branch predict
You Are Here
22
Virtualization Extensions
EL2 hypervisor, stage-2 translation, KVM
23
Debugging & Tooling Ecosystem
GDB, OpenOCD/JTAG, ETM/ITM, QEMU
24
Linkers, Loaders & Binary Format Internals
ELF deep dive, relocations, PIC, crt0
25
Cross-Compilation & Build Systems
GCC/Clang toolchains, CMake, firmware gen
26
ARM in Real Systems
Android, FreeRTOS/Zephyr, U-Boot, TF-A
27
Security Research & Exploitation
ASLR, PAC attacks, ROP/JOP, kernel exploit
28
Emerging ARMv9 & Future Directions
MTE, SME, confidential compute, AI accel
Real-World Analogy — A Restaurant Kitchen: A modern ARM core is like a high-volume restaurant kitchen. The fetch/decode stage is the waiter taking orders. Register renaming is the head chef assigning each dish to a specific station and pan (physical register) — even if three tables all ordered "steak" (same architectural register), each gets its own cooking surface to avoid conflicts. The reorder buffer is the pass (the heated counter where finished plates wait): dishes can finish in any order (out-of-order execution), but the expeditor (commit logic) sends them to tables strictly in the order they were ordered, so no guest gets dessert before their appetizer. Branch prediction is the kitchen prepping speculative dishes — "Table 5 always orders tiramisu after the steak" — gambling time to start early. If the prediction is wrong (they order cheesecake), the speculative work is thrown out (pipeline flush), wasting cycles. The store buffer is the pickup window: finished dishes (stores) queue here until the waiter (memory system) is ready, and a subsequent order for "that same dish" (load from same address) can be fulfilled directly from the window without going back to the fridge (cache).
OOO Pipeline Stages: Fetch → Decode → Rename → Dispatch → Issue → Execute → Writeback → Commit (Retire). Each stage can hide work from the ISA programmer but every stage has capacity limits that show up as performance bottlenecks.

Out-of-Order Execution

Register Renaming

The architectural register file (X0–X30) is far too small to hold the inflight work of a wide OOO core. A physical register file (PRF) of 128–256 entries is allocated dynamically. The rename stage maps each destination register write to a fresh PRF entry, eliminating WAR (Write-After-Read) and WAW (Write-After-Write) hazards while preserving true RAW (Read-After-Write) data dependences.

// Visible code — appears serial:
MUL  X1, X2, X3    // WAW: both write X1 ... but after renaming →
ADD  X1, X4, X5    //   MUL  → PRF[p47], ADD → PRF[p48]  (no conflict)

// RAW hazard (cannot rename away):
SDIV X0, X6, X7    // Long latency (≥12 cycles on A78)
ADD  X8, X0, X1    // Must wait for SDIV result in PRF[p47]

Reorder Buffer (ROB)

Instructions are inserted into the ROB in program order at dispatch. They can execute out of order — the ROB records which have completed — but they commit (become visible state) only in program order from the head of the ROB. This is what makes precise exceptions possible: any instruction that faulted can be identified because nothing past it has committed.

ROB Sizes (approximate):
Cortex-A55: ~32 entry ROB (in-order up to decode, OOO from issue). Cortex-A78: ~160 entry ROB. Neoverse N2: ~400 entry ROB. Apple Firestorm: ~620 entry ROB — the largest ROB publicly known (2023), enabling deep memory-level parallelism.

Reservation Stations & Execution Ports

// Cortex-A78 issue queues (approximate per ARM TRM):
// IQ0 (integer ALU, shift, CSEL): 4 entries × 2 pipes → 2 ALU/cycle
// IQ1 (multiply, divide):          2 entries × 1 pipe → 1 MUL/cycle
// IQ2 (load):                      12 entries × 2 pipes → 2 loads/cycle
// IQ3 (store):                     8 entries × 1 pipe → 1 store/cycle
// IQ4 (NEON/FP):                   8 entries × 2 pipes → 2 NEON/cycle

// Throughput-limiting example: 4 dependent multiplies, 1 cycle each
MUL  X4, X0, X1    // Uses MUL pipe (1 cycle latency approx)
MUL  X5, X2, X3    // Independent → can dual-issue if separate src regs
MUL  X6, X4, X5    // Depends on X4 and X5 → must wait → serialised
MUL  X7, X6, X3    // Depends on X6 → more serialisation

// Solution: expose more independent chains (see Part 18, multiple accumulators)

Branch Prediction

Bimodal & Two-Level Predictors

A bimodal predictor is a table of 2-bit saturating counters indexed by PC bits. Two-level (correlated) predictors index the counter table with a combination of the PC and the Global History Register (GHR), a shift register of the last N branch outcomes. The GHR captures correlation patterns like "this branch is always taken if the last three were also taken."

// Branch that aliases in a bimodal predictor (same PC bits, different ctx):
// Two calls to same function with different condition chains → predictor thrashes
// Solution: pad hot inner loops with NOPs or use PRFM for instruction cache

TAGE Predictor

Tagged GEometric history length (TAGE) uses a base bimodal table plus K tagged tables indexed by hash(PC, GHR[0:2^k−1]). The longest history that hits provides the prediction. A "use-count" meta-predictor arbitrates ties. Cortex-A78 and Neoverse N2 implement variants of TAGE achieving >98% accuracy on SPECint workloads.

Branch Target Buffer (BTB) & Return Address Stack (RAS)

// BTB Miss: indirect branch (BR Xn) where target changes frequently
// Triggering BTB miss costs ~20 cycles pipeline flush on A78
// Perf event: 0x010 = BR_MIS_PRED
mrs x0, pmcr_el0
// ... configure PMU event 0x010 as in Part 18 ...

// Indirect call via function pointer (high BTB miss rate if target varies):
BLR X9        // BTB must predict X9 content — if wrong: flush + refetch

// Static indirect branches (e.g. switch jump table): BTB usually correct after warmup
// Dynamic virtual dispatch: 1 call site → many targets → BTB capacity exhausted

// RAS (Return Address Stack, ~16–32 deep):
BL  func      // Pushes PC+4 onto RAS
...
func:
    ...
    RET       // Pops RAS → correct most of the time (mismatch on longjmp/setjmp)

Memory Subsystem

Store-to-Load Forwarding

A store writes to the store buffer (not directly to cache). A subsequent load to the same address is fulfilled from the store buffer — this is store-to-load forwarding. It avoids the cache latency when the data is still "in flight." However, partial overlaps (e.g., write 8 bytes, read 4 bytes misaligned) cause a forwarding bubble of ~4–8 extra cycles.

// Perfect forwarding: same address, same size → data from store buffer
STR  X0, [X1]      // Write X0 to [X1]
LDR  X2, [X1]      // Forwarded from store buffer: ~4 cycle latency vs ~12 cache hit

// Forwarding failure: partial overlap
STR  W0, [X1]      // Write 4 bytes to [X1]
LDR  X2, [X1]      // Read 8 bytes from [X1] — only 4 overlap → forwarding stall

// Common kernel bug: write byte, read word (struct alignment pitfall)
STRB W0, [X1]      // 1-byte store
LDR  W2,  [X1]     // 4-byte load → partial overlap → ~10 cycle penalty on A78

Memory Dependence Speculation

ARM OOO cores speculate that loads do not alias most stores. If a load is issued early and a later-dispatched store has the same address, the core detects the violation at commit time, flushes the load (and everything dispatched after it), and re-executes. The PMU event 0x074 = MEM_ACCESS_LD and the perf stat metric memory_bound reveal when this is expensive.

Cache & TLB Architecture

// Query cache parameters at runtime (ARM CTR_EL0):
// CTR_EL0[19:16] = L1 instruction cache line size (log2 words)
// CTR_EL0[3:0]   = L1 data cache line size (log2 words)
mrs  x0, ctr_el0
ubfx x1, x0, #16, #4   // Extract ICache line size field
mov  x2, #4
lsl  x3, x2, x1         // Cache line = 4 << field (bytes)
// On A78: CTR_EL0 = 0x84448004 → L1D line = 64 bytes, L1I line = 64 bytes
Cache Parameters (Cortex-A78, typical SoC):
L1I: 64 KB, 4-way set-assoc, 64-byte lines, 4-cycle hit
L1D: 32 KB, 4-way set-assoc, 64-byte lines, 4-cycle hit
L2 (per core): 256 KB–512 KB, 8-way, 12-cycle hit
L3 (shared): 4–8 MB, 16-way, 30–40 cycle hit
DRAM: 200–300 cycles (DDR5 at 5600 MT/s)

Prefetchers

// Hardware prefetcher types (all implicit, cannot be disabled via ISA):
// 1. Next-line: always fetch the next cache line after a miss
// 2. Stride: detect constant-stride access patterns
// 3. Stream: detect sequential streams, issue ahead-of-time prefetch

// Software prefetch (explicit, from Part 18):
PRFM PLDL1KEEP, [X0, #256]   // Prefetch 4 cache lines ahead into L1D
PRFM PLDL2KEEP, [X0, #512]   // Prefetch into L2 (for long loops)

// PRFM hint types:
// PLDL1KEEP  = load, L1, keep (non-evict hint)
// PSTL1KEEP  = store, L1, keep — prefetch for write (allocate clean line)
// PLDL3STRM  = load, L3, stream (evict-soon hint — scratchpad pattern)

TLB Structure & Shootdown

// TLB Miss: triggers full page table walk (50–200 cycles!)
// TLBI (TLB Invalidate) instructions:
TLBI  VMALLE1IS      // Invalidate all EL1 translations (inner-shareable)
TLBI  VAE1IS, X0     // Invalidate EL1 by virtual address (X0 >> 12 = VA/4KB)
TLBI  ASIDE1IS, X0   // Invalidate all entries with ASID in X0[63:48]
DSB   ISH            // Ensure TLB invalidation visible to all observers
ISB                  // Flush pipeline after TLB change

// TLB shootdown on SMP: each core running threads of the same process
// must invalidate TLBs when a mapping changes. IS (Inner-Shareable) suffix
// broadcasts the TLBI to all cores in the same inner-shareable domain.

Core Comparison

Cortex-A55 vs Cortex-A78 vs Neoverse N2 (summary):

Cortex-A55 (efficiency core used in DynamIQ clusters): In-order up to dispatch; 2-wide decode; 8-entry issue queue; 32-entry ROB equivalent; ~1.9 IPC at peak; sub-1W active power. Designed for background tasks and always-on workloads.

Cortex-A78 (performance core, 2020–2023 smartphones): 4-wide decode; 160-entry ROB; 6 execution ports; 2 load + 1 store / cycle; TAGE branch predictor; 3.6 IPC peak; ~2.5W TDP in 5 nm silicon.

Neoverse N2 (server, 2022+): 4-wide decode; 400-entry ROB; 8 execution ports; 3 load + 2 store / cycle (SVE adds vector load/store ports); CHI interconnect for cache coherence at rack scale; ~4.0 IPC peak; 10W+ TDP.

Assembly Implications

// ── Rule 1: Break dependence chains for ILP ──
// Bad: chain of 4 multiplies → 1 result/4 cycles
MUL  X0, X1, X2
MUL  X0, X0, X3
MUL  X0, X0, X4
MUL  X0, X0, X5

// Good: 4 independent multiplies → 4 results/1 cycle (wide issue)
MUL  X0, X1, X2
MUL  X6, X3, X4
MUL  X7, X5, X8
MUL  X9, X10, X11
// Then merge: MUL X12, X0, X6 / MUL X13, X7, X9 / MUL X0, X12, X13
// ── Rule 2: Avoid indirect branch thrash → prefer direct branches ──
// VTable dispatch: BLR Xn — poor BTB predictions when targets vary
// Alternative: inline or devirtualise in hot paths
// ── Rule 3: Align hot loop head to I-Cache line boundary ──
.balign 64           // 64-byte = L1I cache line on A78
.loop_head:
    LDP  X0, X1, [X2], #16
    LDP  X3, X4, [X2], #16
    // ... loop body ...
    B.NE .loop_head
// ── Rule 4: Separate store and load to same address ──
STR  X0, [X1]
// Insert ~4 independent instructions here to allow forwarding pipeline
ADD  X5, X6, X7    // Fill cycle 1
ADD  X8, X9, X10   // Fill cycle 2
ADD  X11, X12, X13 // Fill cycle 3 (A78 forwarding latency ≈ 4)
LDR  X2, [X1]      // Now forwarding hits clean without stall

Case Study: Apple M1 vs Cortex-X2 — Two Takes on Wide OOO

MicroarchitectureIndustryReal-World
How Different Design Philosophies Yield Different Results

When Apple's M1 (Firestorm cores) launched in 2020, it shattered ARM performance expectations. Its microarchitecture makes a fascinating comparison to ARM's own Cortex-X2 (2022):

ParameterApple Firestorm (M1)Cortex-X2
Decode Width8-wide5-wide
ROB Size~630 entries~288 entries
Integer ALU Ports64
Load/Store Ports3 Load + 2 Store2 Load + 2 Store
L1D Cache128 KB, 8-way64 KB, 4-way
L2 Cache (per core)12 MB shared (perf cluster)512 KB–1 MB
Power Target~10W (perf cluster)~3W (single core)

Key insight: Apple's advantage comes from being both the chip designer and the only customer — they can afford a 630-entry ROB and 128 KB L1D because they control the thermal design of the MacBook chassis. ARM's Cortex-X2 must work across dozens of Android phones with different thermal budgets, forcing a more conservative design. The lesson: microarchitecture trade-offs are inseparable from the system they ship in.

Performance impact: On SPEC CPU 2017, Firestorm's wider issue and larger ROB deliver ~15–20% higher single-thread IPC than X2, but X2's smaller area allows more cores per cluster. In server workloads (Neoverse N2), throughput per watt matters more than single-thread speed, so ARM chose a 4-wide balanced design instead.

HistoryEvolution
From StrongARM to Neoverse: 30 Years of ARM Microarchitecture

ARM microarchitecture has evolved through distinct generations:

  • 1996 — StrongARM (DEC, then Intel): The first high-performance ARM core. 5-stage in-order pipeline, 233 MHz, 1W. Proved ARM could compete on performance, not just power.
  • 2005 — Cortex-A8: ARM's first superscalar core. 2-wide in-order, dual-issue integer. Powered the original iPhone (2007) and Kindle.
  • 2011 — Cortex-A15: ARM's first fully out-of-order core. 3-wide decode, ~100-entry ROB. The architecture that made ARM credible for servers (Calxeda, Applied Micro).
  • 2018 — Cortex-A76 ("Enyo"): 4-wide OOO, 128-entry ROB, micro-op cache. Closed the gap with Intel mobile Core i5. Basis for Neoverse N1 (AWS Graviton2).
  • 2022 — Neoverse V2: 5-wide decode, SVE2, 400+ ROB, CHI mesh interconnect. Powers Graviton4 and Microsoft Cobalt — ARM's entry into datacenter dominance.

Each generation roughly doubled ROB size and added 1–2 execution ports, a pattern constrained by the cubic relationship between power and OOO window size.

Hands-On Exercises

Exercise 1Beginner
Measure ROB Depth via Dependency Chain

Empirically estimate your CPU's ROB size by measuring when independent instruction throughput drops:

  1. Write a loop containing N independent ADD Xn, Xn, #1 instructions (each writing a different register), followed by a single long-latency SDIV whose result is consumed by the next iteration's first ADD
  2. Increase N from 10 to 300 in steps of 10. Time 10 million iterations of each
  3. Plot cycles/iteration vs N. While N < ROB depth, the independent ADDs fill the ROB and overlap with the SDIV. Once N exceeds the ROB, dispatching stalls — the curve flattens
  4. The "knee" of the curve approximates the ROB size

Expected: On Cortex-A78, the knee should appear around N=150–160. On Apple M-series, around N=600+.

Exercise 2Intermediate
Branch Predictor Stress Test

Design a benchmark that defeats TAGE prediction:

  1. Create an array of 256 random 0/1 values (unseeded random — different each run)
  2. Loop through the array; for each element, execute CBZ/CBNZ to branch into one of two code paths
  3. Measure total cycles and branch mispredictions using PMU events 0x10 (BR_MIS_PRED) and 0x12 (BR_PRED)
  4. Compare against a sorted version of the same array (all 0s then all 1s) — mispredictions should drop to near-zero

Analysis: Calculate misprediction rate for random vs sorted. Typical results: ~45-50% miss rate on random (worse than coin flip due to aliasing), <1% on sorted.

Exercise 3Advanced
Store-to-Load Forwarding Latency Measurement

Measure the difference between forwarded and non-forwarded loads:

  1. Write a tight loop: STR X0, [X1] / LDR X2, [X1] (same address, same size — perfect forwarding). Time 100M iterations using cycle counter
  2. Modify to misaligned partial overlap: STR X0, [X1] / LDR W2, [X1, #2] (4-byte load at 2-byte offset into 8-byte store). Time again
  3. Modify to complete miss: STR X0, [X1] / LDR X2, [X3] where X3 points to a different cache line
  4. Calculate: forwarding latency, partial-overlap penalty, and L1D hit latency (from the miss case)

Expected on A78: ~4 cycles (forwarded), ~10-12 cycles (partial overlap), ~4 cycles (L1D hit, no forwarding — same if data is warm in cache).

Conclusion & Next Steps

The Cortex-A78 represents the intersection of every concept in the series: ISA constraints shape what rename can do, weak memory ordering emerges from the store buffer design, and cache line boundaries determine when PRFM helps. Every assembly optimization from Part 18 maps to a physical circuit now visible in this part. The Apple M1 vs Cortex-X2 comparison shows how system-level constraints drive wildly different microarchitectural choices from the same ISA, and the exercises let you probe these mechanisms empirically by measuring ROB depth, branch predictor accuracy, and forwarding latency on real hardware.

Next in the Series

In Part 22: Virtualization Extensions, we enter EL2, write trap handlers, configure stage-2 page tables for guest memory isolation, wire virtual GIC list registers, and understand how KVM on ARM implements hardware-accelerated virtual machines.

Technology