ARM Assembly Part 12: MMU, Page Tables & Virtual Memory

Introduction & Translation Overview

                        
                        Series Overview: This is Part 12 of our 28-part ARM Assembly Mastery Series. In Part 11 we covered the exception model. Now we configure what every AArch64 OS must set up before running user space: the Memory Management Unit. Without the MMU there is no process isolation, no kernel/user separation, and no demand paging.
                    

ARM Assembly Mastery

Your 28-step learning path • Currently on Step 12

MMU, Page Tables & Virtual Memory

Stage-1 translation, permissions, huge pages

You Are Here

TrustZone & ARM Security Extensions

Secure monitor, world switching, TF-A

Cortex-M Assembly & Bare-Metal Embedded

NVIC, SysTick, linker scripts, low-power

Cortex-A System Programming & Boot

EL3→EL1 transitions, MMU setup, PSCI

Apple Silicon & macOS ABI

ARM64e PAC, Mach-O, dyld, perf counters

Inline Assembly, GCC/Clang & C Interop

Constraints, clobbers, compiler interaction

Performance Profiling & Micro-Optimization

Pipeline hazards, PMU, benchmarking

Reverse Engineering & ARM Binary Analysis

ELF, disassembly, CFR, iOS/Android quirks

Building a Bare-Metal OS Kernel

Bootloader, UART, scheduler, context switch

ARM Microarchitecture Deep Dive

OOO pipelines, reorder buffers, branch predict

Virtualization Extensions

EL2 hypervisor, stage-2 translation, KVM

Debugging & Tooling Ecosystem

GDB, OpenOCD/JTAG, ETM/ITM, QEMU

Linkers, Loaders & Binary Format Internals

ELF deep dive, relocations, PIC, crt0

Cross-Compilation & Build Systems

GCC/Clang toolchains, CMake, firmware gen

ARM in Real Systems

Android, FreeRTOS/Zephyr, U-Boot, TF-A

Security Research & Exploitation

ASLR, PAC attacks, ROP/JOP, kernel exploit

Emerging ARMv9 & Future Directions

MTE, SME, confidential compute, AI accel

AArch64 virtual address translation converts a 64-bit virtual address to a physical address through a multi-level page table walk. Two independent address spaces exist: TTBR0_EL1 covers user space (VA bits[63:56] = 0x00) and TTBR1_EL1 covers kernel space (VA bits[63:56] = 0xFF). Both ranges and granule choices are encoded in TCR_EL1. Memory attributes (cacheability, shareability, device type) come from MAIR_EL1, indexed by a 3-bit AttrIndx field in each page table descriptor.

Multi-level page table walk diagram showing L0 through L3 translation from virtual address to physical address — AArch64 address translation walk — four-level page table lookup from virtual address to physical frame

                        
                        Real-World Analogy — The Library Index Card System: Think of the MMU as a library card catalogue. Books (data) are stored on physical shelves at specific physical addresses. But patrons never navigate by shelf number — they look up a virtual reference (call number) in the catalogue. The catalogue has multiple levels: first you find the floor directory (L0 table), then the aisle (L1), then the shelf section (L2), then the exact shelf position (L3/page). Two separate catalogues (TTBR0 and TTBR1) exist — one for public readers (user space) and a restricted one for librarians (kernel space). Access permissions on each card (AP bits) determine who can read which books — and the PXN/UXN bits prevent patrons from executing the library code manuals. The TLB is the librarian's memory of recently looked-up books, avoiding the full catalogue walk every time.
                    

Walkthrough Step-by-Step Address Translation (4KB Granule)

Consider translating virtual address 0x0000_0040_1234_5678 with T0SZ=25 and 4KB granule:

Step 1 — Top-bit check: Bits[63:56] = 0x00 → use TTBR0_EL1 (user space).

Step 2 — L0 index: VA[47:39] = 0b000000010 = 2 → read entry at TTBR0 + 2×8 bytes. This is a Table entry pointing to L1 table base.

Step 3 — L1 index: VA[38:30] = 0b000000000 = 0 → read L1[0]. If this is a 1GB Block entry, translation is done (PA = Block base | VA[29:0]). Otherwise, it's a Table entry pointing to L2.

Step 4 — L2 index: VA[29:21] = 0b010010001 = 145 → read L2[145]. If 2MB Block entry, PA = Block base | VA[20:0]. Otherwise, Table → L3.

Step 5 — L3 index: VA[20:12] = 0b001000101 = 69 → read L3[69]. This is a 4KB Page entry. Final PA = Page base | VA[11:0] (offset 0x678).

The total walk costs 4 sequential memory reads — in the worst case ~40 cycles without TLB caching. The TLB caches the final PA, making subsequent accesses to the same page a single-cycle lookup.

Kernel/User Isolation via TTBR0/TTBR1

ARM's split address space design provides natural kernel/user isolation. User processes use TTBR0_EL1 (lower VA range, 0x0000...) and can only access pages marked with AP=01 (EL1/EL0 RW) or AP=11 (EL1/EL0 RO). The kernel's TTBR1_EL1 (upper VA range, 0xFFFF...) maps with AP=00 (EL1 only). On context switch, the kernel changes TTBR0_EL1 to the new process's page table — TTBR1_EL1 stays constant since kernel mappings are shared. Post-Spectre, Linux on ARM64 implements KPTI (Kernel Page Table Isolation): when running in EL0, the kernel TTBR1 page tables are replaced with a minimal "trampoline" table that only maps the exception vectors. On EL1 entry, the handler quickly switches to the full kernel tables. This prevents Meltdown-style attacks where EL0 speculatively reads kernel memory through TTBR1.

TCR_EL1 — Translation Control

T0SZ / T1SZ Address Range

TCR_EL1.T0SZ defines the size of the TTBR0 address space as 2^(64-T0SZ) bytes. Linux typically uses T0SZ=25 giving a 512 GB user address space (0x0000_0000_0000_0000 to 0x0000_007F_FFFF_FFFF). TCR_EL1.T1SZ mirrors this for the kernel TTBR1 range. Addresses outside both ranges generate an address size fault.

TCR_EL1 T0SZ/T1SZ address space split showing TTBR0 user range and TTBR1 kernel range — TCR_EL1 address space configuration — T0SZ/T1SZ define the TTBR0 (user) and TTBR1 (kernel) VA ranges

Granule Selection (TG0/TG1)

TG0/TG1 select the page granule: 00=4KB, 01=64KB, 10=16KB. The 4KB granule is the most common (used by Linux) and gives 512-entry tables at each level. The 64KB granule reduces walk depth and TLB pressure for workloads with large contiguous allocations (common in HPC/NUMA systems like Cavium Thunder).

Walk Levels Derived from T0SZ

// TCR_EL1 setup for Linux-style config:
// T0SZ=25, T1SZ=25, 4KB granule, inner shareable, write-back cacheable
LDR  x0, =0x00000015B5194515
// Bit breakdown:
//  [5:0]  T0SZ = 25  (0x19)
//  [7]    EPD0 = 0   (TTBR0 walk enabled)
//  [9:8]  IRGN0 = 01 (inner WB WA cacheable)
//  [11:10] ORGN0 = 01 (outer WB WA cacheable)
//  [13:12] SH0  = 11  (inner shareable)
//  [15:14] TG0  = 00  (4KB granule)
//  [21:16] T1SZ = 25  (0x19)
//  [23]   EPD1 = 0
//  [29:28] TG1  = 10  (4KB granule for TTBR1, encoding differs!)
//  [35:32] IPS  = 100 (48-bit phys addr space)
MSR  TCR_EL1, x0
ISB

MAIR_EL1 — Memory Attributes

Attribute Index (AttrIndx)

MAIR_EL1 holds eight 8-bit attribute encodings (indices 0–7). Each page table descriptor references one via its 3-bit AttrIndx field. A typical kernel setup uses: index 0 = Device-nGnRnE (MMIO, strictly ordered), index 1 = Normal, Inner/Outer Write-Back, Read-Allocate, Write-Allocate (cached RAM), index 2 = Normal, Non-Cacheable (uncached DMA buffers).

MAIR_EL1 eight attribute index slots mapping to Device and Normal memory type encodings — MAIR_EL1 attribute index slots — eight 8-bit encodings for Device-nGnRnE, Normal cached, and Non-Cacheable types

// MAIR_EL1 encoding for three common attribute types
// Attr[0] = 0x00 = Device-nGnRnE (strictly-ordered MMIO)
// Attr[1] = 0xFF = Normal, Inner/Outer WB/WA/RA (cacheable)
// Attr[2] = 0x44 = Normal, Inner/Outer Non-Cacheable

LDR  x0, =0x000000000044FF00
MSR  MAIR_EL1, x0
ISB

Device vs Normal Memory

Device memory (Device-nGnRnE / nGnRE / nGRE / GRE) is used for memory-mapped I/O registers. Accesses to Device memory are always in program order relative to other Device accesses to the same peripheral, preventing the hardware from reordering reads or writes. Normal memory allows full out-of-order and speculative access; without explicit barriers it may be reordered by both the CPU and memory system.

Page Table Descriptors

Block vs Page vs Table Entries

A 4KB-granule descriptor is 8 bytes. Bits[1:0] identify the type: 00/01 = invalid, 11 = table (points to next-level table), 01 at leaf = page (4 KB), 01 at L1/L2 = block (1 GB or 2 MB respectively). Block entries eliminate one or two levels of walk, reducing TLB miss latency for large contiguous regions. Linux uses 2 MB huge pages extensively for the kernel image and large anonymous mappings.

AArch64 page table descriptor bit field layout showing AP, UXN, PXN, AF, SH, and AttrIndx positions — 4KB page descriptor format — permission (AP), execute-never (UXN/PXN), access flag (AF), and attribute index fields

AP, UXN, PXN, AF, SH bits

// 4KB page descriptor bit fields (simplified):
// [0]    Valid bit
// [1]    Type (1=block/page, 0=table at non-leaf)
// [4:2]  AttrIndx — index into MAIR_EL1 (bits[4:2])
// [5]    NS  — Non-Secure (EL3 only)
// [7:6]  AP  — Access Permissions
//               00 = EL1 RW, EL0 no access
//               01 = EL1/EL0 RW
//               10 = EL1 RO, EL0 no access
//               11 = EL1/EL0 RO
// [9:8]  SH  — Shareability (10=outer, 11=inner shareable)
// [10]   AF  — Access Flag (must be 1 or take AF fault on first access)
// [11]   nG  — Not Global (ASID-tagged if set)
// [47:12] OA — Output (physical) address[47:12]
// [51:48] (RES0 for 48-bit PA)
// [53]   PXN — Privileged Execute Never
// [54]   UXN — Unprivileged Execute Never

// Example: EL1 RW, EL0 no access, cached, inner shareable, AF=1
// AttrIndx=1 (Normal cached), AP=00, SH=11, AF=1
// Descriptor bits: AF|SH|AP|AttrIndx | OA | valid+type
// = 0x0060000000000703  (for OA=0x0, type=page)

MOV x0, #0x703             // bits[11:0]: AF=1, SH=11, AttrIndx=001, valid+type
ORR x0, x0, #(1 << 10)    // AF bit already included above
ORR x0, x0, x5             // OR in physical page address (x5 = page_phys_addr)

Building an Identity Map at Boot

Level-0 & Level-1 Tables

An identity map (VA == PA) is the simplest starting point for boot: the MMU is enabled while code is still running at physical addresses. With T0SZ=25 and 4KB granule, the walk is L0→L1→L2→L3 for 4KB pages, or terminates at L1 for 1 GB blocks. A minimal boot identity-map uses a single L0 table (4KB, 512×8-byte entries) pointing to a few L1 block entries.

// Minimal identity map: map 0x0–0x40000000 (1 GB) as Device,
// and 0x40000000–0x80000000 (1 GB) as Normal cached.
// Page table base in x9 (4KB aligned, zero-filled before use)

// L0[0] → L1 table
ADR   x0, l1_table
ORR   x0, x0, #3           // Table descriptor (valid=1, type=table)
STR   x0, [x9]             // L0 entry 0

// L1[0] = 1 GB block: PA 0x0, Device-nGnRnE (AttrIndx=0, AP=00, AF=1)
MOV   x1, #(1 << 10)       // AF=1
ORR   x1, x1, #(3 << 8)   // SH=11 (inner shareable)
ORR   x1, x1, #1           // valid=1, block (type=01 at L1)
STR   x1, [x10]            // l1_table[0]

// L1[1] = 1 GB block: PA 0x40000000, Normal cached (AttrIndx=1)
MOV   x2, #0x40000000      // physical base
ORR   x2, x2, #(1 << 10)  // AF
ORR   x2, x2, #(3 << 8)   // SH inner
ORR   x2, x2, #(1 << 2)   // AttrIndx=1
ORR   x2, x2, #1           // valid+block
STR   x2, [x10, #8]        // l1_table[1]

Enabling the MMU (SCTLR_EL1.M)

// Enable MMU: write TTBR0, TTBR1, TCR, MAIR, then set SCTLR_EL1.M
ADR   x0, l0_table
MSR   TTBR0_EL1, x0
ADR   x0, kernel_l0_table
MSR   TTBR1_EL1, x0

// Ensure all page table writes are visible before enabling MMU
DSB   ISH
TLBI  VMALLE1IS              // Invalidate all TLB entries
DSB   ISH
ISB

MRS   x0, SCTLR_EL1
ORR   x0, x0, #(1 << 0)    // M=1: enable MMU
ORR   x0, x0, #(1 << 2)    // C=1: data cache enable
ORR   x0, x0, #(1 << 12)   // I=1: instruction cache enable
MSR   SCTLR_EL1, x0
ISB                          // Instruction barrier after MMU enable

Huge Pages & Contiguous Hint

Beyond 2 MB L2 blocks, AArch64 supports a Contiguous Hint (bit[52] in descriptors): when 16 consecutive naturally-aligned L3 page descriptors all have Contiguous=1, the TLB may cache them as a single 64 KB entry, reducing TLB pressure for hot large-page regions. Linux uses this for anonymous huge pages on systems without 2 MB aligned physical memory. The kernel also supports 1 GB L1 block mappings for areas like device RAM on server SoCs where the full gigabyte is contiguous.

TLB Maintenance

TLBI Operations

// TLBI syntax: TLBI <op>{IS}{nXS}, {Xt}
// IS = inner-shareable broadcast (affects all cores in shareability domain)
// nXS = not extended shareability

// Invalidate all EL1 entries (broadcast to all PEs)
TLBI  VMALLE1IS

// Invalidate a specific VA in EL1, non-broadcast (current PE only)
// Xt = VA[55:12] shifted — bits[63:44]=ASID, bits[43:0]=VA>>12
MOV   x0, x5, LSR #12      // VA page number
TLBI  VAE1, x0             // Invalidate by VA in EL1

// Invalidate all entries for a specific ASID
MOV   x0, #(1234 << 48)    // ASID in bits[63:48]
TLBI  ASIDE1IS, x0

// Always use DSB before and after TLBI in critical sections
DSB   ISHST                 // Ensure preceding stores (page table writes) complete
TLBI  VMALLE1IS
DSB   ISH                  // Ensure TLBI completes before continuing
ISB

Break-Before-Make Rule

When changing a live page table entry from one valid mapping to another (e.g., changing permissions or PA), the ARM architecture requires Break-Before-Make: first write an invalid descriptor, execute a DSB, issue the appropriate TLBI, issue another DSB, then write the new valid descriptor. Skipping this can expose a window where both old and new mappings coexist in different cores' TLBs, causing undefined behaviour. For read-only promotions (adding a restriction) the rule is relaxed, but erring on the side of BBM is the safe default.

Break-Before-Make sequence diagram showing invalidate, barrier, TLBI, barrier, and remap steps for safe page table updates — Break-Before-Make protocol — required sequence for safely updating live page table entries across all cores

                        
                        Key Insight: The Access Flag (AF, bit[10]) is deliberately omitted from descriptors at first mapping creation in some OS designs. The first access to the page causes an AF fault taken to EL1. The kernel's AF fault handler sets the AF bit and returns — this records which pages have been recently accessed for the page replacement algorithm. It's a hardware-assisted LRU approximation that costs zero kernel-side tracking overhead.
                    

Hands-On Exercises

                        
                        Practice Exercises: Test your understanding of ARM virtual memory and page tables.
                    

Exercise 1 Identity Map with Permissions

Task: Extend the boot identity map to cover 4 regions: 0x00000000–0x3FFFFFFF as Device-nGnRnE (MMIO), 0x40000000–0x7FFFFFFF as Normal RW cached (kernel data), 0x80000000–0xBFFFFFFF as Normal RO+PXN cached (user code mapped as read-only), and 0xC0000000–0xFFFFFFFF as Normal RW Non-Cacheable (DMA buffers). Use L1 block entries and verify each region's attributes by reading back the descriptors in GDB.

Hint: You need four L1 entries. Set AttrIndx=0 for Device, AttrIndx=1 for cached Normal, AttrIndx=2 for non-cacheable. Use AP[7:6] to control read/write permissions.

Exercise 2 TLB Shootdown Sequence

Task: Write a function that implements the Break-Before-Make pattern: given a pointer to a live L3 page table entry and a new descriptor value, (1) store an invalid entry, (2) DSB ISH, (3) TLBI VAE1IS with the correct VA, (4) DSB ISH, (5) store the new valid entry, (6) DSB ISH + ISB. Test by changing a page from RW to RO while another thread reads it — verify the reader takes a permission fault after the change.

Hint: The TLBI operand format is: bits[63:48]=ASID, bits[43:0]=VA≡12. Zero the ASID field for global pages.

Exercise 3 Demand Paging via AF Faults

Task: Create page table entries with AF=0 for a 64KB region. Write an AF fault handler (ESR_EL1 EC=0b100100, ISS DFSC=0b001001) that sets the AF bit in the faulting descriptor, issues TLBI+DSB, and returns to the faulting instruction. Count how many AF faults you take when linearly scanning the region — verify it equals exactly 16 (one per 4KB page).

Hint: The FAR_EL1 gives the faulting VA. Walk the page table from TTBR0 to find the L3 entry, then OR in bit[10]. Remember to use atomic ORR if other cores might access the same descriptor.

Page Table Configuration Generator

Use this tool to plan and document your MMU configuration — address space layout, granule sizes, memory regions with attributes, and TLBI maintenance strategy. Download as Word, Excel, or PDF for your system design documentation.

Page Table Configuration Planner

Document your MMU setup: address ranges, attributes, and TLB strategy. Download as Word, Excel, or PDF.

Draft auto-saved

All data stays in your browser. Nothing is sent to or stored on any server.

System Name *

Granule Size *

T0SZ / T1SZ Config *

MAIR_EL1 Attributes *

Memory Regions

TLB Maintenance Strategy

Author Name

Conclusion & Next Steps

We configured the AArch64 MMU from scratch: TCR_EL1 T0SZ/T1SZ/TG0/TG1/IPS, MAIR_EL1 attribute indices for Device and Normal memory, page table descriptor fields (AP, UXN, PXN, AF, SH, AttrIndx), the step-by-step address translation walk, kernel/user isolation via split TTBR0/TTBR1 address spaces and KPTI, building a boot identity map with L1 block entries, enabling the MMU via SCTLR_EL1.M with proper cache and barrier sequencing, huge pages and the Contiguous Hint, and TLBI operations with the Break-Before-Make rule.

Cookie Consent

Cookie Preferences