ARM Assembly Part 7: Memory Model, Caches & Barriers

Introduction

                        
                        Series Overview: This is Part 7 of our 28-part ARM Assembly Mastery Series. Parts 1–6 covered architecture history, ARM32, AArch64 registers, arithmetic, branching, and the AAPCS64 calling convention. Now we tackle the memory subsystem — arguably the most treacherous area of systems programming.
                    

ARM Assembly Mastery

Your 28-step learning path • Currently on Step 7

Memory Model, Caches & Barriers

Weak ordering, DMB/DSB/ISB, TLB

You Are Here

NEON & Advanced SIMD

Vector ops, intrinsics, media processing

SVE & SVE2 Scalable Vector Extensions

Predicate regs, gather/scatter, HPC/ML

Floating-Point & VFP Instructions

IEEE-754, scalar FP, rounding modes

Exception Levels, Interrupts & Vector Tables

EL0–EL3, GIC, fault debugging

MMU, Page Tables & Virtual Memory

Stage-1 translation, permissions, huge pages

TrustZone & ARM Security Extensions

Secure monitor, world switching, TF-A

Cortex-M Assembly & Bare-Metal Embedded

NVIC, SysTick, linker scripts, low-power

Cortex-A System Programming & Boot

EL3→EL1 transitions, MMU setup, PSCI

Apple Silicon & macOS ABI

ARM64e PAC, Mach-O, dyld, perf counters

Inline Assembly, GCC/Clang & C Interop

Constraints, clobbers, compiler interaction

Performance Profiling & Micro-Optimization

Pipeline hazards, PMU, benchmarking

Reverse Engineering & ARM Binary Analysis

ELF, disassembly, CFR, iOS/Android quirks

Building a Bare-Metal OS Kernel

Bootloader, UART, scheduler, context switch

ARM Microarchitecture Deep Dive

OOO pipelines, reorder buffers, branch predict

Virtualization Extensions

EL2 hypervisor, stage-2 translation, KVM

Debugging & Tooling Ecosystem

GDB, OpenOCD/JTAG, ETM/ITM, QEMU

Linkers, Loaders & Binary Format Internals

ELF deep dive, relocations, PIC, crt0

Cross-Compilation & Build Systems

GCC/Clang toolchains, CMake, firmware gen

ARM in Real Systems

Android, FreeRTOS/Zephyr, U-Boot, TF-A

Security Research & Exploitation

ASLR, PAC attacks, ROP/JOP, kernel exploit

Emerging ARMv9 & Future Directions

MTE, SME, confidential compute, AI accel

Why Memory Ordering Matters

Modern CPUs reorder memory accesses for performance — stores may complete out of order, loads may be satisfied from store buffers before writing to cache, and the memory system may merge adjacent writes. Without explicit ordering constraints, shared-memory algorithms like spinlocks, message queues, and RCU break silently. The bug won't crash immediately — it will manifest as rare data corruption under load, on specific hardware, at 3 AM. ARM's weak ordering model is significantly more aggressive than x86's Total Store Order (TSO), meaning code that works on Intel may break on ARM without proper barriers.

                        
                        The x86 Trap: On x86, most memory operations are naturally ordered (TSO guarantees store-store and load-load ordering). Developers coming from x86 often accidentally write correct concurrent code without explicit barriers. When porting this code to ARM, hidden ordering assumptions break. The Linux kernel's ARM port required hundreds of barrier insertions that were invisible on x86.
                    

ARM Memory Model

Weak vs Strong Ordering

x86 uses Total Store Order (TSO) — stores become globally visible in program order, and loads are never reordered before earlier loads. ARM uses a weaker model: loads and stores to Normal memory may be observed by other CPUs in any order unless device memory or barrier instructions are used. This means the hardware can reorder loads past stores, stores past stores, and loads past loads — all three are allowed. The ARM architecture manual guarantees only address dependency ordering (if load B depends on the address from load A, B won't execute before A).

Comparison of x86 Total Store Order versus ARM weak memory model reordering rules — x86 Total Store Order versus ARM weak memory model: allowed reordering of loads and stores across architectures

Reordering	x86 (TSO)	ARM (Weak)	Implication
Load → Load	Ordered	May reorder	Need DMB LD or LDAR
Load → Store	Ordered	May reorder	Need DMB or acquire/release
Store → Store	Ordered	May reorder	Need DMB ST or STLR
Store → Load	May reorder	May reorder	Both need barriers for this

Memory Attribute Types

AArch64 defines memory types in the MAIR_EL1 register (Memory Attribute Indirection Register). Page table entries reference MAIR indices to select the attribute for each page:

Normal — Cacheable, write-back or write-through, out-of-order access permitted. Used for RAM. The CPU may merge, reorder, and speculatively access Normal memory freely.
Device — Non-cacheable, with ordering guarantees. Four sub-types control hardware behaviour:
- Device-nGnRnE — No Gathering, No Reordering, No Early-write-Ack (strictest, for MMIO)
- Device-nGnRE — No Gathering, No Reordering, Early-write-Ack allowed
- Device-nGRE — No Gathering, Reordering allowed, Early-write-Ack allowed
- Device-GRE — Gathering, Reordering, and Early-write-Ack all allowed (loosest device type)
Normal Non-Cacheable — Byte-addressable like Normal memory but bypasses caches. Used for DMA buffers and framebuffers where coherency would add overhead.

Shareability Domains

Shareability determines which agents must observe a data write — think of it as the "broadcast radius" for coherency:

Non-Shareable (NSH) — Only the local CPU core sees updates. Suitable for thread-private data.
Inner Shareable (ISH) — All CPUs in the inner domain (typically one cluster, or all big.LITTLE cores). This is the correct scope for most kernel and user-space synchronisation.
Outer Shareable (OSH) — Inner domain plus outer agents (GPU, DMA controllers, interconnect-attached accelerators).
Full System (SY) — All agents including ones outside the coherency domain. Required for system-level TLB and cache maintenance broadcasts.

Barrier instructions (DMB, DSB) accept a shareability option that determines their scope. Using ISH when SY is needed (or vice versa) is a common source of rare concurrency bugs.

Cache Hierarchy

Set/Way Organisation

A cache is organised like a post office with mailboxes: each address maps to a set (row), and each set has W ways (slots). Total capacity = S sets × W ways × L line bytes. Address bits are split into three fields:

Offset (low bits) — byte position within the cache line (log2 L)
Index (middle bits) — selects the set (log2 S)
Tag (remaining high bits) — stored alongside data to identify which address is cached

Level	Typical Size	Associativity	Line Size	Latency
L1-I / L1-D	32–64 KB each	4-way	64 bytes	1–4 cycles
L2 (per-core)	256 KB – 1 MB	8-way	64 bytes	8–15 cycles
L3 (shared)	4–32 MB	16-way+	64 bytes	20–50 cycles

On a cache miss, the hardware must evict an existing line (typically using LRU or pseudo-random replacement policy) to make room for the new data. Higher associativity reduces conflict misses but increases access latency and area.

VIPT vs PIPT Caches

Modern ARM L1 caches are Virtually Indexed, Physically Tagged (VIPT). The index bits come from the virtual address (available immediately, no TLB wait), while the tag comparison uses the physical address (from TLB lookup that happens in parallel). This hides TLB latency — by the time the set is read, the physical tag is ready for comparison.

VIPT can cause cache aliasing: if two virtual addresses map to the same physical page but produce different set indices, the same data exists in two cache lines. ARM sidesteps this by ensuring the L1 cache size divided by associativity does not exceed the page size (a 64 KB 4-way cache uses 14 index+offset bits ≤ 16 KB per way = 14 bits, matching a 16 KB page). L2 and L3 caches are always PIPT — they index and tag using physical addresses only, eliminating aliasing entirely at the cost of waiting for the TLB translation.

                        
                        Apple M-series VIPT trick: Apple's M1/M2 use 128 KB L1 caches (larger than typical) with 16 KB pages. They avoid aliasing by using 8-way associativity: 128 KB / 8 = 16 KB per way, keeping index+offset within 14 bits. The ARM architecture spec explicitly permits this design because the index bits fall within the page offset.
                    

Cache Coherency (MESI)

ARM's interconnect (CCI, CCN, CMN, or DSU) maintains coherency using a MOESI-like protocol with five states per cache line:

Modified (M) — Dirty and exclusive. This core has the only valid copy, and it's been written to.
Owned (O) — Dirty but shared. This core supplies data to snoop requests; other cores hold stale Shared copies.
Exclusive (E) — Clean and exclusive. Matches main memory; can transition to M without bus traffic.
Shared (S) — Clean and shared. Multiple cores hold identical copies; must invalidate others before writing.
Invalid (I) — Line is not present or has been invalidated.

Hardware coherency is automatic for Inner Shareable Normal memory — when Core 0 stores to address X, the interconnect invalidates or updates X in Core 1's cache. Software barriers are only needed for:

Ordering guarantees (when which accesses become visible matters)
I-cache/D-cache coherency (the I-cache is not hardware-coherent with the D-cache)
DMA coherency (devices outside the coherency domain)

Case Study Linux Kernel

False Sharing: The Silent Performance Killer

When two cores write to different variables that happen to share the same 64-byte cache line, the coherency protocol bounces the line between M states on each core — a phenomenon called false sharing. In the Linux kernel, per-CPU counters were originally packed into arrays. Accessing counter[cpu] on different CPUs caused constant invalidation traffic. The fix was __cacheline_aligned_in_smp — padding each counter to a full cache line (64 bytes), eliminating cross-core coherency traffic and improving throughput by 10–50× on contended paths.

Probing Cache Info (CTR_EL0/CLIDR_EL1)

// Read Cache Type Register
MRS  x0, CTR_EL0        // Cache Type Register EL0
// Bits [3:0]  = IminLine (log2 of I-cache line in words)
// Bits [19:16] = DminLine (log2 of D-cache line in words)
// Bits [27:24] = CWG     (cache writeback granule)

MRS  x1, CLIDR_EL1      // Cache Level ID Register
// Bits [2:0]  = L1 cache type (0=none,1=I,2=D,3=sep I+D,4=unified)

Cache Maintenance

Data Cache Operations (DC)

ARM provides cache maintenance instructions that operate on individual cache lines by virtual address. These are essential for DMA, self-modifying code, and firmware:

Data cache maintenance operations: DC CIVAC, DC CVAC, and DC IVAC at Point of Coherency — Data cache maintenance flow: DC CIVAC, DC CVAC, and DC IVAC operations at Point of Coherency and Point of Unification

Instruction	Action	Scope	Use Case
`DC CIVAC`	Clean + Invalidate	Point of Coherency (PoC)	DMA buffers (outbound + inbound)
`DC CVAC`	Clean (writeback)	PoC	DMA outbound (device reads from RAM)
`DC CVAP`	Clean to PoP	Point of Persistence (ARMv8.2)	Persistent memory / NVDIMMs
`DC IVAC`	Invalidate only	PoC	Dangerous — discards dirty data. Use after DMA inbound only.
`DC CSW`	Clean by Set/Way	All levels	Full cache flush at power-down
`DC CISW`	Clean + Invalidate by S/W	All levels	Firmware boot / cache disable sequence

                        
                        DC IVAC is dangerous: It discards dirty data without writing it back. If any cache line in the range has been modified, those modifications are lost. Only use DC IVAC for DMA receive buffers where you know the CPU hasn't written to the range. For all other cases, use DC CIVAC.
                    

// Flush a range of memory to PoC (for DMA)
// x0 = start address, x1 = size
flush_dcache_range:
    MRS  x2, CTR_EL0
    UBFX x2, x2, #16, #4       // Extract DminLine
    MOV  x3, #4
    LSL  x3, x3, x2            // Cache line size in bytes
    SUB  x4, x3, #1            // Mask
    BIC  x5, x0, x4            // Align start down
1:  DC   CIVAC, x5             // Clean+Invalidate by VA to PoC
    ADD  x5, x5, x3
    CMP  x5, x1
    B.LS 1b
    DSB  SY                    // Ensure all maintenance complete
    RET

Instruction Cache Operations (IC)

The I-cache is not coherent with the D-cache in ARM — writing to memory via stores (which go through the D-cache) does not automatically update the I-cache. Three maintenance instructions handle I-cache invalidation:

IC IALLU — Invalidate all I-cache lines to the Point of Unification (PoU). Requires EL1+ privilege. Affects only the executing CPU.
IC IALLUIS — Same as IALLU but broadcasts to all CPUs in the Inner Shareable domain. Used after loading kernel modules or patching code that may execute on any core.
IC IVAU — Invalidate I-cache by VA to PoU. Available at EL0 (user-space JIT compilers use this). Only invalidates a single cache line.

The Point of Unification (PoU) is the level where the I-cache and D-cache share a common view — typically the L2 cache. After cleaning D-cache lines to PoU (DC CVAU), invalidating I-cache to PoU (IC IVAU) ensures the instruction fetch path sees the new code.

I-Cache Coherency After JIT

// JIT code generation: write code, then execute it
// x0 = start of new code, x1 = end of new code
jit_flush:
    // 1. Flush D-cache to PoU so the I-cache can see the writes
    MRS  x2, CTR_EL0
    UBFX x3, x2, #16, #4
    MOV  x4, #4
    LSL  x4, x4, x3            // D-cache line size
    BIC  x5, x0, x4
.Ldc: DC CVAU, x5              // Clean D-cache line to PoU
    ADD  x5, x5, x4
    CMP  x5, x1
    B.LS .Ldc
    DSB  ISH                   // Ensure D-cache maintenance complete
    // 2. Invalidate I-cache to PoU
    AND  x6, x2, #15           // IminLine
    MOV  x7, #4
    LSL  x7, x7, x6            // I-cache line size
    BIC  x8, x0, x7
.Lic: IC IVAU, x8              // Invalidate I-cache line to PoU
    ADD  x8, x8, x7
    CMP  x8, x1
    B.LS .Lic
    DSB  ISH                   // Ensure I-cache maintenance complete
    ISB                        // Synchronise fetch pipeline
    RET

Memory Barrier Instructions

SummaryThree Barrier Instructions

Quick Reference: DMB vs DSB vs ISB

DMB — orders memory accesses. Subsequent accesses won't reorder before prior ones. Does NOT stall the pipeline.

DSB — stronger: all prior memory accesses AND cache/TLB maintenance complete before anything after executes. Stalls until completion.

ISB — flushes the instruction pipeline and refetches. Required after changing system registers, installing page tables, or code modification.

DMB — Data Memory Barrier

DMB ensures all memory accesses (loads and/or stores) before the barrier are observed by the specified domain before any memory access after the barrier. Think of DMB as a one-way gate: earlier accesses must pass through before later ones are allowed.

DMB, DSB, and ISB barrier instructions compared showing memory ordering versus pipeline effects — DMB, DSB, and ISB barrier instructions: memory ordering (DMB), completion synchronisation (DSB), and pipeline flush (ISB)

Crucially, DMB does not stall the pipeline — the CPU can continue executing non-memory instructions (arithmetic, branches) while waiting for ordering to be established. It also does not wait for cache/TLB maintenance operations to complete (that requires DSB). DMB is the correct barrier for most synchronisation patterns:

Message passing: Store data, DMB, store flag — ensures the consumer sees data before the flag
Lock release: Store shared data, DMB, store lock=0 — ensures all critical section writes are visible before the lock appears free
Device register sequences: Write config register, DMB, write enable register

// Producer: publish a message
STR  x1, [x0]         // Write data
DMB  ISH               // Ensure data write visible before flag write
STR  x2, [x3]         // Write flag = 1 (consumer side sees data first)

DSB — Data Synchronisation Barrier

DSB is strictly stronger than DMB. In addition to ordering memory accesses, DSB stalls the pipeline until all prior memory accesses and all pending cache/TLB maintenance operations have completed to the specified domain. No instruction of any kind (not even non-memory instructions) executes after DSB until completion.

Use DSB when you need to guarantee completion, not just ordering:

Before ISB: System register changes (MSR to SCTLR, TCR, TTBR) need DSB to ensure the write propagates before ISB flushes the pipeline
After cache maintenance: DC/IC operations are not guaranteed complete until DSB
After TLBI: TLB invalidation broadcasts are asynchronous; DSB waits for all invalidations to complete across all cores
Before WFI/WFE: Ensures all stores are visible before the core enters low-power state

ISB — Instruction Synchronisation Barrier

ISB is fundamentally different from DMB and DSB — it operates on the instruction pipeline, not memory. ISB flushes all instructions that were fetched or decoded before the ISB and forces the CPU to re-fetch everything after it. This guarantees that any changes to system state (control registers, page tables, branch prediction) take effect for subsequent instructions.

ISB does not wait for memory accesses to complete — always pair with DSB first when modifying system registers via MSR:

MSR SCTLR_EL1, x0 — Write the system register
DSB SY — Ensure the MSR write propagates to all observers
ISB — Flush pipeline so subsequent instructions use the new configuration

// Enable the MMU (EL1)
MRS  x0, SCTLR_EL1
ORR  x0, x0, #1        // Set M bit
MSR  SCTLR_EL1, x0
DSB  SY                // Ensure write to SCTLR visible
ISB                    // Flush pipeline; now executing with MMU on

Barrier Option Variants

Both DMB and DSB accept an option that encodes scope × direction. ISB always uses SY (full system). The option determines which agents and which access types are constrained:

Option	Scope	Direction	Typical Use
`SY`	Full System	All (LD+ST)	System-level TLBI, cache ops, device I/O
`ST`	Full System	Store → Store	Order stores to different devices
`LD`	Full System	Load → Load/Store	Order loads before dependent stores
`ISH`	Inner Shareable	All	Most common: kernel/user sync between CPUs
`ISHST`	Inner Shareable	Store → Store	Release stores in spinlocks
`ISHLD`	Inner Shareable	Load → Load/Store	Acquire loads in spinlocks
`OSH`	Outer Shareable	All	GPU/DMA coherency
`NSH`	Non-Shareable	All	Thread-private ordering (rare)

                        
                        Rule of thumb: Use DMB ISH / DSB ISH for CPU-to-CPU synchronisation. Use DSB SY for system-level operations (TLBI, DC, IC, MSR). Using SY everywhere works but is slower — ISH only broadcasts within the CPU cluster.
                    

Acquire/Release Semantics

LDAR / STLR Instructions

ARMv8.0 introduced load-acquire (LDAR) and store-release (STLR) instructions that embed barrier semantics directly into the load/store, eliminating the need for separate DMB instructions in most synchronisation patterns:

LDAR (Load-Acquire) — No subsequent load or store can be reordered before this load. This is the "acquire" half of a lock — everything inside the critical section stays after the lock acquisition.
STLR (Store-Release) — No prior load or store can be reordered after this store. This is the "release" half — all critical section writes complete before the unlock becomes visible.

These map directly to C11/C++ memory_order_acquire and memory_order_release. The exclusive variants (LDAXR/STLXR) combine exclusivity (for atomic read-modify-write) with acquire/release semantics. This gives you a complete spinlock without any explicit barrier instructions:

Load-acquire and store-release one-way barrier semantics across critical section boundaries — LDAR (load-acquire) and STLR (store-release) one-way barrier semantics preventing reordering across critical section boundaries

// Spinlock acquire using LDAXR/STXR + LDAR/STLR
spin_lock:
    MOV  w1, #1
.Ltry:
    LDAXR w2, [x0]       // Load-Acquire Exclusive
    CBNZ  w2, .Ltry      // Spin if locked
    STXR  w3, w1, [x0]   // Store Exclusive
    CBNZ  w3, .Ltry      // Retry if store failed
    RET

spin_unlock:
    STLR  wzr, [x0]      // Store-Release zero (unlock)
    RET

LDAPR (ARMv8.3 Relaxed Acquire)

LDAPR (Load-AcquirePC) was added in ARMv8.3 as a weaker alternative to LDAR. While LDAR enforces full acquire ordering (no reordering of any subsequent access), LDAPR only prevents reordering of the load relative to subsequent stores to the same address. Specifically, LDAPR allows later loads to other addresses to be reordered before it — a relaxation that can improve performance in read-heavy workloads.

LDAPR maps to the C++ memory_order_consume concept (data-dependency ordering). In practice, the Linux kernel uses LDAPR for its smp_load_acquire() implementation on ARMv8.3+ systems, falling back to LDAR on older cores. The performance difference is measurable on workloads with frequent acquire loads followed by independent reads — LDAPR allows the CPU's load queue to service those reads out of order.

                        
                        Ordering hierarchy: LDR (no ordering) < LDAPR (reorder stores only by address dependency) < LDAR (full acquire: nothing reorders before this load). Choose the weakest that satisfies your algorithm's correctness requirements.
                    

LSE Atomic Operations (ARMv8.1)

The Large System Extensions (LSE), mandatory from ARMv8.1, add single-instruction atomic read-modify-write operations that replace LDXR/STXR retry loops:

Instruction	Operation	C11 Equivalent
`LDADD / LDADDA / LDADDAL`	Atomic load + add	`atomic_fetch_add()`
`LDCLR / LDCLRA / LDCLRAL`	Atomic load + clear bits	`atomic_fetch_and(~val)`
`LDSET / LDSETA / LDSETAL`	Atomic load + set bits	`atomic_fetch_or()`
`LDEOR`	Atomic load + XOR	`atomic_fetch_xor()`
`SWP / SWPA / SWPAL`	Atomic swap	`atomic_exchange()`
`CAS / CASA / CASAL`	Compare and swap	`atomic_compare_exchange()`
`CASP`	Compare and swap pair (128-bit)	`__atomic_compare_exchange_16()`

The suffix A adds acquire semantics, L adds release, and AL adds both (sequential consistency). LSE atomics are significantly faster than LDXR/STXR loops on large systems because the hardware can implement them without exclusive monitors — the interconnect handles the atomic operation at the cache level, reducing coherency traffic.

Case Study Performance

MySQL on Graviton2: LSE Atomics Impact

When AWS released Graviton2 (ARMv8.2 with LSE), MySQL compiled with -march=armv8.1-a (enabling LSE) showed 30–40% higher throughput on contended workloads compared to the same binary compiled for ARMv8.0 (using LDXR/STXR loops). The improvement came primarily from LDADD replacing reference count increments and CAS replacing mutex acquisition loops. The Linux kernel added runtime LSE detection in 2019, patching atomic operations at boot time based on ID_AA64ISAR0_EL1.

TLB Maintenance

TLBI Operations

TLB invalidation instructions follow a naming convention: TLBI <type><level><shareability>. The IS suffix broadcasts the invalidation to all CPUs in the Inner Shareable domain — without it, only the local CPU's TLB is affected:

Instruction	Scope	Use Case
`TLBI VMALLE1IS`	All EL1 entries, broadcast	Full TLB flush after changing ASID or TCR
`TLBI VAE1IS, Xt`	Single VA, all levels, broadcast	Unmapping a page (general)
`TLBI VALE1IS, Xt`	Single VA, last-level only, broadcast	Permission change (PTE-only update)
`TLBI ASIDE1IS, Xt`	All entries for an ASID, broadcast	Process exit / address space teardown
`TLBI VMALLE1`	All EL1 entries, local CPU only	Single-core boot or test scenarios

Always follow TLBI with DSB ISH to ensure the invalidation completes on all cores before proceeding, then ISB if the current CPU's translations were affected. Missing the DSB creates a race where another core may still use a stale TLB entry.

Break-Before-Make Sequence

// Correct page table entry update (Break-Before-Make)
// x0 = PTE address, x1 = new PTE value
update_pte:
    STR  xzr, [x0]         // 1. Write invalid entry (break)
    DSB  ISHST             // Ensure invalid entry visible
    TLBI VAE1IS, x0        // 2. Invalidate TLB for this VA
    DSB  ISH               // Ensure TLB invalidation complete
    STR  x1, [x0]          // 3. Write new valid entry (make)
    DSB  ISH               // Ensure new entry visible
    ISB                    // Use new mapping immediately
    RET

Device & Non-Cacheable Memory

Memory-mapped I/O registers must be mapped as Device memory in the page tables — typically Device-nGnRnE (no Gathering, no Reordering, no Early-write-Acknowledgement), the strictest device type. This prevents the CPU from:

Gathering: Merging two 32-bit writes into a single 64-bit write (which the peripheral wouldn't understand)
Reordering: Issuing writes to different MMIO registers out of order (which could violate hardware init sequences)
Early acknowledgement: Reporting a write as "complete" before it actually reaches the device bus (which could miss error status)

Accessing MMIO through Normal memory attributes causes undefined behaviour — the CPU may speculatively read device registers (triggering side effects), cache device responses (returning stale data), or merge writes. In Linux, ioremap() maps device memory as Device-nGnRnE by default. For DMA buffers that the CPU also reads/writes, Normal Non-Cacheable is appropriate — it prevents caching but still allows merging and reordering (acceptable for bulk data transfers).

Case Study Embedded

UART Init: When Memory Type Matters

A common embedded bug: mapping a UART's control registers as Normal cacheable memory. The CPU caches the status register value, so polling UART_FR (flag register) for "transmit FIFO not full" returns a stale cached value — the loop either spins forever or overwrites data in the FIFO. Mapping as Device-nGnRnE forces every load to read from the actual hardware register, and every store to reach the FIFO in program order.

                        
                        Key Insight: The three-barrier pattern for system register changes is always DSB → MSR → DSB → ISB. The first DSB ensures prior cache/TLB maintenance is complete; the MSR updates the register; the second DSB ensures the write propagates; ISB flushes the pipeline to use the new value. Omitting any step causes intermittent, hardware-specific failures.
                    

Conclusion & Next Steps

In this part we explored the foundations of ARM's memory system — from the weakly-ordered memory model that gives hardware maximum reordering freedom, through memory attribute types (Normal, Device, Non-Cacheable) and shareability domains that control coherency scope, to the set-associative cache hierarchy with VIPT L1 and PIPT L2/L3, maintained by the MOESI coherency protocol.

We covered the practical tools: DC operations for flushing data to main memory (essential for DMA), IC operations for I-cache coherency after JIT compilation, the three barrier instructions (DMB for ordering, DSB for completion, ISB for pipeline flush) with their option variants, acquire/release semantics (LDAR/STLR) that eliminate explicit barriers in lock primitives, LSE atomics that replace LDXR/STXR loops, TLB maintenance with the critical break-before-make sequence, and device memory mapping requirements for MMIO safety.

Practice Exercises

Message-Passing Litmus Test: Write a two-thread message-passing pattern using only STLR/LDAR (no DMB). Core 0 stores data then flag; Core 1 polls flag then reads data. Verify that acquire/release ordering prevents the consumer from seeing stale data. Then replace LDAR with plain LDR and explain why it could fail on ARM.
DMA Buffer Flush: Write a prepare_dma_buffer function that (a) writes data to a buffer, (b) cleans the D-cache range to PoC using DC CVAC, (c) executes DSB to ensure completion, and (d) returns the physical address. Explain why DC CVAC (clean only) is sufficient for outbound DMA but DC CIVAC (clean + invalidate) is needed if the DMA controller will write results back.
Break-Before-Make: Modify the update_pte example to handle a large page (2 MB block entry) that spans 512 contiguous PTEs. Your solution must invalidate TLB entries for all 512 pages and handle the case where another CPU is concurrently walking the page tables. Explain why a single TLBI VAE1IS is insufficient.

Cookie Consent

Cookie Preferences