Back to Technology

ARM Assembly Part 7: Memory Model, Caches & Barriers

February 26, 2026 Wasil Zafar 24 min read

Understand ARM's weakly-ordered memory model, the cache hierarchy (L1/L2/L3), data and instruction cache maintenance operations, DMB/DSB/ISB barrier instructions with all option variants, acquire/release semantics via LDAR/STLR, and TLB maintenance sequences.

Table of Contents

  1. Introduction
  2. ARM Memory Model
  3. Cache Hierarchy
  4. Cache Maintenance
  5. Memory Barrier Instructions
  6. Acquire/Release Semantics
  7. TLB Maintenance
  8. Device & Non-Cacheable Memory
  9. Conclusion & Next Steps

Introduction

Series Overview: This is Part 7 of our 28-part ARM Assembly Mastery Series. Parts 1–6 covered architecture history, ARM32, AArch64 registers, arithmetic, branching, and the AAPCS64 calling convention. Now we tackle the memory subsystem — arguably the most treacherous area of systems programming.

ARM Assembly Mastery

Your 28-step learning path • Currently on Step 7
1
Architecture History & Core Concepts
ARMv1→v9, RISC philosophy, profiles
2
ARM32 Instruction Set Fundamentals
ARM vs Thumb, registers, CPSR, barrel shifter
3
AArch64 Registers, Addressing & Data Movement
X/W regs, addressing modes, load/store pairs
4
Arithmetic, Logic & Bit Manipulation
ADD/SUB, bitfield extract/insert, CLZ
5
Branching, Loops & Conditional Execution
Branch types, link register, jump tables
6
Stack, Subroutines & AAPCS
Calling conventions, prologue/epilogue
7
Memory Model, Caches & Barriers
Weak ordering, DMB/DSB/ISB, TLB
You Are Here
8
NEON & Advanced SIMD
Vector ops, intrinsics, media processing
9
SVE & SVE2 Scalable Vector Extensions
Predicate regs, gather/scatter, HPC/ML
10
Floating-Point & VFP Instructions
IEEE-754, scalar FP, rounding modes
11
Exception Levels, Interrupts & Vector Tables
EL0–EL3, GIC, fault debugging
12
MMU, Page Tables & Virtual Memory
Stage-1 translation, permissions, huge pages
13
TrustZone & ARM Security Extensions
Secure monitor, world switching, TF-A
14
Cortex-M Assembly & Bare-Metal Embedded
NVIC, SysTick, linker scripts, low-power
15
Cortex-A System Programming & Boot
EL3→EL1 transitions, MMU setup, PSCI
16
Apple Silicon & macOS ABI
ARM64e PAC, Mach-O, dyld, perf counters
17
Inline Assembly, GCC/Clang & C Interop
Constraints, clobbers, compiler interaction
18
Performance Profiling & Micro-Optimization
Pipeline hazards, PMU, benchmarking
19
Reverse Engineering & ARM Binary Analysis
ELF, disassembly, CFR, iOS/Android quirks
20
Building a Bare-Metal OS Kernel
Bootloader, UART, scheduler, context switch
21
ARM Microarchitecture Deep Dive
OOO pipelines, reorder buffers, branch predict
22
Virtualization Extensions
EL2 hypervisor, stage-2 translation, KVM
23
Debugging & Tooling Ecosystem
GDB, OpenOCD/JTAG, ETM/ITM, QEMU
24
Linkers, Loaders & Binary Format Internals
ELF deep dive, relocations, PIC, crt0
25
Cross-Compilation & Build Systems
GCC/Clang toolchains, CMake, firmware gen
26
ARM in Real Systems
Android, FreeRTOS/Zephyr, U-Boot, TF-A
27
Security Research & Exploitation
ASLR, PAC attacks, ROP/JOP, kernel exploit
28
Emerging ARMv9 & Future Directions
MTE, SME, confidential compute, AI accel

Why Memory Ordering Matters

Modern CPUs reorder memory accesses for performance — stores may complete out of order, loads may be satisfied from store buffers before writing to cache, and the memory system may merge adjacent writes. Without explicit ordering constraints, shared-memory algorithms like spinlocks, message queues, and RCU break silently. The bug won't crash immediately — it will manifest as rare data corruption under load, on specific hardware, at 3 AM. ARM's weak ordering model is significantly more aggressive than x86's Total Store Order (TSO), meaning code that works on Intel may break on ARM without proper barriers.

The x86 Trap: On x86, most memory operations are naturally ordered (TSO guarantees store-store and load-load ordering). Developers coming from x86 often accidentally write correct concurrent code without explicit barriers. When porting this code to ARM, hidden ordering assumptions break. The Linux kernel's ARM port required hundreds of barrier insertions that were invisible on x86.

ARM Memory Model

Weak vs Strong Ordering

x86 uses Total Store Order (TSO) — stores become globally visible in program order, and loads are never reordered before earlier loads. ARM uses a weaker model: loads and stores to Normal memory may be observed by other CPUs in any order unless device memory or barrier instructions are used. This means the hardware can reorder loads past stores, stores past stores, and loads past loads — all three are allowed. The ARM architecture manual guarantees only address dependency ordering (if load B depends on the address from load A, B won't execute before A).

Reorderingx86 (TSO)ARM (Weak)Implication
Load → LoadOrderedMay reorderNeed DMB LD or LDAR
Load → StoreOrderedMay reorderNeed DMB or acquire/release
Store → StoreOrderedMay reorderNeed DMB ST or STLR
Store → LoadMay reorderMay reorderBoth need barriers for this

Memory Attribute Types

AArch64 defines memory types in the MAIR_EL1 register (Memory Attribute Indirection Register). Page table entries reference MAIR indices to select the attribute for each page:

  • Normal — Cacheable, write-back or write-through, out-of-order access permitted. Used for RAM. The CPU may merge, reorder, and speculatively access Normal memory freely.
  • Device — Non-cacheable, with ordering guarantees. Four sub-types control hardware behaviour:
    • Device-nGnRnE — No Gathering, No Reordering, No Early-write-Ack (strictest, for MMIO)
    • Device-nGnRE — No Gathering, No Reordering, Early-write-Ack allowed
    • Device-nGRE — No Gathering, Reordering allowed, Early-write-Ack allowed
    • Device-GRE — Gathering, Reordering, and Early-write-Ack all allowed (loosest device type)
  • Normal Non-Cacheable — Byte-addressable like Normal memory but bypasses caches. Used for DMA buffers and framebuffers where coherency would add overhead.

Shareability Domains

Shareability determines which agents must observe a data write — think of it as the "broadcast radius" for coherency:

  • Non-Shareable (NSH) — Only the local CPU core sees updates. Suitable for thread-private data.
  • Inner Shareable (ISH) — All CPUs in the inner domain (typically one cluster, or all big.LITTLE cores). This is the correct scope for most kernel and user-space synchronisation.
  • Outer Shareable (OSH) — Inner domain plus outer agents (GPU, DMA controllers, interconnect-attached accelerators).
  • Full System (SY) — All agents including ones outside the coherency domain. Required for system-level TLB and cache maintenance broadcasts.

Barrier instructions (DMB, DSB) accept a shareability option that determines their scope. Using ISH when SY is needed (or vice versa) is a common source of rare concurrency bugs.

Cache Hierarchy

Set/Way Organisation

A cache is organised like a post office with mailboxes: each address maps to a set (row), and each set has W ways (slots). Total capacity = S sets × W ways × L line bytes. Address bits are split into three fields:

  • Offset (low bits) — byte position within the cache line (log2 L)
  • Index (middle bits) — selects the set (log2 S)
  • Tag (remaining high bits) — stored alongside data to identify which address is cached
LevelTypical SizeAssociativityLine SizeLatency
L1-I / L1-D32–64 KB each4-way64 bytes1–4 cycles
L2 (per-core)256 KB – 1 MB8-way64 bytes8–15 cycles
L3 (shared)4–32 MB16-way+64 bytes20–50 cycles

On a cache miss, the hardware must evict an existing line (typically using LRU or pseudo-random replacement policy) to make room for the new data. Higher associativity reduces conflict misses but increases access latency and area.

VIPT vs PIPT Caches

Modern ARM L1 caches are Virtually Indexed, Physically Tagged (VIPT). The index bits come from the virtual address (available immediately, no TLB wait), while the tag comparison uses the physical address (from TLB lookup that happens in parallel). This hides TLB latency — by the time the set is read, the physical tag is ready for comparison.

VIPT can cause cache aliasing: if two virtual addresses map to the same physical page but produce different set indices, the same data exists in two cache lines. ARM sidesteps this by ensuring the L1 cache size divided by associativity does not exceed the page size (a 64 KB 4-way cache uses 14 index+offset bits ≤ 16 KB per way = 14 bits, matching a 16 KB page). L2 and L3 caches are always PIPT — they index and tag using physical addresses only, eliminating aliasing entirely at the cost of waiting for the TLB translation.

Apple M-series VIPT trick: Apple's M1/M2 use 128 KB L1 caches (larger than typical) with 16 KB pages. They avoid aliasing by using 8-way associativity: 128 KB / 8 = 16 KB per way, keeping index+offset within 14 bits. The ARM architecture spec explicitly permits this design because the index bits fall within the page offset.

Cache Coherency (MESI)

ARM's interconnect (CCI, CCN, CMN, or DSU) maintains coherency using a MOESI-like protocol with five states per cache line:

  • Modified (M) — Dirty and exclusive. This core has the only valid copy, and it's been written to.
  • Owned (O) — Dirty but shared. This core supplies data to snoop requests; other cores hold stale Shared copies.
  • Exclusive (E) — Clean and exclusive. Matches main memory; can transition to M without bus traffic.
  • Shared (S) — Clean and shared. Multiple cores hold identical copies; must invalidate others before writing.
  • Invalid (I) — Line is not present or has been invalidated.

Hardware coherency is automatic for Inner Shareable Normal memory — when Core 0 stores to address X, the interconnect invalidates or updates X in Core 1's cache. Software barriers are only needed for:

  1. Ordering guarantees (when which accesses become visible matters)
  2. I-cache/D-cache coherency (the I-cache is not hardware-coherent with the D-cache)
  3. DMA coherency (devices outside the coherency domain)
Case Study Linux Kernel
False Sharing: The Silent Performance Killer

When two cores write to different variables that happen to share the same 64-byte cache line, the coherency protocol bounces the line between M states on each core — a phenomenon called false sharing. In the Linux kernel, per-CPU counters were originally packed into arrays. Accessing counter[cpu] on different CPUs caused constant invalidation traffic. The fix was __cacheline_aligned_in_smp — padding each counter to a full cache line (64 bytes), eliminating cross-core coherency traffic and improving throughput by 10–50× on contended paths.

Probing Cache Info (CTR_EL0/CLIDR_EL1)

// Read Cache Type Register
MRS  x0, CTR_EL0        // Cache Type Register EL0
// Bits [3:0]  = IminLine (log2 of I-cache line in words)
// Bits [19:16] = DminLine (log2 of D-cache line in words)
// Bits [27:24] = CWG     (cache writeback granule)

MRS  x1, CLIDR_EL1      // Cache Level ID Register
// Bits [2:0]  = L1 cache type (0=none,1=I,2=D,3=sep I+D,4=unified)

Cache Maintenance

Data Cache Operations (DC)

ARM provides cache maintenance instructions that operate on individual cache lines by virtual address. These are essential for DMA, self-modifying code, and firmware:

InstructionActionScopeUse Case
DC CIVACClean + InvalidatePoint of Coherency (PoC)DMA buffers (outbound + inbound)
DC CVACClean (writeback)PoCDMA outbound (device reads from RAM)
DC CVAPClean to PoPPoint of Persistence (ARMv8.2)Persistent memory / NVDIMMs
DC IVACInvalidate onlyPoCDangerous — discards dirty data. Use after DMA inbound only.
DC CSWClean by Set/WayAll levelsFull cache flush at power-down
DC CISWClean + Invalidate by S/WAll levelsFirmware boot / cache disable sequence
DC IVAC is dangerous: It discards dirty data without writing it back. If any cache line in the range has been modified, those modifications are lost. Only use DC IVAC for DMA receive buffers where you know the CPU hasn't written to the range. For all other cases, use DC CIVAC.
// Flush a range of memory to PoC (for DMA)
// x0 = start address, x1 = size
flush_dcache_range:
    MRS  x2, CTR_EL0
    UBFX x2, x2, #16, #4       // Extract DminLine
    MOV  x3, #4
    LSL  x3, x3, x2            // Cache line size in bytes
    SUB  x4, x3, #1            // Mask
    BIC  x5, x0, x4            // Align start down
1:  DC   CIVAC, x5             // Clean+Invalidate by VA to PoC
    ADD  x5, x5, x3
    CMP  x5, x1
    B.LS 1b
    DSB  SY                    // Ensure all maintenance complete
    RET

Instruction Cache Operations (IC)

The I-cache is not coherent with the D-cache in ARM — writing to memory via stores (which go through the D-cache) does not automatically update the I-cache. Three maintenance instructions handle I-cache invalidation:

  • IC IALLU — Invalidate all I-cache lines to the Point of Unification (PoU). Requires EL1+ privilege. Affects only the executing CPU.
  • IC IALLUIS — Same as IALLU but broadcasts to all CPUs in the Inner Shareable domain. Used after loading kernel modules or patching code that may execute on any core.
  • IC IVAU — Invalidate I-cache by VA to PoU. Available at EL0 (user-space JIT compilers use this). Only invalidates a single cache line.

The Point of Unification (PoU) is the level where the I-cache and D-cache share a common view — typically the L2 cache. After cleaning D-cache lines to PoU (DC CVAU), invalidating I-cache to PoU (IC IVAU) ensures the instruction fetch path sees the new code.

I-Cache Coherency After JIT

// JIT code generation: write code, then execute it
// x0 = start of new code, x1 = end of new code
jit_flush:
    // 1. Flush D-cache to PoU so the I-cache can see the writes
    MRS  x2, CTR_EL0
    UBFX x3, x2, #16, #4
    MOV  x4, #4
    LSL  x4, x4, x3            // D-cache line size
    BIC  x5, x0, x4
.Ldc: DC CVAU, x5              // Clean D-cache line to PoU
    ADD  x5, x5, x4
    CMP  x5, x1
    B.LS .Ldc
    DSB  ISH                   // Ensure D-cache maintenance complete
    // 2. Invalidate I-cache to PoU
    AND  x6, x2, #15           // IminLine
    MOV  x7, #4
    LSL  x7, x7, x6            // I-cache line size
    BIC  x8, x0, x7
.Lic: IC IVAU, x8              // Invalidate I-cache line to PoU
    ADD  x8, x8, x7
    CMP  x8, x1
    B.LS .Lic
    DSB  ISH                   // Ensure I-cache maintenance complete
    ISB                        // Synchronise fetch pipeline
    RET

Memory Barrier Instructions

SummaryThree Barrier Instructions
Quick Reference: DMB vs DSB vs ISB

DMB — orders memory accesses. Subsequent accesses won't reorder before prior ones. Does NOT stall the pipeline.

DSB — stronger: all prior memory accesses AND cache/TLB maintenance complete before anything after executes. Stalls until completion.

ISB — flushes the instruction pipeline and refetches. Required after changing system registers, installing page tables, or code modification.

DMB — Data Memory Barrier

DMB ensures all memory accesses (loads and/or stores) before the barrier are observed by the specified domain before any memory access after the barrier. Think of DMB as a one-way gate: earlier accesses must pass through before later ones are allowed.

Crucially, DMB does not stall the pipeline — the CPU can continue executing non-memory instructions (arithmetic, branches) while waiting for ordering to be established. It also does not wait for cache/TLB maintenance operations to complete (that requires DSB). DMB is the correct barrier for most synchronisation patterns:

  • Message passing: Store data, DMB, store flag — ensures the consumer sees data before the flag
  • Lock release: Store shared data, DMB, store lock=0 — ensures all critical section writes are visible before the lock appears free
  • Device register sequences: Write config register, DMB, write enable register
// Producer: publish a message
STR  x1, [x0]         // Write data
DMB  ISH               // Ensure data write visible before flag write
STR  x2, [x3]         // Write flag = 1 (consumer side sees data first)

DSB — Data Synchronisation Barrier

DSB is strictly stronger than DMB. In addition to ordering memory accesses, DSB stalls the pipeline until all prior memory accesses and all pending cache/TLB maintenance operations have completed to the specified domain. No instruction of any kind (not even non-memory instructions) executes after DSB until completion.

Use DSB when you need to guarantee completion, not just ordering:

  • Before ISB: System register changes (MSR to SCTLR, TCR, TTBR) need DSB to ensure the write propagates before ISB flushes the pipeline
  • After cache maintenance: DC/IC operations are not guaranteed complete until DSB
  • After TLBI: TLB invalidation broadcasts are asynchronous; DSB waits for all invalidations to complete across all cores
  • Before WFI/WFE: Ensures all stores are visible before the core enters low-power state

ISB — Instruction Synchronisation Barrier

ISB is fundamentally different from DMB and DSB — it operates on the instruction pipeline, not memory. ISB flushes all instructions that were fetched or decoded before the ISB and forces the CPU to re-fetch everything after it. This guarantees that any changes to system state (control registers, page tables, branch prediction) take effect for subsequent instructions.

ISB does not wait for memory accesses to complete — always pair with DSB first when modifying system registers via MSR:

  1. MSR SCTLR_EL1, x0 — Write the system register
  2. DSB SY — Ensure the MSR write propagates to all observers
  3. ISB — Flush pipeline so subsequent instructions use the new configuration
// Enable the MMU (EL1)
MRS  x0, SCTLR_EL1
ORR  x0, x0, #1        // Set M bit
MSR  SCTLR_EL1, x0
DSB  SY                // Ensure write to SCTLR visible
ISB                    // Flush pipeline; now executing with MMU on

Barrier Option Variants

Both DMB and DSB accept an option that encodes scope × direction. ISB always uses SY (full system). The option determines which agents and which access types are constrained:

OptionScopeDirectionTypical Use
SYFull SystemAll (LD+ST)System-level TLBI, cache ops, device I/O
STFull SystemStore → StoreOrder stores to different devices
LDFull SystemLoad → Load/StoreOrder loads before dependent stores
ISHInner ShareableAllMost common: kernel/user sync between CPUs
ISHSTInner ShareableStore → StoreRelease stores in spinlocks
ISHLDInner ShareableLoad → Load/StoreAcquire loads in spinlocks
OSHOuter ShareableAllGPU/DMA coherency
NSHNon-ShareableAllThread-private ordering (rare)
Rule of thumb: Use DMB ISH / DSB ISH for CPU-to-CPU synchronisation. Use DSB SY for system-level operations (TLBI, DC, IC, MSR). Using SY everywhere works but is slower — ISH only broadcasts within the CPU cluster.

Acquire/Release Semantics

LDAR / STLR Instructions

ARMv8.0 introduced load-acquire (LDAR) and store-release (STLR) instructions that embed barrier semantics directly into the load/store, eliminating the need for separate DMB instructions in most synchronisation patterns:

  • LDAR (Load-Acquire) — No subsequent load or store can be reordered before this load. This is the "acquire" half of a lock — everything inside the critical section stays after the lock acquisition.
  • STLR (Store-Release) — No prior load or store can be reordered after this store. This is the "release" half — all critical section writes complete before the unlock becomes visible.

These map directly to C11/C++ memory_order_acquire and memory_order_release. The exclusive variants (LDAXR/STLXR) combine exclusivity (for atomic read-modify-write) with acquire/release semantics. This gives you a complete spinlock without any explicit barrier instructions:

// Spinlock acquire using LDAXR/STXR + LDAR/STLR
spin_lock:
    MOV  w1, #1
.Ltry:
    LDAXR w2, [x0]       // Load-Acquire Exclusive
    CBNZ  w2, .Ltry      // Spin if locked
    STXR  w3, w1, [x0]   // Store Exclusive
    CBNZ  w3, .Ltry      // Retry if store failed
    RET

spin_unlock:
    STLR  wzr, [x0]      // Store-Release zero (unlock)
    RET

LDAPR (ARMv8.3 Relaxed Acquire)

LDAPR (Load-AcquirePC) was added in ARMv8.3 as a weaker alternative to LDAR. While LDAR enforces full acquire ordering (no reordering of any subsequent access), LDAPR only prevents reordering of the load relative to subsequent stores to the same address. Specifically, LDAPR allows later loads to other addresses to be reordered before it — a relaxation that can improve performance in read-heavy workloads.

LDAPR maps to the C++ memory_order_consume concept (data-dependency ordering). In practice, the Linux kernel uses LDAPR for its smp_load_acquire() implementation on ARMv8.3+ systems, falling back to LDAR on older cores. The performance difference is measurable on workloads with frequent acquire loads followed by independent reads — LDAPR allows the CPU's load queue to service those reads out of order.

Ordering hierarchy: LDR (no ordering) < LDAPR (reorder stores only by address dependency) < LDAR (full acquire: nothing reorders before this load). Choose the weakest that satisfies your algorithm's correctness requirements.

LSE Atomic Operations (ARMv8.1)

The Large System Extensions (LSE), mandatory from ARMv8.1, add single-instruction atomic read-modify-write operations that replace LDXR/STXR retry loops:

InstructionOperationC11 Equivalent
LDADD / LDADDA / LDADDALAtomic load + addatomic_fetch_add()
LDCLR / LDCLRA / LDCLRALAtomic load + clear bitsatomic_fetch_and(~val)
LDSET / LDSETA / LDSETALAtomic load + set bitsatomic_fetch_or()
LDEORAtomic load + XORatomic_fetch_xor()
SWP / SWPA / SWPALAtomic swapatomic_exchange()
CAS / CASA / CASALCompare and swapatomic_compare_exchange()
CASPCompare and swap pair (128-bit)__atomic_compare_exchange_16()

The suffix A adds acquire semantics, L adds release, and AL adds both (sequential consistency). LSE atomics are significantly faster than LDXR/STXR loops on large systems because the hardware can implement them without exclusive monitors — the interconnect handles the atomic operation at the cache level, reducing coherency traffic.

Case Study Performance
MySQL on Graviton2: LSE Atomics Impact

When AWS released Graviton2 (ARMv8.2 with LSE), MySQL compiled with -march=armv8.1-a (enabling LSE) showed 30–40% higher throughput on contended workloads compared to the same binary compiled for ARMv8.0 (using LDXR/STXR loops). The improvement came primarily from LDADD replacing reference count increments and CAS replacing mutex acquisition loops. The Linux kernel added runtime LSE detection in 2019, patching atomic operations at boot time based on ID_AA64ISAR0_EL1.

TLB Maintenance

TLBI Operations

TLB invalidation instructions follow a naming convention: TLBI <type><level><shareability>. The IS suffix broadcasts the invalidation to all CPUs in the Inner Shareable domain — without it, only the local CPU's TLB is affected:

InstructionScopeUse Case
TLBI VMALLE1ISAll EL1 entries, broadcastFull TLB flush after changing ASID or TCR
TLBI VAE1IS, XtSingle VA, all levels, broadcastUnmapping a page (general)
TLBI VALE1IS, XtSingle VA, last-level only, broadcastPermission change (PTE-only update)
TLBI ASIDE1IS, XtAll entries for an ASID, broadcastProcess exit / address space teardown
TLBI VMALLE1All EL1 entries, local CPU onlySingle-core boot or test scenarios

Always follow TLBI with DSB ISH to ensure the invalidation completes on all cores before proceeding, then ISB if the current CPU's translations were affected. Missing the DSB creates a race where another core may still use a stale TLB entry.

Break-Before-Make Sequence

// Correct page table entry update (Break-Before-Make)
// x0 = PTE address, x1 = new PTE value
update_pte:
    STR  xzr, [x0]         // 1. Write invalid entry (break)
    DSB  ISHST             // Ensure invalid entry visible
    TLBI VAE1IS, x0        // 2. Invalidate TLB for this VA
    DSB  ISH               // Ensure TLB invalidation complete
    STR  x1, [x0]          // 3. Write new valid entry (make)
    DSB  ISH               // Ensure new entry visible
    ISB                    // Use new mapping immediately
    RET

Device & Non-Cacheable Memory

Memory-mapped I/O registers must be mapped as Device memory in the page tables — typically Device-nGnRnE (no Gathering, no Reordering, no Early-write-Acknowledgement), the strictest device type. This prevents the CPU from:

  • Gathering: Merging two 32-bit writes into a single 64-bit write (which the peripheral wouldn't understand)
  • Reordering: Issuing writes to different MMIO registers out of order (which could violate hardware init sequences)
  • Early acknowledgement: Reporting a write as "complete" before it actually reaches the device bus (which could miss error status)

Accessing MMIO through Normal memory attributes causes undefined behaviour — the CPU may speculatively read device registers (triggering side effects), cache device responses (returning stale data), or merge writes. In Linux, ioremap() maps device memory as Device-nGnRnE by default. For DMA buffers that the CPU also reads/writes, Normal Non-Cacheable is appropriate — it prevents caching but still allows merging and reordering (acceptable for bulk data transfers).

Case Study Embedded
UART Init: When Memory Type Matters

A common embedded bug: mapping a UART's control registers as Normal cacheable memory. The CPU caches the status register value, so polling UART_FR (flag register) for "transmit FIFO not full" returns a stale cached value — the loop either spins forever or overwrites data in the FIFO. Mapping as Device-nGnRnE forces every load to read from the actual hardware register, and every store to reach the FIFO in program order.

Key Insight: The three-barrier pattern for system register changes is always DSB → MSR → DSB → ISB. The first DSB ensures prior cache/TLB maintenance is complete; the MSR updates the register; the second DSB ensures the write propagates; ISB flushes the pipeline to use the new value. Omitting any step causes intermittent, hardware-specific failures.

Conclusion & Next Steps

In this part we explored the foundations of ARM's memory system — from the weakly-ordered memory model that gives hardware maximum reordering freedom, through memory attribute types (Normal, Device, Non-Cacheable) and shareability domains that control coherency scope, to the set-associative cache hierarchy with VIPT L1 and PIPT L2/L3, maintained by the MOESI coherency protocol.

We covered the practical tools: DC operations for flushing data to main memory (essential for DMA), IC operations for I-cache coherency after JIT compilation, the three barrier instructions (DMB for ordering, DSB for completion, ISB for pipeline flush) with their option variants, acquire/release semantics (LDAR/STLR) that eliminate explicit barriers in lock primitives, LSE atomics that replace LDXR/STXR loops, TLB maintenance with the critical break-before-make sequence, and device memory mapping requirements for MMIO safety.

Practice Exercises

  1. Message-Passing Litmus Test: Write a two-thread message-passing pattern using only STLR/LDAR (no DMB). Core 0 stores data then flag; Core 1 polls flag then reads data. Verify that acquire/release ordering prevents the consumer from seeing stale data. Then replace LDAR with plain LDR and explain why it could fail on ARM.
  2. DMA Buffer Flush: Write a prepare_dma_buffer function that (a) writes data to a buffer, (b) cleans the D-cache range to PoC using DC CVAC, (c) executes DSB to ensure completion, and (d) returns the physical address. Explain why DC CVAC (clean only) is sufficient for outbound DMA but DC CIVAC (clean + invalidate) is needed if the DMA controller will write results back.
  3. Break-Before-Make: Modify the update_pte example to handle a large page (2 MB block entry) that spans 512 contiguous PTEs. Your solution must invalidate TLB entries for all 512 pages and handle the case where another CPU is concurrently walking the page tables. Explain why a single TLBI VAE1IS is insufficient.

Next in the Series

In Part 8: NEON & Advanced SIMD, we shift to high-throughput vector processing — lane notation, vector arithmetic, permutations, saturating arithmetic, dot-product instructions, and structured load/store patterns for image and signal processing.

Technology