ARMv9.0 (2021) was the first major architectural version since ARMv8 in 2011. It mandated SVE2 for server-class cores, introduced the Realm Management Extension for confidential computing, and continued a decade-long trend of adding new feature flags instead of breaking instruction semantics. This final instalment surveys the most impactful ARMv9 additions — MTE, SME, CCA, and AI matrix instructions — and closes with a glance at where ARM is heading next.
ARMv9 Overview
Mandatory and New Features
ARMv9 key feature changes vs ARMv8
| Feature | ARMv8 | ARMv9 |
|---|---|---|
| SIMD baseline | NEON optional (mandatory for A-profile) | SVE2 mandatory for Cortex-A (server/HPC) |
| Memory safety | PAC (8.3), BTI (8.5), MTE optional (8.5) | MTE more deeply integrated across profiles |
| Confidential compute | TrustZone (S/NS worlds) | CCA adds Realm world (4th security state) |
| Matrix compute | FEAT_DotProd (8.4), I8MM (8.6) | SME (Scalable Matrix Extension) new in 9.2 |
| Branch recording | ETM/PTM trace | BRBE (Branch Record Buffer Extension) |
| Transactional memory | Not present | TME (Transactional Memory Extension) — optional |
| Debug | Self-hosted debug v8.4 | TRBE (Trace Buffer Extension) in-memory trace |
ARMv9 Silicon Timeline
ARMv9 processor releases
- Cortex-A510/A710/X2 (2022) — First ARMv9 CPU cores; Cortex-A510 implements SVE2 at 128-bit VL (vector length), power-efficient core.
- Cortex-A520/A720/X4 (2023) — ARMv9.2, adds MTE support; Cortex-X4 paired with Immortalis-G720 GPU in Snapdragon 8 Gen 3.
- Neoverse V2 (2022) — ARMv9 server core (AWS Graviton4, Google Axion); 256-bit SVE2, MTE, 4× DDR5 channels.
- Apple M-series — ARMv8.6+ (ARM64e); PAC everywhere, but not full ARMv9 SVE2 — Apple uses custom SIMD width strategies.
- Cortex-A725/X925 (2024) — ARMv9.2+, SME2 debut in mobile; Cortex-X925 reaches ~3.85 GHz peak.
Memory Tagging Extension (MTE)
MTE (ARMv8.5-A, FEAT_MTE) is a hardware mechanism to detect spatial and temporal memory safety violations at near-zero cost. Every 16-byte aligned granule of physical memory has a 4-bit allocation tag stored in special ECC-like storage. Every pointer has a 4-bit logical tag stored in its top byte (bits [59:56]). A tag mismatch on load/store can either fault synchronously, fault asynchronously, or be silently ignored depending on system register configuration.
MTE Instructions
; ----- MTE Instruction Reference -----
; IRG Xd, Xn [, Xm] — Insert Random Tag
; Generates a random 4-bit tag, inserts into bits [59:56] of Xd
; Xn provides the base address; optional Xm is a tag exclusion mask
IRG x0, x1 ; x0 = tagged ptr (random tag in [59:56]), base=x1
; ADDG Xd, Xn, #imm6, #imm4 — Add with Tag
; Adds scaled offset to Xn, AND replaces tag with #imm4
ADDG x1, x0, #16, #2 ; x1 = x0+16, tag=2 (for next 16B granule)
; STG Xt, [Xn {, #simm9}] — Store Tag
; Stores the tag from bits [59:56] of Xt to the allocation tag of [Xn]
STG x0, [x0] ; allocate tag at the address x0 points to
; STGM Xt, [Xn] — Store Tag, Multiple (fills an entire page)
; EL1/EL2 only — initialises all tag granules in the page
; LDG Xd, [Xn {, #simm9}] — Load Tag
; Loads the 4-bit allocation tag from [Xn] into bits [59:56] of Xd
LDG x2, [x0] ; x2 = x0 with allocation tag filled in [59:56]
; STGP Xt1, Xt2, [Xn{, #simm7}] — Store Tag and Pair
; Stores two registers and sets tag in one instruction (like STP+STG)
; STZG Xt, [Xn {, #simm9}] — Store Zero with Tag
; Zeroes the 16-byte granule and sets its allocation tag
; Example: allocating a tagged buffer
; Assume x0 = untagged base address of 64-byte allocation
mte_alloc_64:
IRG x0, x0 ; assign random tag to pointer
STG x0, [x0, #0] ; tag granule 0
STG x0, [x0, #16] ; tag granule 1
STG x0, [x0, #32] ; tag granule 2
STG x0, [x0, #48] ; tag granule 3 (all 4 granules get same tag)
ret ; caller uses tagged x0 for all access
# Enable MTE for a process (Linux 5.10+, requires kernel CONFIG_ARM64_MTE)
# prctl(PR_SET_TAGGED_ADDR_CTRL, PR_TAGGED_ADDR_ENABLE | PR_MTE_TCF_SYNC, 0, 0, 0)
# In C:
# #include <sys/prctl.h>
# prctl(PR_SET_TAGGED_ADDR_CTRL,
# PR_TAGGED_ADDR_ENABLE | PR_MTE_TCF_SYNC | (0xffff << PR_MTE_TAG_SHIFT),
# 0, 0, 0);
# Build with GCC/Clang MTE sanitizer (software-assisted MTE, Android 12+)
clang --target=aarch64-linux-android33 \
-fsanitize=memtag \
-fsanitize-memtag-mode=sync \ # synchronous fault on mismatch
-march=armv8.5-a+memtag \
-O2 -o mte_test main.c
# Verify MTE availability on system
cat /proc/cpuinfo | grep "asimdrdm\|dcpop\|mte" # mte entry confirms support
Android MTE Deployment
MTE rollout strategy (Android 13/14)
- Pixel 8/8 Pro (Tensor G3, ARMv9.2) — First production Android devices with MTE enabled in async mode for the system allocator (
jemallocvariant). Catches heap overflows and UAF asynchronously at ~1% perf overhead. - Async vs sync modes: Async — tag check happens off the critical path, fault reported asynchronously (lowest overhead, misses some races). Sync — fault on every mismatching access, higher overhead, precise attribution. Used for development/fuzzing.
- Scudo allocator integration: Android's production allocator Scudo uses MTE natively —
IRGon eachmalloc()to assign a random pointer tag,STGMto initialise the page, zeroing tag onfree()to catch use-after-free. - Stack tagging:
-fsanitize=memtag-stacktags stack variables at function entry; enables detection of stack buffer overflows and dangling stack pointer use at near-zero overhead vs software AddressSanitizer.
Scalable Matrix Extension (SME)
SME (ARMv9.2-A, FEAT_SME) is designed for matrix operations at the core of modern ML inference and scientific computing. It introduces a new 2D register array called ZA, a new execution mode (Streaming SVE mode), and a set of outer-product accumulation instructions.
ZA Storage, Modes, and Outer Product
; ===== SME Mode Entry/Exit =====
SMSTART ; enter Streaming SVE mode, zero ZA
SMSTART SM ; enter Streaming SVE mode only (ZA unchanged)
SMSTART ZA ; enable ZA array only (stay in normal SVE mode)
SMSTOP ; exit Streaming SVE mode, reset PSTATE.SM=0
SMSTOP ZA ; disable ZA (PSTATE.ZA=0, ZA contents zeroed)
; RDSVL — read streaming vector length
RDSVL x0, #1 ; x0 = SVL in bytes (128/1=1 byte/1, so #1→bytes directly)
; ===== ZA Tile Access =====
; ZA tiles are named by element size: ZA0.B, ZA0.H, ZA0.S, ZA0.D, ZA0.Q
; For SVL=128: ZA0.S is a 4×4 tile of single-precision floats
; Load row 0 of ZA0.S from memory:
LD1W {ZA0H.S[W12, 0]}, P0/Z, [x0] ; load row indexed by w12+0
LD1W {ZA0H.S[W12, 1]}, P0/Z, [x0, x1, LSL #2]
; ===== Outer Product — FP32 MOPA =====
; FMOPA ZAd.S, Pg/M, Pg/M, Zn.S, Zm.S
; ZAd.S += outer_product(Zn.S, Zm.S) [masked by two predicate registers]
; This performs: for each (row i, col j): ZA0.S[i][j] += Zn[i] * Zm[j]
; A single FMOPA instruction accumulates one full SVL×SVL outer product!
ZERO {ZA} ; zero all ZA tiles
FMOPA ZA0.S, P0/M, P1/M, Z0.S, Z1.S ; ZA0.S += Z0 ⊗ Z1 (outer product)
FMOPA ZA0.S, P0/M, P1/M, Z2.S, Z3.S ; accumulate more outer products
; Store result row-by-row
ST1W {ZA0H.S[W12, 0]}, P0, [x2]
ST1W {ZA0H.S[W12, 1]}, P0, [x2, x3, LSL #2]
// SME intrinsics in C (arm_sme.h, ACLE 2023)
#include <arm_sme.h>
// Matrix-multiply accumulate using SME intrinsics
// Computes C += A * B where A is M×K, B is K×N, C is M×N (all FP32)
__attribute__((arm_streaming)) // must execute in Streaming SVE mode
void sme_matmul_fp32(float *C, const float *A, const float *B,
int M, int K, int N)
{
uint64_t vl = svcntw(); // SVL in 32-bit elements
svbool_t pg_all = svptrue_b32();
svfloat32_t va, vb;
// Zero the accumulator tile
svzero_za();
for (int k = 0; k < K; k++) {
// Load one row of A (broadcast across SVL elements)
va = svld1_f32(pg_all, A + k * M);
// Load one row of B
vb = svld1_f32(pg_all, B + k * N);
// Outer product accumulate into ZA tile 0
svmopa_za32_f32_m(0, pg_all, pg_all, va, vb);
}
// Store ZA tile 0 result to C
for (int i = 0; i < M; i++) {
svst1_hor_za32(0, i, pg_all, C + i * N);
}
}
ARM Confidential Compute Architecture (CCA)
CCA (part of ARMv9) addresses cloud multi-tenancy security: even a malicious hypervisor or compromised EL3 firmware should be unable to read a tenant's VM memory. CCA achieves this through a fourth security world — Realm — managed by a new firmware component called the Realm Management Monitor (RMM) running at EL2-R.
ARM four-world security model
| World | Exception Level | Trust | Content |
|---|---|---|---|
| Normal | EL0/EL1/EL2 | Untrusted | Guest OS, hypervisor (KVM, Xen, ESX) |
| Secure | S-EL0/S-EL1/EL3 | Trusted by vendor | OP-TEE, TF-A secure monitor |
| Realm | R-EL0/R-EL1/EL2-R | Self-certifiable | Realm VM tenant code — isolated from hypervisor |
| Root | EL3 | Silicon root of trust | TF-RMM + RMObject management only |
Granule Protection Tables (GPT)
GPTs are a new stage of address translation controlled by EL3. Every 4 KB physical page has a 2-bit GPT entry identifying which world(s) may access it: Normal, Secure, Realm, or Root. A Normal-world hypervisor attempting to read a Realm-assigned granule receives a GPF (Granule Protection Fault) rather than the data.
// Realm lifecycle (simplified — using RMMD SMC calls from TF-A)
// These SMC calls are defined by the Realm Management Monitor Interface (RMMI)
#define SMC_RMM_REALM_CREATE 0xC4000150
#define SMC_RMM_REALM_ACTIVATE 0xC4000151
#define SMC_RMM_REALM_DESTROY 0xC4000152
#define SMC_RMM_REC_CREATE 0xC4000158 // Realm Execution Context = vCPU
#define SMC_RMM_REC_ENTER 0xC400015A // enter a REC (run the Realm vCPU)
// Step 1: Create Realm descriptor (Host NS hypervisor calls RMM)
// x0=SMC_RMM_REALM_CREATE, x1=rd_addr, x2=realm_params_addr
// RMM allocates a Realm Descriptor granule, sets GPT entry to REALM
// Step 2: Create REC (virtual CPU for the Realm VM)
// x0=SMC_RMM_REC_CREATE, x1=rec_addr, x2=rd_addr, x3=mpidr
// Step 3: Enter REC — transitions CPU from Normal EL2 to Realm R-EL1
// x0=SMC_RMM_REC_ENTER, x1=rec_addr, x2=run_object_addr
// RMM saves/restores NS register state, loads Realm state, ERET to R-EL1
// The Realm VM cannot be read by the host hypervisor — GPT enforces physical isolation
// Realm's identity (measurement) is attested via CCA attestation token (CBOR/COSE)
AI/ML Acceleration Instructions
ARM has progressively added inference-optimised instructions across ARMv8.4–ARMv9, targeting quantised integer and reduced-precision floating-point workloads that dominate on-device ML.
DotProd (FEAT_DotProd) and I8MM (FEAT_I8MM)
; ===== SDOT / UDOT — 8-bit dot product (FEAT_DotProd, ARMv8.4) =====
; SDOT Vd.4S, Vn.16B, Vm.4B[lane]
; For each of the 4 output int32 lanes:
; Vd[i].S += Vn[4i..4i+3].B (signed) · Vm[lane*4..lane*4+3].B (signed)
; Computes 4 × 4-element signed dot products per instruction (16 MACs/instr)
; Load two quantised int8 activation vectors (16 elements each)
LD1 {v0.16B}, [x0], #16 ; load 16 int8 activations
LD1 {v1.16B}, [x1], #16 ; load 16 int8 weights
; Accumulate into int32 result register v2.4S
SDOT v2.4S, v0.16B, v1.16B ; v2 += dot(v0[0:4],v1[0:4]) | dot(v0[4:8],v1[4:8]) | ...
; ===== SMMLA / UMMLA — 8-bit matrix multiply accumulate (FEAT_I8MM, ARMv8.6) =====
; SMMLA Vd.4S, Vn.16B, Vm.16B
; Performs a 2×8 × 8×2 = 2×2 matrix multiply of int8 operands, accumulates int32
; Each instruction computes a 2×2 output tile with 16 MACs × 2 = 32 MACs!
SMMLA v4.4S, v0.16B, v1.16B ; 2×2 int32 accumulation from two 2×8 int8 matrices
; v4.S[0] = sum_k(Vn[0,k]*Vm[k,0]) for k=0..7 (row0·col0)
; v4.S[1] = sum_k(Vn[0,k]*Vm[k,1]) for k=0..7 (row0·col1)
; v4.S[2] = sum_k(Vn[1,k]*Vm[k,0]) for k=0..7 (row1·col0)
; v4.S[3] = sum_k(Vn[1,k]*Vm[k,1]) for k=0..7 (row1·col1)
BFloat16 Matrix Multiply (FEAT_BFloat16)
BFloat16 (BF16) is a 16-bit floating-point format with the same 8-bit exponent as FP32 but only 7 mantissa bits (vs 23 for FP32). It is the dominant training and inference format in ML accelerators (Google TPU, NVIDIA A100) due to better dynamic range than FP16.
; ===== BFMMLA — BFloat16 matrix multiply accumulate (FEAT_BFloat16, ARMv8.6) =====
; BFMMLA Vd.4S, Vn.8H, Vm.8H
; Computes a 2×2 FP32 accumulation from two 2×4 BF16 matrices
; 8 FP32 MACs per instruction — ideal for inference workloads
; Load BF16 input row (8 × BF16 = 128 bits = one 128-bit register)
LD1 {v0.8H}, [x0], #16 ; 8 BF16 activations
LD1 {v1.8H}, [x1], #16 ; 8 BF16 weights
; Accumulate 2×2 FP32 result
BFMMLA v2.4S, v0.8H, v1.8H ; v2.4S += matmul(v0_matrix_2x4, v1_matrix_4x2)
; ===== BFCVT — Convert FP32 to BF16 =====
; BFCVTN Vd.4H, Vn.4S — convert 4 FP32 → 4 BF16 (narrow, lower half)
; BFCVTN2 Vd.8H, Vn.4S — convert 4 FP32 → 4 BF16 (narrow, upper half)
; BFCVT Hd, Sn — convert one FP32 → BF16 (scalar)
BFCVTN v3.4H, v4.4S ; pack FP32 result back to BF16 for storage
ML instruction evolution — multiply-accumulate throughput per instruction
| Feature | Instruction | MACs/instr (128-bit) | Data type |
|---|---|---|---|
| NEON baseline | FMLA V.4S | 4 | FP32 |
| FEAT_DotProd | SDOT V.4S | 16 | INT8×INT8→INT32 |
| FEAT_BFloat16 | BFMMLA V.4S | 8 | BF16×BF16→FP32 |
| FEAT_I8MM | SMMLA V.4S | 32 | INT8×INT8→INT32 |
| SME MOPA | FMOPA ZA0.S (SVL=512) | 256 | FP32×FP32→FP32 |
ARMv10, CHERI, and the Horizon
Emerging directions in ARM architecture research
- ARMv10 (speculative): ARM has not publicly announced ARMv10. Industry speculation points to deeper confidential-computing substrate, enhanced SME3/4 for LLM inference, and possible native CHERI-style capability checks integrated rather than bolted on.
- CHERI (Capability Hardware Enhanced RISC Instructions): A Cambridge/ARM research project adding hardware capability support to AArch64. Each pointer is a 128-bit capability (address + bounds + permissions + tag bit). The Morello evaluation board (Neoverse N1 with CHERI extensions) shipped in 2022. CHERI eliminates entire classes of spatial/temporal memory safety errors in hardware — stronger than MTE's probabilistic 4-bit tags, but with higher pointer storage overhead.
- SME2 (ARMv9.2): Extends SME with multi-vector outer product (
FMOPAvariants using ZA0–ZA7 tiles simultaneously), lookup tables (LUTI2/LUTI4for non-linear activation functions), and structured sparse matrix operations for pruned model inference. - MTE3 (ARMv8.7+): Adds Store-Only tag write mode — reduces overhead when writing untagged data into tagged regions, important for zero-copy network paths.
- BRBE (Branch Record Buffer Extension): Hardware branch history ring buffer — like Intel LBR. Software-controlled, captures up to 64 last taken branches per core with timestamps and EL information. Enables low-overhead production profiling without ETM streaming infrastructure.
- RME v1.1 / CCA silicon: First commercial CCA silicon expected 2025–2026 in cloud server platforms. NVIDIA GH200 descendants and AWS Graviton evolution are candidate first adopters. This enables fully attested confidential VMs on ARM-hosted public cloud.
Case Study: ARM in the Cloud — The Graviton Revolution
The most dramatic real-world validation of ARMv9's capabilities is happening in the cloud, where ARM-based server processors have gone from a curiosity to a genuine x86 challenger in under six years.
AWS Graviton — The ARM Server Timeline
| Generation | Year | Core Design | Architecture | Key Specs |
|---|---|---|---|---|
| Graviton1 | 2018 | Cortex-A72 (16 cores) | ARMv8.0 | First ARM EC2 instance (A1), budget-tier |
| Graviton2 | 2020 | Neoverse N1 (64 cores) | ARMv8.2 | 40% better price-performance vs x86 M5 |
| Graviton3 | 2021 | Neoverse V1 (64 cores) | ARMv8.4+ | SVE 256-bit, DDR5, 25% faster than G2 |
| Graviton4 | 2023 | Neoverse V2 (96 cores) | ARMv9.0 | SVE2, MTE-capable, 536 GiB/s memory BW, 30% faster than G3 |
Market Impact: ARM server market share grew from <1% (2018) to ~10% (2024), with industry projections of 20%+ by 2027. Major cloud providers now all offer ARM instances: Azure Cobalt (2024, Neoverse N2, 128 cores, ARMv9), Google Axion (2024, Neoverse V2, ARMv9), and NVIDIA Grace (2023, Neoverse V2/Hopper GPU, 480 GB LPDDR5X, 900 GB/s NVLink-C2C for AI supercomputing). The "ARM can't do servers" myth died with Graviton2 — ARMv9 is now accelerating the transition.
perf record with minimal overhead) — vital for fleet-wide optimisation across millions of instances.
The Evolution of ARM Architecture Versions (1985–2025)
Four decades of ARM architecture innovation
| Era | Architecture | Key Innovation | Impact |
|---|---|---|---|
| 1985 | ARMv1 | Sophie Wilson's 26-bit RISC design, 25,000 transistors | Acorn RISC Machine — proved RISC could be simple and power-efficient |
| 1994–2002 | ARMv4/v5/v6 | Thumb mode, DSP extensions, Java acceleration | Mobile revolution — Nokia, Palm, iPod all ran ARM cores |
| 2004 | ARMv7 | NEON SIMD, TrustZone, Cortex-A8 | iPhone (2007) ran Cortex-A8 — ARM became the smartphone ISA |
| 2011 | ARMv8.0 | 64-bit AArch64, crypto extensions (AES/SHA) | ARM enters servers — first serious x86 challenge since Itanium |
| 2016–2020 | ARMv8.x | PAC (8.3), BTI (8.5), MTE (8.5), I8MM (8.6), BF16 (8.6) | Feature-flag explosion — security and ML acceleration without breaking ABI |
| 2021 | ARMv9.0 | SVE2 mandatory, CCA/Realm world, BRBE, TME | Biggest architectural jump since ARMv8 — security and compute in tandem |
| 2022–2024 | ARMv9.2+ | SME, SME2, MTE3, TRBE | First ARMv9 cloud silicon (Graviton4, Cobalt, Axion, Grace) |
| 2025+ | CHERI / ARMv10 | Hardware capabilities, memory-safe pointers by default | End of spatial/temporal memory safety bugs in hardware — the holy grail |
Hands-On Exercises
MTE Feature Detection & Tag Mismatch Experiment
Explore Memory Tagging Extension by detecting hardware support, allocating tagged memory, and deliberately triggering a tag mismatch fault to understand sync vs async modes.
Step 1 — Check MTE support:
# On a physical ARMv8.5+ device (Pixel 8, Graviton4) or QEMU with MTE:
cat /proc/cpuinfo | grep -i mte
# From EL1 (kernel module or baremetal):
# MRS X0, ID_AA64PFR1_EL1 → bits [11:8] = MTE level (0=none, 1=EL0 only, 2=full, 3=MTE3)
# On QEMU: qemu-system-aarch64 -cpu max -machine virt -append "arm64.nomte=0"
Step 2 — Write a C program that triggers a tag mismatch:
// mte_test.c — MTE tag mismatch demonstration
// Requires Linux 5.10+, kernel CONFIG_ARM64_MTE=y, ARMv8.5+ hardware or QEMU
#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <sys/prctl.h>
#include <string.h>
// MTE prctl constants (Linux uapi)
#ifndef PR_SET_TAGGED_ADDR_CTRL
#define PR_SET_TAGGED_ADDR_CTRL 55
#define PR_TAGGED_ADDR_ENABLE (1UL << 0)
#define PR_MTE_TCF_SYNC (1UL << 1) // Synchronous fault
#define PR_MTE_TCF_ASYNC (1UL << 2) // Asynchronous fault
#define PR_MTE_TAG_SHIFT 3
#endif
#ifndef PROT_MTE
#define PROT_MTE 0x20
#endif
int main(void) {
// Enable MTE in synchronous mode for this process
if (prctl(PR_SET_TAGGED_ADDR_CTRL,
PR_TAGGED_ADDR_ENABLE | PR_MTE_TCF_SYNC |
(0xffffUL << PR_MTE_TAG_SHIFT), 0, 0, 0)) {
perror("prctl MTE enable failed (no MTE support?)");
return 1;
}
printf("[+] MTE enabled in SYNC mode\n");
// Allocate MTE-tagged memory (must use mmap with PROT_MTE)
size_t sz = 4096;
char *buf = mmap(NULL, sz, PROT_READ | PROT_WRITE | PROT_MTE,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
if (buf == MAP_FAILED) { perror("mmap"); return 1; }
// Tag the pointer using IRG (compiler builtin or inline asm)
// __arm_mte_create_random_tag assigns a random tag to buf
char *tagged = __arm_mte_create_random_tag(buf, 0);
__arm_mte_set_tag(tagged); // STG — set allocation tag in memory
printf("[+] Untagged ptr: %p\n", buf);
printf("[+] Tagged ptr: %p\n", tagged);
// Valid access — tags match
tagged[0] = 'A';
printf("[+] Valid write succeeded: '%c'\n", tagged[0]);
// Deliberate mismatch — access with wrong tag (untagged pointer)
printf("[!] Attempting access with mismatched tag (should fault in SYNC)...\n");
buf[0] = 'X'; // BUG: buf has tag=0, memory has random tag → SEGV!
printf("[-] If you see this, MTE did not catch the mismatch\n");
munmap(buf, sz);
return 0;
}
// Build: clang --target=aarch64-linux-gnu -march=armv8.5-a+memtag -O2 -o mte_test mte_test.c
// Run: ./mte_test → expect SIGSEGV (sync) or SIGBUS (async) on the buf[0] write
Step 3 — Observe sync vs async behaviour: Recompile and run with PR_MTE_TCF_ASYNC instead of PR_MTE_TCF_SYNC. In async mode, the fault is reported later (possibly after the function returns) via TFSRE0_EL1 — useful for production (low overhead) but imprecise. Sync mode faults immediately at the offending instruction — ideal for debugging and fuzzing.
SDOT vs SMMLA Throughput Benchmark
Compare the effective multiply-accumulate throughput of SDOT (ARMv8.4, 16 MACs/instr) versus SMMLA (ARMv8.6, 32 MACs/instr) by timing tight loops with the PMU cycle counter.
// sdot_bench.S — SDOT throughput measurement
.global sdot_bench
.type sdot_bench, %function
sdot_bench:
// x0 = iteration count (e.g., 10000000)
// Returns: x0 = cycle count
movi v0.16B, #1 // activations = all 1s
movi v1.16B, #2 // weights = all 2s
movi v2.4S, #0 // accumulator = 0
mrs x1, PMCCNTR_EL0 // read cycle counter (start)
.Lsdot_loop:
sdot v2.4S, v0.16B, v1.16B // 16 MACs
sdot v3.4S, v0.16B, v1.16B // 16 MACs (pipeline fill)
sdot v4.4S, v0.16B, v1.16B // 16 MACs
sdot v5.4S, v0.16B, v1.16B // 16 MACs (4 × 16 = 64 MACs/iteration)
subs x0, x0, #1
b.ne .Lsdot_loop
mrs x2, PMCCNTR_EL0 // read cycle counter (end)
sub x0, x2, x1 // x0 = elapsed cycles
ret
// smmla_bench.S — SMMLA throughput measurement
.global smmla_bench
.type smmla_bench, %function
smmla_bench:
// x0 = iteration count
// Returns: x0 = cycle count
movi v0.16B, #1
movi v1.16B, #2
movi v2.4S, #0
mrs x1, PMCCNTR_EL0
.Lsmmla_loop:
smmla v2.4S, v0.16B, v1.16B // 32 MACs
smmla v3.4S, v0.16B, v1.16B // 32 MACs
smmla v4.4S, v0.16B, v1.16B // 32 MACs
smmla v5.4S, v0.16B, v1.16B // 32 MACs (4 × 32 = 128 MACs/iteration)
subs x0, x0, #1
b.ne .Lsmmla_loop
mrs x2, PMCCNTR_EL0
sub x0, x2, x1
ret
// bench_main.c — Driver program
#include <stdio.h>
#include <stdint.h>
extern uint64_t sdot_bench(uint64_t iterations);
extern uint64_t smmla_bench(uint64_t iterations);
int main(void) {
uint64_t iters = 10000000;
uint64_t sdot_cycles = sdot_bench(iters);
uint64_t smmla_cycles = smmla_bench(iters);
double sdot_macs = (double)iters * 64; // 4 SDOT × 16 MACs
double smmla_macs = (double)iters * 128; // 4 SMMLA × 32 MACs
printf("SDOT: %lu cycles, %.2f MACs/cycle\n", sdot_cycles, sdot_macs / sdot_cycles);
printf("SMMLA: %lu cycles, %.2f MACs/cycle\n", smmla_cycles, smmla_macs / smmla_cycles);
printf("SMMLA speedup: %.2fx throughput\n", (smmla_macs/smmla_cycles) / (sdot_macs/sdot_cycles));
return 0;
}
// Makefile:
// CC = aarch64-linux-gnu-gcc
// CFLAGS = -march=armv8.6-a+i8mm -O2
// bench: bench_main.c sdot_bench.S smmla_bench.S
// $(CC) $(CFLAGS) -o $@ $^
// Run on ARMv8.6+ or QEMU: qemu-aarch64 -cpu max ./bench
Expected results: On Cortex-X4/Neoverse V2, SMMLA typically achieves ~1.8–2× the throughput of SDOT per cycle, since the 2×8×2 tile covers more output elements per instruction even though both may have similar pipeline latency. On in-order cores (Cortex-A520), the gap is narrower because the frontend is the bottleneck, not execution width.
ARMv9 Feature Register Interrogation (Kernel Module)
Write a Linux kernel module that reads ARM ID registers to produce a comprehensive ARMv9 feature summary — similar to what Linux dmesg reports at boot but with full field-level decoding.
// armv9_features.c — Linux kernel module for ARMv9 feature detection
// Build: make -C /lib/modules/$(uname -r)/build M=$PWD modules
// Load: sudo insmod armv9_features.ko && dmesg | tail -30
#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/init.h>
#include <asm/sysreg.h>
MODULE_LICENSE("GPL");
MODULE_DESCRIPTION("ARMv9 Feature Register Decoder");
static void decode_pfr0(u64 val) {
unsigned sve = (val >> 32) & 0xF;
unsigned el2 = (val >> 8) & 0xF;
unsigned el3 = (val >> 12) & 0xF;
unsigned fp = (val >> 16) & 0xF;
unsigned asimd = (val >> 20) & 0xF;
pr_info(" ID_AA64PFR0_EL1 = 0x%016llx\n", val);
pr_info(" SVE: %s (level %u)\n", sve ? "YES" : "NO", sve);
pr_info(" EL2: %s\n", el2 ? "implemented" : "not implemented");
pr_info(" EL3: %s\n", el3 ? "implemented" : "not implemented");
pr_info(" FP: %s\n", fp != 0xF ? "YES" : "NO");
pr_info(" ASIMD: %s\n", asimd != 0xF ? "YES" : "NO");
}
static void decode_pfr1(u64 val) {
unsigned mte = (val >> 8) & 0xF;
unsigned sme = (val >> 24) & 0xF;
unsigned bt = (val >> 0) & 0xF;
pr_info(" ID_AA64PFR1_EL1 = 0x%016llx\n", val);
pr_info(" MTE: level %u (%s)\n", mte,
mte == 0 ? "none" : mte == 1 ? "EL0 insn only" :
mte == 2 ? "full MTE2" : mte == 3 ? "MTE3 (asym)" : "unknown");
pr_info(" SME: %s (level %u)\n", sme ? "YES" : "NO", sme);
pr_info(" BTI: %s\n", bt ? "YES" : "NO");
}
static void decode_zfr0(u64 val) {
unsigned svever = (val >> 0) & 0xF;
unsigned bf16 = (val >> 20) & 0xF;
unsigned i8mm = (val >> 44) & 0xF;
pr_info(" ID_AA64ZFR0_EL1 = 0x%016llx\n", val);
pr_info(" SVE2: %s (ver %u)\n", svever ? "YES" : "SVE1 only", svever);
pr_info(" BFloat16: %s\n", bf16 ? "YES" : "NO");
pr_info(" I8MM: %s\n", i8mm ? "YES" : "NO");
}
static void decode_smfr0(u64 val) {
unsigned smever = (val >> 56) & 0xF;
pr_info(" ID_AA64SMFR0_EL1 = 0x%016llx\n", val);
pr_info(" SME version: %u (%s)\n", smever,
smever == 0 ? "not implemented" : smever == 1 ? "SME" :
smever == 2 ? "SME2" : "SME2+");
}
static int __init armv9_features_init(void) {
u64 pfr0, pfr1, zfr0, smfr0;
pr_info("=== ARMv9 Feature Register Dump ===\n");
pfr0 = read_sysreg_s(SYS_ID_AA64PFR0_EL1);
decode_pfr0(pfr0);
pfr1 = read_sysreg_s(SYS_ID_AA64PFR1_EL1);
decode_pfr1(pfr1);
zfr0 = read_sysreg_s(SYS_ID_AA64ZFR0_EL1);
decode_zfr0(zfr0);
// SME register only valid if SME is present
if (((pfr1 >> 24) & 0xF) > 0) {
smfr0 = read_sysreg_s(SYS_ID_AA64SMFR0_EL1);
decode_smfr0(smfr0);
} else {
pr_info(" SME not present — skipping ID_AA64SMFR0_EL1\n");
}
pr_info("=== End Feature Dump ===\n");
return 0;
}
static void __exit armv9_features_exit(void) {
pr_info("ARMv9 feature module unloaded\n");
}
module_init(armv9_features_init);
module_exit(armv9_features_exit);
# Makefile for the kernel module
# obj-m += armv9_features.o
# all:
# make -C /lib/modules/$(shell uname -r)/build M=$(PWD) modules
# clean:
# make -C /lib/modules/$(shell uname -r)/build M=$(PWD) clean
# Expected output on AWS Graviton4 (Neoverse V2, ARMv9):
# === ARMv9 Feature Register Dump ===
# ID_AA64PFR0_EL1 = 0x1101000011112222
# SVE: YES (level 1)
# EL2: implemented
# EL3: implemented
# FP: YES
# ASIMD: YES
# ID_AA64PFR1_EL1 = 0x0000000000000220
# MTE: level 2 (full MTE2)
# SME: NO (level 0)
# BTI: YES
# ID_AA64ZFR0_EL1 = 0x0000100000100001
# SVE2: YES (ver 1)
# BFloat16: YES
# I8MM: YES
# SME not present — skipping ID_AA64SMFR0_EL1
# === End Feature Dump ===
Challenge extensions: (a) Add decoding for ID_AA64ISAR0_EL1 to detect AES, SHA, CRC32, and atomic instruction support. (b) Add ID_AA64MMFR0_EL1 to detect physical address size (PA range), granule sizes, and stage-2 translation support. (c) Compare output between QEMU -cpu max (all features) and -cpu cortex-a76 (ARMv8.2 subset) to see the feature gap.
Series Summary — 28 Parts, One Architecture
This final section looks back at what the ARM Assembly Mastery Series covers end-to-end — from the earliest ARMv1 silicon to the ARMv9 features shipping in 2026 hardware.
28-part learning arc
| Section | Parts | Theme |
|---|---|---|
| Foundations | 1–6 | ISA history, ARM32, AArch64, arithmetic, branching, AAPCS calling convention |
| Memory Subsystem | 7–8 | Cache hierarchy, memory ordering, barriers, NEON SIMD |
| Advanced ISA | 9–10 | SVE/SVE2 scalable vectors, VFP floating-point |
| System Architecture | 11–13 | Exception levels, MMU/page tables, TrustZone security |
| Platform Programming | 14–16 | Cortex-M bare metal, Cortex-A boot chain, Apple Silicon ARM64e |
| Compiler & Performance | 17–18 | Inline assembly, GCC/Clang internals, micro-optimisation, PMU profiling |
| Binary Analysis | 19 | Reverse engineering, ELF/Mach-O, Ghidra, iOS/Android quirks |
| Advanced Systems | 20–24 | Bare-metal OS, microarchitecture, hypervisors, debugging/tracing, linker internals |
| Toolchain & Production | 25–26 | Cross-compilation toolchains, Android NDK, FreeRTOS/Zephyr, U-Boot, TF-A |
| Security & Future | 27–28 | Exploitation research, ROP/JOP/PAC/BTI, ARMv9 MTE/SME/CCA, AI instructions |
Series Complete!
Congratulations — you have completed all 28 parts of the ARM Assembly Mastery Series. You have travelled from ARMv1 RISC origins through AArch64 system programming, advanced SIMD, virtual memory, hypervisors, security research, production deployment, and now to the cutting edge of ARMv9 memory tagging, matrix extensions, and confidential computing. The ARM architecture continues to evolve — ARM's public specifications and the GCC/LLVM codebases are the best resources to stay current as new feature flags ship in silicon. Keep building, keep measuring, and keep reading the reference manual.