ARM Assembly Part 28: Emerging ARMv9 & Future Directions

ARMv9.0 (2021) was the first major architectural version since ARMv8 in 2011. It mandated SVE2 for server-class cores, introduced the Realm Management Extension for confidential computing, and continued a decade-long trend of adding new feature flags instead of breaking instruction semantics. This final instalment surveys the most impactful ARMv9 additions — MTE, SME, CCA, and AI matrix instructions — and closes with a glance at where ARM is heading next.

                            
                            Analogy — Upgrading a Modern Airport: Imagine a sprawling international airport that upgrades every decade. MTE is like adding colour-coded luggage tags — every bag gets a 4-bit colour sticker at check-in, and if the tag on your boarding pass doesn't match the tag on your suitcase, the conveyor belt rejects it instantly (catching "wrong-bag" errors — spatial/temporal memory safety). SME is a massive cargo sorting hall with a grid of robotic arms that move entire pallets in matrix formation simultaneously, instead of one parcel at a time — that's the outer product acceleration for ML workloads. CCA (Realm VMs) is a sealed diplomatic lounge with one-way mirrored walls: neither airport staff (the hypervisor) nor other airlines (other tenants) can see inside, and passengers prove identity through a cryptographically attested boarding pass — even the airport management company has zero visibility. The AI instructions (SDOT → SMMLA → BFMMLA → SME MOPA) are increasingly sophisticated automated passport scanners — each generation reads more passport fields per scan (4 MACs → 16 → 32 → 256 per swipe), processing entire queues in the time a single manual check used to take. ARMv10/CHERI is the airport's future expansion blueprint — smart boarding passes that encode not just your seat but also which gates, shops, and lounges you're authorised to enter, replacing simple colour tags with rich capability tokens that the hardware itself enforces.
                        

ARMv9 Overview

Mandatory and New Features

ARMv9

ARMv9 key feature changes vs ARMv8

Feature	ARMv8	ARMv9
SIMD baseline	NEON optional (mandatory for A-profile)	SVE2 mandatory for Cortex-A (server/HPC)
Memory safety	PAC (8.3), BTI (8.5), MTE optional (8.5)	MTE more deeply integrated across profiles
Confidential compute	TrustZone (S/NS worlds)	CCA adds Realm world (4th security state)
Matrix compute	FEAT_DotProd (8.4), I8MM (8.6)	SME (Scalable Matrix Extension) new in 9.2
Branch recording	ETM/PTM trace	BRBE (Branch Record Buffer Extension)
Transactional memory	Not present	TME (Transactional Memory Extension) — optional
Debug	Self-hosted debug v8.4	TRBE (Trace Buffer Extension) in-memory trace

ARMv9 Silicon Timeline

Silicon

ARMv9 processor releases

Cortex-A510/A710/X2 (2022) — First ARMv9 CPU cores; Cortex-A510 implements SVE2 at 128-bit VL (vector length), power-efficient core.
Cortex-A520/A720/X4 (2023) — ARMv9.2, adds MTE support; Cortex-X4 paired with Immortalis-G720 GPU in Snapdragon 8 Gen 3.
Neoverse V2 (2022) — ARMv9 server core (AWS Graviton4, Google Axion); 256-bit SVE2, MTE, 4× DDR5 channels.
Apple M-series — ARMv8.6+ (ARM64e); PAC everywhere, but not full ARMv9 SVE2 — Apple uses custom SIMD width strategies.
Cortex-A725/X925 (2024) — ARMv9.2+, SME2 debut in mobile; Cortex-X925 reaches ~3.85 GHz peak.

Cortex-A510 Micro-architecture Deep Dive

The Cortex-A510 is ARM's first ARMv9-A LITTLE core, replacing the Cortex-A55. It is found in virtually every flagship SoC since 2022 (Snapdragon 8 Gen 1 / Dimensity 9000 / Exynos 2200) as the efficiency cluster that handles background tasks, Always-On-Display, and sensor hub workloads.

A510Micro-arch

Cortex-A510 vs Cortex-A55 Comparison

Feature	Cortex-A55 (ARMv8.2)	Cortex-A510 (ARMv9.0)
Pipeline	8-stage in-order	Dual-issue in-order, deeper pipeline (~11 stages)
Decode	2-wide decode	2-wide decode with macro-op fusion
SIMD / Vector	NEON 128-bit (ASIMD)	SVE2 at 128-bit VL + NEON backward compat
Branch Prediction	Simple bimodal	Larger BTB, improved indirect predictor
L1 Cache	32 KB I + 32 KB D	32 KB I + 32 KB D (same size, better prefetch)
L2 Cache	Private 64–256 KB	Shared between 2 cores (64–256 KB per pair)
IPC Uplift	Baseline	~35% higher IPC at same frequency
Power Efficiency	Baseline	~20% lower energy per task
Security	FEAT_PAuth optional	FEAT_PAuth + MTE + BTI mandatory
ISA	ARMv8.2-A	ARMv9.0-A (Realm Management, SVE2, MTE)

                            
                            Paired-core design: Cortex-A510 uses a unique merged core topology — two A510 cores share an L2 cache and front-end resources, reducing silicon area by ~15% compared to two fully independent A55 cores. This pairing is transparent to software; the OS still sees individual logical cores with their own register files and execution state.
                        

Memory Tagging Extension (MTE)

MTE (ARMv8.5-A, FEAT_MTE) is a hardware mechanism to detect spatial and temporal memory safety violations at near-zero cost. Every 16-byte aligned granule of physical memory has a 4-bit allocation tag stored in special ECC-like storage. Every pointer has a 4-bit logical tag stored in its top byte (bits [59:56]). A tag mismatch on load/store can either fault synchronously, fault asynchronously, or be silently ignored depending on system register configuration.

Diagram of MTE memory tagging showing 4-bit pointer tags matched against allocation tags on 16-byte granules — MTE tagging mechanism — 4-bit logical tag in pointer bits [59:56] must match allocation tag on each memory access

                            
                            Tag granule: Tags are associated with 16-byte aligned memory granules. A 64-byte struct therefore has 4 distinct tag granules (bits 59:56 of the pointer carry the tag; top 8 bits of the VA are the key field when TBI is enabled — MTE uses bits [59:56] specifically).
                        

MTE Instructions

; ----- MTE Instruction Reference -----
; IRG  Xd, Xn [, Xm]   — Insert Random Tag
;   Generates a random 4-bit tag, inserts into bits [59:56] of Xd
;   Xn provides the base address; optional Xm is a tag exclusion mask
IRG     x0, x1                  ; x0 = tagged ptr (random tag in [59:56]), base=x1

; ADDG  Xd, Xn, #imm6, #imm4   — Add with Tag
;   Adds scaled offset to Xn, AND replaces tag with #imm4
ADDG    x1, x0, #16, #2         ; x1 = x0+16, tag=2 (for next 16B granule)

; STG   Xt, [Xn {, #simm9}]    — Store Tag
;   Stores the tag from bits [59:56] of Xt to the allocation tag of [Xn]
STG     x0, [x0]                ; allocate tag at the address x0 points to

; STGM  Xt, [Xn]               — Store Tag, Multiple (fills an entire page)
;   EL1/EL2 only — initialises all tag granules in the page

; LDG   Xd, [Xn {, #simm9}]   — Load Tag
;   Loads the 4-bit allocation tag from [Xn] into bits [59:56] of Xd
LDG     x2, [x0]                ; x2 = x0 with allocation tag filled in [59:56]

; STGP  Xt1, Xt2, [Xn{, #simm7}]  — Store Tag and Pair
;   Stores two registers and sets tag in one instruction (like STP+STG)

; STZG  Xt, [Xn {, #simm9}]   — Store Zero with Tag
;   Zeroes the 16-byte granule and sets its allocation tag

; Example: allocating a tagged buffer
;   Assume x0 = untagged base address of 64-byte allocation
mte_alloc_64:
    IRG     x0, x0              ; assign random tag to pointer
    STG     x0, [x0, #0]       ; tag granule 0
    STG     x0, [x0, #16]      ; tag granule 1
    STG     x0, [x0, #32]      ; tag granule 2
    STG     x0, [x0, #48]      ; tag granule 3   (all 4 granules get same tag)
    ret                         ; caller uses tagged x0 for all access

# Enable MTE for a process (Linux 5.10+, requires kernel CONFIG_ARM64_MTE)
# prctl(PR_SET_TAGGED_ADDR_CTRL, PR_TAGGED_ADDR_ENABLE | PR_MTE_TCF_SYNC, 0, 0, 0)
# In C:
# #include <sys/prctl.h>
# prctl(PR_SET_TAGGED_ADDR_CTRL,
#       PR_TAGGED_ADDR_ENABLE | PR_MTE_TCF_SYNC | (0xffff << PR_MTE_TAG_SHIFT),
#       0, 0, 0);

# Build with GCC/Clang MTE sanitizer (software-assisted MTE, Android 12+)
clang --target=aarch64-linux-android33 \
      -fsanitize=memtag \
      -fsanitize-memtag-mode=sync \     # synchronous fault on mismatch
      -march=armv8.5-a+memtag \
      -O2 -o mte_test main.c

# Verify MTE availability on system
cat /proc/cpuinfo | grep "asimdrdm\|dcpop\|mte"   # mte entry confirms support

Android MTE Deployment

Android MTE

MTE rollout strategy (Android 13/14)

Pixel 8/8 Pro (Tensor G3, ARMv9.2) — First production Android devices with MTE enabled in async mode for the system allocator (jemalloc variant). Catches heap overflows and UAF asynchronously at ~1% perf overhead.
Async vs sync modes: Async — tag check happens off the critical path, fault reported asynchronously (lowest overhead, misses some races). Sync — fault on every mismatching access, higher overhead, precise attribution. Used for development/fuzzing.
Scudo allocator integration: Android's production allocator Scudo uses MTE natively — IRG on each malloc() to assign a random pointer tag, STGM to initialise the page, zeroing tag on free() to catch use-after-free.
Stack tagging: -fsanitize=memtag-stack tags stack variables at function entry; enables detection of stack buffer overflows and dangling stack pointer use at near-zero overhead vs software AddressSanitizer.

Scalable Matrix Extension (SME)

SME (ARMv9.2-A, FEAT_SME) is designed for matrix operations at the core of modern ML inference and scientific computing. It introduces a new 2D register array called ZA, a new execution mode (Streaming SVE mode), and a set of outer-product accumulation instructions.

Diagram of SME ZA tile register performing outer product accumulation with FMOPA instruction — SME outer product accumulation — FMOPA computes a full SVL×SVL outer product into the ZA tile in one instruction

                            
                            ZA tile storage: ZA is a square 2D array of SVL × SVL bits (SVL = Streaming Vector Length, minimum 128 bits). At SVL=512b, ZA is 512×512 bits = 32 KB of on-chip accumulator storage — far larger than any scalar/vector register file.
                        

ZA Storage, Modes, and Outer Product

; ===== SME Mode Entry/Exit =====
SMSTART             ; enter Streaming SVE mode, zero ZA
SMSTART SM          ; enter Streaming SVE mode only (ZA unchanged)
SMSTART ZA          ; enable ZA array only (stay in normal SVE mode)
SMSTOP              ; exit Streaming SVE mode, reset PSTATE.SM=0
SMSTOP ZA           ; disable ZA (PSTATE.ZA=0, ZA contents zeroed)

; RDSVL — read streaming vector length
RDSVL   x0, #1       ; x0 = SVL in bytes (128/1=1 byte/1, so #1→bytes directly)

; ===== ZA Tile Access =====
; ZA tiles are named by element size: ZA0.B, ZA0.H, ZA0.S, ZA0.D, ZA0.Q
; For SVL=128: ZA0.S is a 4×4 tile of single-precision floats
; Load row 0 of ZA0.S from memory:
LD1W    {ZA0H.S[W12, 0]}, P0/Z, [x0]    ; load row indexed by w12+0
LD1W    {ZA0H.S[W12, 1]}, P0/Z, [x0, x1, LSL #2]

; ===== Outer Product — FP32 MOPA =====
; FMOPA ZAd.S, Pg/M, Pg/M, Zn.S, Zm.S
; ZAd.S += outer_product(Zn.S, Zm.S)  [masked by two predicate registers]
; This performs: for each (row i, col j): ZA0.S[i][j] += Zn[i] * Zm[j]
; A single FMOPA instruction accumulates one full SVL×SVL outer product!

ZERO    {ZA}                         ; zero all ZA tiles
FMOPA   ZA0.S, P0/M, P1/M, Z0.S, Z1.S   ; ZA0.S += Z0 ⊗ Z1 (outer product)
FMOPA   ZA0.S, P0/M, P1/M, Z2.S, Z3.S   ; accumulate more outer products

; Store result row-by-row
ST1W    {ZA0H.S[W12, 0]}, P0, [x2]
ST1W    {ZA0H.S[W12, 1]}, P0, [x2, x3, LSL #2]

// SME intrinsics in C (arm_sme.h, ACLE 2023)
#include <arm_sme.h>

// Matrix-multiply accumulate using SME intrinsics
// Computes C += A * B  where A is M×K, B is K×N, C is M×N (all FP32)
__attribute__((arm_streaming))          // must execute in Streaming SVE mode
void sme_matmul_fp32(float *C, const float *A, const float *B,
                     int M, int K, int N)
{
    uint64_t vl = svcntw();              // SVL in 32-bit elements

    svbool_t pg_all = svptrue_b32();
    svfloat32_t va, vb;

    // Zero the accumulator tile
    svzero_za();

    for (int k = 0; k < K; k++) {
        // Load one row of A (broadcast across SVL elements)
        va = svld1_f32(pg_all, A + k * M);
        // Load one row of B
        vb = svld1_f32(pg_all, B + k * N);
        // Outer product accumulate into ZA tile 0
        svmopa_za32_f32_m(0, pg_all, pg_all, va, vb);
    }

    // Store ZA tile 0 result to C
    for (int i = 0; i < M; i++) {
        svst1_hor_za32(0, i, pg_all, C + i * N);
    }
}

ARM Confidential Compute Architecture (CCA)

CCA (part of ARMv9) addresses cloud multi-tenancy security: even a malicious hypervisor or compromised EL3 firmware should be unable to read a tenant's VM memory. CCA achieves this through a fourth security world — Realm — managed by a new firmware component called the Realm Management Monitor (RMM) running at EL2-R.

Diagram of ARM CCA four-world security model with Normal, Secure, Realm, and Root worlds — ARM CCA four-world model — Realm world isolates tenant VMs from hypervisor via Granule Protection Tables

CCA worlds

ARM four-world security model

World	Exception Level	Trust	Content
Normal	EL0/EL1/EL2	Untrusted	Guest OS, hypervisor (KVM, Xen, ESX)
Secure	S-EL0/S-EL1/EL3	Trusted by vendor	OP-TEE, TF-A secure monitor
Realm	R-EL0/R-EL1/EL2-R	Self-certifiable	Realm VM tenant code — isolated from hypervisor
Root	EL3	Silicon root of trust	TF-RMM + RMObject management only

Granule Protection Tables (GPT)

GPTs are a new stage of address translation controlled by EL3. Every 4 KB physical page has a 2-bit GPT entry identifying which world(s) may access it: Normal, Secure, Realm, or Root. A Normal-world hypervisor attempting to read a Realm-assigned granule receives a GPF (Granule Protection Fault) rather than the data.

// Realm lifecycle (simplified — using RMMD SMC calls from TF-A)
// These SMC calls are defined by the Realm Management Monitor Interface (RMMI)

#define SMC_RMM_REALM_CREATE    0xC4000150
#define SMC_RMM_REALM_ACTIVATE  0xC4000151
#define SMC_RMM_REALM_DESTROY   0xC4000152
#define SMC_RMM_REC_CREATE      0xC4000158   // Realm Execution Context = vCPU
#define SMC_RMM_REC_ENTER       0xC400015A   // enter a REC (run the Realm vCPU)

// Step 1: Create Realm descriptor (Host NS hypervisor calls RMM)
// x0=SMC_RMM_REALM_CREATE, x1=rd_addr, x2=realm_params_addr
// RMM allocates a Realm Descriptor granule, sets GPT entry to REALM

// Step 2: Create REC (virtual CPU for the Realm VM)
// x0=SMC_RMM_REC_CREATE, x1=rec_addr, x2=rd_addr, x3=mpidr

// Step 3: Enter REC — transitions CPU from Normal EL2 to Realm R-EL1
// x0=SMC_RMM_REC_ENTER, x1=rec_addr, x2=run_object_addr
// RMM saves/restores NS register state, loads Realm state, ERET to R-EL1

// The Realm VM cannot be read by the host hypervisor — GPT enforces physical isolation
// Realm's identity (measurement) is attested via CCA attestation token (CBOR/COSE)

AI/ML Acceleration Instructions

ARM has progressively added inference-optimised instructions across ARMv8.4–ARMv9, targeting quantised integer and reduced-precision floating-point workloads that dominate on-device ML.

Diagram showing evolution of ARM AI/ML instructions from DotProd through I8MM and BFloat16 to SME MOPA — ARM ML instruction evolution — from 4 MACs/instruction (NEON) to 256 MACs/instruction (SME MOPA)

DotProd (FEAT_DotProd) and I8MM (FEAT_I8MM)

; ===== SDOT / UDOT — 8-bit dot product (FEAT_DotProd, ARMv8.4) =====
; SDOT Vd.4S, Vn.16B, Vm.4B[lane]
; For each of the 4 output int32 lanes:
;   Vd[i].S += Vn[4i..4i+3].B (signed) · Vm[lane*4..lane*4+3].B (signed)
; Computes 4 × 4-element signed dot products per instruction (16 MACs/instr)

; Load two quantised int8 activation vectors (16 elements each)
LD1     {v0.16B}, [x0], #16     ; load 16 int8 activations
LD1     {v1.16B}, [x1], #16     ; load 16 int8 weights

; Accumulate into int32 result register v2.4S
SDOT    v2.4S, v0.16B, v1.16B   ; v2 += dot(v0[0:4],v1[0:4]) | dot(v0[4:8],v1[4:8]) | ...

; ===== SMMLA / UMMLA — 8-bit matrix multiply accumulate (FEAT_I8MM, ARMv8.6) =====
; SMMLA Vd.4S, Vn.16B, Vm.16B
; Performs a 2×8 × 8×2 = 2×2 matrix multiply of int8 operands, accumulates int32
; Each instruction computes a 2×2 output tile with 16 MACs × 2 = 32 MACs!

SMMLA   v4.4S, v0.16B, v1.16B   ; 2×2 int32 accumulation from two 2×8 int8 matrices
; v4.S[0] = sum_k(Vn[0,k]*Vm[k,0])  for k=0..7  (row0·col0)
; v4.S[1] = sum_k(Vn[0,k]*Vm[k,1])  for k=0..7  (row0·col1)
; v4.S[2] = sum_k(Vn[1,k]*Vm[k,0])  for k=0..7  (row1·col0)
; v4.S[3] = sum_k(Vn[1,k]*Vm[k,1])  for k=0..7  (row1·col1)

BFloat16 Matrix Multiply (FEAT_BFloat16)

BFloat16 (BF16) is a 16-bit floating-point format with the same 8-bit exponent as FP32 but only 7 mantissa bits (vs 23 for FP32). It is the dominant training and inference format in ML accelerators (Google TPU, NVIDIA A100) due to better dynamic range than FP16.

; ===== BFMMLA — BFloat16 matrix multiply accumulate (FEAT_BFloat16, ARMv8.6) =====
; BFMMLA Vd.4S, Vn.8H, Vm.8H
; Computes a 2×2 FP32 accumulation from two 2×4 BF16 matrices
; 8 FP32 MACs per instruction — ideal for inference workloads

; Load BF16 input row (8 × BF16 = 128 bits = one 128-bit register)
LD1     {v0.8H}, [x0], #16      ; 8 BF16 activations
LD1     {v1.8H}, [x1], #16      ; 8 BF16 weights

; Accumulate 2×2 FP32 result
BFMMLA  v2.4S, v0.8H, v1.8H    ; v2.4S += matmul(v0_matrix_2x4, v1_matrix_4x2)

; ===== BFCVT — Convert FP32 to BF16 =====
; BFCVTN  Vd.4H, Vn.4S   — convert 4 FP32 → 4 BF16 (narrow, lower half)
; BFCVTN2 Vd.8H, Vn.4S   — convert 4 FP32 → 4 BF16 (narrow, upper half)
; BFCVT   Hd, Sn           — convert one FP32 → BF16 (scalar)

BFCVTN  v3.4H, v4.4S           ; pack FP32 result back to BF16 for storage

Throughput

ML instruction evolution — multiply-accumulate throughput per instruction

Feature	Instruction	MACs/instr (128-bit)	Data type
NEON baseline	`FMLA V.4S`	4	FP32
FEAT_DotProd	`SDOT V.4S`	16	INT8×INT8→INT32
FEAT_BFloat16	`BFMMLA V.4S`	8	BF16×BF16→FP32
FEAT_I8MM	`SMMLA V.4S`	32	INT8×INT8→INT32
SME MOPA	`FMOPA ZA0.S` (SVL=512)	256	FP32×FP32→FP32

ARMv10, CHERI, and the Horizon

                            
                            ARMv9 sub-versions (9.1–9.4) continue delivering incremental features via the optional-feature framework. Each new CPU core (Cortex-X925 = ARMv9.2, Neoverse V3 = ARMv9.2+) implements a specific feature subset. The architecture specification allows a feature-flag model where a core can be ARMv9 compliant while omitting optional extensions like SME2 or MTE3.
                        

Future

Emerging directions in ARM architecture research

ARMv10 (speculative): ARM has not publicly announced ARMv10. Industry speculation points to deeper confidential-computing substrate, enhanced SME3/4 for LLM inference, and possible native CHERI-style capability checks integrated rather than bolted on.
CHERI (Capability Hardware Enhanced RISC Instructions): A Cambridge/ARM research project adding hardware capability support to AArch64. Each pointer is a 128-bit capability (address + bounds + permissions + tag bit). The Morello evaluation board (Neoverse N1 with CHERI extensions) shipped in 2022. CHERI eliminates entire classes of spatial/temporal memory safety errors in hardware — stronger than MTE's probabilistic 4-bit tags, but with higher pointer storage overhead.
SME2 (ARMv9.2): Extends SME with multi-vector outer product (FMOPA variants using ZA0–ZA7 tiles simultaneously), lookup tables (LUTI2/LUTI4 for non-linear activation functions), and structured sparse matrix operations for pruned model inference.
MTE3 (ARMv8.7+): Adds Store-Only tag write mode — reduces overhead when writing untagged data into tagged regions, important for zero-copy network paths.
BRBE (Branch Record Buffer Extension): Hardware branch history ring buffer — like Intel LBR. Software-controlled, captures up to 64 last taken branches per core with timestamps and EL information. Enables low-overhead production profiling without ETM streaming infrastructure.
RME v1.1 / CCA silicon: First commercial CCA silicon expected 2025–2026 in cloud server platforms. NVIDIA GH200 descendants and AWS Graviton evolution are candidate first adopters. This enables fully attested confidential VMs on ARM-hosted public cloud.

Case Study: ARM in the Cloud — The Graviton Revolution

The most dramatic real-world validation of ARMv9's capabilities is happening in the cloud, where ARM-based server processors have gone from a curiosity to a genuine x86 challenger in under six years.

Cloud Production

AWS Graviton — The ARM Server Timeline

Generation	Year	Core Design	Architecture	Key Specs
Graviton1	2018	Cortex-A72 (16 cores)	ARMv8.0	First ARM EC2 instance (A1), budget-tier
Graviton2	2020	Neoverse N1 (64 cores)	ARMv8.2	40% better price-performance vs x86 M5
Graviton3	2021	Neoverse V1 (64 cores)	ARMv8.4+	SVE 256-bit, DDR5, 25% faster than G2
Graviton4	2023	Neoverse V2 (96 cores)	ARMv9.0	SVE2, MTE-capable, 536 GiB/s memory BW, 30% faster than G3

Market Impact: ARM server market share grew from <1% (2018) to ~10% (2024), with industry projections of 20%+ by 2027. Major cloud providers now all offer ARM instances: Azure Cobalt (2024, Neoverse N2, 128 cores, ARMv9), Google Axion (2024, Neoverse V2, ARMv9), and NVIDIA Grace (2023, Neoverse V2/Hopper GPU, 480 GB LPDDR5X, 900 GB/s NVLink-C2C for AI supercomputing). The "ARM can't do servers" myth died with Graviton2 — ARMv9 is now accelerating the transition.

                            
                            Why ARMv9 features matter for cloud: SVE2 enables high-throughput video transcoding and cryptography without custom accelerators. MTE provides hardware memory safety for system-level software (hypervisors, container runtimes) at near-zero overhead — critical for cloud security posture. CCA enables confidential VMs where even the cloud provider cannot inspect tenant data. BRBE enables always-on production profiling (perf record with minimal overhead) — vital for fleet-wide optimisation across millions of instances.
                        

The Evolution of ARM Architecture Versions (1985–2025)

History

Four decades of ARM architecture innovation

Era	Architecture	Key Innovation	Impact
1985	ARMv1	Sophie Wilson's 26-bit RISC design, 25,000 transistors	Acorn RISC Machine — proved RISC could be simple and power-efficient
1994–2002	ARMv4/v5/v6	Thumb mode, DSP extensions, Java acceleration	Mobile revolution — Nokia, Palm, iPod all ran ARM cores
2004	ARMv7	NEON SIMD, TrustZone, Cortex-A8	iPhone (2007) ran Cortex-A8 — ARM became the smartphone ISA
2011	ARMv8.0	64-bit AArch64, crypto extensions (AES/SHA)	ARM enters servers — first serious x86 challenge since Itanium
2016–2020	ARMv8.x	PAC (8.3), BTI (8.5), MTE (8.5), I8MM (8.6), BF16 (8.6)	Feature-flag explosion — security and ML acceleration without breaking ABI
2021	ARMv9.0	SVE2 mandatory, CCA/Realm world, BRBE, TME	Biggest architectural jump since ARMv8 — security and compute in tandem
2022–2024	ARMv9.2+	SME, SME2, MTE3, TRBE	First ARMv9 cloud silicon (Graviton4, Cobalt, Axion, Grace)
2025+	CHERI / ARMv10	Hardware capabilities, memory-safe pointers by default	End of spatial/temporal memory safety bugs in hardware — the holy grail

Hands-On Exercises

Exercise 1 Beginner–Intermediate

MTE Feature Detection & Tag Mismatch Experiment

Explore Memory Tagging Extension by detecting hardware support, allocating tagged memory, and deliberately triggering a tag mismatch fault to understand sync vs async modes.

Step 1 — Check MTE support:

# On a physical ARMv8.5+ device (Pixel 8, Graviton4) or QEMU with MTE:
cat /proc/cpuinfo | grep -i mte

# From EL1 (kernel module or baremetal):
# MRS X0, ID_AA64PFR1_EL1   → bits [11:8] = MTE level (0=none, 1=EL0 only, 2=full, 3=MTE3)
# On QEMU: qemu-system-aarch64 -cpu max -machine virt -append "arm64.nomte=0"

Step 2 — Write a C program that triggers a tag mismatch:

// mte_test.c — MTE tag mismatch demonstration
// Requires Linux 5.10+, kernel CONFIG_ARM64_MTE=y, ARMv8.5+ hardware or QEMU

#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <sys/prctl.h>
#include <string.h>

// MTE prctl constants (Linux uapi)
#ifndef PR_SET_TAGGED_ADDR_CTRL
#define PR_SET_TAGGED_ADDR_CTRL  55
#define PR_TAGGED_ADDR_ENABLE    (1UL << 0)
#define PR_MTE_TCF_SYNC          (1UL << 1)   // Synchronous fault
#define PR_MTE_TCF_ASYNC         (1UL << 2)   // Asynchronous fault
#define PR_MTE_TAG_SHIFT         3
#endif

#ifndef PROT_MTE
#define PROT_MTE  0x20
#endif

int main(void) {
    // Enable MTE in synchronous mode for this process
    if (prctl(PR_SET_TAGGED_ADDR_CTRL,
              PR_TAGGED_ADDR_ENABLE | PR_MTE_TCF_SYNC |
              (0xffffUL << PR_MTE_TAG_SHIFT), 0, 0, 0)) {
        perror("prctl MTE enable failed (no MTE support?)");
        return 1;
    }
    printf("[+] MTE enabled in SYNC mode\n");

    // Allocate MTE-tagged memory (must use mmap with PROT_MTE)
    size_t sz = 4096;
    char *buf = mmap(NULL, sz, PROT_READ | PROT_WRITE | PROT_MTE,
                     MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
    if (buf == MAP_FAILED) { perror("mmap"); return 1; }

    // Tag the pointer using IRG (compiler builtin or inline asm)
    // __arm_mte_create_random_tag assigns a random tag to buf
    char *tagged = __arm_mte_create_random_tag(buf, 0);
    __arm_mte_set_tag(tagged);    // STG — set allocation tag in memory

    printf("[+] Untagged ptr: %p\n", buf);
    printf("[+] Tagged ptr:   %p\n", tagged);

    // Valid access — tags match
    tagged[0] = 'A';
    printf("[+] Valid write succeeded: '%c'\n", tagged[0]);

    // Deliberate mismatch — access with wrong tag (untagged pointer)
    printf("[!] Attempting access with mismatched tag (should fault in SYNC)...\n");
    buf[0] = 'X';  // BUG: buf has tag=0, memory has random tag → SEGV!

    printf("[-] If you see this, MTE did not catch the mismatch\n");
    munmap(buf, sz);
    return 0;
}

// Build: clang --target=aarch64-linux-gnu -march=armv8.5-a+memtag -O2 -o mte_test mte_test.c
// Run:   ./mte_test   → expect SIGSEGV (sync) or SIGBUS (async) on the buf[0] write

Step 3 — Observe sync vs async behaviour: Recompile and run with PR_MTE_TCF_ASYNC instead of PR_MTE_TCF_SYNC. In async mode, the fault is reported later (possibly after the function returns) via TFSRE0_EL1 — useful for production (low overhead) but imprecise. Sync mode faults immediately at the offending instruction — ideal for debugging and fuzzing.

Exercise 2 Intermediate

SDOT vs SMMLA Throughput Benchmark

Compare the effective multiply-accumulate throughput of SDOT (ARMv8.4, 16 MACs/instr) versus SMMLA (ARMv8.6, 32 MACs/instr) by timing tight loops with the PMU cycle counter.

// sdot_bench.S — SDOT throughput measurement
.global sdot_bench
.type sdot_bench, %function

sdot_bench:
    // x0 = iteration count (e.g., 10000000)
    // Returns: x0 = cycle count
    movi    v0.16B, #1              // activations = all 1s
    movi    v1.16B, #2              // weights = all 2s
    movi    v2.4S,  #0              // accumulator = 0

    mrs     x1, PMCCNTR_EL0        // read cycle counter (start)
.Lsdot_loop:
    sdot    v2.4S, v0.16B, v1.16B  // 16 MACs
    sdot    v3.4S, v0.16B, v1.16B  // 16 MACs (pipeline fill)
    sdot    v4.4S, v0.16B, v1.16B  // 16 MACs
    sdot    v5.4S, v0.16B, v1.16B  // 16 MACs (4 × 16 = 64 MACs/iteration)
    subs    x0, x0, #1
    b.ne    .Lsdot_loop

    mrs     x2, PMCCNTR_EL0        // read cycle counter (end)
    sub     x0, x2, x1             // x0 = elapsed cycles
    ret

// smmla_bench.S — SMMLA throughput measurement
.global smmla_bench
.type smmla_bench, %function

smmla_bench:
    // x0 = iteration count
    // Returns: x0 = cycle count
    movi    v0.16B, #1
    movi    v1.16B, #2
    movi    v2.4S,  #0

    mrs     x1, PMCCNTR_EL0
.Lsmmla_loop:
    smmla   v2.4S, v0.16B, v1.16B  // 32 MACs
    smmla   v3.4S, v0.16B, v1.16B  // 32 MACs
    smmla   v4.4S, v0.16B, v1.16B  // 32 MACs
    smmla   v5.4S, v0.16B, v1.16B  // 32 MACs (4 × 32 = 128 MACs/iteration)
    subs    x0, x0, #1
    b.ne    .Lsmmla_loop

    mrs     x2, PMCCNTR_EL0
    sub     x0, x2, x1
    ret

// bench_main.c — Driver program
#include <stdio.h>
#include <stdint.h>

extern uint64_t sdot_bench(uint64_t iterations);
extern uint64_t smmla_bench(uint64_t iterations);

int main(void) {
    uint64_t iters = 10000000;

    uint64_t sdot_cycles  = sdot_bench(iters);
    uint64_t smmla_cycles = smmla_bench(iters);

    double sdot_macs  = (double)iters * 64;   // 4 SDOT × 16 MACs
    double smmla_macs = (double)iters * 128;   // 4 SMMLA × 32 MACs

    printf("SDOT:  %lu cycles, %.2f MACs/cycle\n",  sdot_cycles,  sdot_macs / sdot_cycles);
    printf("SMMLA: %lu cycles, %.2f MACs/cycle\n", smmla_cycles, smmla_macs / smmla_cycles);
    printf("SMMLA speedup: %.2fx throughput\n", (smmla_macs/smmla_cycles) / (sdot_macs/sdot_cycles));
    return 0;
}

// Makefile:
//   CC = aarch64-linux-gnu-gcc
//   CFLAGS = -march=armv8.6-a+i8mm -O2
//   bench: bench_main.c sdot_bench.S smmla_bench.S
//       $(CC) $(CFLAGS) -o $@ $^
// Run on ARMv8.6+ or QEMU: qemu-aarch64 -cpu max ./bench

Expected results: On Cortex-X4/Neoverse V2, SMMLA typically achieves ~1.8–2× the throughput of SDOT per cycle, since the 2×8×2 tile covers more output elements per instruction even though both may have similar pipeline latency. On in-order cores (Cortex-A520), the gap is narrower because the frontend is the bottleneck, not execution width.

Exercise 3 Advanced

ARMv9 Feature Register Interrogation (Kernel Module)

Write a Linux kernel module that reads ARM ID registers to produce a comprehensive ARMv9 feature summary — similar to what Linux dmesg reports at boot but with full field-level decoding.

// armv9_features.c — Linux kernel module for ARMv9 feature detection
// Build: make -C /lib/modules/$(uname -r)/build M=$PWD modules
// Load:  sudo insmod armv9_features.ko && dmesg | tail -30

#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/init.h>
#include <asm/sysreg.h>

MODULE_LICENSE("GPL");
MODULE_DESCRIPTION("ARMv9 Feature Register Decoder");

static void decode_pfr0(u64 val) {
    unsigned sve   = (val >> 32) & 0xF;
    unsigned el2   = (val >> 8)  & 0xF;
    unsigned el3   = (val >> 12) & 0xF;
    unsigned fp    = (val >> 16) & 0xF;
    unsigned asimd = (val >> 20) & 0xF;

    pr_info("  ID_AA64PFR0_EL1 = 0x%016llx\n", val);
    pr_info("    SVE:   %s (level %u)\n", sve ? "YES" : "NO", sve);
    pr_info("    EL2:   %s\n", el2 ? "implemented" : "not implemented");
    pr_info("    EL3:   %s\n", el3 ? "implemented" : "not implemented");
    pr_info("    FP:    %s\n", fp != 0xF ? "YES" : "NO");
    pr_info("    ASIMD: %s\n", asimd != 0xF ? "YES" : "NO");
}

static void decode_pfr1(u64 val) {
    unsigned mte   = (val >> 8)  & 0xF;
    unsigned sme   = (val >> 24) & 0xF;
    unsigned bt    = (val >> 0)  & 0xF;

    pr_info("  ID_AA64PFR1_EL1 = 0x%016llx\n", val);
    pr_info("    MTE:   level %u (%s)\n", mte,
            mte == 0 ? "none" : mte == 1 ? "EL0 insn only" :
            mte == 2 ? "full MTE2" : mte == 3 ? "MTE3 (asym)" : "unknown");
    pr_info("    SME:   %s (level %u)\n", sme ? "YES" : "NO", sme);
    pr_info("    BTI:   %s\n", bt ? "YES" : "NO");
}

static void decode_zfr0(u64 val) {
    unsigned svever  = (val >> 0)  & 0xF;
    unsigned bf16    = (val >> 20) & 0xF;
    unsigned i8mm    = (val >> 44) & 0xF;

    pr_info("  ID_AA64ZFR0_EL1 = 0x%016llx\n", val);
    pr_info("    SVE2:    %s (ver %u)\n", svever ? "YES" : "SVE1 only", svever);
    pr_info("    BFloat16: %s\n", bf16 ? "YES" : "NO");
    pr_info("    I8MM:    %s\n", i8mm ? "YES" : "NO");
}

static void decode_smfr0(u64 val) {
    unsigned smever  = (val >> 56) & 0xF;

    pr_info("  ID_AA64SMFR0_EL1 = 0x%016llx\n", val);
    pr_info("    SME version: %u (%s)\n", smever,
            smever == 0 ? "not implemented" : smever == 1 ? "SME" :
            smever == 2 ? "SME2" : "SME2+");
}

static int __init armv9_features_init(void) {
    u64 pfr0, pfr1, zfr0, smfr0;

    pr_info("=== ARMv9 Feature Register Dump ===\n");

    pfr0 = read_sysreg_s(SYS_ID_AA64PFR0_EL1);
    decode_pfr0(pfr0);

    pfr1 = read_sysreg_s(SYS_ID_AA64PFR1_EL1);
    decode_pfr1(pfr1);

    zfr0 = read_sysreg_s(SYS_ID_AA64ZFR0_EL1);
    decode_zfr0(zfr0);

    // SME register only valid if SME is present
    if (((pfr1 >> 24) & 0xF) > 0) {
        smfr0 = read_sysreg_s(SYS_ID_AA64SMFR0_EL1);
        decode_smfr0(smfr0);
    } else {
        pr_info("  SME not present — skipping ID_AA64SMFR0_EL1\n");
    }

    pr_info("=== End Feature Dump ===\n");
    return 0;
}

static void __exit armv9_features_exit(void) {
    pr_info("ARMv9 feature module unloaded\n");
}

module_init(armv9_features_init);
module_exit(armv9_features_exit);

# Makefile for the kernel module
# obj-m += armv9_features.o
# all:
#     make -C /lib/modules/$(shell uname -r)/build M=$(PWD) modules
# clean:
#     make -C /lib/modules/$(shell uname -r)/build M=$(PWD) clean

# Expected output on AWS Graviton4 (Neoverse V2, ARMv9):
# === ARMv9 Feature Register Dump ===
#   ID_AA64PFR0_EL1 = 0x1101000011112222
#     SVE:   YES (level 1)
#     EL2:   implemented
#     EL3:   implemented
#     FP:    YES
#     ASIMD: YES
#   ID_AA64PFR1_EL1 = 0x0000000000000220
#     MTE:   level 2 (full MTE2)
#     SME:   NO (level 0)
#     BTI:   YES
#   ID_AA64ZFR0_EL1 = 0x0000100000100001
#     SVE2:    YES (ver 1)
#     BFloat16: YES
#     I8MM:    YES
#   SME not present — skipping ID_AA64SMFR0_EL1
# === End Feature Dump ===

Challenge extensions: (a) Add decoding for ID_AA64ISAR0_EL1 to detect AES, SHA, CRC32, and atomic instruction support. (b) Add ID_AA64MMFR0_EL1 to detect physical address size (PA range), granule sizes, and stage-2 translation support. (c) Compare output between QEMU -cpu max (all features) and -cpu cortex-a76 (ARMv8.2 subset) to see the feature gap.

Series Summary — 28 Parts, One Architecture

This final section looks back at what the ARM Assembly Mastery Series covers end-to-end — from the earliest ARMv1 silicon to the ARMv9 features shipping in 2026 hardware.

Series map

28-part learning arc

Section	Parts	Theme
Foundations	1–6	ISA history, ARM32, AArch64, arithmetic, branching, AAPCS calling convention
Memory Subsystem	7–8	Cache hierarchy, memory ordering, barriers, NEON SIMD
Advanced ISA	9–10	SVE/SVE2 scalable vectors, VFP floating-point
System Architecture	11–13	Exception levels, MMU/page tables, TrustZone security
Platform Programming	14–16	Cortex-M bare metal, Cortex-A boot chain, Apple Silicon ARM64e
Compiler & Performance	17–18	Inline assembly, GCC/Clang internals, micro-optimisation, PMU profiling
Binary Analysis	19	Reverse engineering, ELF/Mach-O, Ghidra, iOS/Android quirks
Advanced Systems	20–24	Bare-metal OS, microarchitecture, hypervisors, debugging/tracing, linker internals
Toolchain & Production	25–26	Cross-compilation toolchains, Android NDK, FreeRTOS/Zephyr, U-Boot, TF-A
Security & Future	27–28	Exploitation research, ROP/JOP/PAC/BTI, ARMv9 MTE/SME/CCA, AI instructions

Cookie Consent

ARM Assembly Part 28: Emerging ARMv9 & Future Directions

Table of Contents

ARM Assembly Mastery

Architecture History & Core Concepts

ARM32 Instruction Set Fundamentals

AArch64 Registers, Addressing & Data Movement

Arithmetic, Logic & Bit Manipulation

Branching, Loops & Conditional Execution

Stack, Subroutines & AAPCS

Memory Model, Caches & Barriers

NEON & Advanced SIMD

SVE & SVE2 Scalable Vectors

Floating-Point & VFP Instructions

Exception Levels, Interrupts & Vectors

MMU, Page Tables & Virtual Memory

TrustZone & Security Extensions

Cortex-M Assembly & Bare-Metal

Cortex-A System Programming & Boot

Apple Silicon & macOS ABI

Inline Assembly & C Interop

Performance Profiling & Micro-Opt

Reverse Engineering & Binary Analysis

Building a Bare-Metal OS Kernel

ARM Microarchitecture Deep Dive

Virtualization Extensions

Debugging & Tooling Ecosystem

Linkers, Loaders & Binary Format Internals

Cross-Compilation & Build Systems

ARM in Real Systems

Security Research & Exploitation

Emerging ARMv9 & Future Directions