Back to Technology

ARM Assembly Part 9: SVE & SVE2 Scalable Vector Extensions

March 12, 2026 Wasil Zafar 26 min read

ARM's Scalable Vector Extension (SVE) introduces hardware-agnostic vector length, predicate registers for per-lane control, gather/scatter memory access, first-fault loads for speculative loops, and a rich programming model powering HPC on Neoverse and ML on Grace Hopper.

Table of Contents

  1. Introduction & SVE Design Goals
  2. SVE Registers
  3. Predication
  4. SVE Memory Operations
  5. Vectorisation Loop Patterns
  6. SVE2 Extensions
  7. SME — Scalable Matrix Extension
  8. Conclusion & Next Steps

Introduction & SVE Design Goals

Series Overview: This is Part 9 of our 28-part ARM Assembly Mastery Series. Parts 1–8 covered architecture history through NEON SIMD. Now we go further — SVE removes NEON's fixed-width constraint, letting the same binary run efficiently on 128-bit Cortex-A55 through 2048-bit Fujitsu A64FX without recompilation.

ARM Assembly Mastery

Your 28-step learning path • Currently on Step 9
1
Architecture History & Core Concepts
ARMv1→v9, RISC philosophy, profiles
2
ARM32 Instruction Set Fundamentals
ARM vs Thumb, registers, CPSR, barrel shifter
3
AArch64 Registers, Addressing & Data Movement
X/W regs, addressing modes, load/store pairs
4
Arithmetic, Logic & Bit Manipulation
ADD/SUB, bitfield extract/insert, CLZ
5
Branching, Loops & Conditional Execution
Branch types, link register, jump tables
6
Stack, Subroutines & AAPCS
Calling conventions, prologue/epilogue
7
Memory Model, Caches & Barriers
Weak ordering, DMB/DSB/ISB, TLB
8
NEON & Advanced SIMD
Vector ops, intrinsics, media processing
9
SVE & SVE2 Scalable Vector Extensions
Predicate regs, gather/scatter, HPC/ML
You Are Here
10
Floating-Point & VFP Instructions
IEEE-754, scalar FP, rounding modes
11
Exception Levels, Interrupts & Vector Tables
EL0–EL3, GIC, fault debugging
12
MMU, Page Tables & Virtual Memory
Stage-1 translation, permissions, huge pages
13
TrustZone & ARM Security Extensions
Secure monitor, world switching, TF-A
14
Cortex-M Assembly & Bare-Metal Embedded
NVIC, SysTick, linker scripts, low-power
15
Cortex-A System Programming & Boot
EL3→EL1 transitions, MMU setup, PSCI
16
Apple Silicon & macOS ABI
ARM64e PAC, Mach-O, dyld, perf counters
17
Inline Assembly, GCC/Clang & C Interop
Constraints, clobbers, compiler interaction
18
Performance Profiling & Micro-Optimization
Pipeline hazards, PMU, benchmarking
19
Reverse Engineering & ARM Binary Analysis
ELF, disassembly, CFR, iOS/Android quirks
20
Building a Bare-Metal OS Kernel
Bootloader, UART, scheduler, context switch
21
ARM Microarchitecture Deep Dive
OOO pipelines, reorder buffers, branch predict
22
Virtualization Extensions
EL2 hypervisor, stage-2 translation, KVM
23
Debugging & Tooling Ecosystem
GDB, OpenOCD/JTAG, ETM/ITM, QEMU
24
Linkers, Loaders & Binary Format Internals
ELF deep dive, relocations, PIC, crt0
25
Cross-Compilation & Build Systems
GCC/Clang toolchains, CMake, firmware gen
26
ARM in Real Systems
Android, FreeRTOS/Zephyr, U-Boot, TF-A
27
Security Research & Exploitation
ASLR, PAC attacks, ROP/JOP, kernel exploit
28
Emerging ARMv9 & Future Directions
MTE, SME, confidential compute, AI accel

ARM's Scalable Vector Extension (SVE) appeared with ARMv8.2-A in 2016 — the same year Fujitsu and RIKEN committed to building Fugaku, the supercomputer that would become the world's fastest machine in 2020 using 48-core A64FX chips running SVE at 512 bits. The central insight driving the design: the HPC community was tired of rewriting and recompiling code every time hardware doubled its vector width (MMX 64-bit → SSE 128-bit → AVX 256-bit → AVX-512). SVE eliminates this treadmill by making vector length an implementation choice invisible to the programmer.

Unlike NEON's fixed 128-bit registers, SVE allows each silicon vendor to choose a vector length (VL) from 128 to 2048 bits in 128-bit increments. Software never queries or hard-codes VL — it writes VL-agnostic loops that process however many elements the hardware can handle per cycle. A Cortex-A510 with 128-bit SVE, a Neoverse V1 with 256-bit SVE, and an A64FX with 512-bit SVE all run the same binary. The loop simply iterates fewer times on wider hardware. On the final iteration, a predicate register masks off the leftover lanes — no scalar cleanup epilogue required.

Why It Matters: AVX-512 binaries crash on AVX2-only CPUs. SVE binaries run everywhere, because the instruction set encodes operations on predicates and scalable vectors, not fixed widths. This single design choice lets cloud providers deploy one ARM binary across an entire heterogeneous fleet — Graviton3 (256-bit SVE) and Graviton4 (128-bit SVE2) execute identical code paths.

SVE Registers

Z Registers (Scalable Vectors)

SVE provides 32 scalable vector registers, Z0–Z31, each exactly VL bits wide. The crucial detail: VL is not known at compile time. The assembler never encodes a fixed width into the instruction stream. Instead, software discovers VL at runtime using the CNTB family of instructions.

Z registers alias the lower 128 bits of the NEON V registers: Z0[127:0] is bit-identical to V0. This means you can call a NEON function, and the result sits in Z0 ready for SVE processing without any move instruction. The upper bits (128 to VL-1) are defined as zero after a NEON write — no garbage leaks upward.

SuffixLane WidthLanes (128-bit VL)Lanes (512-bit VL)Use Case
.B8-bit byte1664Text processing, INT8 ML inference
.H16-bit half832FP16 inference, BFloat16 training
.S32-bit single416FP32 physics, image pixels
.D64-bit double28FP64 HPC, scientific computing

Think of each Z register as an elastic container: the hardware stretches it wider on silicon that supports longer vectors, but the number of registers is always 32. This is in contrast to x86, where AVX-512 added entirely new ZMM registers on top of the existing XMM/YMM set, tripling the state that must be saved on context switch.

P Registers (Predicates)

SVE's most revolutionary feature is its 16 scalable predicate registers, P0–P15, each VL/8 bits wide — that is, one bit per byte of vector length. On a 256-bit implementation, P registers are 32 bits wide (256/8); on a 512-bit implementation, they are 64 bits. The predicate width scales automatically with VL.

Predicate registers divide into two groups by convention:

  • P0–P7 (governing predicates) — used as masks on arithmetic and memory instructions. When you write FADD Z0.S, P3/M, Z0.S, Z1.S, P3 controls which lanes participate.
  • P8–P15 (general predicates) — available for temporary masks, loop counters, comparison results, and complex predicate logic.

Each bit in a predicate corresponds to one byte of vector data. For 32-bit (.S) operations, every 4th bit matters; for 64-bit (.D), every 8th bit matters. The intervening bits are ignored but must be set correctly for element-size-aware instructions like WHILELT.

Predicate Modes: SVE instructions accept predicates in two flavours. /M (merging) — inactive lanes retain their previous value. /Z (zeroing) — inactive lanes are forced to zero. Memory loads always use /Z (inactive lanes read zero); stores simply skip inactive lanes. The choice between /M and /Z affects register pressure: /Z creates a clean zero that downstream instructions can detect, while /M avoids an extra MOV to preserve partial results.

FFR — First-Fault Register

The First-Fault Register (FFR) is a single implicit predicate-width register that records which lanes of an LDFF1 (first-fault load) completed successfully. It is SVE's answer to a classic vectorisation problem: "what if my vector-width load straddles a page boundary into unmapped memory?"

Here is how first-fault loads work step by step:

  1. SETFFR — Reset the FFR to all-ones (all lanes valid).
  2. LDFF1W {Z0.S}, P0/Z, [X0, X1, LSL #2] — Attempt the load. The first element (lane 0) is guaranteed to fault normally if unmapped (so your loop still traps on genuine bugs). But if any subsequent element faults, the hardware silently suppresses that fault and clears FFR from that lane onward.
  3. RDFFR P1.B — Read the FFR into a predicate register. P1 now indicates exactly which lanes loaded valid data.
  4. Process only the valid lanes using P1 as the governing predicate, then advance the loop index by the number of valid elements.

This mechanism lets the compiler vectorise loops where the iteration count is unknown and the data might end near a page boundary — for example, scanning a null-terminated C string with SVE. Without first-fault loads, such loops would need conservative scalar fallbacks near page edges.

Vector Length & CNTB/CNTW/CNTD

// Query current hardware vector length in bytes
CNTB x0           // x0 = VL/8 (number of bytes per Z register)
CNTW x1           // x1 = VL/32 (number of 32-bit lanes)
CNTD x2           // x2 = VL/64 (number of 64-bit lanes)
CNTH x3           // x3 = VL/16 (number of 16-bit lanes)

// Typical: on 512-bit SVE hardware, CNTB returns 64

Predication

WHILELT / WHILELE / WHILELO

WHILELT is the instruction that makes SVE's VL-agnostic loops work. It compares a scalar loop index against a scalar limit and generates a predicate mask where lane i is active if (index + i) < limit. On the first iteration of a 100-element loop on 512-bit hardware (16 × 32-bit lanes), WHILELT sets all 16 predicate bits. On the final iteration (elements 96–99), only the first 4 bits are set — the remaining 12 lanes are masked off.

InstructionConditionSigned/UnsignedTypical Use
WHILELT Pd.T, Xn, Xmindex + lane < limitSignedStandard for-loop (i < n)
WHILELE Pd.T, Xn, Xmindex + lane ≤ limitSignedInclusive upper bound (i ≤ n)
WHILELO Pd.T, Xn, Xmindex + lane < limitUnsignedSize_t / pointer-based loops
WHILELS Pd.T, Xn, Xmindex + lane ≤ limitUnsignedUnsigned inclusive bounds

After WHILELT executes, the condition flags are set: the NONE condition is true if no lanes are active (loop done), and FIRST is true if at least one lane is active. The canonical SVE loop tests B.NONE .Ldone immediately after WHILELT to exit cleanly. No peeling, no scalar epilogue, no alignment check — the predicate handles everything.

// SVE vectorised loop: for (i=0; i<n; i++) b[i] = a[i] * c
//   x0 = pointer to a, x1 = pointer to b, x2 = n, z1.s = broadcast c
    MOV  x3, #0                    // i = 0
.Lloop:
    WHILELT p0.s, x3, x2           // p0: active lanes where i+lane < n
    B.NONE  .Ldone                 // Exit if no active lanes
    LD1W    {z0.s}, p0/z, [x0, x3, LSL #2]  // Load active a[i..i+VL-1]
    FMUL    z0.s, p0/m, z0.s, z1.s          // Multiply (merging predicate)
    ST1W    {z0.s}, p0, [x1, x3, LSL #2]   // Store active elements
    INCW    x3                              // i += number of 32-bit lanes
    B       .Lloop
.Ldone:

Predicated Arithmetic & Memory

Nearly every SVE data-processing instruction accepts a governing predicate between the opcode and the operands. The predicate register is written with a mode suffix that controls what happens to inactive lanes:

ModeSyntaxInactive LanesWhen to Use
/M (Merging)FADD Z0.S, P0/M, Z0.S, Z1.SRetain previous Z0 valueAccumulation, conditional update
/Z (Zeroing)FADD Z0.S, P0/Z, Z1.S, Z2.SForced to zeroFresh computation, no prior dependency

For memory operations, the rules are slightly different. Loads always use /Z — inactive lanes produce zero, never garbage from memory. Stores use the predicate without a mode suffix: ST1W {Z0.S}, P0, [X0, X1, LSL #2]. Inactive store lanes are simply suppressed; no byte hits the cache.

This design eliminates the need for explicit blend/select instructions after masked operations. Compare this to AVX-512, where you often need VBLENDMPS to merge results back — in SVE, the merging is built into every instruction.

Compiler Hint: GCC and Clang map C if statements inside vectorised loops directly to predicated SVE instructions. The compiler generates a comparison (FCMGT P1.S, P0/Z, Z0.S, Z1.S), then uses P1 as the governing predicate on the conditional body — no branch, no scalar fallback. This is why SVE auto-vectorises control flow that NEON cannot.

PTEST / PFIRST / PNEXT

Predicate management instructions let you inspect and iterate through active elements, enabling patterns far beyond simple vectorised loops:

InstructionEncodingPurposeFlags Set
PTESTPTEST Pg, Pn.BTest predicate contents without modifying any registerNONE (all-false), !NONE (at least one true), FIRST, LAST
PFIRSTPFIRST Pdn.B, Pg, Pdn.BSet the first inactive lane (within Pg) to activeUpdates NONE flag
PNEXTPNEXT Pdn.T, Pg, Pdn.TAdvance to the next active element within governing PgUpdates NONE flag
PTRUEPTRUE Pd.T {, pattern}Initialize predicate (all-true or a specific VL pattern)

The PFIRST/PNEXT pair enables element-serial iteration — processing one active element at a time within a predicate mask. This is essential for irregular data patterns like sparse matrix traversal, where you want to extract non-zero indices one by one from a comparison predicate. The pattern looks like:

  1. CMPEQ P1.S, P0/Z, Z0.S, #0 — find all zero elements
  2. PNEXT P1.S, P0, P1.S — advance to first/next match
  3. B.NONE .Ldone — stop when no more matches
  4. Extract and process the single active lane, then goto step 2

SVE Memory Operations

Contiguous Load/Store (LD1/ST1)

SVE's contiguous memory instructions are the workhorse of vectorised loops. They access a consecutive run of elements, masked by a governing predicate to handle loop tails cleanly:

InstructionDescriptionAddress Form
LD1W {Zd.S}, Pg/Z, [Xn, Xm, LSL #2]Scalar base + scalar index × 4Indexed (loop counter in Xm)
LD1W {Zd.S}, Pg/Z, [Xn]Scalar base, unit strideSimple pointer dereference
LD1W {Zd.S}, Pg/Z, [Xn, #3, MUL VL]Base + immediate × VLAccessing stacked vectors (e.g., 3rd chunk)
ST1W {Zt.S}, Pg, [Xn, Xm, LSL #2]Predicated contiguous storeSame addressing as loads

SVE also supports multi-register loads for interleaved data: LD2W loads two Z registers (even/odd elements de-interleaved), LD3W loads three (e.g., RGB pixel channels), and LD4W loads four (RGBA). These are the SVE equivalent of NEON's LD2/LD3/LD4 but work at scalable widths.

The key difference from NEON loads: SVE loads are always predicated. Even a "load everything" operation uses PTRUE P0.S to generate the all-ones predicate first. This uniformity simplifies hardware design and ensures the same instruction works for both full-width and loop-tail iterations.

Gather Loads / Scatter Stores

// Gather load: Z0.S[i] = memory[base + Z1.S[i]]
// Z1 contains byte offsets
LD1W  {z0.s}, p0/z, [x0, z1.s, UXTW #2]   // Gather: base + z1[i]<<2

// Scatter store: memory[base + Z1.S[i]] = Z2.S[i]
ST1W  {z2.s}, p0, [x0, z1.s, UXTW #2]     // Scatter store

First-Fault & Non-Fault Loads

SVE provides two speculative load variants that solve the "last page" problem — what happens when your vector-width access extends past the end of mapped memory:

InstructionFirst Element Fault?Subsequent Faults?Use Case
LDFF1W (First-Fault)Yes — normal trapSuppressed — FFR clearedVectorising strlen, strchr, memchr
LDNF1W (Non-Fault)SuppressedSuppressedPrefetch-like speculation (EL0 only)

The first-fault workflow is the more commonly used pattern. Imagine vectorising strlen(): you load a vector of bytes starting at the current pointer, check for a zero byte, and advance. Near the end of a string that sits close to a page boundary, the load might cross into an unmapped page. With LDFF1B, the hardware loads as many bytes as possible, then records in the FFR which lanes succeeded. You process only those lanes, advance by that count, and repeat.

Real-World Example glibc SVE

SVE-Optimised strlen in glibc

ARM's contributed SVE implementation of strlen() in the GNU C Library uses exactly this LDFF1B + RDFFR pattern. On Neoverse V1 (256-bit SVE), it processes 32 bytes per iteration compared to NEON's 16 bytes. The first-fault mechanism eliminates the need for page-boundary alignment checks that traditional implementations require, reducing the function's branch count by ~40% and improving throughput on short strings by 15–25% compared to the NEON path.

Vectorisation Loop Patterns

DAXPY / SAXPY Kernel

// DAXPY: y[i] += alpha * x[i]   (double precision)
// x0=n, x1=*x, x2=*y, d0=alpha (broadcast to z0.d first)
    MOV     z0.d, d0               // Broadcast scalar alpha to all lanes
    MOV     x3, #0
.Ldaxpy:
    WHILELT p0.d, x3, x0
    B.NONE  .Ldaxpy_done
    LD1D    {z1.d}, p0/z, [x1, x3, LSL #3]  // x[i]
    LD1D    {z2.d}, p0/z, [x2, x3, LSL #3]  // y[i]
    FMLA    z2.d, p0/m, z1.d, z0.d          // y += alpha*x
    ST1D    {z2.d}, p0, [x2, x3, LSL #3]
    INCD    x3
    B       .Ldaxpy
.Ldaxpy_done:

Reduction with FADDA

// Horizontal sum of float32 array using SVE
// x0 = *array, x1 = n,  result → s0
    PTRUE   p0.s                    // All lanes active
    FMOV    z0.s, #0.0              // Accumulator vector = 0
    MOV     x2, #0
.Lreduce:
    WHILELT p1.s, x2, x1
    B.NONE  .Lreduce_done
    LD1W    {z1.s}, p1/z, [x0, x2, LSL #2]
    FADD    z0.s, p1/m, z0.s, z1.s  // Accumulate active lanes
    INCW    x2
    B       .Lreduce
.Lreduce_done:
    PTRUE   p0.s
    FADDA   s0, p0, s0, z0.s        // Reduce vector to scalar

SVE2 Extensions

New SVE2 Instructions

SVE2 was introduced with ARMv9-A (2021) and is mandatory for all ARMv9 implementations — unlike SVE, which was optional in ARMv8.2+. SVE2 brings NEON's full integer and fixed-point capability into the scalable framework, closing the gaps that made SVE primarily useful for floating-point HPC workloads:

CategoryKey InstructionsWhat It Enables
Widening Multiply-AddSMLALB, SMLALT, UMLALB, UMLALTINT16×INT16→INT32 accumulation (audio codecs, image filters)
Complex ArithmeticFCADD, FCMLAComplex number multiply-add treating adjacent lane pairs as real+imag (FFT, 5G baseband)
Saturating NarrowingSQSHRUNB, SQSHRUNT, UQSHRNT32-bit → 16-bit with saturation and shift (video encode quantisation)
Polynomial MultiplyPMULLB, PMULLTGF(2) multiplication for CRC, error correction codes
Cross-Lane PermuteTBL (two-source), TBXArbitrary byte shuffles across full VL width
Non-Temporal Gather/ScatterLDNT1W (scalar+vector)Cache-bypassing random access for graph analytics

The widening and narrowing instructions use a bottom/top (B/T) convention: SMLALB multiplies the bottom (even-indexed) narrow elements while SMLALT multiplies the top (odd-indexed) elements, accumulating both into the wider destination. This interleaved approach processes an entire vector of narrow data in two instructions without any explicit unpack/pack step.

Cryptography (AES/SHA3/SM)

SVE2 includes optional cryptographic extensions that parallelise block cipher processing across the entire SVE width. Since each Z register holds VL/128 independent 128-bit blocks, wider SVE implementations naturally process more cipher blocks per instruction:

Feature FlagInstructionsBlocks per Instruction (256-bit VL)Use Case
FEAT_SVE_AESAESE, AESD, AESMC, AESIMC2AES-GCM, AES-CTR for TLS/IPsec
FEAT_SVE_SHA3RAX1, XAR, EOR32SHA3/Keccak, SHAKE for post-quantum crypto
FEAT_SVE_SM4SM4E, SM4EKEY2Chinese national standard SM4 cipher

On Fujitsu A64FX with 512-bit SVE, AESE processes 4 AES blocks simultaneously — matching the throughput of Intel's VAES on AVX-512. For server workloads dominated by TLS termination, this means SVE crypto can saturate 100 Gbps network links without dedicated accelerator hardware.

Feature Detection Required: The crypto extensions are individually optional. Before using them, software must check CPUID feature registers (ID_AA64ZFR0_EL1) at runtime. A Neoverse V1 might implement FEAT_SVE_AES but not FEAT_SVE_SM4. Libraries like OpenSSL probe these flags during initialization and select the appropriate code path.

Bit Permutation & Histogram

SVE2 includes two specialised instruction groups that unlock workloads previously impossible to vectorise efficiently:

Bit Manipulation (FEAT_SVE_BitPerm)

InstructionOperationApplication
BEXT Zd.T, Zn.T, Zm.TExtract bits from Zn at positions specified by set bits in Zm, pack to bottomBit-level compression, Morton code extraction
BDEP Zd.T, Zn.T, Zm.TDeposit (scatter) bottom bits of Zn into positions specified by ZmBit-level decompression, Z-order curve encoding

These are the SVE equivalents of Intel's PEXT/PDEP (BMI2), operating across an entire vector of elements. They are essential for database engines that use bit-packed column stores and need to extract specific bit fields from compressed records.

Histogram Instructions

InstructionOperationApplication
HISTCNT Zd.S, Pg/Z, Zn.S, Zm.SFor each lane, count how many elements in Zm match Zn[lane]Database GROUP BY, frequency counting
HISTSEG Zd.B, Zn.B, Zm.BByte-level histogram within 128-bit segmentsCharacter frequency analysis, compression stats
Case Study Database Engineering

HISTCNT in Columnar Database Engines

A SELECT colour, COUNT(*) FROM products GROUP BY colour query on a columnar store requires counting how often each distinct value appears. Scalar code needs a hash table lookup per row. With SVE2's HISTCNT, the engine loads a vector of colour codes and counts matches against each distinct value in a single instruction. On Neoverse V2 (128-bit SVE2), this processes 4 int32 comparisons per cycle; on future 256-bit implementations, it doubles to 8. Early benchmarks from ARM Research show 2.5× speedup on TPC-H Query 1 aggregation kernels compared to scalar loops.

SME — Scalable Matrix Extension

The Scalable Matrix Extension (SME), introduced with ARMv9.2-A in 2022, takes the scalable philosophy one step further: from 1D vectors to 2D matrix tiles. SME targets neural network inference and HPC GEMM (General Matrix Multiply) kernels — workloads where the inner loop is fundamentally a matrix outer product.

Streaming SVE Mode (SSVE)

SME introduces a new processor mode called Streaming SVE, entered with SMSTART and exited with SMSTOP. In streaming mode, the Z and P registers are available with a potentially different vector length (streaming VL) than normal SVE. The key addition is the ZA register array — a 2D tile of SVL×SVL bits, where SVL is the streaming vector length.

InstructionOperationMatrix Dimension (256-bit SVL)
SMSTARTEnter streaming SVE mode, enable ZA
FMOPA ZA0.S, P0/M, Z0.S, Z1.SFP32 outer product: ZA += Z0 ⊗ Z1T8×8 tile accumulate
BFMOPA ZA0.S, P0/M, Z0.H, Z1.HBF16 outer product into FP32 accum8×8 tile (16 BF16 pairs)
SMOPA ZA0.S, P0/M, Z0.B, Z1.BINT8 outer product into INT32 accum8×8 tile (32 INT8 pairs)
LD1W {ZA0H.S[W0]}, P0/Z, [X0]Load one horizontal row of ZA tile8 elements
ST1W {ZA0V.S[W0]}, P0, [X1]Store one vertical column of ZA tile8 elements
SMSTOPExit streaming mode, ZA contents undefined

The outer product approach is revolutionary for GEMM: instead of loading rows and columns and computing dot products (the traditional vector approach), FMOPA takes one column vector and one row vector and accumulates their outer product into the entire tile in a single instruction. For an 8×8 tile, that is 64 multiply-accumulate operations per instruction. This matches the throughput of dedicated matrix accelerators like Google's TPU systolic array — but implemented as a general-purpose ISA extension.

Real-World Impact AI/ML Inference

SME in ARM's Compute Library

ARM's open-source Compute Library (ACL) added SME backends targeting Neoverse V2+ and future Cortex-X cores. For INT8 quantised transformer inference, the SMOPA instruction computes a 32×32 output tile from INT8 inputs in 32 outer-product accumulations — compared to 1,024 individual multiply-accumulate operations in scalar code. Early projections show SME delivering 4–8× inference throughput improvement over SVE2 SDOT for matrix-heavy layers, with the largest gains on fully-connected and attention layers where GEMM dominates.

Key Insight: SVE's key advantage over AVX-512 isn't just wider vectors — it's the predicate register file. Instead of mask registers being a scarce 8-register resource (k0–k7), SVE offers 16 predicate registers with clean loop-tail handling via WHILELT. This makes compiler auto-vectorisation dramatically more reliable and eliminates the scalar epilogue loops that reduce practical AVX-512 speedup.

Conclusion & Next Steps

SVE and SVE2 represent a fundamental rethinking of how vector processors should work. Instead of encoding fixed widths into the ISA and forcing recompilation every hardware generation, ARM made the vector length an implementation-invisible parameter. The key concepts from this part:

  • Z0–Z31 — scalable vector registers (VL bits wide) aliasing NEON V registers at their lower 128 bits
  • P0–P15 — scalable predicate registers (VL/8 bits) enabling per-lane masking in /M (merging) and /Z (zeroing) modes
  • WHILELT/INCW loop pattern — the canonical VL-agnostic loop that eliminates scalar cleanup epilogues
  • FFR and LDFF1 — first-fault loads for safe speculative vectorisation near page boundaries
  • Contiguous, gather/scatter, and interleaved — comprehensive memory access patterns all governed by predicates
  • SVE2 — mandatory in ARMv9, adding widening integer ops, complex arithmetic, crypto, histogram, and bit permutation
  • SME — 2D matrix tiles with outer-product accumulation for GEMM-class workloads
Exercises:
  1. VL-Agnostic memcpy — Write an SVE loop that copies n bytes from source to destination using LD1B/ST1B with WHILELT and INCB. Verify it works without changes on both 128-bit and 256-bit SVE (use qemu-aarch64 -cpu max,sve256=on to test different VLs).
  2. First-Fault strlen — Implement strlen() using LDFF1B, RDFFR, and CMPEQ to find the first zero byte. Handle the FFR cleanup with SETFFR before each speculative load. Compare cycle counts against a scalar LDRB/CBNZ loop on QEMU.
  3. Predicate Logic — Given two float32 arrays A and B of length n, write SVE code that computes C[i] = (A[i] > 0) ? A[i] * B[i] : A[i] + B[i] using predicated instructions (FCMGT to generate the predicate, then FMUL with /M and FADD with the inverted predicate via NOT). No branches allowed.

Next in the Series

In Part 10: Floating-Point & VFP Instructions, we cover IEEE-754 representation, scalar FP arithmetic, rounding mode control, FP comparison and classification, half-precision (FP16/BF16), and the interplay between the scalar FP and SIMD register file.

Technology