Back to Technology

ARM Assembly Part 5: Branching, Loops & Conditional Execution

March 5, 2026 Wasil Zafar 18 min read

Master all AArch64 branch instructions: unconditional B/BR, call BL/BLR with the link register, compare-and-branch CBZ/CBNZ, test-bit-and-branch TBZ/TBNZ, conditional B.cond branches with NZCV, loop patterns, function pointers, and computed jump tables.

Table of Contents

  1. Introduction
  2. Unconditional Branches
  3. Compare & Branch
  4. Conditional Branches
  5. Loop Patterns
  6. Function Pointers & Indirect Calls
  7. Computed Jump Tables
  8. Conclusion & Next Steps

Introduction

Series Overview: This is Part 5 of our 28-part ARM Assembly Mastery Series. Parts 1–4 covered ARM history, ARM32, AArch64 registers, and arithmetic. Now we tackle control flow — the instructions that make code loop, branch, and call functions.

ARM Assembly Mastery

Your 28-step learning path • Currently on Step 5
1
Architecture History & Core Concepts
ARMv1→v9, RISC philosophy, profiles
2
ARM32 Instruction Set Fundamentals
ARM vs Thumb, registers, CPSR, barrel shifter
3
AArch64 Registers, Addressing & Data Movement
X/W regs, addressing modes, load/store pairs
4
Arithmetic, Logic & Bit Manipulation
ADD/SUB, bitfield extract/insert, CLZ
5
Branching, Loops & Conditional Execution
Branch types, link register, jump tables
You Are Here
6
Stack, Subroutines & AAPCS
Calling conventions, prologue/epilogue
7
Memory Model, Caches & Barriers
Weak ordering, DMB/DSB/ISB, TLB
8
NEON & Advanced SIMD
Vector ops, intrinsics, media processing
9
SVE & SVE2 Scalable Vector Extensions
Predicate regs, gather/scatter, HPC/ML
10
Floating-Point & VFP Instructions
IEEE-754, scalar FP, rounding modes
11
Exception Levels, Interrupts & Vector Tables
EL0–EL3, GIC, fault debugging
12
MMU, Page Tables & Virtual Memory
Stage-1 translation, permissions, huge pages
13
TrustZone & ARM Security Extensions
Secure monitor, world switching, TF-A
14
Cortex-M Assembly & Bare-Metal Embedded
NVIC, SysTick, linker scripts, low-power
15
Cortex-A System Programming & Boot
EL3→EL1 transitions, MMU setup, PSCI
16
Apple Silicon & macOS ABI
ARM64e PAC, Mach-O, dyld, perf counters
17
Inline Assembly, GCC/Clang & C Interop
Constraints, clobbers, compiler interaction
18
Performance Profiling & Micro-Optimization
Pipeline hazards, PMU, benchmarking
19
Reverse Engineering & ARM Binary Analysis
ELF, disassembly, CFR, iOS/Android quirks
20
Building a Bare-Metal OS Kernel
Bootloader, UART, scheduler, context switch
21
ARM Microarchitecture Deep Dive
OOO pipelines, reorder buffers, branch predict
22
Virtualization Extensions
EL2 hypervisor, stage-2 translation, KVM
23
Debugging & Tooling Ecosystem
GDB, OpenOCD/JTAG, ETM/ITM, QEMU
24
Linkers, Loaders & Binary Format Internals
ELF deep dive, relocations, PIC, crt0
25
Cross-Compilation & Build Systems
GCC/Clang toolchains, CMake, firmware gen
26
ARM in Real Systems
Android, FreeRTOS/Zephyr, U-Boot, TF-A
27
Security Research & Exploitation
ASLR, PAC attacks, ROP/JOP, kernel exploit
28
Emerging ARMv9 & Future Directions
MTE, SME, confidential compute, AI accel

Branch Instruction Overview

If data-processing instructions are the "muscles" of a program, branches are its nervous system — they make decisions, create loops, call functions, and return results. AArch64 provides three clean families of branch instructions:

FamilyInstructionsPurposeCondition Source
UnconditionalB, BR, BL, BLR, RETJumps, function calls, returnsNone — always taken
Compare & BranchCBZ, CBNZ, TBZ, TBNZTest a register value directlyRegister value (no flags needed)
ConditionalB.cond, CSEL, CSET, CSINC, CSINVBranch or select based on NZCVNZCV flags (from prior ADDS/CMP/etc.)
Key Change from ARM32: AArch64 eliminates the per-instruction condition codes from ARM32 (where almost any instruction could be executed conditionally via the IT block). Conditional execution is now restricted to dedicated branch and CSEL/CSET instructions. This simplifies hardware branch prediction and keeps the encoding space clean for the larger 31-register file.

PC-Relative Range Limits

Every branch instruction encodes its target as a PC-relative offset within the 32-bit instruction word. Different instructions sacrifice different amounts of encoding space for the offset, giving them different ranges:

InstructionOffset BitsRangeTypical Context
B / BL26 bits × 4±128 MBFunction calls, long jumps — covers most binaries
B.cond, CBZ/CBNZ19 bits × 4±1 MBConditional branches, loop tests, null checks
TBZ/TBNZ14 bits × 4±32 KBBit tests — usually close to the test site
BR/BLR/RETRegister (64-bit)Full 64-bit address spaceIndirect calls, function pointers, returns
Veneers (Trampolines): If your branch target is outside the instruction's range, the linker automatically inserts a veneer — a small trampoline that loads the full address and does an indirect branch. You almost never see this in practice (128 MB covers most binaries), but it explains mysterious extra code in very large firmware images. The assembler will often warn you: "relocation out of range, veneer inserted."

Unconditional Branches

B & BR — Jump

B (Branch) is the simplest control-flow instruction — an unconditional PC-relative jump. BR (Branch Register) is its indirect sibling, jumping to the absolute 64-bit address held in a register:

// B — PC-relative unconditional jump
    B    .target              // Jump to label (PC + signed offset)
    B    _start               // Jump to symbol (linker resolves)

    // BR — Register-indirect jump (no link)
    BR   x16                  // Jump to address in x16
                               // Does NOT save return address
                               // Used for: computed gotos, tail calls,
                               // jump tables, longjmp
B vs BR: B is for "I know where I'm going at compile time" — the target is a label or symbol resolved by the assembler/linker. BR is for "the destination is computed at runtime" — function pointers, jump tables, or tail-call optimisations where the target varies.

BL & BLR — Call (Link)

BL (Branch with Link) and BLR (Branch with Link to Register) are the function call instructions. Before jumping, they save PC+4 (the address of the next instruction) into X30 (the Link Register), creating a breadcrumb trail for the callee to return to:

// Direct function call (BL)
    BL   printf              // X30 = PC+4, then jump to printf
                               // printf will RET back to here

    // Indirect function call (BLR)
    LDR  x8, [x0, #callback] // Load function pointer from struct
    BLR  x8                  // X30 = PC+4, then call via x8

    // Multiple calls in sequence
    BL   init_hardware        // X30 = addr_of_next_instr
    BL   config_clocks        // X30 updated again (overwritten!)
    BL   main                 // Each BL overwrites X30
X30 Overwrite Danger: Every BL/BLR overwrites X30. If a function calls another function (non-leaf), it must save X30 to the stack in its prologue and restore it before RET. Forgetting this is the #1 cause of infinite loops and crashes in hand-written assembly — the function returns to itself instead of its caller.

RET — Return

RET branches to the address in X30 (by default), returning control to the caller. Critical details:

  • RET is NOT a stack pop: It only jumps to X30. It does NOT pop the return address from the stack (unlike x86-64's RET). The callee must restore X30 from the stack manually if it saved it.
  • Optional register: RET Xn returns to the address in Xn instead of X30 — rarely used but available for coroutines or custom dispatch.
  • Branch prediction hint: The processor treats RET differently from BR x30 — RET uses the return address stack predictor, which is faster. Always use RET for function returns, never BR X30.
// Leaf function (no calls → no need to save X30)
    ADD  x0, x0, x1           // Compute result
    RET                        // Return to caller via X30

    // Non-leaf function (calls other functions → must save X30)
    STP  x29, x30, [sp, #-16]!  // Save FP and LR
    MOV  x29, sp                 // Set frame pointer
    BL   helper                  // Call helper (overwrites X30)
    LDP  x29, x30, [sp], #16    // Restore FP and LR
    RET                          // Return to original caller

Compare & Branch

CBZ/CBNZ — Compare Zero

CBZ (Compare and Branch on Zero) and CBNZ (Compare and Branch on Nonzero) test a register directly without setting or reading NZCV flags. They combine a comparison with zero and a branch into a single instruction — saving the separate CMP that x86-64 requires:

// Null pointer check (most common use)
    CBZ  x0, .return_null    // if (ptr == NULL) goto return_null
    // ptr is guaranteed non-null here

    // Loop termination (count down to zero)
    SUB  x1, x1, #1          // Decrement (no S suffix = no flags)
    CBNZ x1, .loop_body      // if (count != 0) continue

    // Linked list traversal
.traverse:
    LDR  x0, [x0, #next]     // Follow next pointer
    CBZ  x0, .end_of_list    // NULL? End of list
    // Process node...
    B    .traverse

    // 32-bit variants (test W registers)
    CBZ  w0, .is_zero_32     // Test lower 32 bits only
    CBNZ w5, .nonzero_32
Why CBZ/CBNZ Matter: In typical C code, null checks and loop terminations account for ~40% of all branches. Without CBZ, each requires two instructions: CMP x0, #0 then B.EQ label. CBZ halves this to one instruction — a significant code-size and dispatch-bandwidth saving across millions of branches per second.

TBZ/TBNZ — Test Bit

TBZ (Test Bit and Branch if Zero) and TBNZ (Test Bit and Branch if Nonzero) test a single specific bit within a register and branch accordingly. Like CBZ/CBNZ, they don't touch the NZCV flags:

// Test bit 31 (sign bit of a 32-bit value in a 64-bit register)
    TBNZ x0, #31, .is_negative  // Branch if signed negative

    // Test bit 0 (odd/even check)
    TBZ  x0, #0, .is_even       // Branch if bit 0 is clear (even)

    // Check a specific permission flag
    TBNZ x_flags, #PERM_WRITE, .has_write  // Branch if write permitted

    // Hardware status register bit test
    MRS  x0, DAIF               // Read interrupt mask register
    TBNZ x0, #7, .irq_masked    // Branch if IRQ bit (bit 7) is masked

    // Bit-scan loop (find first set bit)
.scan:
    TBZ  x0, #0, .next_bit      // Test LSB
    // Bit 0 is set — process it
    B    .done
.next_bit:
    LSR  x0, x0, #1             // Shift right
    ADD  x1, x1, #1             // Increment bit position
    CBNZ x0, .scan              // Continue if bits remain
Performance Pipeline
CBZ/TBZ vs CMP+B.cond — Fusion and Efficiency

On high-performance AArch64 cores (Cortex-A76+, Apple M-series, Neoverse V1), the branch predictor handles CBZ/CBNZ/TBZ/TBNZ as single micro-ops, while CMP + B.cond may or may not be fused depending on the core's macro-fusion rules. Even when fused, CBZ saves one instruction's worth of fetch bandwidth. In tight loops, this translates to measurable throughput improvements — Apple's M1 can execute up to 8 instructions per cycle, so saving one instruction per iteration on a 4-instruction loop body is a 20% improvement.

Macro-Fusion Branch Prediction Code Density

Conditional Branches

NZCV Conditions Reference

AArch64 uses four condition flags — N (Negative), Z (Zero), C (Carry), V (Overflow) — collectively called NZCV. These flags are set by flag-setting instructions (ADDS, SUBS, ANDS, CMP, TST, etc.) and read by conditional branches and CSEL/CSET. There are 15 named conditions (14 meaningful + AL/always):

Reference AArch64
Condition Code Quick Reference
CodeSuffixMeaningFlags TestSigned/Unsigned
0000EQEqualZ = 1Both
0001NENot EqualZ = 0Both
0010CS / HSCarry Set / Higher or SameC = 1Unsigned ≥
0011CC / LOCarry Clear / LowerC = 0Unsigned <
0100MIMinus (Negative)N = 1Signed result negative
0101PLPlus (Positive or Zero)N = 0Signed result ≥ 0
0110VSOverflow SetV = 1Signed overflow occurred
0111VCOverflow ClearV = 0No signed overflow
1000HIHigherC = 1 AND Z = 0Unsigned >
1001LSLower or SameC = 0 OR Z = 1Unsigned ≤
1010GEGreater or EqualN = VSigned ≥
1011LTLess ThanN ≠ VSigned <
1100GTGreater ThanZ = 0 AND N = VSigned >
1101LELess or EqualZ = 1 OR N ≠ VSigned ≤
1110ALAlways(unconditional)N/A
Signed: GE/LT/GT/LE Unsigned: HS/LO/HI/LS Both: EQ/NE
Signed vs Unsigned — The Most Common Bug: Using B.GT/B.LT (signed) when comparing unsigned values, or B.HI/B.LO (unsigned) with signed values. Example: comparing address 0xFFFF...0 with 0x1 — unsigned, the first is higher; signed, it's -16 (less). Always match the comparison type to your data type: unsigned → HS/LO/HI/LS, signed → GE/LT/GT/LE.

B.cond — Conditional Branch

Append any condition code to B. to create a conditional branch. The condition is evaluated against NZCV flags set by the most recent flag-setting instruction. The two most common flag-setters are CMP (subtract, discard result) and TST (AND, discard result):

// If-else pattern: if (x0 > x1) { ... } else { ... }
    CMP   x0, x1             // Set NZCV based on x0 - x1
    B.GT  .greater           // Signed: x0 > x1?
    // else path here
    B     .done
.greater:
    // greater-than path
.done:

    // Unsigned comparison
    CMP   x0, x1
    B.HI  .x0_higher         // Unsigned: x0 > x1?

    // Chained comparisons (if-elseif-else)
    CMP   x0, #0
    B.LT  .negative          // x0 < 0?
    B.EQ  .zero              // x0 == 0?
    // x0 > 0 (fall-through)

    // CMN — Compare Negative (adds instead of subtracts)
    CMN   x0, #1             // Equivalent to CMP x0, #-1
    B.EQ  .is_minus_one      // x0 == -1?

    // TST-based branching (test specific bits)
    TST   x0, #0x3           // Test alignment (low 2 bits)
    B.NE  .not_aligned        // Branch if any low bits set

CSEL, CSET, CSINC, CSINV

The CSEL family provides branchless conditional data operations — they select between two values based on a condition code without branching. This eliminates branch misprediction penalties for simple conditionals:

// CSEL — Conditional Select
    CMP   x0, x1
    CSEL  x2, x0, x1, LT    // x2 = (x0 < x1) ? x0 : x1  (min)
    CSEL  x3, x1, x0, LT    // x3 = (x0 < x1) ? x1 : x0  (max)

    // CSET — Conditional Set (boolean result)
    CMP   x0, #0
    CSET  x1, GT             // x1 = (x0 > 0) ? 1 : 0

    // CSINC — Conditional Select Increment
    CMP   x0, x1
    CSINC x2, x3, x4, EQ    // x2 = (x0 == x1) ? x3 : x4+1

    // CSINV — Conditional Select Invert
    CMP   x0, #0
    CSINV x1, xzr, xzr, GE  // x1 = (x0 >= 0) ? 0 : -1 (sign extension)

    // CSNEG — Conditional Select Negate
    CMP   x0, #0
    CSNEG x1, x0, x0, GE    // x1 = (x0 >= 0) ? x0 : -x0  (abs value!)

    // Common patterns:
    // abs(x):   CSNEG x0, x0, x0, GE  (after CMP x0, #0)
    // clamp:    CMP x0, x_max; CSEL x0, x_max, x0, GT
    // bool:     CMP x0, x1; CSET x2, EQ  (x2 = x0 == x1)
Performance Branch Prediction
CSEL vs Branch — When Branchless Wins

A branch misprediction costs 10–15 cycles on modern AArch64 cores (pipeline flush + refill). For unpredictable conditions (50/50 true/false), CSEL executes in 1 cycle every time. Compilers convert x = (a > b) ? c : d to CSEL automatically at -O2. However, if the condition is highly predictable (e.g., error checking that rarely triggers), a branch is better because it lets the predictor skip the "else" path entirely, allowing more instructions to be in-flight.

Branchless Misprediction Penalty Compiler Optimisation

Loop Patterns

Counted Loops (SUBS + B.NE)

The most fundamental loop pattern in assembly: decrement a counter, check if it's zero, and branch back if not. The S suffix on SUBS sets the Zero flag when the counter reaches 0, eliminating the need for a separate CMP:

// Counted loop: sum 64-element array
    MOV  x0, xzr             // sum = 0
    MOV  x2, #64             // count = 64
    ADRP x1, array; ADD x1, x1, :lo12:array
.loop:
    LDR  x3, [x1], #8        // Load element, post-increment pointer
    ADD  x0, x0, x3          // sum += element
    SUBS x2, x2, #1          // count--; set Z flag when 0
    B.NE .loop                // Repeat if count != 0
    // x0 now contains the sum

    // Alternative: CBZ-based loop (no flag setting)
    MOV  x2, #64
.loop2:
    LDR  x3, [x1], #8
    ADD  x0, x0, x3
    SUB  x2, x2, #1          // No S suffix
    CBNZ x2, .loop2           // Test register directly

    // Decrement-until-zero with early exit
    MOV  x2, #MAX_ITER
.search:
    LDR  x3, [x1], #8
    CMP  x3, x_target
    B.EQ .found               // Exit early if found
    SUBS x2, x2, #1
    B.NE .search
    // Not found (exhausted iterations)

While Loops & Do-While

A while loop tests the condition before the first iteration (may execute zero times). A do-while loop tests at the bottom (always executes at least once). Compilers typically transform while-loops into do-while with a guard check for efficiency:

// while (x0 > 0) { x0 = x0 - x1; count++; }
    // Compiler transforms to: if (x0 > 0) do { ... } while (x0 > 0)
    CMP  x0, #0
    B.LE .while_done          // Guard: skip if condition already false
.while_body:
    SUB  x0, x0, x1          // x0 -= x1
    ADD  x2, x2, #1          // count++
    CMP  x0, #0
    B.GT .while_body          // Bottom test (do-while style)
.while_done:

    // do-while: process string characters
.do_loop:
    LDRB w3, [x0], #1        // Load byte, advance pointer
    // ... process character ...
    CBNZ w3, .do_loop         // Continue until null terminator

    // strlen implementation (do-while with pointer arithmetic)
    MOV  x1, x0              // Save start pointer
.strlen_loop:
    LDRB w2, [x0], #1        // Load byte
    CBNZ w2, .strlen_loop    // Continue until \0
    SUB  x0, x0, x1          // Length = end - start
    SUB  x0, x0, #1          // Adjust for post-increment past \0

Loop Unrolling Hints

Unrolling processes multiple elements per iteration, reducing branch overhead and enabling the processor to schedule instructions more efficiently:

// 4x unrolled array sum (processes 4 elements per iteration)
    MOV  x0, xzr             // sum = 0
    MOV  x2, #64             // count (must be multiple of 4)
.unrolled:
    LDP  x3, x4, [x1], #16   // Load 2 elements
    LDP  x5, x6, [x1], #16   // Load 2 more
    ADD  x0, x0, x3
    ADD  x0, x0, x4
    ADD  x0, x0, x5
    ADD  x0, x0, x6
    SUBS x2, x2, #4
    B.NE .unrolled
Unrolling Guidelines: (1) Use LDP/STP pairs to load/store two registers at once. (2) Align loop entry to cache-line boundary with .balign 64. (3) Unroll by 2× or 4× — beyond that, diminishing returns and I-cache pressure. (4) Add a scalar cleanup loop for element counts not divisible by the unroll factor. (5) Hardware prefetchers work best with consistent, predictable stride patterns.

Function Pointers & Indirect Calls

BLR for Function Pointers

In C, every function pointer call compiles to a simple pattern: load the pointer into a register, then BLR. The AAPCS64 calling convention is identical for direct and indirect calls — arguments in X0–X7, return value in X0, return address in X30:

// C: result = callback(arg1, arg2);
// callback is a function pointer stored in a struct
    LDR  x8, [x0, #fn_offset]  // Load function pointer from struct
    MOV  x0, x1                 // First argument (shifts x1 → x0)
    MOV  x1, x2                 // Second argument
    BLR  x8                     // Call via function pointer
                                 // Return value in x0

    // Array of function pointers (dispatch table)
    ADRP x9, handler_table
    ADD  x9, x9, :lo12:handler_table
    LDR  x8, [x9, x0, LSL #3]  // handlers[event_type]
    BLR  x8                     // Dispatch to handler

    // Callback with context (like qsort comparator)
    LDR  x8, [x19, #cmp_func]  // Load comparator
    LDP  x0, x1, [x20]         // Load two items to compare
    BLR  x8                     // Call comparator(a, b)
    CMP  x0, #0                 // Check result
    B.LT .swap                   // If a < b, swap them

C++ vtable Dispatch Pattern

Virtual method calls in C++ follow a characteristic two-level indirection: load the vtable pointer from the object, then load the method pointer from the vtable slot:

// C++: obj->virtual_method(arg);
// Object layout: +0 = vptr, +8 = first data member
    LDR  x8, [x0]              // Load vptr (first word of object)
    LDR  x9, [x8, #16]         // Load vtable slot 2 (offset = N*8)
    BLR  x9                     // Call virtual method
                                 // x0 = 'this' pointer (already there)

    // Apple ARM64e: PAC-authenticated vtable dispatch
    LDR  x8, [x0]              // Load vptr
    LDR  x9, [x8, #16]         // Load vtable slot
    BLRAAZ x9                   // Authenticate and call
                                 // Traps if PAC invalid (ROP protection)
Case Study Security
Pointer Authentication Codes (PAC) on Apple Silicon

Apple's ARM64e ABI uses Pointer Authentication to cryptographically sign return addresses and function pointers. When BL saves the return address to X30, the hardware PAC unit signs it using a secret key. RET verifies the signature before branching — if an attacker has corrupted X30 (ROP attack), the signature check fails and the process is killed. This is implemented as PACIASP/AUTIASP instructions in function prologues/epilogues, and BLRAAZ/BRAAZ for authenticated indirect branches.

ARM64e ROP Prevention Apple M1/M2

Computed Jump Tables

switch() Lowering

When a switch statement has dense, contiguous cases (e.g., 0–7), compilers generate a jump table — an array of target addresses indexed by the switch value. This is O(1) dispatch vs O(n) for a chain of compare-and-branch. The pattern has three steps: (1) bounds-check to handle the default case, (2) index into the table to load the target, (3) branch via BR:

// Compiler-generated switch/jump-table pattern
// switch (event_type) { case 0: ... case 7: ... default: ... }
    CMP   x0, #7              // Bounds check against max case
    B.HI  .default             // Out-of-range → default handler
    ADRP  x1, .jtable
    ADD   x1, x1, :lo12:.jtable
    LDR   x2, [x1, x0, LSL #3] // Load target address (8-byte entries)
    BR    x2                    // Jump to case handler

    .balign 8
.jtable:
    .dword case0, case1, case2, case3
    .dword case4, case5, case6, case7

case0:
    // Handle event 0
    B     .switch_done
case1:
    // Handle event 1
    B     .switch_done
    // ... cases 2–7 ...
.default:
    // Default handler
.switch_done:
When Do Compilers Use Jump Tables? GCC/Clang heuristics generally emit a jump table when: (1) there are 4+ cases, (2) the case values are reasonably dense (sparse switches use binary search or cascaded branches), (3) the range isn't excessively large (e.g., case 0 and case 1000000 would not table-ify). You can guide the compiler with __builtin_expect for the most common case.

PIC-Safe Jump Tables

Position-Independent Code (PIC) cannot embed absolute addresses in jump tables because the code may be loaded at any virtual address (shared libraries, ASLR). Instead, entries store PC-relative offsets from a known base. The loader never needs to patch the table:

// PIC-safe jump table: entries are signed offsets from table base
    CMP   x0, #7
    B.HI  .default
    ADR   x1, .pic_jtable       // PC-relative address of table
    LDRSW x2, [x1, x0, LSL #2] // Load 32-bit signed offset
    ADD   x2, x1, x2           // target = table_base + offset
    BR    x2                    // Jump to case handler

    .balign 4
.pic_jtable:
    .word case0 - .pic_jtable   // Offset from table base
    .word case1 - .pic_jtable
    .word case2 - .pic_jtable
    .word case3 - .pic_jtable
Case Study Compilers
How GCC and Clang Differ on Jump Tables

GCC (-fPIC mode) emits 32-bit signed offsets relative to the table base, using LDRSW + ADD + BR — the pattern shown above. Clang/LLVM prefers a slightly different encoding: it stores offsets relative to the ADR instruction itself and sometimes uses ADRP + ADD pairs. Both approaches produce position-independent code that requires zero relocation entries. When reverse-engineering a binary, recognising these patterns tells you "this is a switch statement" — the number of table entries equals the number of cases.

GCC Clang/LLVM PIC Reverse Engineering
Key Insight: AArch64 eliminates the ARM32 per-instruction IT-block conditional execution in favour of dedicated CBZ/CBNZ, TBZ/TBNZ, and CSEL/CSET instructions. This makes branch prediction simpler for the hardware and keeps the encoding space clean for the larger register file.

Conclusion & Next Steps

This part covered the complete AArch64 branching toolkit. Unconditional branches (B, BR, BL, BLR, RET) handle jumps, calls, and returns. Compare & Branch (CBZ/CBNZ, TBZ/TBNZ) provide efficient zero-test and bit-test patterns without touching the flags. Conditional branches (B.cond) use the 15 condition codes against NZCV flags, while CSEL/CSET/CSINC/CSINV/CSNEG deliver branchless conditional data selection. For loops, the SUBS + B.NE idiom handles counted loops, while-loops canonicalize to do-while with guard checks, and manual unrolling with LDP boosts throughput. Function pointers compile to a simple LDR + BLR pattern, C++ vtable dispatch adds one extra indirection, and PAC secures indirect calls on Apple Silicon. Jump tables give O(1) switch dispatch with PIC-safe offset encoding for shared libraries.

Practice Exercises:
  1. Fibonacci Loop: Write a counted loop that computes the first 20 Fibonacci numbers, storing each in an array. Use SUBS + B.NE for the loop and ADD for the recurrence.
  2. Branchless Min/Max: Given three values in X0, X1, X2, compute the minimum and maximum using only CMP + CSEL (no B.cond). Store min in X3, max in X4.
  3. PIC Jump Table: Write a 4-case switch statement using a PIC-safe jump table. Each case should load a different constant into X0. Test it under gcc -fPIC to verify it links correctly in a shared library.

Next in the Series

In Part 6: Stack, Subroutines & AAPCS, we formalize the AArch64 Procedure Call Standard — register assignment rules, caller/callee-saved registers, stack frame layout, variadic functions, and the complete prologue/epilogue pattern you need to write ABI-compliant assembly.

Technology