ARM Assembly Part 5: Branching, Loops & Conditional Execution

Introduction

                        
                        Series Overview: This is Part 5 of our 28-part ARM Assembly Mastery Series. Parts 1–4 covered ARM history, ARM32, AArch64 registers, and arithmetic. Now we tackle control flow — the instructions that make code loop, branch, and call functions.
                    

ARM Assembly Mastery

Your 28-step learning path • Currently on Step 5

Branching, Loops & Conditional Execution

Branch types, link register, jump tables

You Are Here

Stack, Subroutines & AAPCS

Calling conventions, prologue/epilogue

Memory Model, Caches & Barriers

Weak ordering, DMB/DSB/ISB, TLB

NEON & Advanced SIMD

Vector ops, intrinsics, media processing

SVE & SVE2 Scalable Vector Extensions

Predicate regs, gather/scatter, HPC/ML

Floating-Point & VFP Instructions

IEEE-754, scalar FP, rounding modes

Exception Levels, Interrupts & Vector Tables

EL0–EL3, GIC, fault debugging

MMU, Page Tables & Virtual Memory

Stage-1 translation, permissions, huge pages

TrustZone & ARM Security Extensions

Secure monitor, world switching, TF-A

Cortex-M Assembly & Bare-Metal Embedded

NVIC, SysTick, linker scripts, low-power

Cortex-A System Programming & Boot

EL3→EL1 transitions, MMU setup, PSCI

Apple Silicon & macOS ABI

ARM64e PAC, Mach-O, dyld, perf counters

Inline Assembly, GCC/Clang & C Interop

Constraints, clobbers, compiler interaction

Performance Profiling & Micro-Optimization

Pipeline hazards, PMU, benchmarking

Reverse Engineering & ARM Binary Analysis

ELF, disassembly, CFR, iOS/Android quirks

Building a Bare-Metal OS Kernel

Bootloader, UART, scheduler, context switch

ARM Microarchitecture Deep Dive

OOO pipelines, reorder buffers, branch predict

Virtualization Extensions

EL2 hypervisor, stage-2 translation, KVM

Debugging & Tooling Ecosystem

GDB, OpenOCD/JTAG, ETM/ITM, QEMU

Linkers, Loaders & Binary Format Internals

ELF deep dive, relocations, PIC, crt0

Cross-Compilation & Build Systems

GCC/Clang toolchains, CMake, firmware gen

ARM in Real Systems

Android, FreeRTOS/Zephyr, U-Boot, TF-A

Security Research & Exploitation

ASLR, PAC attacks, ROP/JOP, kernel exploit

Emerging ARMv9 & Future Directions

MTE, SME, confidential compute, AI accel

Branch Instruction Overview

If data-processing instructions are the "muscles" of a program, branches are its nervous system — they make decisions, create loops, call functions, and return results. AArch64 provides three clean families of branch instructions:

Family	Instructions	Purpose	Condition Source
Unconditional	B, BR, BL, BLR, RET	Jumps, function calls, returns	None — always taken
Compare & Branch	CBZ, CBNZ, TBZ, TBNZ	Test a register value directly	Register value (no flags needed)
Conditional	B.cond, CSEL, CSET, CSINC, CSINV	Branch or select based on NZCV	NZCV flags (from prior ADDS/CMP/etc.)

                        
                        Key Change from ARM32: AArch64 eliminates the per-instruction condition codes from ARM32 (where almost any instruction could be executed conditionally via the IT block). Conditional execution is now restricted to dedicated branch and CSEL/CSET instructions. This simplifies hardware branch prediction and keeps the encoding space clean for the larger 31-register file.
                    

PC-Relative Range Limits

Every branch instruction encodes its target as a PC-relative offset within the 32-bit instruction word. Different instructions sacrifice different amounts of encoding space for the offset, giving them different ranges:

Instruction	Offset Bits	Range	Typical Context
`B` / `BL`	26 bits × 4	±128 MB	Function calls, long jumps — covers most binaries
`B.cond`, `CBZ`/`CBNZ`	19 bits × 4	±1 MB	Conditional branches, loop tests, null checks
`TBZ`/`TBNZ`	14 bits × 4	±32 KB	Bit tests — usually close to the test site
`BR`/`BLR`/`RET`	Register (64-bit)	Full 64-bit address space	Indirect calls, function pointers, returns

                        
                        Veneers (Trampolines): If your branch target is outside the instruction's range, the linker automatically inserts a veneer — a small trampoline that loads the full address and does an indirect branch. You almost never see this in practice (128 MB covers most binaries), but it explains mysterious extra code in very large firmware images. The assembler will often warn you: "relocation out of range, veneer inserted."
                    

Unconditional Branches

B & BR — Jump

B (Branch) is the simplest control-flow instruction — an unconditional PC-relative jump. BR (Branch Register) is its indirect sibling, jumping to the absolute 64-bit address held in a register:

AArch64 branch instruction types showing B, BR, BL, and RET control flow — AArch64 branch instruction types showing B (PC-relative), BR (register-indirect), BL (call with link), and RET (return) control flow patterns

// B — PC-relative unconditional jump
    B    .target              // Jump to label (PC + signed offset)
    B    _start               // Jump to symbol (linker resolves)

    // BR — Register-indirect jump (no link)
    BR   x16                  // Jump to address in x16
                               // Does NOT save return address
                               // Used for: computed gotos, tail calls,
                               // jump tables, longjmp

                        
                        B vs BR: B is for "I know where I'm going at compile time" — the target is a label or symbol resolved by the assembler/linker. BR is for "the destination is computed at runtime" — function pointers, jump tables, or tail-call optimisations where the target varies.
                    

BL & BLR — Call (Link)

BL (Branch with Link) and BLR (Branch with Link to Register) are the function call instructions. Before jumping, they save PC+4 (the address of the next instruction) into X30 (the Link Register), creating a breadcrumb trail for the callee to return to:

// Direct function call (BL)
    BL   printf              // X30 = PC+4, then jump to printf
                               // printf will RET back to here

    // Indirect function call (BLR)
    LDR  x8, [x0, #callback] // Load function pointer from struct
    BLR  x8                  // X30 = PC+4, then call via x8

    // Multiple calls in sequence
    BL   init_hardware        // X30 = addr_of_next_instr
    BL   config_clocks        // X30 updated again (overwritten!)
    BL   main                 // Each BL overwrites X30

                        
                        X30 Overwrite Danger: Every BL/BLR overwrites X30. If a function calls another function (non-leaf), it must save X30 to the stack in its prologue and restore it before RET. Forgetting this is the #1 cause of infinite loops and crashes in hand-written assembly — the function returns to itself instead of its caller.
                    

RET — Return

RET branches to the address in X30 (by default), returning control to the caller. Critical details:

RET is NOT a stack pop: It only jumps to X30. It does NOT pop the return address from the stack (unlike x86-64's RET). The callee must restore X30 from the stack manually if it saved it.
Optional register: RET Xn returns to the address in Xn instead of X30 — rarely used but available for coroutines or custom dispatch.
Branch prediction hint: The processor treats RET differently from BR x30 — RET uses the return address stack predictor, which is faster. Always use RET for function returns, never BR X30.

// Leaf function (no calls → no need to save X30)
    ADD  x0, x0, x1           // Compute result
    RET                        // Return to caller via X30

    // Non-leaf function (calls other functions → must save X30)
    STP  x29, x30, [sp, #-16]!  // Save FP and LR
    MOV  x29, sp                 // Set frame pointer
    BL   helper                  // Call helper (overwrites X30)
    LDP  x29, x30, [sp], #16    // Restore FP and LR
    RET                          // Return to original caller

Compare & Branch

CBZ/CBNZ — Compare Zero

CBZ (Compare and Branch on Zero) and CBNZ (Compare and Branch on Nonzero) test a register directly without setting or reading NZCV flags. They combine a comparison with zero and a branch into a single instruction — saving the separate CMP that x86-64 requires:

CBZ and CBNZ combined compare-and-branch versus two-instruction CMP plus B.EQ — CBZ/CBNZ instruction flow showing combined compare-and-branch in a single instruction versus the two-instruction CMP + B.EQ alternative

// Null pointer check (most common use)
    CBZ  x0, .return_null    // if (ptr == NULL) goto return_null
    // ptr is guaranteed non-null here

    // Loop termination (count down to zero)
    SUB  x1, x1, #1          // Decrement (no S suffix = no flags)
    CBNZ x1, .loop_body      // if (count != 0) continue

    // Linked list traversal
.traverse:
    LDR  x0, [x0, #next]     // Follow next pointer
    CBZ  x0, .end_of_list    // NULL? End of list
    // Process node...
    B    .traverse

    // 32-bit variants (test W registers)
    CBZ  w0, .is_zero_32     // Test lower 32 bits only
    CBNZ w5, .nonzero_32

                        
                        Why CBZ/CBNZ Matter: In typical C code, null checks and loop terminations account for ~40% of all branches. Without CBZ, each requires two instructions: CMP x0, #0 then B.EQ label. CBZ halves this to one instruction — a significant code-size and dispatch-bandwidth saving across millions of branches per second.
                    

TBZ/TBNZ — Test Bit

TBZ (Test Bit and Branch if Zero) and TBNZ (Test Bit and Branch if Nonzero) test a single specific bit within a register and branch accordingly. Like CBZ/CBNZ, they don't touch the NZCV flags:

// Test bit 31 (sign bit of a 32-bit value in a 64-bit register)
    TBNZ x0, #31, .is_negative  // Branch if signed negative

    // Test bit 0 (odd/even check)
    TBZ  x0, #0, .is_even       // Branch if bit 0 is clear (even)

    // Check a specific permission flag
    TBNZ x_flags, #PERM_WRITE, .has_write  // Branch if write permitted

    // Hardware status register bit test
    MRS  x0, DAIF               // Read interrupt mask register
    TBNZ x0, #7, .irq_masked    // Branch if IRQ bit (bit 7) is masked

    // Bit-scan loop (find first set bit)
.scan:
    TBZ  x0, #0, .next_bit      // Test LSB
    // Bit 0 is set — process it
    B    .done
.next_bit:
    LSR  x0, x0, #1             // Shift right
    ADD  x1, x1, #1             // Increment bit position
    CBNZ x0, .scan              // Continue if bits remain

Performance Pipeline

CBZ/TBZ vs CMP+B.cond — Fusion and Efficiency

On high-performance AArch64 cores (Cortex-A76+, Apple M-series, Neoverse V1), the branch predictor handles CBZ/CBNZ/TBZ/TBNZ as single micro-ops, while CMP + B.cond may or may not be fused depending on the core's macro-fusion rules. Even when fused, CBZ saves one instruction's worth of fetch bandwidth. In tight loops, this translates to measurable throughput improvements — Apple's M1 can execute up to 8 instructions per cycle, so saving one instruction per iteration on a 4-instruction loop body is a 20% improvement.

Macro-Fusion Branch Prediction Code Density

Conditional Branches

NZCV Conditions Reference

AArch64 uses four condition flags — N (Negative), Z (Zero), C (Carry), V (Overflow) — collectively called NZCV. These flags are set by flag-setting instructions (ADDS, SUBS, ANDS, CMP, TST, etc.) and read by conditional branches and CSEL/CSET. There are 15 named conditions (14 meaningful + AL/always):

NZCV condition flags and their relationship to 15 condition codes — NZCV condition flags and their relationship to the 15 condition codes used by B.cond conditional branch instructions

Reference AArch64

Condition Code Quick Reference

Code	Suffix	Meaning	Flags Test	Signed/Unsigned
0000	`EQ`	Equal	Z = 1	Both
0001	`NE`	Not Equal	Z = 0	Both
0010	`CS / HS`	Carry Set / Higher or Same	C = 1	Unsigned ≥
0011	`CC / LO`	Carry Clear / Lower	C = 0	Unsigned <
0100	`MI`	Minus (Negative)	N = 1	Signed result negative
0101	`PL`	Plus (Positive or Zero)	N = 0	Signed result ≥ 0
0110	`VS`	Overflow Set	V = 1	Signed overflow occurred
0111	`VC`	Overflow Clear	V = 0	No signed overflow
1000	`HI`	Higher	C = 1 AND Z = 0	Unsigned >
1001	`LS`	Lower or Same	C = 0 OR Z = 1	Unsigned ≤
1010	`GE`	Greater or Equal	N = V	Signed ≥
1011	`LT`	Less Than	N ≠ V	Signed <
1100	`GT`	Greater Than	Z = 0 AND N = V	Signed >
1101	`LE`	Less or Equal	Z = 1 OR N ≠ V	Signed ≤
1110	`AL`	Always	(unconditional)	N/A

Signed: GE/LT/GT/LE Unsigned: HS/LO/HI/LS Both: EQ/NE

                        
                        Signed vs Unsigned — The Most Common Bug: Using B.GT/B.LT (signed) when comparing unsigned values, or B.HI/B.LO (unsigned) with signed values. Example: comparing address 0xFFFF...0 with 0x1 — unsigned, the first is higher; signed, it's -16 (less). Always match the comparison type to your data type: unsigned → HS/LO/HI/LS, signed → GE/LT/GT/LE.
                    

B.cond — Conditional Branch

Append any condition code to B. to create a conditional branch. The condition is evaluated against NZCV flags set by the most recent flag-setting instruction. The two most common flag-setters are CMP (subtract, discard result) and TST (AND, discard result):

// If-else pattern: if (x0 > x1) { ... } else { ... }
    CMP   x0, x1             // Set NZCV based on x0 - x1
    B.GT  .greater           // Signed: x0 > x1?
    // else path here
    B     .done
.greater:
    // greater-than path
.done:

    // Unsigned comparison
    CMP   x0, x1
    B.HI  .x0_higher         // Unsigned: x0 > x1?

    // Chained comparisons (if-elseif-else)
    CMP   x0, #0
    B.LT  .negative          // x0 < 0?
    B.EQ  .zero              // x0 == 0?
    // x0 > 0 (fall-through)

    // CMN — Compare Negative (adds instead of subtracts)
    CMN   x0, #1             // Equivalent to CMP x0, #-1
    B.EQ  .is_minus_one      // x0 == -1?

    // TST-based branching (test specific bits)
    TST   x0, #0x3           // Test alignment (low 2 bits)
    B.NE  .not_aligned        // Branch if any low bits set

CSEL, CSET, CSINC, CSINV

The CSEL family provides branchless conditional data operations — they select between two values based on a condition code without branching. This eliminates branch misprediction penalties for simple conditionals:

// CSEL — Conditional Select
    CMP   x0, x1
    CSEL  x2, x0, x1, LT    // x2 = (x0 < x1) ? x0 : x1  (min)
    CSEL  x3, x1, x0, LT    // x3 = (x0 < x1) ? x1 : x0  (max)

    // CSET — Conditional Set (boolean result)
    CMP   x0, #0
    CSET  x1, GT             // x1 = (x0 > 0) ? 1 : 0

    // CSINC — Conditional Select Increment
    CMP   x0, x1
    CSINC x2, x3, x4, EQ    // x2 = (x0 == x1) ? x3 : x4+1

    // CSINV — Conditional Select Invert
    CMP   x0, #0
    CSINV x1, xzr, xzr, GE  // x1 = (x0 >= 0) ? 0 : -1 (sign extension)

    // CSNEG — Conditional Select Negate
    CMP   x0, #0
    CSNEG x1, x0, x0, GE    // x1 = (x0 >= 0) ? x0 : -x0  (abs value!)

    // Common patterns:
    // abs(x):   CSNEG x0, x0, x0, GE  (after CMP x0, #0)
    // clamp:    CMP x0, x_max; CSEL x0, x_max, x0, GT
    // bool:     CMP x0, x1; CSET x2, EQ  (x2 = x0 == x1)

Performance Branch Prediction

CSEL vs Branch — When Branchless Wins

A branch misprediction costs 10–15 cycles on modern AArch64 cores (pipeline flush + refill). For unpredictable conditions (50/50 true/false), CSEL executes in 1 cycle every time. Compilers convert x = (a > b) ? c : d to CSEL automatically at -O2. However, if the condition is highly predictable (e.g., error checking that rarely triggers), a branch is better because it lets the predictor skip the "else" path entirely, allowing more instructions to be in-flight.

Branchless Misprediction Penalty Compiler Optimisation

Loop Patterns

Counted Loops (SUBS + B.NE)

The most fundamental loop pattern in assembly: decrement a counter, check if it's zero, and branch back if not. The S suffix on SUBS sets the Zero flag when the counter reaches 0, eliminating the need for a separate CMP:

Comparison of counted loop, while loop, and do-while loop patterns in AArch64 assembly with control flow diagrams

// Counted loop: sum 64-element array
    MOV  x0, xzr             // sum = 0
    MOV  x2, #64             // count = 64
    ADRP x1, array; ADD x1, x1, :lo12:array
.loop:
    LDR  x3, [x1], #8        // Load element, post-increment pointer
    ADD  x0, x0, x3          // sum += element
    SUBS x2, x2, #1          // count--; set Z flag when 0
    B.NE .loop                // Repeat if count != 0
    // x0 now contains the sum

    // Alternative: CBZ-based loop (no flag setting)
    MOV  x2, #64
.loop2:
    LDR  x3, [x1], #8
    ADD  x0, x0, x3
    SUB  x2, x2, #1          // No S suffix
    CBNZ x2, .loop2           // Test register directly

    // Decrement-until-zero with early exit
    MOV  x2, #MAX_ITER
.search:
    LDR  x3, [x1], #8
    CMP  x3, x_target
    B.EQ .found               // Exit early if found
    SUBS x2, x2, #1
    B.NE .search
    // Not found (exhausted iterations)

While Loops & Do-While

A while loop tests the condition before the first iteration (may execute zero times). A do-while loop tests at the bottom (always executes at least once). Compilers typically transform while-loops into do-while with a guard check for efficiency:

// while (x0 > 0) { x0 = x0 - x1; count++; }
    // Compiler transforms to: if (x0 > 0) do { ... } while (x0 > 0)
    CMP  x0, #0
    B.LE .while_done          // Guard: skip if condition already false
.while_body:
    SUB  x0, x0, x1          // x0 -= x1
    ADD  x2, x2, #1          // count++
    CMP  x0, #0
    B.GT .while_body          // Bottom test (do-while style)
.while_done:

    // do-while: process string characters
.do_loop:
    LDRB w3, [x0], #1        // Load byte, advance pointer
    // ... process character ...
    CBNZ w3, .do_loop         // Continue until null terminator

    // strlen implementation (do-while with pointer arithmetic)
    MOV  x1, x0              // Save start pointer
.strlen_loop:
    LDRB w2, [x0], #1        // Load byte
    CBNZ w2, .strlen_loop    // Continue until \0
    SUB  x0, x0, x1          // Length = end - start
    SUB  x0, x0, #1          // Adjust for post-increment past \0

Loop Unrolling Hints

Unrolling processes multiple elements per iteration, reducing branch overhead and enabling the processor to schedule instructions more efficiently:

// 4x unrolled array sum (processes 4 elements per iteration)
    MOV  x0, xzr             // sum = 0
    MOV  x2, #64             // count (must be multiple of 4)
.unrolled:
    LDP  x3, x4, [x1], #16   // Load 2 elements
    LDP  x5, x6, [x1], #16   // Load 2 more
    ADD  x0, x0, x3
    ADD  x0, x0, x4
    ADD  x0, x0, x5
    ADD  x0, x0, x6
    SUBS x2, x2, #4
    B.NE .unrolled

                        
                        Unrolling Guidelines: (1) Use LDP/STP pairs to load/store two registers at once. (2) Align loop entry to cache-line boundary with .balign 64. (3) Unroll by 2× or 4× — beyond that, diminishing returns and I-cache pressure. (4) Add a scalar cleanup loop for element counts not divisible by the unroll factor. (5) Hardware prefetchers work best with consistent, predictable stride patterns.
                    

Function Pointers & Indirect Calls

BLR for Function Pointers

In C, every function pointer call compiles to a simple pattern: load the pointer into a register, then BLR. The AAPCS64 calling convention is identical for direct and indirect calls — arguments in X0–X7, return value in X0, return address in X30:

// C: result = callback(arg1, arg2);
// callback is a function pointer stored in a struct
    LDR  x8, [x0, #fn_offset]  // Load function pointer from struct
    MOV  x0, x1                 // First argument (shifts x1 → x0)
    MOV  x1, x2                 // Second argument
    BLR  x8                     // Call via function pointer
                                 // Return value in x0

    // Array of function pointers (dispatch table)
    ADRP x9, handler_table
    ADD  x9, x9, :lo12:handler_table
    LDR  x8, [x9, x0, LSL #3]  // handlers[event_type]
    BLR  x8                     // Dispatch to handler

    // Callback with context (like qsort comparator)
    LDR  x8, [x19, #cmp_func]  // Load comparator
    LDP  x0, x1, [x20]         // Load two items to compare
    BLR  x8                     // Call comparator(a, b)
    CMP  x0, #0                 // Check result
    B.LT .swap                   // If a < b, swap them

C++ vtable Dispatch Pattern

Virtual method calls in C++ follow a characteristic two-level indirection: load the vtable pointer from the object, then load the method pointer from the vtable slot:

// C++: obj->virtual_method(arg);
// Object layout: +0 = vptr, +8 = first data member
    LDR  x8, [x0]              // Load vptr (first word of object)
    LDR  x9, [x8, #16]         // Load vtable slot 2 (offset = N*8)
    BLR  x9                     // Call virtual method
                                 // x0 = 'this' pointer (already there)

    // Apple ARM64e: PAC-authenticated vtable dispatch
    LDR  x8, [x0]              // Load vptr
    LDR  x9, [x8, #16]         // Load vtable slot
    BLRAAZ x9                   // Authenticate and call
                                 // Traps if PAC invalid (ROP protection)

Case Study Security

Pointer Authentication Codes (PAC) on Apple Silicon

Apple's ARM64e ABI uses Pointer Authentication to cryptographically sign return addresses and function pointers. When BL saves the return address to X30, the hardware PAC unit signs it using a secret key. RET verifies the signature before branching — if an attacker has corrupted X30 (ROP attack), the signature check fails and the process is killed. This is implemented as PACIASP/AUTIASP instructions in function prologues/epilogues, and BLRAAZ/BRAAZ for authenticated indirect branches.

ARM64e ROP Prevention Apple M1/M2

Computed Jump Tables

switch() Lowering

When a switch statement has dense, contiguous cases (e.g., 0–7), compilers generate a jump table — an array of target addresses indexed by the switch value. This is O(1) dispatch vs O(n) for a chain of compare-and-branch. The pattern has three steps: (1) bounds-check to handle the default case, (2) index into the table to load the target, (3) branch via BR:

Computed jump table dispatch with bounds check, table indexing, and indirect branch — Computed jump table dispatch mechanism showing bounds check, table indexing, and indirect branch for O(1) switch statement implementation

// Compiler-generated switch/jump-table pattern
// switch (event_type) { case 0: ... case 7: ... default: ... }
    CMP   x0, #7              // Bounds check against max case
    B.HI  .default             // Out-of-range → default handler
    ADRP  x1, .jtable
    ADD   x1, x1, :lo12:.jtable
    LDR   x2, [x1, x0, LSL #3] // Load target address (8-byte entries)
    BR    x2                    // Jump to case handler

    .balign 8
.jtable:
    .dword case0, case1, case2, case3
    .dword case4, case5, case6, case7

case0:
    // Handle event 0
    B     .switch_done
case1:
    // Handle event 1
    B     .switch_done
    // ... cases 2–7 ...
.default:
    // Default handler
.switch_done:

                        
                        When Do Compilers Use Jump Tables? GCC/Clang heuristics generally emit a jump table when: (1) there are 4+ cases, (2) the case values are reasonably dense (sparse switches use binary search or cascaded branches), (3) the range isn't excessively large (e.g., case 0 and case 1000000 would not table-ify). You can guide the compiler with __builtin_expect for the most common case.
                    

PIC-Safe Jump Tables

Position-Independent Code (PIC) cannot embed absolute addresses in jump tables because the code may be loaded at any virtual address (shared libraries, ASLR). Instead, entries store PC-relative offsets from a known base. The loader never needs to patch the table:

// PIC-safe jump table: entries are signed offsets from table base
    CMP   x0, #7
    B.HI  .default
    ADR   x1, .pic_jtable       // PC-relative address of table
    LDRSW x2, [x1, x0, LSL #2] // Load 32-bit signed offset
    ADD   x2, x1, x2           // target = table_base + offset
    BR    x2                    // Jump to case handler

    .balign 4
.pic_jtable:
    .word case0 - .pic_jtable   // Offset from table base
    .word case1 - .pic_jtable
    .word case2 - .pic_jtable
    .word case3 - .pic_jtable

Case Study Compilers

How GCC and Clang Differ on Jump Tables

GCC (-fPIC mode) emits 32-bit signed offsets relative to the table base, using LDRSW + ADD + BR — the pattern shown above. Clang/LLVM prefers a slightly different encoding: it stores offsets relative to the ADR instruction itself and sometimes uses ADRP + ADD pairs. Both approaches produce position-independent code that requires zero relocation entries. When reverse-engineering a binary, recognising these patterns tells you "this is a switch statement" — the number of table entries equals the number of cases.

GCC Clang/LLVM PIC Reverse Engineering

                        
                        Key Insight: AArch64 eliminates the ARM32 per-instruction IT-block conditional execution in favour of dedicated CBZ/CBNZ, TBZ/TBNZ, and CSEL/CSET instructions. This makes branch prediction simpler for the hardware and keeps the encoding space clean for the larger register file.
                    

Conclusion & Next Steps

This part covered the complete AArch64 branching toolkit. Unconditional branches (B, BR, BL, BLR, RET) handle jumps, calls, and returns. Compare & Branch (CBZ/CBNZ, TBZ/TBNZ) provide efficient zero-test and bit-test patterns without touching the flags. Conditional branches (B.cond) use the 15 condition codes against NZCV flags, while CSEL/CSET/CSINC/CSINV/CSNEG deliver branchless conditional data selection. For loops, the SUBS + B.NE idiom handles counted loops, while-loops canonicalize to do-while with guard checks, and manual unrolling with LDP boosts throughput. Function pointers compile to a simple LDR + BLR pattern, C++ vtable dispatch adds one extra indirection, and PAC secures indirect calls on Apple Silicon. Jump tables give O(1) switch dispatch with PIC-safe offset encoding for shared libraries.

                        
                        Practice Exercises:
                        Fibonacci Loop: Write a counted loop that computes the first 20 Fibonacci numbers, storing each in an array. Use SUBS + B.NE for the loop and ADD for the recurrence.
Branchless Min/Max: Given three values in X0, X1, X2, compute the minimum and maximum using only CMP + CSEL (no B.cond). Store min in X3, max in X4.
PIC Jump Table: Write a 4-case switch statement using a PIC-safe jump table. Each case should load a different constant into X0. Test it under gcc -fPIC to verify it links correctly in a shared library.

Cookie Consent

Cookie Preferences