Back to Technology

ARM Assembly Part 16: Apple Silicon & macOS ABI

April 23, 2026 Wasil Zafar 22 min read

Apple Silicon implements ARM64e — an ARM-designed superset of AArch64 that enforces pointer authentication codes (PAC) and Branch Target Identification (BTI) in hardware. Combined with Apple's unique ABI requirements (mandatory frame pointer, 16-byte aligned stack, Mach-O load commands), macOS assembly has a distinct character that differs from Linux and Windows ARM64.

Table of Contents

  1. Introduction & ARM64e vs AArch64
  2. Pointer Authentication Codes (PAC)
  3. Branch Target Identification (BTI)
  4. Mach-O Binary Format
  5. dyld & Lazy Binding
  6. Apple ARM64 ABI Specifics
  7. PMU Performance Counters on M1/M2
  8. Hands-On Exercises
  9. Conclusion & Next Steps

Introduction & ARM64e vs AArch64

Series Overview: This is Part 16 of the 28-part ARM Assembly Mastery Series. Apple Silicon (M1, M2, M3, M4 families) builds on the ARMv8.3-A+ instruction set with Apple-specific extensions under the ARM64e ABI. Writing system-level code, hand-optimised kernels, or reverse-engineering macOS binaries requires understanding both the architectural extensions and Apple's unique ABI choices.

ARM Assembly Mastery

Your 28-step learning path • Currently on Step 16
1
Architecture History & Core Concepts
ARMv1→v9, RISC philosophy, profiles
2
ARM32 Instruction Set Fundamentals
ARM vs Thumb, registers, CPSR, barrel shifter
3
AArch64 Registers, Addressing & Data Movement
X/W regs, addressing modes, load/store pairs
4
Arithmetic, Logic & Bit Manipulation
ADD/SUB, bitfield extract/insert, CLZ
5
Branching, Loops & Conditional Execution
Branch types, link register, jump tables
6
Stack, Subroutines & AAPCS
Calling conventions, prologue/epilogue
7
Memory Model, Caches & Barriers
Weak ordering, DMB/DSB/ISB, TLB
8
NEON & Advanced SIMD
Vector ops, intrinsics, media processing
9
SVE & SVE2 Scalable Vector Extensions
Predicate regs, gather/scatter, HPC/ML
10
Floating-Point & VFP Instructions
IEEE-754, scalar FP, rounding modes
11
Exception Levels, Interrupts & Vector Tables
EL0–EL3, GIC, fault debugging
12
MMU, Page Tables & Virtual Memory
Stage-1 translation, permissions, huge pages
13
TrustZone & ARM Security Extensions
Secure monitor, world switching, TF-A
14
Cortex-M Assembly & Bare-Metal Embedded
NVIC, SysTick, linker scripts, low-power
15
Cortex-A System Programming & Boot
EL3→EL1 transitions, MMU setup, PSCI
16
Apple Silicon & macOS ABI
ARM64e PAC, Mach-O, dyld, perf counters
You Are Here
17
Inline Assembly, GCC/Clang & C Interop
Constraints, clobbers, compiler interaction
18
Performance Profiling & Micro-Optimization
Pipeline hazards, PMU, benchmarking
19
Reverse Engineering & ARM Binary Analysis
ELF, disassembly, CFR, iOS/Android quirks
20
Building a Bare-Metal OS Kernel
Bootloader, UART, scheduler, context switch
21
ARM Microarchitecture Deep Dive
OOO pipelines, reorder buffers, branch predict
22
Virtualization Extensions
EL2 hypervisor, stage-2 translation, KVM
23
Debugging & Tooling Ecosystem
GDB, OpenOCD/JTAG, ETM/ITM, QEMU
24
Linkers, Loaders & Binary Format Internals
ELF deep dive, relocations, PIC, crt0
25
Cross-Compilation & Build Systems
GCC/Clang toolchains, CMake, firmware gen
26
ARM in Real Systems
Android, FreeRTOS/Zephyr, U-Boot, TF-A
27
Security Research & Exploitation
ASLR, PAC attacks, ROP/JOP, kernel exploit
28
Emerging ARMv9 & Future Directions
MTE, SME, confidential compute, AI accel

ARM64e is Apple's implementation of ARMv8.3-A Pointer Authentication combined with ARMv8.5-A BTI. The ISA gains new instructions (PACIA/AUTIA families), the ABI gains new calling convention requirements (PACIASP in prologues), and the hardware gain new registers (APxxx keys). On macOS, this coexists with the Mach-O binary format, dyld interposition, and Apple's hardened runtime — all of which influences how hand-written assembly must be structured.

Analogy — The Airport Security Check: Think of Pointer Authentication as a passport stamp system. When you enter a function (arrive at the gate), the CPU stamps your return address (passport) with a cryptographic signature (PACIASP). When you leave (AUTIASP), security checks the stamp — if it's been tampered with (buffer overflow corrupted LR), you're denied boarding (SIGILL crash). BTI is the designated entry point system: you can only enter the terminal through authorised doors (marked with BTI instructions). If malicious code tries to jump into the middle of a function (sneak through a service entrance), the hardware blocks it. Together, PAC + BTI create a zero-overhead security perimeter that makes Apple Silicon one of the most exploit-resistant consumer platforms ever shipped.

Pointer Authentication Codes (PAC)

PAC uses spare bits in 64-bit virtual addresses (bits 63–49 on systems with fewer than 50-bit VA) to store a cryptographic MAC. The key is a hardware register (APIAKey_EL1) not accessible from EL0. Signing inserts the PAC; stripping/authenticating removes it. Forgery causes an authentication fault (SIGSEGV on macOS).

PACIASP / AUTIASP

// ARM64e function prologue/epilogue — required by Apple ABI for
// any function that modifies LR (i.e., any non-leaf function)
// PACIASP: sign LR using SP as context + IA key → stores signed LR on return addr
// AUTIASP: authenticate (strip and verify) LR using SP as context

    .text
    .align 4
_my_function:
    PACIASP                    // Sign LR with SP context (IA key)
    STP   x29, x30, [sp, #-16]!  // Push frame pointer + signed LR
    MOV   x29, sp              // Set up frame pointer (MANDATORY on Apple)

    // ... function body ...
    BL    _some_other_function

    LDP   x29, x30, [sp], #16  // Restore frame pointer + signed LR
    AUTIASP                    // Authenticate LR — fail → SIGILL/SIGSEGV
    RET                        // Jump via authenticated LR

// Leaf function (no BL inside): no PAC needed, LR never saved
    .align 4
_leaf_function:
    ADD   x0, x0, x1
    RET

Data Pointer Authentication

// Signing and authenticating data pointers (DA key, generic key)
// PACDA: sign data pointer in Xn using DA key, SP as modifier
// AUTDA: authenticate data pointer (strip and verify)
// XPACD: strip PAC without authentication (for debugging/comparison)

// Sign a global pointer before storing
LDR   x0, =my_vtable_ptr      // Load pointer value (untagged)
PACDA x0, sp                   // Sign: x0 = signed_pointer(x0, sp, DA_key)
STR   x0, [x19, #vtable_offset] // Store signed pointer

// Authenticate before indirect call
LDR   x0, [x19, #vtable_offset] // Load signed pointer
AUTDA x0, sp                   // Authenticate: restores untagged on success
                                 // Poisons to non-canonical addr on failure
LDR   x8, [x0, #method_offset] // Load method pointer from vtable
BLRAA x8, x0                   // Branch with PAC: call + authenticate in one op

// Strip PAC for display (no authentication — unsafe!)
LDR   x0, [x19, #vtable_offset]
XPACD x0                       // Remove PAC bits: x0 = raw VA
// Use x0 only for printing / comparison, NOT for calling

Branch Target Identification (BTI)

// BTI: valid indirect branch targets must start with BTI instruction
// Enabled via GNU_PROPERTY_AARCH64_FEATURE_1_BTI in ELF note or
// _COMM_PAGE_CPU_CAPABILITIES bit on macOS
// BTI variants: BTI c (BLR targets), BTI j (BR targets), BTI jc (both)

    .section __TEXT,__text
    .align 4
_callback_function:
    BTI   c                    // Valid target for BLR only
    STP   x29, x30, [sp, #-16]!
    MOV   x29, sp
    // ... implementation ...
    LDP   x29, x30, [sp], #16
    RET

    .align 4
_jump_table_entry:
    BTI   j                    // Valid target for BR only (switch jump tables)
    B     _actual_handler

    .align 4
_general_entry:
    BTI   jc                   // Valid for both BR and BLR
    NOP

// Attempting indirect branch to non-BTI landing pad → EXC_BAD_INSTRUCTION
// (SIGILL with ESR_EL1.EC = 0b111100, ISS = BTI trap)

// macOS disables BTI enforcement for userspace by default but Hardened Runtime
// .entitlements can enable it. XNU kernel enforces BTI for kernel extensions.

Mach-O Binary Format

Mach-O Header & Load Commands

Every macOS binary starts with a mach_header_64 (magic 0xFEEDFACF for 64-bit). This is followed by load commands: LC_SEGMENT_64 defines virtual memory regions; LC_DYLD_INFO_ONLY carries bind/lazy-bind/export trie tables; LC_SYMTAB holds the symbol table; LC_DYSYMTAB holds dynamic symbol info. Unlike ELF, Mach-O has no separate section headers independent of segments — sections are nested inside LC_SEGMENT_64 commands.

# Inspect Mach-O structure
otool -h -l /usr/bin/clang | head -60

# Dump __TEXT,__text section disassembly (Apple llvm-objdump)
otool -t -v -V /usr/lib/system/libsystem_c.dylib | head -80

# Show all load commands
otool -l /usr/bin/swift | grep -A4 "LC_SEGMENT_64"

# Show exported symbols (from export trie)
nm -gU /usr/lib/libSystem.B.dylib | head -20

# Check PAC/BTI note in ARM64e binary
otool -l /usr/bin/clang | grep -A3 LC_NOTE

__TEXT and __DATA Segments

// macOS assembly section declarations (Apple GAS / clang assembler syntax)
    .section __TEXT,__text,regular,pure_instructions   // Code
    .section __TEXT,__const                            // Read-only data
    .section __TEXT,__objc_methnames,cstring_literals  // ObjC method name strings
    .section __DATA,__data                             // Initialised data (RW)
    .section __DATA,__bss                              // Uninitialised data
    .section __DATA,__la_symbol_ptr,lazy_symbol_pointers // Lazy binding stubs
    .section __DATA_CONST,__got,non_lazy_symbol_pointers // GOT (non-lazy)

// Example: define a global variable + string literal
    .section __DATA,__data
    .align 3
my_counter:
    .quad  0                   // 8-byte global counter

    .section __TEXT,__cstring,cstring_literals
hello_str:
    .asciz "Hello, Apple Silicon\n"

dyld & Lazy Binding

// Calling a dynamic library function from assembly
// First call goes through dyld stub → resolver → patches GOT → real function
// Subsequent calls bypass dyld (direct through GOT)

    .text
    .align 4
call_printf_example:
    PACIASP
    STP   x29, x30, [sp, #-32]!
    MOV   x29, sp

    ADRP  x0, hello_str@PAGE      // Load PC-relative page of string
    ADD   x0, x0, hello_str@PAGEOFF
    BL    _printf                  // Call via PLT stub (first call: slow path)

    LDP   x29, x30, [sp], #32
    AUTIASP
    RET

// macOS stub (in __TEXT,__stubs):
// _printf:
//     ADRP   x16, __la_symbol_ptr_printf@GOTPAGE
//     LDR    x16, [x16, __la_symbol_ptr_printf@GOTPAGEOFF]
//     BR     x16                  // First call: jumps to dyld_stub_binder
//                                 // After binding: jumps directly to _printf in libSystem

// objc_msgSend dispatch (message passing — extremely hot path on macOS)
// Apple's hand-written assembly dispatcher is in libobjc.A.dylib
// bl objc_msgSend: x0=receiver, x1=selector, x2..=args

Apple ARM64 ABI Specifics

Mandatory Frame Pointer

// Apple ARM64 ABI requires x29 (frame pointer) always valid on non-leaf calls
// This enables reliable stack unwinding for crash reports and Instruments profiling
// Unlike Linux where -fomit-frame-pointer is common, Apple mandates x29

// Correct Apple ABI function prologue:
_my_func:
    PACIASP
    STP   x29, x30, [sp, #-16]!   // Decrement SP by 16; store x29+x30
    MOV   x29, sp                  // x29 = current fp (chain to caller's frame)
    // ... body ...
    LDP   x29, x30, [sp], #16     // Restore; increment SP by 16
    AUTIASP
    RET

// Frame chains: x29 → caller_x29 → caller_caller_x29 → ... → 0
// Crash reporter walks this chain to produce symbolicated backtraces

// Stack alignment: SP must be 16-byte aligned at all call sites
// Variadic functions require SP 16-byte aligned before BL
    SUB   sp, sp, #32              // 32-byte frame (16-byte aligned)
    STP   x0, x1, [sp]             // Save argument registers
    ADD   sp, sp, #32              // Restore (must also be 16-byte aligned)

macOS System Call Convention

// macOS ARM64 system call: SVC #0x80 (unlike Linux SVC #0)
// System call number in x16 (not x8 as on Linux)
// Arguments: x0–x5 (same as Linux)
// Return: x0 (result), x1 (2nd return word for some calls)
// Error: NZCV carry flag C=1 → x0 = errno value

// Example: write(1, buf, len) → syscall #4 (BSD write)
    .text
_raw_write:
    MOV   x16, #4              // SYS_write = 4 on macOS
    SVC   #0x80                // Trap to XNU
    BCS   syscall_error        // Carry set = error; x0 = errno
    RET
syscall_error:
    NEG   x0, x0               // Return negative errno (POSIX convention)
    RET

// Example: exit(0) → syscall #1 (SYS_exit)
_raw_exit:
    MOV   x16, #1              // SYS_exit
    MOV   x0, #0               // exit code 0
    SVC   #0x80

// Mach trap numbers: negative (e.g., mach_msg = -31, task_self = -28)
// Called via SVC #0x80 with x16 = 0xFFFFFFFF00000000 | trap_number

PMU Performance Counters on M1/M2

# Apple PMU does not support perf(1) or direct MSR via mrs in user space
# Use Apple's kdebug / kperf API from privileged processes, or Instruments.app

# Option 1: pmu counters via kperf (requires entitlement com.apple.private.kpc)
# Headers: , 
# Available via Apple Internal SDK or open-source xnu headers

# Option 2: ktrace / kdebug via libkdd
#   sudo ktrace artrace -o trace.artrace -- ./benchmark
#   instruments -l 10000 -t "CPU Profiler" ./benchmark

# Option 3: proc_pid_rusage() for coarse wall/user/system time
# clock_gettime(CLOCK_UPTIME_RAW): accesses ARM CNTVCT_EL0 via commpage

# Measure CNTVCT_EL0 (virtual timer counter — accessible from EL0):
    .text
read_cntvct:
    MRS   x0, CNTVCT_EL0      // Virtual count (runs at CNTFRQ_EL0 Hz ≈ 24 MHz)
    RET

read_cntfrq:
    MRS   x0, CNTFRQ_EL0      // Counter frequency in Hz (24000000 on M1/M2)
    RET

# Compute elapsed time: (cntvct_end - cntvct_start) * 1e9 / cntfrq = nanoseconds
Key Insight: The most impactful Apple Silicon assembly difference for developers porting Linux arm64 code to macOS is the mandatory PACIASP/AUTIASP pair. On Linux arm64 without PAC hardware, these are NOPs. On Apple Silicon they're enforced — omitting PACIASP in a function that saves LR means AUTIASP will authenticate a value that was never signed, producing a non-canonical address in LR and an immediate EXC_BAD_ACCESS. The second most common gotcha is the x16 syscall register — Linux uses x8, macOS uses x16, and mixing them produces SIGSYS or unexpected errno values.
Case Study Apple M1

Case Study: The M1 Transition — Rosetta 2 and PAC Compatibility

When Apple shipped the M1 in November 2020, billions of lines of x86_64 code needed to run on ARM64. Rosetta 2 JIT-translates x86 → ARM64 at install time, but PAC posed a unique challenge: translated code doesn't use PACIASP/AUTIASP (x86 has no equivalent), so Apple created a separate code signing entitlement (com.apple.security.cs.allow-jit) and a dedicated translation cache that operates in non-PAC mode for Rosetta processes while native ARM64e processes enforce PAC fully.

Performance insight: Despite the translation overhead, Rosetta 2 achieves 70-80% native performance on compute-heavy workloads — partly because Apple's M1 microarchitecture was designed with a wide 8-issue decode pipeline and 592 reorder buffer entries specifically to absorb the extra instructions Rosetta generates. Hand-written ARM64 assembly for M1 can exploit this width: 4 integer + 2 FP + 2 load/store per cycle with zero branch misprediction penalty on correctly predicted paths.

History Evolution

From PowerPC to ARM: Apple's Three Architecture Transitions

Apple has navigated three major ISA transitions: 68K→PowerPC (1994), PowerPC→x86 (2006), and x86→ARM64 (2020). Each transition introduced a translation layer (68K emulator, Rosetta 1, Rosetta 2) and a universal binary format (fat binaries, Universal 2). The ARM64 transition was uniquely aggressive: Apple designed both the silicon (M1) and the OS (Big Sur) simultaneously, ensuring that XNU kernel, drivers, and core frameworks shipped as native ARM64e binaries from day one — no emulation layer for the OS itself.

The ARM64e ABI was first deployed on the A12 Bionic (iPhone XS, 2018) with iOS 12, giving Apple two years of real-world PAC deployment data before the Mac transition. By the time M1 shipped, every PAC-related kernel bug had been found and patched through the iPhone install base — a strategic advantage no other ARM desktop platform has had.

Hands-On Exercises

Exercise 1 PAC

Exercise: PAC-Signed Function Calls on macOS

Write a macOS ARM64e assembly program with three functions: _main (calls _compute), _compute (calls _helper), and _helper (leaf function). Each non-leaf function must use PACIASP/AUTIASP and mandatory frame pointer (STP x29, x30; MOV x29, sp). Build with clang -arch arm64e -o pac_demo pac_demo.s and run. Then intentionally corrupt the saved LR on the stack (write 0xDEAD to [sp, #8] after STP) and observe the EXC_BAD_ACCESS crash. Use lldb to inspect the register state at the fault and identify the PAC authentication failure in the crash report.

Exercise 2 Mach-O

Exercise: Minimal Mach-O "Hello World" in Assembly

Write a complete macOS ARM64 assembly program that prints "Hello, Apple Silicon!" using the raw syscall interface (SVC #0x80). Use x16 = 4 (SYS_write), x0 = 1 (stdout), x1 = buffer, x2 = length. Then exit with x16 = 1 (SYS_exit). Assemble with as -o hello.o hello.s && ld -o hello hello.o -lSystem -syslibroot $(xcrun --sdk macosx --show-sdk-path) -e _main. Examine the resulting binary with otool -h -l hello to identify all LC_SEGMENT_64 commands and compare the __TEXT and __DATA segment layout to what you see in otool -l /usr/bin/true.

Exercise 3 Timing

Exercise: Precision Timing with CNTVCT_EL0

Write a macOS ARM64 assembly function _bench_neon_dotproduct that: (1) reads CNTVCT_EL0 into x19 (start time), (2) executes 1,000,000 iterations of a 128-bit NEON dot-product on two 4-element float32 vectors (FMLA v2.4s, v0.4s, v1.4s), (3) reads CNTVCT_EL0 again into x20, (4) computes elapsed = x20 - x19. Call from a C wrapper that reads CNTFRQ_EL0 (24 MHz on M-series) and converts to nanoseconds: ns = elapsed * 1e9 / freq. Compare the throughput (GFLOP/s) against the theoretical peak of your Apple Silicon chip (M1: ~2.5 GFLOP/s per core for FP32 NEON).

Conclusion & Next Steps

We covered ARM64e on Apple Silicon comprehensively: PAC instruction families (PACIASP/AUTIASP for return addresses, PACDA/AUTDA for data pointers, BLRAA for authenticated indirect calls), BTI landing pads for indirect branch hardening, the Mach-O binary structure (load commands, segment/section layout, Apple assembler syntax), dyld lazy binding stubs and GOT entries, Apple ABI requirements (mandatory frame pointer, SVC #0x80 with syscall number in x16), and measuring timing via CNTVCT_EL0. We also traced the M1 transition history and how Rosetta 2 handles the PAC compatibility challenge.

Next in the Series

In Part 17: Inline Assembly, GCC/Clang & C Interop, we move to embedding assembly directives inside C/C++ code: constraint strings, clobber lists, memory barriers from C, and when the compiler wins vs when hand-written asm still matters.

Technology