Back to Technology

ARM Assembly Part 15: Cortex-A System Programming & Boot

April 16, 2026 Wasil Zafar 25 min read

Every Cortex-A system starts in EL3 Secure world and descends through privilege levels until reaching user-space. We trace this descent in assembly: EL3 hardware init, TF-A BL1→BL2→BL31 handoff, EL3→EL1 ERET chain, identity-map MMU enable, and final FDT pointer pass to a Linux kernel entry point.

Table of Contents

  1. Introduction & Boot Chain Overview
  2. EL3 Initialisation
  3. TF-A BL1 → BL2 → BL31/BL32/BL33
  4. EL3 → EL2 → EL1 ERET Chain
  5. Enabling the MMU at EL1
  6. PSCI — Power State Coordination Interface
  7. FDT / ATags Handoff to Linux
  8. Hands-On Exercises
  9. Boot Sequence Planner
  10. Conclusion & Next Steps

Introduction & Boot Chain Overview

Series Overview: This is Part 15 of the 28-part ARM Assembly Mastery Series. We now synthesise the earlier chapters: exception levels (Part 11), MMU (Part 12), and TrustZone (Part 13) into a working boot sequence — the kind you'd find in a Raspberry Pi 4 (BCM2711), a Qualcomm Snapdragon at bootloader stage, or an Arm FVP Base platform.

ARM Assembly Mastery

Your 28-step learning path • Currently on Step 15
1
Architecture History & Core Concepts
ARMv1→v9, RISC philosophy, profiles
2
ARM32 Instruction Set Fundamentals
ARM vs Thumb, registers, CPSR, barrel shifter
3
AArch64 Registers, Addressing & Data Movement
X/W regs, addressing modes, load/store pairs
4
Arithmetic, Logic & Bit Manipulation
ADD/SUB, bitfield extract/insert, CLZ
5
Branching, Loops & Conditional Execution
Branch types, link register, jump tables
6
Stack, Subroutines & AAPCS
Calling conventions, prologue/epilogue
7
Memory Model, Caches & Barriers
Weak ordering, DMB/DSB/ISB, TLB
8
NEON & Advanced SIMD
Vector ops, intrinsics, media processing
9
SVE & SVE2 Scalable Vector Extensions
Predicate regs, gather/scatter, HPC/ML
10
Floating-Point & VFP Instructions
IEEE-754, scalar FP, rounding modes
11
Exception Levels, Interrupts & Vector Tables
EL0–EL3, GIC, fault debugging
12
MMU, Page Tables & Virtual Memory
Stage-1 translation, permissions, huge pages
13
TrustZone & ARM Security Extensions
Secure monitor, world switching, TF-A
14
Cortex-M Assembly & Bare-Metal Embedded
NVIC, SysTick, linker scripts, low-power
15
Cortex-A System Programming & Boot
EL3→EL1 transitions, MMU setup, PSCI
You Are Here
16
Apple Silicon & macOS ABI
ARM64e PAC, Mach-O, dyld, perf counters
17
Inline Assembly, GCC/Clang & C Interop
Constraints, clobbers, compiler interaction
18
Performance Profiling & Micro-Optimization
Pipeline hazards, PMU, benchmarking
19
Reverse Engineering & ARM Binary Analysis
ELF, disassembly, CFR, iOS/Android quirks
20
Building a Bare-Metal OS Kernel
Bootloader, UART, scheduler, context switch
21
ARM Microarchitecture Deep Dive
OOO pipelines, reorder buffers, branch predict
22
Virtualization Extensions
EL2 hypervisor, stage-2 translation, KVM
23
Debugging & Tooling Ecosystem
GDB, OpenOCD/JTAG, ETM/ITM, QEMU
24
Linkers, Loaders & Binary Format Internals
ELF deep dive, relocations, PIC, crt0
25
Cross-Compilation & Build Systems
GCC/Clang toolchains, CMake, firmware gen
26
ARM in Real Systems
Android, FreeRTOS/Zephyr, U-Boot, TF-A
27
Security Research & Exploitation
ASLR, PAC attacks, ROP/JOP, kernel exploit
28
Emerging ARMv9 & Future Directions
MTE, SME, confidential compute, AI accel

On power-on, Cortex-A hardware resets to AArch64 EL3 Secure state. The boot chain proceeds through privilege rings: EL3 (Secure firmware/TF-A BL31) → EL2 (hypervisor or disabled) → EL1 (OS kernel) → EL0 (user space). Each ERET drops privilege while configuring the execution state (SPSel, register width) for the next level. Understanding this chain is essential for writing secure firmware, debugging early boot failures, or porting new platforms.

Analogy — The Office Building Startup Sequence: Imagine arriving at corporate headquarters before dawn. The security guard (EL3 firmware) unlocks the building, verifies alarm codes, and checks structural safety. The building manager (EL2 hypervisor) then enables elevators, assigns floors to tenants, and configures the HVAC system. Finally, individual offices (EL1 OS) open their doors, turn on lights, and let employees (EL0 user applications) begin work. Each person only enters after the previous stage completes — and if the guard doesn't set the alarm correctly, nothing else functions. The Cortex-A boot chain works identically: each ERET drops privilege only after the firmware at the current level has fully configured the hardware for the next occupant.

EL3 Initialisation

CPU Reset Registers

On AArch64 reset, several registers have UNKNOWN or implementation-defined values. Firmware must initialise cache-related CPU registers, disable data caches and MMU (so early code doesn't fault), and set up the stack pointer before calling C. Key CPUECTLR_EL1 and L2CTLR_EL1 are implementation-defined (Cortex-A57/A72/A78 examples).

// Early EL3 init — CPU comes out of reset here
// SP_EL3 not yet valid; code must be PC-relative until stack set
    .section ".text.boot"
    .global el3_reset_entry
el3_reset_entry:
    // 1. Disable D/I caches and MMU via SCTLR_EL3
    MRS   x0, SCTLR_EL3
    BIC   x0, x0, #(1 << 0)   // M=0: disable MMU
    BIC   x0, x0, #(1 << 2)   // C=0: disable D-cache
    BIC   x0, x0, #(1 << 12)  // I=0: disable I-cache
    MSR   SCTLR_EL3, x0
    ISB

    // 2. Set EL3 stack
    LDR   x0, =__el3_stack_top
    MOV   sp, x0

    // 3. Enable FP/SIMD at all EL (CPTR_EL3 TFP=0)
    MSR   CPTR_EL3, xzr        // Allow FP at all levels

    // 4. Configure CPU extended control (Cortex-A72 example)
    // MRS   x0, S3_1_C15_C2_1   // CPUECTLR_EL1
    // ORR   x0, x0, #(1 << 6)   // Enable hardware prefetch
    // MSR   S3_1_C15_C2_1, x0
    ISB

    B     bl31_main              // Jump to TF-A BL31 C runtime

SCR_EL3 & HCR_EL2 Configuration

// Configure SCR_EL3 for NS world handoff
// SCR_EL3 controls: NS bit, IRQ/FIQ routing, SMC enable, RW (register width for EL2)
MRS   x0, SCR_EL3
ORR   x0, x0, #(1 << 0)    // NS=1: next EL runs in Non-Secure state
ORR   x0, x0, #(1 << 1)    // IRQ=1: IRQs taken to EL3 (optional: route to EL1)
ORR   x0, x0, #(1 << 3)    // SMD=0 kept; ensure SMC not disabled
ORR   x0, x0, #(1 << 8)    // HCE=1: HVC instruction enabled
ORR   x0, x0, #(1 << 10)   // RW=1: EL2 and below run AArch64 (not AArch32)
MSR   SCR_EL3, x0
ISB

// Configure HCR_EL2 — if EL2 is present, set RW for EL1
MRS   x0, HCR_EL2
ORR   x0, x0, #(1 << 31)   // RW=1: EL1 runs AArch64
MSR   HCR_EL2, x0
ISB

TF-A BL1 → BL2 → BL31/BL32/BL33

BL1 ROM Code

BL1 executes from ROM. It initialises the EL3 execution environment, validates and copies BL2 to trusted SRAM, then ERET to BL2. BL2 loads BL31 (resident secure monitor), optionally BL32 (OP-TEE), and BL33 (U-Boot/UEFI). BL31 installs the SMC dispatcher, then ERET to BL33 at EL1.

// BL1 stub: copy BL2 from Flash/eMMC to secure SRAM then jump
    .global bl1_main
bl1_main:
    LDR   x0, =BL2_SRC_ADDR    // BL2 in non-volatile storage
    LDR   x1, =BL2_DST_ADDR    // Secure SRAM destination
    LDR   x2, =BL2_SIZE
    BL    memcpy_el3             // Simple EL3 memcpy

    // Optional: hash verification (SHA-256 over BL2 image + CoT check)
    // BL    bl1_verify_bl2

    // ERET to BL2 at EL3
    LDR   x0, =BL2_DST_ADDR
    MSR   ELR_EL3, x0           // Return address = BL2 entry
    MRS   x1, SPSR_EL3
    BIC   x1, x1, #0xF          // EL = EL3 (bits[3:2]=11) — stay at EL3 for BL2
    ORR   x1, x1, #0xD          // M[3:0]=1101 = EL3h (SP_EL3)
    MSR   SPSR_EL3, x1
    ERET                         // Jump to BL2

BL31 Resident Monitor

// BL31 passes execution to BL33 (U-Boot) at EL1 NS
    .global bl31_exit_to_ns
bl31_exit_to_ns:
    // Load BL33 (U-Boot) entry point and context
    LDR   x0, =UBOOT_ENTRY_ADDR
    MSR   ELR_EL3, x0

    // Build SPSR_EL3: target = EL1h (EL1 with SP_EL1), NS=1 already in SCR
    MOV   x1, #0b00101          // M[4:0] = EL1h
    ORR   x1, x1, #(0b1111 << 6) // DAIF all masked initially
    MSR   SPSR_EL3, x1

    // x0–x3 per BL33 calling convention: 0=FDT addr, 1–3=0
    LDR   x0, =FDT_BASE_ADDR    // Pass device tree to bootloader
    MOV   x1, xzr
    MOV   x2, xzr
    MOV   x3, xzr
    ERET                         // Jump to U-Boot EL1

EL3 → EL2 → EL1 ERET Chain

// Full three-level ERET descent (EL3→EL2→EL1)
// Useful when enabling hypervisor before OS

// === EL3 → EL2 ===
drop_to_el2:
    ADR   x0, el2_entry         // EL2 entry point
    MSR   ELR_EL3, x0
    MOV   x0, #0b01001          // SPSR EL2h, DAIF unmasked
    MSR   SPSR_EL3, x0
    ERET

    .balign 4
el2_entry:
    // Configure EL2 regs, then drop to EL1
    MRS   x0, HCR_EL2
    ORR   x0, x0, #(1 << 31)   // E2H=0, RW=1 (EL1 AArch64)
    ORR   x0, x0, #(1 << 27)   // TGE=0 (traps to EL1, not EL2)
    MSR   HCR_EL2, x0
    ISB

// === EL2 → EL1 ===
drop_to_el1:
    ADR   x0, el1_entry
    MSR   ELR_EL2, x0
    MOV   x0, #0b00101          // SPSR EL1h
    MSR   SPSR_EL2, x0
    ERET

    .balign 4
el1_entry:
    // We are now at EL1 — set up kernel environment
    LDR   sp, =__kernel_stack_top
    BL    kernel_main

Enabling the MMU at EL1

// Enable identity-mapped MMU at EL1 for early kernel
// Assumes page tables already populated (Part 12 pattern)
enable_mmu_el1:
    LDR   x0, =ttb0_l1_base    // TTBR0_EL1: user/low address space table
    MSR   TTBR0_EL1, x0
    LDR   x0, =ttb1_l1_base    // TTBR1_EL1: kernel/high address space table
    MSR   TTBR1_EL1, x0

    // TCR_EL1: 48-bit VA, 4K granule, Inner/Outer WB-WA cacheable
    LDR   x0, =0x00000001B5193516  // T0SZ=16, T1SZ=16, TG0=4K, TG1=4K, IPS=40bit
    MSR   TCR_EL1, x0

    // MAIR_EL1: attr0=Device-nGnRnE, attr1=Normal WB-WA
    LDR   x0, =0xFF44
    MSR   MAIR_EL1, x0
    ISB

    DSB   ISH                   // Ensure page table writes visible
    TLBI  VMALLE1               // Invalidate all EL1 TLBs
    DSB   ISH
    ISB

    MRS   x0, SCTLR_EL1
    ORR   x0, x0, #(1 << 0)   // M=1: enable MMU
    ORR   x0, x0, #(1 << 2)   // C=1: enable D-cache
    ORR   x0, x0, #(1 << 12)  // I=1: enable I-cache
    MSR   SCTLR_EL1, x0
    ISB                         // Fetch subsequent instructions with MMU on

PSCI — Power State Coordination Interface

CPU_ON (SMP Bringup)

// PSCI CPU_ON: bring secondary CPU online from Linux kernel (EL1)
// Calling convention: SMCCC (function ID in x0, args in x1-x3)
// CPU_ON function ID: 0xC4000003 (64-bit PSCI)

    .global psci_cpu_on
psci_cpu_on:
    // x0 = PSCI function ID (CPU_ON = 0xC4000003)
    // x1 = MPIDR of target CPU (e.g., 0x80000001 for CPU1)
    // x2 = entry_point_address (secondary CPU starts here)
    // x3 = context_id (arbitrary value passed to secondary entry)
    LDR   x0, =0xC4000003       // PSCI64 CPU_ON
    LDR   x1, =0x0000000100     // CPU1 MPIDR
    LDR   x2, =secondary_entry  // Entry point for secondary
    MOV   x3, #0                // context_id
    SMC   #0                    // SMC to BL31 PSCI handler
    // x0 returns PSCI_SUCCESS (0) or error code
    RET

// Secondary CPU entry point (landed here by PSCI handler via ERET)
    .global secondary_entry
secondary_entry:
    LDR   sp, =secondary_stack_top
    MSR   TTBR0_EL1, x8        // Set page tables (passed via x8 by convention)
    BL    enable_mmu_el1
    BL    secondary_main

FDT / ATags Handoff to Linux

// Linux AArch64 kernel entry: arch/arm64/kernel/head.S primary_entry
// Calling convention from bootloader:
//   x0 = Physical address of device tree blob (DTB/FDT), or 0
//   x1–x3 = 0 (reserved)
// Kernel is called at its load address (2 MB aligned)
// Bootloader must have MMU off, caches off, IRQs and FIQs disabled

launch_linux:
    // Disable caches and MMU
    MRS   x0, SCTLR_EL1
    BIC   x0, x0, #(1 << 0)   // M=0: MMU off
    BIC   x0, x0, #(1 << 2)   // C=0: D-cache off
    BIC   x0, x0, #(1 << 12)  // I=0: I-cache off
    MSR   SCTLR_EL1, x0
    ISB

    // Invalidate TLBs
    TLBI  VMALLE1
    DSB   SY
    ISB

    // Pass FDT address (32-bit phys aligned) in x0
    LDR   x0, =FDT_PHYS_ADDR
    MOV   x1, xzr
    MOV   x2, xzr
    MOV   x3, xzr

    // Branch to kernel (no link — no return)
    LDR   x4, =KERNEL_ENTRY_PHYS
    BR    x4
Key Insight: The most common early boot bug is forgetting to disable the D-cache before enabling the MMU with a different set of page tables, or enabling the MMU with stale TLB entries. The mandatory sequence is: populate page tables → DSB ISH → TLBI VMALLE1{IS} → DSB ISH → ISB → MSR SCTLR_EL1 (M=1) → ISB. Any deviation can leave the CPU executing instructions with inconsistent translation state, manifesting as random data aborts or instruction aborts immediately after the SCTLR write.
History Evolution

Evolution of ARM Boot: From Single-Stage to Chain of Trust

Early ARM systems (ARM7TDMI era, 1990s) had trivial boot: ROM at address 0x00000000 jumped directly to application code — no privilege levels, no EL transitions, no firmware chain. The ARMv6 Cortex-A8 introduced the two-stage boot (ROM → bootloader → OS), and ARMv7-A added TrustZone, creating the three-world model. ARMv8-A formalised the four exception levels (EL0–EL3) and the Arm Trusted Firmware (now TF-A) project in 2013 established the BL1→BL2→BL31→BL33 reference chain used universally today.

The Raspberry Pi boot chain is uniquely different: the Broadcom VideoCore IV GPU (not the ARM CPU!) is the primary boot processor. The GPU loads bootcode.bin from the SD card, which loads start.elf (the GPU firmware). Only then does start.elf release the ARM cores from reset, passing the FDT address and kernel image address. This GPU-first design means the ARM CPU never executes EL3 firmware ROM on Raspberry Pi — TF-A runs only when explicitly enabled (Pi 4 onward).

Case Study Server

Case Study: AWS Graviton Boot Chain — From Nitro to Linux in 1.2 Seconds

Amazon's Graviton3 (Neoverse V1, 64 cores) boots through a hardened TF-A chain: the Nitro Security Chip acts as BL1, verifying a chain of trust rooted in one-time-programmable (OTP) eFuses. BL2 loads from SPI NOR flash, performs DRAM training (calibrating DDR5 timing parameters) — the single longest boot phase at ~400 ms. BL31 installs PSCI handlers for 64-core SMP bring-up, then ERET to BL33 (a minimal UEFI firmware). The UEFI stub passes ACPI tables (not FDT, unlike embedded Linux) to the kernel via the EFI system table.

Engineering insight: To achieve sub-2-second boot-to-shell, AWS parallelized DRAM training across all memory controllers and eliminated the U-Boot stage entirely — UEFI directly loads the Linux kernel. PSCI CPU_ON brings secondary cores online asynchronously: the primary core starts scheduling before all 63 secondaries have completed their MMU enable and TLB invalidation. This pipelining reduced total boot time from 8 seconds (sequential) to 1.2 seconds.

Hands-On Exercises

Exercise 1 EL Transitions

Exercise: EL3 → EL1 Minimal Boot Stub

Write a complete AArch64 boot stub that starts at EL3 and drops to EL1. Your code must: (1) disable MMU and caches via SCTLR_EL3, (2) set SP_EL3, (3) configure SCR_EL3 (NS=1, RW=1, HCE=1), (4) configure HCR_EL2 (RW=1), (5) set ELR_EL3 to your EL1 entry point, (6) set SPSR_EL3 for EL1h mode with DAIF masked, (7) ERET. At the EL1 entry point, read CurrentEL and print the value (2 = EL1) to confirm the transition succeeded. Test on QEMU virt machine: qemu-system-aarch64 -M virt -cpu cortex-a72 -nographic -kernel boot.elf.

Exercise 2 MMU

Exercise: Identity-Map MMU Enable at EL1

Extend your EL1 boot stub to enable the MMU with an identity map. Create a Level 1 page table with 1 GB block descriptors: map the first 1 GB (0x00000000–0x3FFFFFFF) as Device-nGnRnE (for UART MMIO), and the second 1 GB (0x40000000–0x7FFFFFFF) as Normal cacheable (for DRAM). Configure MAIR_EL1, TCR_EL1 (T0SZ=25 for 39-bit VA, 4 KB granule), TTBR0_EL1, then enable with the mandatory DSB → TLBI → DSB → ISB → MSR SCTLR_EL1 → ISB sequence. Write a character to QEMU's PL011 UART (0x09000000) after MMU enable to prove the Device mapping works.

Exercise 3 PSCI

Exercise: PSCI CPU_ON Secondary Core Bringup

On QEMU virt with -smp 4, bring all four cores online using PSCI. The primary core (MPIDR=0) executes your boot stub at EL3 and issues SMC calls for PSCI CPU_ON (function ID 0xC4000003) targeting cores 1, 2, and 3. Each secondary core should: (1) read its MPIDR_EL1, (2) write a unique byte (core ID) to a shared memory location, (3) WFE in a holding pen. After all secondaries are online, the primary core reads the shared memory array and prints all four core IDs via UART. This exercises the full SMP bringup path used by Linux arch/arm64/kernel/smp.c.

Boot Sequence Planner

ARM Boot Sequence Planner

Document your platform's boot chain from power-on to OS entry. Download as Word, Excel, or PDF.

Draft auto-saved

All data stays in your browser — nothing is uploaded.

Conclusion & Next Steps

We traced the complete Cortex-A boot chain in assembly: EL3 hardware init with cache disable, SCR_EL3/HCR_EL2 configuration for Non-Secure AArch64 execution, the TF-A BL1→BL2→BL31 firmware loading chain, EL3→EL2→EL1 ERET descent with SPSR programming, identity-mapped MMU enable with the mandatory DSB/TLBI/ISB sequence, PSCI CPU_ON for SMP bringup of secondary cores, and the Linux AArch64 kernel entry convention (x0=FDT, caches off, MMU off). Along the way, we examined how real platforms (Raspberry Pi, AWS Graviton) implement these stages and how the boot chain evolved from simple ROM jumps to today's Chain of Trust model.

Next in the Series

In Part 16: Apple Silicon & macOS ABI, we explore ARM64e pointer authentication codes (PAC), BTI, the Mach-O binary format, dyld3 lazy binding, Apple-specific register conventions, and how to read PMU events on M1/M2/M3 without perf(1).

Technology