Back to Technology

ARM Assembly Part 26: ARM in Real Systems — Android, RTOS & Bootloaders

June 25, 2026 Wasil Zafar 23 min read

Deploy ARM assembly in production environments: Android NDK ABI, FreeRTOS/Zephyr RTOS, U-Boot bootloader internals, and Trusted Firmware-A boot flow.

Table of Contents

  1. Android NDK & ABI
  2. FreeRTOS on Cortex-M
  3. Zephyr RTOS on ARM
  4. U-Boot Bootloader
  5. Trusted Firmware-A
  6. Case Study: ARM in Production
  7. Hands-On Exercises
  8. Summary

ARM Assembly Mastery

Your 28-step learning path • Currently on Step 26
1
Architecture History & Core Concepts
ARMv1→v9, RISC philosophy
2
ARM32 Instruction Set Fundamentals
ARM vs Thumb, CPSR
3
AArch64 Registers, Addressing & Data Movement
X/W regs, addressing modes
4
Arithmetic, Logic & Bit Manipulation
ADD/SUB, bitfield, CLZ
5
Branching, Loops & Conditional Execution
Branch types, jump tables
6
Stack, Subroutines & AAPCS
Calling conventions
7
Memory Model, Caches & Barriers
Weak ordering, DMB/DSB/ISB
8
NEON & Advanced SIMD
Vector ops, intrinsics
9
SVE & SVE2 Scalable Vectors
Predicate regs, HPC/ML
10
Floating-Point & VFP Instructions
IEEE-754, rounding modes
11
Exception Levels, Interrupts & Vectors
EL0–EL3, GIC
12
MMU, Page Tables & Virtual Memory
Stage-1 translation
13
TrustZone & Security Extensions
Secure monitor, TF-A
14
Cortex-M Assembly & Bare-Metal
NVIC, SysTick, linker scripts
15
Cortex-A System Programming & Boot
EL3→EL1, MMU setup, PSCI
16
Apple Silicon & macOS ABI
ARM64e PAC, Mach-O, dyld
17
Inline Assembly & C Interop
Constraints, clobbers
18
Performance Profiling & Micro-Opt
Pipeline hazards, PMU
19
Reverse Engineering & Binary Analysis
ELF, disassembly, CFR
20
Building a Bare-Metal OS Kernel
Bootloader, UART, scheduler
21
ARM Microarchitecture Deep Dive
OOO pipelines, branch predict
22
Virtualization Extensions
EL2 hypervisor, stage-2, KVM
23
Debugging & Tooling Ecosystem
GDB, OpenOCD/JTAG, ETM/ITM
24
Linkers, Loaders & Binary Format Internals
ELF deep dive, relocations, PIC
25
Cross-Compilation & Build Systems
GCC/Clang toolchains, CMake, firmware gen
26
ARM in Real Systems
Android, FreeRTOS/Zephyr, U-Boot, TF-A
You Are Here
27
Security Research & Exploitation
ASLR, PAC attacks, ROP/JOP, kernel exploit
28
Emerging ARMv9 & Future Directions
MTE, SME, confidential compute, AI accel

Theory is validated in production. This part walks through five real deployment environments where ARM assembly knowledge is directly exercised: Android native code via the NDK, two embedded RTOSes (FreeRTOS and Zephyr), the U-Boot bootloader that brings up hundreds of millions of Linux boards, and Trusted Firmware-A — the ARM-blessed reference implementation of the secure world boot chain.

Why real systems? Textbook examples run in simulators with flat memory, no timing constraints, and no boot dependencies. Real systems impose ABI contracts, timing budgets, platform quirks, and multi-stage initialisation chains that only appear when you touch actual hardware or production SDKs.
Analogy — The City Infrastructure: Think of an ARM-powered device as a city. TF-A (BL1→BL31) is the underground utility grid — power lines, water mains, gas pipes — laid down before any building exists, invisible once the city is running. U-Boot is the construction crane that erects the buildings (initialises DRAM, loads the kernel) and is removed once the skyline is up. FreeRTOS/Zephyr are the traffic management systems — scheduling which vehicles (tasks) use which roads (CPU time), with PendSV acting as the traffic light that stops one stream and lets another through. Android is the city itself: restaurants (apps), offices (services), and shops (activities) all operating above the infrastructure, connected to the native foundation via bridges (JNI). Each layer trusts the one below it — if the utility grid (TF-A) is compromised, the entire city falls.

Android NDK & ARM ABI

ARM ABI Variants in the NDK

Android supports two ARM ABIs: armeabi-v7a (Thumb-2, VFP-D16, soft-float ABI) and arm64-v8a (AArch64, mandatory on 64-bit devices since Android 5.0). Since the 2019 Play Store requirement that all new apps support 64-bit, arm64-v8a has been the primary target.

ABI
armeabi-v7a vs arm64-v8a
Propertyarmeabi-v7aarm64-v8a
ISAARM32 / Thumb-2AArch64
GPRsr0–r15x0–x30, sp, pc
Arg registersr0–r3 (4 × 32-bit)x0–x7 (8 × 64-bit)
FP/SIMDVFP-D16 or NEONv0–v31 (128-bit)
Stack alignment8 bytes16 bytes
Pointer size32 bits64 bits
SyscallSVC #0 (r7=NR)SVC #0 (x8=NR)

JNI Integration — Calling Native Code from Java

The Java Native Interface (JNI) is the bridge between Dalvik/ART bytecode and ARM native code. Native functions must follow a strict naming convention and the ABI calling convention.

// C/C++ JNI function — compiles to arm64-v8a native library
// File: jni/native-lib.cpp
#include <jni.h>
#include <string.h>
#include <arm_neon.h>

// JNI name: Java_com_example_myapp_MainActivity_dotProduct
// x0 = JNIEnv*, x1 = jobject thiz, x2 = jfloatArray a, x3 = jfloatArray b
extern "C" JNIEXPORT jfloat JNICALL
Java_com_example_myapp_MainActivity_dotProduct(
        JNIEnv *env, jobject /* thiz */,
        jfloatArray a, jfloatArray b)
{
    jsize len = env->GetArrayLength(a);
    float *pa = env->GetFloatArrayElements(a, nullptr);
    float *pb = env->GetFloatArrayElements(b, nullptr);

    float32x4_t vsum = vdupq_n_f32(0.0f);
    jsize i = 0;
    for (; i + 3 < len; i += 4) {
        float32x4_t va = vld1q_f32(pa + i);
        float32x4_t vb = vld1q_f32(pb + i);
        vsum = vmlaq_f32(vsum, va, vb);   // vsum += va * vb (fused multiply-add)
    }
    // Horizontal add of vsum lanes
    float sum = vaddvq_f32(vsum);
    for (; i < len; i++) sum += pa[i] * pb[i];   // tail

    env->ReleaseFloatArrayElements(a, pa, JNI_ABORT);
    env->ReleaseFloatArrayElements(b, pb, JNI_ABORT);
    return sum;
}
# CMakeLists.txt for Android NDK
cmake_minimum_required(VERSION 3.22)
project(myapp)

add_library(native-lib SHARED jni/native-lib.cpp)

# AArch64 NEON is always available — no -mfpu flag needed
target_compile_options(native-lib PRIVATE
    -O3
    -march=armv8-a+simd        # enable explicit SVE/NEON intrinsics
    -ffast-math
)

find_library(log-lib log)
target_link_libraries(native-lib ${log-lib})
# Build with NDK CMake wrapper
# In app/build.gradle:
# externalNativeBuild { cmake { path "CMakeLists.txt" } }
# defaultConfig { externalNativeBuild { cmake { abiFilters "arm64-v8a" } } }

# Direct command-line build (NDK r25+)
$NDK/toolchains/llvm/prebuilt/darwin-x86_64/bin/clang++ \
    --target=aarch64-linux-android33 \
    --sysroot=$NDK/toolchains/llvm/prebuilt/darwin-x86_64/sysroot \
    -O3 -march=armv8-a+simd -ffast-math \
    -shared -fPIC \
    jni/native-lib.cpp -o libnative-lib.so

Bionic libc — Key Differences from glibc

Android's Bionic C library is smaller and faster than glibc but has important differences that surface in native ARM code:

Bionic
Bionic vs glibc differences that affect assembly
  • No pthread_cancel — asynchronous cancellation points absent; code relying on cancel-safe wrappers must be reworked.
  • No getcontext/swapcontext — coroutine libraries must use raw setjmp/longjmp or hand-written context switch.
  • Stack size — main thread 8 MB by default; spawned threads 1 MB. Deeply recursive assembly-heavy code must call pthread_attr_setstacksize.
  • System call numbers — Bionic uses the kernel ABI directly (x8 = syscall number, SVC #0). Numbers match arch/arm64/include/asm/unistd.h in the kernel tree.
  • TLS access — thread-local storage slot 0 (TPIDR_EL0) points to Bionic pthread_internal_t; slot 1 is errno.
// Accessing errno on Android/Bionic — assembly equivalent
// In C: errno is at __get_tls()[1]
// In ARM64 ASM:
//   mrs  x0, tpidr_el0      // load TLS base
//   ldr  x0, [x0, #8]       // slot 1 = errno pointer
//   ldr  w0, [x0]           // load errno value

// Minimal raw syscall (no Bionic wrapper)
// __NR_write = 64  (arm64 kernel)
static inline long raw_write(int fd, const void *buf, size_t count)
{
    register long x0 __asm__("x0") = fd;
    register long x1 __asm__("x1") = (long)buf;
    register long x2 __asm__("x2") = count;
    register long x8 __asm__("x8") = 64;   // __NR_write
    __asm__ volatile ("svc #0"
        : "+r"(x0) : "r"(x1), "r"(x2), "r"(x8)
        : "memory", "cc");
    return x0;
}

FreeRTOS on Cortex-M

FreeRTOS is the dominant RTOS for Cortex-M microcontrollers. Its scheduler relies on SysTick for periodic preemption and on PendSV for deferred context switching — both implemented in ARM assembly.

Context switch mechanism: FreeRTOS uses two exceptions cooperatively. SysTick (highest configurable priority) marks a task switch needed and pends PendSV. PendSV (lowest priority) then performs the actual register save/restore, guaranteeing the switch happens after all higher-priority ISRs complete.

Task Creation & Scheduler API

// FreeRTOS task API — Cortex-M4 (STM32F4 example)
#include "FreeRTOS.h"
#include "task.h"
#include "queue.h"

// configTICK_RATE_HZ = 1000  (1 ms tick via SysTick)
// configMAX_SYSCALL_INTERRUPT_PRIORITY = 0x50  (5 NVIC priority levels reserved)

static QueueHandle_t xDataQueue;

// Producer task — highest priority
void vProducerTask(void *pvParams)
{
    (void)pvParams;
    uint32_t count = 0;
    for (;;) {
        count++;
        xQueueSend(xDataQueue, &count, portMAX_DELAY);
        vTaskDelay(pdMS_TO_TICKS(10));   // yield for 10 ms
    }
}

// Consumer task — lower priority
void vConsumerTask(void *pvParams)
{
    (void)pvParams;
    uint32_t received;
    for (;;) {
        if (xQueueReceive(xDataQueue, &received, portMAX_DELAY) == pdTRUE) {
            // process received
        }
    }
}

int main(void)
{
    xDataQueue = xQueueCreate(16, sizeof(uint32_t));

    xTaskCreate(vProducerTask, "PROD", 256, NULL, 3, NULL);
    xTaskCreate(vConsumerTask, "CONS", 256, NULL, 2, NULL);

    vTaskStartScheduler();   // never returns if heap is sufficient
    for (;;) {}
}

Context Switch Assembly — PendSV Handler

The FreeRTOS PendSV handler is pure ARM assembly. On Cortex-M4 the hardware stacks {r0–r3, r12, lr, pc, xPSR} automatically on exception entry; the RTOS handler saves the remaining callee-saved registers {r4–r11} and the floating-point context.

; FreeRTOS PendSV_Handler — Cortex-M4 + FPU (ARM syntax, GCC/Keil)
; extern void *pxCurrentTCB;  // pointer to current TCB (thread control block)

    .syntax unified
    .thumb
    .global PendSV_Handler
    .type   PendSV_Handler, %function
PendSV_Handler:
    ; --- Save current task context ---
    mrs     r0, psp                 ; r0 = process stack pointer
    isb                             ; instruction sync barrier

    ldr     r3, =pxCurrentTCB       ; r3 = &pxCurrentTCB
    ldr     r2, [r3]                ; r2 = pxCurrentTCB (TCB pointer)

    ; Check if FPU context is active (CONTROL.FPCA bit)
    tst     lr, #0x10               ; EXC_RETURN: bit4=0 → FPU used
    it      eq
    vstmdbeq r0!, {s16-s31}         ; save S16-S31 (S0-S15 already on HW stack)

    stmdb   r0!, {r4-r11, lr}       ; save r4-r11 and EXC_RETURN
    str     r0, [r2]                ; store new PSP into TCB->pxTopOfStack

    ; --- Select next task ---
    stmdb   sp!, {r3}               ; preserve r3 across C call
    mov     r0, #configMAX_SYSCALL_INTERRUPT_PRIORITY
    msr     basepri, r0             ; disable interrupts ≤ config priority
    dsb
    isb
    bl      vTaskSwitchContext       ; updates pxCurrentTCB
    mov     r0, #0
    msr     basepri, r0             ; re-enable all interrupts

    ldmia   sp!, {r3}               ; restore r3

    ; --- Restore next task context ---
    ldr     r1, [r3]                ; r1 = new pxCurrentTCB
    ldr     r0, [r1]                ; r0 = new task's pxTopOfStack

    ldmia   r0!, {r4-r11, lr}       ; restore r4-r11 and EXC_RETURN

    tst     lr, #0x10               ; check FPU again
    it      eq
    vldmiaeq r0!, {s16-s31}         ; restore S16-S31 if FPU context present

    msr     psp, r0                 ; restore PSP
    isb
    bx      lr                      ; return from exception (hardware restores r0-r3,r12,lr,pc,xPSR)
    .size   PendSV_Handler, .-PendSV_Handler

SysTick ISR & Tick Hook

// FreeRTOS SysTick handler (in port.c — simplified excerpt)
// Cortex-M: SysTick fires every 1/configTICK_RATE_HZ seconds.
// The handler increments the tick count and optionally pends PendSV.

void xPortSysTickHandler(void)
{
    // The SysTick runs at max syscall priority —
    // lock the scheduler before touching kernel structures.
    portDISABLE_INTERRUPTS();
    {
        // Increment tick; returns pdTRUE if a context switch is needed
        if (xTaskIncrementTick() != pdFALSE) {
            // Trigger PendSV to do the actual switch after all ISRs finish
            portNVIC_INT_CTRL_REG = portNVIC_PENDSVSET_BIT;
        }
    }
    portENABLE_INTERRUPTS();
}

// Low-power idle hook — entered when no tasks are ready
void vApplicationIdleHook(void)
{
    // WFI suspends the core until the next interrupt (SysTick, peripheral, etc.)
    __asm volatile ("wfi");
}

Zephyr RTOS on ARM

Zephyr is a Linux Foundation RTOS targeting everything from Cortex-M0 to Cortex-A53. Its hardware abstraction is driven by Device Tree (DTS) and Kconfig, making the same application code portable across hundreds of boards.

Device Tree, Kconfig & West Build

# Build a Zephyr blinky application for nRF52840-DK (Cortex-M4F)
west build -b nrf52840dk/nrf52840 samples/basic/blinky

# Build for STM32 Nucleo-F401RE
west build -b nucleo_f401re samples/basic/blinky

# Build for QEMU ARM Cortex-M3 emulator
west build -b qemu_cortex_m3 samples/hello_world
west build -t run     # run in QEMU

# Flash to real hardware via OpenOCD
west flash --runner openocd

# Flash via J-Link
west flash --runner jlink
// Zephyr application — GPIO blink with ARM-aware timing
#include <zephyr/kernel.h>
#include <zephyr/drivers/gpio.h>

// DTS alias: led0 → &gpio0 pin 13 (board-specific)
#define LED0_NODE DT_ALIAS(led0)
static const struct gpio_dt_spec led = GPIO_DT_SPEC_GET(LED0_NODE, gpios);

int main(void)
{
    gpio_pin_configure_dt(&led, GPIO_OUTPUT_ACTIVE);

    while (1) {
        gpio_pin_toggle_dt(&led);
        k_msleep(500);   // uses SysTick via Zephyr kernel timer
    }
    return 0;
}
# prj.conf — Kconfig options for stack size and FPU
CONFIG_MAIN_STACK_SIZE=2048
CONFIG_FPU=y              # enable hardware FPU context save/restore
CONFIG_FPU_SHARING=y      # multiple threads may use FPU

# Override thread stack analysis
CONFIG_THREAD_STACK_INFO=y
CONFIG_THREAD_ANALYZER=y
Zephyr ARM context switch: Zephyr's Cortex-M port similarly uses PendSV for context switching. The assembly implementation is in arch/arm/core/aarch32/swap_helper.S. On Cortex-M33 (ARMv8-M) it leverages the security extension stack banked registers when TrustZone is enabled.

U-Boot Bootloader

U-Boot is the dominant open-source bootloader for Linux on ARM. It runs before the kernel — initialising DRAM, configuring clocks, and loading the kernel image plus Device Tree blob (DTB) into memory.

Boot Flow & SPL

Boot chain
Typical 4-stage ARM Linux boot
  1. ROM bootloader (BootROM) — on-chip, vendor ROM. Loads SPL from eMMC/SD boot partition into on-chip SRAM (typically 64–256 KB).
  2. SPL (Secondary Program Loader) — tiny first-stage U-Boot. Initialises DRAM controller, then loads full U-Boot into DRAM.
  3. U-Boot proper — full bootloader. Initialises peripherals, parses boot_targets, loads kernel + DTB + initrd.
  4. Linux kernel — entered via booti command (Image format, AArch64) or bootz (zImage, ARM32).
# U-Boot SPL entry — arch/arm/cpu/armv8/start.S (simplified)
# The BootROM jumps here with:
#   x0 = pointer to ATF/BL31 info (platform specific)
#   x1 = machine ID or 0
#   All other cores spinning at wfe loop

    .global _start
    .type   _start, @function
_start:
    /* Switch to AArch64 EL3 if BootROM left us there */
    mrs     x0, CurrentEL
    lsr     x0, x0, #2
    cmp     x0, #3
    b.eq    el3_entry

el1_entry:
    /* Already at EL1 (some platforms drop to EL1 directly) */
    adr     x0, _start
    ldr     x1, =CONFIG_SYS_TEXT_BASE
    /* relocate if load address != link address ... */
    b       board_init_f_init_reserve

el3_entry:
    /* Configure SCTLR_EL3: disable MMU/cache, set endianness */
    mov     x0, #0
    msr     sctlr_el3, x0
    isb

    /* Set up EL3 stack */
    ldr     x0, =CONFIG_SYS_INIT_SP_ADDR
    bic     sp, x0, #0xf          /* 16-byte align */
    bl      board_init_f          /* C entry point */

Relocation & FDT Manipulation

U-Boot relocates itself to the top of available DRAM to leave low memory for the kernel. After relocation, it patches the Device Tree blob with runtime information (serial number, MAC addresses, memory map).

// U-Boot board init sequence (board/myboard/myboard.c)
#include <common.h>
#include <fdt_support.h>

// Called after DRAM init — set up board-level structures
int board_init(void)
{
    // Example: configure a GPIO for board ID detection
    // (platform-specific, shown schematically)
    return 0;
}

// Called just before booting the kernel: patch DTB with runtime info
int ft_board_setup(void *blob, struct bd_info *bd)
{
    // Patch /memory node with actual DRAM size detected at runtime
    u64 base = CONFIG_SYS_SDRAM_BASE;
    u64 size = (u64)gd->ram_size;
    fdt_fixup_memory(blob, base, size);

    // Add Ethernet MAC address read from OTP efuses
    u8 mac[6];
    read_mac_from_efuse(mac);
    do_fixup_by_path(blob, "/ethernet@ff200000", "mac-address",
                     mac, sizeof(mac), 1);
    return 0;
}

Distro Boot — Generic Boot Script

# U-Boot environment for distro boot (Raspberry Pi 4 example)
# This lives in /boot/extlinux/extlinux.conf on the SD card

LABEL Linux
    KERNEL /Image                   # AArch64 uncompressed kernel
    FDT    /dtbs/broadcom/bcm2711-rpi-4-b.dtb
    APPEND root=/dev/mmcblk0p2 rootwait console=ttyS0,115200

# U-Boot command that triggers the above:
# => run distro_bootcmd
# Internally executes:
# load mmc 0:1 ${loadaddr} /Image
# load mmc 0:1 ${fdt_addr} /dtbs/broadcom/bcm2711-rpi-4-b.dtb
# booti ${loadaddr} - ${fdt_addr}

# Manual kernel boot from U-Boot prompt
=> setenv bootargs "root=/dev/mmcblk0p2 rootwait console=ttyS0,115200"
=> load mmc 0:1 0x40480000 /Image
=> load mmc 0:1 0x44000000 /dtbs/broadcom/bcm2711-rpi-4-b.dtb
=> booti 0x40480000 - 0x44000000

Trusted Firmware-A (TF-A)

TF-A is ARM's reference implementation of the secure-world firmware. It executes at EL3 and is responsible for the BL1→BL2→BL31/BL32/BL33 boot chain and runtime PSCI services used to power-manage cores from the OS.

BL1–BL33 Boot Chain

TF-A stages
Five boot loader stages
StageELLocationResponsibility
BL1EL3Trusted ROMAP cold boot, authentication, loads BL2
BL2EL1-STrusted SRAMPlatform init, loads BL31 + BL32 + BL33
BL31EL3 (resident)Trusted SRAMSecure monitor: SMC dispatcher, PSCI, GIC setup
BL32S-EL1Trusted DRAMTrusted OS (OP-TEE) — optional
BL33EL2 or EL1-NSDRAMNormal world bootloader (U-Boot or UEFI)
; TF-A BL1 cold boot entry (bl1/aarch64/bl1_entrypoint.S — simplified)
    .globl  bl1_entrypoint
bl1_entrypoint:
    ; Verify we are at EL3
    mrs     x0, CurrentEL
    lsr     x0, x0, #2
    cmp     x0, #MODE_EL3
    b.ne    do_panic

    ; Reset SCTLR_EL3 — disable MMU, disable caches, set little-endian
    mov_imm x0, (SCTLR_EL3_RES1)   // required-1 bits
    msr     sctlr_el3, x0
    isb

    ; Configure SCR_EL3 — set NS=0 for secure world initially
    mov_imm x0, (SCR_RESET_VAL)
    msr     scr_el3, x0
    isb

    ; Set up EL3 stack pointer
    adr     x0, __BL1_RAM_START__
    add     x0, x0, #BL1_STACKS_OFFSET
    mov     sp, x0

    ; Zero-initialise BSS
    adr     x0, __BSS_START__
    adr     x1, __BSS_END__
bl1_bss_loop:
    str     xzr, [x0], #8
    cmp     x0, x1
    b.lo    bl1_bss_loop

    ; Jump to C entry
    bl      bl1_setup           ; platform_setup(), console_init()
    bl      bl1_main            ; loads BL2, validates hash chain
    ; bl1_main never returns — it jumps to BL31 via ERET

PSCI SMC Handler

Power State Coordination Interface (PSCI) is the standard ARM interface for OS-initiated core power management. Linux calls SVC (in EL1) or HVC (in EL2) which traps to EL3 where BL31 handles it.

// TF-A PSCI handler (lib/psci/psci_common.c — conceptual excerpt)
// SMC function IDs (PSCI 1.1, 64-bit convention)
#define PSCI_CPU_OFF            0xC4000002
#define PSCI_CPU_ON_AARCH64     0xC4000003
#define PSCI_SYSTEM_SUSPEND     0xC400000E
#define PSCI_SYSTEM_RESET       0x84000009

// Called from SMC dispatcher after verifying caller EL is ≥ EL1
int psci_handler(uint32_t smc_fid,
                 u_register_t x1, u_register_t x2, u_register_t x3)
{
    switch (smc_fid) {
    case PSCI_CPU_ON_AARCH64:
        // x1 = target_cpu (MPIDR), x2 = entry_point_address, x3 = context_id
        return psci_cpu_on((u_register_t)x1,
                           (uintptr_t)x2, (u_register_t)x3);

    case PSCI_CPU_OFF:
        // Powers off calling CPU; does not return to caller
        psci_cpu_off();
        panic();   // unreachable

    case PSCI_SYSTEM_SUSPEND:
        // x1 = entry_point address; powers off cluster if all CPUs off
        return psci_system_suspend((uintptr_t)x1, x2);

    case PSCI_SYSTEM_RESET:
        psci_system_reset();
        panic();   // unreachable

    default:
        return PSCI_E_NOT_SUPPORTED;
    }
}
; BL31 SMC entry vector (Secure Monitor Call handler)
; When Linux calls SVC from EL1, EL3 receives control here.
; x0 = SMC function identifier (SMCCC standard)
; x1–x6 = arguments

    .globl  smc_handler64
smc_handler64:
    ; Save general-purpose registers (caller ABI args preserved in x0-x6)
    stp     x0,  x1,  [sp, #-16]!
    stp     x2,  x3,  [sp, #-16]!
    stp     x4,  x5,  [sp, #-16]!
    stp     x6,  x7,  [sp, #-16]!
    stp     x8,  x9,  [sp, #-16]!
    stp     x18, x30, [sp, #-16]!

    ; Decode SMC function ID and dispatch
    mov     w19, w0             ; save fid
    bl      smc_dispatch        ; C function returns result in x0

    ; Restore non-volatile state, return result in x0
    ldp     x18, x30, [sp], #16
    ldp     x8,  x9,  [sp], #16

    ; Restore callee-saved but NOT x0 (holds return value from smc_dispatch)
    add     sp, sp, #(6*8)      ; skip x1-x6 (caller discards them)
    eret                        ; return to EL1 NS world
Chain of trust: TF-A uses ARM's Trusted Board Boot Requirements (TBBR) specification. Each stage's binary is hash-chained: BL1 verifies BL2's hash, BL2 verifies BL31/BL32/BL33. Hash roots are stored in OTP fuses. Bypassing a link in this chain requires physical access to inject before authentication, or a vulnerability in the authentication logic itself.

Case Study: ARM in Production at Scale

Industry Case Study

Raspberry Pi: The Full ARM Stack in 145 Million Units

The Raspberry Pi is the most tangible example of every system covered in this article working together in a single product. Since 2012, over 145 million units have shipped, evolving from ARMv6 (ARM1176JZF-S) to ARMv8-A (Cortex-A76 in Pi 5).

Boot chain: The Pi 4/5 uses a VideoCore GPU BootROM → start4.elf (GPU firmware, analogous to BL1/BL2) → U-Boot or the Linux kernel directly. TF-A (BL31) runs at EL3 providing PSCI services for CPU hotplug. The Device Tree Blob (bcm2711-rpi-4-b.dtb) is loaded by U-Boot and passed to the kernel — exactly as shown in the distro boot section above.

RTOS usage: Industrial variants like the Raspberry Pi Pico (RP2040, dual Cortex-M0+) run FreeRTOS for real-time control — robotic arms, CNC machines, 3D printers. The Zephyr RTOS also supports the RP2040 via west build -b rpi_pico.

Android: Android runs on the Pi 4 via community builds (KonstaKANG), demonstrating the full NDK / JNI / Bionic stack on a $35 board. The same ABI constraints (arm64-v8a, 16-byte stack alignment, NEON always-on) apply whether you're on a Pixel phone or a Pi.

Historical Evolution

ARM Software Ecosystem: From Proprietary to Open Source (1990–2024)

1990s — Closed World: ARM development meant purchasing ARM's own SDT (Software Development Toolkit) or Keil's tools. No open-source RTOS existed for ARM. Boot code was vendor-proprietary, shipped as binary blobs. Cross-compilation from x86 required expensive Metrowerks CodeWarrior licenses.

2000s — Open Source Rising: U-Boot (derived from PPCBoot, 2000) became the standard ARM bootloader. FreeRTOS launched in 2003, offering a free RTOS alternative to VxWorks ($30K+ licenses). GCC's ARM support matured, making free cross-compilation practical. Android's launch (2008) brought ARM to mainstream consumer devices — the NDK (r1, June 2009) gave developers direct access to ARM native code.

2010s — Standardisation Era: ARM published PSCI (2013), standardising how OSes manage core power across all vendors. Linaro (founded 2010) unified ARM Linux kernel support. TF-A (originally ARM Trusted Firmware, 2014) became the reference EL3 implementation. Zephyr RTOS (2016) brought Linux Foundation governance to embedded. Android required 64-bit (arm64-v8a) support by 2019.

2020s — Convergence: The same TF-A/U-Boot/Linux stack runs on Raspberry Pi, AWS Graviton servers, and automotive ECUs. FreeRTOS was acquired by Amazon (2017) and integrated with AWS IoT. Zephyr surpassed FreeRTOS in supported board count (500+). ARM's SystemReady certification program ensured standards-based boot across all Cortex-A platforms.

Hands-On Exercises

Exercise 1 — Beginner

Build & Run a Zephyr Application in QEMU

Install the Zephyr SDK and build the hello_world sample for the QEMU Cortex-M3 target. Then modify it to print the SysTick reload value and the number of context switches after 5 seconds of running two threads.

# Install Zephyr (follow https://docs.zephyrproject.org/latest/develop/getting_started/)
pip install west
west init ~/zephyrproject
cd ~/zephyrproject && west update

# Build hello_world for QEMU Cortex-M3
west build -b qemu_cortex_m3 samples/hello_world
west build -t run

# Expected output:
# *** Booting Zephyr OS build v3.x.0 ***
# Hello World! qemu_cortex_m3

# Now build the philosophers sample (multi-threaded):
west build -b qemu_cortex_m3 samples/philosophers -p
west build -t run

# Observe thread scheduling with CONFIG_THREAD_ANALYZER=y in prj.conf

Expected Learning: Zephyr's west tool abstracts cross-compilation entirely. The same application code runs on QEMU and real hardware with only a board change (-b flag).

Exercise 2 — Intermediate

Trace FreeRTOS Context Switches on STM32

Using STM32CubeIDE or QEMU with the STM32F4 machine, create a FreeRTOS project with 3 tasks at different priorities. Add instrumentation to the PendSV handler to toggle a GPIO pin on every context switch, then measure the switch frequency and latency.

// Add to your FreeRTOS project — context switch counter
#include "FreeRTOS.h"
#include "task.h"

volatile uint32_t context_switch_count = 0;

// Hook called by FreeRTOS on every task switch
void traceTASK_SWITCHED_IN(void)
{
    context_switch_count++;
    // Toggle PA5 (LED on Nucleo-F401RE) to visualise on oscilloscope
    GPIOA->ODR ^= (1 << 5);
}

// Three tasks at different priorities
void vHighTask(void *p)  { for(;;) { vTaskDelay(1); } }
void vMedTask(void *p)   { for(;;) { vTaskDelay(5); } }
void vLowTask(void *p)   {
    for(;;) {
        vTaskDelay(pdMS_TO_TICKS(1000));
        printf("Switches/sec: %lu\n", context_switch_count);
        context_switch_count = 0;
    }
}

int main(void)
{
    HAL_Init();
    SystemClock_Config();
    xTaskCreate(vHighTask, "H", 128, NULL, 3, NULL);
    xTaskCreate(vMedTask,  "M", 128, NULL, 2, NULL);
    xTaskCreate(vLowTask,  "L", 256, NULL, 1, NULL);
    vTaskStartScheduler();
}

Expected Learning: Context switch frequency depends on task delay values and priorities. On a Cortex-M4 at 168 MHz, PendSV handler completes in ~1-2 microseconds including FPU context save.

Exercise 3 — Advanced

Boot U-Boot + Linux on QEMU virt Machine

Build U-Boot from source for the QEMU AArch64 virt machine, create a minimal initramfs, and boot a complete Linux system. Observe the full boot chain: U-Boot SPL → U-Boot proper → Linux kernel → userspace.

# 1. Build U-Boot for QEMU ARM64 virt
git clone https://source.denx.de/u-boot/u-boot.git
cd u-boot
make qemu_arm64_defconfig
make CROSS_COMPILE=aarch64-linux-gnu- -j$(nproc)

# Output: u-boot.bin (BL33 image for QEMU)

# 2. Download a prebuilt AArch64 Linux kernel (or build from source)
wget https://cdn.kernel.org/pub/linux/kernel/v6.x/linux-6.6.tar.xz
# ... configure with: make ARCH=arm64 defconfig && make ARCH=arm64 Image

# 3. Create minimal initramfs
mkdir -p initramfs/{bin,dev,proc,sys}
cat > initramfs/init <<'INIT'
#!/bin/sh
mount -t proc none /proc
mount -t sysfs none /sys
echo "=== ARM64 Linux booted via U-Boot ==="
echo "Kernel: $(uname -r)"
echo "CPU: $(cat /proc/cpuinfo | grep 'model name' | head -1)"
exec /bin/sh
INIT
chmod +x initramfs/init
# ... (statically link busybox into initramfs/bin/sh)

# 4. Boot with QEMU
qemu-system-aarch64 \
    -machine virt -cpu cortex-a57 -m 512M \
    -bios u-boot.bin \
    -device loader,file=Image,addr=0x40400000 \
    -device loader,file=initramfs.cpio.gz,addr=0x44000000 \
    -serial stdio -display none -no-reboot

Expected Learning: U-Boot initialises the virt machine's DRAM and GICv2, then booti transfers control to the kernel at EL1. The full chain demonstrates every concept from Parts 15 (boot), 20 (bare-metal), and 24 (linkers/ELF).

Summary

Five production environments — each exercising a distinct layer of ARM system knowledge:

Summary
Key takeaways by environment
  • Android NDK: arm64-v8a ABI enforces 16-byte stack alignment, 8 argument registers, and NEON always available. Bionic differs from glibc in TLS layout and absent POSIX features.
  • FreeRTOS: PendSV handler is the core context switch engine. Hardware auto-saves r0–r3/r12/lr/pc/xPSR; the RTOS saves r4–r11 and optional FPU S16–S31.
  • Zephyr: Device Tree + Kconfig abstracts hardware; the ARM port still drops to assembly for context switch and exception setup.
  • U-Boot: SPL bootstraps DRAM from BootROM, U-Boot proper handles FDT patching and distro boot. AArch64 entry code at _start mirrors bare-metal kernel entry.
  • TF-A: BL31 is the permanent EL3 resident. PSCI SMC handler dispatches core power calls from Linux to firmware with full register save/restore discipline.

Next in the Series

In Part 27: Security Research & Exploitation, we examine ASLR bypass, PAC oracle attacks, ROP/JOP chain construction on AArch64, and kernel exploitation patterns — applying everything learned across the series to offensive security.

Technology