Back to Technology

ARM Assembly Part 23: Debugging & Tooling Ecosystem

June 11, 2026 Wasil Zafar 22 min read

The ARM debugging ecosystem spans from QEMU software simulation up to hardware probes with instruction-level trace. This part covers the GDB remote serial protocol, OpenOCD configuration for JTAG and SWD adapters, ETM/ITM trace decoding, and practical kernel debugging workflows with QEMU.

Table of Contents

  1. Debugging Architecture Overview
  2. GDB Remote Protocol
  3. OpenOCD & JTAG/SWD
  4. ETM Instruction Trace
  5. ITM Stimulus Ports
  6. QEMU as Debug Target
  7. Kernel Debugging Workflows
  8. Case Study: ARM Boot Failure
  9. Hands-On Exercises
  10. Conclusion & Next Steps

Debugging Architecture Overview

Series Overview: Part 23 of 28. Related: Part 14 (Cortex-M/NVIC), Part 15 (Cortex-A boot), Part 20 (bare-metal kernel).

ARM Assembly Mastery

Your 28-step learning path • Currently on Step 23
1
Architecture History & Core Concepts
ARMv1→v9, RISC philosophy
2
ARM32 Instruction Set Fundamentals
ARM vs Thumb, CPSR
3
AArch64 Registers, Addressing & Data Movement
X/W regs, addressing modes
4
Arithmetic, Logic & Bit Manipulation
ADD/SUB, bitfield, CLZ
5
Branching, Loops & Conditional Execution
Branch types, jump tables
6
Stack, Subroutines & AAPCS
Calling conventions
7
Memory Model, Caches & Barriers
Weak ordering, DMB/DSB/ISB
8
NEON & Advanced SIMD
Vector ops, intrinsics
9
SVE & SVE2 Scalable Vectors
Predicate regs, HPC/ML
10
Floating-Point & VFP Instructions
IEEE-754, rounding modes
11
Exception Levels, Interrupts & Vectors
EL0–EL3, GIC
12
MMU, Page Tables & Virtual Memory
Stage-1 translation
13
TrustZone & Security Extensions
Secure monitor, TF-A
14
Cortex-M Assembly & Bare-Metal
NVIC, SysTick, linker scripts
15
Cortex-A System Programming & Boot
EL3→EL1, MMU setup, PSCI
16
Apple Silicon & macOS ABI
ARM64e PAC, Mach-O, dyld
17
Inline Assembly & C Interop
Constraints, clobbers
18
Performance Profiling & Micro-Opt
Pipeline hazards, PMU
19
Reverse Engineering & Binary Analysis
ELF, disassembly, CFR
20
Building a Bare-Metal OS Kernel
Bootloader, UART, scheduler
21
ARM Microarchitecture Deep Dive
OOO pipelines, branch predict
22
Virtualization Extensions
EL2 hypervisor, stage-2, KVM
23
Debugging & Tooling Ecosystem
GDB, OpenOCD/JTAG, ETM/ITM, QEMU
You Are Here
24
Linkers, Loaders & Binary Format Internals
ELF deep dive, relocations, PIC, crt0
25
Cross-Compilation & Build Systems
GCC/Clang toolchains, CMake
26
ARM in Real Systems
Android, FreeRTOS/Zephyr, U-Boot
27
Security Research & Exploitation
ASLR, PAC attacks, ROP/JOP
28
Emerging ARMv9 & Future Directions
MTE, SME, confidential compute
Real-World Analogy — A Doctor's Diagnostic Toolkit: Debugging an ARM system is like a doctor diagnosing a patient using increasingly invasive tools. Printf/UART debugging is like asking the patient "where does it hurt?" — cheap, non-invasive, but relies on the patient (code) being conscious (running) and honest (correct print statements). GDB breakpoints are an X-ray: you halt the patient (stop execution), take a snapshot of internal state (registers, memory), then let them continue. JTAG/SWD is surgery — you physically connect probes to the patient's nervous system (debug port), gaining total control even when the patient is unconscious (crashed or in a tight loop). ETM instruction trace is a continuous ECG monitor: it records every heartbeat (instruction) in real-time without slowing the patient down, letting you reconstruct exactly what happened leading up to a cardiac event (crash). ITM stimulus ports are blood pressure cuffs — lightweight, always-on monitoring channels that report vital signs (debug events) without disrupting the patient. Each tool has a cost-benefit trade-off: more invasive means more data but more disruption.
ARM Debug Architecture Layers:
Hardware: JTAG/SWD pins → Debug Access Port (DAP) → Memory Access Port (MEM-AP) → system bus
Trace: ETM (instruction-level) → CoreSight trace bus → TPIU/ETB → host
Protocol: OpenOCD (JTAG/SWD) → GDB Remote Serial Protocol (RSP) → GDB
Virtual: QEMU -s flag → built-in GDB stub (no hardware needed)

GDB Remote Protocol

# ── Connect GDB to a QEMU GDB stub (from Part 20 bare-metal kernel) ──
qemu-system-aarch64 \
    -machine virt -cpu cortex-a57 -m 128M \
    -kernel kernel.elf -serial stdio -display none \
    -s -S   # -s: GDB on port 1234, -S: halt at boot

# In a second terminal:
aarch64-linux-gnu-gdb kernel.elf
(gdb) target remote :1234
(gdb) b _start
(gdb) continue
(gdb) info registers           # Show all GP registers
(gdb) x/4i $pc                 # Disassemble 4 instructions at PC
(gdb) x/32xg $sp               # Dump 32 quadwords from stack pointer
(gdb) p/x *(uint64_t*)0x40000000   # Read physical memory
(gdb) set $x0 = 0x1234         # Modify register
(gdb) watch *(uint64_t*)0x40100000 # Hardware watchpoint on memory address
# ── GDB useful AArch64-specific commands ──
# Inspect system registers (aarch64 extension):
(gdb) monitor info registers   # QEMU: show all QEMU-visible regs

# Set conditional breakpoint (break when X0 == 5):
(gdb) b some_function if $x0 == 5

# Tui mode for source+asm side by side:
(gdb) tui enable
(gdb) layout split

# Save/load breakpoints:
(gdb) save breakpoints bp.txt
(gdb) source bp.txt

OpenOCD & JTAG/SWD

# ── Install OpenOCD and connect to a Raspberry Pi 4 via J-Link ──
# (Raspberry Pi 4 exposes JTAG on GPIO 22-27 with appropriate config)

# openocd.cfg for RPi4 via J-Link
cat > rpi4.cfg <<'EOF'
# Probe: SEGGER J-Link (SWD mode)
source [find interface/jlink.cfg]
transport select swd

# Target: ARM Cortex-A72 (AArch64)
source [find target/bcm2711.cfg]

# SWD clk: start conservative
adapter speed 1000
EOF

openocd -f rpi4.cfg &

# Then attach GDB:
aarch64-linux-gnu-gdb vmlinux
(gdb) target extended-remote :3333
(gdb) monitor reset halt
(gdb) b start_kernel
(gdb) continue
# ── OpenOCD flash programming (Cortex-M embedded, Part 14 context) ──
# Example: flash bare-metal firmware to STM32 via ST-Link
openocd \
    -f interface/stlink.cfg \
    -f target/stm32f4x.cfg \
    -c "program firmware.elf verify reset exit"

# Memory read/write from OpenOCD telnet interface:
telnet localhost 4444
> halt
> mdw 0x40000000 8   # Read 8 words from 0x40000000
> mww 0x40000000 0xDEADBEEF   # Write word

ETM Instruction Trace

CoreSight ETM Architecture: The Embedded Trace Macrocell (ETM) captures every instruction executed (PC values + branch outcomes) at full CPU speed using a compressed protocol. The trace is output via the CoreSight trace bus to an Embedded Trace Buffer (ETB, on-chip SRAM) or off-chip via a TPIU (Trace Port Interface Unit). An ETB can store ~4–32 KB of compressed trace, decoded offline to reconstruct the full execution path.
// Enable ETMv4 via system registers (AArch64, EL2 access required)
// CoreSight registers are memory-mapped in the debug ROM region
// Cortex-A78 ETM base address: read from ROM table at 0xE00FF000

// In assembly (EL2 or debug monitor):
.equ ETM_BASE,     0xE0041000   // ETM for CPU0 (SoC-specific)
.equ ETM_TRCPRGCTLR, 0x004     // Programming Control Register
.equ ETM_TRCCONFIGR, 0x010     // Configuration Register
.equ ETM_TRCVICTLR, 0x080      // ViewInst Control Register (which PCs to trace)
.equ ETM_TRCOSLAR,  0x300      // OS Lock Access Register

// Step 1: Unlock OS Lock (prevents ETM access if locked)
mov  x0, #ETM_BASE
str  wzr, [x0, #ETM_TRCOSLAR]  // Write 0 to unlock

// Step 2: Disable ETM for programming
ldr  w1, [x0, #ETM_TRCPRGCTLR]
bic  w1, w1, #1                 // Clear EN bit
str  w1, [x0, #ETM_TRCPRGCTLR]
dsb  sy
isb

// Step 3: Configure to trace all instructions, timestamps on
mov  w2, #(1 << 6)             // TRCONFIGR: timestamps enabled
str  w2, [x0, #ETM_TRCCONFIGR]

// Step 4: ViewInst — include all EL1 instructions (no filter)
mov  w3, #0x201                 // TRCVICTLR: EL1NS, no filter
str  w3, [x0, #ETM_TRCVICTLR]

// Step 5: Re-enable ETM
ldr  w1, [x0, #ETM_TRCPRGCTLR]
orr  w1, w1, #1
str  w1, [x0, #ETM_TRCPRGCTLR]
isb
# Decode ETM trace in Linux using Coresight subsystem
# (kernel must be built with CONFIG_CORESIGHT=y, CONFIG_CORESIGHT_ETM4X=y)

# Enable ETM tracing on CPU0:
echo 1 > /sys/bus/coresight/devices/etm0/enable_source

# Run workload:
./my_workload &
PID=$!
sleep 1
kill -STOP $PID
echo 0 > /sys/bus/coresight/devices/etm0/enable_source

# Read ETB (Embedded Trace Buffer) and decode:
cat /dev/cs_etb0 > trace.raw
# Use 'perf' with cs-etm event:
perf record -e cs_etm//@etm0/ --per-thread ./my_workload
perf report --stdio --call-graph flat

ITM Stimulus Ports

// ITM (Instrumentation Trace Macrocell) — Cortex-M / A debug printf
// ITM_STIM0 at 0xE0000000 — write a byte/word and it appears in trace
// No clock cycles wasted on flush — non-blocking when demultiplexed via SWO pin

.equ ITM_STIM0,    0xE0000000   // Stimulus port 0
.equ ITM_TER,      0xE0000E00   // Trace Enable Register (enable port 0)
.equ ITM_TCR,      0xE0000E80   // Trace Control Register
.equ ITM_LAR,      0xE0000FB0   // Lock Access Register

// itm_init: unlock and enable ITM, stimulus port 0
itm_init:
    mov  x0, #ITM_LAR
    movk x0, #0xE000, lsl #16
    mov  w1, #0xC5ACCE55        // ITM unlock key
    str  w1, [x0]               // Unlock ITM

    mov  x2, #ITM_TCR
    movk x2, #0xE000, lsl #16
    mov  w3, #0x00010005        // ITMENA=1, TraceBusID=1
    str  w3, [x2]

    mov  x4, #ITM_TER
    movk x4, #0xE000, lsl #16
    mov  w5, #1                  // Enable port 0 only
    str  w5, [x4]
    ret

// itm_putc(char c) — x0 = character, non-blocking write to stimulus port 0
itm_putc:
    mov  x1, #ITM_STIM0
    movk x1, #0xE000, lsl #16
.itm_spin:
    ldr  w2, [x1]               // FIFOREADY bit in STIM read
    tst  w2, #1
    b.eq .itm_spin              // Wait until FIFO has space
    strb w0, [x1]               // Write 1 byte to port 0
    ret

QEMU as Debug Target

# ── Debug AArch64 Linux kernel with QEMU + GDB ──
# Run QEMU with GDB stub enabled and halted at boot:
qemu-system-aarch64 \
    -machine virt -cpu cortex-a72 -m 2G \
    -kernel Image \
    -initrd initramfs.cpio.gz \
    -append "nokaslr console=ttyAMA0 debug" \
    -serial stdio -display none \
    -s -S

# Note: nokaslr disables KASLR (kernel address randomisation) — required for GDB symbols to work

# Attach GDB with kernel vmlinux symbols:
aarch64-linux-gnu-gdb vmlinux
(gdb) target remote :1234
(gdb) add-symbol-file drivers/net/virtio_net.ko 0xffff000001234000  # Load module symbols
(gdb) b net_rx_action     # Kernel breakpoint
(gdb) continue

# Trace individual syscalls using GDB catch:
(gdb) catch syscall read   # Break on every read() syscall
(gdb) commands             # Run commands on each catch
    silent
    printf "read at pc=%#lx\n", $pc
    continue
end
# ── QEMU register inspection while halted ──
# In GDB: QEMU maps ARM64 system registers via monitor interface
(gdb) monitor info registers   # All registers including PSTATE, ELR, SPSR

# In QEMU monitor (Ctrl+A C to switch from serial to monitor):
(qemu) x /10i 0xffff800010000000   # Disassemble kernel .text
(qemu) xp /4x 0x40000000           # Physical memory dump
(qemu) info mtree                   # Show full ARM memory map
(qemu) info cpus                    # All vCPUs and their states

Kernel Debugging Workflows

# ── 1. KASAN (Kernel Address Sanitizer) for ARM64 memory bugs ──
# Build kernel with:
# CONFIG_KASAN=y, CONFIG_KASAN_GENERIC=y
# All heap out-of-bounds and use-after-free bugs print:
# BUG: KASAN: slab-out-of-bounds in function+offset/size

# ── 2. KGDB — built-in kernel GDB stub over serial ──
# Boot: append "kgdboc=ttyAMA0,115200 kgdbwait"
# Connect: aarch64-linux-gnu-gdb vmlinux -ex "target remote /dev/ttyUSB0"

# ── 3. Ftrace — function tracer with zero config in sysfs ──
echo function > /sys/kernel/debug/tracing/current_tracer
echo schedule > /sys/kernel/debug/tracing/set_ftrace_filter
echo 1 > /sys/kernel/debug/tracing/tracing_on
sleep 1
echo 0 > /sys/kernel/debug/tracing/tracing_on
cat /sys/kernel/debug/tracing/trace | head -50

# ── 4. Crash dump analysis with crash tool ──
# Configure kernel: CONFIG_KEXEC=y, CONFIG_CRASH_DUMP=y
# kdump via kexec: on panic, boot capture kernel:
kdump: saved vmcore to /var/crash/vmcore
crash vmlinux /var/crash/vmcore
crash> bt         # Backtrace of panicking task
crash> log        # Kernel log buffer at crash time
crash> ps         # All tasks at crash time

Case Study: Debugging a Real-World ARM Boot Failure

EmbeddedProductionReal-World
When the Board Won't Boot: A Cortex-A53 Custom SoC Story

A hardware startup bringing up a custom Cortex-A53 quad-core SoC encountered a devastating production bug: 1 in 50 boards failed to boot, hanging silently after ATF (Arm Trusted Firmware) started. No UART output, no error LED. The debugging journey illustrates every tool in this article:

  • Step 1 — JTAG (OpenOCD + J-Link): Connected to the DAP and halted all four cores. Three cores were parked correctly (WFE loop). Core 0 was stuck in a DMB SY at the very end of BL2 → BL31 handoff. The program counter hadn't advanced past the barrier.
  • Step 2 — GDB inspection: info registers showed SCTLR_EL3.C=1 (caches enabled) but MAIR_EL3 contained garbage. The memory attribute for the ATF code region was configured as Device-nGnRnE instead of Normal Cacheable — meaning every instruction fetch was non-cacheable and non-speculative. The DMB was waiting for all outstanding stores to complete, but the store buffer was wedged because the cache controller was trying to coherence-check a non-cacheable region.
  • Step 3 — ETM trace (Lauterbach TRACE32): Instruction trace showed the MAIR_EL3 was written correctly 99% of the time. On failing boards, a voltage droop during the PLL lock sequence caused a transient SRAM bit-flip in the ATF's .data section where MAIR constants were stored. The ETM trace proved the write instruction executed correctly but the value it loaded from memory was already corrupted.
  • Fix: Added a MAIR readback verification loop after initial configuration: write MAIR, read it back, compare, retry up to 3 times. Added CRC check on ATF .data section at boot. Zero failures across 10,000 units after the fix.

Key lesson: Each debugging layer revealed what the previous couldn't. JTAG + GDB found where the CPU was stuck. ETM trace found why — a hardware transient that no amount of printf debugging could have caught.

HistoryEvolution
Evolution of ARM Debug: From ICE to CoreSight

ARM debugging technology has evolved through distinct generations:

  • 1990s — EmbeddedICE: ARM7TDMI included the first on-chip debug unit. Two hardware breakpoints and two watchpoints, controlled via JTAG scan chain. GDB talked to the ICE through manufacturer-specific protocols (ARM Multi-ICE, Lauterbach TRACE32).
  • 2004 — CoreSight v1: ARM introduced a standardized debug interconnect — the Debug Access Port (DAP) and standard memory-mapped debug components. For the first time, different debug probes could connect to any CoreSight-compliant chip using the same protocol.
  • 2008 — ETMv4: Full instruction-level tracing at GHz speeds. Each instruction compressed to ~1 bit on average (branch outcomes + exception transitions). A 32KB ETB can store millions of instructions.
  • 2016 — CoreSight SoC-600: Added cross-trigger interfaces (CTI) for multi-core synchronized halt, timestamp synchronization across cores, and AMBA ATB (Advanced Trace Bus) for streaming trace to DDR at multi-GB/s rates.
  • 2020+ — ARM Statistical Profiling Extension (SPE): Hardware-sampled profiling directly to memory. Every Nth instruction (configurable) is sampled with full context (PC, latency, data address, cache/TLB events) — like perf but zero software overhead.

Hands-On Exercises

Exercise 1Beginner
GDB Command Mastery with QEMU

Using the Part 20 bare-metal kernel in QEMU with GDB:

  1. Set a breakpoint on uart_putc. When it hits, print the character being sent: p/c (char)$x0
  2. Set a hardware watchpoint on the UART data register: watch *(uint32_t*)0x09000000. Continue and confirm GDB breaks every time a character is written
  3. Use display/i $pc to automatically show the next instruction at every breakpoint hit
  4. Use record full (if supported by QEMU-GDB) to enable reverse debugging, then reverse-stepi to step backwards through execution

Deliverable: A GDB command script (.gdbinit) that automates connecting to QEMU, loading symbols, and setting up your standard breakpoints.

Exercise 2Intermediate
Ftrace Kernel Function Profiling

On an ARM64 Linux system (Raspberry Pi, cloud instance, or QEMU with full Linux):

  1. Enable function tracing: echo function > /sys/kernel/debug/tracing/current_tracer
  2. Filter to scheduler functions only: echo 'schedule*' > /sys/kernel/debug/tracing/set_ftrace_filter
  3. Run a workload (stress-ng --cpu 4 --timeout 5), then read the trace
  4. Switch to function_graph tracer and repeat — observe the call tree with timing for each function
  5. Use trace-cmd record -p function_graph -g schedule for a cleaner workflow

Analysis: Identify the top 5 most-called functions in the scheduler path. Correlate with what you know about context switching from Part 20.

Exercise 3Advanced
Custom GDB Python Script for ARM64

Write a GDB Python extension that pretty-prints ARM64 system register state:

  1. Create arm64_debug.py with a GDB command class that inherits from gdb.Command
  2. When invoked, read key registers: gdb.parse_and_eval("$pc"), $sp, $cpsr
  3. Decode PSTATE fields from CPSR: N, Z, C, V flags, current EL (bits [3:2]), SP selection (bit 0), and exception masking (DAIF, bits [9:6])
  4. Pretty-print: EL1h | NZCV=0b1001 | DAIF=0b1111 (all masked) | SP=0xFFFF...1000
  5. Bonus: Add a arm64-bt command that walks the frame pointer chain (X29) manually, printing each saved LR (X30) — a manual stack trace that works even when GDB's built-in bt fails

Load: (gdb) source arm64_debug.py then use (gdb) arm64-state

Conclusion & Next Steps

The ARM debugging stack is deep but consistent: every layer from JTAG hardware to kernel tracing exposes the same fundamental abstractions — halt, single-step, memory read/write, and execution trace. Mastering these tools transforms opaque crashes into navigable execution histories. The boot failure case study shows how layered debugging tools complement each other, and the exercises build practical muscle memory with GDB, Ftrace, and custom debug scripting that you'll use throughout your ARM development career.

Next in the Series

In Part 24: Linkers, Loaders & Binary Format Internals, we descend into ELF section anatomy, RELA relocations, PLT/GOT mechanics, position-independent code, and the crt0 startup sequence that connects the OS loader to your main() function.

Technology