Series Context: This is Part 9 of 20 in the CMSIS Mastery Series. With our foundational layers in place — Core, RTOS2, DSP, drivers, and Pack — we now turn to debugging: the skill that separates professional embedded developers from those who spend days on bugs that should take minutes. This article covers the hardware debug infrastructure that every Cortex-M device ships with.
1
Overview & ARM Cortex-M Ecosystem
CMSIS layers, Cortex-M families, memory map, toolchains
Completed
2
CMSIS-Core: Registers, NVIC & SysTick
core_cmX.h, register access, interrupt controller, SysTick timer
Completed
3
Startup Code, Linker Scripts & Vector Table
Reset handler, BSS init, scatter files, boot process
Completed
4
CMSIS-RTOS2: Threads, Mutexes & Semaphores
Thread management, synchronization primitives, scheduling
Completed
5
CMSIS-RTOS2: Message Queues & Event Flags
Inter-thread comms, ISR-to-thread, real-time design patterns
Completed
6
CMSIS-DSP: Filters, FFT & Math Functions
FIR/IIR filters, FFT, SIMD optimizations
Completed
7
CMSIS-Driver: UART, SPI & I2C
Driver abstraction layer, callbacks, DMA integration
Completed
8
CMSIS-Pack & Software Components
Pack files, device support, dependency management
Completed
9
Debugging with CMSIS-DAP & CoreSight
SWD/JTAG, HardFault analysis, ITM tracing
You Are Here
10
Portable Firmware: Multi-Vendor Projects
HAL vs CMSIS, cross-platform BSPs, reusable libraries
11
Interrupts, Concurrency & Real-Time Constraints
Interrupt latency, critical sections, lock-free programming
12
Memory Management in Embedded Systems
Static vs dynamic, heap fragmentation, memory pools
13
Low Power & Energy Optimization
Sleep modes, clock gating, tickless RTOS, power profiling
14
DMA & High-Performance Data Handling
DMA basics, peripheral transfers, zero-copy techniques
15
Security: ARMv8-M & TrustZone
Secure/non-secure worlds, secure boot, firmware protection
16
Bootloaders & Firmware Updates
OTA updates, dual-bank flash, fail-safe strategies
17
Testing & Validation
Unity/Ceedling unit tests, HIL testing, integration testing
18
Performance Optimization
Compiler flags, inline assembly, cache (M7/M33), profiling
19
Embedded Software Architecture
Layered design, event-driven, state machines, component-based
20
Tooling & Workflow (Professional Level)
CI/CD for embedded, MISRA, static analysis, Doxygen
CoreSight Debug Architecture
ARM CoreSight is the debug and trace infrastructure built into every ARM Cortex-M processor. It is not a single block — it is a hierarchy of components connected via the Advanced Peripheral Bus (APB) and the Advanced High-performance Bus (AHB), all accessible from outside the chip through a Debug Access Port. Understanding this architecture is essential because every debugger interaction — setting a breakpoint, reading a variable, decoding a HardFault — works through this system.
Debug Access Port (DAP)
The DAP is the top-level interface between the external debug probe and the on-chip debug infrastructure. It sits in the system's domain and provides two types of Access Ports: the AHB-AP (for memory-mapped access to the processor's memory space — code, RAM, peripheral registers) and the APB-AP (for access to the CoreSight component registers themselves — ITM, ETM, DWT, FPB). The probe communicates with the DAP via either SWD or JTAG, then routes requests through the appropriate AP to read or write memory, control execution, or configure trace.
ITM, ETM, TPIU, DWT, FPB
The CoreSight ecosystem includes several specialised components that work together to provide debug and trace capability without requiring the processor to halt:
ITM
Instrumentation Trace Macrocell
Software-controlled trace. Firmware writes to ITM->PORT[n] stimulus registers and the data is serialised out of the TPIU as trace packets — invisible to the application's timing, unlike UART printf. Supports 32 independent ports; port 0 is conventionally used for text output, port 1 for RTOS event logging.
DWT
Data Watchpoint & Trace Unit
Provides hardware watchpoints (break on data access at an address), cycle counting via DWT->CYCCNT, and PC sampling. The cycle counter is invaluable for microsecond-precision profiling without modifying code logic — just read CYCCNT before and after the code under test.
FPB
Flash Patch & Breakpoint Unit
Provides hardware breakpoints — on M3/M4/M7 you get 6 instruction comparators and 2 literal comparators. Hardware breakpoints work in flash (unlike software breakpoints which require writable memory). The FPB also supports flash patching: replacing flash addresses with RAM-resident patches without reflashing.
ETM / TPIU
Embedded Trace Macrocell & TPIU
ETM records the full instruction execution stream — every branch taken, every instruction retired. The TPIU (Trace Port Interface Unit) serialises all trace data (ITM + ETM) onto the physical SWO pin (1-bit) or a 4-bit parallel trace port. Most CMSIS-DAP probes support only SWO (ITM only); parallel ETM trace requires specialist probes like the Arm Embedded Trace Probe.
SWD vs JTAG Protocols
Both SWD and JTAG are physical layer protocols for communicating with the DAP. Understanding their differences matters when you are choosing a probe, designing a PCB debug header, or diagnosing connectivity problems.
SWD Bit-Banging & Timing
SWD uses only two signals: SWDIO (bidirectional data) and SWCLK (clock). The host drives SWCLK, while SWDIO is driven by the host during request packets and by the target during acknowledgement and data phases. A line turnaround period separates direction changes. Here is an illustrative bit-banging sequence (simplified for readability — production probes use hardware shift registers):
/**
* SWD line reset + JTAG-to-SWD switch sequence (simplified illustration).
* In practice this is done in hardware by the CMSIS-DAP probe firmware.
*
* Physical signals: SWDIO (GPIO output/input) and SWCLK (GPIO output)
*/
/* Step 1: Drive SWDIO high, send 50+ clock pulses (line reset) */
/* SWDIO = HIGH for 50 clocks */
for (int i = 0; i < 50; i++) {
SWCLK_LOW(); __NOP(); __NOP();
SWCLK_HIGH(); __NOP(); __NOP();
}
/* Step 2: Send JTAG-to-SWD magic sequence 0x9EE7 (16 bits, LSB first) */
uint16_t magic = 0x9EE7U;
for (int i = 0; i < 16; i++) {
SWDIO_SET((magic >> i) & 1U);
SWCLK_LOW(); __NOP(); __NOP();
SWCLK_HIGH(); __NOP(); __NOP();
}
/* Step 3: Line reset again (50 clocks, SWDIO = HIGH) */
/* Step 4: 2 idle clocks (SWDIO = LOW) */
/* Step 5: DAP is now in SWD mode — send IDCODE read request */
/**
* SWD packet format (8-bit request):
* bit[0] = start (always 1)
* bit[1] = APnDP (0=DP, 1=AP)
* bit[2] = RnW (0=write, 1=read)
* bit[3:4] = A[2:3] (register address bits 2–3)
* bit[5] = parity (odd parity of bits 1–4)
* bit[6] = stop (always 0)
* bit[7] = park (always 1, line pulled high)
*
* After request: 1 turnaround + 3-bit ACK from target (OK=001, WAIT=010, FAULT=100)
* After ACK: 32-bit data + 1 parity bit (read), or turnaround + 32-bit data + parity (write)
*/
Protocol Comparison
| Feature |
SWD |
JTAG |
| Signals required |
2 (SWDIO, SWCLK) |
4+ (TDI, TDO, TMS, TCK, optional nTRST) |
| PCB pin count |
Minimal — 10-pin or 5-pin SWD header standard |
20-pin standard ARM JTAG header; larger footprint |
| Typical max speed |
10 MHz (probe-dependent; J-Link up to 50 MHz) |
10–25 MHz typical; deterministic for long chains |
| Multi-device daisy chain |
No — point-to-point only |
Yes — JTAG chains support multiple devices/TAPs |
| SWO trace support |
Yes — SWO (single-pin serial) via third signal |
Possible via dedicated trace port; less common |
| Cortex-M support |
All Cortex-M variants (M0+ does not support JTAG) |
M3, M4, M7, M23, M33 (not M0/M0+) |
| Preferred for |
Single-chip embedded, space-constrained PCBs |
Complex SoCs, FPGAs, multi-chip debug chains |
Practical Advice: Always use SWD for new Cortex-M designs. It requires only two pins, is supported by every modern probe, and is mandatory for M0/M0+ devices which have no JTAG support. Add the SWO pin (a third signal) if you want ITM/SWV trace capability.
CMSIS-DAP Probes
CMSIS-DAP is a firmware standard that defines the USB HID protocol between a debug probe and the host PC. Any microcontroller running CMSIS-DAP firmware appears as a USB HID device and is immediately recognised by CMSIS-DAP compatible debuggers (OpenOCD, pyOCD, J-Link software). This means you can build your own debug probe from an inexpensive MCU board running DAPLink firmware.
DAPLink Open-Source Probe
DAPLink is the reference open-source implementation of the CMSIS-DAP firmware, maintained by ARM. It runs on the LPC11U35, LPC4322, or nRF52840 MCU and provides: USB HID debug interface (CMSIS-DAP v2), USB mass-storage drag-and-drop flashing (drop a .hex/.bin onto the virtual drive), and USB CDC virtual COM port (connects to the target's UART). It is the firmware used on every mbed/Nucleo development board and the Raspberry Pi Debug Probe.
| Probe |
Protocol |
SWO Trace |
ETM Trace |
Approx. Cost |
Notes |
| DAPLink / mbed HDK |
CMSIS-DAP v1/v2, SWD, JTAG |
Yes (SWO) |
No |
Free (built-in on Nucleo/Discovery) |
Open-source; drag-and-drop flash; on most dev boards |
| Raspberry Pi Debug Probe |
CMSIS-DAP v2, SWD |
Yes (SWO) |
No |
~$12 USD |
RP2040-based; excellent OpenOCD support; UART passthrough |
| J-Link BASE / EDU |
JTAG, SWD, Segger RTT |
Yes (SWO) |
Yes (J-Link PLUS) |
$20 EDU / $500+ BASE |
Gold standard; J-Link RTT for zero-overhead tracing; vendor SDK support |
| ST-LINK v3 |
JTAG, SWD, virtual COM |
Yes (SWV) |
No |
~$15 (STLINK-V3MODS) |
STM32-focused; excellent CubeIDE integration; CMSIS-DAP via third-party firmware |
| ULINK pro (Keil) |
JTAG, SWD, CMSIS-DAP v2 |
Yes (SWV) |
Yes (parallel 4-bit) |
~$500 USD |
Keil MDK native; supports full ETM instruction trace; power measurement |
| Black Magic Probe |
GDB server (native SWD/JTAG) |
Yes (SWO) |
No |
~$70 USD |
No separate GDB server needed — probe IS the GDB server via USB CDC; open firmware |
Debug Techniques
Understanding the physical debug infrastructure lets you use it intentionally rather than accidentally. Two of the most powerful on-chip debug resources are hardware breakpoints (via FPB) and data watchpoints (via DWT). Both operate without modifying the code under test — they are purely hardware mechanisms.
Hardware vs Software Breakpoints
Software breakpoints replace an instruction with a BKPT #0 opcode. The processor traps on execution, the debugger restores the original instruction, and resumes. This requires writable memory — so software breakpoints work in RAM but fail in flash (read-only) without flash modification cycles. They are unlimited in number but introduce latency from the instruction patch cycle.
Hardware breakpoints use the FPB comparators to halt execution when the PC reaches a specific address — without any code modification. The M3/M4/M7 provide 6 instruction comparators. If you set more than 6 hardware breakpoints, the debugger must transparently fall back to software breakpoints for the excess. The FPB also supports conditional breakpoints at the hardware level via value comparisons on M33/M55.
Data Watchpoints (DWT)
Watchpoints halt (or trace) on data access at a specific address — reads, writes, or both. This is invaluable for tracking down memory corruption: set a watchpoint on a variable whose value is mysteriously changing, and the processor will halt the moment any code writes to that address, regardless of which thread or interrupt caused it. The M3/M4/M7 provide 4 DWT comparators; the M33 provides 8.
/**
* Programmatic DWT watchpoint configuration via CMSIS registers.
* Halts on any write to the address of 'g_shared_counter'.
*
* Prerequisites: CoreDebug->DEMCR must have TRCENA set (see ITM section below).
*/
extern volatile uint32_t g_shared_counter; /* Variable we suspect is corrupted */
void dwt_set_watchpoint(uint32_t address, uint32_t mask_bits, uint32_t function) {
/* Check that DWT is available (NOPREG bit = 0 means comparators present) */
if ((DWT->CTRL & DWT_CTRL_NUMCOMP_Msk) == 0U) {
return; /* No DWT comparators on this device */
}
/* Comparator 0: watch for write access */
DWT->COMP0 = address; /* Address to watch */
DWT->MASK0 = mask_bits; /* 0 = exact address match */
DWT->FUNCTION0 = function; /* 0x6 = break on write; 0x7 = break on r/w */
/* After this write, any store to 'address' causes a DebugMon or halt event */
}
void setup_watchpoint_on_counter(void) {
/* Enable DWT (TRCENA must be set in DEMCR first) */
CoreDebug->DEMCR |= CoreDebug_DEMCR_TRCENA_Msk;
DWT->CYCCNT = 0U; /* Reset cycle counter while we are here */
DWT->CTRL |= DWT_CTRL_CYCCNTENA_Msk; /* Enable cycle counter */
/* Set watchpoint on g_shared_counter — break on any write */
dwt_set_watchpoint((uint32_t)&g_shared_counter,
0U, /* Exact 4-byte match */
0x6U); /* Function: data write */
}
HardFault Debugging
HardFaults are the Cortex-M's last line of defence — an escalation mechanism that fires when a more specific fault (MemManage, BusFault, UsageFault) cannot be handled or is disabled. For most developers, encountering a HardFault means the debugger shows execution stopped at address 0xFFFFFFFF and no useful information. This is unnecessary — the processor has already saved everything you need to diagnose the root cause into the exception stack frame and the fault status registers.
CFSR, HFSR, BFAR, MMFAR
The Configurable Fault Status Register (CFSR at address 0xE000ED28) is a composite of three sub-registers: MMFSR (MemManage fault), BFSR (BusFault), and UFSR (UsageFault). Each bit identifies the specific fault cause. The HardFault Status Register (HFSR) indicates whether the fault was a forced hard fault (escalated from a configurable fault) or a debug event. When BFAR/MMFAR valid bits are set, the corresponding address registers give the faulting address.
Fault Handler Implementation
/**
* Complete HardFault handler that:
* 1. Detects whether the fault occurred in Thread mode (MSP/PSP) or Handler mode
* 2. Extracts the stacked exception frame (saved PC, LR, PSR, R0-R3, R12)
* 3. Reads all fault status registers (CFSR, HFSR, BFAR, MMFAR)
* 4. Reports via ITM (SWO) — no UART needed, works at any baud rate
*
* The __attribute__((naked)) prevents the compiler adding a prologue that
* would corrupt SP before we inspect it.
*/
/* ITM channel 0 output — visible in SWV console */
static void itm_print_hex(const char *label, uint32_t value) {
/* In production replace with full ITM_SendChar loop; abbreviated here */
(void)label; (void)value;
/* See ITM section for full implementation */
}
/* Called from HardFault_Handler with the correct stack pointer */
void HardFault_Handler_C(uint32_t *fault_frame, uint32_t lr_value) {
/* ── Stacked exception frame (8 words automatically pushed by processor) ── */
uint32_t stacked_r0 = fault_frame[0];
uint32_t stacked_r1 = fault_frame[1];
uint32_t stacked_r2 = fault_frame[2];
uint32_t stacked_r3 = fault_frame[3];
uint32_t stacked_r12 = fault_frame[4];
uint32_t stacked_lr = fault_frame[5]; /* LR at time of fault */
uint32_t stacked_pc = fault_frame[6]; /* PC at time of fault — the culprit */
uint32_t stacked_xpsr = fault_frame[7];
/* ── Fault status registers ──────────────────────────────────────────────── */
uint32_t cfsr = SCB->CFSR; /* 0xE000ED28: composite fault status */
uint32_t hfsr = SCB->HFSR; /* 0xE000ED2C: hard fault status */
uint32_t dfsr = SCB->DFSR; /* 0xE000ED30: debug fault status */
uint32_t mmfar = SCB->MMFAR; /* 0xE000ED34: MemManage fault address */
uint32_t bfar = SCB->BFAR; /* 0xE000ED38: BusFault address */
uint32_t afsr = SCB->AFSR; /* 0xE000ED3C: auxiliary fault (vendor-dep.) */
/* ── Decode which stack was active: EXC_RETURN in LR ──────────────────────
* lr_value bit[2]: 0 = MSP was active (Handler mode or Thread/MSP)
* 1 = PSP was active (Thread mode using PSP) */
uint8_t used_psp = (lr_value & 0x4U) ? 1U : 0U;
/* ── Send to ITM channel 0 for SWO capture ──────────────────────────────── */
itm_print_hex("FAULT PC ", stacked_pc);
itm_print_hex("FAULT LR ", stacked_lr);
itm_print_hex("CFSR ", cfsr);
itm_print_hex("HFSR ", hfsr);
itm_print_hex("MMFAR ", mmfar);
itm_print_hex("BFAR ", bfar);
itm_print_hex("PSP active ", (uint32_t)used_psp);
/* ── Decode CFSR sub-fields for human-readable diagnosis ────────────────── */
if (cfsr & SCB_CFSR_IACCVIOL_Msk) itm_print_hex("MemManage: Instr fetch violation @", mmfar);
if (cfsr & SCB_CFSR_DACCVIOL_Msk) itm_print_hex("MemManage: Data access violation @", mmfar);
if (cfsr & SCB_CFSR_IBUSERR_Msk) itm_print_hex("BusFault: Instruction prefetch", 0U);
if (cfsr & SCB_CFSR_PRECISERR_Msk) itm_print_hex("BusFault: Precise data bus error @", bfar);
if (cfsr & SCB_CFSR_IMPRECISERR_Msk) itm_print_hex("BusFault: Imprecise (async) error", 0U);
if (cfsr & SCB_CFSR_UNDEFINSTR_Msk) itm_print_hex("UsageFault: Undefined instruction", 0U);
if (cfsr & SCB_CFSR_UNALIGNED_Msk) itm_print_hex("UsageFault: Unaligned access", 0U);
if (cfsr & SCB_CFSR_DIVBYZERO_Msk) itm_print_hex("UsageFault: Divide by zero", 0U);
/* Suppress unused variable warnings in minimal builds */
(void)stacked_r0; (void)stacked_r1; (void)stacked_r2; (void)stacked_r3;
(void)stacked_r12; (void)stacked_xpsr; (void)dfsr; (void)afsr;
/* Halt in an infinite loop for debugger attachment */
for (;;) { __BKPT(0); }
}
/**
* Naked trampoline: reads the active stack pointer and calls the C handler.
* The naked attribute prevents any prologue/epilogue that would alter SP.
*/
__attribute__((naked)) void HardFault_Handler(void) {
__asm volatile (
" tst lr, #4 \n" /* Test EXC_RETURN bit[2] (Thread=PSP?) */
" ite eq \n"
" mrseq r0, msp \n" /* EQ (bit2=0): use MSP */
" mrsne r0, psp \n" /* NE (bit2=1): use PSP */
" mov r1, lr \n" /* Pass EXC_RETURN value as second arg */
" b HardFault_Handler_C \n"
::: "r0", "r1"
);
}
Critical Step: The stacked PC value points to the instruction that caused the fault (for precise faults) or the instruction after it (for imprecise BusFault). Open your .map file or disassembly, look up the stacked PC address, and you will find the exact line of C code that triggered the fault. This single register eliminates 90% of HardFault debugging time.
ITM Real-Time Tracing
ITM tracing is one of the most under-used features in embedded development. Unlike UART printf (which blocks execution for milliseconds per character), ITM writes are non-blocking — the processor writes a 32-bit word to the stimulus port register and the hardware handles serialisation to the SWO pin asynchronously. If the ITM FIFO is full, the write is discarded rather than blocking the processor. The host-side SWV (Serial Wire Viewer) tool reconstructs the stream.
ITM printf Implementation
/**
* ITM printf implementation — redirects stdout to ITM channel 0.
* Requires: SWO pin connected to probe, SWV enabled in debugger at correct baud.
*
* Initialisation sequence must be performed once before any ITM writes.
*/
#include "core_cm4.h" /* Provides CoreDebug, ITM, DWT register definitions */
/**
* Enable ITM trace with SWO output.
* @param cpu_clock_hz Core clock in Hz (e.g. 168000000 for 168 MHz)
* @param swo_baud Desired SWO baud rate (e.g. 2000000 for 2 Mbaud)
*/
void ITM_Init(uint32_t cpu_clock_hz, uint32_t swo_baud) {
uint32_t prescaler = (cpu_clock_hz / swo_baud) - 1U;
/* 1. Unlock ITM register access */
ITM->LAR = 0xC5ACCE55UL;
/* 2. Enable TRCENA in DEMCR (master enable for all CoreSight trace) */
CoreDebug->DEMCR |= CoreDebug_DEMCR_TRCENA_Msk;
/* 3. Configure TPIU for asynchronous SWO (NRZ encoding) */
TPI->SPPR = 0x00000002UL; /* Async SWO NRZ (UART framing) */
TPI->ACPR = prescaler; /* Async clock prescaler */
TPI->FFCR = 0x00000100UL; /* Enable TPIU formatter */
/* 4. Enable ITM with all 32 stimulus ports and ATB ID 1 */
ITM->TCR = ITM_TCR_ITMENA_Msk /* Global ITM enable */
| ITM_TCR_SYNCENA_Msk /* Enable sync packets */
| ITM_TCR_DWTENA_Msk /* Enable DWT packet forwarding via ITM */
| (1UL << ITM_TCR_TraceBusID_Pos); /* ATB ID = 1 */
/* 5. Enable all 32 stimulus ports (bit N = port N enabled) */
ITM->TER = 0xFFFFFFFFUL;
}
/**
* Write a single character to ITM channel 0.
* Returns the character on success, or -1 if ITM is disabled or FIFO is busy.
* This is the function used by retarget_putchar / semihosting redirections.
*/
int32_t ITM_SendChar(uint32_t port, uint32_t ch) {
if ((ITM->TCR & ITM_TCR_ITMENA_Msk) == 0U) return -1; /* ITM disabled */
if ((ITM->TER & (1UL << port)) == 0U) return -1; /* Port disabled */
/* Wait until the stimulus port FIFO is ready (bit[0] = 1 when ready) */
while (ITM->PORT[port].u32 == 0U) { __NOP(); }
/* Write character — hardware handles SWO serialisation */
ITM->PORT[port].u8 = (uint8_t)ch;
return (int32_t)ch;
}
/**
* Redirect printf to ITM channel 0.
* Implement _write() (GCC newlib) or fputc() (IAR) to call ITM_SendChar.
*/
int _write(int fd, char *buf, int len) {
(void)fd;
for (int i = 0; i < len; i++) {
ITM_SendChar(0U, (uint32_t)buf[i]);
}
return len;
}
DWT Cycle Counter for Microsecond Profiling
/**
* DWT cycle counter — microsecond profiling without modifying target behaviour.
*
* DWT->CYCCNT increments every CPU clock cycle.
* At 168 MHz: 1 µs = 168 counts. Resolution: 1/168 MHz ≈ 5.95 ns.
* Wraps at 2^32 cycles (~25.6 s at 168 MHz — sufficient for most measurements).
*/
/** Enable the DWT cycle counter. Call once at startup after ITM_Init(). */
void DWT_CycleCounter_Enable(void) {
CoreDebug->DEMCR |= CoreDebug_DEMCR_TRCENA_Msk; /* TRCENA must be set */
DWT->CYCCNT = 0U; /* Reset counter */
DWT->CTRL |= DWT_CTRL_CYCCNTENA_Msk; /* Enable counter */
}
/** Read current cycle count */
static inline uint32_t DWT_GetCycles(void) {
return DWT->CYCCNT;
}
/** Convert a cycle delta to microseconds. Avoid division if performance matters. */
static inline uint32_t cycles_to_us(uint32_t cycles, uint32_t cpu_mhz) {
return cycles / cpu_mhz; /* e.g. cycles / 168 for 168 MHz system */
}
/**
* Example: Profile a DSP function using DWT.
* No printf overhead inside the measured region — only two register reads.
*/
void profile_my_fir_filter(void) {
uint32_t t_start, t_end, elapsed_cycles, elapsed_us;
t_start = DWT_GetCycles();
/* ── Code under measurement ───────────────────── */
arm_fir_f32(&fir_inst, input_buf, output_buf, BLOCK_SIZE);
/* ─────────────────────────────────────────────── */
t_end = DWT_GetCycles();
elapsed_cycles = t_end - t_start; /* Handles 32-bit wrap correctly */
elapsed_us = cycles_to_us(elapsed_cycles, 168U);
/* Report via ITM — no blocking UART */
printf("[PROFILE] FIR filter: %lu cycles = %lu µs\r\n",
(unsigned long)elapsed_cycles, (unsigned long)elapsed_us);
}
Exercises
Exercise 1
Beginner
Trigger a HardFault Intentionally and Decode CFSR
Write a small test function that deliberately causes a specific fault type: (a) dereference a null pointer (*(volatile uint32_t *)0x00000000) to trigger a MemManage or BusFault, (b) execute an undefined instruction (__asm volatile(".word 0xF7F0A000")) for a UsageFault, (c) enable the divide-by-zero trap in CCR (SCB->CCR |= SCB_CCR_DIV_0_TRP_Msk) and divide by zero. For each fault: read CFSR from your debugger's memory view, identify the specific set bits, and match them to the ARM Architecture Reference Manual description. Document your findings — which bits were set, what they mean, and whether BFAR or MMFAR held a valid address.
HardFault
CFSR
Fault Analysis
Exercise 2
Intermediate
Profile a Function Using the DWT Cycle Counter
Implement the DWT cycle counter setup from this article. Choose a non-trivial function to profile — for example, a 256-point FFT using CMSIS-DSP, a CRC computation over 1 kB of data, or a sorting algorithm. Measure the cycle count for: (a) default compilation (-O0), (b) size-optimised (-Os), (c) performance-optimised (-O2 or -O3). Report the cycle counts and corresponding microsecond values at your MCU's core clock. Note which optimisation level produces the best throughput and verify the results match your theoretical expectation (e.g., FFT should be approximately O(N log N) cycles).
DWT
Profiling
Optimisation
Exercise 3
Advanced
Stream ITM Data to Host and Visualise via SWV Timeline
Implement a complete ITM logging system with multiple channels: port 0 for printf output, port 1 for RTOS context switch events (called from osRtxThreadSwitch or FreeRTOS traceTASK_SWITCHED_IN), and port 2 for custom performance counters (DWT cycle snapshots). Configure your debugger (Keil MDK SWV or OpenOCD SWO capture) to receive the trace stream. Verify you can see: (a) text output on port 0 in the SWV console, (b) thread switch timestamps on port 1 in the SWV timeline/event view, (c) cycle count values on port 2. Capture a 5-second trace and identify the task with the highest CPU utilisation. Describe the SWO baud rate configuration and any probe limitations you encountered.
ITM
SWV Timeline
RTOS Tracing
Debug Plan Generator
Use this tool to document your debug strategy — the probe, interface, breakpoints, watchpoints, ITM port assignments, and known faults to investigate. Generate a Word, Excel, PDF, or PPTX document to share with your team or include in project documentation.
Conclusion & Next Steps
ARM's CoreSight infrastructure provides a professional-grade debug and trace ecosystem that is largely invisible until you understand what it offers. The key takeaways from this article:
- The CoreSight hierarchy — DAP → AHB-AP/APB-AP → ITM/ETM/DWT/FPB — is the physical path from your debug probe to every breakpoint, watchpoint, and trace event. Understanding it makes probe connectivity problems trivial to diagnose.
- SWD is the correct choice for the vast majority of Cortex-M designs: two signals, supports all variants, lower PCB overhead, and works natively with every modern probe. Add SWO as a third signal if ITM tracing is required.
- The HardFault handler with stacked frame inspection transforms an opaque crash into a diagnosed line of code in seconds. CFSR + stacked PC is sufficient to resolve >90% of production HardFaults without a debugger attached.
- ITM tracing provides non-blocking, multi-channel data streaming from the target to the host over a single SWO pin. Use it instead of UART printf whenever timing accuracy matters — it imposes no measurable impact on application timing.
- The DWT cycle counter (
DWT->CYCCNT) enables nanosecond-resolution profiling with two register reads — the most lightweight profiling mechanism available on Cortex-M.
Next in the Series
In Part 10: Portable Firmware — Multi-Vendor CMSIS Projects, we pivot from debugging individual bugs to designing firmware that targets multiple MCU families. We will cover hardware abstraction layer patterns in C, conditional compilation with CMake board variables, BSP design, and the specific porting deltas between STM32F4, NXP LPC55S69, Nordic nRF52840, and Renesas RA4M1.
Related Articles in This Series
Part 2: CMSIS-Core — Registers, NVIC & SysTick
The processor-level APIs used in the fault handler — SCB, CFSR, NVIC priority grouping, and SysTick — all originate in CMSIS-Core headers.
Read Article
Part 11: Interrupts, Concurrency & Real-Time Constraints
Watchpoints and ITM tracing are essential tools for diagnosing race conditions and interrupt latency violations covered in Part 11.
Read Article
Part 18: Performance Optimization
The DWT cycle counter profiling technique from this article is used extensively in Part 18 to validate compiler flag and cache optimisation results.
Read Article