CMSIS Part 9: Debugging with CMSIS-DAP & CoreSight

                        
                        Series Context: This is Part 9 of 20 in the CMSIS Mastery Series. With our foundational layers in place — Core, RTOS2, DSP, drivers, and Pack — we now turn to debugging: the skill that separates professional embedded developers from those who spend days on bugs that should take minutes. This article covers the hardware debug infrastructure that every Cortex-M device ships with.
                    

CMSIS Mastery Series

Your 20-step learning path • Currently on Step 9

1

9

Debugging with CMSIS-DAP & CoreSight

SWD/JTAG, HardFault analysis, ITM tracing

You Are Here

10

Portable Firmware: Multi-Vendor Projects

HAL vs CMSIS, cross-platform BSPs, reusable libraries

11

Interrupts, Concurrency & Real-Time Constraints

Interrupt latency, critical sections, lock-free programming

12

Memory Management in Embedded Systems

Static vs dynamic, heap fragmentation, memory pools

13

Low Power & Energy Optimization

Sleep modes, clock gating, tickless RTOS, power profiling

14

DMA & High-Performance Data Handling

DMA basics, peripheral transfers, zero-copy techniques

15

Security: ARMv8-M & TrustZone

Secure/non-secure worlds, secure boot, firmware protection

16

Bootloaders & Firmware Updates

OTA updates, dual-bank flash, fail-safe strategies

17

Testing & Validation

Unity/Ceedling unit tests, HIL testing, integration testing

18

Performance Optimization

Compiler flags, inline assembly, cache (M7/M33), profiling

19

Embedded Software Architecture

Layered design, event-driven, state machines, component-based

20

Tooling & Workflow (Professional Level)

CI/CD for embedded, MISRA, static analysis, Doxygen

CoreSight Debug Architecture

ARM CoreSight is the debug and trace infrastructure built into every ARM Cortex-M processor. It is not a single block — it is a hierarchy of components connected via the Advanced Peripheral Bus (APB) and the Advanced High-performance Bus (AHB), all accessible from outside the chip through a Debug Access Port. Understanding this architecture is essential because every debugger interaction — setting a breakpoint, reading a variable, decoding a HardFault — works through this system.

Debug Access Port (DAP)

The DAP is the top-level interface between the external debug probe and the on-chip debug infrastructure. It sits in the system's domain and provides two types of Access Ports: the AHB-AP (for memory-mapped access to the processor's memory space — code, RAM, peripheral registers) and the APB-AP (for access to the CoreSight component registers themselves — ITM, ETM, DWT, FPB). The probe communicates with the DAP via either SWD or JTAG, then routes requests through the appropriate AP to read or write memory, control execution, or configure trace.

ITM, ETM, TPIU, DWT, FPB

The CoreSight ecosystem includes several specialised components that work together to provide debug and trace capability without requiring the processor to halt:

ITM

Instrumentation Trace Macrocell

Software-controlled trace. Firmware writes to ITM->PORT[n] stimulus registers and the data is serialised out of the TPIU as trace packets — invisible to the application's timing, unlike UART printf. Supports 32 independent ports; port 0 is conventionally used for text output, port 1 for RTOS event logging.

DWT

Data Watchpoint & Trace Unit

Provides hardware watchpoints (break on data access at an address), cycle counting via DWT->CYCCNT, and PC sampling. The cycle counter is invaluable for microsecond-precision profiling without modifying code logic — just read CYCCNT before and after the code under test.

FPB

Flash Patch & Breakpoint Unit

Provides hardware breakpoints — on M3/M4/M7 you get 6 instruction comparators and 2 literal comparators. Hardware breakpoints work in flash (unlike software breakpoints which require writable memory). The FPB also supports flash patching: replacing flash addresses with RAM-resident patches without reflashing.

ETM / TPIU

Embedded Trace Macrocell & TPIU

ETM records the full instruction execution stream — every branch taken, every instruction retired. The TPIU (Trace Port Interface Unit) serialises all trace data (ITM + ETM) onto the physical SWO pin (1-bit) or a 4-bit parallel trace port. Most CMSIS-DAP probes support only SWO (ITM only); parallel ETM trace requires specialist probes like the Arm Embedded Trace Probe.

SWD vs JTAG Protocols

Both SWD and JTAG are physical layer protocols for communicating with the DAP. Understanding their differences matters when you are choosing a probe, designing a PCB debug header, or diagnosing connectivity problems.

SWD Bit-Banging & Timing

SWD uses only two signals: SWDIO (bidirectional data) and SWCLK (clock). The host drives SWCLK, while SWDIO is driven by the host during request packets and by the target during acknowledgement and data phases. A line turnaround period separates direction changes. Here is an illustrative bit-banging sequence (simplified for readability — production probes use hardware shift registers):

/**
 * SWD line reset + JTAG-to-SWD switch sequence (simplified illustration).
 * In practice this is done in hardware by the CMSIS-DAP probe firmware.
 *
 * Physical signals: SWDIO (GPIO output/input) and SWCLK (GPIO output)
 */

/* Step 1: Drive SWDIO high, send 50+ clock pulses (line reset) */
/* SWDIO = HIGH for 50 clocks */
for (int i = 0; i < 50; i++) {
    SWCLK_LOW();  __NOP(); __NOP();
    SWCLK_HIGH(); __NOP(); __NOP();
}

/* Step 2: Send JTAG-to-SWD magic sequence 0x9EE7 (16 bits, LSB first) */
uint16_t magic = 0x9EE7U;
for (int i = 0; i < 16; i++) {
    SWDIO_SET((magic >> i) & 1U);
    SWCLK_LOW();  __NOP(); __NOP();
    SWCLK_HIGH(); __NOP(); __NOP();
}

/* Step 3: Line reset again (50 clocks, SWDIO = HIGH) */
/* Step 4: 2 idle clocks (SWDIO = LOW) */
/* Step 5: DAP is now in SWD mode — send IDCODE read request */

/**
 * SWD packet format (8-bit request):
 *   bit[0]   = start (always 1)
 *   bit[1]   = APnDP  (0=DP, 1=AP)
 *   bit[2]   = RnW    (0=write, 1=read)
 *   bit[3:4] = A[2:3] (register address bits 2–3)
 *   bit[5]   = parity (odd parity of bits 1–4)
 *   bit[6]   = stop   (always 0)
 *   bit[7]   = park   (always 1, line pulled high)
 *
 * After request: 1 turnaround + 3-bit ACK from target (OK=001, WAIT=010, FAULT=100)
 * After ACK:     32-bit data + 1 parity bit (read), or turnaround + 32-bit data + parity (write)
 */

Protocol Comparison

Feature	SWD	JTAG
Signals required	2 (SWDIO, SWCLK)	4+ (TDI, TDO, TMS, TCK, optional nTRST)
PCB pin count	Minimal — 10-pin or 5-pin SWD header standard	20-pin standard ARM JTAG header; larger footprint
Typical max speed	10 MHz (probe-dependent; J-Link up to 50 MHz)	10–25 MHz typical; deterministic for long chains
Multi-device daisy chain	No — point-to-point only	Yes — JTAG chains support multiple devices/TAPs
SWO trace support	Yes — SWO (single-pin serial) via third signal	Possible via dedicated trace port; less common
Cortex-M support	All Cortex-M variants (M0+ does not support JTAG)	M3, M4, M7, M23, M33 (not M0/M0+)
Preferred for	Single-chip embedded, space-constrained PCBs	Complex SoCs, FPGAs, multi-chip debug chains

                        
                        Practical Advice: Always use SWD for new Cortex-M designs. It requires only two pins, is supported by every modern probe, and is mandatory for M0/M0+ devices which have no JTAG support. Add the SWO pin (a third signal) if you want ITM/SWV trace capability.
                    

CMSIS-DAP Probes

CMSIS-DAP is a firmware standard that defines the USB HID protocol between a debug probe and the host PC. Any microcontroller running CMSIS-DAP firmware appears as a USB HID device and is immediately recognised by CMSIS-DAP compatible debuggers (OpenOCD, pyOCD, J-Link software). This means you can build your own debug probe from an inexpensive MCU board running DAPLink firmware.

DAPLink Open-Source Probe

DAPLink is the reference open-source implementation of the CMSIS-DAP firmware, maintained by ARM. It runs on the LPC11U35, LPC4322, or nRF52840 MCU and provides: USB HID debug interface (CMSIS-DAP v2), USB mass-storage drag-and-drop flashing (drop a .hex/.bin onto the virtual drive), and USB CDC virtual COM port (connects to the target's UART). It is the firmware used on every mbed/Nucleo development board and the Raspberry Pi Debug Probe.

Probe	Protocol	SWO Trace	ETM Trace	Approx. Cost	Notes
DAPLink / mbed HDK	CMSIS-DAP v1/v2, SWD, JTAG	Yes (SWO)	No	Free (built-in on Nucleo/Discovery)	Open-source; drag-and-drop flash; on most dev boards
Raspberry Pi Debug Probe	CMSIS-DAP v2, SWD	Yes (SWO)	No	~$12 USD	RP2040-based; excellent OpenOCD support; UART passthrough
J-Link BASE / EDU	JTAG, SWD, Segger RTT	Yes (SWO)	Yes (J-Link PLUS)	$20 EDU / $500+ BASE	Gold standard; J-Link RTT for zero-overhead tracing; vendor SDK support
ST-LINK v3	JTAG, SWD, virtual COM	Yes (SWV)	No	~$15 (STLINK-V3MODS)	STM32-focused; excellent CubeIDE integration; CMSIS-DAP via third-party firmware
ULINK pro (Keil)	JTAG, SWD, CMSIS-DAP v2	Yes (SWV)	Yes (parallel 4-bit)	~$500 USD	Keil MDK native; supports full ETM instruction trace; power measurement
Black Magic Probe	GDB server (native SWD/JTAG)	Yes (SWO)	No	~$70 USD	No separate GDB server needed — probe IS the GDB server via USB CDC; open firmware

Debug Techniques

Understanding the physical debug infrastructure lets you use it intentionally rather than accidentally. Two of the most powerful on-chip debug resources are hardware breakpoints (via FPB) and data watchpoints (via DWT). Both operate without modifying the code under test — they are purely hardware mechanisms.

Hardware vs Software Breakpoints

Software breakpoints replace an instruction with a BKPT #0 opcode. The processor traps on execution, the debugger restores the original instruction, and resumes. This requires writable memory — so software breakpoints work in RAM but fail in flash (read-only) without flash modification cycles. They are unlimited in number but introduce latency from the instruction patch cycle.

Hardware breakpoints use the FPB comparators to halt execution when the PC reaches a specific address — without any code modification. The M3/M4/M7 provide 6 instruction comparators. If you set more than 6 hardware breakpoints, the debugger must transparently fall back to software breakpoints for the excess. The FPB also supports conditional breakpoints at the hardware level via value comparisons on M33/M55.

Data Watchpoints (DWT)

Watchpoints halt (or trace) on data access at a specific address — reads, writes, or both. This is invaluable for tracking down memory corruption: set a watchpoint on a variable whose value is mysteriously changing, and the processor will halt the moment any code writes to that address, regardless of which thread or interrupt caused it. The M3/M4/M7 provide 4 DWT comparators; the M33 provides 8.

/**
 * Programmatic DWT watchpoint configuration via CMSIS registers.
 * Halts on any write to the address of 'g_shared_counter'.
 *
 * Prerequisites: CoreDebug->DEMCR must have TRCENA set (see ITM section below).
 */

extern volatile uint32_t g_shared_counter;  /* Variable we suspect is corrupted */

void dwt_set_watchpoint(uint32_t address, uint32_t mask_bits, uint32_t function) {
    /* Check that DWT is available (NOPREG bit = 0 means comparators present) */
    if ((DWT->CTRL & DWT_CTRL_NUMCOMP_Msk) == 0U) {
        return;  /* No DWT comparators on this device */
    }

    /* Comparator 0: watch for write access */
    DWT->COMP0   = address;          /* Address to watch                         */
    DWT->MASK0   = mask_bits;        /* 0 = exact address match                  */
    DWT->FUNCTION0 = function;       /* 0x6 = break on write; 0x7 = break on r/w */
    /* After this write, any store to 'address' causes a DebugMon or halt event  */
}

void setup_watchpoint_on_counter(void) {
    /* Enable DWT (TRCENA must be set in DEMCR first) */
    CoreDebug->DEMCR |= CoreDebug_DEMCR_TRCENA_Msk;
    DWT->CYCCNT = 0U;   /* Reset cycle counter while we are here */
    DWT->CTRL  |= DWT_CTRL_CYCCNTENA_Msk;  /* Enable cycle counter */

    /* Set watchpoint on g_shared_counter — break on any write */
    dwt_set_watchpoint((uint32_t)&g_shared_counter,
                       0U,    /* Exact 4-byte match */
                       0x6U); /* Function: data write */
}

HardFault Debugging

HardFaults are the Cortex-M's last line of defence — an escalation mechanism that fires when a more specific fault (MemManage, BusFault, UsageFault) cannot be handled or is disabled. For most developers, encountering a HardFault means the debugger shows execution stopped at address 0xFFFFFFFF and no useful information. This is unnecessary — the processor has already saved everything you need to diagnose the root cause into the exception stack frame and the fault status registers.

CFSR, HFSR, BFAR, MMFAR

The Configurable Fault Status Register (CFSR at address 0xE000ED28) is a composite of three sub-registers: MMFSR (MemManage fault), BFSR (BusFault), and UFSR (UsageFault). Each bit identifies the specific fault cause. The HardFault Status Register (HFSR) indicates whether the fault was a forced hard fault (escalated from a configurable fault) or a debug event. When BFAR/MMFAR valid bits are set, the corresponding address registers give the faulting address.

Fault Handler Implementation

/**
 * Complete HardFault handler that:
 *   1. Detects whether the fault occurred in Thread mode (MSP/PSP) or Handler mode
 *   2. Extracts the stacked exception frame (saved PC, LR, PSR, R0-R3, R12)
 *   3. Reads all fault status registers (CFSR, HFSR, BFAR, MMFAR)
 *   4. Reports via ITM (SWO) — no UART needed, works at any baud rate
 *
 * The __attribute__((naked)) prevents the compiler adding a prologue that
 * would corrupt SP before we inspect it.
 */

/* ITM channel 0 output — visible in SWV console */
static void itm_print_hex(const char *label, uint32_t value) {
    /* In production replace with full ITM_SendChar loop; abbreviated here */
    (void)label; (void)value;
    /* See ITM section for full implementation */
}

/* Called from HardFault_Handler with the correct stack pointer */
void HardFault_Handler_C(uint32_t *fault_frame, uint32_t lr_value) {
    /* ── Stacked exception frame (8 words automatically pushed by processor) ── */
    uint32_t stacked_r0   = fault_frame[0];
    uint32_t stacked_r1   = fault_frame[1];
    uint32_t stacked_r2   = fault_frame[2];
    uint32_t stacked_r3   = fault_frame[3];
    uint32_t stacked_r12  = fault_frame[4];
    uint32_t stacked_lr   = fault_frame[5];  /* LR at time of fault */
    uint32_t stacked_pc   = fault_frame[6];  /* PC at time of fault — the culprit */
    uint32_t stacked_xpsr = fault_frame[7];

    /* ── Fault status registers ──────────────────────────────────────────────── */
    uint32_t cfsr  = SCB->CFSR;   /* 0xE000ED28: composite fault status        */
    uint32_t hfsr  = SCB->HFSR;   /* 0xE000ED2C: hard fault status             */
    uint32_t dfsr  = SCB->DFSR;   /* 0xE000ED30: debug fault status            */
    uint32_t mmfar = SCB->MMFAR;  /* 0xE000ED34: MemManage fault address       */
    uint32_t bfar  = SCB->BFAR;   /* 0xE000ED38: BusFault address              */
    uint32_t afsr  = SCB->AFSR;   /* 0xE000ED3C: auxiliary fault (vendor-dep.) */

    /* ── Decode which stack was active: EXC_RETURN in LR ──────────────────────
     * lr_value bit[2]: 0 = MSP was active (Handler mode or Thread/MSP)
     *                  1 = PSP was active (Thread mode using PSP)          */
    uint8_t used_psp = (lr_value & 0x4U) ? 1U : 0U;

    /* ── Send to ITM channel 0 for SWO capture ──────────────────────────────── */
    itm_print_hex("FAULT PC   ", stacked_pc);
    itm_print_hex("FAULT LR   ", stacked_lr);
    itm_print_hex("CFSR       ", cfsr);
    itm_print_hex("HFSR       ", hfsr);
    itm_print_hex("MMFAR      ", mmfar);
    itm_print_hex("BFAR       ", bfar);
    itm_print_hex("PSP active ", (uint32_t)used_psp);

    /* ── Decode CFSR sub-fields for human-readable diagnosis ────────────────── */
    if (cfsr & SCB_CFSR_IACCVIOL_Msk)  itm_print_hex("MemManage: Instr fetch violation @", mmfar);
    if (cfsr & SCB_CFSR_DACCVIOL_Msk)  itm_print_hex("MemManage: Data access violation @", mmfar);
    if (cfsr & SCB_CFSR_IBUSERR_Msk)   itm_print_hex("BusFault:  Instruction prefetch",    0U);
    if (cfsr & SCB_CFSR_PRECISERR_Msk) itm_print_hex("BusFault:  Precise data bus error @", bfar);
    if (cfsr & SCB_CFSR_IMPRECISERR_Msk) itm_print_hex("BusFault:  Imprecise (async) error", 0U);
    if (cfsr & SCB_CFSR_UNDEFINSTR_Msk) itm_print_hex("UsageFault: Undefined instruction",  0U);
    if (cfsr & SCB_CFSR_UNALIGNED_Msk)  itm_print_hex("UsageFault: Unaligned access",       0U);
    if (cfsr & SCB_CFSR_DIVBYZERO_Msk)  itm_print_hex("UsageFault: Divide by zero",         0U);

    /* Suppress unused variable warnings in minimal builds */
    (void)stacked_r0; (void)stacked_r1; (void)stacked_r2; (void)stacked_r3;
    (void)stacked_r12; (void)stacked_xpsr; (void)dfsr; (void)afsr;

    /* Halt in an infinite loop for debugger attachment */
    for (;;) { __BKPT(0); }
}

/**
 * Naked trampoline: reads the active stack pointer and calls the C handler.
 * The naked attribute prevents any prologue/epilogue that would alter SP.
 */
__attribute__((naked)) void HardFault_Handler(void) {
    __asm volatile (
        " tst   lr, #4          \n"  /* Test EXC_RETURN bit[2] (Thread=PSP?) */
        " ite   eq              \n"
        " mrseq r0, msp         \n"  /* EQ (bit2=0): use MSP                 */
        " mrsne r0, psp         \n"  /* NE (bit2=1): use PSP                 */
        " mov   r1, lr          \n"  /* Pass EXC_RETURN value as second arg  */
        " b     HardFault_Handler_C \n"
        ::: "r0", "r1"
    );
}

                        
                        Critical Step: The stacked PC value points to the instruction that caused the fault (for precise faults) or the instruction after it (for imprecise BusFault). Open your .map file or disassembly, look up the stacked PC address, and you will find the exact line of C code that triggered the fault. This single register eliminates 90% of HardFault debugging time.
                    

ITM Real-Time Tracing

ITM tracing is one of the most under-used features in embedded development. Unlike UART printf (which blocks execution for milliseconds per character), ITM writes are non-blocking — the processor writes a 32-bit word to the stimulus port register and the hardware handles serialisation to the SWO pin asynchronously. If the ITM FIFO is full, the write is discarded rather than blocking the processor. The host-side SWV (Serial Wire Viewer) tool reconstructs the stream.

ITM printf Implementation

/**
 * ITM printf implementation — redirects stdout to ITM channel 0.
 * Requires: SWO pin connected to probe, SWV enabled in debugger at correct baud.
 *
 * Initialisation sequence must be performed once before any ITM writes.
 */

#include "core_cm4.h"  /* Provides CoreDebug, ITM, DWT register definitions */

/**
 * Enable ITM trace with SWO output.
 * @param cpu_clock_hz  Core clock in Hz (e.g. 168000000 for 168 MHz)
 * @param swo_baud      Desired SWO baud rate (e.g. 2000000 for 2 Mbaud)
 */
void ITM_Init(uint32_t cpu_clock_hz, uint32_t swo_baud) {
    uint32_t prescaler = (cpu_clock_hz / swo_baud) - 1U;

    /* 1. Unlock ITM register access */
    ITM->LAR = 0xC5ACCE55UL;

    /* 2. Enable TRCENA in DEMCR (master enable for all CoreSight trace) */
    CoreDebug->DEMCR |= CoreDebug_DEMCR_TRCENA_Msk;

    /* 3. Configure TPIU for asynchronous SWO (NRZ encoding) */
    TPI->SPPR  = 0x00000002UL;          /* Async SWO NRZ (UART framing)         */
    TPI->ACPR  = prescaler;             /* Async clock prescaler                 */
    TPI->FFCR  = 0x00000100UL;         /* Enable TPIU formatter                 */

    /* 4. Enable ITM with all 32 stimulus ports and ATB ID 1 */
    ITM->TCR = ITM_TCR_ITMENA_Msk      /* Global ITM enable                    */
             | ITM_TCR_SYNCENA_Msk     /* Enable sync packets                   */
             | ITM_TCR_DWTENA_Msk      /* Enable DWT packet forwarding via ITM  */
             | (1UL << ITM_TCR_TraceBusID_Pos); /* ATB ID = 1                  */

    /* 5. Enable all 32 stimulus ports (bit N = port N enabled) */
    ITM->TER = 0xFFFFFFFFUL;
}

/**
 * Write a single character to ITM channel 0.
 * Returns the character on success, or -1 if ITM is disabled or FIFO is busy.
 * This is the function used by retarget_putchar / semihosting redirections.
 */
int32_t ITM_SendChar(uint32_t port, uint32_t ch) {
    if ((ITM->TCR & ITM_TCR_ITMENA_Msk) == 0U) return -1;  /* ITM disabled    */
    if ((ITM->TER & (1UL << port))       == 0U) return -1;  /* Port disabled   */

    /* Wait until the stimulus port FIFO is ready (bit[0] = 1 when ready) */
    while (ITM->PORT[port].u32 == 0U) { __NOP(); }

    /* Write character — hardware handles SWO serialisation */
    ITM->PORT[port].u8 = (uint8_t)ch;
    return (int32_t)ch;
}

/**
 * Redirect printf to ITM channel 0.
 * Implement _write() (GCC newlib) or fputc() (IAR) to call ITM_SendChar.
 */
int _write(int fd, char *buf, int len) {
    (void)fd;
    for (int i = 0; i < len; i++) {
        ITM_SendChar(0U, (uint32_t)buf[i]);
    }
    return len;
}

DWT Cycle Counter for Microsecond Profiling

/**
 * DWT cycle counter — microsecond profiling without modifying target behaviour.
 *
 * DWT->CYCCNT increments every CPU clock cycle.
 * At 168 MHz: 1 µs = 168 counts.  Resolution: 1/168 MHz ≈ 5.95 ns.
 * Wraps at 2^32 cycles (~25.6 s at 168 MHz — sufficient for most measurements).
 */

/** Enable the DWT cycle counter. Call once at startup after ITM_Init(). */
void DWT_CycleCounter_Enable(void) {
    CoreDebug->DEMCR |= CoreDebug_DEMCR_TRCENA_Msk;  /* TRCENA must be set */
    DWT->CYCCNT = 0U;                                  /* Reset counter      */
    DWT->CTRL  |= DWT_CTRL_CYCCNTENA_Msk;             /* Enable counter     */
}

/** Read current cycle count */
static inline uint32_t DWT_GetCycles(void) {
    return DWT->CYCCNT;
}

/** Convert a cycle delta to microseconds. Avoid division if performance matters. */
static inline uint32_t cycles_to_us(uint32_t cycles, uint32_t cpu_mhz) {
    return cycles / cpu_mhz;  /* e.g. cycles / 168 for 168 MHz system */
}

/**
 * Example: Profile a DSP function using DWT.
 * No printf overhead inside the measured region — only two register reads.
 */
void profile_my_fir_filter(void) {
    uint32_t t_start, t_end, elapsed_cycles, elapsed_us;

    t_start = DWT_GetCycles();

    /* ── Code under measurement ───────────────────── */
    arm_fir_f32(&fir_inst, input_buf, output_buf, BLOCK_SIZE);
    /* ─────────────────────────────────────────────── */

    t_end = DWT_GetCycles();
    elapsed_cycles = t_end - t_start;         /* Handles 32-bit wrap correctly */
    elapsed_us     = cycles_to_us(elapsed_cycles, 168U);

    /* Report via ITM — no blocking UART */
    printf("[PROFILE] FIR filter: %lu cycles = %lu µs\r\n",
           (unsigned long)elapsed_cycles, (unsigned long)elapsed_us);
}

Exercises

Exercise 1 Beginner

Trigger a HardFault Intentionally and Decode CFSR

Write a small test function that deliberately causes a specific fault type: (a) dereference a null pointer (*(volatile uint32_t *)0x00000000) to trigger a MemManage or BusFault, (b) execute an undefined instruction (__asm volatile(".word 0xF7F0A000")) for a UsageFault, (c) enable the divide-by-zero trap in CCR (SCB->CCR |= SCB_CCR_DIV_0_TRP_Msk) and divide by zero. For each fault: read CFSR from your debugger's memory view, identify the specific set bits, and match them to the ARM Architecture Reference Manual description. Document your findings — which bits were set, what they mean, and whether BFAR or MMFAR held a valid address.

HardFault CFSR Fault Analysis

Exercise 2 Intermediate

Profile a Function Using the DWT Cycle Counter

Implement the DWT cycle counter setup from this article. Choose a non-trivial function to profile — for example, a 256-point FFT using CMSIS-DSP, a CRC computation over 1 kB of data, or a sorting algorithm. Measure the cycle count for: (a) default compilation (-O0), (b) size-optimised (-Os), (c) performance-optimised (-O2 or -O3). Report the cycle counts and corresponding microsecond values at your MCU's core clock. Note which optimisation level produces the best throughput and verify the results match your theoretical expectation (e.g., FFT should be approximately O(N log N) cycles).

DWT Profiling Optimisation

Exercise 3 Advanced

Stream ITM Data to Host and Visualise via SWV Timeline

Implement a complete ITM logging system with multiple channels: port 0 for printf output, port 1 for RTOS context switch events (called from osRtxThreadSwitch or FreeRTOS traceTASK_SWITCHED_IN), and port 2 for custom performance counters (DWT cycle snapshots). Configure your debugger (Keil MDK SWV or OpenOCD SWO capture) to receive the trace stream. Verify you can see: (a) text output on port 0 in the SWV console, (b) thread switch timestamps on port 1 in the SWV timeline/event view, (c) cycle count values on port 2. Capture a 5-second trace and identify the task with the highest CPU utilisation. Describe the SWO baud rate configuration and any probe limitations you encountered.

ITM SWV Timeline RTOS Tracing

Debug Plan Generator

Use this tool to document your debug strategy — the probe, interface, breakpoints, watchpoints, ITM port assignments, and known faults to investigate. Generate a Word, Excel, PDF, or PPTX document to share with your team or include in project documentation.

CMSIS Debug Plan Generator

Document your embedded debug configuration and investigation plan. Download as Word, Excel, PDF, or PPTX.

Draft auto-saved

All data stays in your browser. Nothing is sent to or stored on any server.

Project Name *

Target MCU *

Debug Probe

Debug Interface

Breakpoints to Set

Data Watchpoints

ITM Port Assignments

Known Faults to Investigate

Trace & Profiling Goals

Author Name

Conclusion & Next Steps

ARM's CoreSight infrastructure provides a professional-grade debug and trace ecosystem that is largely invisible until you understand what it offers. The key takeaways from this article:

The CoreSight hierarchy — DAP → AHB-AP/APB-AP → ITM/ETM/DWT/FPB — is the physical path from your debug probe to every breakpoint, watchpoint, and trace event. Understanding it makes probe connectivity problems trivial to diagnose.
SWD is the correct choice for the vast majority of Cortex-M designs: two signals, supports all variants, lower PCB overhead, and works natively with every modern probe. Add SWO as a third signal if ITM tracing is required.
The HardFault handler with stacked frame inspection transforms an opaque crash into a diagnosed line of code in seconds. CFSR + stacked PC is sufficient to resolve >90% of production HardFaults without a debugger attached.
ITM tracing provides non-blocking, multi-channel data streaming from the target to the host over a single SWO pin. Use it instead of UART printf whenever timing accuracy matters — it imposes no measurable impact on application timing.
The DWT cycle counter (DWT->CYCCNT) enables nanosecond-resolution profiling with two register reads — the most lightweight profiling mechanism available on Cortex-M.

Next in the Series

In Part 10: Portable Firmware — Multi-Vendor CMSIS Projects, we pivot from debugging individual bugs to designing firmware that targets multiple MCU families. We will cover hardware abstraction layer patterns in C, conditional compilation with CMake board variables, BSP design, and the specific porting deltas between STM32F4, NXP LPC55S69, Nordic nRF52840, and Renesas RA4M1.

Cookie Consent

Cookie Preferences

Table of Contents