Series Context: This is Part 11 of our 20-part CMSIS Mastery Series — the first article in the Bonus/Advanced section. Parts 1–10 covered CMSIS fundamentals; now we tackle the professional-grade topics that distinguish senior embedded engineers.
1
Overview & ARM Cortex-M Ecosystem
CMSIS layers, Cortex-M families, memory map, toolchains
2
CMSIS-Core: Registers, NVIC & SysTick
core_cmX.h, register access, interrupt controller, SysTick timer
3
Startup Code, Linker Scripts & Vector Table
Reset handler, BSS init, scatter files, boot process
4
CMSIS-RTOS2: Threads, Mutexes & Semaphores
Thread management, synchronization primitives, scheduling
5
CMSIS-RTOS2: Message Queues & Event Flags
Inter-thread comms, ISR-to-thread, real-time design patterns
6
CMSIS-DSP: Filters, FFT & Math Functions
FIR/IIR filters, FFT, SIMD optimizations
7
CMSIS-Driver: UART, SPI & I2C
Driver abstraction layer, callbacks, DMA integration
8
CMSIS-Pack & Software Components
Pack files, device support, dependency management
9
Debugging with CMSIS-DAP & CoreSight
SWD/JTAG, HardFault analysis, ITM tracing
10
Portable Firmware: Multi-Vendor Projects
HAL vs CMSIS, cross-platform BSPs, reusable libraries
11
Interrupts, Concurrency & Real-Time Constraints
Interrupt latency, critical sections, lock-free programming
You Are Here
12
Memory Management in Embedded Systems
Static vs dynamic, heap fragmentation, memory pools
13
Low Power & Energy Optimization
Sleep modes, clock gating, tickless RTOS, power profiling
14
DMA & High-Performance Data Handling
DMA basics, peripheral transfers, zero-copy techniques
15
Security: ARMv8-M & TrustZone
Secure/non-secure worlds, secure boot, firmware protection
16
Bootloaders & Firmware Updates
OTA updates, dual-bank flash, fail-safe strategies
17
Testing & Validation
Unity/Ceedling unit tests, HIL testing, integration testing
18
Performance Optimization
Compiler flags, inline assembly, cache (M7/M33), profiling
19
Embedded Software Architecture
Layered design, event-driven, state machines, component-based
20
Tooling & Workflow (Professional Level)
CI/CD for embedded, MISRA, static analysis, Doxygen
Interrupt Latency Analysis
Interrupt latency is the elapsed time between a peripheral asserting an interrupt request line and the first instruction of the corresponding ISR executing on the CPU. On Cortex-M it consists of two components: hardware stacking latency (the processor saves eight registers — R0–R3, R12, LR, PC, xPSR — to the active stack before branching to the ISR) and pipeline flush latency (any in-flight instruction must complete or be cancelled). Understanding both is the foundation of real-time budgeting.
For most Cortex-M3/M4/M7 designs the minimum hardware latency is 12 clock cycles from IRQ assertion to ISR entry when no higher-priority interrupt is active, no instruction is stalling on a bus, and no FPU lazy stacking is required. In practice, cache misses on the M7, write-buffer draining, and ISR code placed in slow flash can push measured latency to several hundred cycles — an order of magnitude higher than the architectural minimum.
Key Insight: The Cortex-M hardware guarantees a minimum latency, not a maximum. Your job as a firmware engineer is to measure the worst-case path and prove it satisfies your system's real-time deadlines through margin analysis, not hope.
Measuring Interrupt Latency with the DWT Cycle Counter
The ARM Data Watchpoint and Trace (DWT) unit includes a 32-bit free-running cycle counter — DWT->CYCCNT — that increments every clock cycle. By toggling a GPIO inside the ISR and reading CYCCNT in the background thread, you can measure real hardware latency without any invasive instrumentation overhead. The measurement costs two register writes and one GPIO toggle.
/**
* Interrupt latency measurement using DWT CYCCNT and GPIO toggle.
* Target: STM32F4 (Cortex-M4F, 168 MHz).
*
* Method:
* 1. Main loop arms a DWT timestamp before the peripheral fires.
* 2. ISR records CYCCNT at entry and toggles a test GPIO.
* 3. Main loop captures the GPIO transition via a second DWT
* comparator, giving wall-clock latency in cycles.
*/
#include "stm32f407xx.h"
#include "core_cm4.h"
/* DWT initialisation — must unlock before CYCCNT is writable */
static void DWT_Init(void) {
/* Unlock DWT / ITM access */
CoreDebug->DEMCR |= CoreDebug_DEMCR_TRCENA_Msk;
DWT->CYCCNT = 0U;
DWT->CTRL |= DWT_CTRL_CYCCNTENA_Msk; /* Enable cycle counter */
}
/* Shared measurement variables — volatile prevents optimisation */
volatile uint32_t g_irq_entry_cycle = 0U;
volatile uint32_t g_irq_arm_cycle = 0U;
volatile uint32_t g_latency_cycles = 0U;
/* ISR for TIM2 (example IRQ source) */
void TIM2_IRQHandler(void) {
/* Capture cycle count at ISR entry — first instruction */
g_irq_entry_cycle = DWT->CYCCNT;
/* Toggle PA1 — visible on oscilloscope, cross-check latency */
GPIOA->ODR ^= GPIO_ODR_OD1;
/* Clear interrupt flag */
TIM2->SR &= ~TIM_SR_UIF;
}
int main(void) {
DWT_Init();
/* Configure PA1 as output for oscilloscope probe */
RCC->AHB1ENR |= RCC_AHB1ENR_GPIOAEN;
__DSB();
GPIOA->MODER = (GPIOA->MODER & ~GPIO_MODER_MODER1_Msk)
| (0x01UL << GPIO_MODER_MODER1_Pos);
/* Enable TIM2 interrupt at priority 5 */
NVIC_SetPriority(TIM2_IRQn, 5U);
NVIC_EnableIRQ(TIM2_IRQn);
/* Configure TIM2 for 1 kHz update interrupt */
RCC->APB1ENR |= RCC_APB1ENR_TIM2EN;
TIM2->PSC = 83U; /* 84 MHz / 84 = 1 MHz timer clock */
TIM2->ARR = 999U; /* 1 MHz / 1000 = 1 kHz */
TIM2->DIER |= TIM_DIER_UIE;
TIM2->CR1 |= TIM_CR1_CEN;
for (;;) {
/* Record cycle count just before the IRQ is expected */
g_irq_arm_cycle = DWT->CYCCNT;
/* Wait for ISR to fire */
while (g_irq_entry_cycle == 0U) { __NOP(); }
/* Compute latency: cycles from arm-point to ISR entry */
g_latency_cycles = g_irq_entry_cycle - g_irq_arm_cycle;
g_irq_entry_cycle = 0U; /* Re-arm for next measurement */
/* g_latency_cycles now holds worst-case cycles — log or assert */
/* At 168 MHz, 12 cycles = 71 ns minimum latency */
}
}
Interrupt Latency by Cortex-M Variant
The following table captures the architectural minimum and a realistic worst-case scenario for each Cortex-M variant. The worst-case figures assume code executing from internal flash without caches and include FPU lazy stacking overhead where an FPU is present.
| Core |
Min Cycles (IRQ to ISR) |
Typical Worst Case |
FPU Lazy Stack Overhead |
Notes |
| Cortex-M0 |
15 |
15–30 |
N/A |
No tail-chaining; no late-arrival optimisation |
| Cortex-M0+ |
15 |
15–30 |
N/A |
2-stage pipeline, very predictable |
| Cortex-M3 |
12 |
12–40 |
N/A |
Tail-chaining reduces back-to-back ISR overhead to 6 cycles |
| Cortex-M4F |
12 |
12–50 |
+10–27 (first FPU use) |
FPCCR.LSPEN=1 defers FP context save lazily |
| Cortex-M7 |
12 |
12–200+ |
+10–27 |
6-stage OoO; I/D-cache miss can add 100+ cycles from flash |
| Cortex-M33 |
12 |
12–60 |
+10–27 (optional FPU) |
TrustZone transition adds ~3–5 cycles for NS-to-S boundary |
Critical Sections
A critical section is a code region that must execute atomically with respect to interrupts — no ISR may interleave. On Cortex-M there are two primary mechanisms: PRIMASK, which disables all maskable exceptions globally, and BASEPRI, which masks all exceptions at or below a numeric priority threshold while leaving higher-priority ISRs active. Choosing between them is a real-time engineering decision, not a convenience one.
PRIMASK Save & Restore Pattern
The naive approach — __disable_irq() / __enable_irq() — works only if critical sections never nest. In production firmware, always use save/restore to handle re-entrant callers safely. CMSIS provides __get_PRIMASK() and __set_PRIMASK() for this pattern.
/**
* PRIMASK-based critical section: save/restore pattern.
* Safe for nested calls — preserves the caller's IRQ state.
*/
#include "cmsis_compiler.h" /* __get_PRIMASK, __set_PRIMASK, __DMB */
/* Enter critical section — returns previous PRIMASK value */
static inline uint32_t critical_section_enter(void) {
uint32_t primask = __get_PRIMASK();
__disable_irq(); /* sets PRIMASK = 1, blocks all IRQs */
__DSB(); /* ensure write-buffer is drained */
__ISB(); /* flush pipeline so masking takes effect */
return primask;
}
/* Exit critical section — restores caller's IRQ state */
static inline void critical_section_exit(uint32_t primask) {
__set_PRIMASK(primask);
__ISB();
}
/* ----- Usage example: protect a shared counter --------- */
static volatile uint32_t g_shared_counter = 0U;
void increment_shared_counter(void) {
uint32_t saved = critical_section_enter();
g_shared_counter++; /* read-modify-write is now atomic */
critical_section_exit(saved);
}
/* Even safe when called from an ISR that itself disabled IRQs */
void TIM3_IRQHandler(void) {
uint32_t saved = critical_section_enter();
g_shared_counter += 10U;
critical_section_exit(saved); /* restores PRIMASK=1, not 0 */
TIM3->SR &= ~TIM_SR_UIF;
}
BASEPRI Threshold Masking
BASEPRI is only available on Cortex-M3/M4/M7/M33 (ARMv7-M and ARMv8-M Main). It masks all exceptions whose numeric priority is greater than or equal to the BASEPRI value (lower number = higher priority). Setting BASEPRI = 0x50 (priority 5 on an 8-bit field) leaves priorities 0–4 unmasked — those ISRs can still preempt your critical section. This is the basis of FreeRTOS's taskENTER_CRITICAL().
/**
* BASEPRI-based partial masking: block ISRs at priority >= threshold
* while leaving high-priority ISRs (e.g. safety watchdog) runnable.
*
* configMAX_SYSCALL_INTERRUPT_PRIORITY in FreeRTOS maps directly
* to this register.
*/
#include "core_cm4.h"
/* Priority threshold — mask ISRs with numeric priority >= this value.
* NOTE: On 4-bit priority implementations (most STM32), shift left 4. */
#define CRITICAL_BASEPRI_VALUE (5U << (8U - __NVIC_PRIO_BITS))
static inline uint32_t basepri_enter(void) {
uint32_t old = __get_BASEPRI();
__set_BASEPRI_MAX(CRITICAL_BASEPRI_VALUE);
__DSB();
__ISB();
return old;
}
static inline void basepri_exit(uint32_t old_basepri) {
__set_BASEPRI(old_basepri);
__ISB();
}
/* ---- Critical section technique comparison table ---- */
| Technique |
IRQ Masking Level |
Nesting Safe |
Overhead (cycles) |
Cores |
Notes |
| PRIMASK (global disable) |
All maskable IRQs |
Yes (save/restore) |
3–5 |
All Cortex-M |
Blocks NMI-level only via FAULTMASK |
| BASEPRI threshold |
Priority ≥ threshold |
Yes (save/restore) |
3–5 |
M3/M4/M7/M33 |
Allows high-priority ISRs; FreeRTOS default |
| RTOS taskENTER_CRITICAL |
Priority ≥ configMAX_SYSCALL |
Yes (nesting counter) |
5–10 |
M3/M4/M7/M33 |
Uses BASEPRI internally; kernel-aware |
| RTOS vTaskSuspendAll |
Scheduler only (IRQs run) |
Yes |
10–20 |
All |
Suspends context switches; IRQs still fire |
Atomic Operations
For single 32-bit variables, full critical sections are overkill. ARMv7-M and ARMv8-M provide Load-Exclusive / Store-Exclusive (LDREX/STREX) instructions that implement hardware-level compare-and-swap without disabling interrupts at all. If an interrupt fires between LDREX and STREX, STREX detects the hazard and returns 1 (failure) — the caller retries. This gives you interrupt-transparent atomics with zero latency impact on ISRs.
LDREX/STREX Compare-and-Swap on Cortex-M
/**
* Atomic compare-and-swap using LDREX/STREX (ARMv7-M, ARMv8-M Main).
* Returns 1 if the swap succeeded, 0 if it was interrupted and retried.
*
* CMSIS intrinsics: __LDREXW / __STREXW
*/
#include "cmsis_compiler.h"
/**
* @brief Atomic CAS: if *ptr == expected, write desired and return 1.
*/
static inline int atomic_cas32(volatile uint32_t *ptr,
uint32_t expected,
uint32_t desired) {
uint32_t current;
do {
current = __LDREXW(ptr); /* Load-exclusive */
if (current != expected) {
__CLREX(); /* Clear exclusive monitor */
return 0; /* Mismatch — no swap */
}
} while (__STREXW(desired, ptr)); /* Retry if interrupted */
__DMB(); /* Data memory barrier: ensure visibility before return */
return 1;
}
/**
* @brief Atomic fetch-and-add: atomically adds delta to *ptr,
* returns the original value.
*/
static inline uint32_t atomic_fetch_add(volatile uint32_t *ptr,
uint32_t delta) {
uint32_t old, tmp;
do {
old = __LDREXW(ptr);
tmp = old + delta;
} while (__STREXW(tmp, ptr));
__DMB();
return old;
}
/* ---- Usage: lock-free reference counter ---- */
static volatile uint32_t g_ref_count = 0U;
void object_acquire(void) {
atomic_fetch_add(&g_ref_count, 1U);
}
void object_release(void) {
uint32_t prev = atomic_fetch_add(&g_ref_count, (uint32_t)-1);
if (prev == 1U) {
/* Last reference released — trigger cleanup */
}
}
Cortex-M0/M0+ Warning: LDREX/STREX are not available on ARMv6-M (M0, M0+). For those cores you must use PRIMASK-based critical sections. The C11 _Atomic keyword with arm-none-eabi-gcc -march=armv6-m will generate PRIMASK sequences automatically.
Lock-Free Data Structures
The most common pattern in embedded concurrent programming is the Single-Producer Single-Consumer (SPSC) ring buffer: one writer (often main loop or a DMA callback) and one reader (often a communication task or ISR). With only one producer and one consumer, no atomic operations or critical sections are needed — only memory barriers to prevent out-of-order memory access from breaking the invariant.
SPSC Ring Buffer with Memory Barriers
/**
* Lock-free SPSC ring buffer for embedded use.
* Producer (writer) and consumer (reader) each own one index pointer.
* Only __DMB() barriers are needed — no disabling of interrupts.
*
* Safe when producer runs in main context and consumer in ISR (or vice versa).
* For MPMC (multiple producers/consumers), use atomic CAS instead.
*/
#include "cmsis_compiler.h"
#include
#include
#define RING_BUF_SIZE 256U /* Must be power of 2 */
#define RING_BUF_MASK (RING_BUF_SIZE - 1U)
typedef struct {
uint8_t buf[RING_BUF_SIZE];
volatile uint32_t head; /* Written by producer only */
volatile uint32_t tail; /* Written by consumer only */
} RingBuf_t;
/**
* @brief Write one byte to the ring buffer (producer side).
* @return true on success, false if buffer is full.
*/
bool ring_buf_write(RingBuf_t *rb, uint8_t byte) {
uint32_t head = rb->head;
uint32_t next = (head + 1U) & RING_BUF_MASK;
/* Check full: next == tail means buffer is full */
if (next == rb->tail) {
return false;
}
rb->buf[head] = byte;
/* DMB: ensure the data write is visible before updating head */
__DMB();
rb->head = next;
return true;
}
/**
* @brief Read one byte from the ring buffer (consumer side).
* @return true on success, false if buffer is empty.
*/
bool ring_buf_read(RingBuf_t *rb, uint8_t *out) {
uint32_t tail = rb->tail;
/* Empty check */
if (tail == rb->head) {
return false;
}
/* DMB: ensure head is read before accessing buf[tail] */
__DMB();
*out = rb->buf[tail];
rb->tail = (tail + 1U) & RING_BUF_MASK;
return true;
}
/* ---- UART receive ISR -> main loop example ---- */
static RingBuf_t g_uart_rx;
/* Called from UART ISR */
void USART1_IRQHandler(void) {
if (USART1->SR & USART_SR_RXNE) {
uint8_t byte = (uint8_t)(USART1->DR & 0xFFU);
ring_buf_write(&g_uart_rx, byte); /* no IRQ disable needed */
}
}
/* Called from main loop */
void process_uart_bytes(void) {
uint8_t byte;
while (ring_buf_read(&g_uart_rx, &byte)) {
/* Process byte */
}
}
Why DMB, not DSB? __DMB() (Data Memory Barrier) ensures all preceding memory accesses complete before subsequent ones — sufficient for ordering between CPU and peripheral/ISR observers. __DSB() additionally waits for write buffers to drain to the bus and is needed before __ISB() or before entering sleep. Use DMB for ordering in lock-free structures; use DSB when configuring hardware registers.
Real-Time Constraint Budgeting
Every professional embedded system has a real-time budget — a table mapping each IRQ to its deadline, worst-case execution time (WCET), and allowable latency. Budget violations cause missed deadlines, corrupted data, communication dropouts, and safety events. The process of constructing and verifying this budget is not optional in production firmware.
The budgeting workflow proceeds in four steps. First, list all interrupt sources, their expected firing rate (Hz), and their hard deadlines. Second, measure worst-case ISR execution time using DWT CYCCNT with all cache misses and bus contention present. Third, calculate CPU utilisation: U = sum(WCET_i * rate_i). Fourth, apply Rate Monotonic Analysis (RMA) — if total utilisation is below the RMA bound (69% for n tasks, or the exact bound), the system is provably schedulable.
/**
* Real-time budget enforcement: assert that ISR WCET stays within budget.
* Uses DWT CYCCNT for cycle-accurate measurement.
* In production, replace assert() with a fault handler or error log.
*/
#include "core_cm4.h"
#include
/* Budget in clock cycles (168 MHz, 5 µs deadline = 840 cycles) */
#define ADC_ISR_BUDGET_CYCLES 840U
static volatile uint32_t g_adc_wcet = 0U;
void ADC_IRQHandler(void) {
uint32_t t_enter = DWT->CYCCNT;
/* --- ISR work: read ADC, run filter, write to shared buffer --- */
/* ... (your actual ISR code here) ... */
uint32_t elapsed = DWT->CYCCNT - t_enter;
/* Track WCET */
if (elapsed > g_adc_wcet) {
g_adc_wcet = elapsed;
}
/* Hard budget enforcement — violation is a firmware bug */
assert(elapsed <= ADC_ISR_BUDGET_CYCLES);
ADC1->SR &= ~ADC_SR_EOC;
}
Priority Assignment Rule: Always assign higher NVIC priority (lower numeric value) to IRQs with tighter deadlines, following Rate Monotonic priority assignment. An ADC IRQ with a 5 µs deadline must preempt a UART IRQ with a 100 µs deadline — configure NVIC accordingly and verify with DWT measurements under representative load.
Exercises
Exercise 1
Beginner
Measure Worst-Case IRQ Latency for Your MCU
Using the DWT CYCCNT technique from this article, measure the interrupt latency for a timer IRQ on your hardware. Run the measurement loop for at least 10,000 interrupt events. Record: (a) minimum latency in cycles, (b) maximum latency in cycles, (c) any latency spikes caused by other ISRs preempting the measurement path. Convert cycles to nanoseconds for your clock frequency. Compare against the architectural minimum from the table above.
DWT CYCCNT
IRQ Latency
Real-Time Measurement
Exercise 2
Intermediate
Replace a Mutex with a Lock-Free Queue in a Producer-Consumer
Take an existing firmware module that uses an RTOS mutex to protect a shared buffer between an ISR producer and a task consumer. Replace the mutex with the SPSC ring buffer from this article. Measure: (a) the reduction in worst-case IRQ latency (the ISR no longer blocks on a mutex), (b) the reduction in CPU time spent in critical sections, (c) any new edge cases introduced. Document your findings in a brief technical note.
Lock-Free
SPSC Ring Buffer
ISR Latency Reduction
Exercise 3
Advanced
Implement BASEPRI-Based Partial Interrupt Masking
Design a system with three interrupt priority groups: (a) priority 0–2: safety-critical ISRs that must never be masked (watchdog, fault handlers), (b) priority 3–5: real-time control ISRs masked during driver critical sections, (c) priority 6–15: background communication ISRs. Implement driver_enter_critical() / driver_exit_critical() using BASEPRI that masks group (c) and (b) but not (a). Write a test that verifies a priority-1 ISR fires correctly while a driver critical section is active.
BASEPRI
Priority Grouping
Critical Section Design
IRQ Latency Planner
Use this tool to document your project's interrupt architecture — MCU, IRQ list with latency budgets, critical section strategy, lock-free structures, and shared resource inventory. Download as Word, Excel, PDF, or PPTX for design-review documentation.
Conclusion & Next Steps
In this article we have covered the full concurrency toolkit for professional ARM Cortex-M firmware:
- DWT CYCCNT measurement gives you cycle-accurate latency data — measure under realistic load with all IRQs active, not just in isolation.
- PRIMASK save/restore is the universal critical section for all Cortex-M cores; always save and restore rather than blindly re-enabling.
- BASEPRI threshold masking is the professional choice on M3/M4/M7/M33 — it keeps safety-critical high-priority ISRs running while protecting lower-priority shared resources.
- LDREX/STREX atomics give you interrupt-transparent compare-and-swap for single 32-bit variables on ARMv7-M and ARMv8-M Main — the building block of all lock-free algorithms.
- SPSC ring buffers with DMB barriers are the correct pattern for ISR-to-task communication — no mutex, no critical section, minimal overhead.
- Build a real-time budget table for every project. Measure WCET, assign priorities by Rate Monotonic rules, and enforce budgets with assertions during development.
Next in the Series
In Part 12: Memory Management in Embedded Systems, we tackle the other major source of non-determinism in embedded firmware — dynamic memory allocation. We'll cover static allocation strategies, RTOS memory pools for O(1) deterministic allocation, stack watermark measurement, heap fragmentation analysis, and MPU-based stack guard pages that catch overflows before they corrupt your system state.
Related Articles in This Series
Part 4: CMSIS-RTOS2 — Threads, Mutexes & Semaphores
Master the CMSIS-RTOS2 synchronisation APIs — how mutexes, semaphores, and priority inheritance interact with the interrupt architecture covered here.
Read Article
Part 2: CMSIS-Core — Registers, NVIC & SysTick
The foundation for everything in this article — NVIC priority grouping, preemption levels, and SysTick configuration that underpins real-time scheduling.
Read Article
Part 14: DMA & High-Performance Data Handling
DMA transfers interact directly with the concurrency model covered here — understanding circular buffers, half-transfer interrupts, and cache coherence on M7.
Read Article