CMSIS Part 11: Interrupts, Concurrency & Real-Time Constraints

                        
                        Series Context: This is Part 11 of our 20-part CMSIS Mastery Series — the first article in the Bonus/Advanced section. Parts 1–10 covered CMSIS fundamentals; now we tackle the professional-grade topics that distinguish senior embedded engineers.
                    

CMSIS Mastery Series

Your 20-step learning path • Currently on Step 11

1

11

Interrupts, Concurrency & Real-Time Constraints

Interrupt latency, critical sections, lock-free programming

You Are Here

12

Memory Management in Embedded Systems

Static vs dynamic, heap fragmentation, memory pools

13

Low Power & Energy Optimization

Sleep modes, clock gating, tickless RTOS, power profiling

14

DMA & High-Performance Data Handling

DMA basics, peripheral transfers, zero-copy techniques

15

Security: ARMv8-M & TrustZone

Secure/non-secure worlds, secure boot, firmware protection

16

Bootloaders & Firmware Updates

OTA updates, dual-bank flash, fail-safe strategies

17

Testing & Validation

Unity/Ceedling unit tests, HIL testing, integration testing

18

Performance Optimization

Compiler flags, inline assembly, cache (M7/M33), profiling

19

Embedded Software Architecture

Layered design, event-driven, state machines, component-based

20

Tooling & Workflow (Professional Level)

CI/CD for embedded, MISRA, static analysis, Doxygen

Interrupt Latency Analysis

Interrupt latency is the elapsed time between a peripheral asserting an interrupt request line and the first instruction of the corresponding ISR executing on the CPU. On Cortex-M it consists of two components: hardware stacking latency (the processor saves eight registers — R0–R3, R12, LR, PC, xPSR — to the active stack before branching to the ISR) and pipeline flush latency (any in-flight instruction must complete or be cancelled). Understanding both is the foundation of real-time budgeting.

For most Cortex-M3/M4/M7 designs the minimum hardware latency is 12 clock cycles from IRQ assertion to ISR entry when no higher-priority interrupt is active, no instruction is stalling on a bus, and no FPU lazy stacking is required. In practice, cache misses on the M7, write-buffer draining, and ISR code placed in slow flash can push measured latency to several hundred cycles — an order of magnitude higher than the architectural minimum.

                        
                        Key Insight: The Cortex-M hardware guarantees a minimum latency, not a maximum. Your job as a firmware engineer is to measure the worst-case path and prove it satisfies your system's real-time deadlines through margin analysis, not hope.
                    

Measuring Interrupt Latency with the DWT Cycle Counter

The ARM Data Watchpoint and Trace (DWT) unit includes a 32-bit free-running cycle counter — DWT->CYCCNT — that increments every clock cycle. By toggling a GPIO inside the ISR and reading CYCCNT in the background thread, you can measure real hardware latency without any invasive instrumentation overhead. The measurement costs two register writes and one GPIO toggle.

/**
 * Interrupt latency measurement using DWT CYCCNT and GPIO toggle.
 * Target: STM32F4 (Cortex-M4F, 168 MHz).
 *
 * Method:
 *   1. Main loop arms a DWT timestamp before the peripheral fires.
 *   2. ISR records CYCCNT at entry and toggles a test GPIO.
 *   3. Main loop captures the GPIO transition via a second DWT
 *      comparator, giving wall-clock latency in cycles.
 */
#include "stm32f407xx.h"
#include "core_cm4.h"

/* DWT initialisation — must unlock before CYCCNT is writable */
static void DWT_Init(void) {
    /* Unlock DWT / ITM access */
    CoreDebug->DEMCR |= CoreDebug_DEMCR_TRCENA_Msk;
    DWT->CYCCNT = 0U;
    DWT->CTRL  |= DWT_CTRL_CYCCNTENA_Msk;  /* Enable cycle counter */
}

/* Shared measurement variables — volatile prevents optimisation */
volatile uint32_t g_irq_entry_cycle  = 0U;
volatile uint32_t g_irq_arm_cycle    = 0U;
volatile uint32_t g_latency_cycles   = 0U;

/* ISR for TIM2 (example IRQ source) */
void TIM2_IRQHandler(void) {
    /* Capture cycle count at ISR entry — first instruction */
    g_irq_entry_cycle = DWT->CYCCNT;

    /* Toggle PA1 — visible on oscilloscope, cross-check latency */
    GPIOA->ODR ^= GPIO_ODR_OD1;

    /* Clear interrupt flag */
    TIM2->SR &= ~TIM_SR_UIF;
}

int main(void) {
    DWT_Init();

    /* Configure PA1 as output for oscilloscope probe */
    RCC->AHB1ENR |= RCC_AHB1ENR_GPIOAEN;
    __DSB();
    GPIOA->MODER = (GPIOA->MODER & ~GPIO_MODER_MODER1_Msk)
                 | (0x01UL << GPIO_MODER_MODER1_Pos);

    /* Enable TIM2 interrupt at priority 5 */
    NVIC_SetPriority(TIM2_IRQn, 5U);
    NVIC_EnableIRQ(TIM2_IRQn);

    /* Configure TIM2 for 1 kHz update interrupt */
    RCC->APB1ENR |= RCC_APB1ENR_TIM2EN;
    TIM2->PSC = 83U;          /* 84 MHz / 84 = 1 MHz timer clock */
    TIM2->ARR = 999U;         /* 1 MHz / 1000 = 1 kHz           */
    TIM2->DIER |= TIM_DIER_UIE;
    TIM2->CR1  |= TIM_CR1_CEN;

    for (;;) {
        /* Record cycle count just before the IRQ is expected */
        g_irq_arm_cycle = DWT->CYCCNT;

        /* Wait for ISR to fire */
        while (g_irq_entry_cycle == 0U) { __NOP(); }

        /* Compute latency: cycles from arm-point to ISR entry */
        g_latency_cycles   = g_irq_entry_cycle - g_irq_arm_cycle;
        g_irq_entry_cycle  = 0U;  /* Re-arm for next measurement */

        /* g_latency_cycles now holds worst-case cycles — log or assert */
        /* At 168 MHz, 12 cycles = 71 ns minimum latency             */
    }
}

Interrupt Latency by Cortex-M Variant

The following table captures the architectural minimum and a realistic worst-case scenario for each Cortex-M variant. The worst-case figures assume code executing from internal flash without caches and include FPU lazy stacking overhead where an FPU is present.

Core	Min Cycles (IRQ to ISR)	Typical Worst Case	FPU Lazy Stack Overhead	Notes
Cortex-M0	15	15–30	N/A	No tail-chaining; no late-arrival optimisation
Cortex-M0+	15	15–30	N/A	2-stage pipeline, very predictable
Cortex-M3	12	12–40	N/A	Tail-chaining reduces back-to-back ISR overhead to 6 cycles
Cortex-M4F	12	12–50	+10–27 (first FPU use)	FPCCR.LSPEN=1 defers FP context save lazily
Cortex-M7	12	12–200+	+10–27	6-stage OoO; I/D-cache miss can add 100+ cycles from flash
Cortex-M33	12	12–60	+10–27 (optional FPU)	TrustZone transition adds ~3–5 cycles for NS-to-S boundary

Critical Sections

A critical section is a code region that must execute atomically with respect to interrupts — no ISR may interleave. On Cortex-M there are two primary mechanisms: PRIMASK, which disables all maskable exceptions globally, and BASEPRI, which masks all exceptions at or below a numeric priority threshold while leaving higher-priority ISRs active. Choosing between them is a real-time engineering decision, not a convenience one.

PRIMASK Save & Restore Pattern

The naive approach — __disable_irq() / __enable_irq() — works only if critical sections never nest. In production firmware, always use save/restore to handle re-entrant callers safely. CMSIS provides __get_PRIMASK() and __set_PRIMASK() for this pattern.

/**
 * PRIMASK-based critical section: save/restore pattern.
 * Safe for nested calls — preserves the caller's IRQ state.
 */
#include "cmsis_compiler.h"  /* __get_PRIMASK, __set_PRIMASK, __DMB */

/* Enter critical section — returns previous PRIMASK value */
static inline uint32_t critical_section_enter(void) {
    uint32_t primask = __get_PRIMASK();
    __disable_irq();   /* sets PRIMASK = 1, blocks all IRQs */
    __DSB();           /* ensure write-buffer is drained     */
    __ISB();           /* flush pipeline so masking takes effect */
    return primask;
}

/* Exit critical section — restores caller's IRQ state */
static inline void critical_section_exit(uint32_t primask) {
    __set_PRIMASK(primask);
    __ISB();
}

/* ----- Usage example: protect a shared counter --------- */
static volatile uint32_t g_shared_counter = 0U;

void increment_shared_counter(void) {
    uint32_t saved = critical_section_enter();
    g_shared_counter++;   /* read-modify-write is now atomic */
    critical_section_exit(saved);
}

/* Even safe when called from an ISR that itself disabled IRQs */
void TIM3_IRQHandler(void) {
    uint32_t saved = critical_section_enter();
    g_shared_counter += 10U;
    critical_section_exit(saved);   /* restores PRIMASK=1, not 0 */
    TIM3->SR &= ~TIM_SR_UIF;
}

BASEPRI Threshold Masking

BASEPRI is only available on Cortex-M3/M4/M7/M33 (ARMv7-M and ARMv8-M Main). It masks all exceptions whose numeric priority is greater than or equal to the BASEPRI value (lower number = higher priority). Setting BASEPRI = 0x50 (priority 5 on an 8-bit field) leaves priorities 0–4 unmasked — those ISRs can still preempt your critical section. This is the basis of FreeRTOS's taskENTER_CRITICAL().

/**
 * BASEPRI-based partial masking: block ISRs at priority >= threshold
 * while leaving high-priority ISRs (e.g. safety watchdog) runnable.
 *
 * configMAX_SYSCALL_INTERRUPT_PRIORITY in FreeRTOS maps directly
 * to this register.
 */
#include "core_cm4.h"

/* Priority threshold — mask ISRs with numeric priority >= this value.
 * NOTE: On 4-bit priority implementations (most STM32), shift left 4. */
#define CRITICAL_BASEPRI_VALUE   (5U << (8U - __NVIC_PRIO_BITS))

static inline uint32_t basepri_enter(void) {
    uint32_t old = __get_BASEPRI();
    __set_BASEPRI_MAX(CRITICAL_BASEPRI_VALUE);
    __DSB();
    __ISB();
    return old;
}

static inline void basepri_exit(uint32_t old_basepri) {
    __set_BASEPRI(old_basepri);
    __ISB();
}

/* ---- Critical section technique comparison table ---- */

Technique	IRQ Masking Level	Nesting Safe	Overhead (cycles)	Cores	Notes
PRIMASK (global disable)	All maskable IRQs	Yes (save/restore)	3–5	All Cortex-M	Blocks NMI-level only via FAULTMASK
BASEPRI threshold	Priority ≥ threshold	Yes (save/restore)	3–5	M3/M4/M7/M33	Allows high-priority ISRs; FreeRTOS default
RTOS taskENTER_CRITICAL	Priority ≥ configMAX_SYSCALL	Yes (nesting counter)	5–10	M3/M4/M7/M33	Uses BASEPRI internally; kernel-aware
RTOS vTaskSuspendAll	Scheduler only (IRQs run)	Yes	10–20	All	Suspends context switches; IRQs still fire

Atomic Operations

For single 32-bit variables, full critical sections are overkill. ARMv7-M and ARMv8-M provide Load-Exclusive / Store-Exclusive (LDREX/STREX) instructions that implement hardware-level compare-and-swap without disabling interrupts at all. If an interrupt fires between LDREX and STREX, STREX detects the hazard and returns 1 (failure) — the caller retries. This gives you interrupt-transparent atomics with zero latency impact on ISRs.

LDREX/STREX Compare-and-Swap on Cortex-M

/**
 * Atomic compare-and-swap using LDREX/STREX (ARMv7-M, ARMv8-M Main).
 * Returns 1 if the swap succeeded, 0 if it was interrupted and retried.
 *
 * CMSIS intrinsics: __LDREXW / __STREXW
 */
#include "cmsis_compiler.h"

/**
 * @brief Atomic CAS: if *ptr == expected, write desired and return 1.
 */
static inline int atomic_cas32(volatile uint32_t *ptr,
                                uint32_t expected,
                                uint32_t desired) {
    uint32_t current;
    do {
        current = __LDREXW(ptr);          /* Load-exclusive          */
        if (current != expected) {
            __CLREX();                    /* Clear exclusive monitor  */
            return 0;                     /* Mismatch — no swap       */
        }
    } while (__STREXW(desired, ptr));     /* Retry if interrupted     */

    __DMB();  /* Data memory barrier: ensure visibility before return */
    return 1;
}

/**
 * @brief Atomic fetch-and-add: atomically adds delta to *ptr,
 *        returns the original value.
 */
static inline uint32_t atomic_fetch_add(volatile uint32_t *ptr,
                                         uint32_t delta) {
    uint32_t old, tmp;
    do {
        old = __LDREXW(ptr);
        tmp = old + delta;
    } while (__STREXW(tmp, ptr));

    __DMB();
    return old;
}

/* ---- Usage: lock-free reference counter ---- */
static volatile uint32_t g_ref_count = 0U;

void object_acquire(void) {
    atomic_fetch_add(&g_ref_count, 1U);
}

void object_release(void) {
    uint32_t prev = atomic_fetch_add(&g_ref_count, (uint32_t)-1);
    if (prev == 1U) {
        /* Last reference released — trigger cleanup */
    }
}

                        
                        Cortex-M0/M0+ Warning: LDREX/STREX are not available on ARMv6-M (M0, M0+). For those cores you must use PRIMASK-based critical sections. The C11 _Atomic keyword with arm-none-eabi-gcc -march=armv6-m will generate PRIMASK sequences automatically.
                    

Lock-Free Data Structures

The most common pattern in embedded concurrent programming is the Single-Producer Single-Consumer (SPSC) ring buffer: one writer (often main loop or a DMA callback) and one reader (often a communication task or ISR). With only one producer and one consumer, no atomic operations or critical sections are needed — only memory barriers to prevent out-of-order memory access from breaking the invariant.

SPSC Ring Buffer with Memory Barriers

/**
 * Lock-free SPSC ring buffer for embedded use.
 * Producer (writer) and consumer (reader) each own one index pointer.
 * Only __DMB() barriers are needed — no disabling of interrupts.
 *
 * Safe when producer runs in main context and consumer in ISR (or vice versa).
 * For MPMC (multiple producers/consumers), use atomic CAS instead.
 */
#include "cmsis_compiler.h"
#include 
#include 

#define RING_BUF_SIZE   256U   /* Must be power of 2 */
#define RING_BUF_MASK   (RING_BUF_SIZE - 1U)

typedef struct {
    uint8_t          buf[RING_BUF_SIZE];
    volatile uint32_t head;   /* Written by producer only */
    volatile uint32_t tail;   /* Written by consumer only */
} RingBuf_t;

/**
 * @brief Write one byte to the ring buffer (producer side).
 * @return true on success, false if buffer is full.
 */
bool ring_buf_write(RingBuf_t *rb, uint8_t byte) {
    uint32_t head = rb->head;
    uint32_t next = (head + 1U) & RING_BUF_MASK;

    /* Check full: next == tail means buffer is full */
    if (next == rb->tail) {
        return false;
    }

    rb->buf[head] = byte;

    /* DMB: ensure the data write is visible before updating head */
    __DMB();

    rb->head = next;
    return true;
}

/**
 * @brief Read one byte from the ring buffer (consumer side).
 * @return true on success, false if buffer is empty.
 */
bool ring_buf_read(RingBuf_t *rb, uint8_t *out) {
    uint32_t tail = rb->tail;

    /* Empty check */
    if (tail == rb->head) {
        return false;
    }

    /* DMB: ensure head is read before accessing buf[tail] */
    __DMB();

    *out = rb->buf[tail];
    rb->tail = (tail + 1U) & RING_BUF_MASK;
    return true;
}

/* ---- UART receive ISR -> main loop example ---- */
static RingBuf_t g_uart_rx;

/* Called from UART ISR */
void USART1_IRQHandler(void) {
    if (USART1->SR & USART_SR_RXNE) {
        uint8_t byte = (uint8_t)(USART1->DR & 0xFFU);
        ring_buf_write(&g_uart_rx, byte);   /* no IRQ disable needed */
    }
}

/* Called from main loop */
void process_uart_bytes(void) {
    uint8_t byte;
    while (ring_buf_read(&g_uart_rx, &byte)) {
        /* Process byte */
    }
}

                        
                        Why DMB, not DSB? __DMB() (Data Memory Barrier) ensures all preceding memory accesses complete before subsequent ones — sufficient for ordering between CPU and peripheral/ISR observers. __DSB() additionally waits for write buffers to drain to the bus and is needed before __ISB() or before entering sleep. Use DMB for ordering in lock-free structures; use DSB when configuring hardware registers.
                    

Real-Time Constraint Budgeting

Every professional embedded system has a real-time budget — a table mapping each IRQ to its deadline, worst-case execution time (WCET), and allowable latency. Budget violations cause missed deadlines, corrupted data, communication dropouts, and safety events. The process of constructing and verifying this budget is not optional in production firmware.

The budgeting workflow proceeds in four steps. First, list all interrupt sources, their expected firing rate (Hz), and their hard deadlines. Second, measure worst-case ISR execution time using DWT CYCCNT with all cache misses and bus contention present. Third, calculate CPU utilisation: U = sum(WCET_i * rate_i). Fourth, apply Rate Monotonic Analysis (RMA) — if total utilisation is below the RMA bound (69% for n tasks, or the exact bound), the system is provably schedulable.

/**
 * Real-time budget enforcement: assert that ISR WCET stays within budget.
 * Uses DWT CYCCNT for cycle-accurate measurement.
 * In production, replace assert() with a fault handler or error log.
 */
#include "core_cm4.h"
#include 

/* Budget in clock cycles (168 MHz, 5 µs deadline = 840 cycles) */
#define ADC_ISR_BUDGET_CYCLES   840U

static volatile uint32_t g_adc_wcet = 0U;

void ADC_IRQHandler(void) {
    uint32_t t_enter = DWT->CYCCNT;

    /* --- ISR work: read ADC, run filter, write to shared buffer --- */
    /* ... (your actual ISR code here) ... */

    uint32_t elapsed = DWT->CYCCNT - t_enter;

    /* Track WCET */
    if (elapsed > g_adc_wcet) {
        g_adc_wcet = elapsed;
    }

    /* Hard budget enforcement — violation is a firmware bug */
    assert(elapsed <= ADC_ISR_BUDGET_CYCLES);

    ADC1->SR &= ~ADC_SR_EOC;
}

                        
                        Priority Assignment Rule: Always assign higher NVIC priority (lower numeric value) to IRQs with tighter deadlines, following Rate Monotonic priority assignment. An ADC IRQ with a 5 µs deadline must preempt a UART IRQ with a 100 µs deadline — configure NVIC accordingly and verify with DWT measurements under representative load.
                    

Exercises

Exercise 1 Beginner

Measure Worst-Case IRQ Latency for Your MCU

Using the DWT CYCCNT technique from this article, measure the interrupt latency for a timer IRQ on your hardware. Run the measurement loop for at least 10,000 interrupt events. Record: (a) minimum latency in cycles, (b) maximum latency in cycles, (c) any latency spikes caused by other ISRs preempting the measurement path. Convert cycles to nanoseconds for your clock frequency. Compare against the architectural minimum from the table above.

DWT CYCCNT IRQ Latency Real-Time Measurement

Exercise 2 Intermediate

Replace a Mutex with a Lock-Free Queue in a Producer-Consumer

Take an existing firmware module that uses an RTOS mutex to protect a shared buffer between an ISR producer and a task consumer. Replace the mutex with the SPSC ring buffer from this article. Measure: (a) the reduction in worst-case IRQ latency (the ISR no longer blocks on a mutex), (b) the reduction in CPU time spent in critical sections, (c) any new edge cases introduced. Document your findings in a brief technical note.

Lock-Free SPSC Ring Buffer ISR Latency Reduction

Exercise 3 Advanced

Implement BASEPRI-Based Partial Interrupt Masking

Design a system with three interrupt priority groups: (a) priority 0–2: safety-critical ISRs that must never be masked (watchdog, fault handlers), (b) priority 3–5: real-time control ISRs masked during driver critical sections, (c) priority 6–15: background communication ISRs. Implement driver_enter_critical() / driver_exit_critical() using BASEPRI that masks group (c) and (b) but not (a). Write a test that verifies a priority-1 ISR fires correctly while a driver critical section is active.

BASEPRI Priority Grouping Critical Section Design

IRQ Latency Planner

Use this tool to document your project's interrupt architecture — MCU, IRQ list with latency budgets, critical section strategy, lock-free structures, and shared resource inventory. Download as Word, Excel, PDF, or PPTX for design-review documentation.

IRQ Latency Planner & Concurrency Design Generator

Document your interrupt architecture and concurrency strategy. Download as Word, Excel, PDF, or PPTX.

Draft auto-saved

All data stays in your browser. Nothing is sent to or stored on any server.

Project Name *

Target MCU *

IRQ Inventory (name, priority, max latency budget µs)

Overall System Latency Target

Critical Sections (resource, protection method)

Lock-Free Structures Used

Shared Resources & Access Patterns

Author Name

Conclusion & Next Steps

In this article we have covered the full concurrency toolkit for professional ARM Cortex-M firmware:

DWT CYCCNT measurement gives you cycle-accurate latency data — measure under realistic load with all IRQs active, not just in isolation.
PRIMASK save/restore is the universal critical section for all Cortex-M cores; always save and restore rather than blindly re-enabling.
BASEPRI threshold masking is the professional choice on M3/M4/M7/M33 — it keeps safety-critical high-priority ISRs running while protecting lower-priority shared resources.
LDREX/STREX atomics give you interrupt-transparent compare-and-swap for single 32-bit variables on ARMv7-M and ARMv8-M Main — the building block of all lock-free algorithms.
SPSC ring buffers with DMB barriers are the correct pattern for ISR-to-task communication — no mutex, no critical section, minimal overhead.
Build a real-time budget table for every project. Measure WCET, assign priorities by Rate Monotonic rules, and enforce budgets with assertions during development.

Next in the Series

In Part 12: Memory Management in Embedded Systems, we tackle the other major source of non-determinism in embedded firmware — dynamic memory allocation. We'll cover static allocation strategies, RTOS memory pools for O(1) deterministic allocation, stack watermark measurement, heap fragmentation analysis, and MPU-based stack guard pages that catch overflows before they corrupt your system state.

Cookie Consent

Cookie Preferences

CMSIS Part 11: Interrupts, Concurrency & Real-Time Constraints

Table of Contents

CMSIS Mastery Series

Overview & ARM Cortex-M Ecosystem

CMSIS-Core: Registers, NVIC & SysTick

Startup Code, Linker Scripts & Vector Table

CMSIS-RTOS2: Threads, Mutexes & Semaphores

CMSIS-RTOS2: Message Queues & Event Flags

CMSIS-DSP: Filters, FFT & Math Functions

CMSIS-Driver: UART, SPI & I2C

CMSIS-Pack & Software Components

Debugging with CMSIS-DAP & CoreSight

Portable Firmware: Multi-Vendor Projects