Back to Technology

CMSIS Part 18: Performance Optimization

March 31, 2026 Wasil Zafar 30 min read

From -O2 flags and LTO through inline assembly and TCM placement to cycle-accurate profiling — the complete toolkit for squeezing maximum performance from ARM Cortex-M.

Table of Contents

  1. Compiler Optimisation Flags
  2. LTO & Profile-Guided Optimisation
  3. Inline Assembly & CMSIS Intrinsics
  4. Cache & TCM on M7/M33
  5. DWT Cycle Counter Profiling
  6. SIMD Intrinsics Vectorisation
  7. Exercises
  8. Optimisation Tracker
  9. Conclusion & Next Steps
Series Context: This is Part 18 of the 20-part CMSIS Mastery Series. Performance optimisation sits after testing (Part 17) intentionally — you must have a correct, tested baseline before optimising. Measuring cycles on untested code is wasted effort.

CMSIS Mastery Series

Your 20-step learning path • Currently on Step 18
1
Overview & ARM Cortex-M Ecosystem
CMSIS layers, Cortex-M families, memory map, toolchains
2
CMSIS-Core: Registers, NVIC & SysTick
core_cmX.h, register access, interrupt controller, SysTick timer
3
Startup Code, Linker Scripts & Vector Table
Reset handler, BSS init, scatter files, boot process
4
CMSIS-RTOS2: Threads, Mutexes & Semaphores
Thread management, synchronization primitives, scheduling
5
CMSIS-RTOS2: Message Queues & Event Flags
Inter-thread comms, ISR-to-thread, real-time design patterns
6
CMSIS-DSP: Filters, FFT & Math Functions
FIR/IIR filters, FFT, SIMD optimizations
7
CMSIS-Driver: UART, SPI & I2C
Driver abstraction layer, callbacks, DMA integration
8
CMSIS-Pack & Software Components
Pack files, device support, dependency management
9
Debugging with CMSIS-DAP & CoreSight
SWD/JTAG, HardFault analysis, ITM tracing
10
Portable Firmware: Multi-Vendor Projects
HAL vs CMSIS, cross-platform BSPs, reusable libraries
11
Interrupts, Concurrency & Real-Time Constraints
Interrupt latency, critical sections, lock-free programming
12
Memory Management in Embedded Systems
Static vs dynamic, heap fragmentation, memory pools
13
Low Power & Energy Optimization
Sleep modes, clock gating, tickless RTOS, power profiling
14
DMA & High-Performance Data Handling
DMA basics, peripheral transfers, zero-copy techniques
15
Security: ARMv8-M & TrustZone
Secure/non-secure worlds, secure boot, firmware protection
16
Bootloaders & Firmware Updates
OTA updates, dual-bank flash, fail-safe strategies
17
Testing & Validation
Unity/Ceedling unit tests, HIL testing, integration testing
18
Performance Optimization
Compiler flags, inline assembly, cache (M7/M33), profiling
You Are Here
19
Embedded Software Architecture
Layered design, event-driven, state machines, component-based
20
Tooling & Workflow (Professional Level)
CI/CD for embedded, MISRA, static analysis, Doxygen

Compiler Optimisation Flags

The compiler is your most powerful performance tool. The right flags can halve code size and double execution speed with zero source changes. The wrong flags silently introduce bugs through aggressive assumptions. Understanding exactly what each flag does — and its trade-offs — is essential before optimising embedded firmware.

GCC's optimisation levels are shorthand for groups of individual transformations. The table below gives a practical reference for embedded decision-making.

Flag Code Size Delta Speed Delta Debug Impact When to Use
-O0 Baseline (largest) Baseline (slowest) Full debuggability Development, unit test builds, debugging HardFaults
-O1 -20% typical +30% typical Minor variable optimisation Quick improvement without aggressive transforms
-O2 -30% typical +50–80% Some inlining obscures call stack Release firmware default — good balance
-O3 Often +10% (loop unrolling) +60–100% Heavy inlining — difficult to debug DSP hotspots only — can increase code size
-Os -35–40% +20–40% Moderate Flash-constrained MCUs (M0, M0+)
-Oz -40–45% Minimal Heavy Absolute minimum flash usage (bootloaders)
-flto -15–25% additional +10–20% additional Cross-TU inlining obscures frames Release builds — combine with -O2 or -Os

Beyond the level flags, several individual flags are essential for embedded:

# Recommended CMakeLists.txt compiler flags for Cortex-M4F release build
target_compile_options(firmware.elf PRIVATE

    # CPU / FPU architecture
    -mcpu=cortex-m4
    -mthumb
    -mfpu=fpv4-sp-d16
    -mfloat-abi=hard          # Use hardware FPU registers (not soft emulation)

    # Optimisation level
    -O2                       # Release: good balance speed vs size vs debuggability

    # Link-time optimisation (combine with linker -flto)
    -flto

    # Dead code elimination (requires --gc-sections at link time)
    -ffunction-sections       # Place each function in its own ELF section
    -fdata-sections           # Place each variable in its own ELF section

    # Warning flags (catch real bugs)
    -Wall
    -Wextra
    -Wshadow
    -Wdouble-promotion        # Warn when float silently promoted to double
    -Wundef

    # Strict aliasing (required for correctness with -O2/-O3)
    -fstrict-aliasing
    -Wstrict-aliasing=3

    # Do NOT use -ffast-math in embedded — breaks IEEE 754 NaN/Inf handling
)

target_link_options(firmware.elf PRIVATE
    -flto                     # Must match compile flag
    -Wl,--gc-sections         # Remove unused sections (pairs with -ffunction/data-sections)
    -Wl,-Map=firmware.map     # Generate map file for size analysis
    -Wl,--print-memory-usage  # Print flash/RAM usage at link time
)
Avoid -ffast-math: This flag breaks IEEE 754 compliance — it assumes no NaN, no Inf, and allows reordering of floating-point operations. In embedded DSP and sensor fusion code, this can introduce subtle numerical errors that are nearly impossible to diagnose. Use -fno-math-errno as a safer alternative if you need partial math optimisation.

Link-Time Optimisation

Link-Time Optimisation (LTO) extends the compiler's visibility from a single translation unit to the entire program. Without LTO, GCC optimises each .c file independently — cross-module inlining and dead code elimination are impossible. With LTO enabled (-flto), the linker invokes the compiler again over the combined intermediate representation, enabling:

  • Cross-module inlining — functions in different .c files are inlined if the optimiser deems it beneficial
  • Whole-program dead code elimination — functions called from only one site and never exported are eliminated
  • Constant propagation across modules — a constant defined in one file propagates into callers in another

In practice, LTO with -O2 on a typical embedded firmware project reduces flash usage by an additional 10–25% and improves throughput by 10–20% — significant gains for zero source code changes.

LTO and Interrupt Handlers: LTO may eliminate functions that appear unreachable from main() — including interrupt handlers referenced only from the vector table. Always declare ISR functions with __attribute__((used)) or export them in the linker script to prevent elimination.

Inline Assembly & CMSIS Intrinsics

Inline assembly is the last resort of the embedded optimiser — used when the compiler fails to generate the single instruction you need. The ARM Cortex-M ISA includes instructions with no C equivalent: RBIT (reverse bits), CLZ (count leading zeros), USAT/SSAT (saturating arithmetic), SEV/WFE (event synchronisation).

CMSIS-Core provides intrinsic functions in cmsis_gcc.h (GCC) and cmsis_armclang.h (ARMClang) that wrap these instructions without requiring raw inline assembly. Always prefer CMSIS intrinsics over hand-written asm — they are portable across compilers and explicitly documented.

/**
 * Inline assembly examples: RBIT instruction via CMSIS intrinsic
 * and raw __asm volatile for a custom bitfield swap.
 *
 * Include: cmsis_gcc.h (via core_cm4.h)
 */
#include "core_cm4.h"

/* ── Example 1: CMSIS intrinsic __RBIT ── */
uint32_t reverse_bits_cmsis(uint32_t value) {
    /* __RBIT compiles to a single RBIT instruction on Cortex-M3/M4/M7 */
    return __RBIT(value);
    /* Assembly output: rbit r0, r0 — 1 cycle */
}

/* ── Example 2: Raw __asm volatile with constraints ── */
/* Swap the upper and lower 16-bit halves of a 32-bit word using REV16 */
uint32_t swap_halfwords(uint32_t value) {
    uint32_t result;
    __asm volatile (
        "rev16 %[out], %[in]"          /* REV16: reverses bytes within each halfword */
        : [out] "=r" (result)          /* output operand: any register */
        : [in]  "r"  (value)           /* input operand: any register */
        :                              /* no clobbers */
    );
    return result;
}

/* ── Example 3: Saturating add using CMSIS intrinsic __QADD ── */
int32_t saturating_accumulate(int32_t acc, int32_t sample) {
    /* __QADD maps to QADD instruction — saturates at INT32_MAX/INT32_MIN */
    return __QADD(acc, sample);
}

/* ── Example 4: Count leading zeros — used for fast log2 / priority encode ── */
uint32_t fast_log2_floor(uint32_t value) {
    if (value == 0U) { return 0U; }
    /* __CLZ maps to CLZ instruction — 1 cycle on M3/M4/M7 */
    return 31U - __CLZ(value);
}

/* ── Example 5: Memory barrier intrinsics (critical for DMA / volatile access) ── */
void flush_write_buffer(void) {
    __DSB();   /* Data Synchronisation Barrier — wait for all memory transactions */
    __ISB();   /* Instruction Synchronisation Barrier — flush pipeline */
}

Cache & TCM on M7/M33

The Cortex-M7 is the first Cortex-M core with a Harvard L1 cache — separate instruction cache (ICache) and data cache (DCache), typically 16 KB or 32 KB each. The M33 optionally includes ICache only. Enabling these caches is not automatic — you must explicitly enable them in startup code before running performance-critical code.

Beyond caches, the M7 provides Tightly Coupled Memories (TCM): ITCM for instruction storage and DTCM for data. TCM is accessed via a dedicated interface — no cache misses, no bus arbitration latency, deterministic 0-wait-state access at full CPU frequency. For ISRs and DSP kernels, TCM placement is the single most effective performance technique on the M7.

Core ICache DCache ITCM DTCM
Cortex-M0/M0+ No No No No
Cortex-M3 No No No No
Cortex-M4/M4F No No No No
Cortex-M7 16–32 KB (optional) 16–32 KB (optional) Up to 16 MB Up to 16 MB
Cortex-M23 No No No No
Cortex-M33 Optional No No No
Cortex-M55 Optional Optional Optional Optional
/**
 * Place a time-critical ISR in ITCM on STM32H7 (Cortex-M7).
 *
 * ITCM on STM32H743 starts at 0x00000000 — it is mapped to
 * the instruction fetch port directly. Zero wait states at 480 MHz.
 *
 * Linker script adds:
 *   .itcm_text : AT(_sitcm_flash)
 *   {
 *       _sitcm = .;
 *       *(.itcm_text*)
 *       _eitcm = .;
 *   } > ITCM
 *
 * Startup code copies .itcm_text from flash to ITCM at boot.
 */

/* GCC attribute places function in .itcm_text section */
__attribute__((section(".itcm_text"), noinline))
void TIM1_UP_IRQHandler(void) {
    /* This ISR runs from ITCM — no flash wait states, deterministic latency */

    /* Clear update interrupt flag */
    TIM1->SR &= ~TIM_SR_UIF;

    /* Execute time-critical control loop */
    motor_control_update();
}

/* Placement of DSP buffer in DTCM for zero-latency data access */
__attribute__((section(".dtcm_data")))
static float32_t g_fir_state[FIR_BLOCK_SIZE + FIR_TAPS - 1];

/* Enable caches in startup (call before main() in Reset_Handler) */
void cache_enable(void) {
    /* Enable ICache */
    SCB_EnableICache();

    /* Enable DCache */
    SCB_EnableDCache();

    /* Note: DCache requires explicit cache maintenance for DMA buffers.
     * Use SCB_CleanDCache_by_Addr() before DMA TX.
     * Use SCB_InvalidateDCache_by_Addr() after DMA RX.
     */
}

DWT Cycle Counter Profiling

The Data Watchpoint and Trace (DWT) unit is a Cortex-M debug peripheral that includes a 32-bit cycle counter (DWT->CYCCNT). It counts processor clock cycles with zero impact on execution — unlike timer-based profiling which consumes a timer peripheral and has interrupt overhead.

The DWT cycle counter is the fastest, lightest profiling tool available on Cortex-M. Enable it once and wrap any code section with the macros below to get cycle-accurate measurements. Output can go to ITM (SWO trace), a memory buffer, or be printed over UART after profiling completes.

/**
 * DWT Cycle Counter Profiling Macros
 * Works on Cortex-M3, M4, M7, M33 (not M0/M0+ — no DWT CYCCNT)
 *
 * Usage:
 *   PROFILE_START();
 *   function_to_profile(data, length);
 *   PROFILE_END("my_function");
 */
#include "core_cm4.h"   /* or core_cm7.h / core_cm33.h */

/* ── One-time DWT initialisation (call in main before profiling) ── */
void dwt_init(void) {
    CoreDebug->DEMCR |= CoreDebug_DEMCR_TRCENA_Msk;  /* Enable trace */
    DWT->CYCCNT  = 0U;                                 /* Reset counter */
    DWT->CTRL   |= DWT_CTRL_CYCCNTENA_Msk;            /* Enable counter */
}

/* ── Profiling macros ── */
#define PROFILE_START()  \
    uint32_t _dwt_start = DWT->CYCCNT

#define PROFILE_END(name) \
    do { \
        uint32_t _dwt_cycles = DWT->CYCCNT - _dwt_start; \
        /* ITM output — visible in SWV viewer or Tracealyzer */ \
        ITM_SendChar('['); \
        /* Simple cycle-to-microsecond: cycles / (SystemCoreClock / 1000000) */ \
        uint32_t _us = _dwt_cycles / (SystemCoreClock / 1000000U); \
        profile_log(name, _dwt_cycles, _us); \
    } while (0)

/* ── Example profiling a DSP FIR filter ── */
void profile_fir_filter(void) {
    float32_t input[FIR_BLOCK_SIZE];
    float32_t output[FIR_BLOCK_SIZE];

    /* Prepare input data */
    generate_test_tone(input, FIR_BLOCK_SIZE, 1000.0f, SAMPLE_RATE);

    dwt_init();

    /* Profile the CMSIS-DSP FIR filter */
    PROFILE_START();
    arm_fir_f32(&g_fir_instance, input, output, FIR_BLOCK_SIZE);
    PROFILE_END("arm_fir_f32");

    /* Expected output (STM32H743 @ 480 MHz, 64 taps, 128 samples):
     * [arm_fir_f32] 1,872 cycles = 3.9 us
     */
}

SIMD Intrinsics for Manual Vectorisation

The Cortex-M4, M7, and M33 implement a subset of ARM's SIMD instruction set through the DSP extension (ARMv7E-M). These SIMD instructions operate on packed 8-bit or 16-bit values within a single 32-bit register — effectively processing 4 bytes or 2 halfwords in parallel. CMSIS-Core exposes them as intrinsics in cmsis_gcc.h.

The key intrinsics for embedded DSP are: __SADD8 (signed parallel add, 4 bytes), __SADD16 (signed parallel add, 2 halfwords), __SMUL16 (parallel multiply 16-bit), __PKHBT/__PKHTB (pack halfwords), __SMUAD (dual 16-bit multiply accumulate — 2 MACs per cycle).

/**
 * CMSIS SIMD intrinsics for manual vectorisation.
 * Target: Cortex-M4F with DSP extension (ARMv7E-M).
 *
 * Example: Compute dot product of two Q15 vectors using SMUAD.
 * SMUAD: multiplies the two halfwords of one register with the two halfwords
 * of another and adds the results — 2 multiply-accumulate operations per cycle.
 */
#include "core_cm4.h"

/**
 * Dot product of two Q15 (int16_t) vectors.
 * Scalar version: N multiplies + N-1 adds.
 * SIMD version:   N/2 SMUAD instructions — 2x throughput.
 */
int32_t dot_product_q15_simd(const int16_t *a, const int16_t *b, uint32_t len) {
    int64_t acc = 0;
    uint32_t i = 0;

    /* Process 2 elements per iteration using SMUAD */
    for (; i < (len & ~1U); i += 2) {
        /* Pack two Q15 samples into one 32-bit word */
        uint32_t packed_a = __PKHBT((uint32_t)a[i], (uint32_t)a[i+1], 16);
        uint32_t packed_b = __PKHBT((uint32_t)b[i], (uint32_t)b[i+1], 16);

        /* SMUAD: multiply a[i]*b[i] + a[i+1]*b[i+1] in a single instruction */
        acc += (int32_t)__SMUAD(packed_a, packed_b);
    }

    /* Handle odd element */
    if (i < len) {
        acc += (int32_t)a[i] * (int32_t)b[i];
    }

    /* Saturate 64-bit accumulator to Q31 range */
    if (acc > (int64_t)INT32_MAX)  return INT32_MAX;
    if (acc < (int64_t)INT32_MIN)  return INT32_MIN;
    return (int32_t)acc;
}

/**
 * Parallel byte saturation using __SADD8.
 * Saturating-adds 4 signed bytes simultaneously.
 * Useful for audio mixing, image processing.
 */
uint32_t parallel_saturate_add_bytes(uint32_t a_packed, uint32_t b_packed) {
    /* __SADD8: packed signed add — result saturated to [-128, 127] per byte */
    return __SADD8(a_packed, b_packed);
}
CMSIS-DSP vs Manual SIMD: Before writing manual SIMD intrinsics, check whether CMSIS-DSP already provides the operation you need — arm_dot_prod_q15(), arm_add_q15(), etc. CMSIS-DSP functions are already SIMD-optimised internally and tested across all Cortex-M variants. Write manual intrinsics only when you need a specific combination CMSIS-DSP doesn't offer.

Exercises

Exercise 1 Intermediate

Profile a DSP Filter and Identify the Top Hotspot

Implement a 64-tap FIR filter in plain C (no CMSIS-DSP) processing 128-sample blocks. Enable the DWT cycle counter. Profile the complete filter execution. Then switch to arm_fir_f32() from CMSIS-DSP and profile again. Compute the speedup ratio. Identify the single most expensive line in the C implementation using the DWT macro around individual loop iterations.

DWT Profiling FIR Filter CMSIS-DSP Cycle Counting
Exercise 2 Intermediate

Place ISR in ITCM and Measure Latency Improvement

On an STM32H743 (or equivalent M7 target), implement a timer ISR that performs a fixed amount of arithmetic work. Measure its execution time from the DWT cycle counter. Then: (1) add the linker script ITCM region, (2) add the startup ITCM copy loop, (3) add __attribute__((section(".itcm_text"))) to the ISR. Re-measure. Document the cycle count reduction and confirm it matches the expected flash wait-state elimination (STM32H7 flash = 9 wait states at 480 MHz).

ITCM Cortex-M7 Linker Script ISR Latency
Exercise 3 Advanced

Replace a C Loop with SIMD Intrinsics — Verify and Measure

Implement a Q15 vector dot product in plain C. Profile it with DWT. Rewrite it using __SMUAD and __PKHBT CMSIS SIMD intrinsics as shown in the article. Verify numerical correctness against the C reference using Unity test assertions with TEST_ASSERT_INT32_WITHIN(tolerance, expected, actual). Profile the SIMD version and compute the cycle-count speedup. Note the compiler assembly output (-S) to confirm SMUAD instructions appear.

SIMD Intrinsics __SMUAD Q15 DSP Unity Verification

Performance Optimisation Tracker

Use this tool to document your firmware optimisation session — profiling tool selection, hotspot identification, compiler flags applied, and benchmark results before and after. Download as Word, Excel, PDF, or PPTX for architecture reviews or performance sign-off documentation.

Performance Optimisation Tracker

Document your firmware optimisation findings. Download as Word, Excel, PDF, or PPTX.

Draft auto-saved

All data stays in your browser. Nothing is sent to or stored on any server.

Conclusion & Next Steps

In this part we have covered the complete ARM Cortex-M performance toolkit:

  • Compiler flags-O2, -flto, -ffunction-sections, and architecture-specific -mcpu/-mfpu flags are the foundation of embedded performance. Avoid -ffast-math.
  • LTO extends the compiler's reach across translation unit boundaries — typically delivering an additional 10–25% size reduction on top of per-file optimisation.
  • CMSIS intrinsics provide portable access to ARM-specific instructions — __RBIT, __CLZ, __QADD, barrier intrinsics — without raw inline assembly.
  • ITCM/DTCM placement on Cortex-M7 eliminates flash wait-state latency for critical ISRs and DSP kernels — often the single most impactful performance change available.
  • The DWT cycle counter is the lightest, most accurate profiling tool on Cortex-M — use it before any optimisation to confirm you are targeting the real hotspot.
  • SIMD intrinsics (__SMUAD, __SADD8, __PKHBT) enable 2–4x throughput on DSP loops with minimal code change on M4/M7/M33.

Next in the Series

In Part 19: Embedded Software Architecture, we step back from low-level optimisation to look at the big picture — layered architecture, hardware abstraction layers, event-driven design with finite state machines, component-based design, and the architectural principles that make embedded firmware maintainable across hardware revisions.

Technology