CMSIS Part 18: Performance Optimization

                        
                        Series Context: This is Part 18 of the 20-part CMSIS Mastery Series. Performance optimisation sits after testing (Part 17) intentionally — you must have a correct, tested baseline before optimising. Measuring cycles on untested code is wasted effort.
                    

CMSIS Mastery Series

Your 20-step learning path • Currently on Step 18

1

18

Performance Optimization

Compiler flags, inline assembly, cache (M7/M33), profiling

You Are Here

19

Embedded Software Architecture

Layered design, event-driven, state machines, component-based

20

Tooling & Workflow (Professional Level)

CI/CD for embedded, MISRA, static analysis, Doxygen

Compiler Optimisation Flags

The compiler is your most powerful performance tool. The right flags can halve code size and double execution speed with zero source changes. The wrong flags silently introduce bugs through aggressive assumptions. Understanding exactly what each flag does — and its trade-offs — is essential before optimising embedded firmware.

GCC's optimisation levels are shorthand for groups of individual transformations. The table below gives a practical reference for embedded decision-making.

Flag	Code Size Delta	Speed Delta	Debug Impact	When to Use
`-O0`	Baseline (largest)	Baseline (slowest)	Full debuggability	Development, unit test builds, debugging HardFaults
`-O1`	-20% typical	+30% typical	Minor variable optimisation	Quick improvement without aggressive transforms
`-O2`	-30% typical	+50–80%	Some inlining obscures call stack	Release firmware default — good balance
`-O3`	Often +10% (loop unrolling)	+60–100%	Heavy inlining — difficult to debug	DSP hotspots only — can increase code size
`-Os`	-35–40%	+20–40%	Moderate	Flash-constrained MCUs (M0, M0+)
`-Oz`	-40–45%	Minimal	Heavy	Absolute minimum flash usage (bootloaders)
`-flto`	-15–25% additional	+10–20% additional	Cross-TU inlining obscures frames	Release builds — combine with -O2 or -Os

Beyond the level flags, several individual flags are essential for embedded:

# Recommended CMakeLists.txt compiler flags for Cortex-M4F release build
target_compile_options(firmware.elf PRIVATE

    # CPU / FPU architecture
    -mcpu=cortex-m4
    -mthumb
    -mfpu=fpv4-sp-d16
    -mfloat-abi=hard          # Use hardware FPU registers (not soft emulation)

    # Optimisation level
    -O2                       # Release: good balance speed vs size vs debuggability

    # Link-time optimisation (combine with linker -flto)
    -flto

    # Dead code elimination (requires --gc-sections at link time)
    -ffunction-sections       # Place each function in its own ELF section
    -fdata-sections           # Place each variable in its own ELF section

    # Warning flags (catch real bugs)
    -Wall
    -Wextra
    -Wshadow
    -Wdouble-promotion        # Warn when float silently promoted to double
    -Wundef

    # Strict aliasing (required for correctness with -O2/-O3)
    -fstrict-aliasing
    -Wstrict-aliasing=3

    # Do NOT use -ffast-math in embedded — breaks IEEE 754 NaN/Inf handling
)

target_link_options(firmware.elf PRIVATE
    -flto                     # Must match compile flag
    -Wl,--gc-sections         # Remove unused sections (pairs with -ffunction/data-sections)
    -Wl,-Map=firmware.map     # Generate map file for size analysis
    -Wl,--print-memory-usage  # Print flash/RAM usage at link time
)

                        
                        Avoid -ffast-math: This flag breaks IEEE 754 compliance — it assumes no NaN, no Inf, and allows reordering of floating-point operations. In embedded DSP and sensor fusion code, this can introduce subtle numerical errors that are nearly impossible to diagnose. Use -fno-math-errno as a safer alternative if you need partial math optimisation.
                    

Link-Time Optimisation

Link-Time Optimisation (LTO) extends the compiler's visibility from a single translation unit to the entire program. Without LTO, GCC optimises each .c file independently — cross-module inlining and dead code elimination are impossible. With LTO enabled (-flto), the linker invokes the compiler again over the combined intermediate representation, enabling:

Cross-module inlining — functions in different .c files are inlined if the optimiser deems it beneficial
Whole-program dead code elimination — functions called from only one site and never exported are eliminated
Constant propagation across modules — a constant defined in one file propagates into callers in another

In practice, LTO with -O2 on a typical embedded firmware project reduces flash usage by an additional 10–25% and improves throughput by 10–20% — significant gains for zero source code changes.

                        
                        LTO and Interrupt Handlers: LTO may eliminate functions that appear unreachable from main() — including interrupt handlers referenced only from the vector table. Always declare ISR functions with __attribute__((used)) or export them in the linker script to prevent elimination.
                    

Inline Assembly & CMSIS Intrinsics

Inline assembly is the last resort of the embedded optimiser — used when the compiler fails to generate the single instruction you need. The ARM Cortex-M ISA includes instructions with no C equivalent: RBIT (reverse bits), CLZ (count leading zeros), USAT/SSAT (saturating arithmetic), SEV/WFE (event synchronisation).

CMSIS-Core provides intrinsic functions in cmsis_gcc.h (GCC) and cmsis_armclang.h (ARMClang) that wrap these instructions without requiring raw inline assembly. Always prefer CMSIS intrinsics over hand-written asm — they are portable across compilers and explicitly documented.

/**
 * Inline assembly examples: RBIT instruction via CMSIS intrinsic
 * and raw __asm volatile for a custom bitfield swap.
 *
 * Include: cmsis_gcc.h (via core_cm4.h)
 */
#include "core_cm4.h"

/* ── Example 1: CMSIS intrinsic __RBIT ── */
uint32_t reverse_bits_cmsis(uint32_t value) {
    /* __RBIT compiles to a single RBIT instruction on Cortex-M3/M4/M7 */
    return __RBIT(value);
    /* Assembly output: rbit r0, r0 — 1 cycle */
}

/* ── Example 2: Raw __asm volatile with constraints ── */
/* Swap the upper and lower 16-bit halves of a 32-bit word using REV16 */
uint32_t swap_halfwords(uint32_t value) {
    uint32_t result;
    __asm volatile (
        "rev16 %[out], %[in]"          /* REV16: reverses bytes within each halfword */
        : [out] "=r" (result)          /* output operand: any register */
        : [in]  "r"  (value)           /* input operand: any register */
        :                              /* no clobbers */
    );
    return result;
}

/* ── Example 3: Saturating add using CMSIS intrinsic __QADD ── */
int32_t saturating_accumulate(int32_t acc, int32_t sample) {
    /* __QADD maps to QADD instruction — saturates at INT32_MAX/INT32_MIN */
    return __QADD(acc, sample);
}

/* ── Example 4: Count leading zeros — used for fast log2 / priority encode ── */
uint32_t fast_log2_floor(uint32_t value) {
    if (value == 0U) { return 0U; }
    /* __CLZ maps to CLZ instruction — 1 cycle on M3/M4/M7 */
    return 31U - __CLZ(value);
}

/* ── Example 5: Memory barrier intrinsics (critical for DMA / volatile access) ── */
void flush_write_buffer(void) {
    __DSB();   /* Data Synchronisation Barrier — wait for all memory transactions */
    __ISB();   /* Instruction Synchronisation Barrier — flush pipeline */
}

Cache & TCM on M7/M33

The Cortex-M7 is the first Cortex-M core with a Harvard L1 cache — separate instruction cache (ICache) and data cache (DCache), typically 16 KB or 32 KB each. The M33 optionally includes ICache only. Enabling these caches is not automatic — you must explicitly enable them in startup code before running performance-critical code.

Beyond caches, the M7 provides Tightly Coupled Memories (TCM): ITCM for instruction storage and DTCM for data. TCM is accessed via a dedicated interface — no cache misses, no bus arbitration latency, deterministic 0-wait-state access at full CPU frequency. For ISRs and DSP kernels, TCM placement is the single most effective performance technique on the M7.

Core	ICache	DCache	ITCM	DTCM
Cortex-M0/M0+	No	No	No	No
Cortex-M3	No	No	No	No
Cortex-M4/M4F	No	No	No	No
Cortex-M7	16–32 KB (optional)	16–32 KB (optional)	Up to 16 MB	Up to 16 MB
Cortex-M23	No	No	No	No
Cortex-M33	Optional	No	No	No
Cortex-M55	Optional	Optional	Optional	Optional

/**
 * Place a time-critical ISR in ITCM on STM32H7 (Cortex-M7).
 *
 * ITCM on STM32H743 starts at 0x00000000 — it is mapped to
 * the instruction fetch port directly. Zero wait states at 480 MHz.
 *
 * Linker script adds:
 *   .itcm_text : AT(_sitcm_flash)
 *   {
 *       _sitcm = .;
 *       *(.itcm_text*)
 *       _eitcm = .;
 *   } > ITCM
 *
 * Startup code copies .itcm_text from flash to ITCM at boot.
 */

/* GCC attribute places function in .itcm_text section */
__attribute__((section(".itcm_text"), noinline))
void TIM1_UP_IRQHandler(void) {
    /* This ISR runs from ITCM — no flash wait states, deterministic latency */

    /* Clear update interrupt flag */
    TIM1->SR &= ~TIM_SR_UIF;

    /* Execute time-critical control loop */
    motor_control_update();
}

/* Placement of DSP buffer in DTCM for zero-latency data access */
__attribute__((section(".dtcm_data")))
static float32_t g_fir_state[FIR_BLOCK_SIZE + FIR_TAPS - 1];

/* Enable caches in startup (call before main() in Reset_Handler) */
void cache_enable(void) {
    /* Enable ICache */
    SCB_EnableICache();

    /* Enable DCache */
    SCB_EnableDCache();

    /* Note: DCache requires explicit cache maintenance for DMA buffers.
     * Use SCB_CleanDCache_by_Addr() before DMA TX.
     * Use SCB_InvalidateDCache_by_Addr() after DMA RX.
     */
}

DWT Cycle Counter Profiling

The Data Watchpoint and Trace (DWT) unit is a Cortex-M debug peripheral that includes a 32-bit cycle counter (DWT->CYCCNT). It counts processor clock cycles with zero impact on execution — unlike timer-based profiling which consumes a timer peripheral and has interrupt overhead.

The DWT cycle counter is the fastest, lightest profiling tool available on Cortex-M. Enable it once and wrap any code section with the macros below to get cycle-accurate measurements. Output can go to ITM (SWO trace), a memory buffer, or be printed over UART after profiling completes.

/**
 * DWT Cycle Counter Profiling Macros
 * Works on Cortex-M3, M4, M7, M33 (not M0/M0+ — no DWT CYCCNT)
 *
 * Usage:
 *   PROFILE_START();
 *   function_to_profile(data, length);
 *   PROFILE_END("my_function");
 */
#include "core_cm4.h"   /* or core_cm7.h / core_cm33.h */

/* ── One-time DWT initialisation (call in main before profiling) ── */
void dwt_init(void) {
    CoreDebug->DEMCR |= CoreDebug_DEMCR_TRCENA_Msk;  /* Enable trace */
    DWT->CYCCNT  = 0U;                                 /* Reset counter */
    DWT->CTRL   |= DWT_CTRL_CYCCNTENA_Msk;            /* Enable counter */
}

/* ── Profiling macros ── */
#define PROFILE_START()  \
    uint32_t _dwt_start = DWT->CYCCNT

#define PROFILE_END(name) \
    do { \
        uint32_t _dwt_cycles = DWT->CYCCNT - _dwt_start; \
        /* ITM output — visible in SWV viewer or Tracealyzer */ \
        ITM_SendChar('['); \
        /* Simple cycle-to-microsecond: cycles / (SystemCoreClock / 1000000) */ \
        uint32_t _us = _dwt_cycles / (SystemCoreClock / 1000000U); \
        profile_log(name, _dwt_cycles, _us); \
    } while (0)

/* ── Example profiling a DSP FIR filter ── */
void profile_fir_filter(void) {
    float32_t input[FIR_BLOCK_SIZE];
    float32_t output[FIR_BLOCK_SIZE];

    /* Prepare input data */
    generate_test_tone(input, FIR_BLOCK_SIZE, 1000.0f, SAMPLE_RATE);

    dwt_init();

    /* Profile the CMSIS-DSP FIR filter */
    PROFILE_START();
    arm_fir_f32(&g_fir_instance, input, output, FIR_BLOCK_SIZE);
    PROFILE_END("arm_fir_f32");

    /* Expected output (STM32H743 @ 480 MHz, 64 taps, 128 samples):
     * [arm_fir_f32] 1,872 cycles = 3.9 us
     */
}

SIMD Intrinsics for Manual Vectorisation

The Cortex-M4, M7, and M33 implement a subset of ARM's SIMD instruction set through the DSP extension (ARMv7E-M). These SIMD instructions operate on packed 8-bit or 16-bit values within a single 32-bit register — effectively processing 4 bytes or 2 halfwords in parallel. CMSIS-Core exposes them as intrinsics in cmsis_gcc.h.

The key intrinsics for embedded DSP are: __SADD8 (signed parallel add, 4 bytes), __SADD16 (signed parallel add, 2 halfwords), __SMUL16 (parallel multiply 16-bit), __PKHBT/__PKHTB (pack halfwords), __SMUAD (dual 16-bit multiply accumulate — 2 MACs per cycle).

/**
 * CMSIS SIMD intrinsics for manual vectorisation.
 * Target: Cortex-M4F with DSP extension (ARMv7E-M).
 *
 * Example: Compute dot product of two Q15 vectors using SMUAD.
 * SMUAD: multiplies the two halfwords of one register with the two halfwords
 * of another and adds the results — 2 multiply-accumulate operations per cycle.
 */
#include "core_cm4.h"

/**
 * Dot product of two Q15 (int16_t) vectors.
 * Scalar version: N multiplies + N-1 adds.
 * SIMD version:   N/2 SMUAD instructions — 2x throughput.
 */
int32_t dot_product_q15_simd(const int16_t *a, const int16_t *b, uint32_t len) {
    int64_t acc = 0;
    uint32_t i = 0;

    /* Process 2 elements per iteration using SMUAD */
    for (; i < (len & ~1U); i += 2) {
        /* Pack two Q15 samples into one 32-bit word */
        uint32_t packed_a = __PKHBT((uint32_t)a[i], (uint32_t)a[i+1], 16);
        uint32_t packed_b = __PKHBT((uint32_t)b[i], (uint32_t)b[i+1], 16);

        /* SMUAD: multiply a[i]*b[i] + a[i+1]*b[i+1] in a single instruction */
        acc += (int32_t)__SMUAD(packed_a, packed_b);
    }

    /* Handle odd element */
    if (i < len) {
        acc += (int32_t)a[i] * (int32_t)b[i];
    }

    /* Saturate 64-bit accumulator to Q31 range */
    if (acc > (int64_t)INT32_MAX)  return INT32_MAX;
    if (acc < (int64_t)INT32_MIN)  return INT32_MIN;
    return (int32_t)acc;
}

/**
 * Parallel byte saturation using __SADD8.
 * Saturating-adds 4 signed bytes simultaneously.
 * Useful for audio mixing, image processing.
 */
uint32_t parallel_saturate_add_bytes(uint32_t a_packed, uint32_t b_packed) {
    /* __SADD8: packed signed add — result saturated to [-128, 127] per byte */
    return __SADD8(a_packed, b_packed);
}

                        
                        CMSIS-DSP vs Manual SIMD: Before writing manual SIMD intrinsics, check whether CMSIS-DSP already provides the operation you need — arm_dot_prod_q15(), arm_add_q15(), etc. CMSIS-DSP functions are already SIMD-optimised internally and tested across all Cortex-M variants. Write manual intrinsics only when you need a specific combination CMSIS-DSP doesn't offer.
                    

Exercises

Exercise 1 Intermediate

Profile a DSP Filter and Identify the Top Hotspot

Implement a 64-tap FIR filter in plain C (no CMSIS-DSP) processing 128-sample blocks. Enable the DWT cycle counter. Profile the complete filter execution. Then switch to arm_fir_f32() from CMSIS-DSP and profile again. Compute the speedup ratio. Identify the single most expensive line in the C implementation using the DWT macro around individual loop iterations.

DWT Profiling FIR Filter CMSIS-DSP Cycle Counting

Exercise 2 Intermediate

Place ISR in ITCM and Measure Latency Improvement

On an STM32H743 (or equivalent M7 target), implement a timer ISR that performs a fixed amount of arithmetic work. Measure its execution time from the DWT cycle counter. Then: (1) add the linker script ITCM region, (2) add the startup ITCM copy loop, (3) add __attribute__((section(".itcm_text"))) to the ISR. Re-measure. Document the cycle count reduction and confirm it matches the expected flash wait-state elimination (STM32H7 flash = 9 wait states at 480 MHz).

ITCM Cortex-M7 Linker Script ISR Latency

Exercise 3 Advanced

Replace a C Loop with SIMD Intrinsics — Verify and Measure

Implement a Q15 vector dot product in plain C. Profile it with DWT. Rewrite it using __SMUAD and __PKHBT CMSIS SIMD intrinsics as shown in the article. Verify numerical correctness against the C reference using Unity test assertions with TEST_ASSERT_INT32_WITHIN(tolerance, expected, actual). Profile the SIMD version and compute the cycle-count speedup. Note the compiler assembly output (-S) to confirm SMUAD instructions appear.

SIMD Intrinsics __SMUAD Q15 DSP Unity Verification

Performance Optimisation Tracker

Use this tool to document your firmware optimisation session — profiling tool selection, hotspot identification, compiler flags applied, and benchmark results before and after. Download as Word, Excel, PDF, or PPTX for architecture reviews or performance sign-off documentation.

Performance Optimisation Tracker

Document your firmware optimisation findings. Download as Word, Excel, PDF, or PPTX.

Draft auto-saved

All data stays in your browser. Nothing is sent to or stored on any server.

Project Name *

Target MCU *

Profiling Tool

Cache Strategy

Profiling Hotspots (function, cycles, % time)

Compiler Flags Applied

Inline Assembly / SIMD Functions

Benchmark Results (before vs after)

Author Name

Conclusion & Next Steps

In this part we have covered the complete ARM Cortex-M performance toolkit:

Compiler flags — -O2, -flto, -ffunction-sections, and architecture-specific -mcpu/-mfpu flags are the foundation of embedded performance. Avoid -ffast-math.
LTO extends the compiler's reach across translation unit boundaries — typically delivering an additional 10–25% size reduction on top of per-file optimisation.
CMSIS intrinsics provide portable access to ARM-specific instructions — __RBIT, __CLZ, __QADD, barrier intrinsics — without raw inline assembly.
ITCM/DTCM placement on Cortex-M7 eliminates flash wait-state latency for critical ISRs and DSP kernels — often the single most impactful performance change available.
The DWT cycle counter is the lightest, most accurate profiling tool on Cortex-M — use it before any optimisation to confirm you are targeting the real hotspot.
SIMD intrinsics (__SMUAD, __SADD8, __PKHBT) enable 2–4x throughput on DSP loops with minimal code change on M4/M7/M33.

Next in the Series

In Part 19: Embedded Software Architecture, we step back from low-level optimisation to look at the big picture — layered architecture, hardware abstraction layers, event-driven design with finite state machines, component-based design, and the architectural principles that make embedded firmware maintainable across hardware revisions.

Cookie Consent

Cookie Preferences

Table of Contents

CMSIS Mastery Series

Overview & ARM Cortex-M Ecosystem

CMSIS-Core: Registers, NVIC & SysTick

Startup Code, Linker Scripts & Vector Table

CMSIS-RTOS2: Threads, Mutexes & Semaphores

CMSIS-RTOS2: Message Queues & Event Flags

CMSIS-DSP: Filters, FFT & Math Functions

CMSIS-Driver: UART, SPI & I2C

CMSIS-Pack & Software Components

Debugging with CMSIS-DAP & CoreSight

Portable Firmware: Multi-Vendor Projects

Interrupts, Concurrency & Real-Time Constraints

Memory Management in Embedded Systems

Low Power & Energy Optimization

DMA & High-Performance Data Handling

Security: ARMv8-M & TrustZone

Bootloaders & Firmware Updates

Testing & Validation

Performance Optimization

Embedded Software Architecture

Tooling & Workflow (Professional Level)

Compiler Optimisation Flags

Link-Time Optimisation

Inline Assembly & CMSIS Intrinsics

Cache & TCM on M7/M33

DWT Cycle Counter Profiling

SIMD Intrinsics for Manual Vectorisation

Exercises

Profile a DSP Filter and Identify the Top Hotspot

Place ISR in ITCM and Measure Latency Improvement

Replace a C Loop with SIMD Intrinsics — Verify and Measure

Performance Optimisation Tracker

Performance Optimisation Tracker

Conclusion & Next Steps

Next in the Series

Cookie Consent

Cookie Preferences

CMSIS Part 18: Performance Optimization

Table of Contents

CMSIS Mastery Series

Overview & ARM Cortex-M Ecosystem

CMSIS-Core: Registers, NVIC & SysTick

Startup Code, Linker Scripts & Vector Table

CMSIS-RTOS2: Threads, Mutexes & Semaphores

CMSIS-RTOS2: Message Queues & Event Flags

CMSIS-DSP: Filters, FFT & Math Functions

CMSIS-Driver: UART, SPI & I2C

CMSIS-Pack & Software Components

Debugging with CMSIS-DAP & CoreSight

Portable Firmware: Multi-Vendor Projects

Interrupts, Concurrency & Real-Time Constraints

Memory Management in Embedded Systems

Low Power & Energy Optimization

DMA & High-Performance Data Handling

Security: ARMv8-M & TrustZone

Bootloaders & Firmware Updates

Testing & Validation

Performance Optimization

Embedded Software Architecture

Tooling & Workflow (Professional Level)

Compiler Optimisation Flags

Link-Time Optimisation

Inline Assembly & CMSIS Intrinsics

Cache & TCM on M7/M33

DWT Cycle Counter Profiling

SIMD Intrinsics for Manual Vectorisation

Exercises

Profile a DSP Filter and Identify the Top Hotspot

Place ISR in ITCM and Measure Latency Improvement

Replace a C Loop with SIMD Intrinsics — Verify and Measure

Performance Optimisation Tracker

Performance Optimisation Tracker

Conclusion & Next Steps

Next in the Series

Related Articles in This Series

Part 6: CMSIS-DSP — Filters, FFT & Math Functions

Part 19: Embedded Software Architecture

Part 13: Low Power & Energy Optimization