Series Progress: This is Part 6 of our 20-part CMSIS Mastery Series. Parts 1–5 covered the ecosystem, CMSIS-Core, startup code, RTOS threads, and RTOS IPC. Now we enter the signal processing domain with CMSIS-DSP.
1
Overview & ARM Cortex-M Ecosystem
CMSIS layers, Cortex-M families, memory map, toolchains
Completed
2
CMSIS-Core: Registers, NVIC & SysTick
core_cmX.h, register access, interrupt controller, SysTick timer
Completed
3
Startup Code, Linker Scripts & Vector Table
Reset handler, BSS init, scatter files, boot process
Completed
4
CMSIS-RTOS2: Threads, Mutexes & Semaphores
Thread management, synchronization primitives, scheduling
Completed
5
CMSIS-RTOS2: Message Queues & Event Flags
Inter-thread comms, ISR-to-thread, real-time design patterns
Completed
6
CMSIS-DSP: Filters, FFT & Math Functions
FIR/IIR filters, FFT, SIMD optimizations
You Are Here
7
CMSIS-Driver: UART, SPI & I2C
Driver abstraction layer, callbacks, DMA integration
8
CMSIS-Pack & Software Components
Pack files, device support, dependency management
9
Debugging with CMSIS-DAP & CoreSight
SWD/JTAG, HardFault analysis, ITM tracing
10
Portable Firmware: Multi-Vendor Projects
HAL vs CMSIS, cross-platform BSPs, reusable libraries
11
Interrupts, Concurrency & Real-Time Constraints
Interrupt latency, critical sections, lock-free programming
12
Memory Management in Embedded Systems
Static vs dynamic, heap fragmentation, memory pools
13
Low Power & Energy Optimization
Sleep modes, clock gating, tickless RTOS, power profiling
14
DMA & High-Performance Data Handling
DMA basics, peripheral transfers, zero-copy techniques
15
Security: ARMv8-M & TrustZone
Secure/non-secure worlds, secure boot, firmware protection
16
Bootloaders & Firmware Updates
OTA updates, dual-bank flash, fail-safe strategies
17
Testing & Validation
Unity/Ceedling unit tests, HIL testing, integration testing
18
Performance Optimization
Compiler flags, inline assembly, cache (M7/M33), profiling
19
Embedded Software Architecture
Layered design, event-driven, state machines, component-based
20
Tooling & Workflow (Professional Level)
CI/CD for embedded, MISRA, static analysis, Doxygen
DSP Fundamentals
Before writing a single line of CMSIS-DSP code, you need a firm grasp of the three concepts that underpin all digital signal processing on microcontrollers: the sampling theorem, quantisation noise, and the discrete-time system model. Getting these right means your filters will work. Getting them wrong means wasted silicon and firmware that silently produces garbage.
The Nyquist-Shannon sampling theorem states that to digitise an analogue signal without aliasing you must sample at least twice the highest frequency present in the signal. An audio microphone capturing voice (bandwidth 3.4 kHz) needs a sample rate of at least 6.8 kHz; in practice 8 kHz is the telephony standard. An accelerometer for vibration analysis with content up to 5 kHz needs at least 10 kHz sampling. Miss this requirement and high-frequency content folds back into the baseband, appearing as phantom low-frequency signals that no software filter can remove.
Anti-Aliasing Filter: Always pair your ADC with an analogue anti-aliasing filter — a simple RC low-pass — before sampling. The cut-off frequency should be at or below fs/2. No digital filter can undo aliasing after the fact; it must be prevented in hardware.
Quantisation noise arises because a finite-bit ADC rounds the true analogue value to the nearest representable level. A 12-bit ADC introduces quantisation noise with SNR ≈ 6.02N + 1.76 dB ≈ 74 dB. A 16-bit ADC gives ≈ 98 dB. For audio quality or precision measurements, the ADC bit depth sets the noise floor no amount of filtering can overcome.
Fixed-point vs floating-point is the constant trade-off in MCU DSP. Cortex-M4 and M7 cores with the optional FPU can execute single-precision float operations in 1–14 cycles. Cortex-M0/M3 without FPU perform float emulation in software — typically 100+ cycles per multiply. CMSIS-DSP provides both float32 and fixed-point (Q7, Q15, Q31) variants for every algorithm, letting you choose the right format for your core and precision requirements.
CMSIS-DSP Library Architecture
CMSIS-DSP is a compiled library — not header-only like CMSIS-Core. It ships as a pre-built static library (libarm_cortexM4lf_math.a for Cortex-M4 with FPU, for example) and as source that you can compile yourself with the exact flags matching your target. The single include is arm_math.h, which pulls in type definitions, function declarations, and the instance structs that CMSIS-DSP algorithms use to hold their persistent state.
| Category |
Example Functions |
Approx. Function Count |
| Basic Math |
arm_add_f32, arm_mult_q15, arm_scale_f32 |
~30 |
| Complex Math |
arm_cmplx_mult_cmplx_f32, arm_cmplx_mag_f32 |
~15 |
| Filtering |
arm_fir_f32, arm_biquad_cascade_df2T_f32, arm_lms_f32 |
~40 |
| Transforms |
arm_rfft_fast_f32, arm_cfft_f32, arm_dct4_f32 |
~20 |
| Statistics |
arm_mean_f32, arm_rms_f32, arm_var_f32, arm_max_f32 |
~25 |
| Matrix Operations |
arm_mat_mult_f32, arm_mat_inverse_f32 |
~20 |
| SVM / Bayes |
arm_svm_linear_predict_f32, arm_gaussian_naive_bayes_predict_f32 |
~10 |
The instance struct model is central to CMSIS-DSP design. Stateful algorithms (FIR, IIR, FFT) keep their internal state — delay line, twiddle factors, coefficient array — in a struct that you allocate in your application and pass to every function call. This eliminates hidden global state and makes it trivial to run multiple independent filter instances simultaneously, each with its own state buffer.
# Add CMSIS-DSP to a CMake project using the CMSIS-DSP source tree
# (CMSIS-DSP 1.15+ supports CMake natively via add_subdirectory)
# In CMakeLists.txt:
# set(DISABLEFLOAT16 ON) # disable fp16 if not needed
# add_subdirectory(CMSIS-DSP/Source CMSISDSPBinary)
# target_link_libraries(my_firmware PRIVATE CMSISDSP)
# target_compile_definitions(my_firmware PRIVATE ARM_MATH_CM4 __FPU_PRESENT=1)
# Or link the pre-built library for Cortex-M4F hard-float ABI:
# target_link_libraries(my_firmware PRIVATE
# ${CMSIS_DSP_LIB_DIR}/libarm_cortexM4lf_math.a)
# Verify the library exports the expected symbols
arm-none-eabi-nm libarm_cortexM4lf_math.a | grep arm_fir_init_f32
Compile Flag Requirement: Always define ARM_MATH_CM4 (or the correct variant for your core) and __FPU_PRESENT=1 when targeting a core with an FPU. Without these, CMSIS-DSP falls back to software FP emulation, negating all SIMD optimisations. Pass these via -DARM_MATH_CM4 -D__FPU_PRESENT=1 in your compiler flags.
FIR Filters
A Finite Impulse Response (FIR) filter computes each output sample as a weighted sum of the current and previous N-1 input samples, where N is the filter order (tap count). FIR filters are inherently stable (no feedback), can achieve exactly linear phase (vital for audio and communications), and map directly to the multiply-accumulate hardware available in Cortex-M4/M7 DSP extensions.
The standard design workflow: (1) specify the filter in the frequency domain (passband, stopband, transition width, attenuation); (2) compute coefficients using a windowed sinc method (Hamming window for audio, Kaiser window for tighter specifications); (3) pass the coefficients and state buffer to arm_fir_init_f32(); (4) call arm_fir_f32() once per block of samples.
/* ── fir_audio_lowpass.c ─────────────────────────────────────────────────
* 64-tap FIR low-pass filter for audio at 48 kHz sample rate.
* Cut-off: 8 kHz. Designed with Hamming window (windowed sinc).
* Coefficients generated by SciPy: scipy.signal.firwin(64, 8000/24000)
* ──────────────────────────────────────────────────────────────────────── */
#include "arm_math.h"
#define BLOCK_SIZE 64U /* samples processed per call — match DMA buffer */
#define NUM_TAPS 64U /* filter order + 1 */
/* FIR coefficients (symmetric, Hamming-windowed sinc, fc = 8 kHz @ 48 kHz) */
static const float32_t g_fir_coeffs[NUM_TAPS] = {
/* Generated offline; symmetric so only half shown here, padded to 64 */
0.00000f, 0.00019f, -0.00048f, -0.00063f, 0.00000f, 0.00159f,
0.00247f, 0.00000f, -0.00489f, -0.00647f, 0.00000f, 0.01103f,
0.01371f, 0.00000f, -0.02221f, -0.02756f, 0.00000f, 0.05011f,
0.07568f, 0.09003f, 0.09003f, 0.07568f, 0.05011f, 0.00000f,
-0.02756f, -0.02221f, 0.00000f, 0.01371f, 0.01103f, 0.00000f,
-0.00647f, -0.00489f, 0.00000f, 0.00247f, 0.00159f, 0.00000f,
-0.00063f, -0.00048f, 0.00019f, 0.00000f, 0.00000f, 0.00000f,
0.00000f, 0.00000f, 0.00000f, 0.00000f, 0.00000f, 0.00000f,
0.00000f, 0.00000f, 0.00000f, 0.00000f, 0.00000f, 0.00000f,
0.00000f, 0.00000f, 0.00000f, 0.00000f, 0.00000f, 0.00000f,
0.00000f, 0.00000f, 0.00000f, 0.00000f
};
/* State buffer: NUM_TAPS + BLOCK_SIZE - 1 elements */
static float32_t g_fir_state[NUM_TAPS + BLOCK_SIZE - 1U];
/* Instance struct — holds pointers to coefficients and state */
static arm_fir_instance_f32 g_fir;
/* ── One-time initialisation ─────────────────────────────────────────── */
void fir_init(void)
{
arm_fir_init_f32(
&g_fir, /* instance struct (persistent state) */
NUM_TAPS, /* number of filter taps */
g_fir_coeffs, /* pointer to coefficient array */
g_fir_state, /* pointer to state buffer */
BLOCK_SIZE); /* block size for block processing */
}
/* ── Called every DMA interrupt or RTOS block tick ──────────────────── */
void fir_process(float32_t *p_input, float32_t *p_output)
{
/* Process BLOCK_SIZE samples in a single vectorised call.
* On Cortex-M4F the inner loop uses SIMD MAC instructions,
* achieving ~4 MACs per cycle vs 1 in scalar code. */
arm_fir_f32(&g_fir, p_input, p_output, BLOCK_SIZE);
}
State Buffer Sizing: The state buffer must be exactly NUM_TAPS + BLOCK_SIZE - 1 float32 elements. This is a silent bug — if the buffer is too small you will corrupt adjacent memory with no immediate fault. Always derive the size at compile time with the macro shown above.
IIR Biquad Filters
Where FIR filters require many taps to achieve steep roll-off, Infinite Impulse Response (IIR) filters achieve the same response with far fewer coefficients by using feedback. The trade-off: IIR filters can be unstable if poorly implemented, and they introduce non-linear phase shift. For most sensor and audio applications these are acceptable trade-offs — a 4-stage biquad cascade achieves an 8th-order Butterworth response with only 20 coefficients.
CMSIS-DSP implements the biquad in Direct Form II Transposed (arm_biquad_cascade_df2T_f32), which has superior numerical properties compared to Direct Form I — it requires fewer delay elements and is less prone to coefficient quantisation noise. Each stage has five coefficients: b0, b1, b2 (feedforward), a1, a2 (feedback), stored in the order [b0, b1, b2, a1, a2].
/* ── iir_biquad_dc_removal.c ─────────────────────────────────────────────
* Two-stage biquad cascade for DC removal from a sensor signal.
* Stage 1: High-pass 1 Hz (removes DC offset and slow drift).
* Stage 2: Notch at 50 Hz (removes mains interference).
* Coefficients for 1 kHz sample rate, computed with scipy.signal.
* ──────────────────────────────────────────────────────────────────────── */
#include "arm_math.h"
#define NUM_STAGES 2U /* cascade of 2 biquad sections */
/* Coefficients in DF2T order: [b0, b1, b2, a1, a2] per stage
* Negative a-coefficients because CMSIS-DSP sign convention flips them. */
static float32_t g_biquad_coeffs[5U * NUM_STAGES] = {
/* Stage 1 — 1 Hz high-pass (fc=1 Hz, Q=0.707, fs=1000 Hz) */
0.99368f, -1.98736f, 0.99368f, /* b0, b1, b2 */
1.98728f, -0.98744f, /* a1, a2 (CMSIS sign: stored positive) */
/* Stage 2 — 50 Hz notch (fs=1000 Hz, bandwidth=5 Hz) */
0.97204f, -1.90211f, 0.97204f, /* b0, b1, b2 */
1.90211f, -0.94408f /* a1, a2 */
};
/* State buffer: 2 elements per stage */
static float32_t g_biquad_state[2U * NUM_STAGES];
/* Instance struct */
static arm_biquad_cascade_df2T_instance_f32 g_biquad;
void biquad_init(void)
{
arm_biquad_cascade_df2T_init_f32(
&g_biquad,
NUM_STAGES,
g_biquad_coeffs,
g_biquad_state);
}
void biquad_process(float32_t *p_src, float32_t *p_dst, uint32_t block_size)
{
arm_biquad_cascade_df2T_f32(&g_biquad, p_src, p_dst, block_size);
}
Coefficient Sign Convention: CMSIS-DSP stores the negative feedback coefficients — i.e., the values you pass for a1 and a2 should be the positive values from your filter design tool. Scipy's iirfilter returns denominator coefficients with the convention a[0]=1, a[1]=−(your a1), a[2]=−(your a2). Negate them before passing to CMSIS or your filter will be unstable.
FFT & Spectral Analysis
The Fast Fourier Transform converts a block of time-domain samples into a frequency-domain magnitude spectrum. For embedded systems, the most common application is identifying dominant frequencies in a vibration, acoustic, or physiological signal — without knowing in advance what those frequencies are. CMSIS-DSP provides arm_rfft_fast_f32() for real-valued input at half the computational cost of a full complex FFT.
The output of the RFFT for N input samples is N/2 complex pairs (N floats total): bin 0 is the DC component, bin 1 corresponds to frequency fs/N Hz, bin k corresponds to frequency k*fs/N Hz, up to the Nyquist frequency at bin N/2. Use arm_cmplx_mag_f32() to compute magnitudes and arm_max_f32() to find the peak bin.
/* ── fft_vibration_analysis.c ────────────────────────────────────────────
* Real FFT for vibration spectrum analysis on accelerometer data.
* FFT size: 1024 samples. Sample rate: 10 kHz.
* Frequency resolution: 10000/1024 ≈ 9.77 Hz per bin.
* ──────────────────────────────────────────────────────────────────────── */
#include "arm_math.h"
#include
#define FFT_SIZE 1024U
#define SAMPLE_RATE 10000.0f /* Hz */
/* Hann window coefficients (computed offline, stored in flash) */
static const float32_t g_hann_window[FFT_SIZE] = {
/* w[n] = 0.5 * (1 - cos(2*pi*n/(N-1))) for n = 0..N-1 */
/* Abbreviated — full 1024-element array in production code */
0.0f, /* [0] */
/* ... */
};
/* Input buffer and FFT output buffer */
static float32_t g_input_buf[FFT_SIZE]; /* windowed time-domain samples */
static float32_t g_fft_output[FFT_SIZE]; /* complex output (N floats) */
static float32_t g_magnitude[FFT_SIZE/2U]; /* magnitude spectrum */
/* RFFT instance struct */
static arm_rfft_fast_instance_f32 g_rfft;
void fft_init(void)
{
/* Initialise for FFT_SIZE-point transform */
arm_rfft_fast_init_f32(&g_rfft, FFT_SIZE);
}
float32_t fft_find_peak_frequency(const float32_t *p_accel_samples)
{
/* Step 1: Apply Hann window to reduce spectral leakage */
arm_mult_f32(p_accel_samples, g_hann_window, g_input_buf, FFT_SIZE);
/* Step 2: Compute real FFT (forward transform, ifftFlag = 0) */
arm_rfft_fast_f32(&g_rfft, g_input_buf, g_fft_output,
0U /* ifftFlag=0 for forward FFT */);
/* Step 3: Compute magnitude of each complex bin
* g_fft_output layout: [Re0, Re_N/2, Re1, Im1, Re2, Im2, ...] */
/* Skip bin 0 (DC) and bin N/2 (Nyquist) — start from index 2 */
arm_cmplx_mag_f32(&g_fft_output[2], &g_magnitude[1],
(FFT_SIZE / 2U) - 1U);
/* Step 4: Find the bin with the maximum magnitude */
float32_t max_val;
uint32_t max_idx;
arm_max_f32(&g_magnitude[1], (FFT_SIZE / 2U) - 1U,
&max_val, &max_idx);
max_idx += 1U; /* adjust for skipped DC bin */
/* Step 5: Convert bin index to frequency in Hz */
float32_t peak_freq_hz = (float32_t)max_idx * SAMPLE_RATE / (float32_t)FFT_SIZE;
return peak_freq_hz;
}
Performance Optimisation
CMSIS-DSP's hand-optimised code uses ARMv7E-M SIMD intrinsics (such as __SMLAD, __PKHBT) to pack two 16-bit operations into one 32-bit instruction on Cortex-M4/M7. This gives 2–4x speedup over equivalent scalar C. On Cortex-M55 and M85 with the M-Profile Vector Extension (MVE / Helium), CMSIS-DSP 1.10+ uses 128-bit SIMD lanes for up to 8x throughput on Q15 operations.
/* ── perf_comparison.c ───────────────────────────────────────────────────
* Compare Q15 vs float32 FIR performance using DWT cycle counter.
* Run on Cortex-M4F at 168 MHz (STM32F407).
* ──────────────────────────────────────────────────────────────────────── */
#include "arm_math.h"
#include "core_cm4.h" /* for DWT_CYCCNT */
#include
#include
#define TAPS 64U
#define BLOCK 256U
/* Enable DWT cycle counter (one-time setup) */
static void dwt_enable(void)
{
CoreDebug->DEMCR |= CoreDebug_DEMCR_TRCENA_Msk;
DWT->CYCCNT = 0U;
DWT->CTRL |= DWT_CTRL_CYCCNTENA_Msk;
}
static uint32_t dwt_cycles(void) { return DWT->CYCCNT; }
static float32_t g_f32_coeffs[TAPS];
static float32_t g_f32_state[TAPS + BLOCK - 1U];
static float32_t g_f32_in[BLOCK], g_f32_out[BLOCK];
static arm_fir_instance_f32 g_f32_fir;
static q15_t g_q15_coeffs[TAPS];
static q15_t g_q15_state[TAPS + BLOCK - 1U];
static q15_t g_q15_in[BLOCK], g_q15_out[BLOCK];
static arm_fir_instance_q15 g_q15_fir;
void perf_compare(void)
{
dwt_enable();
arm_fir_init_f32(&g_f32_fir, TAPS, g_f32_coeffs, g_f32_state, BLOCK);
arm_fir_init_q15(&g_q15_fir, TAPS, g_q15_coeffs, g_q15_state, BLOCK);
/* Benchmark float32 FIR */
uint32_t t0 = dwt_cycles();
arm_fir_f32(&g_f32_fir, g_f32_in, g_f32_out, BLOCK);
uint32_t f32_cycles = dwt_cycles() - t0;
/* Benchmark Q15 FIR */
t0 = dwt_cycles();
arm_fir_fast_q15(&g_q15_fir, g_q15_in, g_q15_out, BLOCK);
uint32_t q15_cycles = dwt_cycles() - t0;
/* Typical results on STM32F407 @ 168 MHz, 64-tap, 256-sample block:
* float32 FIR : ~3200 cycles (~19 µs)
* Q15 FIR : ~1400 cycles (~8.3 µs) — 2.3x faster
* Cycle counts depend on data cache hit/miss; run multiple iterations */
printf("float32: %lu cycles | Q15: %lu cycles\r\n",
(unsigned long)f32_cycles, (unsigned long)q15_cycles);
}
| Format |
Bit Width |
Range |
Precision |
When to Use |
| Q7 |
8-bit |
-1.0 to +0.992 |
~40 dB SNR |
RAM-constrained M0, coarse classification |
| Q15 |
16-bit |
-1.0 to +0.99997 |
~90 dB SNR |
M4/M7 SIMD, audio processing, sensor fusion |
| Q31 |
32-bit |
-1.0 to +1.0 |
~186 dB SNR |
High precision without FPU on M3/M4 |
| float32 |
32-bit IEEE 754 |
±3.4×1038 |
~150 dB SNR |
M4F/M7F with FPU, convenience, rapid prototyping |
Exercises
Exercise 1
Beginner
Design and Implement a 50 Hz Notch Filter
Design a second-order IIR notch filter centred at 50 Hz for a 1 kHz sample rate using SciPy (scipy.signal.iirnotch(50, 30, 1000)). Extract the b and a coefficients, convert them to CMSIS-DSP DF2T format (remembering the sign convention for a1 and a2), and implement the filter using arm_biquad_cascade_df2T_f32(). Test with synthetic data: a 10 Hz sine wave corrupted with a 50 Hz sine wave. Verify the output amplitude at 50 Hz is attenuated by at least 30 dB while the 10 Hz component is unchanged. Plot both input and output spectrograms using Python with the captured data.
IIR Biquad
Coefficient Conversion
Notch Filter
Exercise 2
Intermediate
Peak-Frequency Detection Using RFFT and arm_max_f32
Implement a complete spectral analysis pipeline: capture 1024 samples from an ADC at a known sample rate, apply a Hann window using arm_mult_f32(), compute the real FFT with arm_rfft_fast_f32(), compute magnitudes with arm_cmplx_mag_f32(), and locate the peak frequency bin using arm_max_f32(). Test by driving the MCU ADC input with a function generator at three known frequencies (100 Hz, 500 Hz, 2 kHz). For each, verify the detected peak frequency is within ±(fs/N) of the true frequency. Explain why the resolution improves if you increase N from 512 to 1024.
arm_rfft_fast_f32
Windowing
Spectral Analysis
Exercise 3
Advanced
Benchmark FIR Filter in Q15 vs float32 on Real Hardware
Implement a 64-tap low-pass FIR filter in both float32 and Q15 variants using arm_fir_f32() and arm_fir_fast_q15(). Use the DWT cycle counter (CoreDebug/DWT registers) to measure the exact clock cycles consumed by each variant for a 256-sample block. Run the benchmark at three optimisation levels: -O0, -O2, and -Os. Record results in a table. Convert Q15 coefficients from float32 using arm_float_to_q15() and verify output equivalence: both filters applied to the same input should produce outputs within 0.01% of each other (Q15 quantisation error). Discuss the trade-off between execution time, RAM usage, and numerical accuracy.
DWT Profiling
Q15 vs float32
Optimisation Levels
DSP Pipeline Specification Generator
Use this tool to document your CMSIS-DSP signal processing pipeline — signal source, sample rate, data format, filter type and parameters, FFT configuration, and performance targets. Download as Word, Excel, PDF, or PPTX for design documentation or team handoff.
Conclusion & Next Steps
In this article we have worked through the full CMSIS-DSP toolkit from fundamentals to implementation:
- Sampling theory — the Nyquist theorem, aliasing prevention with analogue anti-aliasing filters, and quantisation noise set hard limits that no digital processing can overcome.
- CMSIS-DSP architecture — instance structs keep state external, enabling multiple independent filter instances; compile-time flags (
ARM_MATH_CM4) unlock SIMD optimisations.
- FIR filters —
arm_fir_init_f32 + arm_fir_f32 implement windowed-sinc designs in block mode; state buffer sizing (NUM_TAPS + BLOCK_SIZE − 1) is critical.
- IIR biquad filters —
arm_biquad_cascade_df2T_f32 achieves high-order responses with few coefficients; mind the CMSIS sign convention for feedback coefficients.
- FFT —
arm_rfft_fast_f32 + arm_cmplx_mag_f32 + arm_max_f32 form a complete spectral analysis pipeline; windowing is essential to suppress spectral leakage.
- Performance — Q15 on Cortex-M4 with SIMD gives ~2–3x speedup over float32; the DWT cycle counter is the definitive benchmarking tool.
Next in the Series
In Part 7: CMSIS-Driver — UART, SPI & I2C, we shift focus to the peripheral abstraction layer: the ARM_DRIVER_xx struct pattern, asynchronous callback events, DMA-backed transfers, and how RTOS semaphores turn non-blocking drivers into clean blocking APIs for application code.
Related Articles in This Series
Part 7: CMSIS-Driver — UART, SPI & I2C
Understand vendor-independent peripheral drivers, the callback event model, and how to wrap CMSIS-Driver in RTOS-blocking APIs for clean application code.
Read Article
Part 18: Performance Optimization
Deep-dive into compiler flags, inline assembly, Cortex-M7 cache configuration, and profiling techniques that apply directly to the DSP pipeline benchmarking introduced here.
Read Article
Part 14: DMA & High-Performance Data Handling
Feed your CMSIS-DSP filters directly from DMA ping-pong buffers for zero-CPU data acquisition — the professional approach to continuous signal processing pipelines.
Read Article