Series Context: This is Part 14 of our 20-part CMSIS Mastery Series. We now enter high-performance territory — DMA removes the CPU from the critical path for bulk data movement, enabling parallel processing and deterministic throughput. Parts 11 (interrupts) and 12 (memory management) are useful prerequisite reading.
1
Overview & ARM Cortex-M Ecosystem
CMSIS layers, Cortex-M families, memory map, toolchains
Completed
2
CMSIS-Core: Registers, NVIC & SysTick
core_cmX.h, register access, interrupt controller, SysTick timer
Completed
3
Startup Code, Linker Scripts & Vector Table
Reset handler, BSS init, scatter files, boot process
Completed
4
CMSIS-RTOS2: Threads, Mutexes & Semaphores
Thread management, synchronization primitives, scheduling
Completed
5
CMSIS-RTOS2: Message Queues & Event Flags
Inter-thread comms, ISR-to-thread, real-time design patterns
Completed
6
CMSIS-DSP: Filters, FFT & Math Functions
FIR/IIR filters, FFT, SIMD optimizations
Completed
7
CMSIS-Driver: UART, SPI & I2C
Driver abstraction layer, callbacks, DMA integration
Completed
8
CMSIS-Pack & Software Components
Pack files, device support, dependency management
Completed
9
Debugging with CMSIS-DAP & CoreSight
SWD/JTAG, HardFault analysis, ITM tracing
Completed
10
Portable Firmware: Multi-Vendor Projects
HAL vs CMSIS, cross-platform BSPs, reusable libraries
Completed
11
Interrupts, Concurrency & Real-Time Constraints
Interrupt latency, critical sections, lock-free programming
Completed
12
Memory Management in Embedded Systems
Static vs dynamic, heap fragmentation, memory pools
Completed
13
Low Power & Energy Optimization
Sleep modes, clock gating, tickless RTOS, power profiling
Completed
14
DMA & High-Performance Data Handling
DMA basics, peripheral transfers, zero-copy techniques
You Are Here
15
Security: ARMv8-M & TrustZone
Secure/non-secure worlds, secure boot, firmware protection
16
Bootloaders & Firmware Updates
OTA updates, dual-bank flash, fail-safe strategies
17
Testing & Validation
Unity/Ceedling unit tests, HIL testing, integration testing
18
Performance Optimization
Compiler flags, inline assembly, cache (M7/M33), profiling
19
Embedded Software Architecture
Layered design, event-driven, state machines, component-based
20
Tooling & Workflow (Professional Level)
CI/CD for embedded, MISRA, static analysis, Doxygen
DMA Fundamentals
Direct Memory Access (DMA) is one of the most powerful performance tools in the embedded developer's arsenal. At its core, a DMA controller is a dedicated hardware block that moves data between memory regions or between peripherals and memory — entirely independently of the CPU. While DMA transfers are in progress, the processor is free to execute application code, sleep to save power, or handle other interrupts.
Why DMA Matters
Consider receiving 1024 bytes over UART at 115200 baud. Without DMA, your CPU must handle each received byte via an interrupt — 1024 context switches, 1024 interrupt entries and exits, 1024 memory writes. At 115200 baud each byte takes ~87 µs, so the CPU must be responsive within that window for every byte. With DMA, the UART peripheral triggers a single DMA transfer that captures all 1024 bytes autonomously. The CPU receives one interrupt at completion — or two if using half-transfer notification for double buffering. The difference in CPU load is dramatic: from O(N) interrupts down to O(1).
Rule of Thumb: For any peripheral transfer larger than ~4 bytes that happens repeatedly, DMA is almost always the right choice. The breakeven point where DMA setup overhead is recouped by reduced interrupt overhead is typically around 8–16 bytes.
Transfer Types
| Transfer Type |
Source |
Destination |
Availability |
Flow Controller |
Typical Use Case |
| Memory-to-Memory |
SRAM/Flash |
SRAM |
Most DMA controllers (not all channels) |
DMA |
Fast buffer copy, frame buffer fill, memset |
| Memory-to-Peripheral |
SRAM |
Peripheral DR |
All DMA controllers |
DMA or Peripheral |
UART Tx, SPI Tx, DAC output stream |
| Peripheral-to-Memory |
Peripheral DR |
SRAM |
All DMA controllers |
DMA or Peripheral |
UART Rx, ADC capture, SPI Rx |
| Peripheral-to-Peripheral |
Peripheral DR |
Peripheral DR |
Limited (STM32 BDMA, some GPDMA) |
Peripheral |
Timer-triggered DAC output, ADC-to-DAC loopback |
DMA Controller Architecture
On STM32 devices, the DMA controller exposes a set of streams (STM32F4/H7 terminology) or channels (STM32L4/G4 terminology). Each stream/channel handles one transfer at a time and must be assigned to a specific peripheral request (also called a DMA request line or mux channel). The DMAMUX peripheral on newer STM32 devices allows flexible routing of any peripheral request to any DMA channel.
Key hardware parameters to configure for each DMA transfer:
- Direction: peripheral-to-memory, memory-to-peripheral, or memory-to-memory
- Data width: byte (8-bit), half-word (16-bit), or word (32-bit) — source and destination widths can differ
- Address increment: whether to auto-increment the source/destination pointer after each beat
- Circular mode: whether the transfer automatically restarts when complete
- Priority: low / medium / high / very high — arbitration between simultaneous requests
- FIFO / burst: whether the DMA uses a FIFO to batch transfers into AHB bursts (improves bus utilisation)
Peripheral-to-Memory Transfers
Peripheral-to-memory is the most common DMA use case. The peripheral (UART, SPI, ADC) generates a DMA request each time it has data ready. The DMA controller responds by reading from the peripheral data register and writing to the next location in your receive buffer — without any CPU involvement.
UART Rx via DMA — Ping-Pong Buffer
The following example configures STM32 UART1 receive with DMA in circular mode, using a ping-pong buffer. The DMA generates a half-transfer interrupt when the first half of the buffer is full, and a full-transfer (transfer-complete) interrupt when the second half fills. This allows continuous receive with zero data loss while processing occurs on the idle half.
/**
* UART DMA Rx — Circular Ping-Pong Buffer
* Target: STM32F4xx / STM32H7xx
* Uses LL (Low-Layer) DMA API for clarity; HAL equivalent is similar.
*/
#include "stm32f4xx.h"
#include "stm32f4xx_ll_dma.h"
#include "stm32f4xx_ll_usart.h"
#define UART_RX_BUF_SIZE 256u /* total buffer — two halves of 128 */
#define UART_RX_HALF (UART_RX_BUF_SIZE / 2u)
/* Ping-pong buffer: DMA writes here, CPU reads the idle half */
static uint8_t uart_rx_buf[UART_RX_BUF_SIZE];
/* Which half is ready for processing: 0 = first half, 1 = second half */
static volatile uint8_t rx_half_ready = 0xFFu; /* 0xFF = nothing ready */
void uart_dma_init(void)
{
/* 1. Enable DMA2 clock (USART1 Rx is on DMA2 Stream5 Channel4 on F4) */
RCC->AHB1ENR |= RCC_AHB1ENR_DMA2EN;
__DSB();
/* 2. Ensure stream is disabled before configuring */
LL_DMA_DisableStream(DMA2, LL_DMA_STREAM_5);
while (LL_DMA_IsEnabledStream(DMA2, LL_DMA_STREAM_5)) {}
/* 3. Configure DMA2 Stream5 — peripheral-to-memory, circular */
LL_DMA_ConfigTransfer(DMA2, LL_DMA_STREAM_5,
LL_DMA_DIRECTION_PERIPH_TO_MEMORY |
LL_DMA_MODE_CIRCULAR |
LL_DMA_PERIPH_NOINCREMENT |
LL_DMA_MEMORY_INCREMENT |
LL_DMA_PDATAALIGN_BYTE |
LL_DMA_MDATAALIGN_BYTE |
LL_DMA_PRIORITY_HIGH);
LL_DMA_SetChannelSelection(DMA2, LL_DMA_STREAM_5, LL_DMA_CHANNEL_4);
LL_DMA_SetPeriphAddress(DMA2, LL_DMA_STREAM_5,
(uint32_t)&USART1->DR);
LL_DMA_SetMemoryAddress(DMA2, LL_DMA_STREAM_5,
(uint32_t)uart_rx_buf);
LL_DMA_SetDataLength(DMA2, LL_DMA_STREAM_5, UART_RX_BUF_SIZE);
/* 4. Enable half-transfer and transfer-complete interrupts */
LL_DMA_EnableIT_HT(DMA2, LL_DMA_STREAM_5);
LL_DMA_EnableIT_TC(DMA2, LL_DMA_STREAM_5);
NVIC_SetPriority(DMA2_Stream5_IRQn, 5);
NVIC_EnableIRQ(DMA2_Stream5_IRQn);
/* 5. Enable DMA Rx on USART1, then start the DMA stream */
LL_USART_EnableDMAReq_RX(USART1);
LL_DMA_EnableStream(DMA2, LL_DMA_STREAM_5);
}
/* DMA2 Stream5 ISR */
void DMA2_Stream5_IRQHandler(void)
{
if (LL_DMA_IsActiveFlag_HT5(DMA2)) {
LL_DMA_ClearFlag_HT5(DMA2);
rx_half_ready = 0u; /* first half [0..127] ready */
}
if (LL_DMA_IsActiveFlag_TC5(DMA2)) {
LL_DMA_ClearFlag_TC5(DMA2);
rx_half_ready = 1u; /* second half [128..255] ready */
}
}
/* Called from application task / main loop */
void process_uart_data(void)
{
uint8_t half = rx_half_ready;
if (half == 0xFFu) return; /* nothing ready */
rx_half_ready = 0xFFu; /* consume */
const uint8_t *buf = uart_rx_buf + (half * UART_RX_HALF);
/* Process UART_RX_HALF bytes from buf — zero-copy, no memcpy needed */
for (uint32_t i = 0; i < UART_RX_HALF; i++) {
/* handle buf[i] */
(void)buf[i];
}
}
Half-Transfer & Full-Transfer Interrupts
In circular DMA mode the controller generates two interrupts per complete buffer rotation:
- HT (Half-Transfer): fired when the DMA pointer crosses the midpoint. At this moment, the first half of the buffer contains valid data and the DMA is writing into the second half — safe to process the first half.
- TC (Transfer-Complete): fired when the DMA pointer wraps back to the start. The second half contains valid data and the DMA is now writing into the first half again — safe to process the second half.
Critical Timing: You must finish processing one half before the DMA fills it again. At 115200 baud, you have ~11 ms to process 128 bytes. At 1 Mbaud you only have ~1.28 ms — use an RTOS thread and notify from the ISR rather than processing in the interrupt handler.
Circular DMA Mode
Circular mode is the key to continuous, zero-CPU-involvement data streaming. When the DMA reaches the end of the configured buffer it automatically resets its pointer to the buffer start and continues — no software restart required. This is essential for ADC streaming, audio capture, and any application that requires continuous data acquisition.
Ping-Pong Buffer Pattern
The ping-pong (double-buffer) pattern splits the DMA buffer into two equal halves. While the DMA writes into one half (the "active" half), the CPU processes the other half (the "ready" half). The roles swap at each half-transfer and transfer-complete interrupt.
/**
* Generic ping-pong buffer manager — CPU-side processing example.
* Buffer layout: [pingBuf | pongBuf] — each HALF_SIZE bytes.
*/
#define HALF_SIZE 512u
#define BUF_SIZE (HALF_SIZE * 2u)
/* DMA writes into this buffer in circular mode */
__attribute__((aligned(32))) /* 32-byte alignment for M7 cache lines */
static uint16_t dma_buf[BUF_SIZE];
/* Processed results live here — separate from DMA buffer */
static float32_t result_ping[HALF_SIZE];
static float32_t result_pong[HALF_SIZE];
typedef enum { HALF_PING = 0, HALF_PONG = 1 } buf_half_t;
static volatile buf_half_t pending_half = (buf_half_t)0xFFu;
static volatile bool processing_busy = false;
/* Called from DMA ISR — minimal work here */
void dma_half_complete_callback(buf_half_t half)
{
if (!processing_busy) {
pending_half = half;
}
/* If processing_busy, we've overrun — log and handle in production */
}
/* Called from application thread / main loop */
void run_dsp_pipeline(void)
{
if (pending_half == (buf_half_t)0xFFu) return;
processing_busy = true;
buf_half_t half = pending_half;
pending_half = (buf_half_t)0xFFu;
const uint16_t *src = dma_buf + (half == HALF_PING ? 0 : HALF_SIZE);
float32_t *dest = (half == HALF_PING) ? result_ping : result_pong;
/* Convert ADC counts to voltage and apply DSP */
for (uint32_t i = 0; i < HALF_SIZE; i++) {
dest[i] = (float32_t)src[i] * (3.3f / 4095.0f); /* 12-bit ADC */
}
/* Run CMSIS-DSP filter on dest[] here */
processing_busy = false;
}
ADC DMA Circular Double Buffering
High-speed ADC acquisition is the canonical DMA use case. At 1 MSPS (1 million samples per second) you have 1 µs per sample — nowhere near enough time for an interrupt-per-sample approach. DMA in circular mode with double buffering is the only viable solution.
ADC DMA Setup — STM32H7
/**
* ADC1 continuous conversion with DMA circular double-buffer.
* Target: STM32H743 @ 480 MHz, ADC clock = 36 MHz, 12-bit, 1 MSPS.
*
* Buffer: 2048 samples total (two halves of 1024 = 1 ms per half @ 1 MSPS).
*/
#include "stm32h7xx.h"
#include "stm32h7xx_ll_adc.h"
#include "stm32h7xx_ll_dma.h"
#define ADC_BUF_TOTAL 2048u
#define ADC_BUF_HALF (ADC_BUF_TOTAL / 2u)
/* 32-byte aligned for D-cache line flush/invalidate on M7 */
__attribute__((section(".dma_buffers"), aligned(32)))
static uint16_t adc_dma_buf[ADC_BUF_TOTAL];
static volatile bool adc_half_rdy = false;
static volatile bool adc_full_rdy = false;
void adc_dma_init(void)
{
/* DMA1 Stream1 — ADC1 on H7 uses BDMA or DMA (check your ref manual) */
LL_DMA_DisableStream(DMA1, LL_DMA_STREAM_1);
while (LL_DMA_IsEnabledStream(DMA1, LL_DMA_STREAM_1)) {}
LL_DMA_SetPeriphRequest(DMA1, LL_DMA_STREAM_1, LL_DMAMUX1_REQ_ADC1);
LL_DMA_SetDataTransferDirection(DMA1, LL_DMA_STREAM_1,
LL_DMA_DIRECTION_PERIPH_TO_MEMORY);
LL_DMA_SetMode(DMA1, LL_DMA_STREAM_1, LL_DMA_MODE_CIRCULAR);
LL_DMA_SetPeriphIncMode(DMA1, LL_DMA_STREAM_1, LL_DMA_PERIPH_NOINCREMENT);
LL_DMA_SetMemoryIncMode(DMA1, LL_DMA_STREAM_1, LL_DMA_MEMORY_INCREMENT);
LL_DMA_SetPeriphSize(DMA1, LL_DMA_STREAM_1, LL_DMA_PDATAALIGN_HALFWORD);
LL_DMA_SetMemorySize(DMA1, LL_DMA_STREAM_1, LL_DMA_MDATAALIGN_HALFWORD);
LL_DMA_SetStreamPriority(DMA1, LL_DMA_STREAM_1, LL_DMA_PRIORITY_VERYHIGH);
LL_DMA_SetPeriphAddress(DMA1, LL_DMA_STREAM_1,
LL_ADC_DMA_GetRegAddr(ADC1, LL_ADC_DMA_REG_REGULAR_DATA));
LL_DMA_SetMemoryAddress(DMA1, LL_DMA_STREAM_1, (uint32_t)adc_dma_buf);
LL_DMA_SetDataLength(DMA1, LL_DMA_STREAM_1, ADC_BUF_TOTAL);
LL_DMA_EnableIT_HT(DMA1, LL_DMA_STREAM_1);
LL_DMA_EnableIT_TC(DMA1, LL_DMA_STREAM_1);
LL_DMA_EnableIT_TE(DMA1, LL_DMA_STREAM_1); /* transfer error */
NVIC_SetPriority(DMA1_Stream1_IRQn, 4);
NVIC_EnableIRQ(DMA1_Stream1_IRQn);
LL_ADC_REG_SetDMATransfer(ADC1, LL_ADC_REG_DMA_TRANSFER_UNLIMITED);
LL_DMA_EnableStream(DMA1, LL_DMA_STREAM_1);
LL_ADC_REG_StartConversion(ADC1);
}
void DMA1_Stream1_IRQHandler(void)
{
if (LL_DMA_IsActiveFlag_HT1(DMA1)) {
LL_DMA_ClearFlag_HT1(DMA1);
adc_half_rdy = true; /* first 1024 samples ready */
}
if (LL_DMA_IsActiveFlag_TC1(DMA1)) {
LL_DMA_ClearFlag_TC1(DMA1);
adc_full_rdy = true; /* second 1024 samples ready */
}
if (LL_DMA_IsActiveFlag_TE1(DMA1)) {
LL_DMA_ClearFlag_TE1(DMA1);
/* Handle DMA error — log, halt, or attempt recovery */
__BKPT(0);
}
}
Buffer Switching in the DMA ISR
The ISR should be kept minimal — set a flag or release a semaphore. The actual processing belongs in a dedicated RTOS thread. Using osThreadFlagsSet() from within the ISR is the idiomatic CMSIS-RTOS2 pattern (covered in Part 5).
/* Application thread processing ADC data — CMSIS-RTOS2 */
#include "cmsis_os2.h"
#include "arm_math.h" /* CMSIS-DSP */
#define FFT_SIZE 1024u
static arm_rfft_fast_instance_f32 fft_inst;
static float32_t fft_input[FFT_SIZE];
static float32_t fft_output[FFT_SIZE];
void adc_processing_thread(void *arg)
{
arm_rfft_fast_init_f32(&fft_inst, FFT_SIZE);
for (;;) {
/* Wait for either half to be ready */
uint32_t flags = osThreadFlagsWait(0x03u, osFlagsWaitAny, osWaitForever);
const uint16_t *src = (flags & 0x01u)
? adc_dma_buf /* first half */
: adc_dma_buf + ADC_BUF_HALF; /* second half */
/* Convert to float */
for (uint32_t i = 0; i < FFT_SIZE; i++) {
fft_input[i] = (float32_t)src[i] * (3.3f / 4095.0f);
}
/* Real FFT using CMSIS-DSP */
arm_rfft_fast_f32(&fft_inst, fft_input, fft_output, 0);
/* fft_output now contains frequency-domain data */
}
}
Cache Coherency on Cortex-M7
The Cortex-M7 is the first Cortex-M core to include L1 data cache (D-cache). This introduces a subtle but dangerous problem: cache coherency. The DMA controller is a bus master that accesses memory directly via the AHB/AXI bus, bypassing the CPU's cache entirely. If the CPU's D-cache contains a stale copy of the DMA's destination buffer, the CPU will read old data. Conversely, if the CPU has written modified data in cache that hasn't been flushed to RAM, the DMA will transmit old data from RAM.
The D-Cache Problem Illustrated
Silent Data Corruption: Cache coherency bugs are among the hardest to debug in embedded systems. The code looks correct, the DMA configuration is correct, but the data processed by the CPU is silently stale — and the bug may only manifest at certain buffer sizes or alignment boundaries, making it intermittent.
Clean & Invalidate APIs (CMSIS-Core)
CMSIS-Core provides SCB_CleanDCache_by_Addr() and SCB_InvalidateDCache_by_Addr() for software cache management. The naming convention follows the ARM architecture manual:
- Clean: write dirty cache lines back to RAM (so DMA can read correct data)
- Invalidate: mark cache lines as invalid (so CPU will re-read from RAM after DMA has written)
/**
* Cache coherency wrappers for DMA buffers on Cortex-M7.
* Must be called with correct size and 32-byte aligned addresses
* (one cache line = 32 bytes on Cortex-M7).
*/
#include "core_cm7.h"
#define CACHE_LINE_SIZE 32u
/* Round up to next cache-line boundary */
#define CACHE_ALIGN_SIZE(sz) \
(((sz) + CACHE_LINE_SIZE - 1u) & ~(CACHE_LINE_SIZE - 1u))
/**
* Call BEFORE starting a DMA transmit (Memory-to-Peripheral):
* Ensures any CPU-modified data is written back to RAM before DMA reads.
*/
void dma_tx_prepare(void *buf, uint32_t len)
{
SCB_CleanDCache_by_Addr((uint32_t *)buf, (int32_t)CACHE_ALIGN_SIZE(len));
}
/**
* Call AFTER DMA receive completes (Peripheral-to-Memory):
* Invalidates the cache so CPU reads fresh DMA-written data from RAM.
*/
void dma_rx_complete(void *buf, uint32_t len)
{
SCB_InvalidateDCache_by_Addr((uint32_t *)buf,
(int32_t)CACHE_ALIGN_SIZE(len));
}
/* Example: SPI DMA transmit with cache management */
void spi_dma_send(const uint8_t *data, uint32_t len)
{
/* Ensure data is in RAM (not just cache) before DMA reads it */
dma_tx_prepare((void *)data, len);
/* Now start DMA — it will read correct data from RAM */
LL_DMA_SetMemoryAddress(DMA2, LL_DMA_STREAM_3, (uint32_t)data);
LL_DMA_SetDataLength(DMA2, LL_DMA_STREAM_3, len);
LL_DMA_EnableStream(DMA2, LL_DMA_STREAM_3);
}
/* DMA receive complete callback */
void spi_dma_rx_complete_callback(uint8_t *rx_buf, uint32_t len)
{
/* Invalidate before CPU reads — DMA has written to RAM */
dma_rx_complete(rx_buf, len);
/* Now safe to access rx_buf — CPU will re-read from RAM */
process_spi_data(rx_buf, len);
}
MPU-Based Solution — Non-Cacheable Regions
For DMA buffers that are continuously updated, repeatedly calling clean/invalidate adds overhead and complexity. A cleaner solution is to place DMA buffers in a memory region configured as non-cacheable via the Memory Protection Unit (MPU). The CPU accesses these regions without caching — every access goes directly to RAM — eliminating coherency concerns entirely at the cost of slightly slower CPU access.
/**
* Configure MPU region as non-cacheable for DMA buffers.
* Place DMA buffers at DMA_BUFFER_BASE in the linker script.
* Suitable for STM32H7 with AXI SRAM or dedicated DMA SRAM regions.
*/
#include "core_cm7.h"
#define DMA_BUFFER_BASE 0x30040000u /* AXI SRAM4 on STM32H743 */
#define DMA_BUFFER_SIZE 0x8000u /* 32 KB */
void mpu_configure_dma_region(void)
{
/* Disable MPU during configuration */
ARM_MPU_Disable();
/* Region 0: DMA buffers — non-cacheable, non-bufferable, shareable */
MPU->RNR = 0u;
MPU->RBAR = DMA_BUFFER_BASE;
MPU->RASR = ARM_MPU_RASR(
0u, /* XN: no execute */
ARM_MPU_AP_FULL, /* full access */
0u, 0u, 0u, 0u, /* TEX=0, S=0, C=0, B=0 = Strongly Ordered */
0x00u, /* SRD: no sub-regions disabled */
ARM_MPU_REGION_SIZE_32KB
);
/* Re-enable MPU with background region (allows cached access elsewhere) */
ARM_MPU_Enable(MPU_CTRL_PRIVDEFENA_Msk);
__DSB();
__ISB();
}
/* DMA buffers must be placed at DMA_BUFFER_BASE in the linker script:
* .dma_buffers (NOLOAD) :
* {
* *(.dma_buffers)
* } > DMA_SRAM
*/
DMA with RTOS & Memory-to-Memory
Memory-to-Memory DMA vs memcpy()
DMA can also be used for memory-to-memory transfers — replacing memcpy() for large blocks. Whether this is faster than CPU-based copy depends on the memory bus configuration and the MCU. On M7 with TCM memory, a CPU-optimised memcpy() can outperform DMA for small blocks. For large transfers (>4 KB) to/from AXI SRAM, DMA typically wins because it can burst the AHB bus without competing with instruction fetches.
/**
* Memory-to-memory DMA on STM32F4 — DMA2 Stream0 Channel0.
* Benchmark: compare against memcpy() using DWT cycle counter.
*/
#include "stm32f4xx.h"
#include "core_cm4.h"
#define BUF_LEN 4096u
static uint8_t src_buf[BUF_LEN] __attribute__((aligned(4)));
static uint8_t dst_buf[BUF_LEN] __attribute__((aligned(4)));
/* Enable DWT cycle counter */
static void dwt_enable(void)
{
CoreDebug->DEMCR |= CoreDebug_DEMCR_TRCENA_Msk;
DWT->CYCCNT = 0u;
DWT->CTRL |= DWT_CTRL_CYCCNTENA_Msk;
}
static uint32_t dwt_cycles(void) { return DWT->CYCCNT; }
/* Benchmark memcpy vs DMA */
void benchmark_copy(void)
{
dwt_enable();
/* --- memcpy benchmark --- */
DWT->CYCCNT = 0u;
memcpy(dst_buf, src_buf, BUF_LEN);
uint32_t cpu_cycles = dwt_cycles();
/* --- DMA memory-to-memory benchmark --- */
RCC->AHB1ENR |= RCC_AHB1ENR_DMA2EN; __DSB();
DMA2_Stream0->CR &= ~DMA_SxCR_EN;
while (DMA2_Stream0->CR & DMA_SxCR_EN) {}
DMA2_Stream0->PAR = (uint32_t)src_buf;
DMA2_Stream0->M0AR = (uint32_t)dst_buf;
DMA2_Stream0->NDTR = BUF_LEN / 4u; /* word transfers */
DMA2_Stream0->CR = (0u << DMA_SxCR_CHSEL_Pos) | /* Channel 0 */
DMA_SxCR_DIR_0 | /* Mem-to-Mem */
DMA_SxCR_MINC | /* Mem increment */
DMA_SxCR_PINC | /* Periph increment */
(0x2u << DMA_SxCR_MSIZE_Pos) | /* 32-bit mem */
(0x2u << DMA_SxCR_PSIZE_Pos) | /* 32-bit periph */
DMA_SxCR_PL_1; /* High priority */
DWT->CYCCNT = 0u;
DMA2_Stream0->CR |= DMA_SxCR_EN;
while (!(DMA2->LISR & DMA_LISR_TCIF0)) {} /* wait for TC */
uint32_t dma_cycles = dwt_cycles();
DMA2->LIFCR = DMA_LIFCR_CTCIF0; /* clear TC flag */
/* Log results */
/* cpu_cycles: typically 400-600 for 4KB on F4 @ 168 MHz */
/* dma_cycles: typically 200-350 for 4KB (bus arbitration) */
(void)cpu_cycles;
(void)dma_cycles;
}
/* Common pitfalls with DMA */
Notifying an RTOS Thread from a DMA ISR
The correct pattern for signalling an RTOS thread from a DMA ISR uses CMSIS-RTOS2 thread flags or semaphores. Never call blocking RTOS functions from an ISR — only use the FromISR variants (in FreeRTOS) or functions explicitly marked as ISR-safe in CMSIS-RTOS2.
/* Thread ID — set during RTOS thread creation */
static osThreadId_t g_dma_thread_id = NULL;
#define FLAG_DMA_HALF (1u << 0u)
#define FLAG_DMA_FULL (1u << 1u)
/* DMA ISR — signal the processing thread */
void DMA1_Stream1_IRQHandler(void)
{
if (LL_DMA_IsActiveFlag_HT1(DMA1)) {
LL_DMA_ClearFlag_HT1(DMA1);
osThreadFlagsSet(g_dma_thread_id, FLAG_DMA_HALF);
}
if (LL_DMA_IsActiveFlag_TC1(DMA1)) {
LL_DMA_ClearFlag_TC1(DMA1);
osThreadFlagsSet(g_dma_thread_id, FLAG_DMA_FULL);
}
}
/* Processing thread */
void dma_processing_thread(void *arg)
{
g_dma_thread_id = osThreadGetId();
for (;;) {
uint32_t flags = osThreadFlagsWait(
FLAG_DMA_HALF | FLAG_DMA_FULL,
osFlagsWaitAny,
osWaitForever);
if (flags & FLAG_DMA_HALF) {
/* Invalidate cache for first half, then process */
SCB_InvalidateDCache_by_Addr((uint32_t *)adc_dma_buf,
ADC_BUF_HALF * sizeof(uint16_t));
process_samples(adc_dma_buf, ADC_BUF_HALF);
}
if (flags & FLAG_DMA_FULL) {
SCB_InvalidateDCache_by_Addr(
(uint32_t *)(adc_dma_buf + ADC_BUF_HALF),
ADC_BUF_HALF * sizeof(uint16_t));
process_samples(adc_dma_buf + ADC_BUF_HALF, ADC_BUF_HALF);
}
}
}
Common DMA Pitfalls
| Pitfall |
Symptom |
Root Cause |
Fix |
| Cache coherency (M7) |
CPU reads stale data; intermittent corruption at buffer boundaries |
D-cache holds outdated copy; DMA wrote to RAM but cache not invalidated |
SCB_InvalidateDCache_by_Addr() after DMA Rx; place DMA buffers in non-cacheable MPU region |
| Misaligned buffer address |
HardFault or silent transfer error; DMA TE interrupt fires |
DMA requires word-aligned addresses for 32-bit transfers; 32-byte alignment needed for cache ops |
Use __attribute__((aligned(32))) on DMA buffers; check NDTR vs data width |
| DMA request collision |
Transfer never starts; DMA stream stays busy |
Two peripherals configured to share one DMA stream/channel |
Check DMA request mapping table in reference manual; use DMAMUX on newer devices |
| Circular buffer race condition |
Data appears corrupted every N frames; hard to reproduce |
CPU processing half-buffer too slowly — DMA overwrites before processing completes |
Increase buffer depth; move processing to dedicated RTOS thread with higher priority; profile with DWT |
| Forgetting to clear DMA flags |
ISR fires once then never again; or fires continuously |
DMA flags are sticky — must clear manually in ISR |
Always clear HT, TC, and TE flags at the start of the ISR before processing |
| DMA buffer in Flash/const region |
DMA appears to work but CPU reads zeros |
DMA cannot write to Flash (read-only AHB master access) |
Declare DMA receive buffers as non-const in SRAM; check linker script section placement |
Exercises
Exercise 1
Intermediate
Zero-Copy UART Logger with DMA Double Buffer
Implement a UART logging system that receives log strings over UART using DMA in circular mode with a ping-pong buffer. The receive half should be parsed (look for \n delimiters) and forwarded to a ring buffer without any memcpy(). Measure CPU utilisation with and without DMA using the DWT cycle counter. Target: <2% CPU load at 115200 baud receiving 1000 bytes/second.
DMA Circular
Zero-Copy
UART Rx
DWT Profiling
Exercise 2
Intermediate
1 MSPS ADC Capture with Background FFT
Configure your ADC (STM32 or equivalent) for continuous conversion at the maximum supported sample rate using DMA circular double-buffering. In the background RTOS thread, run a 1024-point CMSIS-DSP real FFT on each completed half-buffer. Display the dominant frequency component over UART. Verify correctness by feeding a known signal frequency from a signal generator or the MCU's own DAC.
ADC DMA
CMSIS-DSP FFT
RTOS Thread Flags
Double Buffer
Exercise 3
Advanced
Enable D-Cache on M7 and Fix Cache Coherency
Take an existing DMA project (UART Rx or SPI) running on a Cortex-M7 (STM32H7 or STM32F7) with D-cache disabled. Enable D-cache (SCB_EnableDCache()) and observe the data corruption. Then fix it using two approaches: (1) software clean/invalidate with SCB_CleanDCache_by_Addr() / SCB_InvalidateDCache_by_Addr(), and (2) MPU non-cacheable region for the DMA buffers. Compare the CPU overhead of both approaches using DWT.
D-Cache
Cache Coherency
MPU
Cortex-M7
DMA Design Planner
Use this tool to document your DMA channel allocation, buffer sizes, cache coherency strategy, and transfer priorities. Download as Word, Excel, PDF, or PPTX for design review documentation.
Conclusion & Next Steps
DMA is a foundational technique for building high-performance, CPU-efficient embedded firmware. In this part we covered:
- DMA fundamentals: transfer types, controller architecture, and the key configuration parameters (direction, data width, increment, circular mode, priority).
- Peripheral-to-memory transfers: configuring UART DMA Rx with circular mode, ping-pong buffers, and half-transfer/transfer-complete interrupts for zero-copy continuous receive.
- ADC double buffering: high-speed circular DMA capture with background CMSIS-DSP processing, using RTOS thread flags to signal between the ISR and the processing thread.
- Cache coherency on Cortex-M7: understanding the D-cache problem, using
SCB_CleanDCache_by_Addr() and SCB_InvalidateDCache_by_Addr(), and the MPU non-cacheable region approach.
- Memory-to-memory DMA: benchmarking against
memcpy() using the DWT cycle counter, and the common pitfalls table.
Next in the Series
In Part 15: Security — ARMv8-M & TrustZone, we'll explore the hardware security architecture of ARMv8-M processors — partitioning flash and SRAM into Secure and Non-Secure worlds, configuring the Security Attribution Unit (SAU), creating secure entry function veneers, implementing PSA Crypto APIs, and building a minimal secure boot chain with signature verification.
Related Articles in This Series
Part 6: CMSIS-DSP — Filters, FFT & Math Functions
Master CMSIS-DSP signal processing — FIR/IIR filters, real and complex FFT, and SIMD-optimised math. DMA-fed ADC buffers feed directly into these algorithms.
Read Article
Part 7: CMSIS-Driver — UART, SPI & I2C
CMSIS-Driver wraps peripheral access including DMA-backed transfers behind a standardised callback interface — the complement to the low-level DMA configuration shown here.
Read Article
Part 11: Interrupts, Concurrency & Real-Time Constraints
DMA ISRs are interrupts — understanding priority grouping, preemption, and ISR-to-thread communication patterns is essential for robust DMA designs.
Read Article