CMSIS Part 14: DMA & High-Performance Data Handling

                        
                        Series Context: This is Part 14 of our 20-part CMSIS Mastery Series. We now enter high-performance territory — DMA removes the CPU from the critical path for bulk data movement, enabling parallel processing and deterministic throughput. Parts 11 (interrupts) and 12 (memory management) are useful prerequisite reading.
                    

CMSIS Mastery Series

Your 20-step learning path • Currently on Step 14

1

14

DMA & High-Performance Data Handling

DMA basics, peripheral transfers, zero-copy techniques

You Are Here

15

Security: ARMv8-M & TrustZone

Secure/non-secure worlds, secure boot, firmware protection

16

Bootloaders & Firmware Updates

OTA updates, dual-bank flash, fail-safe strategies

17

Testing & Validation

Unity/Ceedling unit tests, HIL testing, integration testing

18

Performance Optimization

Compiler flags, inline assembly, cache (M7/M33), profiling

19

Embedded Software Architecture

Layered design, event-driven, state machines, component-based

20

Tooling & Workflow (Professional Level)

CI/CD for embedded, MISRA, static analysis, Doxygen

DMA Fundamentals

Direct Memory Access (DMA) is one of the most powerful performance tools in the embedded developer's arsenal. At its core, a DMA controller is a dedicated hardware block that moves data between memory regions or between peripherals and memory — entirely independently of the CPU. While DMA transfers are in progress, the processor is free to execute application code, sleep to save power, or handle other interrupts.

Why DMA Matters

Consider receiving 1024 bytes over UART at 115200 baud. Without DMA, your CPU must handle each received byte via an interrupt — 1024 context switches, 1024 interrupt entries and exits, 1024 memory writes. At 115200 baud each byte takes ~87 µs, so the CPU must be responsive within that window for every byte. With DMA, the UART peripheral triggers a single DMA transfer that captures all 1024 bytes autonomously. The CPU receives one interrupt at completion — or two if using half-transfer notification for double buffering. The difference in CPU load is dramatic: from O(N) interrupts down to O(1).

                        
                        Rule of Thumb: For any peripheral transfer larger than ~4 bytes that happens repeatedly, DMA is almost always the right choice. The breakeven point where DMA setup overhead is recouped by reduced interrupt overhead is typically around 8–16 bytes.
                    

Transfer Types

Transfer Type	Source	Destination	Availability	Flow Controller	Typical Use Case
Memory-to-Memory	SRAM/Flash	SRAM	Most DMA controllers (not all channels)	DMA	Fast buffer copy, frame buffer fill, memset
Memory-to-Peripheral	SRAM	Peripheral DR	All DMA controllers	DMA or Peripheral	UART Tx, SPI Tx, DAC output stream
Peripheral-to-Memory	Peripheral DR	SRAM	All DMA controllers	DMA or Peripheral	UART Rx, ADC capture, SPI Rx
Peripheral-to-Peripheral	Peripheral DR	Peripheral DR	Limited (STM32 BDMA, some GPDMA)	Peripheral	Timer-triggered DAC output, ADC-to-DAC loopback

DMA Controller Architecture

On STM32 devices, the DMA controller exposes a set of streams (STM32F4/H7 terminology) or channels (STM32L4/G4 terminology). Each stream/channel handles one transfer at a time and must be assigned to a specific peripheral request (also called a DMA request line or mux channel). The DMAMUX peripheral on newer STM32 devices allows flexible routing of any peripheral request to any DMA channel.

Key hardware parameters to configure for each DMA transfer:

Direction: peripheral-to-memory, memory-to-peripheral, or memory-to-memory
Data width: byte (8-bit), half-word (16-bit), or word (32-bit) — source and destination widths can differ
Address increment: whether to auto-increment the source/destination pointer after each beat
Circular mode: whether the transfer automatically restarts when complete
Priority: low / medium / high / very high — arbitration between simultaneous requests
FIFO / burst: whether the DMA uses a FIFO to batch transfers into AHB bursts (improves bus utilisation)

Peripheral-to-Memory Transfers

Peripheral-to-memory is the most common DMA use case. The peripheral (UART, SPI, ADC) generates a DMA request each time it has data ready. The DMA controller responds by reading from the peripheral data register and writing to the next location in your receive buffer — without any CPU involvement.

UART Rx via DMA — Ping-Pong Buffer

The following example configures STM32 UART1 receive with DMA in circular mode, using a ping-pong buffer. The DMA generates a half-transfer interrupt when the first half of the buffer is full, and a full-transfer (transfer-complete) interrupt when the second half fills. This allows continuous receive with zero data loss while processing occurs on the idle half.

/**
 * UART DMA Rx — Circular Ping-Pong Buffer
 * Target: STM32F4xx / STM32H7xx
 * Uses LL (Low-Layer) DMA API for clarity; HAL equivalent is similar.
 */
#include "stm32f4xx.h"
#include "stm32f4xx_ll_dma.h"
#include "stm32f4xx_ll_usart.h"

#define UART_RX_BUF_SIZE  256u   /* total buffer — two halves of 128 */
#define UART_RX_HALF      (UART_RX_BUF_SIZE / 2u)

/* Ping-pong buffer: DMA writes here, CPU reads the idle half */
static uint8_t uart_rx_buf[UART_RX_BUF_SIZE];

/* Which half is ready for processing: 0 = first half, 1 = second half */
static volatile uint8_t rx_half_ready = 0xFFu;  /* 0xFF = nothing ready */

void uart_dma_init(void)
{
    /* 1. Enable DMA2 clock (USART1 Rx is on DMA2 Stream5 Channel4 on F4) */
    RCC->AHB1ENR |= RCC_AHB1ENR_DMA2EN;
    __DSB();

    /* 2. Ensure stream is disabled before configuring */
    LL_DMA_DisableStream(DMA2, LL_DMA_STREAM_5);
    while (LL_DMA_IsEnabledStream(DMA2, LL_DMA_STREAM_5)) {}

    /* 3. Configure DMA2 Stream5 — peripheral-to-memory, circular */
    LL_DMA_ConfigTransfer(DMA2, LL_DMA_STREAM_5,
        LL_DMA_DIRECTION_PERIPH_TO_MEMORY |
        LL_DMA_MODE_CIRCULAR               |
        LL_DMA_PERIPH_NOINCREMENT          |
        LL_DMA_MEMORY_INCREMENT            |
        LL_DMA_PDATAALIGN_BYTE             |
        LL_DMA_MDATAALIGN_BYTE             |
        LL_DMA_PRIORITY_HIGH);

    LL_DMA_SetChannelSelection(DMA2, LL_DMA_STREAM_5, LL_DMA_CHANNEL_4);
    LL_DMA_SetPeriphAddress(DMA2, LL_DMA_STREAM_5,
                            (uint32_t)&USART1->DR);
    LL_DMA_SetMemoryAddress(DMA2, LL_DMA_STREAM_5,
                            (uint32_t)uart_rx_buf);
    LL_DMA_SetDataLength(DMA2, LL_DMA_STREAM_5, UART_RX_BUF_SIZE);

    /* 4. Enable half-transfer and transfer-complete interrupts */
    LL_DMA_EnableIT_HT(DMA2, LL_DMA_STREAM_5);
    LL_DMA_EnableIT_TC(DMA2, LL_DMA_STREAM_5);

    NVIC_SetPriority(DMA2_Stream5_IRQn, 5);
    NVIC_EnableIRQ(DMA2_Stream5_IRQn);

    /* 5. Enable DMA Rx on USART1, then start the DMA stream */
    LL_USART_EnableDMAReq_RX(USART1);
    LL_DMA_EnableStream(DMA2, LL_DMA_STREAM_5);
}

/* DMA2 Stream5 ISR */
void DMA2_Stream5_IRQHandler(void)
{
    if (LL_DMA_IsActiveFlag_HT5(DMA2)) {
        LL_DMA_ClearFlag_HT5(DMA2);
        rx_half_ready = 0u;   /* first half [0..127] ready */
    }

    if (LL_DMA_IsActiveFlag_TC5(DMA2)) {
        LL_DMA_ClearFlag_TC5(DMA2);
        rx_half_ready = 1u;   /* second half [128..255] ready */
    }
}

/* Called from application task / main loop */
void process_uart_data(void)
{
    uint8_t half = rx_half_ready;
    if (half == 0xFFu) return;  /* nothing ready */

    rx_half_ready = 0xFFu;     /* consume */

    const uint8_t *buf = uart_rx_buf + (half * UART_RX_HALF);
    /* Process UART_RX_HALF bytes from buf — zero-copy, no memcpy needed */
    for (uint32_t i = 0; i < UART_RX_HALF; i++) {
        /* handle buf[i] */
        (void)buf[i];
    }
}

Half-Transfer & Full-Transfer Interrupts

In circular DMA mode the controller generates two interrupts per complete buffer rotation:

HT (Half-Transfer): fired when the DMA pointer crosses the midpoint. At this moment, the first half of the buffer contains valid data and the DMA is writing into the second half — safe to process the first half.
TC (Transfer-Complete): fired when the DMA pointer wraps back to the start. The second half contains valid data and the DMA is now writing into the first half again — safe to process the second half.

                        
                        Critical Timing: You must finish processing one half before the DMA fills it again. At 115200 baud, you have ~11 ms to process 128 bytes. At 1 Mbaud you only have ~1.28 ms — use an RTOS thread and notify from the ISR rather than processing in the interrupt handler.
                    

Circular DMA Mode

Circular mode is the key to continuous, zero-CPU-involvement data streaming. When the DMA reaches the end of the configured buffer it automatically resets its pointer to the buffer start and continues — no software restart required. This is essential for ADC streaming, audio capture, and any application that requires continuous data acquisition.

Ping-Pong Buffer Pattern

The ping-pong (double-buffer) pattern splits the DMA buffer into two equal halves. While the DMA writes into one half (the "active" half), the CPU processes the other half (the "ready" half). The roles swap at each half-transfer and transfer-complete interrupt.

/**
 * Generic ping-pong buffer manager — CPU-side processing example.
 * Buffer layout: [pingBuf | pongBuf] — each HALF_SIZE bytes.
 */
#define HALF_SIZE  512u
#define BUF_SIZE   (HALF_SIZE * 2u)

/* DMA writes into this buffer in circular mode */
__attribute__((aligned(32)))   /* 32-byte alignment for M7 cache lines */
static uint16_t dma_buf[BUF_SIZE];

/* Processed results live here — separate from DMA buffer */
static float32_t result_ping[HALF_SIZE];
static float32_t result_pong[HALF_SIZE];

typedef enum { HALF_PING = 0, HALF_PONG = 1 } buf_half_t;

static volatile buf_half_t pending_half = (buf_half_t)0xFFu;
static volatile bool       processing_busy = false;

/* Called from DMA ISR — minimal work here */
void dma_half_complete_callback(buf_half_t half)
{
    if (!processing_busy) {
        pending_half    = half;
    }
    /* If processing_busy, we've overrun — log and handle in production */
}

/* Called from application thread / main loop */
void run_dsp_pipeline(void)
{
    if (pending_half == (buf_half_t)0xFFu) return;

    processing_busy = true;
    buf_half_t half = pending_half;
    pending_half    = (buf_half_t)0xFFu;

    const uint16_t *src  = dma_buf + (half == HALF_PING ? 0 : HALF_SIZE);
    float32_t      *dest = (half == HALF_PING) ? result_ping : result_pong;

    /* Convert ADC counts to voltage and apply DSP */
    for (uint32_t i = 0; i < HALF_SIZE; i++) {
        dest[i] = (float32_t)src[i] * (3.3f / 4095.0f);  /* 12-bit ADC */
    }
    /* Run CMSIS-DSP filter on dest[] here */

    processing_busy = false;
}

ADC DMA Circular Double Buffering

High-speed ADC acquisition is the canonical DMA use case. At 1 MSPS (1 million samples per second) you have 1 µs per sample — nowhere near enough time for an interrupt-per-sample approach. DMA in circular mode with double buffering is the only viable solution.

ADC DMA Setup — STM32H7

/**
 * ADC1 continuous conversion with DMA circular double-buffer.
 * Target: STM32H743 @ 480 MHz, ADC clock = 36 MHz, 12-bit, 1 MSPS.
 *
 * Buffer: 2048 samples total (two halves of 1024 = 1 ms per half @ 1 MSPS).
 */
#include "stm32h7xx.h"
#include "stm32h7xx_ll_adc.h"
#include "stm32h7xx_ll_dma.h"

#define ADC_BUF_TOTAL   2048u
#define ADC_BUF_HALF    (ADC_BUF_TOTAL / 2u)

/* 32-byte aligned for D-cache line flush/invalidate on M7 */
__attribute__((section(".dma_buffers"), aligned(32)))
static uint16_t adc_dma_buf[ADC_BUF_TOTAL];

static volatile bool adc_half_rdy  = false;
static volatile bool adc_full_rdy  = false;

void adc_dma_init(void)
{
    /* DMA1 Stream1 — ADC1 on H7 uses BDMA or DMA (check your ref manual) */
    LL_DMA_DisableStream(DMA1, LL_DMA_STREAM_1);
    while (LL_DMA_IsEnabledStream(DMA1, LL_DMA_STREAM_1)) {}

    LL_DMA_SetPeriphRequest(DMA1, LL_DMA_STREAM_1, LL_DMAMUX1_REQ_ADC1);
    LL_DMA_SetDataTransferDirection(DMA1, LL_DMA_STREAM_1,
                                    LL_DMA_DIRECTION_PERIPH_TO_MEMORY);
    LL_DMA_SetMode(DMA1, LL_DMA_STREAM_1, LL_DMA_MODE_CIRCULAR);
    LL_DMA_SetPeriphIncMode(DMA1,  LL_DMA_STREAM_1, LL_DMA_PERIPH_NOINCREMENT);
    LL_DMA_SetMemoryIncMode(DMA1,  LL_DMA_STREAM_1, LL_DMA_MEMORY_INCREMENT);
    LL_DMA_SetPeriphSize(DMA1,     LL_DMA_STREAM_1, LL_DMA_PDATAALIGN_HALFWORD);
    LL_DMA_SetMemorySize(DMA1,     LL_DMA_STREAM_1, LL_DMA_MDATAALIGN_HALFWORD);
    LL_DMA_SetStreamPriority(DMA1, LL_DMA_STREAM_1, LL_DMA_PRIORITY_VERYHIGH);

    LL_DMA_SetPeriphAddress(DMA1, LL_DMA_STREAM_1,
                            LL_ADC_DMA_GetRegAddr(ADC1, LL_ADC_DMA_REG_REGULAR_DATA));
    LL_DMA_SetMemoryAddress(DMA1, LL_DMA_STREAM_1, (uint32_t)adc_dma_buf);
    LL_DMA_SetDataLength(DMA1, LL_DMA_STREAM_1, ADC_BUF_TOTAL);

    LL_DMA_EnableIT_HT(DMA1, LL_DMA_STREAM_1);
    LL_DMA_EnableIT_TC(DMA1, LL_DMA_STREAM_1);
    LL_DMA_EnableIT_TE(DMA1, LL_DMA_STREAM_1);  /* transfer error */

    NVIC_SetPriority(DMA1_Stream1_IRQn, 4);
    NVIC_EnableIRQ(DMA1_Stream1_IRQn);

    LL_ADC_REG_SetDMATransfer(ADC1, LL_ADC_REG_DMA_TRANSFER_UNLIMITED);
    LL_DMA_EnableStream(DMA1, LL_DMA_STREAM_1);
    LL_ADC_REG_StartConversion(ADC1);
}

void DMA1_Stream1_IRQHandler(void)
{
    if (LL_DMA_IsActiveFlag_HT1(DMA1)) {
        LL_DMA_ClearFlag_HT1(DMA1);
        adc_half_rdy = true;   /* first 1024 samples ready */
    }
    if (LL_DMA_IsActiveFlag_TC1(DMA1)) {
        LL_DMA_ClearFlag_TC1(DMA1);
        adc_full_rdy = true;   /* second 1024 samples ready */
    }
    if (LL_DMA_IsActiveFlag_TE1(DMA1)) {
        LL_DMA_ClearFlag_TE1(DMA1);
        /* Handle DMA error — log, halt, or attempt recovery */
        __BKPT(0);
    }
}

Buffer Switching in the DMA ISR

The ISR should be kept minimal — set a flag or release a semaphore. The actual processing belongs in a dedicated RTOS thread. Using osThreadFlagsSet() from within the ISR is the idiomatic CMSIS-RTOS2 pattern (covered in Part 5).

/* Application thread processing ADC data — CMSIS-RTOS2 */
#include "cmsis_os2.h"
#include "arm_math.h"   /* CMSIS-DSP */

#define FFT_SIZE  1024u

static arm_rfft_fast_instance_f32 fft_inst;
static float32_t fft_input[FFT_SIZE];
static float32_t fft_output[FFT_SIZE];

void adc_processing_thread(void *arg)
{
    arm_rfft_fast_init_f32(&fft_inst, FFT_SIZE);

    for (;;) {
        /* Wait for either half to be ready */
        uint32_t flags = osThreadFlagsWait(0x03u, osFlagsWaitAny, osWaitForever);

        const uint16_t *src = (flags & 0x01u)
                            ? adc_dma_buf               /* first half */
                            : adc_dma_buf + ADC_BUF_HALF;  /* second half */

        /* Convert to float */
        for (uint32_t i = 0; i < FFT_SIZE; i++) {
            fft_input[i] = (float32_t)src[i] * (3.3f / 4095.0f);
        }

        /* Real FFT using CMSIS-DSP */
        arm_rfft_fast_f32(&fft_inst, fft_input, fft_output, 0);

        /* fft_output now contains frequency-domain data */
    }
}

Cache Coherency on Cortex-M7

The Cortex-M7 is the first Cortex-M core to include L1 data cache (D-cache). This introduces a subtle but dangerous problem: cache coherency. The DMA controller is a bus master that accesses memory directly via the AHB/AXI bus, bypassing the CPU's cache entirely. If the CPU's D-cache contains a stale copy of the DMA's destination buffer, the CPU will read old data. Conversely, if the CPU has written modified data in cache that hasn't been flushed to RAM, the DMA will transmit old data from RAM.

The D-Cache Problem Illustrated

                        
                        Silent Data Corruption: Cache coherency bugs are among the hardest to debug in embedded systems. The code looks correct, the DMA configuration is correct, but the data processed by the CPU is silently stale — and the bug may only manifest at certain buffer sizes or alignment boundaries, making it intermittent.
                    

Clean & Invalidate APIs (CMSIS-Core)

CMSIS-Core provides SCB_CleanDCache_by_Addr() and SCB_InvalidateDCache_by_Addr() for software cache management. The naming convention follows the ARM architecture manual:

Clean: write dirty cache lines back to RAM (so DMA can read correct data)
Invalidate: mark cache lines as invalid (so CPU will re-read from RAM after DMA has written)

/**
 * Cache coherency wrappers for DMA buffers on Cortex-M7.
 * Must be called with correct size and 32-byte aligned addresses
 * (one cache line = 32 bytes on Cortex-M7).
 */
#include "core_cm7.h"

#define CACHE_LINE_SIZE  32u

/* Round up to next cache-line boundary */
#define CACHE_ALIGN_SIZE(sz) \
    (((sz) + CACHE_LINE_SIZE - 1u) & ~(CACHE_LINE_SIZE - 1u))

/**
 * Call BEFORE starting a DMA transmit (Memory-to-Peripheral):
 * Ensures any CPU-modified data is written back to RAM before DMA reads.
 */
void dma_tx_prepare(void *buf, uint32_t len)
{
    SCB_CleanDCache_by_Addr((uint32_t *)buf, (int32_t)CACHE_ALIGN_SIZE(len));
}

/**
 * Call AFTER DMA receive completes (Peripheral-to-Memory):
 * Invalidates the cache so CPU reads fresh DMA-written data from RAM.
 */
void dma_rx_complete(void *buf, uint32_t len)
{
    SCB_InvalidateDCache_by_Addr((uint32_t *)buf,
                                 (int32_t)CACHE_ALIGN_SIZE(len));
}

/* Example: SPI DMA transmit with cache management */
void spi_dma_send(const uint8_t *data, uint32_t len)
{
    /* Ensure data is in RAM (not just cache) before DMA reads it */
    dma_tx_prepare((void *)data, len);

    /* Now start DMA — it will read correct data from RAM */
    LL_DMA_SetMemoryAddress(DMA2, LL_DMA_STREAM_3, (uint32_t)data);
    LL_DMA_SetDataLength(DMA2, LL_DMA_STREAM_3, len);
    LL_DMA_EnableStream(DMA2, LL_DMA_STREAM_3);
}

/* DMA receive complete callback */
void spi_dma_rx_complete_callback(uint8_t *rx_buf, uint32_t len)
{
    /* Invalidate before CPU reads — DMA has written to RAM */
    dma_rx_complete(rx_buf, len);

    /* Now safe to access rx_buf — CPU will re-read from RAM */
    process_spi_data(rx_buf, len);
}

MPU-Based Solution — Non-Cacheable Regions

For DMA buffers that are continuously updated, repeatedly calling clean/invalidate adds overhead and complexity. A cleaner solution is to place DMA buffers in a memory region configured as non-cacheable via the Memory Protection Unit (MPU). The CPU accesses these regions without caching — every access goes directly to RAM — eliminating coherency concerns entirely at the cost of slightly slower CPU access.

/**
 * Configure MPU region as non-cacheable for DMA buffers.
 * Place DMA buffers at DMA_BUFFER_BASE in the linker script.
 * Suitable for STM32H7 with AXI SRAM or dedicated DMA SRAM regions.
 */
#include "core_cm7.h"

#define DMA_BUFFER_BASE   0x30040000u   /* AXI SRAM4 on STM32H743 */
#define DMA_BUFFER_SIZE   0x8000u       /* 32 KB */

void mpu_configure_dma_region(void)
{
    /* Disable MPU during configuration */
    ARM_MPU_Disable();

    /* Region 0: DMA buffers — non-cacheable, non-bufferable, shareable */
    MPU->RNR  = 0u;
    MPU->RBAR = DMA_BUFFER_BASE;
    MPU->RASR = ARM_MPU_RASR(
        0u,                          /* XN: no execute */
        ARM_MPU_AP_FULL,             /* full access */
        0u, 0u, 0u, 0u,             /* TEX=0, S=0, C=0, B=0 = Strongly Ordered */
        0x00u,                       /* SRD: no sub-regions disabled */
        ARM_MPU_REGION_SIZE_32KB
    );

    /* Re-enable MPU with background region (allows cached access elsewhere) */
    ARM_MPU_Enable(MPU_CTRL_PRIVDEFENA_Msk);

    __DSB();
    __ISB();
}

/* DMA buffers must be placed at DMA_BUFFER_BASE in the linker script:
 *   .dma_buffers (NOLOAD) :
 *   {
 *       *(.dma_buffers)
 *   } > DMA_SRAM
 */

DMA with RTOS & Memory-to-Memory

Memory-to-Memory DMA vs memcpy()

DMA can also be used for memory-to-memory transfers — replacing memcpy() for large blocks. Whether this is faster than CPU-based copy depends on the memory bus configuration and the MCU. On M7 with TCM memory, a CPU-optimised memcpy() can outperform DMA for small blocks. For large transfers (>4 KB) to/from AXI SRAM, DMA typically wins because it can burst the AHB bus without competing with instruction fetches.

/**
 * Memory-to-memory DMA on STM32F4 — DMA2 Stream0 Channel0.
 * Benchmark: compare against memcpy() using DWT cycle counter.
 */
#include "stm32f4xx.h"
#include "core_cm4.h"

#define BUF_LEN   4096u

static uint8_t src_buf[BUF_LEN] __attribute__((aligned(4)));
static uint8_t dst_buf[BUF_LEN] __attribute__((aligned(4)));

/* Enable DWT cycle counter */
static void dwt_enable(void)
{
    CoreDebug->DEMCR |= CoreDebug_DEMCR_TRCENA_Msk;
    DWT->CYCCNT       = 0u;
    DWT->CTRL        |= DWT_CTRL_CYCCNTENA_Msk;
}

static uint32_t dwt_cycles(void) { return DWT->CYCCNT; }

/* Benchmark memcpy vs DMA */
void benchmark_copy(void)
{
    dwt_enable();

    /* --- memcpy benchmark --- */
    DWT->CYCCNT = 0u;
    memcpy(dst_buf, src_buf, BUF_LEN);
    uint32_t cpu_cycles = dwt_cycles();

    /* --- DMA memory-to-memory benchmark --- */
    RCC->AHB1ENR |= RCC_AHB1ENR_DMA2EN;  __DSB();

    DMA2_Stream0->CR &= ~DMA_SxCR_EN;
    while (DMA2_Stream0->CR & DMA_SxCR_EN) {}

    DMA2_Stream0->PAR  = (uint32_t)src_buf;
    DMA2_Stream0->M0AR = (uint32_t)dst_buf;
    DMA2_Stream0->NDTR = BUF_LEN / 4u;  /* word transfers */
    DMA2_Stream0->CR   = (0u << DMA_SxCR_CHSEL_Pos) |  /* Channel 0 */
                         DMA_SxCR_DIR_0               |  /* Mem-to-Mem */
                         DMA_SxCR_MINC                |  /* Mem increment */
                         DMA_SxCR_PINC                |  /* Periph increment */
                         (0x2u << DMA_SxCR_MSIZE_Pos) |  /* 32-bit mem */
                         (0x2u << DMA_SxCR_PSIZE_Pos) |  /* 32-bit periph */
                         DMA_SxCR_PL_1;                  /* High priority */

    DWT->CYCCNT = 0u;
    DMA2_Stream0->CR |= DMA_SxCR_EN;
    while (!(DMA2->LISR & DMA_LISR_TCIF0)) {}  /* wait for TC */
    uint32_t dma_cycles = dwt_cycles();

    DMA2->LIFCR = DMA_LIFCR_CTCIF0;  /* clear TC flag */

    /* Log results */
    /* cpu_cycles: typically 400-600 for 4KB on F4 @ 168 MHz   */
    /* dma_cycles: typically 200-350 for 4KB (bus arbitration)  */
    (void)cpu_cycles;
    (void)dma_cycles;
}

/* Common pitfalls with DMA */

Notifying an RTOS Thread from a DMA ISR

The correct pattern for signalling an RTOS thread from a DMA ISR uses CMSIS-RTOS2 thread flags or semaphores. Never call blocking RTOS functions from an ISR — only use the FromISR variants (in FreeRTOS) or functions explicitly marked as ISR-safe in CMSIS-RTOS2.

/* Thread ID — set during RTOS thread creation */
static osThreadId_t g_dma_thread_id = NULL;

#define FLAG_DMA_HALF  (1u << 0u)
#define FLAG_DMA_FULL  (1u << 1u)

/* DMA ISR — signal the processing thread */
void DMA1_Stream1_IRQHandler(void)
{
    if (LL_DMA_IsActiveFlag_HT1(DMA1)) {
        LL_DMA_ClearFlag_HT1(DMA1);
        osThreadFlagsSet(g_dma_thread_id, FLAG_DMA_HALF);
    }
    if (LL_DMA_IsActiveFlag_TC1(DMA1)) {
        LL_DMA_ClearFlag_TC1(DMA1);
        osThreadFlagsSet(g_dma_thread_id, FLAG_DMA_FULL);
    }
}

/* Processing thread */
void dma_processing_thread(void *arg)
{
    g_dma_thread_id = osThreadGetId();

    for (;;) {
        uint32_t flags = osThreadFlagsWait(
            FLAG_DMA_HALF | FLAG_DMA_FULL,
            osFlagsWaitAny,
            osWaitForever);

        if (flags & FLAG_DMA_HALF) {
            /* Invalidate cache for first half, then process */
            SCB_InvalidateDCache_by_Addr((uint32_t *)adc_dma_buf,
                                         ADC_BUF_HALF * sizeof(uint16_t));
            process_samples(adc_dma_buf, ADC_BUF_HALF);
        }
        if (flags & FLAG_DMA_FULL) {
            SCB_InvalidateDCache_by_Addr(
                (uint32_t *)(adc_dma_buf + ADC_BUF_HALF),
                ADC_BUF_HALF * sizeof(uint16_t));
            process_samples(adc_dma_buf + ADC_BUF_HALF, ADC_BUF_HALF);
        }
    }
}

Common DMA Pitfalls

Pitfall	Symptom	Root Cause	Fix
Cache coherency (M7)	CPU reads stale data; intermittent corruption at buffer boundaries	D-cache holds outdated copy; DMA wrote to RAM but cache not invalidated	`SCB_InvalidateDCache_by_Addr()` after DMA Rx; place DMA buffers in non-cacheable MPU region
Misaligned buffer address	HardFault or silent transfer error; DMA TE interrupt fires	DMA requires word-aligned addresses for 32-bit transfers; 32-byte alignment needed for cache ops	Use `__attribute__((aligned(32)))` on DMA buffers; check NDTR vs data width
DMA request collision	Transfer never starts; DMA stream stays busy	Two peripherals configured to share one DMA stream/channel	Check DMA request mapping table in reference manual; use DMAMUX on newer devices
Circular buffer race condition	Data appears corrupted every N frames; hard to reproduce	CPU processing half-buffer too slowly — DMA overwrites before processing completes	Increase buffer depth; move processing to dedicated RTOS thread with higher priority; profile with DWT
Forgetting to clear DMA flags	ISR fires once then never again; or fires continuously	DMA flags are sticky — must clear manually in ISR	Always clear HT, TC, and TE flags at the start of the ISR before processing
DMA buffer in Flash/const region	DMA appears to work but CPU reads zeros	DMA cannot write to Flash (read-only AHB master access)	Declare DMA receive buffers as non-const in SRAM; check linker script section placement

Exercises

Exercise 1 Intermediate

Zero-Copy UART Logger with DMA Double Buffer

Implement a UART logging system that receives log strings over UART using DMA in circular mode with a ping-pong buffer. The receive half should be parsed (look for \n delimiters) and forwarded to a ring buffer without any memcpy(). Measure CPU utilisation with and without DMA using the DWT cycle counter. Target: <2% CPU load at 115200 baud receiving 1000 bytes/second.

DMA Circular Zero-Copy UART Rx DWT Profiling

Exercise 2 Intermediate

1 MSPS ADC Capture with Background FFT

Configure your ADC (STM32 or equivalent) for continuous conversion at the maximum supported sample rate using DMA circular double-buffering. In the background RTOS thread, run a 1024-point CMSIS-DSP real FFT on each completed half-buffer. Display the dominant frequency component over UART. Verify correctness by feeding a known signal frequency from a signal generator or the MCU's own DAC.

ADC DMA CMSIS-DSP FFT RTOS Thread Flags Double Buffer

Exercise 3 Advanced

Enable D-Cache on M7 and Fix Cache Coherency

Take an existing DMA project (UART Rx or SPI) running on a Cortex-M7 (STM32H7 or STM32F7) with D-cache disabled. Enable D-cache (SCB_EnableDCache()) and observe the data corruption. Then fix it using two approaches: (1) software clean/invalidate with SCB_CleanDCache_by_Addr() / SCB_InvalidateDCache_by_Addr(), and (2) MPU non-cacheable region for the DMA buffers. Compare the CPU overhead of both approaches using DWT.

D-Cache Cache Coherency MPU Cortex-M7

DMA Design Planner

Use this tool to document your DMA channel allocation, buffer sizes, cache coherency strategy, and transfer priorities. Download as Word, Excel, PDF, or PPTX for design review documentation.

DMA Design Planner

Document your DMA configuration and buffer design. Download as Word, Excel, PDF, or PPTX.

Draft auto-saved

All data stays in your browser. Nothing is sent to or stored on any server.

Project Name *

Target MCU *

DMA Controller(s)

Circular Mode Used?

DMA Channel Allocation

Transfer List (direction, data width, priority)

Priority & Arbitration Notes

Zero-Copy Buffer Layout

Cache Coherency Strategy

Author Name

Conclusion & Next Steps

DMA is a foundational technique for building high-performance, CPU-efficient embedded firmware. In this part we covered:

DMA fundamentals: transfer types, controller architecture, and the key configuration parameters (direction, data width, increment, circular mode, priority).
Peripheral-to-memory transfers: configuring UART DMA Rx with circular mode, ping-pong buffers, and half-transfer/transfer-complete interrupts for zero-copy continuous receive.
ADC double buffering: high-speed circular DMA capture with background CMSIS-DSP processing, using RTOS thread flags to signal between the ISR and the processing thread.
Cache coherency on Cortex-M7: understanding the D-cache problem, using SCB_CleanDCache_by_Addr() and SCB_InvalidateDCache_by_Addr(), and the MPU non-cacheable region approach.
Memory-to-memory DMA: benchmarking against memcpy() using the DWT cycle counter, and the common pitfalls table.

Next in the Series

In Part 15: Security — ARMv8-M & TrustZone, we'll explore the hardware security architecture of ARMv8-M processors — partitioning flash and SRAM into Secure and Non-Secure worlds, configuring the Security Attribution Unit (SAU), creating secure entry function veneers, implementing PSA Crypto APIs, and building a minimal secure boot chain with signature verification.

Cookie Consent

Cookie Preferences

Table of Contents

CMSIS Mastery Series

Overview & ARM Cortex-M Ecosystem

CMSIS-Core: Registers, NVIC & SysTick

Startup Code, Linker Scripts & Vector Table

CMSIS-RTOS2: Threads, Mutexes & Semaphores

CMSIS-RTOS2: Message Queues & Event Flags

CMSIS-DSP: Filters, FFT & Math Functions

CMSIS-Driver: UART, SPI & I2C

CMSIS-Pack & Software Components

Debugging with CMSIS-DAP & CoreSight

Portable Firmware: Multi-Vendor Projects

Interrupts, Concurrency & Real-Time Constraints

Memory Management in Embedded Systems

Low Power & Energy Optimization

DMA & High-Performance Data Handling

Security: ARMv8-M & TrustZone

Bootloaders & Firmware Updates

Testing & Validation

Performance Optimization

Embedded Software Architecture

Tooling & Workflow (Professional Level)

DMA Fundamentals

Why DMA Matters

Transfer Types

DMA Controller Architecture

Peripheral-to-Memory Transfers

UART Rx via DMA — Ping-Pong Buffer

Half-Transfer & Full-Transfer Interrupts

Circular DMA Mode

Ping-Pong Buffer Pattern

ADC DMA Circular Double Buffering

ADC DMA Setup — STM32H7

Buffer Switching in the DMA ISR

Cache Coherency on Cortex-M7

The D-Cache Problem Illustrated

Clean & Invalidate APIs (CMSIS-Core)

MPU-Based Solution — Non-Cacheable Regions

DMA with RTOS & Memory-to-Memory

Memory-to-Memory DMA vs memcpy()

Notifying an RTOS Thread from a DMA ISR

Common DMA Pitfalls

Exercises

Zero-Copy UART Logger with DMA Double Buffer

1 MSPS ADC Capture with Background FFT

Enable D-Cache on M7 and Fix Cache Coherency

DMA Design Planner

DMA Design Planner

Conclusion & Next Steps

Next in the Series

Related Articles in This Series

Part 6: CMSIS-DSP — Filters, FFT & Math Functions

Part 7: CMSIS-Driver — UART, SPI & I2C

Part 11: Interrupts, Concurrency & Real-Time Constraints