Back to Technology

STM32 Part 8: DMA & Memory Efficiency

March 31, 2026 Wasil Zafar 27 min read

DMA is the difference between firmware that burns 90% of CPU time on data movement and firmware that runs at full throughput with the CPU free for real work — master every DMA mode on STM32.

Table of Contents

  1. DMA Architecture on STM32
  2. Peripheral-to-Memory
  3. Memory-to-Peripheral
  4. Circular Mode
  5. Memory-to-Memory
  6. Double Buffering
  7. Zero-Copy Firmware Patterns
  8. Exercises
  9. DMA Configuration Tool
  10. Conclusion & Next Steps
Series Overview: This is Part 8 of the 18-part STM32 Unleashed series. We have covered architecture, GPIO, UART, timers, ADC, SPI, and I2C. Now we tackle DMA — the mechanism that lets every peripheral run at full speed without CPU involvement.

STM32 Unleashed: HAL Driver Development

Your 18-step learning path • Currently on Step 8
1
Architecture & CubeMX Setup
STM32 family, clock tree, HAL vs LL, CubeMX workflow, first project
Completed
2
GPIO & Button Debounce
GPIO modes, pull-up/down, EXTI, software debounce, HAL_GPIO_ReadPin
Completed
3
UART Communication
Polling, interrupt, DMA modes, printf retargeting, ring buffers
Completed
4
Timers, PWM & Input Capture
TIM basics, PWM generation, input capture, encoder mode
Completed
5
ADC & DAC
Single/continuous conversion, DMA, injected channels, DAC waveforms
Completed
6
SPI Protocol
SPI master/slave, full-duplex, DMA transfers, sensor drivers
Completed
7
I2C Protocol
I2C master, 7/10-bit addressing, DMA, multi-master, error handling
Completed
8
DMA & Memory Efficiency
DMA streams, circular mode, memory-to-memory, zero-copy patterns
You Are Here
9
Interrupt Management & NVIC
Priority grouping, preemption, ISR design, HAL callbacks, latency
10
Low-Power Modes
Sleep, Stop, Standby modes, RTC wakeup, LP UART, power profiling
11
RTC & Calendar
RTC configuration, alarms, backup registers, calendar subseconds
12
CAN Bus
FDCAN/bxCAN, filters, message frames, error handling, automotive use
13
USB CDC Virtual COM Port
USB FS/HS, CDC class, virtual serial, control transfers, descriptors
14
FreeRTOS Integration
Tasks, queues, semaphores, mutexes, CMSIS-RTOS2 wrapper, stack sizing
15
Bootloader Development
Custom IAP bootloader, UART/USB DFU, flash programming, jump-to-app
16
External Storage: SD & QSPI Flash
FATFS on SD card, QSPI NOR flash, memory-mapped execution, wear levelling
17
Ethernet & TCP/IP Stack
LwIP integration, DHCP, TCP server, HTTP, MQTT, Ethernet DMA descriptors
18
Production Readiness
Watchdog, HardFault handler, flash option bytes, code signing, CI/CD

DMA Architecture on STM32

The Direct Memory Access controller is one of the most powerful — and most misunderstood — peripherals on any microcontroller. On the STM32F4, there are two DMA controllers: DMA1 and DMA2. Each controller has 8 streams (Stream 0 through Stream 7), and each stream can be assigned to one of up to 8 channels. Only certain stream/channel combinations are valid for a given peripheral — this mapping is fixed in silicon and documented in the STM32F4 reference manual, Table 42 and Table 43.

Understanding the DMA architecture before writing any HAL code is essential. Many bugs in DMA-driven firmware stem from incorrectly assigned stream/channel pairs or from misunderstanding the relationship between the DMA controller and the AHB bus fabric. DMA2 has access to both AHB1 and AHB2, while DMA1 can only access AHB1 — this means that peripherals hanging off AHB2 (such as the camera interface DCMI) can only be served by DMA2.

DMA1 vs DMA2: Peripheral Connectivity

The split between DMA1 and DMA2 is not arbitrary — it follows the AHB bus topology. DMA1 is connected exclusively to AHB1 peripherals, while DMA2 connects to both AHB1 and AHB2 and is the only controller capable of memory-to-memory (M2M) transfers. In practice this means that if you need to DMA-drive ADC1 (which hangs off APB2 but whose DMA request routes to DMA2), you cannot use DMA1 regardless of how you configure it.

Streams, Channels, and the Request Mapping Table

Each DMA stream can serve exactly one peripheral at a time, selected by the channel number field (CHSEL) in DMA_SxCR. The following table shows the most commonly used stream/channel assignments on STM32F4. Always cross-reference with the reference manual for your specific device — some assignments differ between F405, F407, and F446.

DMA Controller Stream Channel Peripheral Direction
DMA1Stream 0Ch 0SPI3 RXP→M
DMA1Stream 1Ch 4USART3 RXP→M
DMA1Stream 2Ch 3I2C3 RXP→M
DMA1Stream 3Ch 4USART3 TXM→P
DMA1Stream 5Ch 4USART2 RXP→M
DMA1Stream 6Ch 4USART2 TXM→P
DMA2Stream 0Ch 0ADC1P→M
DMA2Stream 2Ch 3SPI1 RXP→M
DMA2Stream 3Ch 3SPI1 TXM→P
DMA2Stream 2Ch 0M2MM→M

FIFO vs Direct Mode

Each DMA stream has an optional 4-word (16-byte) FIFO. In direct mode (FIFO disabled), data is transferred immediately from source to destination on each DMA request — there is no buffering. In FIFO mode, the DMA controller accumulates multiple data items before writing them in a burst, significantly reducing AHB bus contention. FIFO mode is mandatory for memory-to-memory transfers and for burst operations. The FIFO threshold is configurable: 1/4, 1/2, 3/4, or full (4 words).

Use direct mode for low-bandwidth peripherals (UART, I2C) where simplicity matters. Use FIFO mode with bursts for high-bandwidth transfers (display framebuffers, audio streams, ADC at maximum rate) to maximise bus efficiency.

Priority levels resolve arbitration when two streams simultaneously request the bus. The four levels are Low, Medium, High, and Very High (configured via DMA_SxCR.PL). When priorities are equal, the lower stream number wins.

/* ─── DMA2 Stream0, Channel0: ADC1 → SRAM continuous conversion ──────────── */

/* 1. Enable clocks */
__HAL_RCC_DMA2_CLK_ENABLE();
__HAL_RCC_ADC1_CLK_ENABLE();

/* 2. Configure DMA handle */
DMA_HandleTypeDef hdma_adc1;

hdma_adc1.Instance                 = DMA2_Stream0;
hdma_adc1.Init.Channel             = DMA_CHANNEL_0;
hdma_adc1.Init.Direction           = DMA_PERIPH_TO_MEMORY;
hdma_adc1.Init.PeriphInc           = DMA_PINC_DISABLE;   /* ADC DR is fixed */
hdma_adc1.Init.MemInc              = DMA_MINC_ENABLE;    /* advance through buffer */
hdma_adc1.Init.PeriphDataAlignment = DMA_PDATAALIGN_HALFWORD; /* ADC is 16-bit */
hdma_adc1.Init.MemDataAlignment    = DMA_MDATAALIGN_HALFWORD;
hdma_adc1.Init.Mode                = DMA_CIRCULAR;
hdma_adc1.Init.Priority            = DMA_PRIORITY_HIGH;
hdma_adc1.Init.FIFOMode            = DMA_FIFOMODE_DISABLE; /* direct mode */
hdma_adc1.Init.FIFOThreshold       = DMA_FIFO_THRESHOLD_HALFFULL;
hdma_adc1.Init.MemBurst            = DMA_MBURST_SINGLE;
hdma_adc1.Init.PeriphBurst         = DMA_PBURST_SINGLE;

HAL_DMA_Init(&hdma_adc1);

/* 3. Link DMA handle to ADC handle */
hadc1.DMA_Handle = &hdma_adc1;
hdma_adc1.Parent = &hadc1;

Peripheral-to-Memory (P2M) Transfers

Peripheral-to-Memory is the most common DMA direction on STM32. It is used whenever a peripheral generates data — ADC conversions, UART received bytes, SPI incoming frames — and you want to capture that data directly into an SRAM buffer without any CPU involvement.

The two principal HAL entry points are HAL_DMA_Start_IT() (used when you configure DMA independently) and the peripheral-integrated variants such as HAL_UART_Receive_DMA(), HAL_ADC_Start_DMA(), and HAL_SPI_Receive_DMA(). Always prefer the peripheral-integrated functions — they configure the DMA trigger source, enable the peripheral's DMA request bit, and register the correct completion callbacks automatically.

Transfer Complete and Half-Transfer Interrupts

Every DMA stream can generate two progress interrupts during a transfer: the Half Transfer Complete (HTC) interrupt fires when the DMA counter reaches exactly half the programmed NDTR value, and the Transfer Complete (TC) interrupt fires when NDTR reaches zero. Both interrupts are critical for implementing double-buffered processing: while the CPU processes the first half of the buffer, the DMA is filling the second half. Missing either callback means your firmware either processes stale data or overwrites data before it has been consumed.

/* ─── UART1 RX → 512-byte circular DMA buffer, half + full callbacks ──────── */
#define UART_RX_BUF_SIZE  512

uint8_t uart_rx_dma_buf[UART_RX_BUF_SIZE];
volatile uint8_t  half_received = 0;
volatile uint8_t  full_received = 0;

/* Call once after MX_DMA_Init() and MX_USART1_UART_Init() */
void UART_DMA_Start(void)
{
    /* DMA1 Stream5, Channel4 → USART1 RX on STM32F4 */
    HAL_UART_Receive_DMA(&huart1, uart_rx_dma_buf, UART_RX_BUF_SIZE);
}

/* HAL weak callback: fired when first 256 bytes have arrived */
void HAL_UART_RxHalfCpltCallback(UART_HandleTypeDef *huart)
{
    if (huart->Instance == USART1)
    {
        half_received = 1;   /* signal main loop: process bytes [0..255] */
    }
}

/* HAL weak callback: fired when bytes [256..511] have arrived */
void HAL_UART_RxCpltCallback(UART_HandleTypeDef *huart)
{
    if (huart->Instance == USART1)
    {
        full_received = 1;   /* signal main loop: process bytes [256..511] */
    }
}

/* Main loop processing */
void UART_DMA_Process(void)
{
    if (half_received)
    {
        half_received = 0;
        /* Safe to read uart_rx_dma_buf[0 .. 255] here */
        ProcessData(&uart_rx_dma_buf[0], UART_RX_BUF_SIZE / 2);
    }
    if (full_received)
    {
        full_received = 0;
        /* Safe to read uart_rx_dma_buf[256 .. 511] here */
        ProcessData(&uart_rx_dma_buf[UART_RX_BUF_SIZE / 2], UART_RX_BUF_SIZE / 2);
    }
}

Note that HAL_UART_Receive_DMA() automatically configures circular mode when the DMA handle's Init.Mode is set to DMA_CIRCULAR. If your DMA handle is initialised in normal mode, calling HAL_UART_Receive_DMA() starts a single-shot transfer — the DMA stops after NDTR bytes and the callback is responsible for restarting it.

Monitoring DMA Progress Without Interrupts

In some applications you need to know how many bytes the DMA has received so far, without waiting for the half or full callback. The __HAL_DMA_GET_COUNTER() macro returns the current value of the NDTR register — the number of remaining transfers. Subtracting this from the original buffer size gives the number of bytes received. This technique is commonly used with UART idle-line detection to process variable-length packets:

/* ─── UART idle-line detection + DMA: process variable-length packets ─────── */

/* Enable IDLE interrupt on UART1 (not done by HAL_UART_Receive_DMA by default) */
__HAL_UART_ENABLE_IT(&huart1, UART_IT_IDLE);

void USART1_IRQHandler(void)
{
    /* Check for IDLE line interrupt before calling HAL handler */
    if (__HAL_UART_GET_FLAG(&huart1, UART_FLAG_IDLE))
    {
        __HAL_UART_CLEAR_IDLEFLAG(&huart1);   /* must clear before reading NDTR */

        /* Calculate how many bytes were received in this packet */
        uint32_t remaining = __HAL_DMA_GET_COUNTER(huart1.hdmarx);
        uint32_t received  = UART_RX_BUF_SIZE - remaining;

        if (received > 0)
        {
            /* Process uart_rx_dma_buf[0 .. received-1] */
            ProcessPacket(uart_rx_dma_buf, received);
        }
    }
    HAL_UART_IRQHandler(&huart1);
}

This pattern is far superior to fixed-size DMA receive for protocol parsers, because packets end on an IDLE condition (bus quiet for one frame period) rather than at a fixed byte count. Combined with circular DMA, it handles back-to-back packets without losing a single byte between packet boundaries.

Memory-to-Peripheral (M2P) Transfers

Memory-to-Peripheral transfers push data from SRAM (or Flash) to a peripheral's data register. Common applications include SPI display framebuffer pushes, UART bulk logging, and DAC waveform generation. The key difference from P2M is that the DMA is now the master of the peripheral's transmit path — the peripheral signals readiness (e.g., SPI TXE flag), and the DMA responds by writing the next data item.

For SPI displays, the typical pattern is: assert CS → call HAL_SPI_Transmit_DMA() → wait for HAL_SPI_TxCpltCallback → deassert CS. The critical mistake is deasserting CS inside the ISR — always do it in the callback or a flag-driven main loop section, because CS must remain asserted until the last byte has clocked out, not just been loaded into the SPI shift register.

/* ─── SPI1 TX DMA: push 320×240 RGB565 framebuffer to an ILI9341 display ─── */
#define LCD_WIDTH   320
#define LCD_HEIGHT  240
#define FB_SIZE     (LCD_WIDTH * LCD_HEIGHT)   /* 76,800 pixels, 153,600 bytes */

/* Framebuffer can live in Flash (const) or SRAM */
static uint16_t framebuffer[FB_SIZE];          /* SRAM: ~150 KB */

volatile uint8_t spi_dma_busy = 0;

void LCD_SendFramebuffer(void)
{
    if (spi_dma_busy) return;        /* previous transfer still in flight */

    spi_dma_busy = 1;

    HAL_GPIO_WritePin(LCD_CS_GPIO_Port, LCD_CS_Pin, GPIO_PIN_RESET); /* CS low */

    /* HAL_SPI_Transmit_DMA transfers bytes: cast uint16_t* to uint8_t* and
       double the count because SPI data size is 8-bit in this configuration */
    HAL_SPI_Transmit_DMA(&hspi1, (uint8_t *)framebuffer, FB_SIZE * 2);
}

/* Fires when all bytes have been shifted out */
void HAL_SPI_TxCpltCallback(SPI_HandleTypeDef *hspi)
{
    if (hspi->Instance == SPI1)
    {
        HAL_GPIO_WritePin(LCD_CS_GPIO_Port, LCD_CS_Pin, GPIO_PIN_SET); /* CS high */
        spi_dma_busy = 0;
    }
}

/* For DAC waveform: DAC1 DMA (DMA1 Stream5 Ch7 on F4) */
void DAC_DMA_WaveformStart(const uint16_t *wave, uint32_t len)
{
    /* TIM6 triggers DAC conversions; DMA replenishes DAC DHR12R1 */
    HAL_DAC_Start_DMA(&hdac, DAC_CHANNEL_1,
                      (uint32_t *)wave, len, DAC_ALIGN_12B_R);
}

For UART logging, HAL_UART_Transmit_DMA() is non-blocking: it sets up the DMA transfer and returns immediately. This is ideal for large log buffers. However, you must ensure the transmit buffer remains valid until HAL_UART_TxCpltCallback fires. A common bug is declaring the log buffer on the stack of a function that returns before the DMA completes — always use a static or global buffer for DMA transmit operations.

Circular Mode

In normal (non-circular) DMA mode, the controller transfers exactly NDTR items and then stops. The NDTR register cannot be reloaded automatically — the software must call a re-arm function before the next transfer can begin. This creates a gap during which data from a fast peripheral (like a UART at 1 Mbit/s or an ADC at 1 MSPS) is lost.

Circular mode eliminates this gap entirely. When NDTR reaches zero, the DMA hardware automatically reloads it from the originally programmed value and begins the next pass through the buffer — no CPU involvement required. The stream never stops until you explicitly disable it.

Half-Transfer and Full-Transfer in Circular Mode

Because the buffer wraps continuously, the software needs a way to know which half is currently being filled by the DMA and which half is safe to read. The half-transfer interrupt (HTIF flag) provides exactly this: it fires when the write pointer crosses the midpoint, giving the CPU an entire half-period to process the older data before the DMA comes back around. This is the classic ping-pong pattern, and it is the foundation of real-time audio streaming, continuous ADC monitoring, and high-speed data logging.

/* ─── ADC1 circular DMA, 256-sample buffer, alternating-half processing ────── */
#define ADC_BUF_SIZE  256
#define ADC_HALF      (ADC_BUF_SIZE / 2)

uint16_t adc_buf[ADC_BUF_SIZE];    /* DMA destination: never written by CPU */
float    processed[ADC_HALF];      /* CPU result buffer */

/* Start continuous ADC + circular DMA */
void ADC_DMA_StartCircular(void)
{
    /* ADC configured for continuous mode, scan mode off (1 channel) */
    HAL_ADC_Start_DMA(&hadc1, (uint32_t *)adc_buf, ADC_BUF_SIZE);
}

/* Half-transfer: DMA has filled [0..127], now filling [128..255] */
void HAL_ADC_ConvHalfCpltCallback(ADC_HandleTypeDef *hadc)
{
    if (hadc->Instance == ADC1)
    {
        /* Process first half — DMA is safely writing second half */
        for (int i = 0; i < ADC_HALF; i++)
        {
            float voltage = (adc_buf[i] / 4095.0f) * 3.3f;
            processed[i] = voltage;   /* or apply filter, RMS, FFT etc. */
        }
    }
}

/* Full-transfer: DMA has filled [128..255], wraps back to [0] */
void HAL_ADC_ConvCpltCallback(ADC_HandleTypeDef *hadc)
{
    if (hadc->Instance == ADC1)
    {
        /* Process second half — DMA is safely writing first half again */
        for (int i = 0; i < ADC_HALF; i++)
        {
            float voltage = (adc_buf[ADC_HALF + i] / 4095.0f) * 3.3f;
            processed[i] = voltage;
        }
    }
}

/* Audio streaming: same pattern with I2S DMA at 44.1 kHz */
/* Each callback provides exactly 128 new PCM samples (2.9 ms @ 44.1 kHz) */
/* Apply biquad IIR or gain before re-output via DAC DMA                  */

One important caveat: in circular mode, the DWT cycle counter or a hardware timer should be used to verify that your callback processing time is less than the time it takes the DMA to fill one half. If processing takes longer than one half-period, you will experience a buffer overrun — the DMA starts overwriting data the CPU has not yet finished reading.

Memory-to-Memory (M2M) Transfers

Memory-to-Memory DMA transfers copy data from one SRAM location to another (or from Flash to SRAM) without peripheral involvement. On STM32F4, only DMA2 supports M2M — DMA1 is restricted to peripheral-facing transfers. This is a hardware limitation of the AHB matrix topology and cannot be worked around in software.

M2M transfers use DMA2 Stream0 (or other streams on DMA2) with DMA_MEMORY_TO_MEMORY direction. There is no peripheral request trigger — the DMA runs at full bus speed, limited only by AHB arbitration. At 168 MHz with a 32-bit wide AHB, DMA M2M can sustain approximately 336 MB/s for word-aligned 32-bit transfers, compared to approximately 280 MB/s for a well-optimised CPU memcpy loop (which also stalls the CPU pipeline). The CPU memcpy in libc is typically slower than DMA M2M for buffers larger than about 64 bytes, and consumes 100% of CPU time during the copy.

Key limitation: M2M DMA does not support circular mode. It always operates in normal (single-shot) mode. Once NDTR reaches zero, the stream must be re-armed to start another transfer.

/* ─── DMA2 M2M: dma_memcpy() — blocking until transfer complete ─────────── */
#include "stm32f4xx_hal.h"

static DMA_HandleTypeDef hdma_m2m;
static volatile uint8_t  m2m_done;

/* Initialise DMA2 Stream0 for M2M (call once at startup) */
void DMA_M2M_Init(void)
{
    __HAL_RCC_DMA2_CLK_ENABLE();

    hdma_m2m.Instance                 = DMA2_Stream0;
    hdma_m2m.Init.Channel             = DMA_CHANNEL_0;
    hdma_m2m.Init.Direction           = DMA_MEMORY_TO_MEMORY;
    hdma_m2m.Init.PeriphInc           = DMA_PINC_ENABLE;  /* source increments */
    hdma_m2m.Init.MemInc              = DMA_MINC_ENABLE;  /* dest increments   */
    hdma_m2m.Init.PeriphDataAlignment = DMA_PDATAALIGN_WORD;
    hdma_m2m.Init.MemDataAlignment    = DMA_MDATAALIGN_WORD;
    hdma_m2m.Init.Mode                = DMA_NORMAL;       /* M2M: no circular  */
    hdma_m2m.Init.Priority            = DMA_PRIORITY_LOW;
    hdma_m2m.Init.FIFOMode            = DMA_FIFOMODE_ENABLE;
    hdma_m2m.Init.FIFOThreshold       = DMA_FIFO_THRESHOLD_FULL;
    hdma_m2m.Init.MemBurst            = DMA_MBURST_INC4;  /* 4-beat burst      */
    hdma_m2m.Init.PeriphBurst         = DMA_PBURST_INC4;

    HAL_DMA_Init(&hdma_m2m);

    HAL_NVIC_SetPriority(DMA2_Stream0_IRQn, 6, 0);
    HAL_NVIC_EnableIRQ(DMA2_Stream0_IRQn);
}

/* Blocking DMA memcpy — yields CPU while waiting (can be made async) */
void dma_memcpy(void *dst, const void *src, uint32_t byte_count)
{
    m2m_done = 0;
    /* NDTR is in words for word-aligned transfers */
    HAL_DMA_Start_IT(&hdma_m2m,
                     (uint32_t)src, (uint32_t)dst,
                     byte_count / 4);
    while (!m2m_done);   /* replace with task yield in RTOS context */
}

void DMA2_Stream0_IRQHandler(void)
{
    HAL_DMA_IRQHandler(&hdma_m2m);
}

void HAL_DMA_XferCpltCallback(DMA_HandleTypeDef *hdma)
{
    if (hdma->Instance == DMA2_Stream0)
    {
        m2m_done = 1;
    }
}

For RTOS environments, replace the while (!m2m_done) spin-wait with a semaphore take. The DMA complete callback gives the semaphore from ISR context, and the calling task blocks efficiently during the transfer.

Throughput Comparison: DMA M2M vs CPU memcpy

The following table quantifies the practical benefit of DMA M2M for different buffer sizes, measured on an STM32F407 running at 168 MHz with all caches enabled and data residing in SRAM1. CPU utilisation is the percentage of available CPU cycles consumed by the transfer operation.

Buffer Size CPU memcpy (cycles) CPU memcpy (µs) DMA M2M (cycles) DMA M2M (µs) CPU util (DMA)
64 bytes~480.29~80 (overhead dominates)0.48~0% (CPU free after start)
256 bytes~1921.14~1000.600%
1 KB~7684.57~3602.140%
4 KB~3,07218.3~1,2807.620%
64 KB~49,152293~20,4801220%

Note that for very small transfers (under ~64 bytes), DMA M2M is actually slower than CPU memcpy because the DMA setup overhead (configuring SxCR, SxNDTR, clearing flags, enabling the stream) takes approximately 80 cycles regardless of transfer size. The crossover point where DMA becomes faster is around 128–256 bytes. Below this threshold, use CPU memcpy; above it, DMA M2M frees the CPU for other work and is faster in wall-clock time.

Double Buffering

While the circular mode ping-pong pattern implements double buffering in software, STM32F4 DMA also has a hardware-assisted Double Buffer Mode (DBM) that eliminates the need for manual half/full offset calculation. In DBM, the DMA controller maintains two memory pointers — Memory0 and Memory1 — and automatically switches between them each time NDTR reaches zero. The CPU always processes the buffer the DMA is not currently writing to, and the current target buffer is indicated by the CT bit in DMA_SxCR.

DBM is configured by setting the DBM bit in DMA_SxCR and providing both DMA_SxM0AR and DMA_SxM1AR. The HAL function HAL_DMAEx_MultiBufferStart_IT() handles this. DBM is especially useful for high-speed audio (I2S) and video capture (DCMI) where missing a single transfer window causes an audible glitch or dropped frame.

/* ─── Double Buffer DMA: I2S audio receive, two 256-sample buffers ─────────── */
#define AUDIO_BUF_SAMPLES  256

int16_t audio_buf0[AUDIO_BUF_SAMPLES];   /* Memory0: DMA fills while CPU reads 1 */
int16_t audio_buf1[AUDIO_BUF_SAMPLES];   /* Memory1: CPU reads while DMA fills 0 */

void Audio_DMA_DoubleBuffer_Start(void)
{
    /* hdma_i2s_rx must be configured for double-buffer mode */
    /* Init.Mode = DMA_NORMAL — DBM overrides circular in HAL  */
    HAL_DMAEx_MultiBufferStart_IT(
        &hdma_i2s_rx,
        (uint32_t)&SPI2->DR,           /* peripheral address (I2S data reg) */
        (uint32_t)audio_buf0,          /* Memory0                           */
        (uint32_t)audio_buf1,          /* Memory1                           */
        AUDIO_BUF_SAMPLES
    );
}

/* Called every time one full buffer of samples has been received.
   The CT bit indicates which buffer the DMA just finished writing. */
void HAL_DMA_XferCpltCallback(DMA_HandleTypeDef *hdma)
{
    if (hdma->Instance == DMA1_Stream3)  /* I2S2_RX stream on F4 */
    {
        /* Read CT bit: 0 = DMA just finished M0, now filling M1 */
        if ((hdma->Instance->CR & DMA_SxCR_CT) == 0)
        {
            /* DMA is now filling audio_buf1: process audio_buf0 */
            ApplyFilter(audio_buf0, AUDIO_BUF_SAMPLES);
            DAC_DMA_OutputBuffer(audio_buf0, AUDIO_BUF_SAMPLES);
        }
        else
        {
            /* DMA is now filling audio_buf0: process audio_buf1 */
            ApplyFilter(audio_buf1, AUDIO_BUF_SAMPLES);
            DAC_DMA_OutputBuffer(audio_buf1, AUDIO_BUF_SAMPLES);
        }
    }
}

A common pitfall with DBM is assuming the callback fires synchronously. On a Cortex-M4 at 168 MHz, the IRQ latency is 12 cycles minimum. For audio at 44.1 kHz with 256-sample buffers, each buffer lasts approximately 5.8 ms — plenty of headroom. For smaller buffers or very high sample rates, the latency budget shrinks and you must profile carefully.

Zero-Copy Firmware Patterns

The ultimate DMA optimisation is zero-copy: data moves from peripheral to its final resting place without any CPU memcpy step. This requires careful buffer ownership discipline. The rule is simple but must be enforced rigorously: the CPU may only access a DMA buffer during the window when the DMA controller is not accessing it. Violating this rule produces subtle, timing-dependent data corruption that is extremely difficult to debug.

DMA-Safe Memory Regions

On STM32F4, the Core Coupled Memory (CCM RAM, 0x10000000–0x1000FFFF on F407) is not accessible by DMA — it is connected directly to the Cortex-M4 data bus (DBUS), bypassing the AHB matrix. Placing a DMA buffer in CCM will cause a hard fault or silent failure with no data transferred. All DMA buffers must reside in standard SRAM1 or SRAM2. Use the linker attribute __attribute__((section(".sram1_bss"))) or simply declare DMA buffers as global/static (which places them in the default BSS/data section in SRAM1).

On STM32H7, there is an additional concern: the Cortex-M7 has a cache (D-cache and I-cache), and DMA operates on physical memory, bypassing the cache. A DMA write to SRAM will not be visible to the CPU until the cache line is invalidated. Use SCB_InvalidateDCache_by_Addr() after a DMA receive and SCB_CleanDCache_by_Addr() before a DMA transmit.

/* ─── Zero-copy packet receive: SPI-attached W5500 Ethernet controller ─────── */
/* The W5500 signals data-ready via an interrupt on PA4.
   We DMA the Ethernet frame directly from W5500 SPI RX into the
   application protocol buffer — no intermediate copy.                      */

#define ETH_FRAME_MAX  1518

/* Application-layer receive buffer: DMA writes here directly */
static uint8_t eth_rx_frame[ETH_FRAME_MAX]
    __attribute__((aligned(4)));         /* word-aligned for DMA efficiency */

static uint16_t eth_rx_len;
static volatile uint8_t  eth_frame_ready;

/* W5500 data-ready ISR: read frame length, then kick off DMA receive */
void EXTI4_IRQHandler(void)
{
    __HAL_GPIO_EXTI_CLEAR_IT(GPIO_PIN_4);

    /* Read 2-byte receive size register (polled: it's only 2 bytes) */
    uint8_t cmd[4] = {0x00, 0x26, 0x00};    /* W5500 RXBUF_SIZE register */
    HAL_GPIO_WritePin(W5500_CS_GPIO_Port, W5500_CS_Pin, GPIO_PIN_RESET);
    HAL_SPI_Transmit(&hspi1, cmd, 3, 10);
    uint8_t size_buf[2];
    HAL_SPI_Receive(&hspi1, size_buf, 2, 10);
    HAL_GPIO_WritePin(W5500_CS_GPIO_Port, W5500_CS_Pin, GPIO_PIN_SET);

    eth_rx_len = (uint16_t)(size_buf[0] << 8) | size_buf[1];
    if (eth_rx_len == 0 || eth_rx_len > ETH_FRAME_MAX) return;

    /* Now DMA the full frame in one shot — zero CPU involvement */
    HAL_GPIO_WritePin(W5500_CS_GPIO_Port, W5500_CS_Pin, GPIO_PIN_RESET);
    HAL_SPI_Receive_DMA(&hspi1, eth_rx_frame, eth_rx_len);
    /* CS deasserted in TxRxCpltCallback after DMA completes */
}

void HAL_SPI_RxCpltCallback(SPI_HandleTypeDef *hspi)
{
    if (hspi->Instance == SPI1)
    {
        HAL_GPIO_WritePin(W5500_CS_GPIO_Port, W5500_CS_Pin, GPIO_PIN_SET);
        eth_frame_ready = 1;   /* main loop: parse eth_rx_frame[0..eth_rx_len-1] */
    }
}

/* Critical section to safely disable a running DMA stream */
void DMA_SafeDisable(DMA_HandleTypeDef *hdma)
{
    __disable_irq();
    HAL_DMA_Abort(hdma);   /* blocks until stream is disabled in hardware */
    __enable_irq();
}

The scatter-gather pattern — where multiple non-contiguous destination buffers are chained — is not natively supported in STM32 HAL DMA (unlike the more sophisticated BDMA in STM32H7). On F4, you must chain transfers manually in the TC callback, updating the memory address register before restarting the stream.

DMA Error Handling

DMA transfers can fail silently if not monitored. The DMA stream status register (DMA_LISR/HISR) contains three error flags per stream: Transfer Error (TEIF), FIFO Error (FEIF), and Direct Mode Error (DMEIF). HAL maps these to the HAL_DMA_ErrorCallback. In production firmware, always implement this callback — a TEIF typically means the peripheral was not ready when the DMA tried to access it (e.g., peripheral clock not enabled, or bus access violation).

/* ─── DMA error callback and recovery ──────────────────────────────────── */
void DMA_Error_Handler(DMA_HandleTypeDef *hdma)
{
    uint32_t error = HAL_DMA_GetError(hdma);

    if (error & HAL_DMA_ERROR_TE)
    {
        /* Transfer error: re-init the DMA stream entirely */
        HAL_DMA_DeInit(hdma);
        HAL_DMA_Init(hdma);

        /* Re-arm the receive: peripheral re-enables its DMA request */
        if (hdma->Instance == DMA1_Stream5)   /* USART1 RX */
        {
            HAL_UART_Receive_DMA(&huart1, uart_rx_dma_buf, UART_RX_BUF_SIZE);
        }
    }

    if (error & HAL_DMA_ERROR_FE)
    {
        /* FIFO error: switch to direct mode or adjust burst settings */
        hdma->Init.FIFOMode = DMA_FIFOMODE_DISABLE;
        HAL_DMA_DeInit(hdma);
        HAL_DMA_Init(hdma);
    }
}

/* Register error callback after HAL_DMA_Init */
void RegisterDmaCallbacks(void)
{
    hdma_uart_rx.XferErrorCallback = DMA_Error_Handler;
    hdma_adc1.XferErrorCallback    = DMA_Error_Handler;
}

DMA and the MPU on STM32H7

On STM32H7 (Cortex-M7), the Memory Protection Unit interacts with DMA in a non-obvious way. The Cortex-M7's D-cache caches SRAM accesses, but DMA operates on physical memory and bypasses the cache entirely. If DMA writes to a region that the CPU has cached, the CPU will read the old (pre-DMA) data from cache — a classic cache coherency bug. The solution is to declare DMA buffers in a memory region that is marked as non-cacheable in the MPU configuration, using an attribute section such as __attribute__((section(".noncacheable"))) and configuring that address range in the MPU with ARM_MPU_ATTR_NON_CACHEABLE. Alternatively, use explicit cache maintenance — SCB_InvalidateDCache_by_Addr() after DMA receive and SCB_CleanDCache_by_Addr() before DMA transmit — at the cost of additional CPU cycles for each transfer.

Exercises

Practice Approach: Each exercise builds on the previous. Use an STM32F4 Nucleo or Discovery board. A logic analyser or oscilloscope is strongly recommended for the intermediate and advanced exercises.

BeginnerExercise 1: DMA M2M vs CPU memcpy Benchmark

Use DMA2 to copy a 4096-byte array from one SRAM region to another (memory-to-memory transfer). Simultaneously implement a standard CPU memcpy of the same array. Use the DWT cycle counter (DWT->CYCCNT) to time both operations at the same clock frequency (168 MHz). Report the cycle count ratio. Expected result: DMA M2M should complete in roughly 1024 cycles (4096/4 words × 1 cycle per burst), while CPU memcpy will take approximately 1500–2000 cycles and consume 100% of the processor during that time.

IntermediateExercise 2: Circular UART DMA at 1 Mbit/s

Implement circular DMA receive on UART1 at 1 Mbit/s into a 512-byte buffer. Use the half-transfer and full-transfer callbacks to process the incoming data (e.g., calculate a running CRC-16 over each 256-byte block). Send 10,000 bytes continuously from a PC terminal application (Python pyserial is ideal). Verify that no data is lost by checking that the received CRC matches the expected value. Common failure modes to investigate: incorrect DMA stream/channel selection, missing clock enable, buffer placed in CCM RAM.

AdvancedExercise 3: Audio Passthrough with Real-Time Filtering

Build a dual-ADC, dual-DAC audio passthrough. Configure ADC1 in continuous mode at 44.1 kHz via DMA circular with a 256-sample (512-byte) buffer. In each half/full callback, apply a software biquad IIR filter (second-order section, low-pass at 4 kHz) to the just-completed half of the ADC buffer. Write the filtered output via DAC1 DMA. Measure total filter latency — defined as the time from when a transient appears on the ADC input to when it appears on the DAC output — using two GPIO toggles and an oscilloscope. Target latency should be under 6 ms (approximately 2.9 ms ADC half-buffer + 2.9 ms DAC half-buffer + processing overhead). Profile CPU utilisation using SysTick and a GPIO-high-during-processing pattern.

STM32 DMA Configuration Document Generator

Use this tool to document your DMA configuration for a project. Fill in stream assignments, mode, FIFO settings, and design notes, then export to Word, Excel, PDF, or PowerPoint. Drafts are saved automatically in your browser.

Conclusion & Next Steps

DMA is not an optional optimisation — on any STM32 firmware that involves continuous data streams, it is the only viable architecture. In this article we covered:

  • DMA architecture: Two controllers (DMA1 and DMA2) with 8 streams each, stream/channel mapping fixed in silicon, FIFO vs direct mode, priority arbitration.
  • Peripheral-to-Memory: ADC and UART circular DMA with half and full transfer callbacks — the foundation of zero-latency data acquisition.
  • Memory-to-Peripheral: SPI display framebuffer push, DAC waveform generation, UART bulk logging — non-blocking, CPU-free data output.
  • Circular mode: Auto-reloading NDTR for continuous streams, ping-pong half/full buffer pattern, audio streaming application.
  • Memory-to-Memory: DMA2-only M2M, dma_memcpy() implementation, throughput comparison vs CPU memcpy.
  • Double buffering (DBM): Hardware-managed double-buffer mode, CT bit inspection, audio codec application.
  • Zero-copy patterns: Direct-to-application-buffer receive, CCM RAM hazard, cache coherency on STM32H7, scatter-gather limitations.

Next in the Series

In Part 9: Interrupt Management & NVIC, we will master priority grouping, preemption, nested ISR design, the HAL callback dispatch architecture, and how to measure and budget interrupt latency — the essential knowledge for building deterministic real-time firmware.

Technology