Back to Technology

STM32 Part 3: UART Communication

March 31, 2026 Wasil Zafar 26 min read

From polling to interrupt to DMA — master every STM32 UART mode, retarget printf to serial, implement a zero-copy ring buffer, and build a command-line interface over UART.

Table of Contents

  1. UART on STM32
  2. Polling Mode
  3. Interrupt Mode
  4. DMA Mode
  5. Printf Retargeting
  6. Ring Buffer Implementation
  7. UART Command-Line Interface
  8. Exercises
  9. UART Design Tool
  10. Conclusion & Next Steps
Series Overview: This is Part 3 of our 18-part STM32 Unleashed series. Having mastered GPIO in Part 2, we now tackle UART — the most universally useful peripheral for embedded debug output, host communication, and sensor interfacing. We cover all three transfer modes and build production-quality supporting infrastructure.

STM32 Unleashed: HAL Driver Development

Your 18-step learning path • Currently on Step 3
1
Architecture & CubeMX Setup
STM32 family, clock tree, HAL vs LL, CubeMX workflow, first project
Completed
2
GPIO & Button Debounce
GPIO modes, pull-up/down, EXTI, software debounce, HAL_GPIO_ReadPin
Completed
3
UART Communication
Polling, interrupt, DMA modes, printf retargeting, ring buffers
You Are Here
4
Timers, PWM & Input Capture
TIM basics, PWM generation, input capture, encoder mode
5
ADC & DAC
Single/continuous conversion, DMA, injected channels, DAC waveforms
6
SPI Protocol
SPI master/slave, full-duplex, DMA transfers, sensor drivers
7
I2C Protocol
I2C master, 7/10-bit addressing, DMA, multi-master, error handling
8
DMA & Memory Efficiency
DMA streams, circular mode, memory-to-memory, zero-copy patterns
9
Interrupt Management & NVIC
Priority grouping, preemption, ISR design, HAL callbacks, latency
10
Low-Power Modes
Sleep, Stop, Standby modes, RTC wakeup, LP UART, power profiling
11
RTC & Calendar
RTC configuration, alarms, backup registers, calendar subseconds
12
CAN Bus
FDCAN/bxCAN, filters, message frames, error handling, automotive use
13
USB CDC Virtual COM Port
USB FS/HS, CDC class, virtual serial, control transfers, descriptors
14
FreeRTOS Integration
Tasks, queues, semaphores, mutexes, CMSIS-RTOS2 wrapper, stack sizing
15
Bootloader Development
Custom IAP bootloader, UART/USB DFU, flash programming, jump-to-app
16
External Storage: SD & QSPI Flash
FATFS on SD card, QSPI NOR flash, memory-mapped execution, wear levelling
17
Ethernet & TCP/IP Stack
LwIP integration, DHCP, TCP server, HTTP, MQTT, Ethernet DMA descriptors
18
Production Readiness
Watchdog, HardFault handler, flash option bytes, code signing, CI/CD

UART on STM32

Serial communication via UART (Universal Asynchronous Receiver/Transmitter) remains the backbone of embedded system debugging and host communication decades after its invention. On STM32, the UART peripheral exists in three variants with progressively greater capability. Understanding the clock domain, baud rate calculation, and oversampling options before writing a single line of HAL code prevents the most common UART configuration bugs.

USART vs UART vs LPUART

USART (Universal Synchronous/Asynchronous Receiver/Transmitter) supports both synchronous mode (with a CK clock output pin) and standard asynchronous UART operation. It also optionally supports IrDA (infrared encoding), LIN (automotive local interconnect), and SmartCard (ISO 7816) protocols. Most STM32 UART peripherals are labelled USART in the reference manual, though they are used in asynchronous mode for standard serial communication.

UART (labelled UART4, UART5 on STM32F4) is asynchronous-only — no CK pin, no SmartCard. Lower pin count, but identical to USART for all standard serial use cases.

LPUART (Low-Power UART) is available on STM32L4, G0, G4, H7, and U5 families. It can operate from the LSE (32.768 kHz) clock in Stop mode, enabling the MCU to receive UART data while the core is sleeping — crucial for battery-powered applications that must wake on a command byte.

Clock Domains and APB Bus Assignment

UART peripheral clock speed determines the maximum achievable baud rate. On STM32F4:

  • USART1 and USART6 are clocked from APB2 — up to 84 MHz when SYSCLK = 168 MHz
  • USART2, USART3, UART4, UART5 are clocked from APB1 — up to 42 MHz when SYSCLK = 168 MHz
USART Instance APB Bus Max Clock (F401 @ 84 MHz) Max Baud (OVER8=0) Nucleo-F401RE Default Use
USART1 APB2 84 MHz 5.25 Mbaud
USART2 APB1 42 MHz 2.625 Mbaud ST-Link VCP (PA2/PA3)
USART3 APB1 42 MHz 2.625 Mbaud
UART4 APB1 42 MHz 2.625 Mbaud
UART5 APB1 42 MHz 2.625 Mbaud
USART6 APB2 84 MHz 5.25 Mbaud

Baud Rate Calculation

The UART baud rate is determined by the BRR (Baud Rate Register) value:

BRR = f_PCLK / (8 × (2 − OVER8) × BaudRate)

With OVER8 = 0 (oversampling by 16, the default): BRR = f_PCLK / (16 × BaudRate)

For USART2 at 115200 baud with APB1 = 42 MHz: BRR = 42,000,000 / (16 × 115,200) = 22.786 ≈ 22.79. The fractional part is stored in BRR[3:0] as DIV_Fraction × 16 = 0.786 × 16 ≈ 12.6 → 13 = 0xD. Integer part BRR[15:4] = 22 = 0x16. So BRR = 0x016D. The actual baud rate = 42,000,000 / (16 × (22 + 13/16)) = 115,108 baud — a 0.08% error, well within the ±2% tolerance of standard UART.

Oversampling by 16 vs 8

With OVER8 = 1 (oversampling by 8), the maximum achievable baud rate doubles (BRR denominator halves). However, the fractional part uses only 3 bits instead of 4, reducing baud rate precision. More importantly, noise immunity decreases: oversampling by 16 uses 3 samples per bit and can tolerate more noise on the line. Use OVER8 = 0 (16x) unless you need baud rates above f_PCLK/16 — typically only relevant for USART1/6 at very high baud rates above 5 Mbaud.

Hardware Flow Control: RTS/CTS

CTS (Clear To Send) and RTS (Request To Send) pins provide hardware handshaking. When the receiver's buffer is near-full, it deasserts RTS; the sender sees CTS deasserted and pauses transmission until the receiver is ready. This eliminates software flow control (XON/XOFF) and prevents overrun errors at high baud rates. On STM32, enable flow control by setting huart.Init.HwFlowCtl = UART_HWCONTROL_RTS_CTS. Requires four pins (TX, RX, RTS, CTS) but eliminates the risk of UART overrun at any baud rate.

/* ---------------------------------------------------------------
 * UART_HandleTypeDef initialisation for USART2 at 115200 baud
 * Target: STM32F401RE, APB1 clock = 42 MHz
 * --------------------------------------------------------------- */
#include "main.h"

UART_HandleTypeDef huart2;

HAL_StatusTypeDef USART2_Init(void)
{
    huart2.Instance          = USART2;
    huart2.Init.BaudRate     = 115200;
    huart2.Init.WordLength   = UART_WORDLENGTH_8B;
    huart2.Init.StopBits     = UART_STOPBITS_1;
    huart2.Init.Parity       = UART_PARITY_NONE;
    huart2.Init.Mode         = UART_MODE_TX_RX;
    huart2.Init.HwFlowCtl    = UART_HWCONTROL_NONE;
    huart2.Init.OverSampling = UART_OVERSAMPLING_16;

    if (HAL_UART_Init(&huart2) != HAL_OK) {
        /* HAL_UART_Init calls HAL_UART_MspInit internally,
         * which must enable GPIOA clock, configure PA2/PA3 as AF7,
         * enable USART2 clock, and optionally configure NVIC/DMA. */
        return HAL_ERROR;
    }
    return HAL_OK;
}

Polling Mode

Polling mode is the simplest UART transfer method. The CPU waits inside the HAL function until the transfer completes or a timeout expires. There is no interrupt, no DMA, and no callback — the CPU is fully occupied during the transfer. This blocking behaviour makes polling unsuitable for use inside ISRs or RTOS tasks with tight deadlines, but it is perfectly appropriate for startup banners, configuration reads, and simple debug logging where blocking a few milliseconds is acceptable.

HAL_UART_Transmit(handle, buf, size, timeout): Transmits size bytes from buf over the UART. Blocks until all bytes are shifted out of the TX shift register or timeout milliseconds elapse. Returns HAL_OK, HAL_TIMEOUT, or HAL_ERROR.

HAL_UART_Receive(handle, buf, size, timeout): Waits for size bytes to arrive in the RX buffer. Blocks until the last byte is received or timeout expires. If the remote side does not send data, this function blocks forever with HAL_MAX_DELAY. Never use HAL_MAX_DELAY for receive in systems where the remote may be silent — it will deadlock your firmware. Calculate a realistic timeout based on baud rate and expected data size: timeout_ms = (size × 10 × 1000) / baud_rate, then add 10% margin.

/* ---------------------------------------------------------------
 * Polling mode UART: transmit "Hello\r\n", receive 1 byte, echo
 * Timeout calculation: 1 byte at 115200 = ~87 µs → use 10 ms
 * --------------------------------------------------------------- */
#include "main.h"
#include <string.h>

extern UART_HandleTypeDef huart2;

void UART_Polling_Demo(void)
{
    HAL_StatusTypeDef status;
    uint8_t tx_buf[] = "Hello from STM32!\r\n";
    uint8_t rx_byte  = 0;

    /* Transmit startup message — blocking, should complete in ~1.5 ms */
    status = HAL_UART_Transmit(&huart2,
                                tx_buf,
                                (uint16_t)strlen((char*)tx_buf),
                                50);   /* 50 ms timeout              */
    if (status != HAL_OK) {
        /* Handle transmit error — line stuck low, baud mismatch, etc. */
        Error_Handler();
    }

    /* Receive a single byte — wait up to 5 seconds for user input */
    status = HAL_UART_Receive(&huart2, &rx_byte, 1, 5000);
    if (status == HAL_OK) {
        /* Echo the received byte back */
        HAL_UART_Transmit(&huart2, &rx_byte, 1, 10);
    } else if (status == HAL_TIMEOUT) {
        const uint8_t timeout_msg[] = "Timeout — no input received.\r\n";
        HAL_UART_Transmit(&huart2, timeout_msg,
                           (uint16_t)strlen((char*)timeout_msg), 100);
    }
}
HAL_MAX_DELAY Pitfall: HAL_MAX_DELAY expands to 0xFFFFFFFFU. Inside HAL's polling loop, the timeout check is (HAL_GetTick() - tickstart) > Timeout, which wraps correctly for most timeouts but means HAL_MAX_DELAY will never truly time out — the function will block indefinitely if data never arrives. This is acceptable in a dedicated receive task but dangerous in a main loop or interrupt context.

Interrupt Mode

In interrupt mode, the HAL function initiates the transfer and returns immediately. The UART peripheral generates an interrupt when each byte is transferred (or the entire buffer completes), and HAL's ISR handles the data movement. Your application code is notified via a callback when the transfer finishes. This non-blocking approach is appropriate for most application-level UART use — the CPU is free to do other work while bytes are being shifted out.

Transmit IT

HAL_UART_Transmit_IT(handle, buf, size) enables the TXE (TX Empty) interrupt and returns HAL_OK immediately. HAL's ISR feeds bytes one at a time from buf into the TDR register as each byte is consumed. When the last byte is loaded, HAL disables the TXE interrupt, enables TC (Transmission Complete), and calls HAL_UART_TxCpltCallback from the TC interrupt.

Important: the buf pointer must remain valid until TxCpltCallback fires. Do not pass a local stack buffer to Transmit_IT and then return from the function — the HAL ISR will read stale stack memory. Use a static or heap-allocated transmit buffer.

Receive IT and Re-arming

HAL_UART_Receive_IT(handle, buf, size) receives exactly size bytes into buf before firing HAL_UART_RxCpltCallback. After the callback fires, receive interrupts are disabled. To continue receiving, you must call HAL_UART_Receive_IT again inside the callback — the "re-arm" pattern. The single-byte ping-pong technique (receive size = 1, re-arm in callback, push byte to ring buffer) is the most flexible approach for variable-length message protocols.

Error Handling

HAL_UART_ErrorCallback is called when a receive error is detected. Check huart->ErrorCode for the specific fault:

  • HAL_UART_ERROR_PE — parity error: received bit pattern does not match expected parity
  • HAL_UART_ERROR_FE — framing error: stop bit was not detected at the expected time; likely baud rate mismatch
  • HAL_UART_ERROR_ORE — overrun error: a new byte arrived before the previous one was read from RDR; receive interrupt latency too high or baud rate too fast for interrupt mode
  • HAL_UART_ERROR_NE — noise error: sampling detected noise on the bit cell boundary
/* ---------------------------------------------------------------
 * Interrupt-mode receive: single-byte ping-pong into ring buffer
 * Re-arms after every byte; ring buffer is defined in Section 6
 * --------------------------------------------------------------- */
#include "main.h"

extern UART_HandleTypeDef huart2;
extern RingBuf_t          g_rx_ringbuf;  /* see Section 6          */

static uint8_t s_rx_byte = 0;   /* single-byte receive staging      */

/* Call once during initialisation to start the receive chain */
void UART_IT_StartReceive(void)
{
    HAL_UART_Receive_IT(&huart2, &s_rx_byte, 1);
}

/* HAL calls this weak function from USART2_IRQHandler via HAL_UART_IRQHandler */
void HAL_UART_RxCpltCallback(UART_HandleTypeDef *huart)
{
    if (huart->Instance == USART2) {
        /* Push received byte into the ring buffer */
        RingBuf_Push(&g_rx_ringbuf, s_rx_byte);

        /* Re-arm: start the next single-byte receive immediately */
        HAL_UART_Receive_IT(&huart2, &s_rx_byte, 1);
    }
}

void HAL_UART_TxCpltCallback(UART_HandleTypeDef *huart)
{
    if (huart->Instance == USART2) {
        /* Transmit buffer is now free — signal main loop if needed */
        /* e.g., set a semaphore or clear a "busy" flag             */
    }
}

void HAL_UART_ErrorCallback(UART_HandleTypeDef *huart)
{
    if (huart->Instance == USART2) {
        uint32_t err = huart->ErrorCode;
        if (err & HAL_UART_ERROR_ORE) {
            /* Overrun: increment counter, clear flag, re-arm       */
        }
        /* Always re-arm after error to avoid receive deadlock      */
        HAL_UART_Receive_IT(&huart2, &s_rx_byte, 1);
    }
}

DMA Mode

DMA (Direct Memory Access) mode offloads data movement from the CPU entirely. Once initiated, the DMA controller streams bytes between the UART data register and your buffer in hardware, without CPU intervention. The CPU is free to execute application code until the DMA fires a completion interrupt. For high-throughput or high-baud applications (921600 baud and above), DMA mode is not optional — interrupt mode at these speeds can consume over 90% of CPU bandwidth on byte-by-byte receive interrupts.

DMA Transmit

HAL_UART_Transmit_DMA(handle, buf, size) programs a DMA stream to read from buf and write to USART2->DR (or TDR on newer devices) on each TXE trigger. The UART peripheral requests a DMA transfer whenever its transmit register is empty. HAL_UART_TxCpltCallback fires when the DMA completes the last transfer.

Circular DMA Receive with Idle-Line Detection

The most powerful STM32 UART receive mechanism is HAL_UARTEx_ReceiveToIdle_DMA. This function configures DMA in circular (non-stopping) mode and additionally enables the IDLE line interrupt. When the UART line goes idle (no new bytes for one character period), HAL fires HAL_UARTEx_RxEventCallback with RxEventType = HAL_UART_RXEVENT_IDLE, telling you exactly how many bytes arrived. This perfectly handles variable-length messages without knowing the message length in advance.

In circular DMA mode, the DMA controller wraps around the buffer without stopping. You receive both a half-complete callback (HAL_UART_RXEVENT_HT) when the first half of the buffer fills and a full-complete callback (HAL_UART_RXEVENT_TC) when the buffer wraps. Processing both callbacks enables true zero-copy streaming at sustained high throughput.

Cache Coherency

On Cortex-M4 (STM32F4, G4, L4): no L1 data cache exists. DMA buffers do not require cache maintenance — the DMA controller reads/writes main SRAM directly, and the CPU sees the same physical memory. Simply declare your DMA buffer as a global array and it works.

On Cortex-M7 (STM32F7, H7): a 32 KB L1 data cache exists. DMA writes bypass the cache, so the CPU may read stale cached data. Before the CPU reads a DMA receive buffer, call SCB_InvalidateDCache_by_Addr(buf, size). Before DMA reads a CPU-written transmit buffer, call SCB_CleanDCache_by_Addr(buf, size). Alternatively, place DMA buffers in non-cacheable SRAM regions using the MPU or linker section attributes.

/* ---------------------------------------------------------------
 * Circular DMA receive with IDLE-line detection
 * HAL_UARTEx_ReceiveToIdle_DMA — handles variable-length frames
 * Buffer: 256 bytes circular; process data in RxEventCallback
 * --------------------------------------------------------------- */
#include "main.h"
#include <string.h>

extern UART_HandleTypeDef huart2;
extern DMA_HandleTypeDef  hdma_usart2_rx;

#define RX_DMA_BUF_SIZE  256U

/* DMA buffer — must be accessible by DMA (not CCM on F4 with DMA1) */
static uint8_t  s_rx_dma_buf[RX_DMA_BUF_SIZE];
static uint16_t s_rx_prev_pos = 0;   /* tracks last processed index */

/* Call once to start the DMA receive loop */
void UART_DMA_StartReceive(void)
{
    s_rx_prev_pos = 0;
    HAL_UARTEx_ReceiveToIdle_DMA(&huart2, s_rx_dma_buf, RX_DMA_BUF_SIZE);
    /* Disable the half-complete DMA interrupt if you only want IDLE events:
     * __HAL_DMA_DISABLE_IT(&hdma_usart2_rx, DMA_IT_HT);
     * For streaming double-buffer processing, leave HT enabled.    */
}

/* HAL calls this for IDLE, HT (half-transfer), and TC (full transfer) events */
void HAL_UARTEx_RxEventCallback(UART_HandleTypeDef *huart, uint16_t Size)
{
    if (huart->Instance == USART2) {
        /* 'Size' is the total number of bytes received since DMA started
         * (or since last wrap), NOT the incremental count since last callback */
        uint16_t new_pos = Size;
        uint16_t num_bytes;

        if (new_pos >= s_rx_prev_pos) {
            num_bytes = new_pos - s_rx_prev_pos;
            /* Process s_rx_dma_buf[s_rx_prev_pos .. new_pos-1] */
            UART_ProcessBytes(&s_rx_dma_buf[s_rx_prev_pos], num_bytes);
        } else {
            /* Buffer wrapped around */
            num_bytes = RX_DMA_BUF_SIZE - s_rx_prev_pos;
            UART_ProcessBytes(&s_rx_dma_buf[s_rx_prev_pos], num_bytes);
            if (new_pos > 0) {
                UART_ProcessBytes(&s_rx_dma_buf[0], new_pos);
            }
        }
        s_rx_prev_pos = new_pos % RX_DMA_BUF_SIZE;
    }
}

/* Stub — replace with ring buffer push or frame decoder */
void UART_ProcessBytes(const uint8_t *data, uint16_t len)
{
    (void)data; (void)len;
    /* e.g., RingBuf_PushN(&g_rx_ringbuf, data, len); */
}

Printf Retargeting

The C standard library printf function ultimately calls a low-level output function that depends on your toolchain. Retargeting means overriding this low-level hook so that printf output flows to your UART instead of nowhere (the default in bare-metal firmware). Once retargeted, you gain formatted output — printf("ADC = %d mV\r\n", value) — with zero additional code changes.

GCC: Override _write()

The GCC newlib C runtime calls _write(int fd, char *buf, int len) for all stdout/stderr output. CubeIDE generates a stub in syscalls.c. Override it (or add a separate retarget.c) with your UART transmit call:

/* ---------------------------------------------------------------
 * retarget.c — GCC newlib printf retarget for STM32
 * Place this file in your project's Src/ directory.
 * Disable buffering with setvbuf() in main() after HAL_Init().
 * --------------------------------------------------------------- */
#include <sys/stat.h>
#include <errno.h>
#include "main.h"

extern UART_HandleTypeDef huart2;

/* Override newlib's low-level write syscall.
 * fd 1 = stdout, fd 2 = stderr; we send both to UART.           */
int _write(int fd, char *buf, int len)
{
    (void)fd;
    /* Use polling transmit to avoid re-entrancy issues in printf.
     * Timeout: 100 ms is generous even at 9600 baud.             */
    if (HAL_UART_Transmit(&huart2,
                           (uint8_t*)buf,
                           (uint16_t)len,
                           100) == HAL_OK) {
        return len;
    }
    return -1;  /* errno = EIO */
}

/* In main(), immediately after HAL_Init():
 *   setvbuf(stdout, NULL, _IONBF, 0);   // disable stdout buffering
 *   setvbuf(stderr, NULL, _IONBF, 0);   // disable stderr buffering
 * Without this, printf output may be held in a 1 KB stdio buffer
 * and never flushed unless you print a newline or call fflush(). */

Keil MDK and IAR EWARM Methods

Keil MDK (MicroLib or standard library): Override int fputc(int ch, FILE *f) in any .c file. MicroLib has no stdio buffering and calls fputc directly for each character. With the standard Keil library, buffering is enabled by default; add setvbuf(stdout, NULL, _IONBF, 0) in main.

IAR EWARM: Override size_t __write(int handle, const unsigned char *buf, size_t bufSize). Located in write.c in the DLIB library support directory. Same setvbuf requirement applies.

ITM/SWO Alternative

The Cortex-M3/4/7 Instrumentation Trace Macrocell (ITM) allows printf output via the Serial Wire Output (SWO) debug pin without using any UART. In CubeIDE, enable ITM trace in the debug configuration and override _write to call ITM_SendChar. Advantage: UART pins remain free; SWO output appears in the CubeIDE SWV console. Disadvantage: requires a debug probe and does not work on production hardware without SWO access.

/* ---------------------------------------------------------------
 * ITM/SWO retarget — printf via SWO debug pin (no UART required)
 * Enable SWO in CubeIDE: Run → Debug Configurations →
 *   Debugger → Serial Wire Viewer → Enable, set core clock
 * --------------------------------------------------------------- */
#include "core_cm4.h"

/* Override _write to send to ITM channel 0 */
int _write(int fd, char *buf, int len)
{
    (void)fd;
    for (int i = 0; i < len; i++) {
        ITM_SendChar((uint32_t)buf[i]);
    }
    return len;
}

/* ITM_SendChar from CMSIS core_cm4.h:
 * Checks ITM->TER (Trace Enable Register) and ITM->PORT[n].u8
 * Polls until the port FIFO is empty, then writes the character.
 * At 2 MHz SWO clock, this can sustain ~200 kbaud effective output. */

Ring Buffer Implementation

A ring buffer (circular buffer) is the standard data structure for bridging the UART receive interrupt (producer) and the main loop (consumer). The interrupt pushes bytes as fast as they arrive; the main loop pops and processes them at its own pace. The key property of a lock-free ring buffer is that no mutex is required for single-producer / single-consumer: the head index is written only by the producer and the tail index is written only by the consumer, and on Cortex-M the index reads/writes are atomic for naturally-aligned 16-bit or 32-bit values.

Two implementation rules ensure correctness:

  • Power-of-2 buffer size — replace modulo division with bitwise AND: idx & (SIZE - 1). This is both faster and avoids the branch in modulo operations when size is not 2n.
  • Overflow policy — when the buffer is full, either drop the newest byte (producer backs off — simplest), drop the oldest byte (consumer appears to advance — useful for streaming), or assert/signal an error. Choose based on whether losing the newest or oldest data is more acceptable for your protocol.
/* ---------------------------------------------------------------
 * Lock-free single-producer / single-consumer ring buffer
 * Size MUST be a power of 2 (64, 128, 256, 512...)
 * head: written by producer (ISR); read by consumer (main)
 * tail: written by consumer (main); read by producer (ISR)
 * --------------------------------------------------------------- */
#include <stdint.h>
#include <string.h>

#define RINGBUF_SIZE  256U    /* must be power of 2                  */
#define RINGBUF_MASK  (RINGBUF_SIZE - 1U)

typedef struct {
    volatile uint16_t head;         /* next write index (ISR writes) */
    volatile uint16_t tail;         /* next read  index (main writes)*/
    uint8_t           buf[RINGBUF_SIZE];
} RingBuf_t;

/* Declare in a shared header; define in uart_ringbuf.c             */
RingBuf_t g_rx_ringbuf = { 0, 0, {0} };

/* Push one byte — called from ISR (UART RxCpltCallback or DMA)    */
/* Returns 1 on success, 0 if buffer full (byte dropped)           */
uint8_t RingBuf_Push(RingBuf_t *rb, uint8_t byte)
{
    uint16_t next_head = (rb->head + 1U) & RINGBUF_MASK;
    if (next_head == rb->tail) {
        return 0;   /* full — drop newest byte (or increment ORE counter) */
    }
    rb->buf[rb->head] = byte;
    rb->head = next_head;   /* publish after writing data            */
    return 1;
}

/* Pop one byte — called from main loop                             */
/* Returns 1 on success, 0 if buffer empty                         */
uint8_t RingBuf_Pop(RingBuf_t *rb, uint8_t *byte)
{
    if (rb->tail == rb->head) {
        return 0;   /* empty                                          */
    }
    *byte    = rb->buf[rb->tail];
    rb->tail = (rb->tail + 1U) & RINGBUF_MASK;   /* advance after read */
    return 1;
}

/* Returns number of bytes available to read */
uint16_t RingBuf_Available(const RingBuf_t *rb)
{
    return (uint16_t)((rb->head - rb->tail) & RINGBUF_MASK);
}

/* Peek at next byte without consuming it */
uint8_t RingBuf_Peek(const RingBuf_t *rb, uint8_t *byte)
{
    if (rb->tail == rb->head) return 0;
    *byte = rb->buf[rb->tail];
    return 1;
}
Volatile Ordering: The volatile qualifier on head and tail prevents the compiler from caching them in registers, ensuring the ISR's write to head is visible to the main loop on the next read. On Cortex-M4 (which has a weakly-ordered write buffer), the implicit data memory barrier from the function call boundary is sufficient. For Cortex-M7 with out-of-order capabilities, add a __DMB() barrier before publishing the updated head index for guaranteed ordering.

UART Command-Line Interface

A UART CLI transforms a debug serial port into a powerful interactive console. The architecture is: the ring buffer fills from the UART interrupt; the main loop pops bytes and builds a line buffer; when a newline arrives, the line is tokenised by spaces; the first token is looked up in a command table; the matched handler function is called with the remaining tokens as arguments.

A well-designed CLI command table is a static array of structs — no dynamic allocation, predictable memory layout, easy to extend at compile time. Each entry holds the command name string, a function pointer, and a help text string. The dispatcher iterates the table with a simple strcmp loop — at 20 commands this costs microseconds, not milliseconds.

/* ---------------------------------------------------------------
 * UART CLI — command table with 4 built-in commands
 * Supports: backspace editing, echo, line accumulation from ring buf
 * --------------------------------------------------------------- */
#include "main.h"
#include <string.h>
#include <stdio.h>

extern RingBuf_t g_rx_ringbuf;
extern UART_HandleTypeDef huart2;

#define CLI_MAX_LINE   80U
#define CLI_MAX_ARGS    8U

/* ---- Command handler declarations ----------------------------- */
static void cmd_help(int argc, char *argv[]);
static void cmd_led(int argc, char *argv[]);
static void cmd_reset(int argc, char *argv[]);
static void cmd_version(int argc, char *argv[]);

/* ---- Command table -------------------------------------------- */
typedef struct {
    const char *name;
    void (*handler)(int argc, char *argv[]);
    const char *help;
} CliCmd_t;

static const CliCmd_t s_cli_table[] = {
    { "help",    cmd_help,    "List all commands"              },
    { "led",     cmd_led,     "led [on|off|toggle]"            },
    { "reset",   cmd_reset,   "Perform a system software reset"},
    { "version", cmd_version, "Print firmware version string"  },
};
#define CLI_NUM_CMDS  (sizeof(s_cli_table) / sizeof(s_cli_table[0]))

/* ---- Line accumulator state ----------------------------------- */
static char     s_line[CLI_MAX_LINE + 1];
static uint16_t s_line_len = 0;

/* ---- CLI output helper ---------------------------------------- */
static void cli_print(const char *str)
{
    HAL_UART_Transmit(&huart2, (uint8_t*)str, (uint16_t)strlen(str), 200);
}

/* ---- Dispatcher ----------------------------------------------- */
static void CLI_Dispatch(char *line)
{
    char *argv[CLI_MAX_ARGS];
    int   argc = 0;
    char *tok  = strtok(line, " \t");
    while (tok && argc < CLI_MAX_ARGS) {
        argv[argc++] = tok;
        tok = strtok(NULL, " \t");
    }
    if (argc == 0) return;

    for (size_t i = 0; i < CLI_NUM_CMDS; i++) {
        if (strcmp(argv[0], s_cli_table[i].name) == 0) {
            s_cli_table[i].handler(argc, argv);
            return;
        }
    }
    cli_print("Unknown command. Type 'help'.\r\n");
}

/* ---- Main loop CLI tick — call every iteration ---------------- */
void CLI_Process(void)
{
    uint8_t byte;
    while (RingBuf_Pop(&g_rx_ringbuf, &byte)) {
        if (byte == '\r' || byte == '\n') {
            if (s_line_len > 0) {
                cli_print("\r\n");
                s_line[s_line_len] = '\0';
                CLI_Dispatch(s_line);
                s_line_len = 0;
            }
            cli_print("> ");   /* prompt */
        } else if (byte == 0x7F || byte == '\b') {
            /* Backspace: remove last character with destructive erase */
            if (s_line_len > 0) {
                s_line_len--;
                cli_print("\b \b");
            }
        } else if (s_line_len < CLI_MAX_LINE) {
            s_line[s_line_len++] = (char)byte;
            HAL_UART_Transmit(&huart2, &byte, 1, 10);  /* local echo */
        }
    }
}

/* ---- Command implementations ---------------------------------- */
static void cmd_help(int argc, char *argv[])
{
    (void)argc; (void)argv;
    for (size_t i = 0; i < CLI_NUM_CMDS; i++) {
        char line[60];
        snprintf(line, sizeof(line), "  %-12s %s\r\n",
                 s_cli_table[i].name, s_cli_table[i].help);
        cli_print(line);
    }
}

static void cmd_led(int argc, char *argv[])
{
    if (argc < 2) { cli_print("Usage: led [on|off|toggle]\r\n"); return; }
    if      (strcmp(argv[1], "on")     == 0) HAL_GPIO_WritePin(GPIOA, GPIO_PIN_5, GPIO_PIN_SET);
    else if (strcmp(argv[1], "off")    == 0) HAL_GPIO_WritePin(GPIOA, GPIO_PIN_5, GPIO_PIN_RESET);
    else if (strcmp(argv[1], "toggle") == 0) HAL_GPIO_TogglePin(GPIOA, GPIO_PIN_5);
    else cli_print("Unknown LED argument\r\n");
}

static void cmd_reset(int argc, char *argv[])
{
    (void)argc; (void)argv;
    cli_print("Resetting...\r\n");
    HAL_Delay(50);
    NVIC_SystemReset();
}

static void cmd_version(int argc, char *argv[])
{
    (void)argc; (void)argv;
    cli_print("Firmware: v1.0.0 built " __DATE__ " " __TIME__ "\r\n");
}

Exercises

Exercise 1 Beginner

Polling UART with Printf Retarget

Configure USART2 at 115200 baud via CubeMX on a Nucleo-F401RE. Add a retarget.c file with the _write() override shown in Section 5. Call setvbuf(stdout, NULL, _IONBF, 0) immediately after HAL_Init() in main. Send a startup message using printf and confirm it appears in PuTTY or minicom. Then implement an echo loop: receive one byte with HAL_UART_Receive (5-second timeout), print the received character and its ASCII value using printf("Received: '%c' (0x%02X)\r\n", ch, ch), and loop. Test with both printable characters and control characters (Tab, Escape).

UART Printf CubeMX Polling
Exercise 2 Intermediate

Interrupt-Driven UART with Ring Buffer Stress Test

Implement the interrupt-driven single-byte receive with re-arming (Section 3) and the ring buffer from Section 6. Configure USART2 at 921600 baud. Write a Python script that sends exactly 1,000 bytes (0x00–0xFF repeated) over the serial port as fast as possible. The STM32 should receive all bytes via interrupt into the ring buffer and echo every byte back. Verify on the Python side that all 1,000 echoed bytes match the sent sequence with no drops. Monitor the USART2 SR register ORE (Overrun Error) flag by reading USART2->SR & USART_SR_ORE after the test — it must be zero for a passing result. Increase baud to 1,843,200 and document the maximum reliable rate for your processor speed.

UART IT Ring Buffer Python Overrun
Exercise 3 Advanced

CLI with DMA ADC Streaming

Integrate the CLI from Section 7 with DMA circular UART receive (Section 4). Add a log command: log start begins streaming ADC1 Channel 0 readings over UART at 1 kHz using a TIM2-triggered ADC in DMA mode; log stop terminates the stream. Format each sample as a 16-bit hex value followed by CRLF, e.g. 0A3F\r\n. Run the log for 10 seconds and check the USART SR ORE flag counter after stopping. Confirm it remains zero. Optionally, pipe the UART output to a Python script that plots the ADC waveform in real time using matplotlib to verify the 1 kHz sample rate.

CLI DMA ADC Streaming

UART Design Tool

Use this tool to document your STM32 UART configuration — peripheral instance, baud rate, transfer mode, buffer sizes, and protocol design. Download as Word, Excel, PDF, or PPTX for design review or project documentation.

STM32 UART Design Generator

Document your UART configuration, transfer mode, buffer sizes, and communication protocol. Download as Word, Excel, PDF, or PPTX.

Draft auto-saved

All data stays in your browser. Nothing is sent to or stored on any server.

Conclusion & Next Steps

In this article we have built a complete UART toolkit covering every layer from peripheral fundamentals to application-level infrastructure:

  • USART vs UART vs LPUART — knowing which instance is on which APB bus and calculating the actual baud rate error from the BRR register prevents clock-domain misconfigurations that silently corrupt data.
  • Polling mode is the right choice for startup banners and simple debug output but must never be used in ISR context or with HAL_MAX_DELAY when the remote may not respond.
  • Interrupt mode with the single-byte ping-pong receive pattern and a ring buffer handles the vast majority of embedded UART use cases efficiently without DMA complexity.
  • Circular DMA with idle-line detection via HAL_UARTEx_ReceiveToIdle_DMA is the professional solution for high-baud or streaming receive — zero CPU load between frames.
  • Printf retargeting via _write() unlocks the full power of formatted output with no additional library overhead.
  • The lock-free ring buffer is a production-quality data structure that bridges ISR producers and main-loop consumers without any mutex overhead.
  • A command-line interface built on the ring buffer and a static command table is the most practical embedded debug tool available — interactive, extensible, and zero-cost to maintain.

Next in the Series

In Part 4: Timers, PWM & Input Capture, we'll deep-dive into the STM32 timer architecture — TIM peripherals, prescaler and ARR configuration, PWM generation for motor and LED control, input capture for frequency and pulse-width measurement, and encoder interface mode for quadrature signal decoding.

Technology