Series Overview: This is Part 1 of our 18-part STM32 Unleashed series. We journey from architectural fundamentals through professional HAL driver development — covering GPIO, UART, timers, ADC, SPI, I2C, DMA, interrupts, low-power modes, FreeRTOS, bootloaders, and production readiness.
1
Architecture & CubeMX Setup
STM32 family, clock tree, HAL vs LL, CubeMX workflow, first project
You Are Here
2
GPIO & Button Debounce
GPIO modes, pull-up/down, EXTI, software debounce, HAL_GPIO_ReadPin
3
UART Communication
Polling, interrupt, DMA modes, printf retargeting, ring buffers
4
Timers, PWM & Input Capture
TIM basics, PWM generation, input capture, encoder mode
5
ADC & DAC
Single/continuous conversion, DMA, injected channels, DAC waveforms
6
SPI Protocol
SPI master/slave, full-duplex, DMA transfers, sensor drivers
7
I2C Protocol
I2C master, 7/10-bit addressing, DMA, multi-master, error handling
8
DMA & Memory Efficiency
DMA streams, circular mode, memory-to-memory, zero-copy patterns
9
Interrupt Management & NVIC
Priority grouping, preemption, ISR design, HAL callbacks, latency
10
Low-Power Modes
Sleep, Stop, Standby modes, RTC wakeup, LP UART, power profiling
11
RTC & Calendar
RTC configuration, alarms, backup registers, calendar subseconds
12
CAN Bus
FDCAN/bxCAN, filters, message frames, error handling, automotive use
13
USB CDC Virtual COM Port
USB FS/HS, CDC class, virtual serial, control transfers, descriptors
14
FreeRTOS Integration
Tasks, queues, semaphores, mutexes, CMSIS-RTOS2 wrapper, stack sizing
15
Bootloader Development
Custom IAP bootloader, UART/USB DFU, flash programming, jump-to-app
16
External Storage: SD & QSPI Flash
FATFS on SD card, QSPI NOR flash, memory-mapped execution, wear levelling
17
Ethernet & TCP/IP Stack
LwIP integration, DHCP, TCP server, HTTP, MQTT, Ethernet DMA descriptors
18
Production Readiness
Watchdog, HardFault handler, flash option bytes, code signing, CI/CD
The STM32 Family
STMicroelectronics' STM32 portfolio is one of the most widely deployed families of 32-bit microcontrollers on the planet. From the ultra-low-power STM32L0 running at 32 MHz to the 550 MHz STM32H7 with dual-core Cortex-M7/M4, there is an STM32 for virtually every embedded application. The common thread — and the key to your productivity — is STM32Cube ecosystem: a unified HAL, middleware stack, code generation tool (CubeMX), and IDE (CubeIDE) that span the entire family.
Before writing any code, you need to understand what you're actually targeting. The STM32 is not a single chip — it is a family architecture, and the decisions you make when selecting a product line and configuring its clocks will affect every peripheral driver you write for the rest of the project.
Product Lines & Naming
STM32 part numbers follow a consistent pattern: STM32[family][sub-family][pin count][flash size][package]. Understanding this nomenclature lets you decode any STM32 datasheet immediately.
| Family |
Core |
Max Freq |
Key Strength |
Typical Use Cases |
| F0 |
Cortex-M0 |
48 MHz |
Ultra-low cost |
Simple I/O, LED control, keyboard |
| F1 |
Cortex-M3 |
72 MHz |
Widely available, legacy |
Blue Pill boards, hobbyist projects |
| F3 |
Cortex-M4F |
72 MHz |
FPU + motor control timers |
Motor drives, power conversion |
| F4 |
Cortex-M4F |
180 MHz |
Performance + peripherals |
Audio, imaging, communications hub |
| F7 |
Cortex-M7 |
216 MHz |
High performance, L1 cache |
HMI, video, high-speed DSP |
| H7 |
Cortex-M7 (+M4) |
550 MHz |
Flagship performance |
Industrial control, AI inference |
| G0 |
Cortex-M0+ |
64 MHz |
Value + efficiency |
Consumer electronics, SMPS |
| G4 |
Cortex-M4F |
170 MHz |
Motor + mixed signal |
BLDC drives, inverters, UPS |
| L0/L1 |
Cortex-M0+/M3 |
32 MHz |
Ultra-low power |
Battery IoT nodes, meters, tags |
| L4/L5 |
Cortex-M4F/M33 |
120 MHz |
Low power + performance |
Wearables, portable instruments |
| U5 |
Cortex-M33 |
160 MHz |
PSA Level 3 security |
Connected IoT, payment terminals |
| WB/WL |
Cortex-M4+M0+ |
64 MHz |
Integrated BLE/802.15.4 |
Wireless sensors, Thread, Zigbee |
Selection Rule of Thumb: Start with the STM32F4 (specifically the F401 or F411) for learning — it's well-documented, affordable, has an FPU, runs fast enough for any tutorial project, and its HAL patterns transfer directly to every other STM32 family. Graduate to the G4, H7, or U5 when your application demands it.
Cortex-M Core Selection
The Cortex-M core inside your STM32 determines which instructions you can use, whether you have hardware floating-point, and how the memory protection unit (MPU) behaves. This is not academic — it affects your linker script, compiler flags, and runtime library selection.
| Core |
ISA |
FPU |
DSP SIMD |
TrustZone |
GCC -mcpu flag |
| Cortex-M0/M0+ |
ARMv6-M |
No |
No |
No |
cortex-m0plus |
| Cortex-M3 |
ARMv7-M |
No |
No |
No |
cortex-m3 |
| Cortex-M4 |
ARMv7E-M |
Optional |
Yes |
No |
cortex-m4 -mfpu=fpv4-sp-d16 |
| Cortex-M7 |
ARMv7E-M |
DP FPU |
Yes |
No |
cortex-m7 -mfpu=fpv5-d16 |
| Cortex-M33 |
ARMv8-M Main |
Optional |
Yes |
Yes |
cortex-m33 -mfpu=fpv5-sp-d16 |
Memory Layout & Bus Architecture
Every STM32 shares the ARM Cortex-M fixed memory map, with vendor-specific peripheral placement on top. Understanding this map is critical for writing linker scripts and diagnosing address faults.
/* STM32F407 Representative Memory Map */
/* Code Region */
0x00000000 - 0x1FFFFFFF /* Code (aliased to Flash or SRAM) */
0x08000000 - 0x080FFFFF /* Flash memory (1 MB on F407VG) */
0x1FFF0000 - 0x1FFF77FF /* System memory (ST bootloader ROM) */
0x1FFF7800 - 0x1FFF7A0F /* OTP (One-Time Programmable) area */
/* SRAM Region */
0x20000000 - 0x2001BFFF /* SRAM1 (112 KB) */
0x2001C000 - 0x2001FFFF /* SRAM2 (16 KB) */
0x10000000 - 0x1000FFFF /* CCM data RAM (64 KB, CPU-only, no DMA) */
/* Peripheral Region */
0x40000000 - 0x400233FF /* APB1 peripherals (USART2, SPI2, I2C1...) */
0x40010000 - 0x40014BFF /* APB2 peripherals (USART1, SPI1, ADC...) */
0x40020000 - 0x400223FF /* AHB1 peripherals (GPIO, DMA, RCC, CRC) */
0x50000000 - 0x50060BFF /* AHB2 peripherals (USB OTG FS, DCMI) */
0x60000000 - 0xDFFFFFFF /* FMC (external SDRAM, NOR, NAND) */
/* System Region */
0xE0000000 - 0xE00FFFFF /* Cortex-M system (NVIC, SysTick, DWT, ITM) */
CCM RAM Warning: The Core Coupled Memory (CCM) on STM32F4 is connected directly to the CPU data bus — not to the AHB bus matrix. This means DMA cannot access CCM RAM. Placing DMA buffers in CCM is a common source of silent data corruption. Always put DMA buffers in SRAM1.
The AHB (Advanced High-performance Bus) matrix is the backbone of the STM32 interconnect. Multiple bus masters — the CPU, DMA1, DMA2, and Ethernet (on F4) — connect to bus slaves (Flash, SRAM, APB bridges) through this matrix. Understanding which master can access which slave at what bandwidth is essential for high-throughput DMA design.
Clock System Deep Dive
The clock system is the heart of any STM32 project. Every peripheral — UART baud rate, SPI bit rate, timer tick, ADC sampling frequency — derives its clock from a source you configure. Getting the clock tree wrong produces subtle bugs: baud rates off by exactly 2×, timers ticking too fast, ADC conversions taking longer than expected.
HSI, HSE, LSI, LSE
STM32 devices offer multiple clock sources, each with different accuracy, power consumption, and startup time characteristics:
| Source |
Typical Freq |
Accuracy |
Power |
Startup |
Typical Use |
| HSI |
8–64 MHz (family) |
±1–2% |
Low |
~2 µs |
Default boot source, no crystal needed |
| HSE |
4–26 MHz (crystal) |
±20–50 ppm |
Medium |
~2 ms |
PLL source for max system clock accuracy |
| LSI |
32 kHz (nominal) |
±30–50% |
Very low |
~40 µs |
IWDG, rough RTC (unreliable for timekeeping) |
| LSE |
32.768 kHz (crystal) |
±20 ppm |
Very low |
~200 ms |
RTC, calendar, low-power wakeup |
PLL Configuration
The Phase-Locked Loop multiplies a low-frequency input (HSI or HSE) to produce the high-frequency SYSCLK. On the STM32F4, the PLL is configured with three parameters: M (input divider), N (multiplier), and P (output divider for SYSCLK). Two additional outputs, Q (USB/SDIO/RNG) and R (some families), add flexibility.
The golden rule: the PLL VCO (voltage-controlled oscillator) must run between 100 MHz and 432 MHz on the F4. Work backwards from your target SYSCLK:
/* PLL calculation for STM32F4, HSE = 8 MHz, target SYSCLK = 168 MHz */
/*
* VCO_in = HSE / M = 8 MHz / 8 = 1 MHz (must be 1–2 MHz)
* VCO_out = VCO_in * N = 1 MHz * 336 = 336 MHz (must be 100–432 MHz)
* SYSCLK = VCO_out / P = 336 MHz / 2 = 168 MHz
* USB/SDIO = VCO_out / Q = 336 MHz / 7 = 48 MHz (exact 48 MHz required!)
*/
RCC_OscInitTypeDef osc = {0};
osc.OscillatorType = RCC_OSCILLATORTYPE_HSE;
osc.HSEState = RCC_HSE_ON;
osc.PLL.PLLState = RCC_PLL_ON;
osc.PLL.PLLSource = RCC_PLLSOURCE_HSE;
osc.PLL.PLLM = 8;
osc.PLL.PLLN = 336;
osc.PLL.PLLP = RCC_PLLP_DIV2;
osc.PLL.PLLQ = 7;
HAL_RCC_OscConfig(&osc);
RCC_ClkInitTypeDef clk = {0};
clk.ClockType = RCC_CLOCKTYPE_SYSCLK | RCC_CLOCKTYPE_HCLK |
RCC_CLOCKTYPE_PCLK1 | RCC_CLOCKTYPE_PCLK2;
clk.SYSCLKSource = RCC_SYSCLKSOURCE_PLLCLK;
clk.AHBCLKDivider = RCC_SYSCLK_DIV1; /* HCLK = 168 MHz */
clk.APB1CLKDivider = RCC_HCLK_DIV4; /* PCLK1 = 42 MHz (max 42 MHz) */
clk.APB2CLKDivider = RCC_HCLK_DIV2; /* PCLK2 = 84 MHz (max 84 MHz) */
/* Flash latency MUST be set before increasing SYSCLK */
HAL_RCC_ClockConfig(&clk, FLASH_LATENCY_5);
Flash Latency: At 168 MHz with 3.3V supply, the STM32F4 Flash requires 5 wait states (FLASH_LATENCY_5). If you increase SYSCLK without setting the correct wait states first, the CPU will fetch corrupt instructions. CubeMX handles this automatically — but if you configure clocks manually, always set __HAL_FLASH_SET_LATENCY() before calling HAL_RCC_ClockConfig().
AHB, APB1 & APB2 Prescalers
After SYSCLK is established, the clock tree splits into bus clocks through prescalers. Getting these wrong affects every peripheral you configure:
- HCLK (AHB clock) — drives the CPU, memory, DMA, and GPIO. Always equal to SYSCLK on high-performance STM32 families.
- PCLK1 (APB1 clock) — drives USART2–5, SPI2–3, I2C1–3, basic timers (TIM2–7, TIM12–14). Maximum 42 MHz on F4.
- PCLK2 (APB2 clock) — drives USART1, SPI1, ADC1–3, advanced timers (TIM1, TIM8–11). Maximum 84 MHz on F4.
Timer Clock Multiplier: When the APB prescaler is not 1, the timer input clock is doubled by the hardware. So with PCLK1 = 42 MHz (APB1 prescaler = 4), TIM2's clock source is 84 MHz — not 42 MHz. This catches many developers off guard when calculating timer periods.
HAL vs LL vs Bare-Metal
STM32 gives you three distinct levels of hardware abstraction, and professional developers choose between them deliberately — not by default. Each level has a different trade-off between portability, performance, and code size.
HAL Architecture
The STM32 HAL (Hardware Abstraction Layer) is ST's high-level driver framework. Every HAL function follows predictable naming: HAL_[Peripheral]_[Action](handle, ...). HAL drivers maintain state in handle structures (UART_HandleTypeDef, SPI_HandleTypeDef, etc.) that persist across calls, enabling interrupt and DMA modes without global variables.
/* HAL UART transmit — polling mode */
UART_HandleTypeDef huart2;
/* Initialise (typically generated by CubeMX) */
huart2.Instance = USART2;
huart2.Init.BaudRate = 115200;
huart2.Init.WordLength = UART_WORDLENGTH_8B;
huart2.Init.StopBits = UART_STOPBITS_1;
huart2.Init.Parity = UART_PARITY_NONE;
huart2.Init.Mode = UART_MODE_TX_RX;
huart2.Init.HwFlowCtl = UART_HWCONTROL_NONE;
HAL_UART_Init(&huart2);
/* Transmit with timeout */
uint8_t msg[] = "Hello STM32\r\n";
HAL_UART_Transmit(&huart2, msg, sizeof(msg)-1, HAL_MAX_DELAY);
/* Non-blocking interrupt mode — returns immediately */
HAL_UART_Transmit_IT(&huart2, msg, sizeof(msg)-1);
/* Completion signalled via HAL_UART_TxCpltCallback() */
HAL advantages: readable code, easy portability between STM32 families, full interrupt and DMA support via callbacks, and direct CubeMX code generation. HAL disadvantage: overhead. A HAL GPIO write takes ~10 cycles; a direct register write takes 1 cycle. For GPIO bit-banging or time-critical ISRs, HAL is too slow.
Low-Layer (LL) Drivers
LL drivers are ST's thin wrapper around registers — inline functions that map directly to peripheral register operations with almost zero overhead. They give you register-level speed with slightly better readability than raw register access, and they're still generated by CubeMX.
/* LL GPIO — set PA5 high, then low */
LL_GPIO_SetOutputPin(GPIOA, LL_GPIO_PIN_5); /* ~1 cycle */
LL_GPIO_ResetOutputPin(GPIOA, LL_GPIO_PIN_5); /* ~1 cycle */
/* LL UART — wait for TX empty, then write */
while (!LL_USART_IsActiveFlag_TXE(USART2)) {}
LL_USART_TransmitData8(USART2, 'A');
Direct Register Access
Bare-metal register access uses CMSIS device headers directly — no HAL, no LL. You access peripheral registers through structs defined in the device header (e.g., stm32f407xx.h). This is the fastest possible approach, but requires you to read every register description in the reference manual.
/* Direct register access — enable GPIOA clock, configure PA5 output */
RCC->AHB1ENR |= RCC_AHB1ENR_GPIOAEN; /* Enable GPIOA clock */
/* Clear MODE bits for PA5, set to output (01) */
GPIOA->MODER &= ~(GPIO_MODER_MODER5);
GPIOA->MODER |= (GPIO_MODER_MODER5_0); /* Output mode */
/* Set output type to push-pull (default), speed to high */
GPIOA->OTYPER &= ~GPIO_OTYPER_OT5; /* Push-pull */
GPIOA->OSPEEDR |= GPIO_OSPEEDR_OSPEED5; /* High speed */
/* Toggle using Bit Set/Reset Register (atomic operation) */
GPIOA->BSRR = GPIO_BSRR_BS5; /* Set PA5 */
GPIOA->BSRR = GPIO_BSRR_BR5; /* Reset PA5 */
Choosing the Right Abstraction
| Scenario |
Recommended Level |
Reason |
| Peripheral initialisation (UART, SPI, I2C) |
HAL or CubeMX |
Complex init sequences, portability |
| DMA transfers, interrupt-driven comms |
HAL |
Callback infrastructure, handle state management |
| Time-critical ISR GPIO toggling |
LL or bare-metal |
Single-cycle throughput required |
| Bit-bang protocols (WS2812, 1-Wire) |
Bare-metal registers |
Cycle-accurate timing needed |
| Clock and power configuration |
HAL (CubeMX generated) |
Complex sequencing, flash latency, voltage scaling |
| New developer, learning project |
HAL |
Readable, well-documented, CubeMX support |
| Production firmware, code size critical |
LL + selective bare-metal |
Smallest binary, predictable performance |
CubeMX & CubeIDE Deep Dive
STM32CubeMX is the graphical configuration tool that generates initialisation code from your peripheral and clock settings. STM32CubeIDE is the Eclipse-based IDE that wraps CubeMX, the GCC toolchain, and an OpenOCD/ST-Link debug interface into a single environment. For most STM32 work, this is your primary development environment.
CubeMX Code Generation Workflow
The CubeMX workflow follows a consistent pattern regardless of which peripheral you're configuring:
- Select target MCU — choose exact part number (e.g., STM32F407VGTx). This loads the correct pin count, flash/SRAM sizes, and available peripherals.
- Pinout & Configuration — assign peripherals to pins using the graphical pinout view. CubeMX enforces alternate function constraints.
- Clock Configuration — use the clock tree diagram to configure PLL, bus prescalers, and peripheral clocks. CubeMX validates that no maximum frequency is exceeded.
- Project Manager — set project name, IDE (CubeIDE, Keil, IAR), and code generation options. The "Generate peripheral initialization as a pair of .c/.h files" option keeps generated code modular.
- Generate Code — CubeMX writes
main.c, stm32f4xx_hal_msp.c, peripheral init files, and the linker script. Your user code goes between /* USER CODE BEGIN */ and /* USER CODE END */ markers — everything outside these markers is overwritten on the next generation.
USER CODE Markers: CubeMX regenerates everything outside the user code markers. If you add code outside these sections, it will be deleted on the next "Generate Code". Always put your application logic, includes, and variables inside the markers. CubeIDE syntax-highlights these sections differently to remind you.
Configuring the Clock Tree
The CubeMX clock tree is a graphical representation of the RCC (Reset and Clock Control) register settings. You can either type in your target frequencies and let CubeMX solve the PLL parameters, or manually set M/N/P/Q values. CubeMX will highlight any violated constraint in red.
For the STM32F407 running at maximum speed, the optimal clock tree configuration is:
# CubeMX Clock Configuration Summary (STM32F407, HSE=8MHz, max speed)
# Clock source: HSE (8 MHz crystal)
# PLL: M=8, N=336, P=2, Q=7
# SYSCLK = 168 MHz
# HCLK = 168 MHz (AHB prescaler = 1)
# PCLK1 = 42 MHz (APB1 prescaler = 4) → TIM2/3 input = 84 MHz
# PCLK2 = 84 MHz (APB2 prescaler = 2) → TIM1/8 input = 168 MHz
# USB/OTG = 48 MHz (PLL Q = 7) → exact 48 MHz for USB
# Flash latency: 5 wait states at 168 MHz, 3.3V
CubeMX Pitfalls & Best Practices
CubeMX accelerates development but introduces pitfalls that catch beginners and experienced developers alike:
| Pitfall |
Symptom |
Fix |
| Code outside USER CODE markers |
Custom code disappears after regeneration |
Always use /* USER CODE BEGIN */ blocks |
| HSE not enabled in oscillator config |
Falls back to HSI, wrong baud rates |
Enable HSE and verify RCC_OscInitTypeDef.HSEState |
| DMA buffer in CCM RAM |
DMA transfer completes but data is wrong |
Use __attribute__((section(".sram1"))) or place in default SRAM |
| Forgetting NVIC priority group |
Nested interrupts behave unexpectedly |
Set HAL_NVIC_SetPriorityGrouping() once in SystemClock_Config() |
| GPIO alternate function not assigned |
SPI/UART pin outputs nothing |
Check GPIO_InitTypeDef.Alternate matches peripheral |
| SysTick used by both HAL and FreeRTOS |
HAL timeouts work incorrectly under RTOS |
Move HAL timebase to TIM6 when using FreeRTOS |
Build System & Toolchain
CubeIDE manages the build internally, but professional STM32 development increasingly uses command-line build systems for CI/CD pipelines, reproducible builds, and editor freedom. Understanding the toolchain lets you build STM32 firmware from any machine without CubeIDE installed.
arm-none-eabi-gcc Setup
The ARM GNU toolchain (arm-none-eabi-gcc) is the standard open-source compiler for bare-metal ARM targets. "none" means no OS, "eabi" means Embedded ABI. Install it from the ARM Developer website or your package manager:
# Ubuntu/Debian
sudo apt install gcc-arm-none-eabi binutils-arm-none-eabi
# macOS (Homebrew)
brew install --cask gcc-arm-embedded
# Verify installation
arm-none-eabi-gcc --version
# arm-none-eabi-gcc (GNU Arm Embedded Toolchain 12.2) 12.2.1 20221205
# Essential flags for STM32F407 (Cortex-M4F, hard FP)
CFLAGS = -mcpu=cortex-m4 -mthumb -mfpu=fpv4-sp-d16 -mfloat-abi=hard
CFLAGS += -DSTM32F407xx -DUSE_HAL_DRIVER
CFLAGS += -O2 -ffunction-sections -fdata-sections -Wall
LDFLAGS = -T STM32F407VGTx_FLASH.ld -Wl,--gc-sections -Wl,-Map=output.map
Makefile vs CMake
CubeMX generates a Makefile by default for command-line builds. This is functional but not scalable for large projects. CMake has become the preferred build system for professional STM32 work, especially when integrating multiple libraries, unit testing, and CI pipelines.
# Minimal CMake toolchain file for STM32F407 (arm-none-eabi.cmake)
set(CMAKE_SYSTEM_NAME Generic)
set(CMAKE_SYSTEM_PROCESSOR ARM)
set(CMAKE_C_COMPILER arm-none-eabi-gcc)
set(CMAKE_CXX_COMPILER arm-none-eabi-g++)
set(CMAKE_ASM_COMPILER arm-none-eabi-gcc)
set(CMAKE_OBJCOPY arm-none-eabi-objcopy)
set(CMAKE_SIZE arm-none-eabi-size)
set(CPU_FLAGS "-mcpu=cortex-m4 -mthumb -mfpu=fpv4-sp-d16 -mfloat-abi=hard")
set(CMAKE_C_FLAGS "${CPU_FLAGS} -ffunction-sections -fdata-sections -Wall" CACHE STRING "")
set(CMAKE_EXE_LINKER_FLAGS "${CPU_FLAGS} -Wl,--gc-sections -specs=nano.specs" CACHE STRING "")
OpenOCD & ST-Link
OpenOCD (Open On-Chip Debugger) is the open-source debug server that connects your development machine to the STM32 via ST-Link or J-Link. CubeIDE uses it transparently, but you can also drive it directly for scripted flashing in CI pipelines:
# Flash firmware via OpenOCD (ST-Link V2)
openocd -f interface/stlink.cfg \
-f target/stm32f4x.cfg \
-c "program build/firmware.elf verify reset exit"
# Start debug server (GDB connects on port 3333)
openocd -f interface/stlink.cfg -f target/stm32f4x.cfg
# In another terminal, start GDB session
arm-none-eabi-gdb build/firmware.elf
(gdb) target remote :3333
(gdb) monitor reset halt
(gdb) load
(gdb) continue
First HAL Project: Blink
The canonical blink example reveals more about the STM32 than its simplicity suggests. Let's implement it three ways — with HAL, with LL, and with direct register access — and compare what the generated assembly looks like.
HAL Initialisation Sequence
Every CubeMX-generated main.c follows the same initialisation sequence. Understanding this sequence prevents you from calling HAL functions before the hardware is ready:
int main(void)
{
/* 1. HAL_Init: configures SysTick for 1 ms timebase,
* sets NVIC priority grouping, initialises Flash prefetch */
HAL_Init();
/* 2. SystemClock_Config: configures HSE, PLL, bus prescalers,
* flash latency. Generated entirely by CubeMX. */
SystemClock_Config();
/* 3. Peripheral MX_xxx_Init functions: configure each peripheral
* in the order CubeMX generates them. */
MX_GPIO_Init();
MX_USART2_UART_Init();
/* 4. User application loop */
while (1)
{
/* USER CODE BEGIN WHILE */
HAL_GPIO_TogglePin(LD2_GPIO_Port, LD2_Pin);
HAL_Delay(500);
/* USER CODE END WHILE */
}
}
Blink with HAL_GPIO_TogglePin
The HAL GPIO configuration generated by CubeMX for a standard Nucleo-F401RE board (LED on PA5):
/* gpio.c — generated by CubeMX, lives in MX_GPIO_Init() */
static void MX_GPIO_Init(void)
{
GPIO_InitTypeDef GPIO_InitStruct = {0};
/* Enable GPIOA clock */
__HAL_RCC_GPIOA_CLK_ENABLE();
/* Set PA5 output level low (LED off initially) */
HAL_GPIO_WritePin(GPIOA, GPIO_PIN_5, GPIO_PIN_RESET);
/* Configure PA5 as push-pull output, no pull, medium speed */
GPIO_InitStruct.Pin = GPIO_PIN_5;
GPIO_InitStruct.Mode = GPIO_MODE_OUTPUT_PP;
GPIO_InitStruct.Pull = GPIO_NOPULL;
GPIO_InitStruct.Speed = GPIO_SPEED_FREQ_LOW;
HAL_GPIO_Init(GPIOA, &GPIO_InitStruct);
}
/* In main() while(1) loop */
HAL_GPIO_TogglePin(GPIOA, GPIO_PIN_5);
HAL_Delay(500); /* 500 ms, blocks on SysTick */
Same Blink, Register-Level
Here is the identical blink written with direct register access. This version compiles to roughly 40 bytes of Flash versus ~2 KB for the HAL version (including HAL library overhead):
#include "stm32f4xx.h" /* CMSIS device header — all register definitions */
static volatile uint32_t tick;
void SysTick_Handler(void)
{
tick++;
}
static void delay_ms(uint32_t ms)
{
uint32_t start = tick;
while ((tick - start) < ms) {}
}
int main(void)
{
/* Configure SysTick: 16 MHz HSI / 16000 = 1 kHz (1 ms period) */
SysTick_Config(16000000U / 1000U);
/* Enable GPIOA peripheral clock on AHB1 */
RCC->AHB1ENR |= RCC_AHB1ENR_GPIOAEN;
(void)RCC->AHB1ENR; /* dummy read — wait for clock to propagate */
/* Configure PA5: output, push-pull, no pull, medium speed */
GPIOA->MODER = (GPIOA->MODER & ~GPIO_MODER_MODER5) | (1u << 10);
GPIOA->OTYPER &= ~GPIO_OTYPER_OT5;
GPIOA->OSPEEDR &= ~GPIO_OSPEEDR_OSPEED5;
GPIOA->PUPDR &= ~GPIO_PUPDR_PUPD5;
for (;;)
{
GPIOA->BSRR = GPIO_BSRR_BS5; /* Atomic set PA5 high */
delay_ms(500);
GPIOA->BSRR = GPIO_BSRR_BR5; /* Atomic reset PA5 low */
delay_ms(500);
}
}
BSRR is Atomic: The GPIO Bit Set/Reset Register (BSRR) is a write-only, 32-bit register. The upper 16 bits reset pins, the lower 16 bits set pins. Because it's a single write, the operation is inherently atomic — no read-modify-write cycle that could be interrupted by an ISR. Always prefer BSRR over ODR for thread-safe GPIO manipulation.
Exercises
Exercise 1
Beginner
CubeMX Clock Tree Exploration
Open CubeMX (or STM32CubeIDE's Device Configuration tool) and create a new project for any STM32F4 device. Navigate to the Clock Configuration tab. Configure the system to use HSE (8 MHz) with PLL to achieve: (a) 168 MHz SYSCLK, (b) exact 48 MHz for USB. Observe the flash latency setting CubeMX selects automatically and verify the APB timer clock multiplier rule applies to TIM2.
CubeMX
Clock Tree
PLL Configuration
Exercise 2
Intermediate
Compare HAL vs Register-Level Code Size
Create two CubeIDE projects targeting the same STM32F4 MCU. In Project A, implement a 500 ms blink using HAL_GPIO_TogglePin and HAL_Delay (full HAL enabled). In Project B, implement the same blink using only CMSIS register access with a SysTick delay. Compile both at -O2 and compare: (a) total Flash usage from the .map file, (b) the generated assembly for the toggle operation, (c) worst-case SysTick interrupt latency in each approach.
HAL
Register Access
Code Size
Assembly
Exercise 3
Advanced
Command-Line Build Without CubeIDE
Take the CubeMX-generated Makefile project from Exercise 2. Build it entirely from the command line using arm-none-eabi-gcc and make. Then convert the Makefile to a CMake project: write a CMakeLists.txt and a toolchain file, add the HAL sources, CMSIS headers, and linker script. Flash the resulting .elf using OpenOCD or ST-Link CLI without opening CubeIDE. Verify the LED blinks at the correct 500 ms period by measuring the GPIO toggle with a logic analyser or oscilloscope.
CMake
Makefile
OpenOCD
CI/CD Ready
STM32 Project Configuration
Use this tool to document your STM32 project configuration — target MCU, clock tree settings, development environment, and toolchain choices. Download as Word, Excel, PDF, or PPTX for project documentation or team onboarding.
Conclusion & Next Steps
In this opening article we have established the foundation every STM32 developer needs:
- The STM32 family spans over a dozen product lines — F0 through H7, G0/G4, L0–U5, WB/WL — all sharing the ARM Cortex-M architecture and the STM32Cube HAL ecosystem, enabling skills transfer across families.
- The clock system is the most critical configuration step: HSI/HSE → PLL → SYSCLK → AHB/APB bus clocks determine the performance and accuracy of every peripheral. Misconfigured clocks produce subtle, hard-to-diagnose bugs.
- HAL, LL, and bare-metal register access each have their place. HAL for most application code, LL for performance-critical paths, direct registers for cycle-accurate timing and minimal footprint.
- CubeMX accelerates development dramatically but requires discipline: always use USER CODE markers, verify clock settings, and be aware of DMA/CCM RAM incompatibility.
- A professional STM32 toolchain is arm-none-eabi-gcc + CMake + OpenOCD — reproducible, CI/CD friendly, and IDE-agnostic.
Next in the Series
In Part 2: GPIO & Button Debounce, we'll master every GPIO mode (input, output, alternate function, analog), implement a reliable software debounce algorithm for mechanical buttons, configure External Interrupts (EXTI) for edge-triggered events, and build a state machine that drives an LED pattern from button inputs — with all three abstraction levels compared.
Related Articles in This Series
Part 2: GPIO & Button Debounce
GPIO modes, alternate functions, EXTI configuration, software debounce algorithms, and HAL callback patterns.
Read Article
Part 3: UART Communication
Polling, interrupt, and DMA UART modes, printf retargeting to UART, ring buffer implementation, and baud rate calculation.
Read Article
Part 4: Timers, PWM & Input Capture
Timer fundamentals, PWM signal generation, input capture for frequency measurement, and encoder interface mode.
Read Article