Series Context: This is Part 3 of the 20-part CMSIS Mastery Series. Parts 1 and 2 covered the ecosystem and processor-level API. Here we go deeper — below the C runtime — to the very first instruction executed after a power-on reset.
1
Overview & ARM Cortex-M Ecosystem
CMSIS layers, Cortex-M families, memory map, toolchains
Completed
2
CMSIS-Core: Registers, NVIC & SysTick
core_cmX.h, register access, interrupt controller, SysTick timer
Completed
3
Startup Code, Linker Scripts & Vector Table
Reset handler, BSS init, scatter files, boot process
You Are Here
4
CMSIS-RTOS2: Threads, Mutexes & Semaphores
Thread management, synchronization primitives, scheduling
5
CMSIS-RTOS2: Message Queues & Event Flags
Inter-thread comms, ISR-to-thread, real-time design patterns
6
CMSIS-DSP: Filters, FFT & Math Functions
FIR/IIR filters, FFT, SIMD optimizations
7
CMSIS-Driver: UART, SPI & I2C
Driver abstraction layer, callbacks, DMA integration
8
CMSIS-Pack & Software Components
Pack files, device support, dependency management
9
Debugging with CMSIS-DAP & CoreSight
SWD/JTAG, HardFault analysis, ITM tracing
10
Portable Firmware: Multi-Vendor Projects
HAL vs CMSIS, cross-platform BSPs, reusable libraries
11
Interrupts, Concurrency & Real-Time Constraints
Interrupt latency, critical sections, lock-free programming
12
Memory Management in Embedded Systems
Static vs dynamic, heap fragmentation, memory pools
13
Low Power & Energy Optimization
Sleep modes, clock gating, tickless RTOS, power profiling
14
DMA & High-Performance Data Handling
DMA basics, peripheral transfers, zero-copy techniques
15
Security: ARMv8-M & TrustZone
Secure/non-secure worlds, secure boot, firmware protection
16
Bootloaders & Firmware Updates
OTA updates, dual-bank flash, fail-safe strategies
17
Testing & Validation
Unity/Ceedling unit tests, HIL testing, integration testing
18
Performance Optimization
Compiler flags, inline assembly, cache (M7/M33), profiling
19
Embedded Software Architecture
Layered design, event-driven, state machines, component-based
20
Tooling & Workflow (Professional Level)
CI/CD for embedded, MISRA, static analysis, Doxygen
Startup Code Deep Dive
When power is applied to a Cortex-M microcontroller, the processor reads two words from address 0x00000000: the initial stack pointer value and the reset vector. It loads the stack pointer into MSP and jumps to the reset vector — typically Reset_Handler in the startup file. No C runtime exists yet. Stack and heap are uninitialised. Global variables have indeterminate values. The startup code's job is to fix all of that before calling main().
Complete ARM Startup Assembly (GCC)
/**
* startup_stm32f407xx.s — Minimal startup file for STM32F407
* Adapted from ARM CMSIS Device template (startup_ARMCM4.S)
*/
.syntax unified
.cpu cortex-m4
.fpu softvfp
.thumb
/* Stack and Heap sizes — can be overridden by linker script symbols */
.set Stack_Size, 0x400 /* 1 KB default stack */
.set Heap_Size, 0x200 /* 512 bytes default heap */
.section .stack
.align 3
.globl __StackTop
.globl __StackLimit
__StackLimit:
.space Stack_Size
.size __StackLimit, . - __StackLimit
__StackTop:
.size __StackTop, . - __StackTop - Stack_Size
.section .heap
.align 3
.globl __HeapBase
.globl __HeapLimit
__HeapBase:
.space Heap_Size
__HeapLimit:
.size __HeapBase, . - __HeapBase
/* Vector table — placed in .isr_vector section by linker script */
.section .isr_vector, "a", %progbits
.align 2
.globl __isr_vector
__isr_vector:
.long __StackTop /* Initial Stack Pointer */
.long Reset_Handler /* Reset Handler */
.long NMI_Handler /* NMI Handler */
.long HardFault_Handler /* HardFault Handler */
.long MemManage_Handler /* MPU Fault Handler */
.long BusFault_Handler /* Bus Fault Handler */
.long UsageFault_Handler /* Usage Fault Handler */
.long 0 /* Reserved */
.long 0 /* Reserved */
.long 0 /* Reserved */
.long 0 /* Reserved */
.long SVC_Handler /* SVCall Handler */
.long DebugMon_Handler /* Debug Monitor Handler */
.long 0 /* Reserved */
.long PendSV_Handler /* PendSV Handler */
.long SysTick_Handler /* SysTick Handler */
/* Device-specific interrupts (240 max) follow here */
.long WWDG_IRQHandler /* Window WatchDog */
.long PVD_IRQHandler /* PVD via EXTI Line */
/* ... (remaining STM32F4 interrupts omitted for brevity) */
.text
.thumb
.thumb_func
.align 2
.globl Reset_Handler
.type Reset_Handler, %function
Reset_Handler:
/* Copy initialised data from Flash (LMA) to SRAM (VMA) */
ldr r0, =__data_start__
ldr r1, =__data_end__
ldr r2, =__etext /* load address of .data in Flash */
movs r3, #0
b .L_data_loop_test
.L_data_loop:
ldr r4, [r2, r3]
str r4, [r0, r3]
adds r3, r3, #4
.L_data_loop_test:
adds r4, r0, r3
cmp r4, r1
bcc .L_data_loop
/* Zero-fill the .bss section */
ldr r0, =__bss_start__
ldr r1, =__bss_end__
movs r2, #0
b .L_bss_loop_test
.L_bss_loop:
str r2, [r0]
adds r0, r0, #4
.L_bss_loop_test:
cmp r0, r1
bcc .L_bss_loop
/* Call SystemInit() to configure clocks and memory */
bl SystemInit
/* Call static constructors (__libc_init_array) */
bl __libc_init_array
/* Jump to main — should never return */
bl main
bx lr
.size Reset_Handler, . - Reset_Handler
Linking Startup Code to the Linker Script
The Key Link: The symbols __data_start__, __data_end__, __etext, __bss_start__, and __bss_end__ are not defined in C or assembly — they are exported by the linker script. The startup assembly reads these addresses at runtime to know exactly where the .data and .bss sections live in flash and SRAM.
Vector Table
The vector table is an array of 32-bit addresses stored at a known location (typically the start of flash). Entry 0 is the initial stack pointer; entries 1 onward are exception/interrupt handler addresses. The NVIC uses this table to determine where to jump when an exception fires.
| Offset |
Exception Number |
IRQn |
Exception Name |
Priority |
| 0x0000 | — | — | Initial Stack Pointer | — |
| 0x0004 | 1 | — | Reset | -3 (highest) |
| 0x0008 | 2 | -14 | NMI | -2 |
| 0x000C | 3 | -13 | HardFault | -1 |
| 0x0010 | 4 | -12 | MemManage Fault | Configurable |
| 0x0014 | 5 | -11 | BusFault | Configurable |
| 0x0018 | 6 | -10 | UsageFault | Configurable |
| 0x002C | 11 | -5 | SVCall | Configurable |
| 0x0030 | 12 | -4 | Debug Monitor | Configurable |
| 0x0038 | 14 | -2 | PendSV | Configurable |
| 0x003C | 15 | -1 | SysTick | Configurable |
| 0x0040 | 16 | 0 | IRQ0 (device-specific) | Configurable |
| 0x0044 | 17 | 1 | IRQ1 (device-specific) | Configurable |
| ... | ... | ... | Up to IRQ239 | Configurable |
Vector Table in C with Weak Default Handlers
/*
* Vector table definition in C — placed in .isr_vector section.
* __attribute__((weak)) allows application code to override any handler
* by simply defining a function with the same name.
* __attribute__((alias)) makes unimplemented handlers fall through to
* the infinite-loop default, avoiding hard-to-find silent faults.
*/
/* Default handler — catches any unimplemented IRQ */
void Default_Handler(void) {
/* Optionally: capture IPSR to identify which interrupt fired */
volatile uint32_t active_irq = __get_IPSR() & 0x1FFU;
(void)active_irq;
for (;;) {}
}
/* Weak aliases: override by defining the function in application code */
void NMI_Handler(void) __attribute__((weak, alias("Default_Handler")));
void HardFault_Handler(void) __attribute__((weak)); /* defined separately */
void MemManage_Handler(void) __attribute__((weak, alias("Default_Handler")));
void BusFault_Handler(void) __attribute__((weak, alias("Default_Handler")));
void UsageFault_Handler(void)__attribute__((weak, alias("Default_Handler")));
void SVC_Handler(void) __attribute__((weak, alias("Default_Handler")));
void PendSV_Handler(void) __attribute__((weak, alias("Default_Handler")));
void SysTick_Handler(void) __attribute__((weak, alias("Default_Handler")));
/* Device-specific IRQs */
void USART1_IRQHandler(void) __attribute__((weak, alias("Default_Handler")));
void TIM2_IRQHandler(void) __attribute__((weak, alias("Default_Handler")));
/* The actual vector table */
extern uint32_t __StackTop; /* Defined by linker script */
__attribute__((section(".isr_vector"), used))
const void * const g_pfnVectors[] = {
(void *)&__StackTop, /* Initial MSP */
Reset_Handler, /* Reset Handler */
NMI_Handler,
HardFault_Handler,
MemManage_Handler,
BusFault_Handler,
UsageFault_Handler,
0, 0, 0, 0, /* Reserved */
SVC_Handler,
DebugMon_Handler,
0, /* Reserved */
PendSV_Handler,
SysTick_Handler,
/* Device interrupts */
WWDG_IRQHandler,
USART1_IRQHandler,
TIM2_IRQHandler,
/* ... */
};
Relocating the Vector Table to SRAM via SCB->VTOR
/*
* VTOR (Vector Table Offset Register) — allows relocating the vector
* table anywhere in memory, aligned to the table size (next power of 2
* above number-of-vectors * 4, minimum 128 bytes on Cortex-M3+).
*
* Use cases:
* - Bootloader: application relocates its own vector table to its start
* - SRAM execution: copy vector table to SRAM for faster ISR dispatch
* - RAM-patching: override individual vectors at runtime
*/
#define SRAM_BASE 0x20000000U
#define VECTOR_TABLE_SIZE (256U * 4U) /* 256 entries * 4 bytes */
/* Extern: vector table defined in startup file or linker symbol */
extern const void * const g_pfnVectors[];
void RelocateVectorTable(void) {
/* Copy flash vector table to SRAM */
uint32_t *src = (uint32_t *)g_pfnVectors;
uint32_t *dst = (uint32_t *)SRAM_BASE;
for (uint32_t i = 0; i < (VECTOR_TABLE_SIZE / 4U); i++) {
dst[i] = src[i];
}
/* Point VTOR to SRAM copy — requires alignment to power of 2 */
__DSB();
SCB->VTOR = SRAM_BASE;
__DSB();
__ISB(); /* Flush pipeline to ensure VTOR change takes effect */
}
/*
* Patching a single vector at runtime (useful for bootloader jump):
*/
void PatchVectorEntry(uint32_t irq_index, void (*handler)(void)) {
uint32_t *vtor_base = (uint32_t *)SCB->VTOR;
vtor_base[irq_index + 16U] = (uint32_t)handler | 0x1U; /* Thumb bit */
__DSB();
}
Linker Scripts & Scatter Files
The linker script is the control file that tells the GNU linker (ld) exactly where to place every section of your program in the target's address space. Every embedded project has one — and understanding it is essential for advanced memory management, bootloaders, and performance optimisation.
MEMORY Block — Defining Address Regions
/* STM32F407VGT6 GCC Linker Script — complete example
* Flash: 1 MB at 0x08000000
* SRAM1: 112 KB at 0x20000000
* SRAM2: 16 KB at 0x2001C000
* CCM RAM: 64 KB at 0x10000000 (core-coupled, zero latency)
*/
MEMORY {
FLASH (rx) : ORIGIN = 0x08000000, LENGTH = 1024K
SRAM (rwx) : ORIGIN = 0x20000000, LENGTH = 112K
SRAM2 (rwx) : ORIGIN = 0x2001C000, LENGTH = 16K
CCMRAM (rwx) : ORIGIN = 0x10000000, LENGTH = 64K
}
/* Stack and heap sizes — referenced by startup code */
_Min_Heap_Size = 0x400; /* 1 KB minimum heap */
_Min_Stack_Size = 0x800; /* 2 KB minimum stack */
SECTIONS Block — Placing Code and Data
SECTIONS {
/* ── .isr_vector: Vector table at very start of flash ── */
.isr_vector : {
. = ALIGN(4);
KEEP(*(.isr_vector)) /* KEEP prevents garbage collection */
. = ALIGN(4);
} >FLASH
/* ── .text: Code and read-only data ── */
.text : {
. = ALIGN(4);
*(.text) /* .text sections from all object files */
*(.text*) /* .text.funcname (from -ffunction-sections) */
*(.glue_7) /* ARM/Thumb interworking glue */
*(.glue_7t)
*(.eh_frame)
KEEP(*(.init))
KEEP(*(.fini))
. = ALIGN(4);
_etext = .; /* Symbol used by startup: end of .text in Flash */
} >FLASH
/* ── .rodata: Read-only data (string literals, const arrays) ── */
.rodata : {
. = ALIGN(4);
*(.rodata)
*(.rodata*)
. = ALIGN(4);
} >FLASH
/* ── .data: Initialised global/static variables ── */
/* LMA (load address) is in Flash; VMA (runtime address) is in SRAM */
.data : {
. = ALIGN(4);
__data_start__ = .; /* Used by Reset_Handler data copy loop */
*(.data)
*(.data*)
. = ALIGN(4);
__data_end__ = .;
} >SRAM AT>FLASH /* Run in SRAM, stored in Flash */
/* __etext = load address of .data in Flash */
__etext = LOADADDR(.data);
/* ── .bss: Uninitialised global/static variables ── */
.bss : {
. = ALIGN(4);
__bss_start__ = .;
*(.bss)
*(.bss*)
*(COMMON)
. = ALIGN(4);
__bss_end__ = .;
} >SRAM
/* ── ._user_heap_stack: Heap and stack size check ── */
._user_heap_stack : {
. = ALIGN(8);
PROVIDE(end = .);
PROVIDE(_end = .);
. = . + _Min_Heap_Size;
. = . + _Min_Stack_Size;
. = ALIGN(8);
} >SRAM
/* ── .ccmram: Time-critical code in zero-latency CCM RAM ── */
.ccmram : {
. = ALIGN(4);
_sccmram = .;
*(.ccmram)
*(.ccmram*)
. = ALIGN(4);
_eccmram = .;
} >CCMRAM AT>FLASH
/* Discard unwanted sections */
/DISCARD/ : {
libc.a(*)
libm.a(*)
libgcc.a(*)
}
}
Boot Process: Reset to main()
Understanding the complete boot sequence lets you diagnose startup failures, implement proper clock configuration, and safely use global variables.
Step 1
Power-On / Hardware Reset
CPU reads SP from address 0x00000000, then reads the reset vector from 0x00000004. The processor begins executing at the reset vector address (Reset_Handler). All registers except SP are indeterminate. The internal HSI oscillator runs at 16 MHz by default on STM32F4.
Step 2
.data Copy: Flash to SRAM
The startup assembly reads from the load address of .data (in Flash, stored at __etext) and writes to the virtual address (in SRAM, from __data_start__ to __data_end__). After this step, all initialised global variables have their correct initial values.
Step 3
.bss Zero-Fill
All uninitialised global and static variables must be zero per the C standard. The startup code fills the region from __bss_start__ to __bss_end__ with zeros. The C standard guarantees this; if it is not done, your globals will have random values.
Step 4
SystemInit() — Clock & Memory Config
Called by Reset_Handler before main(). Vendor-provided (e.g., system_stm32f4xx.c). Configures PLL to the target CPU frequency, sets flash wait states, enables FPU (if present), and updates SystemCoreClock. On ARMv8-M, also enables TrustZone partitioning.
/*
* Minimal SystemInit() for STM32F407 — configure HSE PLL to 168 MHz.
* In practice you would use the vendor's clock configuration tool,
* but understanding this code demystifies the entire clock tree.
*/
void SystemInit(void) {
/* Enable FPU — must be done before any FP instruction */
SCB->CPACR |= ((3UL << 10*2) | (3UL << 11*2)); /* CP10, CP11 full access */
__DSB();
__ISB();
/* Enable HSI (16 MHz internal oscillator) — already on after reset */
RCC->CR |= RCC_CR_HSION;
while (!(RCC->CR & RCC_CR_HSIRDY)) {} /* Wait for HSI ready */
/* Reset RCC configuration to safe defaults */
RCC->CFGR = 0x00000000U;
/* Configure PLL: HSI (16 MHz) * (N=168) / (M=8) / P=2 = 168 MHz */
RCC->PLLCFGR = (8U << RCC_PLLCFGR_PLLM_Pos) | /* M = 8 */
(168U << RCC_PLLCFGR_PLLN_Pos) | /* N = 168 */
(0U << RCC_PLLCFGR_PLLP_Pos) | /* P = 2 (0b00) */
(7U << RCC_PLLCFGR_PLLQ_Pos); /* Q = 7 for USB */
/* Enable PLL and wait for lock */
RCC->CR |= RCC_CR_PLLON;
while (!(RCC->CR & RCC_CR_PLLRDY)) {}
/* Configure Flash: 5 wait states + instruction cache + prefetch */
FLASH->ACR = FLASH_ACR_LATENCY_5WS | FLASH_ACR_ICEN | FLASH_ACR_PRFTEN;
/* Switch system clock to PLL */
RCC->CFGR |= RCC_CFGR_SW_PLL;
while ((RCC->CFGR & RCC_CFGR_SWS) != RCC_CFGR_SWS_PLL) {}
/* Update CMSIS global SystemCoreClock variable */
SystemCoreClock = 168000000U;
}
Bootloaders & Firmware Update Strategies
A bootloader is a small program that runs first after reset, validates the application firmware, and jumps to it. Understanding startup code and vector table relocation is prerequisite to writing a bootloader.
/*
* Minimal 2nd-stage bootloader: validates application and jumps to it.
* Application starts at APP_START_ADDR (must match app linker script).
*
* STM32F4: Bootloader at 0x08000000, Application at 0x08010000 (64 KB offset)
*/
#define APP_START_ADDR 0x08010000U
#define FLASH_END_ADDR 0x080FFFFFU
/* Type for a function pointer with no arguments/return (reset handler type) */
typedef void (*AppEntry_t)(void);
/* Simple CRC32 validation — replace with SHA-256 for production */
static bool validate_firmware(uint32_t start, uint32_t length) {
uint32_t stored_crc = *(uint32_t *)(start + length - 4U);
uint32_t computed_crc = crc32_calculate((uint8_t *)start, length - 4U);
return (stored_crc == computed_crc);
}
void bootloader_jump_to_app(uint32_t app_address) {
/* Read app stack pointer (first word in vector table) */
uint32_t app_sp = *(uint32_t *)app_address;
/* Validate stack pointer is within SRAM range */
if (app_sp < 0x20000000U || app_sp > 0x20020000U) {
/* Invalid firmware — stay in bootloader or signal error */
Error_Handler();
return;
}
/* Read app reset vector (second word) — this is the entry point */
AppEntry_t app_reset = (AppEntry_t)(*(uint32_t *)(app_address + 4U));
/* Relocate vector table to application start */
__disable_irq();
SCB->VTOR = app_address;
__DSB();
__ISB();
/* Set stack pointer to application's initial SP */
__set_MSP(app_sp);
/* Enable interrupts and jump to application Reset_Handler */
__enable_irq();
app_reset(); /* Never returns */
}
int main(void) {
/* Check if bootloader entry is forced (e.g., button held or magic word) */
if (is_bootloader_entry_forced() || !validate_firmware(APP_START_ADDR, APP_SIZE)) {
/* Stay in bootloader — receive firmware via UART/USB/CAN */
firmware_update_receive();
} else {
bootloader_jump_to_app(APP_START_ADDR);
}
for (;;) {} /* Should never reach here */
}
/*
* Application linker script must set origin offset:
* FLASH (rx) : ORIGIN = 0x08010000, LENGTH = 960K
* Application must also set VTOR in its SystemInit():
* SCB->VTOR = 0x08010000U;
*/
Dual-Bank Flash Strategy: For fail-safe OTA updates, use dual-bank flash (available on STM32H7, STM32U5, nRF52840). Write the new firmware to Bank B while Bank A runs. Only swap banks (update OPTCR) after verifying the new firmware's CRC/signature. If the new firmware fails to boot, the bootloader detects this on the next reset and reverts to Bank A.
Exercises
Exercise 1
Beginner
Add a Custom .ccmram Section for STM32F4
Modify the provided linker script to add a .ccmram section that maps to the CCM RAM region (0x10000000, 64 KB on STM32F407). Add a corresponding .ccmram copy loop in the startup assembly (mirror the .data copy pattern). Then annotate a time-critical ISR function with __attribute__((section(".ccmram"))) and verify in the .map file that it was placed correctly.
Linker Script
CCM RAM
STM32F4
Exercise 2
Intermediate
Software Reset via SCB->AIRCR
Implement a system_reset(void) function that triggers a software reset using the SCB->AIRCR SYSRESETREQ bit. Write the correct VECTKEY value (0x05FA) in the high halfword before setting the request bit. Verify that the system resets cleanly and that your startup code re-runs correctly, including the .bss zero-fill (add a verification variable that should be 0 after reset).
SCB->AIRCR
Software Reset
VECTKEY
Exercise 3
Advanced
Minimal 2nd-Stage Bootloader with Vector Table Relocation
Build a complete minimal bootloader for STM32F4: (1) Store a "firmware valid" magic word at a known flash address; (2) on reset, bootloader checks the magic word — if valid, relocates the vector table to APP_START_ADDR (0x08010000) and jumps to the application; (3) if invalid, blink an error LED. Build and flash the bootloader + application as separate binaries. Verify the application's global variables are correctly initialised (proving the application's own startup ran).
Bootloader
VTOR
Firmware Jump
Flash Layout
Memory Layout Designer
Use this tool to document your project's memory layout configuration — flash regions, SRAM regions, stack/heap sizing, and custom sections. Download as Word, Excel, PDF, or PPTX for design review or project documentation.
Conclusion & Next Steps
In this article we have traced the complete journey from power-on reset to the first instruction of main():
- The startup assembly is the bridge between raw hardware and the C runtime — it sets up the stack, copies initialised data from flash to SRAM, zero-fills BSS, configures clocks via
SystemInit(), and calls main().
- The vector table is a flat array of 32-bit handler addresses. Weak aliases allow application code to override any handler; the
__attribute__((section(".isr_vector"))) attribute places it at the correct flash address.
- The linker script exports the symbols (
__data_start__, __etext, __bss_start__) that the startup code reads at runtime — startup and linker script must agree exactly on symbol names.
- VTOR relocation via
SCB->VTOR enables bootloaders to hand off to applications cleanly, and applications to patch individual interrupt vectors at runtime.
- Dual-bank flash combined with firmware signature verification is the correct foundation for production OTA update systems.
Next in the Series
In Part 4: CMSIS-RTOS2 — Threads, Mutexes & Semaphores, we build on the bare-metal foundation and introduce the CMSIS-RTOS2 API: how to create threads with osThreadNew(), protect shared resources with mutexes (including priority inheritance), and signal between threads and ISRs with semaphores — all through a kernel-agnostic API that works with FreeRTOS, Keil RTX5, and Zephyr.
Related Articles in This Series
Part 4: CMSIS-RTOS2 — Threads, Mutexes & Semaphores
Master the CMSIS-RTOS2 API for thread management, synchronisation primitives, and scheduling with FreeRTOS or Keil RTX5.
Read Article
Part 5: CMSIS-RTOS2 — Message Queues & Event Flags
Inter-thread communication patterns, ISR-to-thread signaling, and real-time design patterns using CMSIS-RTOS2 primitives.
Read Article
Part 9: Debugging with CMSIS-DAP & CoreSight
SWD/JTAG debugging, HardFault analysis, ITM trace output, and professional debugging workflows with VS Code and OpenOCD.
Read Article