Back to Technology

ARM Assembly Part 14: Cortex-M Assembly & Bare-Metal Embedded

April 9, 2026 Wasil Zafar 24 min read

Cortex-M is ARM's microcontroller profile — it runs Thumb-2 exclusively, boots from a vector table in Flash, uses the NVIC for nested interrupt management, and exposes SysTick for RTOS tick generation. This part builds a complete bare-metal environment from reset vector through C runtime initialisation to a blinking LED in assembly.

Table of Contents

  1. Introduction & Cortex-M Profile
  2. Thumb-2 Instruction Set
  3. Cortex-M Vector Table
  4. NVIC — Nested Vectored Interrupt Controller
  5. SysTick Timer
  6. Linker Script & Memory Map
  7. Low-Power Modes (WFI / WFE)
  8. Peripheral Registers (GPIO & MMIO)
  9. Hands-On Exercises
  10. Embedded Project Planner
  11. Conclusion & Next Steps

Introduction & Cortex-M Profile

Series Overview: This is Part 14 of our 28-part ARM Assembly Mastery Series. Parts 11–13 covered the Cortex-A system programming model. Now we shift to the Cortex-M microcontroller profile — a radically simpler architecture optimised for deterministic real-time response, sub-milliwatt standby, and zero-OS bare-metal programming from $0.25 MCUs to industrial STM32 boards.

ARM Assembly Mastery

Your 28-step learning path • Currently on Step 14
1
Architecture History & Core Concepts
ARMv1→v9, RISC philosophy, profiles
2
ARM32 Instruction Set Fundamentals
ARM vs Thumb, registers, CPSR, barrel shifter
3
AArch64 Registers, Addressing & Data Movement
X/W regs, addressing modes, load/store pairs
4
Arithmetic, Logic & Bit Manipulation
ADD/SUB, bitfield extract/insert, CLZ
5
Branching, Loops & Conditional Execution
Branch types, link register, jump tables
6
Stack, Subroutines & AAPCS
Calling conventions, prologue/epilogue
7
Memory Model, Caches & Barriers
Weak ordering, DMB/DSB/ISB, TLB
8
NEON & Advanced SIMD
Vector ops, intrinsics, media processing
9
SVE & SVE2 Scalable Vector Extensions
Predicate regs, gather/scatter, HPC/ML
10
Floating-Point & VFP Instructions
IEEE-754, scalar FP, rounding modes
11
Exception Levels, Interrupts & Vector Tables
EL0–EL3, GIC, fault debugging
12
MMU, Page Tables & Virtual Memory
Stage-1 translation, permissions, huge pages
13
TrustZone & ARM Security Extensions
Secure monitor, world switching, TF-A
14
Cortex-M Assembly & Bare-Metal Embedded
NVIC, SysTick, linker scripts, low-power
You Are Here
15
Cortex-A System Programming & Boot
EL3→EL1 transitions, MMU setup, PSCI
16
Apple Silicon & macOS ABI
ARM64e PAC, Mach-O, dyld, perf counters
17
Inline Assembly, GCC/Clang & C Interop
Constraints, clobbers, compiler interaction
18
Performance Profiling & Micro-Optimization
Pipeline hazards, PMU, benchmarking
19
Reverse Engineering & ARM Binary Analysis
ELF, disassembly, CFR, iOS/Android quirks
20
Building a Bare-Metal OS Kernel
Bootloader, UART, scheduler, context switch
21
ARM Microarchitecture Deep Dive
OOO pipelines, reorder buffers, branch predict
22
Virtualization Extensions
EL2 hypervisor, stage-2 translation, KVM
23
Debugging & Tooling Ecosystem
GDB, OpenOCD/JTAG, ETM/ITM, QEMU
24
Linkers, Loaders & Binary Format Internals
ELF deep dive, relocations, PIC, crt0
25
Cross-Compilation & Build Systems
GCC/Clang toolchains, CMake, firmware gen
26
ARM in Real Systems
Android, FreeRTOS/Zephyr, U-Boot, TF-A
27
Security Research & Exploitation
ASLR, PAC attacks, ROP/JOP, kernel exploit
28
Emerging ARMv9 & Future Directions
MTE, SME, confidential compute, AI accel

Cortex-M cores (M0, M0+, M3, M4, M7, M23, M33, M55, M85) are Harvard-architecture processors without an MMU, running exclusively in Thumb or Thumb-2 state. There is no AArch64, no EL system, and no separate secure monitor — though Cortex-M23/M33/M55 add a simplified TrustZone-M for IoT security. The entire exception and interrupt model fits in a 4 KB region-of-interest: the System Control Space (SCS) at 0xE000E000.

Analogy — The Kitchen Appliance vs. the Mainframe: Think of Cortex-A as a full desktop computer: it has virtual memory, multiple privilege levels, and runs a complex OS. Cortex-M is more like a dedicated kitchen appliance — a microwave or dishwasher — with one job, a simple control panel (Thumb-2), a built-in timer (SysTick), and an alarm system (NVIC) that wakes it when the door opens. There is no operating system; the firmware is the entire software. This simplicity gives Cortex-M deterministic interrupt response in as few as 12 clock cycles and power consumption measured in micro-amps during sleep.

Thumb-2 Instruction Set

16-bit vs 32-bit Encoding

Thumb-2 mixes 16-bit and 32-bit encodings in the same instruction stream. The 16-bit subset covers the most common operations (MOV, ADD, LDR, STR, push/pop, branches) for code density. The 32-bit Thumb-2 extensions add the full barrel-shifter operands, more registers, wider immediates, multiply-accumulate, and SIMD/DSP instructions. The assembler selects the shorter encoding automatically unless forced with the .W (wide) or .N (narrow) suffixes.

// 16-bit Thumb (narrow) — most efficient for Cortex-M0
MOVS  r0, #42          // 16-bit: set r0=42, update flags
ADDS  r1, r0, #1       // 16-bit: r1 = r0+1, flags updated

// 32-bit Thumb-2 (wide) — needed for wider immediates
MOVW  r0, #0xBEEF      // 32-bit: load 16-bit immediate
MOVT  r0, #0xDEAD      // 32-bit: load upper 16 bits → r0=0xDEADBEEF

// Barrel shifter in 32-bit encoding (not available in 16-bit)
ADD.W r2, r1, r0, LSL #3   // r2 = r1 + (r0 << 3)

// High-register MOV (32-bit) to transfer between hi and lo regs
MOV   r8, r0           // 16-bit encoding for hi-reg mov (special form)

IT Block Conditional Execution

// IT block: up to 4 conditionally-executed Thumb instructions
// Syntax: IT[T|E][T|E][T|E] cond
// T = Then (execute if cond true), E = Else (execute if cond false)
CMP   r0, #0
ITE   EQ               // If EQ: execute next (Then), else execute after (Else)
MOVEQ r1, #1           // r1=1 if r0==0
MOVNE r1, #0           // r1=0 if r0!=0

// 3-condition example:
CMP  r2, r3
ITTEE GT               // GT: Then Then, else Else Else
MOVGT r4, r2
ADDGT r5, r2, #1
MOVLE r4, r3
ADDLE r5, r3, #1

// Note: IT blocks are deprecated in Thumb-2 for Cortex-M7+ (ITSTATE complications)
// Prefer conditional branches for clarity in new code

Cortex-M Vector Table

Reset Handler & Stack Setup

The Cortex-M vector table starts at address 0x00000000 (or the address in VTOR). Entry 0 is the initial stack pointer value (loaded into MSP at reset). Entry 1 is the reset handler address. All vector entries use odd addresses (bit[0]=1) to indicate Thumb mode. The hardware never executes ARM state code on Cortex-M.

// Cortex-M4 minimal vector table (GAS syntax)
    .section ".vectors", "a", %progbits
    .type  vector_table, %object
vector_table:
    .word  _stack_top          // 0x00: Initial MSP value (from linker)
    .word  reset_handler + 1   // 0x04: Reset (bit[0]=1 for Thumb)
    .word  nmi_handler + 1     // 0x08: NMI
    .word  hardfault_handler + 1 // 0x0C: HardFault
    .word  memmanage_handler + 1 // 0x10: MemManage
    .word  busfault_handler + 1  // 0x14: BusFault
    .word  usagefault_handler + 1// 0x18: UsageFault
    .word  0, 0, 0, 0          // 0x1C-0x28: Reserved
    .word  svc_handler + 1     // 0x2C: SVCall
    .word  0, 0                // 0x30-0x34: Debug / Reserved
    .word  pendsv_handler + 1  // 0x38: PendSV (RTOS context switch)
    .word  systick_handler + 1 // 0x3C: SysTick
    // External interrupts (IRQ0..N) follow from 0x40 onward
    .word  uart_irq_handler + 1
    .word  tim2_irq_handler + 1
    .size  vector_table, . - vector_table

// Reset handler: copy .data, zero .bss, call main
    .section ".text"
    .thumb_func
    .global reset_handler
reset_handler:
    LDR   r0, =_data_load      // Source: Flash
    LDR   r1, =_data_start     // Destination: SRAM
    LDR   r2, =_data_end
copy_data:
    CMP   r1, r2
    BGE   zero_bss
    LDR   r3, [r0], #4
    STR   r3, [r1], #4
    B     copy_data
zero_bss:
    LDR   r1, =_bss_start
    LDR   r2, =_bss_end
    MOVS  r3, #0
clear_bss:
    CMP   r1, r2
    BGE   call_main
    STR   r3, [r1], #4
    B     clear_bss
call_main:
    BL    main
    B     .                    // Infinite loop if main returns

HardFault / MemManage / BusFault

// HardFault handler: inspect stacked frame for diagnostic
    .thumb_func
hardfault_handler:
    // Determine which stack was active: MSP or PSP
    TST   lr, #4               // EXC_RETURN[2]: 0=MSP, 1=PSP
    ITE   EQ
    MRSEQ r0, MSP              // r0 = pointer to exception frame
    MRSNE r0, PSP
    // Exception frame (hardware push): r0,r1,r2,r3,r12,lr,pc,xpsr
    LDR   r1, [r0, #24]        // Stacked PC (faulting instruction)
    LDR   r2, =0xE000ED28      // CFSR (Configurable Fault Status Register)
    LDR   r3, [r2]             // Read CFSR: MemManage/BusFault/UsageFault flags
    B     .                    // Halt (replace with UART debug in production)

NVIC — Nested Vectored Interrupt Controller

Priority Configuration

The NVIC supports up to 240 external interrupts (IRQs) with configurable priorities. Priority width varies by implementation (typically 3–8 bits, with higher numerical value = lower priority). Cortex-M3/M4/M7 support preemption grouping via AIRCR.PRIGROUP: this splits the priority byte into group priority (preemption) and sub-priority (tie-breaking) bits. Lower group priority number = higher precedence and can preempt running handlers.

// NVIC register base: 0xE000E100
// Enable IRQ #37 (e.g., USART2 on STM32)
// NVIC_ISER[1] = 1 << (37 - 32) = 1 << 5 = 0x20
LDR  r0, =0xE000E104       // NVIC_ISER1
MOV  r1, #(1 << 5)
STR  r1, [r0]              // Enable USART2 interrupt

// Set priority of IRQ 37 to 0x40 (moderate priority)
LDR  r0, =0xE000E425       // NVIC_IPR9 (IRQ 37 is in byte 1 of IPR9)
LDR  r1, [r0]
BIC  r1, r1, #(0xFF << 8)  // Clear byte 1
ORR  r1, r1, #(0x40 << 8)  // Set priority 0x40
STR  r1, [r0]

// Disable IRQ 37
LDR  r0, =0xE000E184       // NVIC_ICER1
MOV  r1, #(1 << 5)
STR  r1, [r0]

Writing ISRs in Assembly

// UART RX ISR — read received byte, store to ring buffer
// Cortex-M hardware auto-pushes: r0,r1,r2,r3,r12,lr,pc,xpsr
// EXC_RETURN in LR tells hardware what to restore on exit
    .thumb_func
    .global uart_irq_handler
uart_irq_handler:
    PUSH  {r4-r7, lr}          // Save callee-saved regs + LR(EXC_RETURN)

    LDR   r0, =USART2_BASE
    LDR   r1, [r0, #0x00]      // USART_SR: check RXNE flag
    TST   r1, #(1 << 5)        // RXNE (bit 5)?
    BEQ   uart_done

    LDR   r2, [r0, #0x04]      // USART_DR: read received byte (clears RXNE)
    UXTB  r2, r2                // Extract byte

    // Store in ring buffer (call C function)
    BL    ring_buffer_put       // r0=byte already happens via r2, adjust as needed

uart_done:
    POP   {r4-r7, pc}           // POP LR into PC — return from interrupt

SysTick Timer

// SysTick registers (addresses 0xE000E010-0xE000E01F)
// SYST_CSR  = 0xE000E010: Control and Status
// SYST_RVR  = 0xE000E014: Reload Value
// SYST_CVR  = 0xE000E018: Current Value
// SYST_CALIB= 0xE000E01C: Calibration

// Configure SysTick for 1 ms tick at 72 MHz core clock
// Reload value = (72,000,000 / 1000) - 1 = 71999
LDR  r0, =0xE000E010           // SYST_CSR
LDR  r1, =0xE000E014           // SYST_RVR
LDR  r2, =0xE000E018           // SYST_CVR

MOV  r3, #71999
STR  r3, [r1]                  // Set reload value
MOV  r3, #0
STR  r3, [r2]                  // Clear current value (any write resets it)

// CSR: ENABLE=1, TICKINT=1 (generate interrupt), CLKSOURCE=1 (core clock)
MOV  r3, #0x7
STR  r3, [r0]                  // Start SysTick, enable interrupt

// SysTick ISR (fires every 1 ms)
    .thumb_func
    .global systick_handler
systick_handler:
    PUSH {lr}
    BL   os_tick               // RTOS tick increment
    POP  {pc}

Peripheral Register Access (GPIO & MMIO)

Every Cortex-M peripheral is memory-mapped. GPIO ports, UART controllers, ADC converters, and DMA engines are accessed by reading and writing 32-bit registers at fixed addresses. The vendor datasheet gives register offsets from each peripheral's base address. The key discipline is: read-modify-write (load register, mask bits, OR in new value, store back) to avoid clobbering other fields in the same register.

GPIO Configuration — Blinking an LED

// STM32F4 — Toggle PA5 (LED on Nucleo-F446RE)
// 1. Enable GPIOA clock (RCC AHB1ENR, bit 0)
    LDR   r0, =0x40023830      // RCC_AHB1ENR
    LDR   r1, [r0]
    ORR   r1, r1, #(1 << 0)    // GPIOAEN = 1
    STR   r1, [r0]

// 2. Set PA5 as output (GPIOA_MODER bits [11:10] = 01)
    LDR   r0, =0x40020000      // GPIOA_MODER
    LDR   r1, [r0]
    BIC   r1, r1, #(3 << 10)   // Clear bits [11:10]
    ORR   r1, r1, #(1 << 10)   // Set bit 10 → output mode
    STR   r1, [r0]

// 3. Toggle LED via BSRR (atomic set/reset — no read-modify-write needed)
    LDR   r0, =0x40020018      // GPIOA_BSRR
    MOV   r1, #(1 << 5)        // Set PA5 high (bits [15:0] = set)
    STR   r1, [r0]
    // ... delay ...
    MOV   r1, #(1 << 21)       // Reset PA5 low (bits [31:16] = reset)
    STR   r1, [r0]

Common MMIO Patterns

Critical MMIO Rules: (1) Always use volatile semantics — in assembly this is natural since every LDR/STR hits memory, but in C wrappers, mark peripheral pointers volatile. (2) Some registers are write-only (e.g., BSRR) — reading them returns zero or undefined data. (3) Bit-banding (Cortex-M3/M4) maps each bit to a word address in the alias region, enabling atomic single-bit set/clear without read-modify-write: alias_addr = 0x42000000 + (byte_offset × 32) + (bit × 4).
// Bit-banding example: set bit 5 of GPIOA_ODR atomically
// GPIOA_ODR = 0x40020014, bit 5
// Offset from peripheral base 0x40000000 = 0x00020014
// Alias = 0x42000000 + (0x20014 × 32) + (5 × 4) = 0x42400294
    LDR   r0, =0x42400294   // Bit-band alias for PA5
    MOV   r1, #1
    STR   r1, [r0]          // Atomic set PA5 — no RMW race condition
STM32 Bare Metal IoT

Case Study: STM32 Smart Thermostat — From Prototype to Product

A startup built an IoT smart thermostat on an STM32F401 (Cortex-M4, 84 MHz, 256 KB Flash, 64 KB SRAM). The firmware was entirely bare-metal assembly + C, with no RTOS, to minimize power consumption for battery operation.

Architecture decisions: The NVIC was configured with 4 priority groups — the ADC conversion-complete ISR (priority 0, highest) sampled the temperature sensor every 100 ms via DMA. The UART TX ISR (priority 2) sent readings to a Bluetooth module. SysTick at 1 kHz drove the PID control loop. A WFE sleep loop in the main thread dropped power to 12 μA between samples. GPIO PA5 drove the relay controlling the HVAC system via BSRR atomic writes.

Key insight: By using bit-banding for all GPIO flag checks, the team eliminated race conditions between the main loop and ISRs without disabling interrupts. The total firmware size was 18 KB — fitting in the smallest STM32 variant for the production run, reducing per-unit BOM cost by $0.40.

Linker Script & Memory Map

Flash / SRAM Section Layout

A typical STM32 linker script defines MEMORY regions for Flash (origin 0x08000000) and SRAM (origin 0x20000000). Sections map as: .vectors and .text → Flash; .data (initialised variables) → LMA in Flash, VMA in SRAM (copied by startup); .bss (zero-initialised) → SRAM only; .stack → top of SRAM. Export symbols (_data_load, _data_start, _data_end, _bss_start, _bss_end, _stack_top) that the startup assembly uses.

Startup Code (crt0)

// Minimal linker script excerpt (GNU LD)
// MEMORY {
//   FLASH (rx)  : ORIGIN = 0x08000000, LENGTH = 512K
//   RAM   (rwx) : ORIGIN = 0x20000000, LENGTH = 128K
// }
// SECTIONS {
//   .vectors : { *(.vectors) } > FLASH
//   .text    : { *(.text*) }   > FLASH
//   .data    : { *(.data*) }   > RAM AT> FLASH  /* LMA in Flash, VMA in RAM */
//   .bss     : { *(.bss*) }    > RAM
//   _stack_top = ORIGIN(RAM) + LENGTH(RAM);
// }

// Assembly startup (after vector table copy and BSS clear):
    .text
    .thumb_func
    .weak  _start
    .global _start
_start:
    LDR   sp, =_stack_top      // Set MSP (redundant if vector table entry is correct)
    BL    SystemInit            // Optional: clock/PLL init
    BL    main
    B     .

Low-Power Modes (WFI / WFE)

// WFI — Wait For Interrupt: halt CPU until any enabled interrupt fires
// Used by RTOS idle task; wakes on any unmasked pending interrupt
idle_loop:
    WFI                        // Freeze pipeline; resume on next interrupt
    B    idle_loop

// WFE — Wait For Event: halt until event register set or interrupt
// More flexible: also wakes on SEV (Send Event) from another CPU
// Or on the stacked-event bit from a previous interrupt during low-power
sleep_loop:
    SEV                        // Set own event register first (clear stale event)
    WFE                        // Clear event register; if already set: fall through
    WFE                        // Real sleep: wait for external event or interrupt
    B    sleep_loop

// Cortex-M4 SCB SCR register: enable SLEEPDEEP for deep sleep (Stop/Standby)
LDR  r0, =0xE000ED10           // SCB_SCR
LDR  r1, [r0]
ORR  r1, r1, #(1 << 2)        // SLEEPDEEP=1
STR  r1, [r0]
WFI                            // Enter deep sleep (STOP mode on STM32)
Key Insight: The Cortex-M hardware auto-push mechanism (saving r0–r3, r12, LR, PC, xPSR on interrupt entry) means ISRs can be written like normal C functions — the hardware acts as the prologue. The EXC_RETURN magic value in LR tells the hardware which stack (MSP/PSP) to restore from and whether to return to Thread or Handler mode. Using PSP for thread stacks and MSP for handler (kernel) stacks is the foundation of FreeRTOS context switching.

Hands-On Exercises

Exercise 1 GPIO

Exercise: SOS Blink Pattern in Assembly

Write a Cortex-M4 assembly program that blinks an LED on PA5 in the SOS Morse code pattern: three short flashes (200 ms), three long flashes (600 ms), three short flashes, then a 2-second pause. Use a delay loop based on the core clock frequency (assume 72 MHz). Structure the code with subroutines: short_blink, long_blink, and delay_ms (parameter in r0). Configure GPIOA via RCC and MODER registers, then use BSRR for atomic set/reset.

Bonus: Replace the busy-wait delay loop with SysTick interrupts — set a volatile ms_ticks counter and have the main loop poll it for timing accuracy.

Exercise 2 NVIC

Exercise: Multi-Priority NVIC Configuration

Configure the NVIC for three interrupt sources on an STM32F4: (1) EXTI0 (button press on PA0) at priority 0x00 (highest — debounce via counter), (2) TIM2 overflow at priority 0x40 (medium — toggling PB0 LED), and (3) USART2 RXNE at priority 0x80 (lowest — buffering incoming bytes). Write each ISR in Thumb assembly. Demonstrate preemption: while the USART2 ISR is running, trigger EXTI0 and verify it preempts by toggling a different GPIO pin.

Verification: Use a logic analyser or oscilloscope to observe the GPIO toggles and confirm that higher-priority interrupts preempt lower-priority handlers with correct tail-chaining on return.

Exercise 3 Linker Script

Exercise: Custom Linker Script + Startup Assembly

Write a complete GNU LD linker script for an STM32F446RE (512 KB Flash at 0x08000000, 128 KB SRAM at 0x20000000). Define sections: .vectors → Flash (first 0x200 bytes), .text → Flash, .rodata → Flash, .data → SRAM (AT> FLASH), .bss → SRAM. Export all required symbols. Then write a startup assembly file (startup.s) that: (1) defines the vector table with all 16 system exceptions + 6 IRQ handlers, (2) copies .data from Flash to SRAM, (3) zeros .bss, (4) calls SystemInit and main. Build with arm-none-eabi-gcc -T linker.ld -nostdlib startup.s main.c -o firmware.elf.

Bonus: Add a .config section in Flash for non-volatile settings, and a .stack section with a guard word (0xDEADBEEF) to detect stack overflows at runtime.

Embedded Project Planner

ARM Embedded Project Planner

Plan your Cortex-M bare-metal project. Specify MCU, peripherals, interrupts, and firmware architecture. Download as Word, Excel, or PDF.

Draft auto-saved

All data stays in your browser — nothing is uploaded.

Conclusion & Next Steps

We built a complete Cortex-M bare-metal environment from the ground up: Thumb-2 instruction encodings (16-bit vs 32-bit, IT blocks for conditional execution), the vector table with reset handler initialising .data and .bss, NVIC interrupt enable/priority configuration with preemption groups, ISR writing with hardware auto-push/pop, SysTick for RTOS tick generation, GPIO and peripheral register access via MMIO and bit-banding, linker script section layout, startup assembly (crt0), and WFI/WFE low-power patterns for battery-powered IoT devices.

Next in the Series

In Part 15: Cortex-A System Programming & Boot, we return to the application-class profile and trace the complete boot sequence from power-on through EL3 firmware (TF-A), EL2 hypervisor (Xen/KVM), EL1 Linux kernel, to user space — writing key assembly stubs along the way.

Technology