Series Context: This is Part 12 of the 20-part CMSIS Mastery Series (Bonus/Advanced section). Part 11 covered interrupt latency and concurrency; here we address the second major source of non-determinism in embedded firmware — dynamic memory management.
1
Overview & ARM Cortex-M Ecosystem
CMSIS layers, Cortex-M families, memory map, toolchains
2
CMSIS-Core: Registers, NVIC & SysTick
core_cmX.h, register access, interrupt controller, SysTick timer
3
Startup Code, Linker Scripts & Vector Table
Reset handler, BSS init, scatter files, boot process
4
CMSIS-RTOS2: Threads, Mutexes & Semaphores
Thread management, synchronization primitives, scheduling
5
CMSIS-RTOS2: Message Queues & Event Flags
Inter-thread comms, ISR-to-thread, real-time design patterns
6
CMSIS-DSP: Filters, FFT & Math Functions
FIR/IIR filters, FFT, SIMD optimizations
7
CMSIS-Driver: UART, SPI & I2C
Driver abstraction layer, callbacks, DMA integration
8
CMSIS-Pack & Software Components
Pack files, device support, dependency management
9
Debugging with CMSIS-DAP & CoreSight
SWD/JTAG, HardFault analysis, ITM tracing
10
Portable Firmware: Multi-Vendor Projects
HAL vs CMSIS, cross-platform BSPs, reusable libraries
11
Interrupts, Concurrency & Real-Time Constraints
Interrupt latency, critical sections, lock-free programming
12
Memory Management in Embedded Systems
Static vs dynamic, heap fragmentation, memory pools
You Are Here
13
Low Power & Energy Optimization
Sleep modes, clock gating, tickless RTOS, power profiling
14
DMA & High-Performance Data Handling
DMA basics, peripheral transfers, zero-copy techniques
15
Security: ARMv8-M & TrustZone
Secure/non-secure worlds, secure boot, firmware protection
16
Bootloaders & Firmware Updates
OTA updates, dual-bank flash, fail-safe strategies
17
Testing & Validation
Unity/Ceedling unit tests, HIL testing, integration testing
18
Performance Optimization
Compiler flags, inline assembly, cache (M7/M33), profiling
19
Embedded Software Architecture
Layered design, event-driven, state machines, component-based
20
Tooling & Workflow (Professional Level)
CI/CD for embedded, MISRA, static analysis, Doxygen
Static vs Dynamic Allocation
The embedded world's relationship with dynamic memory allocation is adversarial by necessity. malloc() and free() from the standard C library are designed for general-purpose operating systems — they assume a virtual memory manager, a large address space, and a GC or reference-counted runtime to reclaim memory. None of these assumptions hold on a Cortex-M3 with 64 KB of SRAM and no MMU.
The fundamental problems with malloc() in embedded firmware are: (1) non-deterministic execution time — the allocator may traverse an arbitrarily long free-list searching for a suitable block; (2) fragmentation — repeated alloc/free cycles of varying sizes leave the heap Swiss-cheesed, eventually causing allocation failures even when total free memory is sufficient; (3) out-of-memory handling — most embedded projects have no meaningful way to recover from a NULL return from malloc(); (4) thread-safety — the newlib allocator is not thread-safe by default on Cortex-M without a custom __malloc_lock() implementation.
MISRA-C:2012 Rule 21.3: The memory allocation and deallocation functions of <stdlib.h> shall not be used. MISRA prohibits malloc, calloc, realloc, and free for exactly the reasons above. If your project requires MISRA compliance, static allocation and memory pools are the only permitted strategies.
Memory Allocation Strategy Comparison
| Strategy |
Determinism |
Fragmentation |
Overhead |
Typical Use Cases |
| Static (compile-time) |
Fully deterministic |
None |
Zero runtime |
Safety-critical, MISRA, fixed-topology systems |
| Stack (automatic) |
Deterministic (push/pop) |
None |
1–2 instructions |
Function-local data, temporary buffers |
| Fixed-size pool |
O(1) — deterministic |
None (fixed block size) |
Low (free-list pointer) |
Message queues, packet buffers, RTOS objects |
| Heap4 (FreeRTOS) |
O(n) — non-deterministic |
Moderate (first-fit) |
8–16 bytes header per block |
Startup-only allocations, GUI frameworks |
| Heap5 (FreeRTOS) |
O(n) — non-deterministic |
Moderate |
8–16 bytes + region list |
Multi-region SRAM (CCM + main SRAM) |
| Custom slab allocator |
O(1) per class size |
Low (per-class slabs) |
Medium (slab metadata) |
Networking stacks, file systems with mixed object sizes |
Static Free-List Memory Pool
A fixed-size memory pool pre-allocates a contiguous array of identically-sized blocks at compile time and manages them with a singly-linked free list. Allocation is O(1) — pop a block from the head. Deallocation is O(1) — push a block back to the head. No fragmentation is possible because all blocks are the same size.
/**
* Static fixed-size memory pool — O(1) alloc/free, zero fragmentation.
* Block size and pool depth set at compile time.
*
* Pattern: overlay a "next" pointer on the first word of each free block
* to build an intrusive singly-linked free list.
*/
#include
#include
#include
#define POOL_BLOCK_SIZE 64U /* bytes per block — must be >= sizeof(void*) */
#define POOL_BLOCK_COUNT 32U /* total blocks in pool */
/* Alignment: SRAM access is fastest on 4-byte aligned addresses */
typedef union {
uint8_t data[POOL_BLOCK_SIZE];
void *next; /* used when block is on the free list */
} PoolBlock_t;
typedef struct {
PoolBlock_t blocks[POOL_BLOCK_COUNT];
PoolBlock_t *free_head;
uint32_t free_count;
} MemPool_t;
/* Initialise pool — called once at startup */
void pool_init(MemPool_t *pool) {
pool->free_head = &pool->blocks[0];
pool->free_count = POOL_BLOCK_COUNT;
/* Link all blocks into the free list */
for (uint32_t i = 0U; i < POOL_BLOCK_COUNT - 1U; i++) {
pool->blocks[i].next = &pool->blocks[i + 1U];
}
pool->blocks[POOL_BLOCK_COUNT - 1U].next = NULL;
}
/**
* @brief Allocate one block — O(1).
* @return Pointer to block, or NULL if pool exhausted.
*/
void *pool_alloc(MemPool_t *pool) {
if (pool->free_head == NULL) {
return NULL; /* Pool exhausted — handle at call site */
}
PoolBlock_t *block = pool->free_head;
pool->free_head = (PoolBlock_t *)block->next;
pool->free_count--;
memset(block->data, 0, POOL_BLOCK_SIZE); /* Zero on alloc — security */
return block->data;
}
/**
* @brief Free one block — O(1).
* @param ptr Must be a pointer previously returned by pool_alloc().
*/
void pool_free(MemPool_t *pool, void *ptr) {
if (ptr == NULL) { return; }
PoolBlock_t *block = (PoolBlock_t *)ptr;
block->next = pool->free_head;
pool->free_head = block;
pool->free_count++;
}
/* Global pool instance — BSS segment, zero-initialised */
static MemPool_t g_packet_pool;
void app_init(void) {
pool_init(&g_packet_pool);
}
void process_packet(void) {
uint8_t *buf = (uint8_t *)pool_alloc(&g_packet_pool);
if (buf == NULL) {
/* Pool exhausted: log error, drop packet, do NOT malloc() */
return;
}
/* Use buffer... */
pool_free(&g_packet_pool, buf);
}
Heap Fragmentation
Heap fragmentation occurs when free memory exists as many small non-contiguous chunks rather than one large contiguous block. A heap containing 8 KB of free memory split into 512 individual 16-byte fragments cannot satisfy a 1 KB allocation even though 8 KB is available. In long-running embedded systems, fragmentation grows monotonically unless the allocation pattern is perfectly predictable — which it rarely is.
Recognising Fragmentation in Practice
The classic symptom is an allocation failure that appears only after hours of operation — the system runs fine in the lab, then fails in the field. FreeRTOS's xPortGetFreeHeapSize() returns total free bytes but cannot tell you whether a 2 KB contiguous block is available. Use xPortGetMinimumEverFreeHeapSize() during development to find the low-water mark, and monitor allocation failures by overriding vApplicationMallocFailedHook().
FreeRTOS Heap4 vs Pure Static Allocation
/**
* FreeRTOS static allocation: eliminate heap entirely for RTOS objects.
* Set configSUPPORT_STATIC_ALLOCATION 1 in FreeRTOSConfig.h.
*
* Benefits:
* - RTOS objects use zero heap space
* - Sizes verified at compile time (linker reports overflow)
* - No malloc failure path for RTOS internals
*/
#include "FreeRTOS.h"
#include "task.h"
#include "queue.h"
#include "semphr.h"
/* --- Static task creation --- */
#define TASK_STACK_WORDS 256U /* Stack size in 32-bit words = 1 KB */
static StaticTask_t g_sensor_task_tcb;
static StackType_t g_sensor_task_stack[TASK_STACK_WORDS];
static TaskHandle_t g_sensor_task_handle;
/* --- Static queue creation --- */
#define QUEUE_DEPTH 16U
#define ITEM_SIZE sizeof(uint32_t)
static StaticQueue_t g_data_queue_struct;
static uint8_t g_data_queue_storage[QUEUE_DEPTH * ITEM_SIZE];
static QueueHandle_t g_data_queue;
/* --- Static mutex creation --- */
static StaticSemaphore_t g_spi_mutex_struct;
static SemaphoreHandle_t g_spi_mutex;
void app_create_rtos_objects(void) {
/* Create task with static memory — returns NULL only on bad params */
g_sensor_task_handle = xTaskCreateStatic(
sensor_task_fn, /* task function */
"SensorTask", /* debug name */
TASK_STACK_WORDS, /* stack depth in words */
NULL, /* task parameter */
3U, /* priority */
g_sensor_task_stack, /* stack buffer */
&g_sensor_task_tcb /* TCB storage */
);
configASSERT(g_sensor_task_handle != NULL);
g_data_queue = xQueueCreateStatic(QUEUE_DEPTH, ITEM_SIZE,
g_data_queue_storage,
&g_data_queue_struct);
configASSERT(g_data_queue != NULL);
g_spi_mutex = xSemaphoreCreateMutexStatic(&g_spi_mutex_struct);
configASSERT(g_spi_mutex != NULL);
}
/* Required when configSUPPORT_STATIC_ALLOCATION=1 */
void vApplicationGetIdleTaskMemory(StaticTask_t **ppxIdleTaskTCBBuffer,
StackType_t **ppxIdleTaskStackBuffer,
uint32_t *pulIdleTaskStackSize) {
static StaticTask_t idle_tcb;
static StackType_t idle_stack[configMINIMAL_STACK_SIZE];
*ppxIdleTaskTCBBuffer = &idle_tcb;
*ppxIdleTaskStackBuffer = idle_stack;
*pulIdleTaskStackSize = configMINIMAL_STACK_SIZE;
}
Best Practice: In production RTOS firmware, set configSUPPORT_STATIC_ALLOCATION 1 and configSUPPORT_DYNAMIC_ALLOCATION 0. All RTOS objects (tasks, queues, semaphores, event groups) are created with static memory. This eliminates the heap entirely for RTOS internals and lets the linker verify at build time that your SRAM budget is not exceeded.
RTOS Memory Pools
CMSIS-RTOS2 provides a standardised memory pool API that wraps the underlying RTOS pool implementation (FreeRTOS, Keil RTX5, Zephyr). The API offers deterministic O(1) allocation and deallocation with bounded latency — safe to call from tasks and, on some RTOS implementations, from ISRs.
CMSIS-RTOS2 Memory Pool API
/**
* CMSIS-RTOS2 memory pool: deterministic fixed-block allocation.
* osMemoryPoolNew / osMemoryPoolAlloc / osMemoryPoolFree
*
* The pool is backed by statically-allocated storage when using
* static attributes — no heap involved.
*/
#include "cmsis_os2.h"
/* Define block type: 32-byte message descriptor */
typedef struct {
uint32_t timestamp_ms;
uint16_t sensor_id;
uint16_t flags;
float value;
uint8_t payload[16];
} SensorMsg_t;
#define MSG_POOL_CAPACITY 20U
/* Static backing storage for the pool (optional — avoids heap) */
static osMemoryPoolAttr_t pool_attr = {
.name = "SensorMsgPool",
.attr_bits = 0U,
.cb_mem = NULL, /* let RTOS allocate control block */
.cb_size = 0U,
.mp_mem = NULL, /* let RTOS allocate block storage */
.mp_size = 0U
};
static osMemoryPoolId_t g_msg_pool;
void pool_demo_init(void) {
g_msg_pool = osMemoryPoolNew(MSG_POOL_CAPACITY,
sizeof(SensorMsg_t),
&pool_attr);
if (g_msg_pool == NULL) {
/* Fatal: pool creation failed — check heap config */
while(1) {}
}
}
void sensor_task(void *arg) {
osMessageQueueId_t queue = (osMessageQueueId_t)arg;
for (;;) {
/* Allocate a message block — O(1), bounded latency */
SensorMsg_t *msg = (SensorMsg_t *)osMemoryPoolAlloc(g_msg_pool,
osWaitForever);
if (msg == NULL) { continue; } /* should not happen with osWaitForever */
/* Fill message */
msg->timestamp_ms = osKernelGetTickCount();
msg->sensor_id = 1U;
msg->value = read_adc_voltage();
/* Post to queue */
osStatus_t status = osMessageQueuePut(queue, &msg, 0U, 0U);
if (status != osOK) {
/* Queue full: return block to pool immediately */
osMemoryPoolFree(g_msg_pool, msg);
}
}
}
void comms_task(void *arg) {
osMessageQueueId_t queue = (osMessageQueueId_t)arg;
SensorMsg_t *msg;
for (;;) {
osStatus_t status = osMessageQueueGet(queue, &msg, NULL, osWaitForever);
if (status == osOK) {
transmit_over_uart((uint8_t *)msg, sizeof(SensorMsg_t));
/* Return block to pool after use — O(1) */
osMemoryPoolFree(g_msg_pool, msg);
}
}
}
Stack Overflow Detection
Stack overflows are the most common cause of silent corruption in embedded RTOS firmware. A task that overflows its stack writes into the adjacent RTOS TCB or the next task's stack — corrupting kernel state silently. By the time the bug manifests, the call stack is meaningless. Early detection is essential.
Stack Watermark Checking with 0xDEADBEEF Pattern
/**
* Stack watermark detection: fill stack with sentinel pattern at startup,
* scan from the base upward to find the high-water mark at runtime.
*
* FreeRTOS provides this automatically when configCHECK_FOR_STACK_OVERFLOW >= 1
* and uxTaskGetStackHighWaterMark() is called periodically.
*
* For bare-metal or custom RTOS, implement manually as shown below.
*/
#include
#include
#define STACK_SENTINEL 0xDEADBEEFUL
#define TASK_STACK_SIZE 1024U /* bytes */
/* Statically allocated task stack */
static uint32_t g_task_stack[TASK_STACK_SIZE / 4U];
/**
* @brief Fill task stack with sentinel pattern.
* Call before starting the scheduler.
*/
void stack_watermark_init(uint32_t *stack_base, size_t word_count) {
for (size_t i = 0U; i < word_count; i++) {
stack_base[i] = STACK_SENTINEL;
}
}
/**
* @brief Scan stack from base to find high-water mark.
* @return Number of words still containing the sentinel (unused stack words).
* Zero means the stack has completely overflowed.
*/
size_t stack_watermark_check(const uint32_t *stack_base, size_t word_count) {
size_t unused = 0U;
for (size_t i = 0U; i < word_count; i++) {
if (stack_base[i] == STACK_SENTINEL) {
unused++;
} else {
break; /* First modified word = bottom of used stack */
}
}
return unused;
}
void monitor_task(void *arg) {
for (;;) {
size_t unused = stack_watermark_check(g_task_stack,
sizeof(g_task_stack) / 4U);
size_t used = (sizeof(g_task_stack) / 4U) - unused;
/* Log or assert: less than 64 words (256 bytes) remaining is danger */
if (unused < 64U) {
/* Log: "STACK LOW: task used %u/%u words", used, total */
configASSERT(0); /* Halt in debug builds */
}
osDelay(1000U); /* Check every second */
}
}
MPU-Based Memory Protection
The Memory Protection Unit (MPU) on Cortex-M3/M4/M7/M33 allows you to configure up to 8 or 16 memory regions with individual access permissions. The most powerful use case for embedded firmware is the stack guard page: configure a small no-access region immediately below each task stack. When a stack overflow occurs, the CPU attempts to write into the guard region, immediately triggers a MemFault exception with precise fault address information — caught at the point of overflow, not silently afterwards.
MPU Stack Guard Page Configuration
/**
* MPU stack guard: configure a no-access region one cache line (32 bytes)
* below the task stack. Stack overflow triggers MemFault immediately.
*
* Cortex-M4/M7: uses ARMv7-M MPU (8 regions, base+size format).
* Cortex-M33: uses ARMv8-M MPU (16 regions, base+limit format).
*
* This example targets ARMv7-M (STM32F4).
*/
#include "core_cm4.h"
/* Task stack: 4 KB aligned (MPU regions must be power-of-2 aligned) */
__attribute__((aligned(4096)))
static uint8_t g_task_stack_buf[4096];
/**
* @brief Configure MPU region 7 as a no-access guard below the stack.
*
* ARMv7-M region encoding:
* RASR AP[26:24] = 0b000 → no access (any access triggers MemFault)
* RASR SIZE[5:1] = 0b00100 → 32 bytes region
* RASR ENABLE[0] = 1
*/
void mpu_configure_stack_guard(void) {
/* Disable MPU before configuration */
MPU->CTRL = 0U;
/* Region 7: 32-byte no-access guard at stack base */
MPU->RNR = 7U;
MPU->RBAR = (uint32_t)g_task_stack_buf | MPU_RBAR_VALID_Msk | 7U;
MPU->RASR = (0x00UL << MPU_RASR_AP_Pos) | /* No access */
(0x04UL << MPU_RASR_SIZE_Pos) | /* 32 bytes */
MPU_RASR_ENABLE_Msk;
/* Enable MPU with default memory map for privileged accesses,
and fault on NMI and hard fault handlers accessing guard region */
MPU->CTRL = MPU_CTRL_ENABLE_Msk
| MPU_CTRL_PRIVDEFENA_Msk;
__DSB();
__ISB();
}
/**
* @brief MemFault handler: examine SCB->MMFAR for fault address.
* A MMFAR pointing to the guard region confirms stack overflow.
*/
void MemManage_Handler(void) {
uint32_t fault_addr = 0U;
if (SCB->CFSR & SCB_CFSR_MMARVALID_Msk) {
fault_addr = SCB->MMFAR;
}
/* Determine if fault address is in stack guard region */
if (fault_addr >= (uint32_t)g_task_stack_buf &&
fault_addr < (uint32_t)g_task_stack_buf + 32U) {
/* Confirmed stack overflow — log and halt */
/* In production: trigger watchdog reset, preserve fault info in RTC backup */
}
while (1) { __NOP(); }
}
MPU Region Alignment Requirement: ARMv7-M MPU regions must be naturally aligned — a 32-byte region must start at a 32-byte-aligned address, a 4 KB region at a 4 KB-aligned address. Use __attribute__((aligned(N))) on your stack buffers. Misaligned regions silently cover the wrong address range.
Common Embedded Memory Bugs
| Bug |
Symptom |
Root Cause |
Detection Method |
| Stack overflow |
Random resets, corrupted locals, wrong return addresses |
Deep recursion, large local arrays, ISR nesting |
MPU guard page, FreeRTOS stack checking, watermark scan |
| Heap fragmentation |
NULL from malloc after hours of operation |
Mixed alloc/free sizes, long-lived allocations |
vApplicationMallocFailedHook, heap visualisation |
| Double-free |
Heap corruption, hard fault, wrong data |
Shared ownership without reference counting |
Heap4 debug build, valgrind-style wrappers |
| Dangling pointer |
Intermittent wrong values, silent data corruption |
Freeing memory while another component holds a pointer |
MPU read-guard on freed region, static analysis |
| Null pointer dereference |
HardFault at address 0x00000000 or low addresses |
Failed malloc/pool_alloc return value not checked |
MPU no-access region at 0x00000000 (null guard) |
Exercises
Exercise 1
Beginner
Instrument Firmware to Track Peak Heap Usage
Override FreeRTOS's pvPortMalloc() and vPortFree() wrappers (or newlib's __malloc_lock()) to track: (a) total bytes currently allocated, (b) peak bytes ever allocated simultaneously, (c) total number of allocation failures (NULL returns). Log these values over a 24-hour soak test. Produce a time-series chart of heap utilisation. Identify the top three allocation sites by frequency using __builtin_return_address(0).
Heap Instrumentation
Memory Profiling
FreeRTOS
Exercise 2
Intermediate
Replace Dynamic Allocations with Memory Pool Allocations
Take an RTOS task in your codebase that currently uses pvPortMalloc() to allocate message buffers dynamically. Replace all dynamic allocations with a CMSIS-RTOS2 osMemoryPoolAlloc() backed by a statically-allocated pool. Measure and document: (a) reduction in worst-case allocation time (cycles), (b) elimination of fragmentation risk, (c) new failure mode (pool exhaustion vs NULL malloc) and how you handle it. Verify the pool capacity is sufficient using osMemoryPoolGetSpace().
CMSIS-RTOS2
Memory Pool
Deterministic Allocation
Exercise 3
Advanced
Configure MPU Stack Guard and Trigger a Controlled MemFault
Configure an MPU no-access guard region (32 bytes) immediately below a test task stack. Write a test function that intentionally overflows the stack by allocating a large array on the stack in a loop until the guard region is hit. Verify: (a) the MemFault handler fires with the correct MMFAR address pointing into the guard region, (b) SCB->CFSR shows MMARVALID and DACCVIOL bits set, (c) the system does not silently corrupt adjacent data before the fault fires. Document the full fault register dump.
MPU
MemFault
Stack Guard
Memory Strategy Planner
Use this tool to document your project's memory management strategy — MCU, allocation approach, memory pool inventory, stack guard configuration, and MPU regions. Download as Word, Excel, PDF, or PPTX for architecture review documentation.
Conclusion & Next Steps
In this article we have built a complete embedded memory management toolkit:
- Static allocation is the foundation — zero runtime overhead, verifiable at link time, MISRA-compliant. Use it wherever objects have fixed lifetime and known size.
- Fixed-size memory pools give you O(1) deterministic allocation for variable-lifetime objects — the correct replacement for
malloc() in real-time firmware.
- CMSIS-RTOS2 osMemoryPool provides a standardised pool API that works across FreeRTOS, RTX5, and Zephyr — write once, run on any CMSIS-RTOS2 implementation.
- FreeRTOS static allocation (
xTaskCreateStatic, xQueueCreateStatic) eliminates the heap entirely for RTOS objects — strongly recommended for production firmware.
- Stack watermark scanning with a sentinel pattern gives you high-water mark data during testing; MPU stack guard pages give you hard real-time overflow detection in production.
- The common memory bug table — stack overflow, fragmentation, double-free, dangling pointer, null dereference — gives you a diagnostic checklist for when memory-related faults appear.
Next in the Series
In Part 13: Low Power & Energy Optimization, we shift focus from correctness to efficiency. We'll cover the full ARM Cortex-M low-power toolkit: WFI/WFE sleep modes, STOP and Standby states, clock gating for unused peripherals, FreeRTOS tickless idle with LPTIM, power-domain management, and how to profile your firmware's average current with a hardware current probe to meet battery-life targets.
Related Articles in This Series
Part 13: Low Power & Energy Optimization
Sleep modes, clock gating, tickless RTOS, and power profiling — the next step after getting memory management right in your IoT firmware.
Read Article
Part 4: CMSIS-RTOS2 — Threads, Mutexes & Semaphores
The RTOS foundation that memory pools, static tasks, and queue management are built on — revisit thread creation and synchronisation with static objects.
Read Article
Part 15: Security — ARMv8-M & TrustZone
The MPU concepts from this article extend directly into TrustZone's Secure Attribution Unit — memory protection as a security mechanism.
Read Article