CMSIS Part 12: Memory Management in Embedded Systems

                        
                        Series Context: This is Part 12 of the 20-part CMSIS Mastery Series (Bonus/Advanced section). Part 11 covered interrupt latency and concurrency; here we address the second major source of non-determinism in embedded firmware — dynamic memory management.
                    

CMSIS Mastery Series

Your 20-step learning path • Currently on Step 12

1

12

Memory Management in Embedded Systems

Static vs dynamic, heap fragmentation, memory pools

You Are Here

13

Low Power & Energy Optimization

Sleep modes, clock gating, tickless RTOS, power profiling

14

DMA & High-Performance Data Handling

DMA basics, peripheral transfers, zero-copy techniques

15

Security: ARMv8-M & TrustZone

Secure/non-secure worlds, secure boot, firmware protection

16

Bootloaders & Firmware Updates

OTA updates, dual-bank flash, fail-safe strategies

17

Testing & Validation

Unity/Ceedling unit tests, HIL testing, integration testing

18

Performance Optimization

Compiler flags, inline assembly, cache (M7/M33), profiling

19

Embedded Software Architecture

Layered design, event-driven, state machines, component-based

20

Tooling & Workflow (Professional Level)

CI/CD for embedded, MISRA, static analysis, Doxygen

Static vs Dynamic Allocation

The embedded world's relationship with dynamic memory allocation is adversarial by necessity. malloc() and free() from the standard C library are designed for general-purpose operating systems — they assume a virtual memory manager, a large address space, and a GC or reference-counted runtime to reclaim memory. None of these assumptions hold on a Cortex-M3 with 64 KB of SRAM and no MMU.

The fundamental problems with malloc() in embedded firmware are: (1) non-deterministic execution time — the allocator may traverse an arbitrarily long free-list searching for a suitable block; (2) fragmentation — repeated alloc/free cycles of varying sizes leave the heap Swiss-cheesed, eventually causing allocation failures even when total free memory is sufficient; (3) out-of-memory handling — most embedded projects have no meaningful way to recover from a NULL return from malloc(); (4) thread-safety — the newlib allocator is not thread-safe by default on Cortex-M without a custom __malloc_lock() implementation.

                        
                        MISRA-C:2012 Rule 21.3: The memory allocation and deallocation functions of <stdlib.h> shall not be used. MISRA prohibits malloc, calloc, realloc, and free for exactly the reasons above. If your project requires MISRA compliance, static allocation and memory pools are the only permitted strategies.
                    

Memory Allocation Strategy Comparison

Strategy	Determinism	Fragmentation	Overhead	Typical Use Cases
Static (compile-time)	Fully deterministic	None	Zero runtime	Safety-critical, MISRA, fixed-topology systems
Stack (automatic)	Deterministic (push/pop)	None	1–2 instructions	Function-local data, temporary buffers
Fixed-size pool	O(1) — deterministic	None (fixed block size)	Low (free-list pointer)	Message queues, packet buffers, RTOS objects
Heap4 (FreeRTOS)	O(n) — non-deterministic	Moderate (first-fit)	8–16 bytes header per block	Startup-only allocations, GUI frameworks
Heap5 (FreeRTOS)	O(n) — non-deterministic	Moderate	8–16 bytes + region list	Multi-region SRAM (CCM + main SRAM)
Custom slab allocator	O(1) per class size	Low (per-class slabs)	Medium (slab metadata)	Networking stacks, file systems with mixed object sizes

Static Free-List Memory Pool

A fixed-size memory pool pre-allocates a contiguous array of identically-sized blocks at compile time and manages them with a singly-linked free list. Allocation is O(1) — pop a block from the head. Deallocation is O(1) — push a block back to the head. No fragmentation is possible because all blocks are the same size.

/**
 * Static fixed-size memory pool — O(1) alloc/free, zero fragmentation.
 * Block size and pool depth set at compile time.
 *
 * Pattern: overlay a "next" pointer on the first word of each free block
 * to build an intrusive singly-linked free list.
 */
#include 
#include 
#include 

#define POOL_BLOCK_SIZE   64U    /* bytes per block — must be >= sizeof(void*) */
#define POOL_BLOCK_COUNT  32U    /* total blocks in pool                        */

/* Alignment: SRAM access is fastest on 4-byte aligned addresses */
typedef union {
    uint8_t  data[POOL_BLOCK_SIZE];
    void    *next;   /* used when block is on the free list */
} PoolBlock_t;

typedef struct {
    PoolBlock_t  blocks[POOL_BLOCK_COUNT];
    PoolBlock_t *free_head;
    uint32_t     free_count;
} MemPool_t;

/* Initialise pool — called once at startup */
void pool_init(MemPool_t *pool) {
    pool->free_head = &pool->blocks[0];
    pool->free_count = POOL_BLOCK_COUNT;

    /* Link all blocks into the free list */
    for (uint32_t i = 0U; i < POOL_BLOCK_COUNT - 1U; i++) {
        pool->blocks[i].next = &pool->blocks[i + 1U];
    }
    pool->blocks[POOL_BLOCK_COUNT - 1U].next = NULL;
}

/**
 * @brief Allocate one block — O(1).
 * @return Pointer to block, or NULL if pool exhausted.
 */
void *pool_alloc(MemPool_t *pool) {
    if (pool->free_head == NULL) {
        return NULL;   /* Pool exhausted — handle at call site */
    }
    PoolBlock_t *block = pool->free_head;
    pool->free_head    = (PoolBlock_t *)block->next;
    pool->free_count--;
    memset(block->data, 0, POOL_BLOCK_SIZE);  /* Zero on alloc — security */
    return block->data;
}

/**
 * @brief Free one block — O(1).
 * @param ptr Must be a pointer previously returned by pool_alloc().
 */
void pool_free(MemPool_t *pool, void *ptr) {
    if (ptr == NULL) { return; }
    PoolBlock_t *block = (PoolBlock_t *)ptr;
    block->next     = pool->free_head;
    pool->free_head = block;
    pool->free_count++;
}

/* Global pool instance — BSS segment, zero-initialised */
static MemPool_t g_packet_pool;

void app_init(void) {
    pool_init(&g_packet_pool);
}

void process_packet(void) {
    uint8_t *buf = (uint8_t *)pool_alloc(&g_packet_pool);
    if (buf == NULL) {
        /* Pool exhausted: log error, drop packet, do NOT malloc() */
        return;
    }
    /* Use buffer... */
    pool_free(&g_packet_pool, buf);
}

Heap Fragmentation

Heap fragmentation occurs when free memory exists as many small non-contiguous chunks rather than one large contiguous block. A heap containing 8 KB of free memory split into 512 individual 16-byte fragments cannot satisfy a 1 KB allocation even though 8 KB is available. In long-running embedded systems, fragmentation grows monotonically unless the allocation pattern is perfectly predictable — which it rarely is.

Recognising Fragmentation in Practice

The classic symptom is an allocation failure that appears only after hours of operation — the system runs fine in the lab, then fails in the field. FreeRTOS's xPortGetFreeHeapSize() returns total free bytes but cannot tell you whether a 2 KB contiguous block is available. Use xPortGetMinimumEverFreeHeapSize() during development to find the low-water mark, and monitor allocation failures by overriding vApplicationMallocFailedHook().

FreeRTOS Heap4 vs Pure Static Allocation

/**
 * FreeRTOS static allocation: eliminate heap entirely for RTOS objects.
 * Set configSUPPORT_STATIC_ALLOCATION 1 in FreeRTOSConfig.h.
 *
 * Benefits:
 *   - RTOS objects use zero heap space
 *   - Sizes verified at compile time (linker reports overflow)
 *   - No malloc failure path for RTOS internals
 */
#include "FreeRTOS.h"
#include "task.h"
#include "queue.h"
#include "semphr.h"

/* --- Static task creation --- */
#define TASK_STACK_WORDS   256U   /* Stack size in 32-bit words = 1 KB */

static StaticTask_t  g_sensor_task_tcb;
static StackType_t   g_sensor_task_stack[TASK_STACK_WORDS];
static TaskHandle_t  g_sensor_task_handle;

/* --- Static queue creation --- */
#define QUEUE_DEPTH   16U
#define ITEM_SIZE     sizeof(uint32_t)

static StaticQueue_t  g_data_queue_struct;
static uint8_t        g_data_queue_storage[QUEUE_DEPTH * ITEM_SIZE];
static QueueHandle_t  g_data_queue;

/* --- Static mutex creation --- */
static StaticSemaphore_t g_spi_mutex_struct;
static SemaphoreHandle_t g_spi_mutex;

void app_create_rtos_objects(void) {
    /* Create task with static memory — returns NULL only on bad params */
    g_sensor_task_handle = xTaskCreateStatic(
        sensor_task_fn,           /* task function           */
        "SensorTask",             /* debug name              */
        TASK_STACK_WORDS,         /* stack depth in words    */
        NULL,                     /* task parameter          */
        3U,                       /* priority                */
        g_sensor_task_stack,      /* stack buffer            */
        &g_sensor_task_tcb        /* TCB storage             */
    );
    configASSERT(g_sensor_task_handle != NULL);

    g_data_queue = xQueueCreateStatic(QUEUE_DEPTH, ITEM_SIZE,
                                      g_data_queue_storage,
                                      &g_data_queue_struct);
    configASSERT(g_data_queue != NULL);

    g_spi_mutex = xSemaphoreCreateMutexStatic(&g_spi_mutex_struct);
    configASSERT(g_spi_mutex != NULL);
}

/* Required when configSUPPORT_STATIC_ALLOCATION=1 */
void vApplicationGetIdleTaskMemory(StaticTask_t **ppxIdleTaskTCBBuffer,
                                   StackType_t  **ppxIdleTaskStackBuffer,
                                   uint32_t      *pulIdleTaskStackSize) {
    static StaticTask_t idle_tcb;
    static StackType_t  idle_stack[configMINIMAL_STACK_SIZE];
    *ppxIdleTaskTCBBuffer   = &idle_tcb;
    *ppxIdleTaskStackBuffer = idle_stack;
    *pulIdleTaskStackSize   = configMINIMAL_STACK_SIZE;
}

                        
                        Best Practice: In production RTOS firmware, set configSUPPORT_STATIC_ALLOCATION 1 and configSUPPORT_DYNAMIC_ALLOCATION 0. All RTOS objects (tasks, queues, semaphores, event groups) are created with static memory. This eliminates the heap entirely for RTOS internals and lets the linker verify at build time that your SRAM budget is not exceeded.
                    

RTOS Memory Pools

CMSIS-RTOS2 provides a standardised memory pool API that wraps the underlying RTOS pool implementation (FreeRTOS, Keil RTX5, Zephyr). The API offers deterministic O(1) allocation and deallocation with bounded latency — safe to call from tasks and, on some RTOS implementations, from ISRs.

CMSIS-RTOS2 Memory Pool API

/**
 * CMSIS-RTOS2 memory pool: deterministic fixed-block allocation.
 * osMemoryPoolNew / osMemoryPoolAlloc / osMemoryPoolFree
 *
 * The pool is backed by statically-allocated storage when using
 * static attributes — no heap involved.
 */
#include "cmsis_os2.h"

/* Define block type: 32-byte message descriptor */
typedef struct {
    uint32_t  timestamp_ms;
    uint16_t  sensor_id;
    uint16_t  flags;
    float     value;
    uint8_t   payload[16];
} SensorMsg_t;

#define MSG_POOL_CAPACITY   20U

/* Static backing storage for the pool (optional — avoids heap) */
static osMemoryPoolAttr_t pool_attr = {
    .name      = "SensorMsgPool",
    .attr_bits = 0U,
    .cb_mem    = NULL,   /* let RTOS allocate control block */
    .cb_size   = 0U,
    .mp_mem    = NULL,   /* let RTOS allocate block storage */
    .mp_size   = 0U
};

static osMemoryPoolId_t g_msg_pool;

void pool_demo_init(void) {
    g_msg_pool = osMemoryPoolNew(MSG_POOL_CAPACITY,
                                  sizeof(SensorMsg_t),
                                  &pool_attr);
    if (g_msg_pool == NULL) {
        /* Fatal: pool creation failed — check heap config */
        while(1) {}
    }
}

void sensor_task(void *arg) {
    osMessageQueueId_t queue = (osMessageQueueId_t)arg;

    for (;;) {
        /* Allocate a message block — O(1), bounded latency */
        SensorMsg_t *msg = (SensorMsg_t *)osMemoryPoolAlloc(g_msg_pool,
                                                             osWaitForever);
        if (msg == NULL) { continue; }  /* should not happen with osWaitForever */

        /* Fill message */
        msg->timestamp_ms = osKernelGetTickCount();
        msg->sensor_id    = 1U;
        msg->value        = read_adc_voltage();

        /* Post to queue */
        osStatus_t status = osMessageQueuePut(queue, &msg, 0U, 0U);
        if (status != osOK) {
            /* Queue full: return block to pool immediately */
            osMemoryPoolFree(g_msg_pool, msg);
        }
    }
}

void comms_task(void *arg) {
    osMessageQueueId_t queue = (osMessageQueueId_t)arg;
    SensorMsg_t *msg;

    for (;;) {
        osStatus_t status = osMessageQueueGet(queue, &msg, NULL, osWaitForever);
        if (status == osOK) {
            transmit_over_uart((uint8_t *)msg, sizeof(SensorMsg_t));
            /* Return block to pool after use — O(1) */
            osMemoryPoolFree(g_msg_pool, msg);
        }
    }
}

Stack Overflow Detection

Stack overflows are the most common cause of silent corruption in embedded RTOS firmware. A task that overflows its stack writes into the adjacent RTOS TCB or the next task's stack — corrupting kernel state silently. By the time the bug manifests, the call stack is meaningless. Early detection is essential.

Stack Watermark Checking with 0xDEADBEEF Pattern

/**
 * Stack watermark detection: fill stack with sentinel pattern at startup,
 * scan from the base upward to find the high-water mark at runtime.
 *
 * FreeRTOS provides this automatically when configCHECK_FOR_STACK_OVERFLOW >= 1
 * and uxTaskGetStackHighWaterMark() is called periodically.
 *
 * For bare-metal or custom RTOS, implement manually as shown below.
 */
#include 
#include 

#define STACK_SENTINEL   0xDEADBEEFUL
#define TASK_STACK_SIZE  1024U    /* bytes */

/* Statically allocated task stack */
static uint32_t g_task_stack[TASK_STACK_SIZE / 4U];

/**
 * @brief Fill task stack with sentinel pattern.
 *        Call before starting the scheduler.
 */
void stack_watermark_init(uint32_t *stack_base, size_t word_count) {
    for (size_t i = 0U; i < word_count; i++) {
        stack_base[i] = STACK_SENTINEL;
    }
}

/**
 * @brief Scan stack from base to find high-water mark.
 * @return Number of words still containing the sentinel (unused stack words).
 *         Zero means the stack has completely overflowed.
 */
size_t stack_watermark_check(const uint32_t *stack_base, size_t word_count) {
    size_t unused = 0U;
    for (size_t i = 0U; i < word_count; i++) {
        if (stack_base[i] == STACK_SENTINEL) {
            unused++;
        } else {
            break;  /* First modified word = bottom of used stack */
        }
    }
    return unused;
}

void monitor_task(void *arg) {
    for (;;) {
        size_t unused = stack_watermark_check(g_task_stack,
                                              sizeof(g_task_stack) / 4U);
        size_t used   = (sizeof(g_task_stack) / 4U) - unused;

        /* Log or assert: less than 64 words (256 bytes) remaining is danger */
        if (unused < 64U) {
            /* Log: "STACK LOW: task used %u/%u words", used, total */
            configASSERT(0);  /* Halt in debug builds */
        }

        osDelay(1000U);  /* Check every second */
    }
}

MPU-Based Memory Protection

The Memory Protection Unit (MPU) on Cortex-M3/M4/M7/M33 allows you to configure up to 8 or 16 memory regions with individual access permissions. The most powerful use case for embedded firmware is the stack guard page: configure a small no-access region immediately below each task stack. When a stack overflow occurs, the CPU attempts to write into the guard region, immediately triggers a MemFault exception with precise fault address information — caught at the point of overflow, not silently afterwards.

MPU Stack Guard Page Configuration

/**
 * MPU stack guard: configure a no-access region one cache line (32 bytes)
 * below the task stack. Stack overflow triggers MemFault immediately.
 *
 * Cortex-M4/M7: uses ARMv7-M MPU (8 regions, base+size format).
 * Cortex-M33:   uses ARMv8-M MPU (16 regions, base+limit format).
 *
 * This example targets ARMv7-M (STM32F4).
 */
#include "core_cm4.h"

/* Task stack: 4 KB aligned (MPU regions must be power-of-2 aligned) */
__attribute__((aligned(4096)))
static uint8_t g_task_stack_buf[4096];

/**
 * @brief Configure MPU region 7 as a no-access guard below the stack.
 *
 * ARMv7-M region encoding:
 *   RASR AP[26:24] = 0b000 → no access (any access triggers MemFault)
 *   RASR SIZE[5:1] = 0b00100 → 32 bytes region
 *   RASR ENABLE[0] = 1
 */
void mpu_configure_stack_guard(void) {
    /* Disable MPU before configuration */
    MPU->CTRL = 0U;

    /* Region 7: 32-byte no-access guard at stack base */
    MPU->RNR  = 7U;
    MPU->RBAR = (uint32_t)g_task_stack_buf | MPU_RBAR_VALID_Msk | 7U;
    MPU->RASR = (0x00UL << MPU_RASR_AP_Pos)   |  /* No access       */
                (0x04UL << MPU_RASR_SIZE_Pos)  |  /* 32 bytes        */
                MPU_RASR_ENABLE_Msk;

    /* Enable MPU with default memory map for privileged accesses,
       and fault on NMI and hard fault handlers accessing guard region */
    MPU->CTRL = MPU_CTRL_ENABLE_Msk
              | MPU_CTRL_PRIVDEFENA_Msk;

    __DSB();
    __ISB();
}

/**
 * @brief MemFault handler: examine SCB->MMFAR for fault address.
 *        A MMFAR pointing to the guard region confirms stack overflow.
 */
void MemManage_Handler(void) {
    uint32_t fault_addr = 0U;
    if (SCB->CFSR & SCB_CFSR_MMARVALID_Msk) {
        fault_addr = SCB->MMFAR;
    }

    /* Determine if fault address is in stack guard region */
    if (fault_addr >= (uint32_t)g_task_stack_buf &&
        fault_addr <  (uint32_t)g_task_stack_buf + 32U) {
        /* Confirmed stack overflow — log and halt */
        /* In production: trigger watchdog reset, preserve fault info in RTC backup */
    }

    while (1) { __NOP(); }
}

                        
                        MPU Region Alignment Requirement: ARMv7-M MPU regions must be naturally aligned — a 32-byte region must start at a 32-byte-aligned address, a 4 KB region at a 4 KB-aligned address. Use __attribute__((aligned(N))) on your stack buffers. Misaligned regions silently cover the wrong address range.
                    

Common Embedded Memory Bugs

Bug	Symptom	Root Cause	Detection Method
Stack overflow	Random resets, corrupted locals, wrong return addresses	Deep recursion, large local arrays, ISR nesting	MPU guard page, FreeRTOS stack checking, watermark scan
Heap fragmentation	NULL from malloc after hours of operation	Mixed alloc/free sizes, long-lived allocations	vApplicationMallocFailedHook, heap visualisation
Double-free	Heap corruption, hard fault, wrong data	Shared ownership without reference counting	Heap4 debug build, valgrind-style wrappers
Dangling pointer	Intermittent wrong values, silent data corruption	Freeing memory while another component holds a pointer	MPU read-guard on freed region, static analysis
Null pointer dereference	HardFault at address 0x00000000 or low addresses	Failed malloc/pool_alloc return value not checked	MPU no-access region at 0x00000000 (null guard)

Exercises

Exercise 1 Beginner

Instrument Firmware to Track Peak Heap Usage

Override FreeRTOS's pvPortMalloc() and vPortFree() wrappers (or newlib's __malloc_lock()) to track: (a) total bytes currently allocated, (b) peak bytes ever allocated simultaneously, (c) total number of allocation failures (NULL returns). Log these values over a 24-hour soak test. Produce a time-series chart of heap utilisation. Identify the top three allocation sites by frequency using __builtin_return_address(0).

Heap Instrumentation Memory Profiling FreeRTOS

Exercise 2 Intermediate

Replace Dynamic Allocations with Memory Pool Allocations

Take an RTOS task in your codebase that currently uses pvPortMalloc() to allocate message buffers dynamically. Replace all dynamic allocations with a CMSIS-RTOS2 osMemoryPoolAlloc() backed by a statically-allocated pool. Measure and document: (a) reduction in worst-case allocation time (cycles), (b) elimination of fragmentation risk, (c) new failure mode (pool exhaustion vs NULL malloc) and how you handle it. Verify the pool capacity is sufficient using osMemoryPoolGetSpace().

CMSIS-RTOS2 Memory Pool Deterministic Allocation

Exercise 3 Advanced

Configure MPU Stack Guard and Trigger a Controlled MemFault

Configure an MPU no-access guard region (32 bytes) immediately below a test task stack. Write a test function that intentionally overflows the stack by allocating a large array on the stack in a loop until the guard region is hit. Verify: (a) the MemFault handler fires with the correct MMFAR address pointing into the guard region, (b) SCB->CFSR shows MMARVALID and DACCVIOL bits set, (c) the system does not silently corrupt adjacent data before the fault fires. Document the full fault register dump.

MPU MemFault Stack Guard

Memory Strategy Planner

Use this tool to document your project's memory management strategy — MCU, allocation approach, memory pool inventory, stack guard configuration, and MPU regions. Download as Word, Excel, PDF, or PPTX for architecture review documentation.

Memory Management Strategy Generator

Document your embedded memory strategy. Download as Word, Excel, PDF, or PPTX.

Draft auto-saved

All data stays in your browser. Nothing is sent to or stored on any server.

Project Name *

Target MCU *

Primary Allocation Strategy

Heap Size (if applicable)

Memory Pools (name, block size, count)

MPU Stack Guard Enabled

MPU Regions Configuration

Known Fragmentation / Memory Issues

Author Name

Conclusion & Next Steps

In this article we have built a complete embedded memory management toolkit:

Static allocation is the foundation — zero runtime overhead, verifiable at link time, MISRA-compliant. Use it wherever objects have fixed lifetime and known size.
Fixed-size memory pools give you O(1) deterministic allocation for variable-lifetime objects — the correct replacement for malloc() in real-time firmware.
CMSIS-RTOS2 osMemoryPool provides a standardised pool API that works across FreeRTOS, RTX5, and Zephyr — write once, run on any CMSIS-RTOS2 implementation.
FreeRTOS static allocation (xTaskCreateStatic, xQueueCreateStatic) eliminates the heap entirely for RTOS objects — strongly recommended for production firmware.
Stack watermark scanning with a sentinel pattern gives you high-water mark data during testing; MPU stack guard pages give you hard real-time overflow detection in production.
The common memory bug table — stack overflow, fragmentation, double-free, dangling pointer, null dereference — gives you a diagnostic checklist for when memory-related faults appear.

Next in the Series

In Part 13: Low Power & Energy Optimization, we shift focus from correctness to efficiency. We'll cover the full ARM Cortex-M low-power toolkit: WFI/WFE sleep modes, STOP and Standby states, clock gating for unused peripherals, FreeRTOS tickless idle with LPTIM, power-domain management, and how to profile your firmware's average current with a hardware current probe to meet battery-life targets.

Cookie Consent

Cookie Preferences

Table of Contents

CMSIS Mastery Series

Overview & ARM Cortex-M Ecosystem

CMSIS-Core: Registers, NVIC & SysTick

Startup Code, Linker Scripts & Vector Table

CMSIS-RTOS2: Threads, Mutexes & Semaphores

CMSIS-RTOS2: Message Queues & Event Flags

CMSIS-DSP: Filters, FFT & Math Functions

CMSIS-Driver: UART, SPI & I2C

CMSIS-Pack & Software Components

Debugging with CMSIS-DAP & CoreSight

Portable Firmware: Multi-Vendor Projects

Interrupts, Concurrency & Real-Time Constraints