Back to Embedded Systems Hardware Engineering Series

Capstone 3: Edge AI Camera

April 17, 2026 Wasil Zafar 45 min read

Design a camera module with on-device ML inference for real-time object detection, leveraging the STM32H7’s DSP capabilities and TensorFlow Lite Micro.

Table of Contents

  1. System Overview
  2. Image Pipeline
  3. ML Model Deployment
  4. Memory Architecture
  5. Performance Estimator
  6. Conclusion

System Overview

ParameterSpecificationComponent
MCUARM Cortex-M7 @ 480 MHzSTM32H743
Camera5MP, QVGA for inferenceOV5640 (DCMI)
RAM1MB internal + 8MB SDRAMIS42S16400J
Flash2MB internal + 64MB QSPIW25Q512
Display2.4" TFT (320×240)ILI9341 (SPI)
ConnectivityWiFi + BLEESP32-C3 (UART)
ModelMobileNet v2 (96×96 INT8)TFLite Micro
Inference<100 ms per frameCMSIS-NN accelerated
Edge AI Camera Data Flow
flowchart LR
    A["OV5640
Camera"] -->|DCMI| B["DMA
Transfer"] B --> C["Frame Buffer
SDRAM"] C --> D["Resize &
Preprocess"] D --> E["TFLite Micro
Inference"] E --> F["Post-process
NMS"] F --> G["Display
ILI9341"] F --> H["WiFi/BLE
ESP32-C3"] style E fill:#3B9797,color:#fff style A fill:#16476A,color:#fff

Image Pipeline

/* DCMI configuration for OV5640 → DMA → frame buffer
   Captures QVGA (320x240) RGB565 frames */

#include <stdint.h>

#define FRAME_WIDTH   320
#define FRAME_HEIGHT  240
#define BYTES_PER_PX  2          /* RGB565 */
#define FRAME_SIZE    (FRAME_WIDTH * FRAME_HEIGHT * BYTES_PER_PX)

/* Frame buffers in external SDRAM (double-buffered) */
/* Place at SDRAM base: 0xC0000000 */
uint16_t frame_buf_0[FRAME_WIDTH * FRAME_HEIGHT];  /* 150 KB */
uint16_t frame_buf_1[FRAME_WIDTH * FRAME_HEIGHT];  /* 150 KB */

/* Inference buffer: 96x96 grayscale (INT8) */
int8_t inference_buf[96 * 96];  /* 9.2 KB in DTCM */

/* Downsample and convert RGB565 → grayscale INT8 */
void preprocess_frame(const uint16_t *src, int8_t *dst,
                      int src_w, int src_h,
                      int dst_w, int dst_h) {
    float x_ratio = (float)src_w / dst_w;
    float y_ratio = (float)src_h / dst_h;

    for (int y = 0; y < dst_h; y++) {
        for (int x = 0; x < dst_w; x++) {
            int sx = (int)(x * x_ratio);
            int sy = (int)(y * y_ratio);
            uint16_t pixel = src[sy * src_w + sx];

            /* RGB565 → grayscale: 0.299R + 0.587G + 0.114B */
            uint8_t r = (pixel >> 11) & 0x1F;
            uint8_t g = (pixel >> 5)  & 0x3F;
            uint8_t b = pixel & 0x1F;

            /* Scale to 8-bit and compute luminance */
            uint8_t gray = (uint8_t)(
                (r * 255 / 31) * 77 / 256 +
                (g * 255 / 63) * 150 / 256 +
                (b * 255 / 31) * 29 / 256
            );

            /* Quantize to INT8: [-128, 127] */
            dst[y * dst_w + x] = (int8_t)(gray - 128);
        }
    }
}

ML Model Deployment

# Model quantization — convert Keras model to TFLite INT8
# Suitable for STM32H7 with CMSIS-NN acceleration

import numpy as np

# Simulated model metrics for deployment planning
model_params = {
    "architecture": "MobileNet V2 (alpha=0.35)",
    "input_shape": (96, 96, 1),
    "num_classes": 10,
    "float32_size_kb": 1420,
    "int8_size_kb": 380,
    "float32_latency_ms": 850,
    "int8_latency_ms": 92,
    "accuracy_float32": 0.945,
    "accuracy_int8": 0.938,
}

# Memory layout for STM32H743
memory_map = {
    "ITCM (instructions)": (64, "Inference engine code"),
    "DTCM (fast data)":    (128, "Model weights (hot layers)"),
    "SRAM1":               (512, "Tensor arena"),
    "SRAM2":               (288, "Scratch buffers"),
    "SDRAM":               (8192, "Frame buffers + display"),
    "QSPI Flash":          (65536, "Full model + assets"),
}

print("Edge AI Camera — Model Deployment Plan")
print("=" * 55)
print(f"Model:       {model_params['architecture']}")
print(f"Input:       {model_params['input_shape']}")
print(f"Classes:     {model_params['num_classes']}")
print(f"\nQuantization Comparison:")
print(f"  Float32:  {model_params['float32_size_kb']} KB, "
      f"{model_params['float32_latency_ms']} ms, "
      f"{model_params['accuracy_float32']:.1%} accuracy")
print(f"  INT8:     {model_params['int8_size_kb']} KB, "
      f"{model_params['int8_latency_ms']} ms, "
      f"{model_params['accuracy_int8']:.1%} accuracy")
print(f"  Speedup:  {model_params['float32_latency_ms']/model_params['int8_latency_ms']:.1f}x")
print(f"  Size reduction: {1 - model_params['int8_size_kb']/model_params['float32_size_kb']:.0%}")

print(f"\nMemory Map:")
total_used = 0
for region, (size_kb, usage) in memory_map.items():
    print(f"  {region:<25} {size_kb:>6} KB  — {usage}")
    total_used += size_kb
print(f"  {'Total':<25} {total_used:>6} KB")

Memory Architecture

STM32H7 Memory Hierarchy for ML Inference
flowchart TD
    A["QSPI Flash
64MB — Model Storage"] -->|XIP or Copy| B["DTCM
128KB — Hot Weights"] A -->|DMA| C["SRAM1
512KB — Tensor Arena"] D["DCMI + DMA"] --> E["SDRAM
8MB — Frame Buffers"] E -->|Preprocess| C C --> F["CMSIS-NN
Inference Engine"] F --> G["Results
Bounding Boxes"] B --> F style F fill:#3B9797,color:#fff style A fill:#132440,color:#fff
CMSIS-NN Acceleration: The ARM CMSIS-NN library provides optimized kernels for quantized convolution, depthwise convolution, and fully connected layers. On the Cortex-M7 with DSP extensions, INT8 inference runs 5–9x faster than naive float32 implementations.

Inference Performance Estimator

Edge AI Camera Performance Estimator

Estimate inference performance for your model and target MCU. Download as Word, Excel, or PDF.

Draft auto-saved

Conclusion

The edge AI camera capstone demonstrates how to build a complete vision pipeline — from DCMI capture through preprocessing, INT8 inference with CMSIS-NN, and result display — all on a Cortex-M7 MCU without a cloud connection. The quantized MobileNet V2 model achieves <100ms inference at 93.8% accuracy.

Next Capstone

In Capstone 4: Home Automation Hub, we’ll design a multi-protocol gateway combining WiFi, Zigbee, and Bluetooth Mesh into a unified smart home controller.