Back to Technology

Edge AI & On-Device Intelligence

March 30, 2026 Wasil Zafar 31 min read

Not every AI workload belongs in the cloud. This article covers the engineering discipline of deploying AI to resource-constrained devices — smartphones, microcontrollers, IoT sensors, and embedded systems — through model quantization, pruning, knowledge distillation, TFLite, ONNX Runtime, and hardware-aware optimisation.

Table of Contents

  1. Why Edge AI?
  2. Model Compression Techniques
  3. PyTorch Quantization in Practice
  4. TFLite: Mobile & Embedded Deployment
  5. ONNX: Cross-Platform Inference
  6. Edge Hardware Landscape
  7. NAS & Architecture-Level Compression
  8. Exercises
  9. Edge AI Deployment Plan Generator
  10. Conclusion & Next Steps

AI in the Wild: Real-World Applications & Ethics

Your 24-part learning path • Currently on Step 21
AI & ML Landscape Overview
Paradigms, ecosystem map, real-world applications at a glance
ML Foundations for Practitioners
Supervised learning, bias-variance, model evaluation
Natural Language Processing
Tokenization, embeddings, transformers, semantic search
Computer Vision in the Real World
CNNs, ViTs, detection, segmentation, deployment patterns
Recommender Systems
Collaborative filtering, content-based, two-tower models
Reinforcement Learning Applications
Q-learning, policy gradients, RLHF, real-world deployments
Conversational AI & Chatbots
Dialogue systems, intent detection, RAG, production bots
Large Language Models
Architecture, scaling laws, capabilities, limitations
Prompt Engineering & In-Context Learning
Chain-of-thought, few-shot, structured outputs, prompt patterns
Fine-tuning, RLHF & Model Alignment
LoRA, instruction tuning, DPO, alignment techniques
Generative AI Applications
Diffusion models, GANs, image/audio/video generation
Multimodal AI
Vision-language models, audio-text, cross-modal retrieval
AI Agents & Agentic Workflows
Tool use, planning, memory, multi-agent orchestration
AI in Healthcare & Life Sciences
Diagnostics, drug discovery, clinical NLP, regulatory landscape
AI in Finance & Fraud Detection
Credit scoring, anomaly detection, algorithmic trading
AI in Autonomous Systems & Robotics
Perception, planning, control, sim-to-real transfer
AI Security & Adversarial Robustness
Adversarial attacks, poisoning, model extraction, defences
Explainable AI & Interpretability
SHAP, LIME, attention, mechanistic interpretability
AI Ethics & Bias Mitigation
Fairness metrics, dataset auditing, debiasing techniques
MLOps & Model Deployment
CI/CD for ML, feature stores, monitoring, drift detection
21
Edge AI & On-Device Intelligence
Quantization, pruning, TFLite, CoreML, embedded inference
You Are Here
22
AI Infrastructure, Hardware & Scaling
GPUs, TPUs, distributed training, memory hierarchy
23
Responsible AI Governance
Risk frameworks, model cards, auditing, organisational practice
24
AI Policy, Regulation & Future Directions
EU AI Act, global frameworks, emerging risks, what's next
AI in the Wild Part 21 of 24

About This Article

Edge AI moves intelligence from data centres to the devices where data is generated and decisions are made — smartphones, wearables, autonomous vehicles, industrial sensors, and microcontrollers. This article covers the complete toolkit: why edge deployment matters, the constraints that shape it, model compression techniques (quantization, pruning, distillation), the leading deployment runtimes (TFLite, ONNX, CoreML), and the hardware landscape for edge inference.

Edge AI Quantization TFLite ONNX Runtime Model Compression

Why Edge AI?

The dominant model of AI deployment for the first decade of deep learning was straightforward: send data to a cloud server, run inference on GPU hardware, return results. This model works well when bandwidth is abundant, latency is acceptable, privacy constraints are minimal, and connectivity is guaranteed. But many of the most important AI applications fail at least one of these conditions. A medical device monitoring cardiac arrhythmias must respond in milliseconds — cloud round-trip latency is unacceptable. A smartwatch running sleep quality analysis on a full night of sensor data cannot transmit gigabytes of raw data over a low-energy Bluetooth connection. A factory floor quality inspection system operating in a facility with intermittent connectivity cannot depend on cloud availability for safety-critical decisions. A privacy-sensitive personal health coaching app cannot send voice audio and biometric data to a third-party server. These use cases drive the development of Edge AI — the discipline of deploying trained models directly on the device where inference is needed.

The benefits of edge deployment extend beyond the obvious latency and connectivity advantages. Privacy is preserved by design when raw data never leaves the device. Bandwidth and cloud compute costs are reduced, sometimes dramatically: a smart doorbell processing video locally and only uploading exception frames saves orders of magnitude in bandwidth compared to continuous streaming. Regulatory compliance is simplified when personal data remains on the user's device and is never transmitted. Reliability improves because inference continues even when cloud connectivity is lost. And the proliferation of edge AI hardware — purpose-built neural processing units in virtually every modern smartphone SoC, in microcontrollers, in smart home appliances — has made the deployment infrastructure increasingly capable and accessible.

Key Insight: Edge AI is not about taking cloud models and running them on phones. It is a fundamentally different design discipline that requires co-designing the model architecture, the compression strategy, and the deployment runtime together, constrained by the specific memory, compute, power, and latency budget of the target device. A model designed for a V100 GPU running in float32 will not simply "run on" a microcontroller — it requires a complete reimplementation strategy.

Edge Constraints

Edge devices are defined by four fundamental constraints that collectively shape every design decision. Memory is the most severe: a typical microcontroller has 256KB to 1MB of flash storage for model weights and 64KB to 512KB of RAM for activations. A mid-range smartphone might have 6-8GB of RAM shared across the operating system, all running applications, and any on-device ML models. These figures are orders of magnitude smaller than the tens of gigabytes available to a cloud GPU. Compute on edge devices is measured in millions of operations per second (MOPS) or TOPS (tera-operations per second) on dedicated NPUs — ranging from under 1 TOPS on an entry-level Arm Cortex-M microcontroller to 275 TOPS on an NVIDIA Jetson AGX Orin. Power is constrained by battery capacity and thermal envelope: a coin-cell battery powering an IoT sensor might allow microwatts of average inference power; a smartphone might allocate 1-2 watts to an ML task; an industrial edge server might support 25-50W thermal design power (TDP). Latency requirements vary from under 1ms for safety-critical embedded control to several seconds for background tasks on smartphones.

Use Cases & Deployment Contexts

Edge AI spans an enormous range of deployment contexts and capability levels. Keyword spotting (wake word detection) runs entirely on a microcontroller in milliwatts, continuously processing audio to detect "Hey Siri" or "OK Google" without any network connection. On-device image classification and object detection on smartphones powers features like photo organisation, document scanning, AR, and real-time translation of text in camera frames. Health monitoring on wearables runs continuous ECG and PPG analysis to detect arrhythmias, fall events, and sleep stages on devices with severely constrained battery budgets. Industrial anomaly detection on edge servers monitors machine vibration, temperature, and acoustic signatures in real time, triggering maintenance alerts without cloud round-trips. Autonomous vehicle perception runs multi-model inference pipelines for obstacle detection, lane keeping, and pedestrian recognition on custom edge AI SoCs that must process camera, LiDAR, and radar inputs within a 50ms latency budget at under 50W total system power.

Model Compression Techniques

Model compression is the collective term for techniques that reduce model size, memory footprint, and computational cost — enabling capable models to run on constrained hardware. The four primary families are quantization, pruning, knowledge distillation, and neural architecture search. Each targets a different aspect of model complexity and comes with different accuracy-efficiency trade-offs and engineering complexity costs.

Quantization

Quantization reduces the numerical precision of model weights and activations from the standard 32-bit floating point (float32) used during training to lower-bit representations — typically 8-bit integer (INT8) or, in the most aggressive cases, 4-bit or 2-bit integer or binary representations. The key insight is that neural network weights, once trained, exhibit a relatively narrow range of values that can be faithfully approximated by a linear mapping from the full-precision range to a smaller integer range. INT8 quantization typically reduces model size by 4x (from 4 bytes per weight to 1 byte) and reduces inference latency by 2-4x on CPU hardware that has efficient 8-bit multiply-accumulate (MAC) instructions — modern ARM cores and x86 CPUs have dedicated INT8 SIMD instruction sets (NEON, AVX2/VNNI) that can process 4x more operations per clock cycle in INT8 than in float32.

There are two principal quantization strategies with meaningfully different engineering requirements. Dynamic quantization quantizes model weights to INT8 at load time but performs activations in float32 during inference, converting on the fly. It requires no calibration data and is the simplest approach to apply, but only achieves partial latency savings. Static (post-training) quantization quantizes both weights and activations to INT8, requiring a calibration step in which a representative dataset of 100-1000 samples is fed through the model to measure the activation value ranges. This calibration data is used to compute the optimal quantization scale factors. Static quantization achieves the full 2-4x latency improvement but requires careful selection of calibration data — calibration data that does not represent the deployment distribution will produce quantization scales that are too narrow or too wide, degrading accuracy more than expected. Quantization-aware training (QAT) simulates quantization noise during the training forward pass using fake quantization nodes, allowing the model to learn weight values that are more robust to quantization. QAT typically achieves less than 0.5% accuracy degradation even at INT8 precision, at the cost of requiring a full retraining cycle. For the most demanding accuracy requirements at very low bit-widths (INT4 or binary), QAT is effectively mandatory.

Quantization Results — MobileNetV3

Typical Quantization Impact Benchmarks

The following figures are representative of typical results on a standard image classification model (MobileNetV3-Small) deployed to an ARM Cortex-A72 CPU (Raspberry Pi 4):

  • Float32: 17 MB model size, 95ms inference latency, 67.4% top-1 accuracy (ImageNet)
  • INT8 (PTQ static): 4.3 MB model size, 31ms inference latency, 66.8% top-1 accuracy — 4x smaller, 3x faster, 0.6% accuracy cost
  • INT8 (QAT): 4.3 MB model size, 31ms inference latency, 67.1% top-1 accuracy — negligible accuracy cost vs. float32
  • INT4 (QAT): 2.1 MB model size, 22ms inference latency, 65.9% top-1 accuracy — suitable for severe memory-constrained targets

Note that INT8 accuracy loss is highly model- and task-dependent: vision transformers (ViTs) tend to be more sensitive to quantization than CNNs, and tasks requiring precise boundary detection (semantic segmentation) degrade more than coarse-grained classification.

Pruning

Pruning removes weights (or entire neurons, filters, or attention heads) from a trained model that contribute minimally to its output. The key observation motivating pruning is that neural networks are typically overparameterised — the number of weights substantially exceeds what is needed to represent the learned function, and many weights are near-zero or highly redundant. There are two principal pruning strategies. Unstructured pruning (weight-level sparsity) sets individual weights to zero based on their magnitude, creating a sparse weight matrix. This can achieve very high sparsity (90%+ of weights set to zero) with minimal accuracy loss, but sparse matrix computations do not translate to proportional latency improvements on standard hardware unless the sparsity exceeds ~80% and hardware supports sparse matrix-vector multiplication (e.g., NVIDIA's Sparse Tensor Cores). Structured pruning removes entire structural units — filters in a CNN, attention heads in a transformer, or entire layers — which does produce immediate latency improvements on standard dense hardware. Structured pruning at 50% filter removal typically reduces FLOPs by 40-50% and inference time by 30-40%, with accuracy costs of 1-3% depending on the model and task.

The iterative magnitude pruning (IMP) recipe — prune 10-20% of weights per round, retrain for a few epochs to recover accuracy, repeat — achieves better accuracy-sparsity trade-offs than one-shot pruning. The Lottery Ticket Hypothesis (Frankle & Carlin, 2019) provides theoretical grounding: large networks contain "winning ticket" subnetworks that, when trained in isolation from the same initialisation, match or exceed the full network's performance. This explains why pruned-and-retrained networks can retain high accuracy despite significantly reduced parameterisation.

Knowledge Distillation

Knowledge distillation (Hinton et al., 2015) trains a small "student" model to mimic the output distribution of a large, accurate "teacher" model. Rather than training the student on one-hot hard labels, it is trained to match the teacher's soft probability distribution — the full softmax output — which contains rich information about inter-class similarities that the hard label discards. For example, when the teacher assigns a cat image 85% probability to "cat," 10% probability to "lynx," and 3% probability to "bobcat," the student learns not just "this is a cat" but also "this is a cat that looks a bit like wild felids" — a form of structured knowledge transfer. The temperature hyperparameter in the distillation loss controls the softness of the teacher distribution: higher temperature makes the distribution flatter, transferring more nuanced similarity structure; lower temperature focuses on the teacher's top predictions.

Modern distillation approaches go beyond output-level matching. Feature distillation (FitNets, 2015) matches intermediate feature maps, forcing the student to learn similar internal representations. Attention transfer (AT, 2017) matches the spatial attention maps of student and teacher layers. Relational knowledge distillation (RKD, 2019) matches pairwise and triplet relationships between embeddings in the batch rather than individual activations. In practice, a combination of output-level soft label loss and feature-level matching loss typically achieves the best results. Distillation can reduce model size by 5-10x with accuracy costs of 1-3% compared to the teacher, which is often more accurate than pruning the teacher directly by the same factor.

PyTorch Quantization in Practice

PyTorch provides three quantization APIs, each covering a different point on the control-vs-ease tradeoff. The following code demonstrates both dynamic and static post-training quantization, with benchmarking to measure the real-world latency benefit:

import torch
import torch.quantization
import time

# Post-training quantization: 32-bit float → 8-bit integer
# Reduces model size by 4x, inference by 2-4x on CPU

model = MyMobileNet()  # assume trained float32 model
model.eval()

# 1. Dynamic quantization — simplest, good for LSTM/Linear layers
dynamic_model = torch.quantization.quantize_dynamic(
    model, {torch.nn.Linear, torch.nn.LSTM}, dtype=torch.qint8
)
print(f"Dynamic quantized size: {get_model_size(dynamic_model):.1f} MB")

# 2. Static quantization — better accuracy, requires calibration data
model.qconfig = torch.quantization.get_default_qconfig('fbgemm')  # x86 CPU
torch.quantization.prepare(model, inplace=True)

# Calibration: run representative data through model
with torch.no_grad():
    for batch in calibration_loader:  # 100-200 samples is sufficient
        model(batch)

torch.quantization.convert(model, inplace=True)

# Benchmark
def benchmark(model, x, n=100):
    times = []
    with torch.no_grad():
        for _ in range(n):
            t = time.time()
            model(x)
            times.append(time.time() - t)
    return sum(times) / len(times) * 1000  # ms

x = torch.randn(1, 3, 224, 224)
print(f"Float32 inference: {benchmark(float_model, x):.2f} ms")
print(f"INT8 inference:    {benchmark(model, x):.2f} ms")  # ~3x faster on x86

For ARM targets (Raspberry Pi, mobile), change the qconfig backend from 'fbgemm' (x86 optimised) to 'qnnpack'. The two backends use different kernel implementations optimised for their respective instruction sets — using the wrong backend for your target architecture will not cause errors but will produce significantly worse latency results. Always benchmark the quantized model on the actual target device, not just the development machine: the latency improvement ratio can differ substantially between architectures.

Production Warning: Quantization accuracy loss is not uniformly distributed across the input space. Models can degrade significantly more on underrepresented subgroups than on the majority. Always run the fairness audit from Part 19 on the quantized model before deployment — not just on the float32 model. A quantized model that achieves acceptable top-1 accuracy on ImageNet can still have substantially higher error rates on specific demographic groups, especially when the quantization calibration dataset did not include representative samples from those groups.

TFLite: Mobile & Embedded Deployment

TensorFlow Lite (TFLite) is Google's runtime for deploying machine learning models on mobile, embedded, and IoT devices. It is the de facto deployment standard for Android and is also widely used on Raspberry Pi, microcontrollers (via TensorFlow Lite for Microcontrollers, which has no OS dependency and a binary footprint under 20KB), and other Linux-based embedded systems. TFLite models use the FlatBuffers binary format (the .tflite file extension) which is memory-mapped for zero-copy access — no parsing or memory allocation overhead at model load time.

The conversion pipeline starts from a trained Keras or SavedModel, applies optional optimisations (INT8 quantization, float16 quantization, or dynamic range quantization), and produces a .tflite file. The following code demonstrates the complete conversion pipeline from a Keras model to an INT8 quantized TFLite model ready for microcontroller deployment:

import tensorflow as tf
import numpy as np

# Convert Keras model to TFLite for Android/iOS deployment
keras_model = tf.keras.applications.MobileNetV3Small(
    input_shape=(224, 224, 3), include_top=True, weights='imagenet'
)

# Standard float32 conversion
converter = tf.lite.TFLiteConverter.from_keras_model(keras_model)
tflite_model = converter.convert()
open("mobilenet_v3.tflite", "wb").write(tflite_model)
# Size: 4.3 MB vs Keras: 17 MB

# INT8 quantization (requires representative dataset)
def representative_dataset():
    for _ in range(100):
        yield [np.random.randn(1, 224, 224, 3).astype(np.float32)]

converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_dataset
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.int8
converter.inference_output_type = tf.int8

tflite_int8 = converter.convert()
open("mobilenet_v3_int8.tflite", "wb").write(tflite_int8)
# Size: 1.1 MB — fits in microcontroller flash!

# Run inference with TFLite runtime
interpreter = tf.lite.Interpreter(model_path="mobilenet_v3_int8.tflite")
interpreter.allocate_tensors()
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()

interpreter.set_tensor(input_details[0]['index'], image_batch)
interpreter.invoke()
predictions = interpreter.get_tensor(output_details[0]['index'])

The INT8 TFLite model at 1.1 MB fits comfortably in the flash storage of an STM32F4 series microcontroller (typically 1MB flash, 192KB RAM). When setting inference_input_type = tf.int8, the model expects integer-quantized inputs — the host application must apply the same quantization scale and zero-point to the raw input data before passing it to the interpreter. This quantization metadata (scale, zero_point) is available from input_details[0]['quantization']. Failing to quantize inputs correctly is one of the most common deployment bugs in TFLite applications and will silently produce garbage predictions with no error message.

ONNX: Cross-Platform Inference

The Open Neural Network Exchange (ONNX) format is a framework-agnostic model representation that enables models trained in any framework (PyTorch, TensorFlow, JAX, scikit-learn) to be deployed in any runtime that supports ONNX. ONNX Runtime (ORT) is Microsoft's high-performance inference engine for ONNX models, which typically outperforms native PyTorch CPU inference by 50-200% and supports execution providers for GPU (CUDA, TensorRT), NPU (Qualcomm QNN, Apple CoreML), and edge devices. The ONNX ecosystem is particularly valuable for cross-platform deployment scenarios where the training framework differs from the deployment target — a common situation in production where models are trained in PyTorch but must run efficiently on iOS (via CoreML EP), Android (via NNAPI EP), or embedded Linux (via ACL EP).

import torch
import onnx
import onnxruntime as ort
import numpy as np
import time

# ONNX: framework-agnostic model exchange format
# Train in PyTorch → export to ONNX → run anywhere (CPU, GPU, edge, browser)

model = torch.load("production_model.pt")
model.eval()

dummy_input = torch.randn(1, 3, 224, 224)  # batch_size=1 for edge

# Export to ONNX
torch.onnx.export(
    model, dummy_input, "model.onnx",
    export_params=True,
    opset_version=17,
    input_names=["input"],
    output_names=["output"],
    dynamic_axes={"input": {0: "batch_size"}, "output": {0: "batch_size"}}  # variable batch
)

# Validate
onnx_model = onnx.load("model.onnx")
onnx.checker.check_model(onnx_model)

# Run with ONNX Runtime (50-100% faster than PyTorch CPU on many models)
ort_session = ort.InferenceSession(
    "model.onnx",
    providers=["CUDAExecutionProvider", "CPUExecutionProvider"]  # GPU preferred, CPU fallback
)

# Benchmark comparison
x_np = np.random.randn(1, 3, 224, 224).astype(np.float32)
x_pt = torch.from_numpy(x_np)

# PyTorch CPU inference
with torch.no_grad():
    t = time.time()
    for _ in range(100):
        _ = model(x_pt)
pt_ms = (time.time() - t) * 10  # avg ms

# ONNX Runtime inference
t = time.time()
for _ in range(100):
    _ = ort_session.run(None, {"input": x_np})
ort_ms = (time.time() - t) * 10

print(f"PyTorch CPU: {pt_ms:.1f} ms | ONNX Runtime: {ort_ms:.1f} ms | Speedup: {pt_ms/ort_ms:.2f}x")

ONNX Runtime applies a graph-level optimisation pass before execution: operation fusion (combining adjacent pointwise operations into single kernels), constant folding (pre-computing operations with static inputs at load time), and layout transformations (choosing memory layouts that maximise cache efficiency for the target architecture). These optimisations are applied automatically and require no changes to the model. For INT8 deployment via ONNX, the quantization-aware training workflow in PyTorch can be exported directly; alternatively, ONNX Runtime's built-in dynamic quantization tool can be applied post-export for a simpler pipeline at the cost of some accuracy compared to QAT.

Edge Hardware Landscape

The edge AI hardware landscape has diversified enormously, from milliwatt microcontrollers to multi-hundred-watt edge server SoCs. Choosing the right hardware for a given use case requires understanding the compute, power, memory, and connectivity budget, as well as the software ecosystem available for the target.

Hardware Compute (TOPS) Power (W) RAM Price (USD) Best For
Raspberry Pi 5 ~0.1 (CPU only, no dedicated NPU) 5-15W 4-8 GB LPDDR4X $60-80 Prototyping, TFLite/ONNX CPU inference, hobbyist edge projects, robotics controllers
NVIDIA Jetson Orin Nano 40 TOPS (CUDA + DLA) 5-15W 8 GB LPDDR5 $149 (module) Real-time video analytics, autonomous vehicles, robotics, industrial vision
Google Coral TPU 4 TOPS (INT8 only) 2W (USB) / 0.5W (M.2) None (coprocessor) $30-75 Accelerating TFLite INT8 models on RPi; IoT object detection; low-power continuous inference
Apple Neural Engine (A17 Pro) 35 TOPS ~2W (ML workload budget) 8 GB unified memory N/A (iPhone 15 Pro SoC) CoreML model deployment; on-device LLM inference (Phi-3, Mistral 7B); real-time audio ML
Qualcomm Hexagon DSP (Snapdragon 8 Gen 3) 45 TOPS (NPU) ~3W (peak AI workload) 12-16 GB LPDDR5X N/A (flagship Android SoC) Android on-device ML; QNN SDK deployment; generative AI features; always-on sensor processing
Real-World Deployment

On-Device LLM: Phi-3 Mini on Apple Neural Engine

Apple's 2024 on-device model deployment represents a landmark in edge AI capability. Running a quantized Phi-3 Mini (3.8B parameters, INT4 quantized to ~2GB) on the A17 Pro Neural Engine achieves approximately 20-30 tokens per second — sufficient for interactive text generation without any cloud connection. The key enabling technologies are: Apple's unified memory architecture (no PCIe bandwidth bottleneck between CPU and accelerator); the Neural Engine's 35 TOPS compute budget; INT4 weight compression reducing memory bandwidth demand by 8x compared to float32; and Core ML's optimised attention operator implementation. This deployment pattern — INT4 or INT8 quantized small language models running on device NPUs — will define the next generation of privacy-preserving AI assistants across smartphones, tablets, and laptops.

NAS, Low-Rank Factorisation & Compression Comparison

Beyond quantization, pruning, and distillation, two additional techniques are important in the practitioner's compression toolkit. Neural Architecture Search (NAS) automates the design of compute-efficient model architectures by searching over a space of architecture choices (layer widths, depths, kernel sizes, skip connections) to find the Pareto-optimal architecture for a given compute/accuracy target. Once-for-All (OFA) and MobileNetV3 were produced via hardware-aware NAS, producing architectures that achieve much better accuracy-latency trade-offs than manually designed models. Low-rank factorisation decomposes large weight matrices (or convolution kernels) into products of smaller matrices, reducing both storage and FLOPs. A weight matrix W of shape [M, N] can be approximated as UV where U is [M, r] and V is [r, N] with rank r much smaller than min(M,N), reducing the parameter count from MN to r(M+N). This is directly applicable to the fully-connected layers in transformer models where large linear projections dominate parameter count and compute.

Technique Size Reduction Accuracy Loss Speed Gain Complexity When to Use
Quantization (INT8 PTQ) 4x 0.5-1.5% 2-4x CPU Low — no retraining needed First technique to apply; works for most models; available in all runtimes
Quantization (INT8 QAT) 4x <0.5% 2-4x CPU Medium — requires retraining When PTQ accuracy loss is unacceptable; accuracy-critical production models
Structured Pruning 1.5-3x (at 50-70% filter pruning) 1-3% 1.3-2x Medium — requires iterative prune+finetune Reducing FLOPs for CPU/GPU latency; combine with quantization for multiplicative effect
Knowledge Distillation 5-10x (vs teacher) 1-3% (vs teacher) 5-10x High — requires teacher+student training When target device is too constrained for the full model; student can be architecture-agnostic
NAS 5-10x vs manually designed Often improves accuracy 5-10x Very High — multi-GPU days to weeks of search New model designs targeting specific hardware; justify cost only for large-scale deployment
Low-Rank Factorisation 2-5x (layer dependent) 1-4% 1.5-3x Medium — requires decomposition + finetuning Large linear layers in transformers (LoRA for LLMs); embedding compression; LSTM weight reduction

In practice, the most effective edge deployment strategy combines multiple techniques: start with architecture selection (choose or train a NAS-optimised backbone like MobileNetV3 or EfficientNet-Lite as the starting point rather than adapting a large model); apply structured pruning to remove 30-50% of filters with minimal accuracy cost; then apply INT8 quantization to the pruned model for an additional 4x size reduction and 2-4x latency improvement; and finally deploy via an optimised runtime (TFLite with XNNPACK, ONNX Runtime, or TensorRT for Jetson). This pipeline can achieve 20-40x size reduction and 5-15x latency improvement compared to the original float32 model, with total accuracy loss typically under 2-3% on standard benchmarks.

Exercises

Beginner

Exercise 1: ONNX Export and Latency Benchmark

Export a pre-trained MobileNetV3-Small from PyTorch to ONNX format and run inference using ONNX Runtime. Compare latency to native PyTorch inference on CPU.

  • Export the model using torch.onnx.export with opset version 17.
  • Validate the exported model using onnx.checker.check_model.
  • Run 100 iterations of inference with both PyTorch and ONNX Runtime and compute average, p50, p95, and p99 latencies.
  • What is the speedup ratio? Is it consistent for different batch sizes (1, 4, 16)?
Intermediate

Exercise 2: Post-Training INT8 Quantization

Apply post-training INT8 static quantization to a CNN (MobileNetV3 or a custom classifier) using PyTorch's quantization API.

  • Collect 200 representative calibration samples from the validation set.
  • Apply static quantization with the 'qnnpack' backend (ARM) and 'fbgemm' backend (x86). Compare results on your machine.
  • Measure: model size in MB, inference latency (avg over 100 runs), and top-5 accuracy on ImageNet validation set.
  • What accuracy do you lose on INT8 vs float32? Does the accuracy loss differ if you use a calibration set that only includes 5 ImageNet classes vs all 1000?
Advanced

Exercise 3: TFLite INT8 Deployment on Embedded Target

Convert a custom trained Keras image classifier to TFLite INT8 with representative dataset calibration, and benchmark it on a Raspberry Pi (or simulate ARM performance using QEMU).

  • Convert to TFLite float32 and INT8, collecting both model sizes and accuracies on a 1000-sample test set.
  • Verify that you correctly quantize the input tensor using the scale and zero_point from input_details[0]['quantization'].
  • Benchmark inference latency on the Pi (or ARM emulator): can you achieve under 50ms per image?
  • If you have a Google Coral USB Accelerator: compile the model with edgetpu_compiler, deploy to the Coral TPU, and compare TPU vs CPU latency on the Pi.

Federated Learning & Privacy-Preserving Edge AI

Edge AI and privacy are natural allies: when inference runs on the device, raw data never leaves the user's hands. But many AI applications require models that improve over time as they see more data, and centralising all user data for retraining raises serious privacy concerns. Federated learning addresses this by bringing the model training to the data, rather than bringing data to the model. In a federated learning setup, each device trains on its local data and sends only model updates (gradients or weight deltas) to a central server, which aggregates them into a global model improvement without ever seeing the raw data.

The canonical federated learning algorithm, FedAvg (McMahan et al., 2017), works as follows: the server distributes the current global model to a randomly selected cohort of devices; each device trains for a fixed number of local epochs on its local data and sends the resulting weight delta to the server; the server computes a weighted average of all received deltas (weighting by local dataset size) and applies the aggregated update to the global model; the cycle repeats. FedAvg converges to the same accuracy as centralised training on most standard benchmarks, with the privacy benefit that no device's raw data is ever shared. Apple's Federated Learning platform (underpinning Siri personalisation, QuickType suggestions, and emoji recommendations) and Google's Gboard keyboard are the most prominent production deployments, serving hundreds of millions of users.

Privacy Amplification and Differential Privacy

Even without sharing raw data, gradient updates can leak information about the training data — gradient inversion attacks have demonstrated that input data can sometimes be reconstructed from gradient vectors. The standard defence is differential privacy (DP): before a device sends its gradient update to the server, it clips the gradient to a maximum norm (bounding the sensitivity) and adds calibrated Gaussian or Laplace noise. The resulting mechanism guarantees that the server cannot determine, with more than a fixed probability bound (epsilon), whether any individual's data was included in the update. The privacy-accuracy trade-off is parameterised by the privacy budget epsilon: smaller epsilon means stronger privacy guarantees but larger accuracy cost (because more noise is added to the gradients). In practice, privacy budgets of epsilon = 1-10 have been found to produce acceptable accuracy for many production tasks.

Secure aggregation complements differential privacy at the protocol level: cryptographic techniques (specifically secret sharing or homomorphic encryption) are used to aggregate gradient updates from multiple devices such that the server only sees the sum of updates, not individual device updates. This provides additional protection against a malicious server that might try to learn individual device gradients from the aggregation step. TensorFlow Federated and PySyft are the leading open-source libraries for implementing federated learning with differential privacy and secure aggregation.

OTA Model Updates and Versioning

A critical operational challenge for edge AI that does not arise in cloud serving is the model update problem: how do you safely deploy an improved model to millions of heterogeneous devices without causing crashes, degraded performance, or security vulnerabilities on devices with slow update adoption? The standard pattern is over-the-air (OTA) update with staged rollout. The updated model is packaged (typically as a FlatBuffers .tflite file or ONNX binary) with a cryptographic signature and pushed to devices in phases: 1% of the fleet receives the update in the first wave, with an automated A/B comparison of inference accuracy and crash rate against the previous model version on the same devices. If the A/B comparison shows no regression after 24-48 hours, the rollout proceeds to 10%, then 100%. Any regression at any stage triggers an automatic rollback — the device receives a command to revert to the previous signed model version. Devices that cannot receive the new model (due to connectivity loss, storage constraints, or hardware incompatibility) continue running the previous version with a defined end-of-life date at which the model server updates the compatibility requirements.

Key Insight: The "edge vs. cloud" framing is a false dichotomy in production systems. The most capable and robust edge AI deployments use a hybrid architecture: lightweight, latency-optimised models running on device for the common case (keyword detection, basic gesture recognition, simple classification) with selective offloading to cloud inference for rare, high-complexity inputs where the edge model's confidence falls below a threshold. This "confidence-gated offloading" pattern captures most of the latency and privacy benefits of edge inference while preserving accuracy on the tail of difficult inputs.

Browser-Based Edge AI with WebAssembly and WebGL

A special case of edge deployment that has grown dramatically in importance is in-browser ML inference using WebAssembly (WASM) and WebGL. ONNX Runtime Web supports running ONNX models directly in the browser via WASM (CPU) or WebGL (GPU) backends, with WASM inference typically within 2-5x of native CPU performance. TensorFlow.js provides a similar capability with a full JavaScript API and support for WebGL GPU acceleration. In-browser inference enables: privacy-preserving web features where user input (images, voice, text) is never sent to a server; offline-capable applications that continue to function when connectivity is lost; and reduced server infrastructure costs for applications where inference can be shifted to the client. The primary constraints are model size (large models must be downloaded, consuming bandwidth and browser memory) and the absence of dedicated ML hardware (WASM/WebGL cannot access device NPUs). Models targeting browser deployment should be limited to under 20MB (uncompressed) and under 10ms inference latency to avoid degrading user experience; this typically restricts in-browser inference to lightweight CNNs, MobileNets, and small transformers rather than large language models or detection networks.

Edge AI Deployment Plan Generator

Document your edge AI system's hardware targets, compression strategy, and deployment approach. Download as Word, Excel, PDF, or PowerPoint for engineering review and stakeholder communication.

Draft auto-saved

All data stays in your browser. Nothing is sent to or stored on any server.

Quantization-Aware Training (QAT)

Post-training quantization (PTQ) works by calibrating a float32 model on a representative dataset and then quantizing weights and activations offline. For most models the accuracy degradation is small, but for architectures with wide dynamic range activations or depthwise-separable convolutions, PTQ can cause unacceptable accuracy loss — particularly on classification boundaries where small numerical shifts change the argmax. Quantization-Aware Training (QAT) addresses this by inserting fake quantization nodes into the computation graph during training: forward passes simulate the rounding effects of INT8 arithmetic while backward passes use straight-through estimators to propagate gradients through the non-differentiable rounding operation. The model learns weight distributions that are robust to quantization, typically recovering 0.5–2% accuracy compared to PTQ on sensitive architectures.

Python — PyTorch Quantization-Aware Training with Full Training Loop
import torch
import torch.nn as nn
import torch.quantization as tq
from torch.quantization import (
    get_default_qat_qconfig,
    prepare_qat,
    convert,
    QConfig,
    FakeQuantize,
    default_per_channel_weight_observer,
    default_histogram_observer,
)
from torch.optim.lr_scheduler import CosineAnnealingLR

# ── 1. Define a QAT-compatible model ──────────────────────────────────────
class MobileBlock(nn.Module):
    """Depthwise-separable block — sensitive to PTQ; benefits from QAT."""
    def __init__(self, in_ch, out_ch, stride=1):
        super().__init__()
        self.dw = nn.Sequential(
            nn.Conv2d(in_ch, in_ch, 3, stride=stride, padding=1, groups=in_ch, bias=False),
            nn.BatchNorm2d(in_ch),
            nn.ReLU(inplace=True),
        )
        self.pw = nn.Sequential(
            nn.Conv2d(in_ch, out_ch, 1, bias=False),
            nn.BatchNorm2d(out_ch),
            nn.ReLU(inplace=True),
        )
    def forward(self, x):
        return self.pw(self.dw(x))

class LightweightClassifier(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()
        # QuantStub/DeQuantStub mark entry/exit for quantized region
        self.quant = tq.QuantStub()
        self.dequant = tq.DeQuantStub()
        self.stem = nn.Sequential(
            nn.Conv2d(3, 32, 3, stride=2, padding=1, bias=False),
            nn.BatchNorm2d(32),
            nn.ReLU(inplace=True),
        )
        self.blocks = nn.Sequential(
            MobileBlock(32, 64),
            MobileBlock(64, 128, stride=2),
            MobileBlock(128, 256, stride=2),
        )
        self.pool = nn.AdaptiveAvgPool2d(1)
        self.head = nn.Linear(256, num_classes)

    def forward(self, x):
        x = self.quant(x)
        x = self.stem(x)
        x = self.blocks(x)
        x = self.pool(x).flatten(1)
        x = self.head(x)
        x = self.dequant(x)
        return x

    def fuse_model(self):
        """Fuse Conv+BN+ReLU triplets before QAT preparation."""
        tq.fuse_modules(self.stem, ['0', '1', '2'], inplace=True)
        for block in self.blocks:
            tq.fuse_modules(block.dw, ['0', '1', '2'], inplace=True)
            tq.fuse_modules(block.pw, ['0', '1', '2'], inplace=True)

# ── 2. Configure QAT qconfig ──────────────────────────────────────────────
model = LightweightClassifier(num_classes=10)

# Load a float32 pretrained checkpoint (fine-tune from here)
# model.load_state_dict(torch.load('pretrained_float.pth'))

model.train()
model.fuse_model()  # MUST fuse before preparing for QAT

# fbgemm for x86 servers; qnnpack for ARM/mobile
backend = 'qnnpack'
model.qconfig = get_default_qat_qconfig(backend)

# Alternatively use per-channel weight quantization for better accuracy
# model.qconfig = QConfig(
#     activation=FakeQuantize.with_args(observer=default_histogram_observer),
#     weight=FakeQuantize.with_args(
#         observer=default_per_channel_weight_observer,
#         quant_min=-128, quant_max=127, dtype=torch.qint8,
#         qscheme=torch.per_channel_symmetric
#     )
# )

torch.quantization.prepare_qat(model, inplace=True)

# ── 3. QAT fine-tuning loop ───────────────────────────────────────────────
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.001, momentum=0.9, weight_decay=1e-4)
scheduler = CosineAnnealingLR(optimizer, T_max=10)

NUM_QAT_EPOCHS = 10
for epoch in range(NUM_QAT_EPOCHS):
    model.train()

    # Freeze BN statistics and observer updates after 3 epochs
    # to allow fake quantization to stabilise
    if epoch == 3:
        model.apply(tq.disable_observer)   # freeze scale/zero_point
        model.apply(torch.nn.intrinsic.qat.freeze_bn_stats)

    running_loss = 0.0
    correct = 0
    for batch_idx, (inputs, labels) in enumerate(train_loader):
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        running_loss += loss.item()
        correct += (outputs.argmax(1) == labels).sum().item()

    scheduler.step()
    train_acc = correct / len(train_loader.dataset)
    print(f"Epoch {epoch+1:02d} | loss: {running_loss/len(train_loader):.4f} | acc: {train_acc:.4f}")

# ── 4. Convert QAT model to quantized INT8 ────────────────────────────────
model.eval()
quantized_model = convert(model, inplace=False)

# Compare INT8 model vs float32 on validation set
def evaluate(m, loader):
    m.eval()
    correct = total = 0
    with torch.no_grad():
        for x, y in loader:
            out = m(x)
            correct += (out.argmax(1) == y).sum().item()
            total += y.size(0)
    return correct / total

float_acc = evaluate(model, val_loader)        # still has fake quant
int8_acc  = evaluate(quantized_model, val_loader)
print(f"Float32 (fake-quantized) accuracy: {float_acc:.4f}")
print(f"INT8 quantized accuracy:            {int8_acc:.4f}")
print(f"QAT accuracy drop:                  {(float_acc - int8_acc)*100:.2f}%")

# ── 5. Export to TorchScript and check size ───────────────────────────────
torch.backends.quantized.engine = backend
scripted = torch.jit.script(quantized_model)
scripted.save('model_qat_int8.pt')

import os
float_size = os.path.getsize('model_float32.pt') / 1024 / 1024
int8_size  = os.path.getsize('model_qat_int8.pt') / 1024 / 1024
print(f"Float32 size: {float_size:.1f} MB")
print(f"INT8 QAT size: {int8_size:.1f} MB")
print(f"Compression ratio: {float_size/int8_size:.1f}x")

Knowledge Distillation for Edge Deployment

Knowledge distillation trains a small student model to mimic the soft probability outputs of a large teacher model. The soft targets (temperature-scaled logits from the teacher) carry more information per example than hard labels: a near-zero probability on a wrong class still signals that the model considers that class slightly plausible, providing supervision signal that hard one-hot labels lose. The distillation loss is a weighted combination of the standard cross-entropy with hard labels and the KL divergence between teacher and student soft probabilities.

Technique Size Reduction Latency Reduction Accuracy Impact Implementation Complexity Best For
Dynamic Quantization 2–4x (weights only) 1.3–2x (CPU) <1% drop (NLP/RNN) Very low — 2 lines of code LSTM, Transformer, NLP models
Static PTQ (INT8) 3–4x 2–4x (CPU/ARM) 0.5–2% drop typical Low — calibration dataset needed CNNs, most vision models
QAT (INT8) 3–4x 2–4x (CPU/ARM) <0.5% vs float32 Medium — requires fine-tuning Accuracy-sensitive models, MobileNets
Structured Pruning 2–5x (channel reduction) 2–5x (hardware-friendly) 1–5% (fine-tuning required) High — iterative prune+fine-tune CNN channel reduction for ARM NEON
Knowledge Distillation 5–50x (architecture change) 5–50x Varies (student architecture dependent) High — teacher-student training loop Maximum compression; new architecture design
Low-Rank Factorization 2–4x (selected layers) 1.5–3x 1–3% (layer-dependent) Medium — SVD decomposition + fine-tune Large fully-connected layers, embedding tables

TensorFlow Lite for Microcontrollers (TFLite Micro)

TFLite Micro extends edge inference to the most constrained hardware class: microcontrollers with no operating system, no dynamic memory allocation, and as little as 256KB of flash and 64KB of RAM. The runtime is written in C++ with no heap allocation — all memory is managed through a statically allocated tensor arena. This enables inference on ARM Cortex-M series (Arduino Nano 33 BLE Sense, STM32), ESP32, and specialised ML microcontrollers (Ambiq Apollo4, Syntiant NDP120) with a power draw measured in milliwatts rather than watts.

The deployment pipeline for TFLite Micro differs from standard TFLite: after INT8 quantization, the .tflite FlatBuffer file is converted to a C byte array using xxd -i and compiled directly into the firmware image. The model is loaded from flash memory rather than a file system — there is no file system. The tensor arena size must be determined empirically: too small causes a runtime allocation failure, too large wastes scarce RAM. A binary search approach (halving the arena until allocation fails, then doubling back) is the standard methodology.

Architecture Consideration

Supported Operations on Microcontrollers

TFLite Micro ships with a subset of TFLite operators built as reference implementations in portable C++. As of 2025, this includes: Conv2D (INT8, INT16), DepthwiseConv2D, FullyConnected, AveragePool2D, MaxPool2D, Reshape, Softmax, ReLU/ReLU6, Add, Mul, Concatenation, LSTM (statically sized), GRU, and Embedding. Notably absent from the default build: Transformer attention, dynamic shapes, string tensors, and complex custom ops. For MCU deployment, model architecture selection must be constrained to supported ops from the outset — retrofitting an attention-based model to TFLite Micro is generally not feasible. The recommended architectures for MCU-scale vision are DS-CNN, MobileNetV1 (scaled to <0.25 width multiplier), and EfficientNet-Lite0 at INT8; for audio keyword spotting, the TFLite Micro pre-trained models for 250KB flash targets are the standard starting point.

CoreML Conversion for Apple Neural Engine

The Apple Neural Engine (ANE) is a dedicated ML accelerator present in Apple Silicon (A11 Bionic and later, M1 and later). On an M1 chip the ANE delivers up to 11 TOPS — comparable to a discrete mid-range GPU — at a small fraction of the power budget. Access to the ANE requires deploying models through CoreML, Apple's on-device ML framework. The conversion pipeline from PyTorch or TFLite to CoreML uses the coremltools library:

Python — CoreML Conversion from PyTorch with ANE Optimisation
import coremltools as ct
import torch
import torch.nn as nn

# ── 1. Export PyTorch model to TorchScript (required for CoreML) ──────────
model = LightweightClassifier(num_classes=10)
model.load_state_dict(torch.load('model_float32.pth', map_location='cpu'))
model.eval()

# Trace with representative input (batch size 1 for ANE compatibility)
example_input = torch.rand(1, 3, 224, 224)
traced_model = torch.jit.trace(model, example_input)

# ── 2. Convert to CoreML with ANE-friendly compute units ─────────────────
mlmodel = ct.convert(
    traced_model,
    inputs=[ct.ImageType(
        name='input_image',
        shape=example_input.shape,
        scale=1.0 / 255.0,
        bias=[0, 0, 0],
        color_layout=ct.colorlayout.RGB
    )],
    outputs=[ct.TensorType(name='class_logits')],
    # ALL: use ANE + GPU + CPU with automatic fallback
    compute_units=ct.ComputeUnit.ALL,
    # INT8 weight quantization for ANE acceleration
    compute_precision=ct.precision.FLOAT16,
    convert_to='mlprogram',  # ML Program format (iOS 15+, macOS 12+)
)

# ── 3. Add metadata for model discoverability ─────────────────────────────
mlmodel.short_description = 'Lightweight image classifier for on-device inference'
mlmodel.author = 'MLOps Team'
mlmodel.version = '2.3.1'
mlmodel.license = 'Proprietary'

# Label outputs for human-readable predictions
import json
class_labels = ['cat', 'dog', 'car', 'tree', 'person',
                 'bicycle', 'bird', 'boat', 'chair', 'table']
mlmodel.user_defined_metadata['class_labels'] = json.dumps(class_labels)

# ── 4. Save and benchmark ─────────────────────────────────────────────────
mlmodel.save('LightweightClassifier.mlpackage')  # ML Program format

# Profile on device (macOS only — runs on ANE when available)
# from coremltools.models.utils import _ComputedModels
# ct.models.utils.compile_model(mlmodel)  # AOT compilation
# spec = mlmodel.get_spec()
# print(ct.utils.get_model_size(mlmodel))  # size in bytes

# ── 5. Palettization (further quantization for ANE) ───────────────────────
from coremltools.optimize.coreml import (
    OpLinearQuantizerConfig, OptimizationConfig,
    linearly_quantize_weights
)

config = OptimizationConfig(
    global_config=OpLinearQuantizerConfig(
        mode='linear_symmetric',
        dtype='int8',
        granularity='per_channel'
    )
)
quantized_mlmodel = linearly_quantize_weights(mlmodel, config)
quantized_mlmodel.save('LightweightClassifier_INT8.mlpackage')

import os
fp16_size = sum(
    os.path.getsize(os.path.join(dirpath, f))
    for dirpath, _, files in os.walk('LightweightClassifier.mlpackage')
    for f in files
) / 1024 / 1024

int8_size = sum(
    os.path.getsize(os.path.join(dirpath, f))
    for dirpath, _, files in os.walk('LightweightClassifier_INT8.mlpackage')
    for f in files
) / 1024 / 1024

print(f"FP16 mlpackage size: {fp16_size:.1f} MB")
print(f"INT8 mlpackage size: {int8_size:.1f} MB")
print(f"Additional compression: {fp16_size/int8_size:.1f}x")
ANE Compatibility Constraints: The Apple Neural Engine only accelerates operations that meet specific shape and data type requirements. Key constraints: batch size must be 1 (ANE does not support batched inference), all tensor shapes must be static (no dynamic shapes), supported ops include Conv2D/Linear/BatchNorm/ReLU/Pooling in FP16, and operations that fall outside these constraints are automatically dispatched to GPU or CPU. Using ComputeUnit.ALL lets CoreML automatically route each op to the fastest available accelerator — this is almost always the right choice. Custom model surgery to split an unsupported op into supported primitives is rarely worth the effort vs. selecting an ANE-compatible architecture from the outset.

Conclusion & Next Steps

Edge AI is not a niche application area — it is increasingly the primary deployment context for AI in consumer devices, industrial systems, healthcare wearables, and autonomous machines. The discipline requires a fundamentally different mindset than cloud AI engineering: rather than scaling compute to the model, edge AI engineers scale the model to the compute, applying a structured compression pipeline — architecture selection, structured pruning, quantization, and runtime optimisation — to produce models that meet strict memory, latency, and power budgets while preserving acceptable accuracy.

The compression toolkit covered in this article — PyTorch quantization (dynamic and static), TFLite INT8 conversion with representative dataset calibration, and ONNX Runtime cross-platform deployment — handles the vast majority of practical edge deployment scenarios. For the most constrained targets (microcontrollers with under 512KB flash), TensorFlow Lite for Microcontrollers with a custom kernel library provides a path to running inference with no operating system dependency. For the highest-performance edge scenarios (autonomous vehicles, high-resolution video analytics), NVIDIA Jetson with TensorRT provides GPU-class inference within a 15-50W power envelope.

A critical consideration that runs through every edge deployment decision is fairness and bias: quantization can affect demographic subgroups non-uniformly, and edge models are often deployed in high-stakes contexts (medical devices, access control, autonomous vehicles) where biased predictions have direct physical consequences. The fairness audit workflow from Part 19 must be applied to the compressed, quantized, edge-deployed model — not just the float32 cloud model — and the calibration dataset for quantization should be explicitly curated to include representative samples from all deployment demographic groups.

The next article shifts from edge deployment back to the infrastructure side, exploring the hardware and distributed systems landscape that powers large-scale AI training and serving in data centres — the complementary engineering challenge to the resource-constrained edge deployment covered here.

Next in the Series

In Part 22: AI Infrastructure, Hardware & Scaling, we cover GPUs and TPUs, distributed training with data parallelism and model parallelism, mixed precision training, the memory hierarchy of modern AI accelerators, and the infrastructure decisions that determine training cost and throughput at scale.

Technology