Evolution of YOLO
You Only Look Once (YOLO) revolutionized object detection by framing it as a single regression problem. Instead of region proposal networks that scan images multiple times, YOLO processes the entire image in one forward pass — achieving real-time detection speeds that were previously impossible.
From YOLOv1 to YOLOv8
Each YOLO version introduced critical innovations:
- YOLOv1 (2016): Single-shot grid-based detection. Divides image into S×S grid, each cell predicts B bounding boxes.
- YOLOv2 (2017): Batch normalization, anchor boxes, multi-scale training.
- YOLOv3 (2018): Darknet-53 backbone, Feature Pyramid Network, detection at 3 scales.
- YOLOv4 (2020): CSPDarknet, Mish activation, mosaic augmentation, CIoU loss.
- YOLOv5 (2020): PyTorch implementation, anchor-based, autoanchor, hyperparameter evolution.
- YOLOv8 (2023): Anchor-free, decoupled heads, C2f modules, Distribution Focal Loss.
Version Comparison
import numpy as np
# YOLO version comparison: mAP vs FPS on COCO val2017
# Benchmarked on NVIDIA V100 GPU at 640x640 input resolution
detector_data = {
"Model": [
"YOLOv3", "YOLOv4", "YOLOv5s", "YOLOv5m",
"YOLOv8n", "YOLOv8s", "YOLOv8m", "YOLOv8l"
],
"mAP_50_95": [33.0, 43.5, 37.4, 45.4, 37.3, 44.9, 50.2, 52.9],
"FPS_V100": [35, 50, 140, 110, 195, 160, 110, 75],
"Parameters_M": [61.9, 64.4, 7.2, 21.2, 3.2, 11.2, 25.9, 43.7],
"FLOPs_G": [65.9, 91.1, 16.5, 49.0, 8.7, 28.6, 78.9, 165.2]
}
# Display comparison table
print(f"{'Model':<10} {'mAP@50-95':<12} {'FPS':<8} {'Params(M)':<12} {'GFLOPs':<10}")
print("-" * 52)
for i in range(len(detector_data["Model"])):
print(f"{detector_data['Model'][i]:<10} "
f"{detector_data['mAP_50_95'][i]:<12.1f} "
f"{detector_data['FPS_V100'][i]:<8} "
f"{detector_data['Parameters_M'][i]:<12.1f} "
f"{detector_data['FLOPs_G'][i]:<10.1f}")
# Calculate efficiency ratio (mAP per GFLOPs)
efficiency = np.array(detector_data["mAP_50_95"]) / np.array(detector_data["FLOPs_G"])
best_idx = np.argmax(efficiency)
print(f"\nMost efficient: {detector_data['Model'][best_idx]} "
f"({efficiency[best_idx]:.3f} mAP/GFLOP)")
YOLOv8 Architecture Overview
YOLOv8 consists of three major components working together: the Backbone extracts hierarchical features, the Neck fuses multi-scale information, and the Head produces final predictions without anchor priors.
flowchart TD
A[Input Image 640x640x3] --> B[Stem: Conv 3x3 s2]
B --> C[Stage 1: C2f + Conv s2]
C --> D[Stage 2: C2f + Conv s2]
D --> E[Stage 3: C2f + Conv s2]
E --> F[Stage 4: C2f + SPPF]
F --> G[Upsample 2x]
G --> H[Concat with Stage 3]
H --> I[C2f Neck Block]
I --> J[Upsample 2x]
J --> K[Concat with Stage 2]
K --> L[C2f Neck Block - P3]
L --> M[Conv s2]
M --> N[Concat with I output]
N --> O[C2f Neck Block - P4]
O --> P[Conv s2]
P --> Q[Concat with F output]
Q --> R[C2f Neck Block - P5]
L --> S[Detect Head P3 - 80x80]
O --> T[Detect Head P4 - 40x40]
R --> U[Detect Head P5 - 20x20]
S --> V[NMS + Final Predictions]
T --> V
U --> V
Multi-Scale Detection
YOLOv8 detects objects at three scales, enabling it to find both small and large objects effectively:
- P3 (80×80): Small object detection — stride 8, high spatial resolution
- P4 (40×40): Medium object detection — stride 16, balanced features
- P5 (20×20): Large object detection — stride 32, rich semantic information
Output Tensor Shapes
import numpy as np
# Compute YOLOv8 output tensor shapes for 640x640 input
input_size = 640
num_classes = 80 # COCO dataset classes
reg_max = 16 # DFL distribution bins
# Three detection scales with their strides
scales = {
"P3": {"stride": 8, "description": "Small objects"},
"P4": {"stride": 16, "description": "Medium objects"},
"P5": {"stride": 32, "description": "Large objects"},
}
total_predictions = 0
print("YOLOv8 Output Tensor Shapes (input: 640x640)")
print("=" * 60)
for name, info in scales.items():
grid_size = input_size // info["stride"]
num_anchors = grid_size * grid_size
total_predictions += num_anchors
# Each prediction: 4 * reg_max (box) + num_classes (cls)
box_channels = 4 * reg_max # 64 channels for DFL
cls_channels = num_classes # 80 channels for classification
print(f"\n{name} ({info['description']}):")
print(f" Grid size: {grid_size} x {grid_size} = {num_anchors} predictions")
print(f" Box branch: (batch, {grid_size}, {grid_size}, {box_channels})")
print(f" Cls branch: (batch, {grid_size}, {grid_size}, {cls_channels})")
print(f"\nTotal predictions per image: {total_predictions}")
print(f"Final output shape: (batch, {total_predictions}, {4 + num_classes})")
print(f" = (batch, 8400, 84) for COCO 80 classes")
Backbone: CSPDarknet with C2f
The backbone uses Cross-Stage Partial (CSP) connections to reduce computation while maintaining gradient flow. The key innovation in YOLOv8 is the C2f module (Cross-Stage Partial with 2 convolutions and flow), which replaces YOLOv5’s C3 module with a more efficient design.
C2f Module Implementation
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
def darknet_conv(x, filters, kernel_size, strides=1):
"""Conv + BatchNorm + SiLU activation block."""
x = layers.Conv2D(
filters, kernel_size, strides=strides,
padding="same", use_bias=False
)(x)
x = layers.BatchNormalization(momentum=0.97, epsilon=1e-3)(x)
x = layers.Activation("swish")(x) # SiLU = x * sigmoid(x)
return x
def bottleneck(x, filters, shortcut=True):
"""Standard bottleneck block with optional residual connection."""
residual = x
x = darknet_conv(x, filters, kernel_size=3)
x = darknet_conv(x, filters, kernel_size=3)
if shortcut:
x = layers.Add()([residual, x])
return x
def c2f_module(x, filters, num_bottlenecks=1, shortcut=True):
"""C2f: Cross-Stage Partial with 2 convolutions and flow.
Split channels -> process half through bottlenecks -> concat all.
"""
# Initial 1x1 conv to adjust channels
hidden_channels = filters // 2
x = darknet_conv(x, 2 * hidden_channels, kernel_size=1)
# Split into two halves
split1, split2 = tf.split(x, 2, axis=-1)
# Collect outputs: start with both splits
outputs = [split1, split2]
# Process split2 through N bottleneck blocks
current = split2
for _ in range(num_bottlenecks):
current = bottleneck(current, hidden_channels, shortcut=shortcut)
outputs.append(current)
# Concatenate all outputs
x = layers.Concatenate(axis=-1)(outputs)
# Final 1x1 conv to reduce channels
x = darknet_conv(x, filters, kernel_size=1)
return x
# Demonstrate C2f module
input_tensor = keras.Input(shape=(80, 80, 128))
output = c2f_module(input_tensor, filters=256, num_bottlenecks=3)
print(f"C2f Input: {input_tensor.shape}")
print(f"C2f Output: {output.shape}")
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
def darknet_conv(x, filters, kernel_size, strides=1):
"""Conv + BatchNorm + SiLU activation block."""
x = layers.Conv2D(
filters, kernel_size, strides=strides,
padding="same", use_bias=False
)(x)
x = layers.BatchNormalization(momentum=0.97, epsilon=1e-3)(x)
x = layers.Activation("swish")(x)
return x
def c2f_module(x, filters, num_bottlenecks=1, shortcut=True):
"""C2f module (simplified for shape demo)."""
hidden_channels = filters // 2
x = darknet_conv(x, 2 * hidden_channels, kernel_size=1)
split1, split2 = tf.split(x, 2, axis=-1)
outputs = [split1, split2]
current = split2
for _ in range(num_bottlenecks):
res = current
current = darknet_conv(current, hidden_channels, kernel_size=3)
current = darknet_conv(current, hidden_channels, kernel_size=3)
if shortcut:
current = layers.Add()([res, current])
outputs.append(current)
x = layers.Concatenate(axis=-1)(outputs)
x = darknet_conv(x, filters, kernel_size=1)
return x
def sppf(x, filters, pool_size=5):
"""Spatial Pyramid Pooling - Fast."""
x = darknet_conv(x, filters // 2, kernel_size=1)
p1 = layers.MaxPooling2D(pool_size, strides=1, padding="same")(x)
p2 = layers.MaxPooling2D(pool_size, strides=1, padding="same")(p1)
p3 = layers.MaxPooling2D(pool_size, strides=1, padding="same")(p2)
x = layers.Concatenate(axis=-1)([x, p1, p2, p3])
x = darknet_conv(x, filters, kernel_size=1)
return x
def build_cspdarknet_backbone(input_shape=(640, 640, 3)):
"""Build CSPDarknet53 backbone returning P3, P4, P5 features."""
inputs = keras.Input(shape=input_shape)
# Stem
x = darknet_conv(inputs, 64, kernel_size=3, strides=2) # 320x320
# Stage 1
x = darknet_conv(x, 128, kernel_size=3, strides=2) # 160x160
x = c2f_module(x, 128, num_bottlenecks=3)
# Stage 2 - P3 output
x = darknet_conv(x, 256, kernel_size=3, strides=2) # 80x80
p3 = c2f_module(x, 256, num_bottlenecks=6)
# Stage 3 - P4 output
x = darknet_conv(p3, 512, kernel_size=3, strides=2) # 40x40
p4 = c2f_module(x, 512, num_bottlenecks=6)
# Stage 4 - P5 output
x = darknet_conv(p4, 1024, kernel_size=3, strides=2) # 20x20
x = c2f_module(x, 1024, num_bottlenecks=3)
p5 = sppf(x, 1024)
model = keras.Model(inputs, [p3, p4, p5], name="CSPDarknet")
print("CSPDarknet Backbone Feature Maps:")
print(f" P3: {p3.shape} (stride 8)")
print(f" P4: {p4.shape} (stride 16)")
print(f" P5: {p5.shape} (stride 32)")
print(f" Total params: {model.count_params():,}")
return model
backbone = build_cspdarknet_backbone()
Neck: Feature Pyramid Network + PAN
The neck combines two multi-scale feature fusion strategies: the Feature Pyramid Network (FPN) provides a top-down pathway for rich semantic information, while the Path Aggregation Network (PAN) adds a bottom-up pathway for precise localization signals.
Why Bidirectional Feature Fusion?
High-level features (P5) contain strong semantic information but lack spatial precision. Low-level features (P3) have precise localization but weak semantics. FPN passes semantic info downward; PAN passes spatial info upward. The result: every scale has both rich semantics and precise localization.
Implementation
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
def darknet_conv(x, filters, kernel_size, strides=1):
"""Conv + BatchNorm + SiLU."""
x = layers.Conv2D(
filters, kernel_size, strides=strides,
padding="same", use_bias=False
)(x)
x = layers.BatchNormalization(momentum=0.97, epsilon=1e-3)(x)
x = layers.Activation("swish")(x)
return x
def c2f_module(x, filters, num_bottlenecks=1, shortcut=False):
"""Simplified C2f for neck (no shortcut by default)."""
hidden = filters // 2
x = darknet_conv(x, 2 * hidden, kernel_size=1)
split1, split2 = tf.split(x, 2, axis=-1)
outputs = [split1, split2]
current = split2
for _ in range(num_bottlenecks):
current = darknet_conv(current, hidden, kernel_size=3)
current = darknet_conv(current, hidden, kernel_size=3)
outputs.append(current)
x = layers.Concatenate(axis=-1)(outputs)
x = darknet_conv(x, filters, kernel_size=1)
return x
def build_neck(p3, p4, p5):
"""Build FPN + PAN neck.
Args:
p3: Backbone P3 features (80x80, 256ch)
p4: Backbone P4 features (40x40, 512ch)
p5: Backbone P5 features (20x20, 1024ch)
Returns:
neck_p3, neck_p4, neck_p5: Fused multi-scale features
"""
# === FPN: Top-Down Path ===
# Reduce P5 channels and upsample
up5 = darknet_conv(p5, 512, kernel_size=1)
up5 = layers.UpSampling2D(size=2)(up5)
# Fuse with P4
fpn_p4 = layers.Concatenate(axis=-1)([up5, p4])
fpn_p4 = c2f_module(fpn_p4, 512, num_bottlenecks=3)
# Reduce and upsample to P3 scale
up4 = darknet_conv(fpn_p4, 256, kernel_size=1)
up4 = layers.UpSampling2D(size=2)(up4)
# Fuse with P3
fpn_p3 = layers.Concatenate(axis=-1)([up4, p3])
fpn_p3 = c2f_module(fpn_p3, 256, num_bottlenecks=3)
# === PAN: Bottom-Up Path ===
# Downsample P3 features to P4 scale
down3 = darknet_conv(fpn_p3, 256, kernel_size=3, strides=2)
pan_p4 = layers.Concatenate(axis=-1)([down3, fpn_p4])
pan_p4 = c2f_module(pan_p4, 512, num_bottlenecks=3)
# Downsample to P5 scale
down4 = darknet_conv(pan_p4, 512, kernel_size=3, strides=2)
pan_p5 = layers.Concatenate(axis=-1)([down4, p5])
pan_p5 = c2f_module(pan_p5, 1024, num_bottlenecks=3)
print("Neck Output Shapes:")
print(f" Neck P3: {fpn_p3.shape}")
print(f" Neck P4: {pan_p4.shape}")
print(f" Neck P5: {pan_p5.shape}")
return fpn_p3, pan_p4, pan_p5
# Example usage with placeholder inputs
p3_in = keras.Input(shape=(80, 80, 256))
p4_in = keras.Input(shape=(40, 40, 512))
p5_in = keras.Input(shape=(20, 20, 1024))
neck_p3, neck_p4, neck_p5 = build_neck(p3_in, p4_in, p5_in)
Detection Head: Anchor-Free
YOLOv8’s most significant departure from prior versions is its anchor-free, decoupled detection head. Instead of predicting offsets relative to predefined anchor boxes, the head directly regresses bounding box coordinates using a distribution-based approach.
flowchart LR
A[Feature Map from Neck] --> B[Shared Stem Conv]
B --> C[Classification Branch]
B --> D[Regression Branch]
C --> C1[Conv 3x3 x2]
C1 --> C2[Conv 1x1]
C2 --> C3[Sigmoid]
C3 --> C4[Class Scores: HxWxC]
D --> D1[Conv 3x3 x2]
D1 --> D2[Conv 1x1]
D2 --> D3[DFL Decode]
D3 --> D4[Box: HxWx4]
Key Design Choices
- Decoupled heads: Separate branches for classification and regression — they have different optimization targets and benefit from independent feature processing.
- No anchors: Direct prediction of (center_x, center_y, width, height) offsets from grid cell centers.
- Distribution Focal Loss: Instead of predicting box edges as single values, predict a probability distribution over possible positions. This captures localization uncertainty.
Implementation
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
def darknet_conv(x, filters, kernel_size, strides=1):
"""Conv + BatchNorm + SiLU."""
x = layers.Conv2D(
filters, kernel_size, strides=strides,
padding="same", use_bias=False
)(x)
x = layers.BatchNormalization(momentum=0.97, epsilon=1e-3)(x)
x = layers.Activation("swish")(x)
return x
class DetectionHead(keras.layers.Layer):
"""YOLOv8 anchor-free decoupled detection head."""
def __init__(self, num_classes=80, reg_max=16, **kwargs):
super().__init__(**kwargs)
self.num_classes = num_classes
self.reg_max = reg_max
# 4 values (ltrb) each with reg_max bins
self.box_channels = 4 * reg_max
def build(self, input_shape):
ch = int(input_shape[-1])
hidden = max(ch, min(self.num_classes, 100))
# Classification branch
self.cls_conv1 = self._make_conv(ch, hidden)
self.cls_conv2 = self._make_conv(hidden, hidden)
self.cls_pred = layers.Conv2D(
self.num_classes, 1, padding="same"
)
# Regression branch
self.reg_conv1 = self._make_conv(ch, hidden)
self.reg_conv2 = self._make_conv(hidden, hidden)
self.reg_pred = layers.Conv2D(
self.box_channels, 1, padding="same"
)
def _make_conv(self, in_ch, out_ch):
return keras.Sequential([
layers.Conv2D(out_ch, 3, padding="same", use_bias=False),
layers.BatchNormalization(momentum=0.97, epsilon=1e-3),
layers.Activation("swish"),
])
def call(self, x):
# Classification branch
cls_feat = self.cls_conv1(x)
cls_feat = self.cls_conv2(cls_feat)
cls_output = tf.sigmoid(self.cls_pred(cls_feat))
# Regression branch
reg_feat = self.reg_conv1(x)
reg_feat = self.reg_conv2(reg_feat)
box_output = self.reg_pred(reg_feat)
return cls_output, box_output
# Test detection head
head = DetectionHead(num_classes=80, reg_max=16)
test_input = tf.random.normal((1, 80, 80, 256))
cls_out, box_out = head(test_input)
print(f"Input shape: {test_input.shape}")
print(f"Classification shape: {cls_out.shape}") # (1, 80, 80, 80)
print(f"Box regression shape: {box_out.shape}") # (1, 80, 80, 64)
# Decode DFL to box coordinates
def dfl_decode(box_pred, reg_max=16):
"""Decode Distribution Focal Loss predictions to box values."""
batch, h, w, channels = box_pred.shape
# Reshape to (batch, h, w, 4, reg_max)
box_pred = tf.reshape(box_pred, (-1, h, w, 4, reg_max))
# Softmax over distribution bins
box_dist = tf.nn.softmax(box_pred, axis=-1)
# Expected value: sum(prob_i * i) for i in [0, reg_max)
project = tf.range(reg_max, dtype=tf.float32)
box_decoded = tf.reduce_sum(box_dist * project, axis=-1)
return box_decoded # (batch, h, w, 4) = left, top, right, bottom
decoded_boxes = dfl_decode(box_out)
print(f"Decoded boxes shape: {decoded_boxes.shape}") # (1, 80, 80, 4)
Loss Functions
YOLOv8 combines three loss components during training, each targeting a different aspect of detection quality:
Complete IoU (CIoU) Loss
CIoU extends standard IoU by considering three geometric factors: overlap area, center distance, and aspect ratio consistency:
$$\mathcal{L}_{CIoU} = 1 - IoU + \frac{\rho^2(b, b^{gt})}{c^2} + \alpha v$$
Where:
- $\rho^2(b, b^{gt})$ is the squared Euclidean distance between predicted and ground-truth box centers
- $c$ is the diagonal length of the smallest enclosing box
- $v = \frac{4}{\pi^2}(\arctan\frac{w^{gt}}{h^{gt}} - \arctan\frac{w}{h})^2$ measures aspect ratio consistency
- $\alpha = \frac{v}{(1 - IoU) + v}$ is a balancing parameter
import tensorflow as tf
import numpy as np
def compute_iou(box1, box2):
"""Compute IoU between two sets of boxes in (x1, y1, x2, y2) format.
Args:
box1: (N, 4) predicted boxes
box2: (N, 4) ground truth boxes
Returns:
iou: (N,) IoU values
"""
# Intersection area
inter_x1 = tf.maximum(box1[:, 0], box2[:, 0])
inter_y1 = tf.maximum(box1[:, 1], box2[:, 1])
inter_x2 = tf.minimum(box1[:, 2], box2[:, 2])
inter_y2 = tf.minimum(box1[:, 3], box2[:, 3])
inter_area = tf.maximum(inter_x2 - inter_x1, 0) * \
tf.maximum(inter_y2 - inter_y1, 0)
# Union area
area1 = (box1[:, 2] - box1[:, 0]) * (box1[:, 3] - box1[:, 1])
area2 = (box2[:, 2] - box2[:, 0]) * (box2[:, 3] - box2[:, 1])
union_area = area1 + area2 - inter_area
iou = inter_area / (union_area + 1e-7)
return iou
def ciou_loss(pred_boxes, gt_boxes):
"""Complete IoU loss for bounding box regression.
Args:
pred_boxes: (N, 4) in (x1, y1, x2, y2) format
gt_boxes: (N, 4) in (x1, y1, x2, y2) format
Returns:
loss: (N,) CIoU loss values
"""
iou = compute_iou(pred_boxes, gt_boxes)
# Center distance
pred_cx = (pred_boxes[:, 0] + pred_boxes[:, 2]) / 2
pred_cy = (pred_boxes[:, 1] + pred_boxes[:, 3]) / 2
gt_cx = (gt_boxes[:, 0] + gt_boxes[:, 2]) / 2
gt_cy = (gt_boxes[:, 1] + gt_boxes[:, 3]) / 2
center_dist_sq = (pred_cx - gt_cx) ** 2 + (pred_cy - gt_cy) ** 2
# Diagonal of smallest enclosing box
enclose_x1 = tf.minimum(pred_boxes[:, 0], gt_boxes[:, 0])
enclose_y1 = tf.minimum(pred_boxes[:, 1], gt_boxes[:, 1])
enclose_x2 = tf.maximum(pred_boxes[:, 2], gt_boxes[:, 2])
enclose_y2 = tf.maximum(pred_boxes[:, 3], gt_boxes[:, 3])
enclose_diag_sq = (enclose_x2 - enclose_x1) ** 2 + \
(enclose_y2 - enclose_y1) ** 2
# Aspect ratio consistency
pred_w = pred_boxes[:, 2] - pred_boxes[:, 0]
pred_h = pred_boxes[:, 3] - pred_boxes[:, 1]
gt_w = gt_boxes[:, 2] - gt_boxes[:, 0]
gt_h = gt_boxes[:, 3] - gt_boxes[:, 1]
pi = tf.constant(np.pi, dtype=tf.float32)
v = (4.0 / (pi ** 2)) * (
tf.atan(gt_w / (gt_h + 1e-7)) -
tf.atan(pred_w / (pred_h + 1e-7))
) ** 2
alpha = v / (1.0 - iou + v + 1e-7)
# CIoU = 1 - IoU + distance_term + aspect_term
ciou = 1.0 - iou + center_dist_sq / (enclose_diag_sq + 1e-7) + alpha * v
return ciou
# Test CIoU loss
pred = tf.constant([[10.0, 10.0, 50.0, 50.0],
[20.0, 20.0, 80.0, 80.0]])
gt = tf.constant([[12.0, 12.0, 48.0, 52.0],
[25.0, 18.0, 75.0, 78.0]])
loss = ciou_loss(pred, gt)
print(f"CIoU Loss: {loss.numpy()}")
print(f"Mean CIoU Loss: {tf.reduce_mean(loss).numpy():.4f}")
Distribution Focal Loss (DFL)
Instead of regressing a single value for each box edge, DFL predicts a probability distribution over discrete positions. The loss encourages the distribution to peak near the true location:
$$\mathcal{L}_{DFL}(S_i, S_{i+1}) = -((y_{i+1} - y) \log(S_i) + (y - y_i) \log(S_{i+1}))$$
Where $y$ is the continuous target, $y_i$ and $y_{i+1}$ are the two nearest discrete bins, and $S_i$, $S_{i+1}$ are their predicted probabilities.
import tensorflow as tf
def distribution_focal_loss(pred_dist, target, reg_max=16):
"""Distribution Focal Loss for box regression.
Instead of predicting a single value, predict a distribution
over reg_max discrete positions. Target is a continuous value.
Args:
pred_dist: (N, reg_max) logits for each edge prediction
target: (N,) continuous regression targets in [0, reg_max-1]
Returns:
loss: (N,) DFL loss per sample
"""
# Get the two nearest integer bins
target_left = tf.cast(tf.floor(target), tf.int32)
target_right = target_left + 1
# Clamp to valid range
target_left = tf.clip_by_value(target_left, 0, reg_max - 1)
target_right = tf.clip_by_value(target_right, 0, reg_max - 1)
# Weights for interpolation
weight_right = target - tf.cast(target_left, tf.float32)
weight_left = 1.0 - weight_right
# Cross-entropy with both neighbors
log_probs = tf.nn.log_softmax(pred_dist, axis=-1)
# Gather log probabilities at target bins
batch_indices = tf.range(tf.shape(target_left)[0])
loss_left = -weight_left * tf.gather_nd(
log_probs,
tf.stack([batch_indices, target_left], axis=1)
)
loss_right = -weight_right * tf.gather_nd(
log_probs,
tf.stack([batch_indices, target_right], axis=1)
)
return loss_left + loss_right
# Test DFL
reg_max = 16
pred_logits = tf.random.normal((4, reg_max)) # 4 edges, 16 bins each
targets = tf.constant([3.7, 8.2, 1.5, 12.9]) # continuous targets
dfl_loss = distribution_focal_loss(pred_logits, targets, reg_max)
print(f"DFL per-edge losses: {dfl_loss.numpy()}")
print(f"Mean DFL loss: {tf.reduce_mean(dfl_loss).numpy():.4f}")
Training on Custom Dataset
Training YOLOv8 effectively requires careful data pipeline design, aggressive augmentation (especially mosaic), and a well-tuned learning rate schedule with warmup.
Mosaic Augmentation and Data Pipeline
import tensorflow as tf
import numpy as np
def parse_coco_annotation(image_path, annotations):
"""Parse COCO format annotation for a single image.
Args:
image_path: path to image file
annotations: list of dicts with 'bbox' [x, y, w, h] and 'category_id'
Returns:
image: (H, W, 3) float32 tensor
boxes: (N, 4) in [x1, y1, x2, y2] format, normalized to [0, 1]
labels: (N,) integer class labels
"""
image = tf.io.read_file(image_path)
image = tf.image.decode_jpeg(image, channels=3)
image = tf.cast(image, tf.float32) / 255.0
h, w = tf.shape(image)[0], tf.shape(image)[1]
boxes = []
labels = []
for ann in annotations:
x, y, bw, bh = ann["bbox"]
# Convert (x, y, w, h) to normalized (x1, y1, x2, y2)
x1 = x / float(w)
y1 = y / float(h)
x2 = (x + bw) / float(w)
y2 = (y + bh) / float(h)
boxes.append([x1, y1, x2, y2])
labels.append(ann["category_id"])
return image, np.array(boxes, dtype=np.float32), np.array(labels, dtype=np.int32)
def mosaic_augmentation(images, all_boxes, all_labels, target_size=640):
"""Create mosaic from 4 images.
Combines 4 images into a 2x2 grid with random center point,
merging their annotations accordingly.
Args:
images: list of 4 image tensors
all_boxes: list of 4 box arrays, each (N_i, 4) normalized
all_labels: list of 4 label arrays
target_size: output image size
Returns:
mosaic_img: (target_size, target_size, 3)
mosaic_boxes: (M, 4) merged boxes
mosaic_labels: (M,) merged labels
"""
s = target_size
# Random center point for the mosaic
cx = np.random.randint(s // 4, 3 * s // 4)
cy = np.random.randint(s // 4, 3 * s // 4)
mosaic_img = np.zeros((s, s, 3), dtype=np.float32)
merged_boxes = []
merged_labels = []
# Placement regions for each quadrant
placements = [
(0, 0, cx, cy), # top-left
(cx, 0, s, cy), # top-right
(0, cy, cx, s), # bottom-left
(cx, cy, s, s), # bottom-right
]
for i, (x1_p, y1_p, x2_p, y2_p) in enumerate(placements):
img = images[i]
h, w = img.shape[0], img.shape[1]
# Resize image to fit placement region
pw, ph = x2_p - x1_p, y2_p - y1_p
if pw <= 0 or ph <= 0:
continue
img_resized = tf.image.resize(img, (ph, pw)).numpy()
mosaic_img[y1_p:y2_p, x1_p:x2_p] = img_resized
# Transform boxes to mosaic coordinates
boxes = all_boxes[i].copy()
if len(boxes) > 0:
# Scale to placement region
boxes[:, 0] = boxes[:, 0] * pw + x1_p # x1
boxes[:, 1] = boxes[:, 1] * ph + y1_p # y1
boxes[:, 2] = boxes[:, 2] * pw + x1_p # x2
boxes[:, 3] = boxes[:, 3] * ph + y1_p # y2
# Normalize to mosaic size
boxes[:, [0, 2]] /= s
boxes[:, [1, 3]] /= s
# Clip to [0, 1]
boxes = np.clip(boxes, 0.0, 1.0)
# Filter out degenerate boxes
valid = (boxes[:, 2] - boxes[:, 0] > 0.001) & \
(boxes[:, 3] - boxes[:, 1] > 0.001)
merged_boxes.append(boxes[valid])
merged_labels.append(all_labels[i][valid])
if merged_boxes:
mosaic_boxes = np.concatenate(merged_boxes, axis=0)
mosaic_labels = np.concatenate(merged_labels, axis=0)
else:
mosaic_boxes = np.zeros((0, 4), dtype=np.float32)
mosaic_labels = np.zeros((0,), dtype=np.int32)
return mosaic_img, mosaic_boxes, mosaic_labels
# Example: create synthetic data to demonstrate
print("Mosaic Augmentation Demo")
print("=" * 40)
dummy_images = [np.random.rand(480, 640, 3).astype(np.float32) for _ in range(4)]
dummy_boxes = [
np.array([[0.1, 0.2, 0.5, 0.6], [0.3, 0.4, 0.8, 0.9]]),
np.array([[0.2, 0.1, 0.7, 0.5]]),
np.array([[0.0, 0.0, 0.3, 0.3], [0.5, 0.5, 1.0, 1.0]]),
np.array([[0.4, 0.3, 0.9, 0.8]]),
]
dummy_labels = [
np.array([0, 1]),
np.array([2]),
np.array([0, 3]),
np.array([1]),
]
mosaic, boxes, labels = mosaic_augmentation(
dummy_images, dummy_boxes, dummy_labels, target_size=640
)
print(f"Mosaic image shape: {mosaic.shape}")
print(f"Merged boxes: {boxes.shape[0]} objects")
print(f"Labels: {labels}")
import tensorflow as tf
def create_yolo_dataset(image_dir, annotation_file, batch_size=16,
img_size=640, augment=True):
"""Create tf.data pipeline for YOLOv8 training.
Args:
image_dir: directory containing images
annotation_file: COCO format JSON annotation file
batch_size: training batch size
img_size: target image size (square)
augment: whether to apply augmentations
Returns:
tf.data.Dataset yielding (images, targets) batches
"""
def load_and_preprocess(image_path, boxes, labels):
"""Load image and apply basic preprocessing."""
image = tf.io.read_file(image_path)
image = tf.image.decode_jpeg(image, channels=3)
image = tf.image.resize(image, (img_size, img_size))
image = tf.cast(image, tf.float32) / 255.0
return image, boxes, labels
def augment_sample(image, boxes, labels):
"""Apply training augmentations."""
# Random horizontal flip
if tf.random.uniform(()) > 0.5:
image = tf.image.flip_left_right(image)
# Flip box x coordinates: x -> 1 - x
x1, y1, x2, y2 = boxes[:, 0], boxes[:, 1], boxes[:, 2], boxes[:, 3]
boxes = tf.stack([1.0 - x2, y1, 1.0 - x1, y2], axis=1)
# Color jitter
image = tf.image.random_brightness(image, 0.2)
image = tf.image.random_contrast(image, 0.8, 1.2)
image = tf.image.random_saturation(image, 0.8, 1.2)
image = tf.clip_by_value(image, 0.0, 1.0)
return image, boxes, labels
def encode_targets(image, boxes, labels):
"""Encode boxes and labels into training target format."""
# Pad to fixed number of boxes (max 100 per image)
max_boxes = 100
num_boxes = tf.shape(boxes)[0]
padded_boxes = tf.pad(boxes, [[0, max_boxes - num_boxes], [0, 0]])
padded_labels = tf.pad(labels, [[0, max_boxes - num_boxes]])
# Mask for valid boxes
valid_mask = tf.sequence_mask(num_boxes, max_boxes, dtype=tf.float32)
targets = {
"boxes": padded_boxes[:max_boxes],
"labels": padded_labels[:max_boxes],
"valid_mask": valid_mask
}
return image, targets
# Build pipeline (simplified - actual implementation reads COCO JSON)
# dataset = tf.data.Dataset.from_generator(...)
# dataset = dataset.map(load_and_preprocess)
# if augment:
# dataset = dataset.map(augment_sample)
# dataset = dataset.map(encode_targets)
# dataset = dataset.shuffle(1000).batch(batch_size).prefetch(tf.data.AUTOTUNE)
print(f"Dataset Configuration:")
print(f" Image size: {img_size}x{img_size}")
print(f" Batch size: {batch_size}")
print(f" Augmentation: {augment}")
print(f" Max boxes per image: 100")
print(f" Pipeline: load -> resize -> augment -> encode -> batch -> prefetch")
# Cosine learning rate schedule with warmup
def cosine_warmup_schedule(epoch, total_epochs=300,
warmup_epochs=3, initial_lr=0.01):
"""Cosine annealing with linear warmup."""
import math
if epoch < warmup_epochs:
# Linear warmup
return initial_lr * (epoch + 1) / warmup_epochs
else:
# Cosine decay
progress = (epoch - warmup_epochs) / (total_epochs - warmup_epochs)
return initial_lr * 0.5 * (1.0 + math.cos(math.pi * progress))
# Show learning rate schedule
create_yolo_dataset("images/", "annotations.json")
print("\nLearning Rate Schedule (first 20 epochs):")
for e in range(20):
lr = cosine_warmup_schedule(e, total_epochs=300)
bar = "#" * int(lr * 500)
print(f" Epoch {e:3d}: lr={lr:.6f} {bar}")
TFLite Deployment
Deploying YOLOv8 on edge devices requires converting the trained model to TensorFlow Lite format and applying quantization to reduce model size and improve inference speed.
Model Conversion and Quantization
import tensorflow as tf
import numpy as np
def export_to_tflite(saved_model_path, output_path,
quantization="none", calibration_data=None):
"""Convert SavedModel to TFLite with optional quantization.
Args:
saved_model_path: path to TF SavedModel directory
output_path: path to save .tflite file
quantization: "none", "float16", "dynamic_int8", or "full_int8"
calibration_data: generator for full int8 calibration
Returns:
dict with model size and conversion details
"""
converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_path)
if quantization == "float16":
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_types = [tf.float16]
print("Applying float16 quantization (2x size reduction)")
elif quantization == "dynamic_int8":
converter.optimizations = [tf.lite.Optimize.DEFAULT]
print("Applying dynamic range int8 quantization (4x size reduction)")
elif quantization == "full_int8":
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = calibration_data
converter.target_spec.supported_ops = [
tf.lite.OpsSet.TFLITE_BUILTINS_INT8
]
converter.inference_input_type = tf.uint8
converter.inference_output_type = tf.uint8
print("Applying full integer int8 quantization (Edge TPU compatible)")
tflite_model = converter.convert()
with open(output_path, "wb") as f:
f.write(tflite_model)
size_mb = len(tflite_model) / (1024 * 1024)
print(f"Exported: {output_path} ({size_mb:.1f} MB)")
return {"size_mb": size_mb, "quantization": quantization}
def calibration_data_generator(dataset_path, num_samples=100, img_size=640):
"""Generate calibration data for full int8 quantization."""
def gen():
for i in range(num_samples):
# Load representative images from training set
img = np.random.rand(1, img_size, img_size, 3).astype(np.float32)
yield [img]
return gen
# Conversion pipeline demonstration
print("YOLOv8 TFLite Export Pipeline")
print("=" * 50)
print()
print("Step 1: Save trained model")
print(" model.save('yolov8_saved_model/')")
print()
print("Step 2: Convert with different quantization levels")
# Simulated size comparisons
variants = [
("none", "yolov8n.tflite", 6.4),
("float16", "yolov8n_fp16.tflite", 3.2),
("dynamic_int8", "yolov8n_int8_dyn.tflite", 1.8),
("full_int8", "yolov8n_int8_full.tflite", 1.6),
]
print(f"\n{'Quantization':<15} {'File':<30} {'Size (MB)':<12} {'Speedup':<10}")
print("-" * 67)
for quant, filename, size in variants:
speedup = 6.4 / size
print(f"{quant:<15} {filename:<30} {size:<12.1f} {speedup:<10.1f}x")
import numpy as np
def benchmark_tflite_inference(tflite_path, img_size=640, num_runs=100):
"""Benchmark TFLite model inference speed.
Args:
tflite_path: path to .tflite model
img_size: input image size
num_runs: number of inference runs for averaging
Returns:
dict with timing results
"""
# Note: requires tensorflow package installed
# interpreter = tf.lite.Interpreter(model_path=tflite_path)
# interpreter.allocate_tensors()
# input_details = interpreter.get_input_details()
# output_details = interpreter.get_output_details()
# Simulated benchmark results for demonstration
# Real benchmarks run on actual hardware
print(f"Benchmarking: {tflite_path}")
print(f"Input: {img_size}x{img_size}x3, Runs: {num_runs}")
print()
# Simulated timing results (ms per inference)
results = {
"CPU (x86 i7)": 45.2,
"CPU (ARM Cortex-A76)": 82.5,
"GPU (Mali-G78)": 18.3,
"Edge TPU (Coral)": 6.8,
"NPU (Hexagon DSP)": 12.1,
}
print(f"{'Device':<25} {'Latency (ms)':<15} {'FPS':<10}")
print("-" * 50)
for device, latency in results.items():
fps = 1000.0 / latency
print(f"{device:<25} {latency:<15.1f} {fps:<10.1f}")
return results
def nms_postprocess(boxes, scores, iou_threshold=0.45,
score_threshold=0.25, max_detections=300):
"""Non-Maximum Suppression for TFLite output post-processing.
Args:
boxes: (N, 4) detected boxes in [x1, y1, x2, y2] format
scores: (N, num_classes) class confidence scores
iou_threshold: NMS IoU threshold
score_threshold: minimum confidence to keep
max_detections: maximum output detections
Returns:
final_boxes: (M, 4) filtered boxes
final_scores: (M,) confidence scores
final_classes: (M,) class indices
"""
# Get max class score for each box
max_scores = np.max(scores, axis=1)
class_ids = np.argmax(scores, axis=1)
# Filter by confidence threshold
mask = max_scores > score_threshold
filtered_boxes = boxes[mask]
filtered_scores = max_scores[mask]
filtered_classes = class_ids[mask]
# Sort by score (descending)
order = np.argsort(-filtered_scores)
filtered_boxes = filtered_boxes[order]
filtered_scores = filtered_scores[order]
filtered_classes = filtered_classes[order]
# Apply NMS per class
keep = []
for cls in np.unique(filtered_classes):
cls_mask = filtered_classes == cls
cls_boxes = filtered_boxes[cls_mask]
cls_indices = np.where(cls_mask)[0]
while len(cls_boxes) > 0:
keep.append(cls_indices[0])
if len(cls_boxes) == 1:
break
# Compute IoU with remaining boxes
ious = compute_iou_numpy(cls_boxes[0:1], cls_boxes[1:])
# Remove overlapping boxes
remaining = ious[0] < iou_threshold
cls_boxes = cls_boxes[1:][remaining]
cls_indices = cls_indices[1:][remaining]
keep = np.array(keep[:max_detections])
return filtered_boxes[keep], filtered_scores[keep], filtered_classes[keep]
def compute_iou_numpy(box, boxes):
"""Compute IoU between one box and array of boxes (numpy)."""
x1 = np.maximum(box[:, 0], boxes[:, 0])
y1 = np.maximum(box[:, 1], boxes[:, 1])
x2 = np.minimum(box[:, 2], boxes[:, 2])
y2 = np.minimum(box[:, 3], boxes[:, 3])
inter = np.maximum(x2 - x1, 0) * np.maximum(y2 - y1, 0)
area_box = (box[:, 2] - box[:, 0]) * (box[:, 3] - box[:, 1])
area_boxes = (boxes[:, 2] - boxes[:, 0]) * (boxes[:, 3] - boxes[:, 1])
union = area_box + area_boxes - inter
return inter / (union + 1e-7)
# Run benchmark
benchmark_tflite_inference("yolov8n_int8_full.tflite")
# Test NMS
print("\nNMS Post-Processing Test:")
test_boxes = np.array([
[10, 10, 50, 50],
[12, 12, 52, 48], # overlaps with first
[100, 100, 200, 200],
[102, 98, 198, 202], # overlaps with third
], dtype=np.float32)
test_scores = np.array([
[0.9, 0.1], [0.85, 0.2], [0.8, 0.3], [0.75, 0.6]
], dtype=np.float32)
final_boxes, final_scores, final_classes = nms_postprocess(
test_boxes, test_scores, iou_threshold=0.45, score_threshold=0.25
)
print(f"Input: {len(test_boxes)} detections")
print(f"After NMS: {len(final_boxes)} detections")
print(f"Kept boxes:\n{final_boxes}")
print(f"Scores: {final_scores}")
print(f"Classes: {final_classes}")
- Verify input preprocessing matches training (RGB, 0-1 normalization, letterbox resize)
- Test with float32 model first to establish accuracy baseline
- Use representative calibration data (500+ images) for full int8 quantization
- Validate mAP drop after quantization is within 1-2% of float model
- Include NMS post-processing in deployment pipeline (not baked into TFLite model)
- Set appropriate confidence threshold (0.25-0.5) and NMS IoU threshold (0.45-0.7)
- Profile memory usage: int8 model requires 4x less RAM than float32
- Test on target hardware with realistic input sizes and batch sizes