Table of Contents

  1. Evolution of YOLO
  2. Architecture Overview
  3. Backbone: CSPDarknet
  4. Neck: FPN + PAN
  5. Detection Head
  6. Loss Functions
  7. Training on Custom Data
  8. TFLite Deployment
Back to TensorFlow Mastery Series

Deep Dive: YOLOv8 — Object Detection in TensorFlow

May 3, 2026 Wasil Zafar 35 min read

Implement YOLO-style object detection from anchor-free heads to TFLite deployment — build the CSPDarknet backbone, C2f modules, detection heads, and run inference on edge devices.

Evolution of YOLO

You Only Look Once (YOLO) revolutionized object detection by framing it as a single regression problem. Instead of region proposal networks that scan images multiple times, YOLO processes the entire image in one forward pass — achieving real-time detection speeds that were previously impossible.

Key Insight: YOLOv8 achieves 100+ FPS on modern GPUs while maintaining competitive mAP scores, making it the go-to architecture for real-time applications like autonomous driving, surveillance, and robotics.

From YOLOv1 to YOLOv8

Each YOLO version introduced critical innovations:

  • YOLOv1 (2016): Single-shot grid-based detection. Divides image into S×S grid, each cell predicts B bounding boxes.
  • YOLOv2 (2017): Batch normalization, anchor boxes, multi-scale training.
  • YOLOv3 (2018): Darknet-53 backbone, Feature Pyramid Network, detection at 3 scales.
  • YOLOv4 (2020): CSPDarknet, Mish activation, mosaic augmentation, CIoU loss.
  • YOLOv5 (2020): PyTorch implementation, anchor-based, autoanchor, hyperparameter evolution.
  • YOLOv8 (2023): Anchor-free, decoupled heads, C2f modules, Distribution Focal Loss.

Version Comparison

import numpy as np

# YOLO version comparison: mAP vs FPS on COCO val2017
# Benchmarked on NVIDIA V100 GPU at 640x640 input resolution
detector_data = {
    "Model": [
        "YOLOv3", "YOLOv4", "YOLOv5s", "YOLOv5m",
        "YOLOv8n", "YOLOv8s", "YOLOv8m", "YOLOv8l"
    ],
    "mAP_50_95": [33.0, 43.5, 37.4, 45.4, 37.3, 44.9, 50.2, 52.9],
    "FPS_V100": [35, 50, 140, 110, 195, 160, 110, 75],
    "Parameters_M": [61.9, 64.4, 7.2, 21.2, 3.2, 11.2, 25.9, 43.7],
    "FLOPs_G": [65.9, 91.1, 16.5, 49.0, 8.7, 28.6, 78.9, 165.2]
}

# Display comparison table
print(f"{'Model':<10} {'mAP@50-95':<12} {'FPS':<8} {'Params(M)':<12} {'GFLOPs':<10}")
print("-" * 52)
for i in range(len(detector_data["Model"])):
    print(f"{detector_data['Model'][i]:<10} "
          f"{detector_data['mAP_50_95'][i]:<12.1f} "
          f"{detector_data['FPS_V100'][i]:<8} "
          f"{detector_data['Parameters_M'][i]:<12.1f} "
          f"{detector_data['FLOPs_G'][i]:<10.1f}")

# Calculate efficiency ratio (mAP per GFLOPs)
efficiency = np.array(detector_data["mAP_50_95"]) / np.array(detector_data["FLOPs_G"])
best_idx = np.argmax(efficiency)
print(f"\nMost efficient: {detector_data['Model'][best_idx]} "
      f"({efficiency[best_idx]:.3f} mAP/GFLOP)")

YOLOv8 Architecture Overview

YOLOv8 consists of three major components working together: the Backbone extracts hierarchical features, the Neck fuses multi-scale information, and the Head produces final predictions without anchor priors.

YOLOv8 Full Architecture
flowchart TD
    A[Input Image 640x640x3] --> B[Stem: Conv 3x3 s2]
    B --> C[Stage 1: C2f + Conv s2]
    C --> D[Stage 2: C2f + Conv s2]
    D --> E[Stage 3: C2f + Conv s2]
    E --> F[Stage 4: C2f + SPPF]

    F --> G[Upsample 2x]
    G --> H[Concat with Stage 3]
    H --> I[C2f Neck Block]

    I --> J[Upsample 2x]
    J --> K[Concat with Stage 2]
    K --> L[C2f Neck Block - P3]

    L --> M[Conv s2]
    M --> N[Concat with I output]
    N --> O[C2f Neck Block - P4]

    O --> P[Conv s2]
    P --> Q[Concat with F output]
    Q --> R[C2f Neck Block - P5]

    L --> S[Detect Head P3 - 80x80]
    O --> T[Detect Head P4 - 40x40]
    R --> U[Detect Head P5 - 20x20]

    S --> V[NMS + Final Predictions]
    T --> V
    U --> V
                            

Multi-Scale Detection

YOLOv8 detects objects at three scales, enabling it to find both small and large objects effectively:

  • P3 (80×80): Small object detection — stride 8, high spatial resolution
  • P4 (40×40): Medium object detection — stride 16, balanced features
  • P5 (20×20): Large object detection — stride 32, rich semantic information

Output Tensor Shapes

import numpy as np

# Compute YOLOv8 output tensor shapes for 640x640 input
input_size = 640
num_classes = 80  # COCO dataset classes
reg_max = 16      # DFL distribution bins

# Three detection scales with their strides
scales = {
    "P3": {"stride": 8,  "description": "Small objects"},
    "P4": {"stride": 16, "description": "Medium objects"},
    "P5": {"stride": 32, "description": "Large objects"},
}

total_predictions = 0
print("YOLOv8 Output Tensor Shapes (input: 640x640)")
print("=" * 60)

for name, info in scales.items():
    grid_size = input_size // info["stride"]
    num_anchors = grid_size * grid_size
    total_predictions += num_anchors

    # Each prediction: 4 * reg_max (box) + num_classes (cls)
    box_channels = 4 * reg_max  # 64 channels for DFL
    cls_channels = num_classes   # 80 channels for classification

    print(f"\n{name} ({info['description']}):")
    print(f"  Grid size: {grid_size} x {grid_size} = {num_anchors} predictions")
    print(f"  Box branch: (batch, {grid_size}, {grid_size}, {box_channels})")
    print(f"  Cls branch: (batch, {grid_size}, {grid_size}, {cls_channels})")

print(f"\nTotal predictions per image: {total_predictions}")
print(f"Final output shape: (batch, {total_predictions}, {4 + num_classes})")
print(f"  = (batch, 8400, 84) for COCO 80 classes")

Backbone: CSPDarknet with C2f

The backbone uses Cross-Stage Partial (CSP) connections to reduce computation while maintaining gradient flow. The key innovation in YOLOv8 is the C2f module (Cross-Stage Partial with 2 convolutions and flow), which replaces YOLOv5’s C3 module with a more efficient design.

CSP Design Principle: Split input channels into two parts. Process one half through a series of bottleneck blocks while leaving the other half untouched. Concatenate both halves — this preserves gradient information while reducing computational cost by roughly 50%.

C2f Module Implementation

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

def darknet_conv(x, filters, kernel_size, strides=1):
    """Conv + BatchNorm + SiLU activation block."""
    x = layers.Conv2D(
        filters, kernel_size, strides=strides,
        padding="same", use_bias=False
    )(x)
    x = layers.BatchNormalization(momentum=0.97, epsilon=1e-3)(x)
    x = layers.Activation("swish")(x)  # SiLU = x * sigmoid(x)
    return x

def bottleneck(x, filters, shortcut=True):
    """Standard bottleneck block with optional residual connection."""
    residual = x
    x = darknet_conv(x, filters, kernel_size=3)
    x = darknet_conv(x, filters, kernel_size=3)
    if shortcut:
        x = layers.Add()([residual, x])
    return x

def c2f_module(x, filters, num_bottlenecks=1, shortcut=True):
    """C2f: Cross-Stage Partial with 2 convolutions and flow.

    Split channels -> process half through bottlenecks -> concat all.
    """
    # Initial 1x1 conv to adjust channels
    hidden_channels = filters // 2
    x = darknet_conv(x, 2 * hidden_channels, kernel_size=1)

    # Split into two halves
    split1, split2 = tf.split(x, 2, axis=-1)

    # Collect outputs: start with both splits
    outputs = [split1, split2]

    # Process split2 through N bottleneck blocks
    current = split2
    for _ in range(num_bottlenecks):
        current = bottleneck(current, hidden_channels, shortcut=shortcut)
        outputs.append(current)

    # Concatenate all outputs
    x = layers.Concatenate(axis=-1)(outputs)

    # Final 1x1 conv to reduce channels
    x = darknet_conv(x, filters, kernel_size=1)
    return x

# Demonstrate C2f module
input_tensor = keras.Input(shape=(80, 80, 128))
output = c2f_module(input_tensor, filters=256, num_bottlenecks=3)
print(f"C2f Input:  {input_tensor.shape}")
print(f"C2f Output: {output.shape}")
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

def darknet_conv(x, filters, kernel_size, strides=1):
    """Conv + BatchNorm + SiLU activation block."""
    x = layers.Conv2D(
        filters, kernel_size, strides=strides,
        padding="same", use_bias=False
    )(x)
    x = layers.BatchNormalization(momentum=0.97, epsilon=1e-3)(x)
    x = layers.Activation("swish")(x)
    return x

def c2f_module(x, filters, num_bottlenecks=1, shortcut=True):
    """C2f module (simplified for shape demo)."""
    hidden_channels = filters // 2
    x = darknet_conv(x, 2 * hidden_channels, kernel_size=1)
    split1, split2 = tf.split(x, 2, axis=-1)
    outputs = [split1, split2]
    current = split2
    for _ in range(num_bottlenecks):
        res = current
        current = darknet_conv(current, hidden_channels, kernel_size=3)
        current = darknet_conv(current, hidden_channels, kernel_size=3)
        if shortcut:
            current = layers.Add()([res, current])
        outputs.append(current)
    x = layers.Concatenate(axis=-1)(outputs)
    x = darknet_conv(x, filters, kernel_size=1)
    return x

def sppf(x, filters, pool_size=5):
    """Spatial Pyramid Pooling - Fast."""
    x = darknet_conv(x, filters // 2, kernel_size=1)
    p1 = layers.MaxPooling2D(pool_size, strides=1, padding="same")(x)
    p2 = layers.MaxPooling2D(pool_size, strides=1, padding="same")(p1)
    p3 = layers.MaxPooling2D(pool_size, strides=1, padding="same")(p2)
    x = layers.Concatenate(axis=-1)([x, p1, p2, p3])
    x = darknet_conv(x, filters, kernel_size=1)
    return x

def build_cspdarknet_backbone(input_shape=(640, 640, 3)):
    """Build CSPDarknet53 backbone returning P3, P4, P5 features."""
    inputs = keras.Input(shape=input_shape)

    # Stem
    x = darknet_conv(inputs, 64, kernel_size=3, strides=2)   # 320x320

    # Stage 1
    x = darknet_conv(x, 128, kernel_size=3, strides=2)       # 160x160
    x = c2f_module(x, 128, num_bottlenecks=3)

    # Stage 2 - P3 output
    x = darknet_conv(x, 256, kernel_size=3, strides=2)       # 80x80
    p3 = c2f_module(x, 256, num_bottlenecks=6)

    # Stage 3 - P4 output
    x = darknet_conv(p3, 512, kernel_size=3, strides=2)      # 40x40
    p4 = c2f_module(x, 512, num_bottlenecks=6)

    # Stage 4 - P5 output
    x = darknet_conv(p4, 1024, kernel_size=3, strides=2)     # 20x20
    x = c2f_module(x, 1024, num_bottlenecks=3)
    p5 = sppf(x, 1024)

    model = keras.Model(inputs, [p3, p4, p5], name="CSPDarknet")
    print("CSPDarknet Backbone Feature Maps:")
    print(f"  P3: {p3.shape} (stride 8)")
    print(f"  P4: {p4.shape} (stride 16)")
    print(f"  P5: {p5.shape} (stride 32)")
    print(f"  Total params: {model.count_params():,}")
    return model

backbone = build_cspdarknet_backbone()

Neck: Feature Pyramid Network + PAN

The neck combines two multi-scale feature fusion strategies: the Feature Pyramid Network (FPN) provides a top-down pathway for rich semantic information, while the Path Aggregation Network (PAN) adds a bottom-up pathway for precise localization signals.

Architecture Pattern FPN + PAN Bidirectional Fusion
Why Bidirectional Feature Fusion?

High-level features (P5) contain strong semantic information but lack spatial precision. Low-level features (P3) have precise localization but weak semantics. FPN passes semantic info downward; PAN passes spatial info upward. The result: every scale has both rich semantics and precise localization.

Multi-Scale Feature Fusion Bidirectional

Implementation

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

def darknet_conv(x, filters, kernel_size, strides=1):
    """Conv + BatchNorm + SiLU."""
    x = layers.Conv2D(
        filters, kernel_size, strides=strides,
        padding="same", use_bias=False
    )(x)
    x = layers.BatchNormalization(momentum=0.97, epsilon=1e-3)(x)
    x = layers.Activation("swish")(x)
    return x

def c2f_module(x, filters, num_bottlenecks=1, shortcut=False):
    """Simplified C2f for neck (no shortcut by default)."""
    hidden = filters // 2
    x = darknet_conv(x, 2 * hidden, kernel_size=1)
    split1, split2 = tf.split(x, 2, axis=-1)
    outputs = [split1, split2]
    current = split2
    for _ in range(num_bottlenecks):
        current = darknet_conv(current, hidden, kernel_size=3)
        current = darknet_conv(current, hidden, kernel_size=3)
        outputs.append(current)
    x = layers.Concatenate(axis=-1)(outputs)
    x = darknet_conv(x, filters, kernel_size=1)
    return x

def build_neck(p3, p4, p5):
    """Build FPN + PAN neck.

    Args:
        p3: Backbone P3 features (80x80, 256ch)
        p4: Backbone P4 features (40x40, 512ch)
        p5: Backbone P5 features (20x20, 1024ch)

    Returns:
        neck_p3, neck_p4, neck_p5: Fused multi-scale features
    """
    # === FPN: Top-Down Path ===
    # Reduce P5 channels and upsample
    up5 = darknet_conv(p5, 512, kernel_size=1)
    up5 = layers.UpSampling2D(size=2)(up5)

    # Fuse with P4
    fpn_p4 = layers.Concatenate(axis=-1)([up5, p4])
    fpn_p4 = c2f_module(fpn_p4, 512, num_bottlenecks=3)

    # Reduce and upsample to P3 scale
    up4 = darknet_conv(fpn_p4, 256, kernel_size=1)
    up4 = layers.UpSampling2D(size=2)(up4)

    # Fuse with P3
    fpn_p3 = layers.Concatenate(axis=-1)([up4, p3])
    fpn_p3 = c2f_module(fpn_p3, 256, num_bottlenecks=3)

    # === PAN: Bottom-Up Path ===
    # Downsample P3 features to P4 scale
    down3 = darknet_conv(fpn_p3, 256, kernel_size=3, strides=2)
    pan_p4 = layers.Concatenate(axis=-1)([down3, fpn_p4])
    pan_p4 = c2f_module(pan_p4, 512, num_bottlenecks=3)

    # Downsample to P5 scale
    down4 = darknet_conv(pan_p4, 512, kernel_size=3, strides=2)
    pan_p5 = layers.Concatenate(axis=-1)([down4, p5])
    pan_p5 = c2f_module(pan_p5, 1024, num_bottlenecks=3)

    print("Neck Output Shapes:")
    print(f"  Neck P3: {fpn_p3.shape}")
    print(f"  Neck P4: {pan_p4.shape}")
    print(f"  Neck P5: {pan_p5.shape}")

    return fpn_p3, pan_p4, pan_p5

# Example usage with placeholder inputs
p3_in = keras.Input(shape=(80, 80, 256))
p4_in = keras.Input(shape=(40, 40, 512))
p5_in = keras.Input(shape=(20, 20, 1024))
neck_p3, neck_p4, neck_p5 = build_neck(p3_in, p4_in, p5_in)

Detection Head: Anchor-Free

YOLOv8’s most significant departure from prior versions is its anchor-free, decoupled detection head. Instead of predicting offsets relative to predefined anchor boxes, the head directly regresses bounding box coordinates using a distribution-based approach.

Decoupled Detection Head
flowchart LR
    A[Feature Map from Neck] --> B[Shared Stem Conv]
    B --> C[Classification Branch]
    B --> D[Regression Branch]

    C --> C1[Conv 3x3 x2]
    C1 --> C2[Conv 1x1]
    C2 --> C3[Sigmoid]
    C3 --> C4[Class Scores: HxWxC]

    D --> D1[Conv 3x3 x2]
    D1 --> D2[Conv 1x1]
    D2 --> D3[DFL Decode]
    D3 --> D4[Box: HxWx4]
                            

Key Design Choices

  • Decoupled heads: Separate branches for classification and regression — they have different optimization targets and benefit from independent feature processing.
  • No anchors: Direct prediction of (center_x, center_y, width, height) offsets from grid cell centers.
  • Distribution Focal Loss: Instead of predicting box edges as single values, predict a probability distribution over possible positions. This captures localization uncertainty.

Implementation

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

def darknet_conv(x, filters, kernel_size, strides=1):
    """Conv + BatchNorm + SiLU."""
    x = layers.Conv2D(
        filters, kernel_size, strides=strides,
        padding="same", use_bias=False
    )(x)
    x = layers.BatchNormalization(momentum=0.97, epsilon=1e-3)(x)
    x = layers.Activation("swish")(x)
    return x

class DetectionHead(keras.layers.Layer):
    """YOLOv8 anchor-free decoupled detection head."""

    def __init__(self, num_classes=80, reg_max=16, **kwargs):
        super().__init__(**kwargs)
        self.num_classes = num_classes
        self.reg_max = reg_max
        # 4 values (ltrb) each with reg_max bins
        self.box_channels = 4 * reg_max

    def build(self, input_shape):
        ch = int(input_shape[-1])
        hidden = max(ch, min(self.num_classes, 100))

        # Classification branch
        self.cls_conv1 = self._make_conv(ch, hidden)
        self.cls_conv2 = self._make_conv(hidden, hidden)
        self.cls_pred = layers.Conv2D(
            self.num_classes, 1, padding="same"
        )

        # Regression branch
        self.reg_conv1 = self._make_conv(ch, hidden)
        self.reg_conv2 = self._make_conv(hidden, hidden)
        self.reg_pred = layers.Conv2D(
            self.box_channels, 1, padding="same"
        )

    def _make_conv(self, in_ch, out_ch):
        return keras.Sequential([
            layers.Conv2D(out_ch, 3, padding="same", use_bias=False),
            layers.BatchNormalization(momentum=0.97, epsilon=1e-3),
            layers.Activation("swish"),
        ])

    def call(self, x):
        # Classification branch
        cls_feat = self.cls_conv1(x)
        cls_feat = self.cls_conv2(cls_feat)
        cls_output = tf.sigmoid(self.cls_pred(cls_feat))

        # Regression branch
        reg_feat = self.reg_conv1(x)
        reg_feat = self.reg_conv2(reg_feat)
        box_output = self.reg_pred(reg_feat)

        return cls_output, box_output

# Test detection head
head = DetectionHead(num_classes=80, reg_max=16)
test_input = tf.random.normal((1, 80, 80, 256))
cls_out, box_out = head(test_input)
print(f"Input shape:          {test_input.shape}")
print(f"Classification shape: {cls_out.shape}")  # (1, 80, 80, 80)
print(f"Box regression shape: {box_out.shape}")  # (1, 80, 80, 64)

# Decode DFL to box coordinates
def dfl_decode(box_pred, reg_max=16):
    """Decode Distribution Focal Loss predictions to box values."""
    batch, h, w, channels = box_pred.shape
    # Reshape to (batch, h, w, 4, reg_max)
    box_pred = tf.reshape(box_pred, (-1, h, w, 4, reg_max))
    # Softmax over distribution bins
    box_dist = tf.nn.softmax(box_pred, axis=-1)
    # Expected value: sum(prob_i * i) for i in [0, reg_max)
    project = tf.range(reg_max, dtype=tf.float32)
    box_decoded = tf.reduce_sum(box_dist * project, axis=-1)
    return box_decoded  # (batch, h, w, 4) = left, top, right, bottom

decoded_boxes = dfl_decode(box_out)
print(f"Decoded boxes shape:  {decoded_boxes.shape}")  # (1, 80, 80, 4)

Loss Functions

YOLOv8 combines three loss components during training, each targeting a different aspect of detection quality:

Complete IoU (CIoU) Loss

CIoU extends standard IoU by considering three geometric factors: overlap area, center distance, and aspect ratio consistency:

$$\mathcal{L}_{CIoU} = 1 - IoU + \frac{\rho^2(b, b^{gt})}{c^2} + \alpha v$$

Where:

  • $\rho^2(b, b^{gt})$ is the squared Euclidean distance between predicted and ground-truth box centers
  • $c$ is the diagonal length of the smallest enclosing box
  • $v = \frac{4}{\pi^2}(\arctan\frac{w^{gt}}{h^{gt}} - \arctan\frac{w}{h})^2$ measures aspect ratio consistency
  • $\alpha = \frac{v}{(1 - IoU) + v}$ is a balancing parameter
import tensorflow as tf
import numpy as np

def compute_iou(box1, box2):
    """Compute IoU between two sets of boxes in (x1, y1, x2, y2) format.

    Args:
        box1: (N, 4) predicted boxes
        box2: (N, 4) ground truth boxes

    Returns:
        iou: (N,) IoU values
    """
    # Intersection area
    inter_x1 = tf.maximum(box1[:, 0], box2[:, 0])
    inter_y1 = tf.maximum(box1[:, 1], box2[:, 1])
    inter_x2 = tf.minimum(box1[:, 2], box2[:, 2])
    inter_y2 = tf.minimum(box1[:, 3], box2[:, 3])

    inter_area = tf.maximum(inter_x2 - inter_x1, 0) * \
                 tf.maximum(inter_y2 - inter_y1, 0)

    # Union area
    area1 = (box1[:, 2] - box1[:, 0]) * (box1[:, 3] - box1[:, 1])
    area2 = (box2[:, 2] - box2[:, 0]) * (box2[:, 3] - box2[:, 1])
    union_area = area1 + area2 - inter_area

    iou = inter_area / (union_area + 1e-7)
    return iou

def ciou_loss(pred_boxes, gt_boxes):
    """Complete IoU loss for bounding box regression.

    Args:
        pred_boxes: (N, 4) in (x1, y1, x2, y2) format
        gt_boxes: (N, 4) in (x1, y1, x2, y2) format

    Returns:
        loss: (N,) CIoU loss values
    """
    iou = compute_iou(pred_boxes, gt_boxes)

    # Center distance
    pred_cx = (pred_boxes[:, 0] + pred_boxes[:, 2]) / 2
    pred_cy = (pred_boxes[:, 1] + pred_boxes[:, 3]) / 2
    gt_cx = (gt_boxes[:, 0] + gt_boxes[:, 2]) / 2
    gt_cy = (gt_boxes[:, 1] + gt_boxes[:, 3]) / 2
    center_dist_sq = (pred_cx - gt_cx) ** 2 + (pred_cy - gt_cy) ** 2

    # Diagonal of smallest enclosing box
    enclose_x1 = tf.minimum(pred_boxes[:, 0], gt_boxes[:, 0])
    enclose_y1 = tf.minimum(pred_boxes[:, 1], gt_boxes[:, 1])
    enclose_x2 = tf.maximum(pred_boxes[:, 2], gt_boxes[:, 2])
    enclose_y2 = tf.maximum(pred_boxes[:, 3], gt_boxes[:, 3])
    enclose_diag_sq = (enclose_x2 - enclose_x1) ** 2 + \
                      (enclose_y2 - enclose_y1) ** 2

    # Aspect ratio consistency
    pred_w = pred_boxes[:, 2] - pred_boxes[:, 0]
    pred_h = pred_boxes[:, 3] - pred_boxes[:, 1]
    gt_w = gt_boxes[:, 2] - gt_boxes[:, 0]
    gt_h = gt_boxes[:, 3] - gt_boxes[:, 1]

    pi = tf.constant(np.pi, dtype=tf.float32)
    v = (4.0 / (pi ** 2)) * (
        tf.atan(gt_w / (gt_h + 1e-7)) -
        tf.atan(pred_w / (pred_h + 1e-7))
    ) ** 2

    alpha = v / (1.0 - iou + v + 1e-7)

    # CIoU = 1 - IoU + distance_term + aspect_term
    ciou = 1.0 - iou + center_dist_sq / (enclose_diag_sq + 1e-7) + alpha * v
    return ciou

# Test CIoU loss
pred = tf.constant([[10.0, 10.0, 50.0, 50.0],
                    [20.0, 20.0, 80.0, 80.0]])
gt = tf.constant([[12.0, 12.0, 48.0, 52.0],
                  [25.0, 18.0, 75.0, 78.0]])

loss = ciou_loss(pred, gt)
print(f"CIoU Loss: {loss.numpy()}")
print(f"Mean CIoU Loss: {tf.reduce_mean(loss).numpy():.4f}")

Distribution Focal Loss (DFL)

Instead of regressing a single value for each box edge, DFL predicts a probability distribution over discrete positions. The loss encourages the distribution to peak near the true location:

$$\mathcal{L}_{DFL}(S_i, S_{i+1}) = -((y_{i+1} - y) \log(S_i) + (y - y_i) \log(S_{i+1}))$$

Where $y$ is the continuous target, $y_i$ and $y_{i+1}$ are the two nearest discrete bins, and $S_i$, $S_{i+1}$ are their predicted probabilities.

import tensorflow as tf

def distribution_focal_loss(pred_dist, target, reg_max=16):
    """Distribution Focal Loss for box regression.

    Instead of predicting a single value, predict a distribution
    over reg_max discrete positions. Target is a continuous value.

    Args:
        pred_dist: (N, reg_max) logits for each edge prediction
        target: (N,) continuous regression targets in [0, reg_max-1]

    Returns:
        loss: (N,) DFL loss per sample
    """
    # Get the two nearest integer bins
    target_left = tf.cast(tf.floor(target), tf.int32)
    target_right = target_left + 1

    # Clamp to valid range
    target_left = tf.clip_by_value(target_left, 0, reg_max - 1)
    target_right = tf.clip_by_value(target_right, 0, reg_max - 1)

    # Weights for interpolation
    weight_right = target - tf.cast(target_left, tf.float32)
    weight_left = 1.0 - weight_right

    # Cross-entropy with both neighbors
    log_probs = tf.nn.log_softmax(pred_dist, axis=-1)

    # Gather log probabilities at target bins
    batch_indices = tf.range(tf.shape(target_left)[0])

    loss_left = -weight_left * tf.gather_nd(
        log_probs,
        tf.stack([batch_indices, target_left], axis=1)
    )
    loss_right = -weight_right * tf.gather_nd(
        log_probs,
        tf.stack([batch_indices, target_right], axis=1)
    )

    return loss_left + loss_right

# Test DFL
reg_max = 16
pred_logits = tf.random.normal((4, reg_max))  # 4 edges, 16 bins each
targets = tf.constant([3.7, 8.2, 1.5, 12.9])  # continuous targets

dfl_loss = distribution_focal_loss(pred_logits, targets, reg_max)
print(f"DFL per-edge losses: {dfl_loss.numpy()}")
print(f"Mean DFL loss: {tf.reduce_mean(dfl_loss).numpy():.4f}")

Training on Custom Dataset

Training YOLOv8 effectively requires careful data pipeline design, aggressive augmentation (especially mosaic), and a well-tuned learning rate schedule with warmup.

Mosaic Augmentation and Data Pipeline

Important: Mosaic augmentation combines 4 images into one training sample, forcing the model to detect objects in varied contexts and at different scales. This significantly reduces the need for large batch sizes — YOLOv8 typically trains with batch size 16 but gets the diversity of batch size 64.
import tensorflow as tf
import numpy as np

def parse_coco_annotation(image_path, annotations):
    """Parse COCO format annotation for a single image.

    Args:
        image_path: path to image file
        annotations: list of dicts with 'bbox' [x, y, w, h] and 'category_id'

    Returns:
        image: (H, W, 3) float32 tensor
        boxes: (N, 4) in [x1, y1, x2, y2] format, normalized to [0, 1]
        labels: (N,) integer class labels
    """
    image = tf.io.read_file(image_path)
    image = tf.image.decode_jpeg(image, channels=3)
    image = tf.cast(image, tf.float32) / 255.0
    h, w = tf.shape(image)[0], tf.shape(image)[1]

    boxes = []
    labels = []
    for ann in annotations:
        x, y, bw, bh = ann["bbox"]
        # Convert (x, y, w, h) to normalized (x1, y1, x2, y2)
        x1 = x / float(w)
        y1 = y / float(h)
        x2 = (x + bw) / float(w)
        y2 = (y + bh) / float(h)
        boxes.append([x1, y1, x2, y2])
        labels.append(ann["category_id"])

    return image, np.array(boxes, dtype=np.float32), np.array(labels, dtype=np.int32)

def mosaic_augmentation(images, all_boxes, all_labels, target_size=640):
    """Create mosaic from 4 images.

    Combines 4 images into a 2x2 grid with random center point,
    merging their annotations accordingly.

    Args:
        images: list of 4 image tensors
        all_boxes: list of 4 box arrays, each (N_i, 4) normalized
        all_labels: list of 4 label arrays
        target_size: output image size

    Returns:
        mosaic_img: (target_size, target_size, 3)
        mosaic_boxes: (M, 4) merged boxes
        mosaic_labels: (M,) merged labels
    """
    s = target_size
    # Random center point for the mosaic
    cx = np.random.randint(s // 4, 3 * s // 4)
    cy = np.random.randint(s // 4, 3 * s // 4)

    mosaic_img = np.zeros((s, s, 3), dtype=np.float32)
    merged_boxes = []
    merged_labels = []

    # Placement regions for each quadrant
    placements = [
        (0, 0, cx, cy),         # top-left
        (cx, 0, s, cy),         # top-right
        (0, cy, cx, s),         # bottom-left
        (cx, cy, s, s),         # bottom-right
    ]

    for i, (x1_p, y1_p, x2_p, y2_p) in enumerate(placements):
        img = images[i]
        h, w = img.shape[0], img.shape[1]

        # Resize image to fit placement region
        pw, ph = x2_p - x1_p, y2_p - y1_p
        if pw <= 0 or ph <= 0:
            continue

        img_resized = tf.image.resize(img, (ph, pw)).numpy()
        mosaic_img[y1_p:y2_p, x1_p:x2_p] = img_resized

        # Transform boxes to mosaic coordinates
        boxes = all_boxes[i].copy()
        if len(boxes) > 0:
            # Scale to placement region
            boxes[:, 0] = boxes[:, 0] * pw + x1_p  # x1
            boxes[:, 1] = boxes[:, 1] * ph + y1_p  # y1
            boxes[:, 2] = boxes[:, 2] * pw + x1_p  # x2
            boxes[:, 3] = boxes[:, 3] * ph + y1_p  # y2

            # Normalize to mosaic size
            boxes[:, [0, 2]] /= s
            boxes[:, [1, 3]] /= s

            # Clip to [0, 1]
            boxes = np.clip(boxes, 0.0, 1.0)

            # Filter out degenerate boxes
            valid = (boxes[:, 2] - boxes[:, 0] > 0.001) & \
                    (boxes[:, 3] - boxes[:, 1] > 0.001)
            merged_boxes.append(boxes[valid])
            merged_labels.append(all_labels[i][valid])

    if merged_boxes:
        mosaic_boxes = np.concatenate(merged_boxes, axis=0)
        mosaic_labels = np.concatenate(merged_labels, axis=0)
    else:
        mosaic_boxes = np.zeros((0, 4), dtype=np.float32)
        mosaic_labels = np.zeros((0,), dtype=np.int32)

    return mosaic_img, mosaic_boxes, mosaic_labels

# Example: create synthetic data to demonstrate
print("Mosaic Augmentation Demo")
print("=" * 40)
dummy_images = [np.random.rand(480, 640, 3).astype(np.float32) for _ in range(4)]
dummy_boxes = [
    np.array([[0.1, 0.2, 0.5, 0.6], [0.3, 0.4, 0.8, 0.9]]),
    np.array([[0.2, 0.1, 0.7, 0.5]]),
    np.array([[0.0, 0.0, 0.3, 0.3], [0.5, 0.5, 1.0, 1.0]]),
    np.array([[0.4, 0.3, 0.9, 0.8]]),
]
dummy_labels = [
    np.array([0, 1]),
    np.array([2]),
    np.array([0, 3]),
    np.array([1]),
]

mosaic, boxes, labels = mosaic_augmentation(
    dummy_images, dummy_boxes, dummy_labels, target_size=640
)
print(f"Mosaic image shape: {mosaic.shape}")
print(f"Merged boxes: {boxes.shape[0]} objects")
print(f"Labels: {labels}")
import tensorflow as tf

def create_yolo_dataset(image_dir, annotation_file, batch_size=16,
                        img_size=640, augment=True):
    """Create tf.data pipeline for YOLOv8 training.

    Args:
        image_dir: directory containing images
        annotation_file: COCO format JSON annotation file
        batch_size: training batch size
        img_size: target image size (square)
        augment: whether to apply augmentations

    Returns:
        tf.data.Dataset yielding (images, targets) batches
    """
    def load_and_preprocess(image_path, boxes, labels):
        """Load image and apply basic preprocessing."""
        image = tf.io.read_file(image_path)
        image = tf.image.decode_jpeg(image, channels=3)
        image = tf.image.resize(image, (img_size, img_size))
        image = tf.cast(image, tf.float32) / 255.0
        return image, boxes, labels

    def augment_sample(image, boxes, labels):
        """Apply training augmentations."""
        # Random horizontal flip
        if tf.random.uniform(()) > 0.5:
            image = tf.image.flip_left_right(image)
            # Flip box x coordinates: x -> 1 - x
            x1, y1, x2, y2 = boxes[:, 0], boxes[:, 1], boxes[:, 2], boxes[:, 3]
            boxes = tf.stack([1.0 - x2, y1, 1.0 - x1, y2], axis=1)

        # Color jitter
        image = tf.image.random_brightness(image, 0.2)
        image = tf.image.random_contrast(image, 0.8, 1.2)
        image = tf.image.random_saturation(image, 0.8, 1.2)
        image = tf.clip_by_value(image, 0.0, 1.0)

        return image, boxes, labels

    def encode_targets(image, boxes, labels):
        """Encode boxes and labels into training target format."""
        # Pad to fixed number of boxes (max 100 per image)
        max_boxes = 100
        num_boxes = tf.shape(boxes)[0]
        padded_boxes = tf.pad(boxes, [[0, max_boxes - num_boxes], [0, 0]])
        padded_labels = tf.pad(labels, [[0, max_boxes - num_boxes]])
        # Mask for valid boxes
        valid_mask = tf.sequence_mask(num_boxes, max_boxes, dtype=tf.float32)

        targets = {
            "boxes": padded_boxes[:max_boxes],
            "labels": padded_labels[:max_boxes],
            "valid_mask": valid_mask
        }
        return image, targets

    # Build pipeline (simplified - actual implementation reads COCO JSON)
    # dataset = tf.data.Dataset.from_generator(...)
    # dataset = dataset.map(load_and_preprocess)
    # if augment:
    #     dataset = dataset.map(augment_sample)
    # dataset = dataset.map(encode_targets)
    # dataset = dataset.shuffle(1000).batch(batch_size).prefetch(tf.data.AUTOTUNE)

    print(f"Dataset Configuration:")
    print(f"  Image size: {img_size}x{img_size}")
    print(f"  Batch size: {batch_size}")
    print(f"  Augmentation: {augment}")
    print(f"  Max boxes per image: 100")
    print(f"  Pipeline: load -> resize -> augment -> encode -> batch -> prefetch")

# Cosine learning rate schedule with warmup
def cosine_warmup_schedule(epoch, total_epochs=300,
                           warmup_epochs=3, initial_lr=0.01):
    """Cosine annealing with linear warmup."""
    import math
    if epoch < warmup_epochs:
        # Linear warmup
        return initial_lr * (epoch + 1) / warmup_epochs
    else:
        # Cosine decay
        progress = (epoch - warmup_epochs) / (total_epochs - warmup_epochs)
        return initial_lr * 0.5 * (1.0 + math.cos(math.pi * progress))

# Show learning rate schedule
create_yolo_dataset("images/", "annotations.json")
print("\nLearning Rate Schedule (first 20 epochs):")
for e in range(20):
    lr = cosine_warmup_schedule(e, total_epochs=300)
    bar = "#" * int(lr * 500)
    print(f"  Epoch {e:3d}: lr={lr:.6f} {bar}")

TFLite Deployment

Deploying YOLOv8 on edge devices requires converting the trained model to TensorFlow Lite format and applying quantization to reduce model size and improve inference speed.

Model Conversion and Quantization

import tensorflow as tf
import numpy as np

def export_to_tflite(saved_model_path, output_path,
                     quantization="none", calibration_data=None):
    """Convert SavedModel to TFLite with optional quantization.

    Args:
        saved_model_path: path to TF SavedModel directory
        output_path: path to save .tflite file
        quantization: "none", "float16", "dynamic_int8", or "full_int8"
        calibration_data: generator for full int8 calibration

    Returns:
        dict with model size and conversion details
    """
    converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_path)

    if quantization == "float16":
        converter.optimizations = [tf.lite.Optimize.DEFAULT]
        converter.target_spec.supported_types = [tf.float16]
        print("Applying float16 quantization (2x size reduction)")

    elif quantization == "dynamic_int8":
        converter.optimizations = [tf.lite.Optimize.DEFAULT]
        print("Applying dynamic range int8 quantization (4x size reduction)")

    elif quantization == "full_int8":
        converter.optimizations = [tf.lite.Optimize.DEFAULT]
        converter.representative_dataset = calibration_data
        converter.target_spec.supported_ops = [
            tf.lite.OpsSet.TFLITE_BUILTINS_INT8
        ]
        converter.inference_input_type = tf.uint8
        converter.inference_output_type = tf.uint8
        print("Applying full integer int8 quantization (Edge TPU compatible)")

    tflite_model = converter.convert()

    with open(output_path, "wb") as f:
        f.write(tflite_model)

    size_mb = len(tflite_model) / (1024 * 1024)
    print(f"Exported: {output_path} ({size_mb:.1f} MB)")
    return {"size_mb": size_mb, "quantization": quantization}

def calibration_data_generator(dataset_path, num_samples=100, img_size=640):
    """Generate calibration data for full int8 quantization."""
    def gen():
        for i in range(num_samples):
            # Load representative images from training set
            img = np.random.rand(1, img_size, img_size, 3).astype(np.float32)
            yield [img]
    return gen

# Conversion pipeline demonstration
print("YOLOv8 TFLite Export Pipeline")
print("=" * 50)
print()
print("Step 1: Save trained model")
print("  model.save('yolov8_saved_model/')")
print()
print("Step 2: Convert with different quantization levels")

# Simulated size comparisons
variants = [
    ("none",         "yolov8n.tflite",          6.4),
    ("float16",      "yolov8n_fp16.tflite",     3.2),
    ("dynamic_int8", "yolov8n_int8_dyn.tflite", 1.8),
    ("full_int8",    "yolov8n_int8_full.tflite", 1.6),
]

print(f"\n{'Quantization':<15} {'File':<30} {'Size (MB)':<12} {'Speedup':<10}")
print("-" * 67)
for quant, filename, size in variants:
    speedup = 6.4 / size
    print(f"{quant:<15} {filename:<30} {size:<12.1f} {speedup:<10.1f}x")
import numpy as np

def benchmark_tflite_inference(tflite_path, img_size=640, num_runs=100):
    """Benchmark TFLite model inference speed.

    Args:
        tflite_path: path to .tflite model
        img_size: input image size
        num_runs: number of inference runs for averaging

    Returns:
        dict with timing results
    """
    # Note: requires tensorflow package installed
    # interpreter = tf.lite.Interpreter(model_path=tflite_path)
    # interpreter.allocate_tensors()
    # input_details = interpreter.get_input_details()
    # output_details = interpreter.get_output_details()

    # Simulated benchmark results for demonstration
    # Real benchmarks run on actual hardware
    print(f"Benchmarking: {tflite_path}")
    print(f"Input: {img_size}x{img_size}x3, Runs: {num_runs}")
    print()

    # Simulated timing results (ms per inference)
    results = {
        "CPU (x86 i7)": 45.2,
        "CPU (ARM Cortex-A76)": 82.5,
        "GPU (Mali-G78)": 18.3,
        "Edge TPU (Coral)": 6.8,
        "NPU (Hexagon DSP)": 12.1,
    }

    print(f"{'Device':<25} {'Latency (ms)':<15} {'FPS':<10}")
    print("-" * 50)
    for device, latency in results.items():
        fps = 1000.0 / latency
        print(f"{device:<25} {latency:<15.1f} {fps:<10.1f}")

    return results

def nms_postprocess(boxes, scores, iou_threshold=0.45,
                    score_threshold=0.25, max_detections=300):
    """Non-Maximum Suppression for TFLite output post-processing.

    Args:
        boxes: (N, 4) detected boxes in [x1, y1, x2, y2] format
        scores: (N, num_classes) class confidence scores
        iou_threshold: NMS IoU threshold
        score_threshold: minimum confidence to keep
        max_detections: maximum output detections

    Returns:
        final_boxes: (M, 4) filtered boxes
        final_scores: (M,) confidence scores
        final_classes: (M,) class indices
    """
    # Get max class score for each box
    max_scores = np.max(scores, axis=1)
    class_ids = np.argmax(scores, axis=1)

    # Filter by confidence threshold
    mask = max_scores > score_threshold
    filtered_boxes = boxes[mask]
    filtered_scores = max_scores[mask]
    filtered_classes = class_ids[mask]

    # Sort by score (descending)
    order = np.argsort(-filtered_scores)
    filtered_boxes = filtered_boxes[order]
    filtered_scores = filtered_scores[order]
    filtered_classes = filtered_classes[order]

    # Apply NMS per class
    keep = []
    for cls in np.unique(filtered_classes):
        cls_mask = filtered_classes == cls
        cls_boxes = filtered_boxes[cls_mask]
        cls_indices = np.where(cls_mask)[0]

        while len(cls_boxes) > 0:
            keep.append(cls_indices[0])
            if len(cls_boxes) == 1:
                break

            # Compute IoU with remaining boxes
            ious = compute_iou_numpy(cls_boxes[0:1], cls_boxes[1:])
            # Remove overlapping boxes
            remaining = ious[0] < iou_threshold
            cls_boxes = cls_boxes[1:][remaining]
            cls_indices = cls_indices[1:][remaining]

    keep = np.array(keep[:max_detections])
    return filtered_boxes[keep], filtered_scores[keep], filtered_classes[keep]

def compute_iou_numpy(box, boxes):
    """Compute IoU between one box and array of boxes (numpy)."""
    x1 = np.maximum(box[:, 0], boxes[:, 0])
    y1 = np.maximum(box[:, 1], boxes[:, 1])
    x2 = np.minimum(box[:, 2], boxes[:, 2])
    y2 = np.minimum(box[:, 3], boxes[:, 3])

    inter = np.maximum(x2 - x1, 0) * np.maximum(y2 - y1, 0)
    area_box = (box[:, 2] - box[:, 0]) * (box[:, 3] - box[:, 1])
    area_boxes = (boxes[:, 2] - boxes[:, 0]) * (boxes[:, 3] - boxes[:, 1])
    union = area_box + area_boxes - inter

    return inter / (union + 1e-7)

# Run benchmark
benchmark_tflite_inference("yolov8n_int8_full.tflite")

# Test NMS
print("\nNMS Post-Processing Test:")
test_boxes = np.array([
    [10, 10, 50, 50],
    [12, 12, 52, 48],  # overlaps with first
    [100, 100, 200, 200],
    [102, 98, 198, 202],  # overlaps with third
], dtype=np.float32)

test_scores = np.array([
    [0.9, 0.1], [0.85, 0.2], [0.8, 0.3], [0.75, 0.6]
], dtype=np.float32)

final_boxes, final_scores, final_classes = nms_postprocess(
    test_boxes, test_scores, iou_threshold=0.45, score_threshold=0.25
)
print(f"Input: {len(test_boxes)} detections")
print(f"After NMS: {len(final_boxes)} detections")
print(f"Kept boxes:\n{final_boxes}")
print(f"Scores: {final_scores}")
print(f"Classes: {final_classes}")
Deployment Checklist:
  • Verify input preprocessing matches training (RGB, 0-1 normalization, letterbox resize)
  • Test with float32 model first to establish accuracy baseline
  • Use representative calibration data (500+ images) for full int8 quantization
  • Validate mAP drop after quantization is within 1-2% of float model
  • Include NMS post-processing in deployment pipeline (not baked into TFLite model)
  • Set appropriate confidence threshold (0.25-0.5) and NMS IoU threshold (0.45-0.7)
  • Profile memory usage: int8 model requires 4x less RAM than float32
  • Test on target hardware with realistic input sizes and batch sizes