Table of Contents

  1. Object Detection Landscape
  2. The YOLO Philosophy
  3. YOLO Grid System
  4. Bounding Box Representation
  5. YOLO Loss Function
  6. Building YOLOv1 Backbone
  7. Non-Maximum Suppression
  8. Modern YOLO: YOLOv8
  9. Anchor Boxes & FPN
  10. Training a Custom Detector
  11. Inference & Visualization
  12. Conclusion & Next Steps
Back to PyTorch Mastery Series

Deep Dive: YOLO — Real-Time Object Detection

May 3, 2026 Wasil Zafar 35 min read

Build a YOLO object detector from scratch in PyTorch. Understand grid cells, bounding box predictions, non-maximum suppression, and train custom detectors with the modern Ultralytics YOLOv8 framework.

Object Detection Landscape

Before diving into YOLO, it's important to understand where object detection sits in the broader computer vision hierarchy. There are three progressively harder tasks that a neural network can perform on an image:

  • Image Classification — "What is in this image?" A single label per image (e.g., "cat").
  • Object Localization — "Where is the object?" Classification plus a single bounding box.
  • Object Detection — "Where are ALL objects?" Multiple bounding boxes, each with a class label and confidence score.

Object detection is dramatically harder than classification because the network must simultaneously predict how many objects are present, where each one is located, and what class each belongs to — all in a single forward pass.

Two-Stage vs One-Stage Detectors

Historically, object detectors fell into two camps based on their architectural philosophy:

Object Detection Approaches
flowchart TD
    A[Object Detection] --> B[Two-Stage Detectors]
    A --> C[One-Stage Detectors]
    B --> D[R-CNN Family]
    D --> D1[R-CNN 2014]
    D --> D2[Fast R-CNN 2015]
    D --> D3[Faster R-CNN 2015]
    C --> E[YOLO Family]
    C --> F[SSD 2016]
    C --> G[RetinaNet 2017]
    E --> E1[YOLOv1 2016]
    E --> E2[YOLOv3 2018]
    E --> E3[YOLOv5/v8 2020+]
    B -.->|Higher Accuracy| H[Slower ~5 FPS]
    C -.->|Real-Time| I[Faster ~30-60 FPS]
                            

Two-stage detectors (R-CNN family) first propose candidate regions where objects might be, then classify each region individually. This is thorough but slow — Faster R-CNN achieves about 5-7 FPS on a GPU. One-stage detectors (YOLO, SSD) skip the region proposal step entirely and predict bounding boxes and classes in a single network pass, enabling real-time speeds of 30-60+ FPS.

Why Real-Time Matters

Real-time detection (>30 FPS) is critical for autonomous driving (reacting to pedestrians in milliseconds), video surveillance (processing hundreds of camera feeds simultaneously), robotics (grasping moving objects), and augmented reality (overlaying information on live video). YOLO made this possible by reframing detection as a single regression problem.

The following code demonstrates the speed difference by timing a simple classification model versus a detection model. This gives you intuition for why architectural choices matter for real-time applications.

import torch
import torch.nn as nn
import time

# Simple classifier: single prediction per image
class SimpleClassifier(nn.Module):
    def __init__(self, num_classes=80):
        super().__init__()
        self.features = nn.Sequential(
            nn.Conv2d(3, 64, 3, padding=1), nn.ReLU(),
            nn.AdaptiveAvgPool2d(1)
        )
        self.classifier = nn.Linear(64, num_classes)

    def forward(self, x):
        x = self.features(x)
        x = x.view(x.size(0), -1)
        return self.classifier(x)

# Simple detector: predicts grid of boxes + classes
class SimpleDetector(nn.Module):
    def __init__(self, S=7, B=2, C=80):
        super().__init__()
        self.features = nn.Sequential(
            nn.Conv2d(3, 64, 3, padding=1), nn.ReLU(),
            nn.Conv2d(64, 128, 3, padding=1), nn.ReLU(),
            nn.AdaptiveAvgPool2d(S)
        )
        self.head = nn.Conv2d(128, B * 5 + C, 1)

    def forward(self, x):
        x = self.features(x)
        return self.head(x)  # Shape: (batch, B*5+C, S, S)

# Benchmark both models
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
img = torch.randn(1, 3, 448, 448).to(device)

classifier = SimpleClassifier().to(device).eval()
detector = SimpleDetector().to(device).eval()

# Time classifier
start = time.time()
for _ in range(100):
    with torch.no_grad():
        _ = classifier(img)
cls_time = (time.time() - start) / 100

# Time detector
start = time.time()
for _ in range(100):
    with torch.no_grad():
        _ = detector(img)
det_time = (time.time() - start) / 100

print(f"Classifier: {cls_time*1000:.2f} ms/image")
print(f"Detector:   {det_time*1000:.2f} ms/image")
print(f"Detector output shape: {detector(img).shape}")

Notice that the detector produces a spatial grid of predictions (S×S) in roughly the same time as a classifier because it's still just one forward pass — the key insight behind YOLO's speed.

The YOLO Philosophy

"You Only Look Once" perfectly captures YOLO's core innovation. Unlike two-stage detectors that examine an image multiple times (first for proposals, then for classification), YOLO processes the entire image in a single neural network evaluation. The network simultaneously predicts all bounding boxes and class probabilities for every object in the frame.

Detection as Regression

YOLO's radical idea was to treat object detection as a regression problem rather than a classification problem. Instead of asking "Is there an object here?" at thousands of candidate locations, YOLO asks "Given this entire image, what are all the bounding box coordinates and class labels?" The network directly outputs a fixed-size tensor containing all predictions.

Key Insight Redmon et al., 2016
YOLO's Single-Pass Design

By encoding the entire detection pipeline into a single convolutional neural network, YOLO achieves three things simultaneously: (1) it sees the full image context when making predictions (reducing background false positives), (2) it runs at real-time speeds because there's only one network to evaluate, and (3) it learns generalizable representations of objects that transfer well to new domains.

regression single-pass real-time

Here's how YOLO conceptually differs from a region-based approach. We can simulate the difference in approaches with pseudocode-style Python:

import torch
import torch.nn as nn

# Two-stage approach (conceptual): propose then classify
def two_stage_detect(image, rpn, classifier):
    """Slow: generates ~2000 proposals, classifies each one"""
    proposals = rpn(image)           # ~2000 candidate boxes
    results = []
    for box in proposals:            # Classify EACH proposal
        crop = image[:, :, box[1]:box[3], box[0]:box[2]]
        label = classifier(crop)
        results.append((box, label))
    return results  # Very slow!

# YOLO approach: single pass does everything
class YOLOConcept(nn.Module):
    """Fast: one forward pass predicts ALL boxes and classes"""
    def __init__(self, S=7, B=2, C=20):
        super().__init__()
        self.S, self.B, self.C = S, B, C
        # Single backbone + head
        self.backbone = nn.Sequential(
            nn.Conv2d(3, 64, 7, stride=2, padding=3),
            nn.LeakyReLU(0.1),
            nn.MaxPool2d(2),
            nn.Conv2d(64, 192, 3, padding=1),
            nn.LeakyReLU(0.1),
            nn.AdaptiveAvgPool2d(S),
        )
        self.head = nn.Sequential(
            nn.Flatten(),
            nn.Linear(192 * S * S, 4096),
            nn.LeakyReLU(0.1),
            nn.Linear(4096, S * S * (B * 5 + C)),
        )

    def forward(self, x):
        features = self.backbone(x)
        output = self.head(features)
        # Reshape to (batch, S, S, B*5 + C)
        return output.view(-1, self.S, self.S, self.B * 5 + self.C)

# Demo
model = YOLOConcept(S=7, B=2, C=20)
img = torch.randn(1, 3, 448, 448)
predictions = model(img)
print(f"Input shape:  {img.shape}")
print(f"Output shape: {predictions.shape}")
print(f"Grid: 7x7, each cell predicts: {2*5 + 20} values")
print(f"  = 2 boxes × 5 values (x, y, w, h, conf) + 20 class probs")

The output tensor has shape (batch, 7, 7, 30) — meaning each of the 49 grid cells predicts 2 bounding boxes (each with 5 values: x, y, w, h, confidence) plus 20 class probabilities. Everything in one shot.

YOLO Grid System

YOLO divides the input image into an $S \times S$ grid (typically $7 \times 7$ for YOLOv1). Each grid cell is responsible for detecting objects whose center falls within that cell. Each cell predicts:

  • $B$ bounding boxes, each with 5 values: $(x, y, w, h, \text{confidence})$
  • $C$ class probabilities: $P(\text{Class}_i | \text{Object})$ for each class

The total output tensor has shape $S \times S \times (B \times 5 + C)$. For YOLOv1 with PASCAL VOC (20 classes): $7 \times 7 \times (2 \times 5 + 20) = 7 \times 7 \times 30$.

YOLO Grid Cell Prediction Structure
flowchart LR
    A[Input Image
448×448] --> B[CNN Backbone
24 Conv Layers] B --> C[Output Tensor
7×7×30] C --> D[Grid Cell i,j] D --> E[Box 1: x,y,w,h,conf] D --> F[Box 2: x,y,w,h,conf] D --> G[20 Class Probs] E --> H[Final Detection] F --> H G --> H H --> I[class_conf =
P class × IoU]

The Responsible Cell Concept

A critical detail: the grid cell that contains the center point of a ground-truth object is "responsible" for predicting that object. If a dog's center is at pixel (200, 300) in a 448×448 image, that maps to grid cell $(200/64, 300/64) = (3, 4)$ in a 7×7 grid (where each cell is 64 pixels wide). That specific cell must predict the dog's bounding box.

import torch
import numpy as np

def assign_objects_to_grid(bboxes, labels, S=7, img_size=448):
    """
    Assign ground-truth objects to grid cells.

    Args:
        bboxes: Tensor of shape (N, 4) with [x_center, y_center, width, height]
                all normalized to [0, 1] relative to image size
        labels: Tensor of shape (N,) with class indices
        S: Grid size (7 for YOLOv1)

    Returns:
        target: Tensor of shape (S, S, 5 + C) with assigned ground truth
    """
    C = 20  # Number of classes (PASCAL VOC)
    target = torch.zeros(S, S, 5 + C)

    for i in range(len(bboxes)):
        x_center, y_center, w, h = bboxes[i]

        # Which grid cell is responsible?
        grid_x = int(x_center * S)  # Column index
        grid_y = int(y_center * S)  # Row index

        # Clamp to valid range
        grid_x = min(grid_x, S - 1)
        grid_y = min(grid_y, S - 1)

        # Position relative to grid cell (0 to 1 within cell)
        x_cell = x_center * S - grid_x
        y_cell = y_center * S - grid_y

        # Store: [x_cell, y_cell, w, h, confidence, one-hot class]
        target[grid_y, grid_x, 0] = x_cell
        target[grid_y, grid_x, 1] = y_cell
        target[grid_y, grid_x, 2] = w
        target[grid_y, grid_x, 3] = h
        target[grid_y, grid_x, 4] = 1.0  # Object is present

        # One-hot encode class
        class_idx = int(labels[i])
        target[grid_y, grid_x, 5 + class_idx] = 1.0

    return target

# Example: 2 objects in a 448x448 image
bboxes = torch.tensor([
    [0.45, 0.65, 0.30, 0.40],  # Dog at center (0.45, 0.65)
    [0.80, 0.20, 0.15, 0.25],  # Car at center (0.80, 0.20)
])
labels = torch.tensor([11, 6])  # dog=11, car=6 in VOC

target = assign_objects_to_grid(bboxes, labels)
print(f"Target shape: {target.shape}")
print(f"Dog assigned to cell: ({int(0.65*7)}, {int(0.45*7)}) = (4, 3)")
print(f"Car assigned to cell: ({int(0.20*7)}, {int(0.80*7)}) = (1, 5)")
print(f"Cell (4,3) confidence: {target[4, 3, 4].item()}")
print(f"Cell (4,3) class 11 (dog): {target[4, 3, 5+11].item()}")

This assignment mechanism means that YOLO has a natural limitation: each grid cell can only predict one object in YOLOv1. If two objects have centers in the same cell, only one can be detected. Later versions (YOLOv3+) address this with anchor boxes and multi-scale predictions.

Bounding Box Representation

YOLO predicts bounding boxes using a specific coordinate format. Each box has 5 values:

  • $(x, y)$ — center of the box relative to the grid cell (values between 0 and 1)
  • $(w, h)$ — width and height relative to the entire image (values between 0 and 1)
  • confidence — $P(\text{Object}) \times \text{IoU}_{\text{pred}}^{\text{truth}}$

The confidence score captures two things: how likely an object exists in that box AND how well the predicted box aligns with the actual object. Formally:

$$\text{Confidence} = P(\text{Object}) \times \text{IoU}_{\text{pred}}^{\text{truth}}$$

Where IoU (Intersection over Union) measures the overlap between predicted and ground-truth boxes:

$$\text{IoU} = \frac{|B_p \cap B_{gt}|}{|B_p \cup B_{gt}|}$$

Format Conversion

In practice, we frequently convert between box formats. The two most common are center format (x_center, y_center, w, h) used by YOLO and corner format (x_min, y_min, x_max, y_max) used for IoU computation and visualization.

import torch

def center_to_corners(boxes):
    """
    Convert boxes from (x_center, y_center, w, h) to (x1, y1, x2, y2).
    All values normalized to [0, 1].
    """
    x_center, y_center, w, h = boxes.unbind(-1)
    x1 = x_center - w / 2
    y1 = y_center - h / 2
    x2 = x_center + w / 2
    y2 = y_center + h / 2
    return torch.stack([x1, y1, x2, y2], dim=-1)

def corners_to_center(boxes):
    """
    Convert boxes from (x1, y1, x2, y2) to (x_center, y_center, w, h).
    """
    x1, y1, x2, y2 = boxes.unbind(-1)
    x_center = (x1 + x2) / 2
    y_center = (y1 + y2) / 2
    w = x2 - x1
    h = y2 - y1
    return torch.stack([x_center, y_center, w, h], dim=-1)

def compute_iou(boxes1, boxes2):
    """
    Compute IoU between two sets of boxes (both in corner format).
    boxes1: (N, 4), boxes2: (M, 4)
    Returns: (N, M) IoU matrix
    """
    # Intersection coordinates
    x1 = torch.max(boxes1[:, None, 0], boxes2[None, :, 0])
    y1 = torch.max(boxes1[:, None, 1], boxes2[None, :, 1])
    x2 = torch.min(boxes1[:, None, 2], boxes2[None, :, 2])
    y2 = torch.min(boxes1[:, None, 3], boxes2[None, :, 3])

    # Intersection area (clamp to 0 if no overlap)
    intersection = (x2 - x1).clamp(min=0) * (y2 - y1).clamp(min=0)

    # Union area
    area1 = (boxes1[:, 2] - boxes1[:, 0]) * (boxes1[:, 3] - boxes1[:, 1])
    area2 = (boxes2[:, 2] - boxes2[:, 0]) * (boxes2[:, 3] - boxes2[:, 1])
    union = area1[:, None] + area2[None, :] - intersection

    return intersection / (union + 1e-6)

# Demo: Convert and compute IoU
pred_boxes = torch.tensor([
    [0.5, 0.5, 0.4, 0.6],   # Predicted box (center format)
    [0.3, 0.3, 0.2, 0.2],   # Another prediction
])
gt_boxes = torch.tensor([
    [0.48, 0.52, 0.38, 0.58],  # Ground truth (center format)
])

# Convert to corners for IoU
pred_corners = center_to_corners(pred_boxes)
gt_corners = center_to_corners(gt_boxes)

iou_matrix = compute_iou(pred_corners, gt_corners)
print(f"Predicted boxes (center): \n{pred_boxes}")
print(f"Predicted boxes (corners): \n{pred_corners}")
print(f"IoU with ground truth: {iou_matrix.squeeze()}")
print(f"Box 1 IoU: {iou_matrix[0, 0]:.4f} (good match)")
print(f"Box 2 IoU: {iou_matrix[1, 0]:.4f} (poor match)")

IoU is the fundamental metric for object detection — it tells us how well a predicted box overlaps with the ground truth. An IoU above 0.5 is typically considered a "correct" detection, while 0.75+ indicates high-quality localization.

YOLO Loss Function

The YOLO loss function is a multi-part sum-squared error that balances three objectives: box localization, confidence prediction, and class prediction. The full loss is:

$$\mathcal{L} = \lambda_{\text{coord}} \mathcal{L}_{\text{box}} + \mathcal{L}_{\text{conf}} + \mathcal{L}_{\text{class}}$$

Where:

  • $\lambda_{\text{coord}} = 5$ — upweights localization errors (boxes matter more than background confidence)
  • $\lambda_{\text{noobj}} = 0.5$ — downweights confidence loss for cells without objects (most cells are background)
  • Box width/height use $\sqrt{w}$ and $\sqrt{h}$ instead of raw values to make the loss more sensitive to small-box errors
Why sqrt(w) and sqrt(h)? A small absolute error in a large box (e.g., 300→305 pixels) is less important than the same error in a small box (e.g., 30→35 pixels). Taking the square root compresses large values and amplifies small ones, making the loss roughly scale-invariant.

Loss Implementation

Here's a complete implementation of the YOLOv1 loss function. This is one of the most educational pieces of code because it shows how every prediction component is supervised:

import torch
import torch.nn as nn

class YOLOv1Loss(nn.Module):
    """
    YOLOv1 loss function implementation.
    Penalizes localization, confidence, and classification errors.
    """
    def __init__(self, S=7, B=2, C=20, lambda_coord=5.0, lambda_noobj=0.5):
        super().__init__()
        self.S = S
        self.B = B
        self.C = C
        self.lambda_coord = lambda_coord
        self.lambda_noobj = lambda_noobj

    def forward(self, predictions, targets):
        """
        predictions: (batch, S, S, B*5 + C)
        targets: (batch, S, S, 5 + C) — only 1 box per cell in target
        """
        batch_size = predictions.shape[0]
        pred = predictions.reshape(batch_size, self.S, self.S, self.B * 5 + self.C)

        # Extract target components
        target_boxes = targets[..., :4]    # (batch, S, S, 4)
        target_conf = targets[..., 4:5]    # (batch, S, S, 1) — 1 if object, 0 otherwise
        target_class = targets[..., 5:]    # (batch, S, S, C)

        # Object mask: cells that contain an object
        obj_mask = target_conf.squeeze(-1)  # (batch, S, S)

        # Extract predictions for box 1 and box 2
        pred_box1 = pred[..., :5]          # x, y, w, h, conf for box 1
        pred_box2 = pred[..., 5:10]        # x, y, w, h, conf for box 2
        pred_class = pred[..., 10:]        # class predictions

        # ============ Localization Loss ============
        # Use box 1 for simplicity (full impl selects best IoU box)
        xy_loss = self.lambda_coord * torch.sum(
            obj_mask.unsqueeze(-1) * (pred_box1[..., :2] - target_boxes[..., :2]) ** 2
        )

        # sqrt(w) and sqrt(h) for scale sensitivity
        wh_pred = torch.sign(pred_box1[..., 2:4]) * torch.sqrt(
            torch.abs(pred_box1[..., 2:4]) + 1e-6
        )
        wh_target = torch.sqrt(target_boxes[..., 2:4] + 1e-6)
        wh_loss = self.lambda_coord * torch.sum(
            obj_mask.unsqueeze(-1) * (wh_pred - wh_target) ** 2
        )

        # ============ Confidence Loss ============
        # Object cells: confidence should match IoU
        conf_obj_loss = torch.sum(
            obj_mask * (pred_box1[..., 4] - target_conf.squeeze(-1)) ** 2
        )
        # No-object cells: confidence should be 0
        noobj_mask = 1.0 - obj_mask
        conf_noobj_loss = self.lambda_noobj * torch.sum(
            noobj_mask * (pred_box1[..., 4]) ** 2
        )
        # Also penalize box 2 confidence in no-object cells
        conf_noobj_loss += self.lambda_noobj * torch.sum(
            noobj_mask * (pred_box2[..., 4]) ** 2
        )

        # ============ Classification Loss ============
        class_loss = torch.sum(
            obj_mask.unsqueeze(-1) * (pred_class - target_class) ** 2
        )

        # Total loss
        total_loss = (xy_loss + wh_loss + conf_obj_loss +
                      conf_noobj_loss + class_loss) / batch_size
        return total_loss

# Demo: create random predictions and targets
loss_fn = YOLOv1Loss(S=7, B=2, C=20)
pred = torch.randn(4, 7, 7, 30)       # Batch of 4
target = torch.zeros(4, 7, 7, 25)     # 5 + 20 = 25

# Place an object at cell (3, 3) in first image
target[0, 3, 3, :4] = torch.tensor([0.5, 0.5, 0.3, 0.4])
target[0, 3, 3, 4] = 1.0   # Object present
target[0, 3, 3, 5 + 14] = 1.0  # Class 14 (person)

loss = loss_fn(pred, target)
print(f"YOLO Loss: {loss.item():.4f}")
print(f"Loss components penalize localization, confidence, and classification jointly")

The loss function is the heart of YOLO training. The lambda_coord = 5 multiplier ensures the network prioritizes getting bounding box positions right, while lambda_noobj = 0.5 prevents the overwhelming number of background cells from dominating the gradient signal.

Building YOLOv1 Backbone

The original YOLOv1 uses a custom backbone called Darknet, inspired by GoogLeNet's inception modules but simplified into plain convolutions. It consists of 24 convolutional layers followed by 2 fully connected layers. The architecture progressively reduces spatial resolution while increasing channel depth, ultimately producing the 7×7×30 output tensor.

PyTorch Implementation

Below is a faithful PyTorch implementation of the YOLOv1 architecture. We define the backbone as a sequence of convolutional blocks and attach a detection head that outputs the final predictions:

import torch
import torch.nn as nn

class ConvBlock(nn.Module):
    """Conv + BatchNorm + LeakyReLU block used throughout Darknet."""
    def __init__(self, in_channels, out_channels, kernel_size, stride=1, padding=0):
        super().__init__()
        self.conv = nn.Sequential(
            nn.Conv2d(in_channels, out_channels, kernel_size,
                      stride=stride, padding=padding, bias=False),
            nn.BatchNorm2d(out_channels),
            nn.LeakyReLU(0.1, inplace=True),
        )

    def forward(self, x):
        return self.conv(x)

class YOLOv1(nn.Module):
    """
    YOLOv1 Architecture (simplified but faithful).
    Input: 448x448x3
    Output: S x S x (B*5 + C) = 7x7x30
    """
    def __init__(self, S=7, B=2, C=20):
        super().__init__()
        self.S, self.B, self.C = S, B, C

        # Darknet backbone (24 conv layers)
        self.backbone = nn.Sequential(
            # Block 1
            ConvBlock(3, 64, 7, stride=2, padding=3),      # 448 -> 224
            nn.MaxPool2d(2, stride=2),                      # 224 -> 112

            # Block 2
            ConvBlock(64, 192, 3, padding=1),               # 112 -> 112
            nn.MaxPool2d(2, stride=2),                      # 112 -> 56

            # Block 3
            ConvBlock(192, 128, 1),
            ConvBlock(128, 256, 3, padding=1),
            ConvBlock(256, 256, 1),
            ConvBlock(256, 512, 3, padding=1),
            nn.MaxPool2d(2, stride=2),                      # 56 -> 28

            # Block 4 (repeated 1x1 → 3x3 pattern)
            ConvBlock(512, 256, 1),
            ConvBlock(256, 512, 3, padding=1),
            ConvBlock(512, 256, 1),
            ConvBlock(256, 512, 3, padding=1),
            ConvBlock(512, 256, 1),
            ConvBlock(256, 512, 3, padding=1),
            ConvBlock(512, 256, 1),
            ConvBlock(256, 512, 3, padding=1),
            ConvBlock(512, 512, 1),
            ConvBlock(512, 1024, 3, padding=1),
            nn.MaxPool2d(2, stride=2),                      # 28 -> 14

            # Block 5
            ConvBlock(1024, 512, 1),
            ConvBlock(512, 1024, 3, padding=1),
            ConvBlock(1024, 512, 1),
            ConvBlock(512, 1024, 3, padding=1),
            ConvBlock(1024, 1024, 3, padding=1),
            ConvBlock(1024, 1024, 3, stride=2, padding=1),  # 14 -> 7

            # Block 6
            ConvBlock(1024, 1024, 3, padding=1),
            ConvBlock(1024, 1024, 3, padding=1),            # 7x7x1024
        )

        # Detection head (2 FC layers)
        self.head = nn.Sequential(
            nn.Flatten(),
            nn.Linear(1024 * S * S, 4096),
            nn.LeakyReLU(0.1, inplace=True),
            nn.Dropout(0.5),
            nn.Linear(4096, S * S * (B * 5 + C)),
        )

    def forward(self, x):
        features = self.backbone(x)
        output = self.head(features)
        return output.view(-1, self.S, self.S, self.B * 5 + self.C)

# Create model and verify output shape
model = YOLOv1(S=7, B=2, C=20)
x = torch.randn(2, 3, 448, 448)
output = model(x)
print(f"Input shape:  {x.shape}")
print(f"Output shape: {output.shape}")
print(f"Expected:     (2, 7, 7, 30)")
print(f"Parameters:   {sum(p.numel() for p in model.parameters()):,}")

This model has roughly 270 million parameters — significantly larger than modern efficient detectors. The fully connected layers in particular are parameter-heavy, which is why later YOLO versions replaced them with convolutional heads.

Non-Maximum Suppression (NMS)

After YOLO produces predictions, multiple grid cells often detect the same object. A large dog might span several grid cells, and each cell might predict a box for it. Non-Maximum Suppression (NMS) is the post-processing step that removes duplicate detections, keeping only the best box for each object.

NMS Algorithm: (1) Sort all boxes by confidence score. (2) Take the highest-confidence box and add it to final results. (3) Remove all remaining boxes that have IoU > threshold with the selected box. (4) Repeat until no boxes remain. This greedy approach ensures each object has at most one detection.

NMS from Scratch

Let's implement NMS from scratch, then compare with PyTorch's built-in implementation:

import torch
from torchvision.ops import nms as torchvision_nms

def nms_from_scratch(boxes, scores, iou_threshold=0.5):
    """
    Non-Maximum Suppression implemented from scratch.

    Args:
        boxes: Tensor (N, 4) in corner format [x1, y1, x2, y2]
        scores: Tensor (N,) confidence scores
        iou_threshold: IoU threshold for suppression

    Returns:
        keep: indices of boxes to keep
    """
    # Sort by confidence (descending)
    order = scores.argsort(descending=True)
    keep = []

    while order.numel() > 0:
        # Pick the best box
        idx = order[0].item()
        keep.append(idx)

        if order.numel() == 1:
            break

        # Compute IoU of this box with all remaining boxes
        remaining = order[1:]
        best_box = boxes[idx].unsqueeze(0)      # (1, 4)
        other_boxes = boxes[remaining]            # (M, 4)

        # IoU calculation
        x1 = torch.max(best_box[:, 0], other_boxes[:, 0])
        y1 = torch.max(best_box[:, 1], other_boxes[:, 1])
        x2 = torch.min(best_box[:, 2], other_boxes[:, 2])
        y2 = torch.min(best_box[:, 3], other_boxes[:, 3])

        intersection = (x2 - x1).clamp(min=0) * (y2 - y1).clamp(min=0)
        area_best = (best_box[:, 2] - best_box[:, 0]) * (best_box[:, 3] - best_box[:, 1])
        area_other = (other_boxes[:, 2] - other_boxes[:, 0]) * (other_boxes[:, 3] - other_boxes[:, 1])
        union = area_best + area_other - intersection
        iou = intersection / (union + 1e-6)

        # Keep boxes with low IoU (different objects)
        mask = iou.squeeze(0) < iou_threshold
        order = remaining[mask]

    return torch.tensor(keep, dtype=torch.long)

# Demo: Multiple overlapping detections of the same object
boxes = torch.tensor([
    [100, 100, 300, 300],  # High confidence box
    [110, 105, 295, 310],  # Overlapping (same object)
    [105, 98, 305, 295],   # Overlapping (same object)
    [400, 200, 550, 400],  # Different object entirely
    [410, 205, 545, 395],  # Overlapping with box above
], dtype=torch.float32)

scores = torch.tensor([0.95, 0.88, 0.82, 0.90, 0.75])

# Our implementation
keep_ours = nms_from_scratch(boxes, scores, iou_threshold=0.5)
print(f"Our NMS keeps indices: {keep_ours.tolist()}")
print(f"Kept boxes: {len(keep_ours)} out of {len(boxes)}")

# PyTorch's implementation (should match)
keep_torch = torchvision_nms(boxes, scores, iou_threshold=0.5)
print(f"Torchvision NMS keeps: {keep_torch.tolist()}")
print(f"Results match: {keep_ours.tolist() == keep_torch.tolist()}")

NMS reduces our 5 raw predictions to just 2 final detections — one for each distinct object. The overlapping boxes for the same object are suppressed, keeping only the highest-confidence version.

Modern YOLO: YOLOv5/v8 with Ultralytics

While understanding YOLOv1's mechanics is educational, real-world projects use modern implementations like YOLOv8 from Ultralytics. YOLOv8 incorporates years of improvements: anchor-free detection heads, CSP (Cross-Stage Partial) backbone, path aggregation networks, and state-of-the-art training recipes — all wrapped in a clean Python API.

Production Ready Ultralytics YOLOv8, 2023
YOLOv8 Model Variants

YOLOv8 comes in five sizes (nano, small, medium, large, xlarge) trading speed for accuracy. YOLOv8n runs at 100+ FPS on edge GPUs while YOLOv8x achieves near state-of-the-art mAP on COCO. All models share the same API — just change the size suffix.

YOLOv8 ultralytics anchor-free COCO

Inference & Fine-tuning

The Ultralytics library makes running YOLOv8 incredibly simple — just a few lines for inference on images. Install with pip install ultralytics:

from ultralytics import YOLO
import torch

# Load a pretrained YOLOv8 model (downloads automatically)
model = YOLO('yolov8n.pt')  # nano model (fastest)

# Run inference on an image
results = model('https://ultralytics.com/images/bus.jpg')

# Process results
for result in results:
    boxes = result.boxes  # Bounding box outputs

    print(f"Detected {len(boxes)} objects:")
    print(f"  Bounding boxes (xyxy): {boxes.xyxy.shape}")
    print(f"  Confidence scores: {boxes.conf}")
    print(f"  Class indices: {boxes.cls}")

    # Get class names
    for i, (box, conf, cls) in enumerate(zip(boxes.xyxy, boxes.conf, boxes.cls)):
        class_name = model.names[int(cls)]
        print(f"  [{i}] {class_name}: {conf:.2f} at {box.tolist()}")

For custom datasets, YOLOv8 supports fine-tuning with just a few more lines. You provide a YAML file describing your dataset and the library handles data loading, augmentation, and training:

from ultralytics import YOLO

# Load pretrained model as starting point
model = YOLO('yolov8n.pt')

# Fine-tune on custom dataset
# Requires a data.yaml file pointing to your train/val images and labels
# Example data.yaml:
# train: /path/to/train/images
# val: /path/to/val/images
# nc: 3  (number of classes)
# names: ['cat', 'dog', 'bird']

# Train for 50 epochs
results = model.train(
    data='data.yaml',       # Path to dataset config
    epochs=50,              # Number of training epochs
    imgsz=640,              # Input image size
    batch=16,               # Batch size
    lr0=0.01,               # Initial learning rate
    device='0',             # GPU device (or 'cpu')
    project='runs/detect',  # Save directory
    name='custom_yolov8',   # Experiment name
)

# Evaluate on validation set
metrics = model.val()
print(f"mAP@0.5: {metrics.box.map50:.4f}")
print(f"mAP@0.5:0.95: {metrics.box.map:.4f}")

# Export model for deployment
model.export(format='onnx')  # ONNX format for production
print("Model exported to ONNX!")

YOLOv8 handles all the complexity internally — data augmentation (mosaic, mixup, HSV shifts), learning rate scheduling (cosine annealing with warmup), and multi-scale training. You just configure high-level parameters.

Anchor Boxes & FPN

YOLOv1's fixed grid has a major limitation: each cell only predicts 2 boxes with unconstrained shapes. This makes it hard to learn very wide or very tall objects (like a giraffe vs a school bus). Anchor boxes (introduced in YOLOv2) provide shape priors — predefined aspect ratios that the network refines rather than predicting from scratch.

Instead of predicting raw $(x, y, w, h)$, the network predicts offsets from predefined anchor shapes. If an anchor has width $a_w$ and height $a_h$, the predicted box is:

$$w = a_w \cdot e^{t_w}, \quad h = a_h \cdot e^{t_h}$$

Where $t_w, t_h$ are the network's learned adjustments. This makes training more stable because the network starts from reasonable shapes.

import torch
import numpy as np

def generate_anchors(feature_sizes, anchor_configs, image_size=416):
    """
    Generate anchor boxes at multiple scales (like YOLOv3).

    Args:
        feature_sizes: list of feature map sizes [52, 26, 13]
        anchor_configs: anchor (w, h) pairs for each scale
        image_size: input image resolution

    Returns:
        all_anchors: dict mapping scale to anchor boxes
    """
    all_anchors = {}

    for scale_idx, (feat_size, anchors) in enumerate(zip(feature_sizes, anchor_configs)):
        stride = image_size / feat_size
        grid_anchors = []

        for gy in range(feat_size):
            for gx in range(feat_size):
                cx = (gx + 0.5) * stride
                cy = (gy + 0.5) * stride

                for aw, ah in anchors:
                    # Anchor box in absolute pixel coordinates
                    x1 = cx - aw / 2
                    y1 = cy - ah / 2
                    x2 = cx + aw / 2
                    y2 = cy + ah / 2
                    grid_anchors.append([x1, y1, x2, y2])

        all_anchors[f'scale_{feat_size}x{feat_size}'] = torch.tensor(grid_anchors)

    return all_anchors

# YOLOv3-style anchor configuration (3 scales, 3 anchors each)
feature_sizes = [52, 26, 13]  # Small, Medium, Large objects
anchor_configs = [
    [(10, 13), (16, 30), (33, 23)],       # Small anchors (52x52)
    [(30, 61), (62, 45), (59, 119)],      # Medium anchors (26x26)
    [(116, 90), (156, 198), (373, 326)],  # Large anchors (13x13)
]

anchors = generate_anchors(feature_sizes, anchor_configs)
for scale, boxes in anchors.items():
    print(f"{scale}: {boxes.shape[0]} anchor boxes")

# Total anchors = 52*52*3 + 26*26*3 + 13*13*3 = 10647
total = sum(boxes.shape[0] for boxes in anchors.values())
print(f"Total anchor boxes: {total}")
print(f"Each anchor predicts: 4 coords + 1 objectness + 80 classes = 85 values")

Feature Pyramid Networks

Different-sized objects are best detected at different feature map resolutions. Small objects (like a distant car) need high-resolution features, while large objects (like a close-up face) are captured well by low-resolution, high-level features. Feature Pyramid Networks (FPN) combine both by creating a top-down pathway that merges features at multiple scales:

import torch
import torch.nn as nn
import torch.nn.functional as F

class SimpleFPN(nn.Module):
    """
    Simplified Feature Pyramid Network for multi-scale detection.
    Merges features from different backbone stages.
    """
    def __init__(self, in_channels_list, out_channels=256):
        super().__init__()
        # Lateral connections (1x1 conv to match channels)
        self.lateral_convs = nn.ModuleList([
            nn.Conv2d(in_ch, out_channels, 1)
            for in_ch in in_channels_list
        ])
        # Smoothing convolutions (3x3 after merge)
        self.smooth_convs = nn.ModuleList([
            nn.Conv2d(out_channels, out_channels, 3, padding=1)
            for _ in in_channels_list
        ])

    def forward(self, features):
        """
        features: list of feature maps from backbone [C3, C4, C5]
                  from high-res to low-res
        """
        # Apply lateral connections
        laterals = [conv(f) for conv, f in zip(self.lateral_convs, features)]

        # Top-down pathway: upsample and add
        for i in range(len(laterals) - 2, -1, -1):
            upsampled = F.interpolate(
                laterals[i + 1], size=laterals[i].shape[2:], mode='nearest'
            )
            laterals[i] = laterals[i] + upsampled

        # Apply smoothing
        outputs = [conv(lat) for conv, lat in zip(self.smooth_convs, laterals)]
        return outputs

# Demo: Simulate backbone features at 3 scales
batch_size = 2
# C3: 52x52, C4: 26x26, C5: 13x13
features = [
    torch.randn(batch_size, 256, 52, 52),   # High-res (small objects)
    torch.randn(batch_size, 512, 26, 26),   # Medium-res
    torch.randn(batch_size, 1024, 13, 13),  # Low-res (large objects)
]

fpn = SimpleFPN(in_channels_list=[256, 512, 1024], out_channels=256)
pyramid_features = fpn(features)

print("Feature Pyramid Network outputs:")
for i, feat in enumerate(pyramid_features):
    print(f"  P{i+3}: {feat.shape} — detects {'small' if i==0 else 'medium' if i==1 else 'large'} objects")

YOLOv3+ uses exactly this pattern: three detection heads at scales 13×13, 26×26, and 52×52. Large objects are detected at the 13×13 scale (large receptive field), while small objects are caught at 52×52 (high spatial resolution). This multi-scale approach is why modern YOLO handles objects of vastly different sizes.

Training a Custom Object Detector

Training a real object detector requires properly formatted data, effective augmentation (that transforms both images AND bounding boxes), and evaluation using the standard mean Average Precision (mAP) metric.

YOLO Annotation Format: Each image has a corresponding .txt file. Each line represents one object: class_id x_center y_center width height (all normalized to [0, 1]). This is the format expected by Ultralytics and most YOLO implementations.

Here's how to build a custom dataset class for YOLO training that loads images and their bounding box annotations:

import torch
from torch.utils.data import Dataset, DataLoader
import numpy as np
from PIL import Image
import os

class YOLODataset(Dataset):
    """
    Custom dataset for YOLO-format annotations.
    Each image has a .txt label file with format:
    class_id x_center y_center width height (normalized)
    """
    def __init__(self, img_dir, label_dir, img_size=416, S=7, B=2, C=20):
        self.img_dir = img_dir
        self.label_dir = label_dir
        self.img_size = img_size
        self.S, self.B, self.C = S, B, C

        # List all image files
        self.images = [f for f in os.listdir(img_dir)
                       if f.endswith(('.jpg', '.png', '.jpeg'))]

    def __len__(self):
        return len(self.images)

    def __getitem__(self, idx):
        # Load image
        img_path = os.path.join(self.img_dir, self.images[idx])
        image = Image.open(img_path).convert('RGB')
        image = image.resize((self.img_size, self.img_size))
        image = torch.tensor(np.array(image), dtype=torch.float32)
        image = image.permute(2, 0, 1) / 255.0  # (C, H, W), normalized

        # Load labels
        label_file = self.images[idx].replace('.jpg', '.txt').replace('.png', '.txt')
        label_path = os.path.join(self.label_dir, label_file)

        boxes = []
        if os.path.exists(label_path):
            with open(label_path, 'r') as f:
                for line in f.readlines():
                    parts = line.strip().split()
                    class_id = int(parts[0])
                    x, y, w, h = map(float, parts[1:5])
                    boxes.append([class_id, x, y, w, h])

        # Convert to YOLO target tensor
        target = self._encode_target(boxes)
        return image, target

    def _encode_target(self, boxes):
        """Encode bounding boxes into S×S×(5+C) target tensor."""
        target = torch.zeros(self.S, self.S, 5 + self.C)

        for box in boxes:
            class_id, x, y, w, h = box
            grid_x = int(x * self.S)
            grid_y = int(y * self.S)
            grid_x = min(grid_x, self.S - 1)
            grid_y = min(grid_y, self.S - 1)

            # Only assign if cell is empty (first object wins)
            if target[grid_y, grid_x, 4] == 0:
                target[grid_y, grid_x, 0] = x * self.S - grid_x
                target[grid_y, grid_x, 1] = y * self.S - grid_y
                target[grid_y, grid_x, 2] = w
                target[grid_y, grid_x, 3] = h
                target[grid_y, grid_x, 4] = 1.0
                target[grid_y, grid_x, 5 + class_id] = 1.0

        return target

# Example usage (with synthetic data for demonstration)
print("YOLODataset expects:")
print("  img_dir/  → image files (.jpg, .png)")
print("  label_dir/ → label files (.txt, same name as image)")
print("  Label format: 'class_id x_center y_center width height'")
print(f"\nTarget tensor shape: ({7}, {7}, {5 + 20}) = (7, 7, 25)")
print("Each cell encodes: [x_offset, y_offset, w, h, objectness, 20 class probs]")

mAP Evaluation

The standard metric for object detection is mean Average Precision (mAP). It measures both localization quality and classification accuracy across all IoU thresholds. Here's a simplified mAP calculator:

import torch
import numpy as np

def calculate_ap(precisions, recalls):
    """Calculate Average Precision using 11-point interpolation."""
    ap = 0.0
    for t in np.arange(0, 1.1, 0.1):
        # Maximum precision at recall >= t
        prec_at_recall = precisions[recalls >= t]
        if len(prec_at_recall) > 0:
            ap += prec_at_recall.max()
    return ap / 11.0

def compute_map(predictions, ground_truths, iou_threshold=0.5, num_classes=20):
    """
    Compute mean Average Precision.

    Args:
        predictions: list of dicts with 'boxes', 'scores', 'labels', 'image_id'
        ground_truths: list of dicts with 'boxes', 'labels', 'image_id'
        iou_threshold: IoU threshold for a "correct" detection
        num_classes: number of object classes
    """
    aps = []

    for cls in range(num_classes):
        # Collect all predictions and GTs for this class
        cls_preds = []
        cls_gts = {}
        n_gt = 0

        for gt in ground_truths:
            mask = gt['labels'] == cls
            img_id = gt['image_id']
            cls_gts[img_id] = gt['boxes'][mask]
            n_gt += mask.sum().item()

        if n_gt == 0:
            continue

        for pred in predictions:
            mask = pred['labels'] == cls
            for box, score in zip(pred['boxes'][mask], pred['scores'][mask]):
                cls_preds.append({
                    'box': box, 'score': score.item(),
                    'image_id': pred['image_id']
                })

        # Sort by confidence
        cls_preds.sort(key=lambda x: x['score'], reverse=True)

        # Compute precision-recall curve
        tp = np.zeros(len(cls_preds))
        fp = np.zeros(len(cls_preds))
        matched = {img_id: set() for img_id in cls_gts}

        for i, pred in enumerate(cls_preds):
            img_id = pred['image_id']
            gt_boxes = cls_gts.get(img_id, torch.zeros(0, 4))

            if len(gt_boxes) == 0:
                fp[i] = 1
                continue

            # Find best matching GT box
            pred_box = pred['box'].unsqueeze(0)
            ious = compute_single_iou(pred_box, gt_boxes)
            best_iou, best_idx = ious.max(dim=1)

            if best_iou >= iou_threshold and best_idx.item() not in matched[img_id]:
                tp[i] = 1
                matched[img_id].add(best_idx.item())
            else:
                fp[i] = 1

        # Cumulative precision and recall
        cum_tp = np.cumsum(tp)
        cum_fp = np.cumsum(fp)
        precisions = cum_tp / (cum_tp + cum_fp + 1e-6)
        recalls = cum_tp / (n_gt + 1e-6)

        ap = calculate_ap(precisions, recalls)
        aps.append(ap)

    mAP = np.mean(aps) if aps else 0.0
    return mAP

def compute_single_iou(box, boxes):
    """Compute IoU of one box against many boxes."""
    x1 = torch.max(box[:, 0], boxes[:, 0])
    y1 = torch.max(box[:, 1], boxes[:, 1])
    x2 = torch.min(box[:, 2], boxes[:, 2])
    y2 = torch.min(box[:, 3], boxes[:, 3])
    inter = (x2 - x1).clamp(0) * (y2 - y1).clamp(0)
    area1 = (box[:, 2] - box[:, 0]) * (box[:, 3] - box[:, 1])
    area2 = (boxes[:, 2] - boxes[:, 0]) * (boxes[:, 3] - boxes[:, 1])
    union = area1 + area2 - inter
    return inter / (union + 1e-6)

# Demo with synthetic detections
print("mAP Evaluation Summary:")
print("  mAP@0.5   → IoU threshold of 0.5 (standard)")
print("  mAP@0.75  → IoU threshold of 0.75 (strict)")
print("  mAP@[.5:.95] → Average across thresholds 0.5 to 0.95 (COCO metric)")
print("\nHigher mAP = better detector. COCO leaderboard uses mAP@[.5:.95]")

The mAP metric tells you: "Across all classes and all confidence thresholds, what fraction of detections are correct?" A detector with mAP@0.5 of 0.45 means 45% of its predictions at IoU>0.5 are true positives, averaged over all recall levels. Modern YOLOv8x achieves ~53.9 mAP@[.5:.95] on COCO — state-of-the-art for real-time detectors.

Inference & Visualization

A detection pipeline isn't complete until you can visualize the results. Drawing bounding boxes with class labels and confidence scores on images is essential for debugging and demonstrating your model. Here's a complete visualization function:

import torch
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as patches

def visualize_detections(image, boxes, scores, labels, class_names,
                         conf_threshold=0.5, figsize=(12, 8)):
    """
    Draw bounding boxes with labels on an image.

    Args:
        image: numpy array (H, W, 3) in [0, 255] or [0, 1]
        boxes: tensor (N, 4) in [x1, y1, x2, y2] pixel coords
        scores: tensor (N,) confidence scores
        labels: tensor (N,) class indices
        class_names: list of class name strings
        conf_threshold: minimum confidence to display
    """
    # Normalize image to [0, 1] for matplotlib
    if image.max() > 1.0:
        image = image / 255.0

    fig, ax = plt.subplots(1, figsize=figsize)
    ax.imshow(image)

    # Color palette for different classes
    colors = plt.cm.Set3(np.linspace(0, 1, len(class_names)))

    # Filter by confidence
    mask = scores >= conf_threshold
    boxes = boxes[mask]
    scores = scores[mask]
    labels = labels[mask]

    for box, score, label in zip(boxes, scores, labels):
        x1, y1, x2, y2 = box.tolist()
        w, h = x2 - x1, y2 - y1
        cls_idx = int(label)
        color = colors[cls_idx % len(colors)]

        # Draw bounding box
        rect = patches.Rectangle(
            (x1, y1), w, h,
            linewidth=2, edgecolor=color, facecolor='none'
        )
        ax.add_patch(rect)

        # Draw label background
        label_text = f"{class_names[cls_idx]}: {score:.2f}"
        ax.text(
            x1, y1 - 5, label_text,
            fontsize=10, fontweight='bold',
            color='white',
            bbox=dict(boxstyle='round,pad=0.3',
                      facecolor=color, alpha=0.8)
        )

    ax.axis('off')
    ax.set_title(f"Detections: {len(boxes)} objects (conf > {conf_threshold})")
    plt.tight_layout()
    plt.show()

# Demo with synthetic detections
np.random.seed(42)
H, W = 480, 640
image = np.random.randint(100, 200, (H, W, 3), dtype=np.uint8)

# Simulate detections
boxes = torch.tensor([
    [50, 80, 200, 300],    # Person
    [300, 150, 500, 400],  # Car
    [420, 50, 580, 180],   # Dog
    [100, 350, 250, 470],  # Chair
])
scores = torch.tensor([0.92, 0.87, 0.78, 0.45])
labels = torch.tensor([0, 2, 16, 56])
class_names = ['person'] + ['bicycle', 'car'] + [''] * 13 + ['dog'] + [''] * 39 + ['chair']

print("Visualization function ready!")
print(f"  {len(boxes)} raw detections")
print(f"  After threshold (0.5): {(scores >= 0.5).sum()} shown")
# Uncomment below to actually render (requires display):
# visualize_detections(image, boxes, scores, labels, class_names, conf_threshold=0.5)

Real-Time Detection Pipeline

For real-time detection from a webcam or video stream, you need an efficient processing loop that captures frames, runs inference, and displays results at interactive speeds. Here's the complete pipeline using OpenCV and Ultralytics:

import torch
import numpy as np
import time

# Real-time detection pipeline (conceptual — requires webcam for actual use)
class RealtimeDetector:
    """
    Real-time object detection pipeline.
    Captures frames, runs YOLO inference, draws results.
    """
    def __init__(self, model_name='yolov8n.pt', conf_threshold=0.5):
        self.conf_threshold = conf_threshold
        self.fps_history = []
        # In production: self.model = YOLO(model_name)

    def process_frame(self, frame, model_output):
        """
        Process a single frame with detections.
        Returns annotated frame with boxes drawn.
        """
        start = time.time()

        # Simulate detection results (in production, model(frame))
        annotated = frame.copy()

        # Draw detections
        for box, score, label in model_output:
            if score < self.conf_threshold:
                continue
            x1, y1, x2, y2 = map(int, box)
            # In production: cv2.rectangle, cv2.putText
            annotated[y1:y1+3, x1:x2] = [0, 255, 0]  # Top border
            annotated[y2-3:y2, x1:x2] = [0, 255, 0]  # Bottom border

        # Calculate FPS
        elapsed = time.time() - start
        fps = 1.0 / (elapsed + 1e-6)
        self.fps_history.append(fps)

        return annotated, fps

    def run_benchmark(self, num_frames=100, frame_size=(640, 480)):
        """Benchmark detection speed without actual camera."""
        print(f"Benchmarking {num_frames} frames at {frame_size}...")

        for i in range(num_frames):
            # Simulate frame capture
            frame = np.random.randint(0, 255, (*frame_size, 3), dtype=np.uint8)

            # Simulate model inference (just timing the overhead)
            start = time.time()
            _ = torch.randn(1, 3, 640, 640)  # Simulate tensor creation
            elapsed = time.time() - start
            self.fps_history.append(1.0 / (elapsed + 1e-6))

        avg_fps = np.mean(self.fps_history[-num_frames:])
        print(f"Average processing speed: {avg_fps:.0f} FPS")
        print(f"Latency per frame: {1000/avg_fps:.1f} ms")
        return avg_fps

# Run benchmark
detector = RealtimeDetector(conf_threshold=0.5)
fps = detector.run_benchmark(num_frames=50)

print(f"\nReal-time detection pipeline:")
print(f"  Model: YOLOv8n (nano — optimized for speed)")
print(f"  Input: 640×640 (standard YOLO input size)")
print(f"  Expected FPS with GPU: 80-120 FPS")
print(f"  Expected FPS with CPU: 15-30 FPS")

In production, you'd use OpenCV's cv2.VideoCapture for camera input and cv2.imshow for display. The Ultralytics library also provides model.predict(source=0, show=True) which handles the entire webcam pipeline in one line.

Here's the complete Ultralytics one-liner for live webcam detection:

from ultralytics import YOLO

# One-line real-time webcam detection
model = YOLO('yolov8n.pt')

# Stream from webcam (source=0) with live display
# This opens a window showing detections in real-time
results = model.predict(
    source=0,           # Webcam index (0 = default camera)
    show=True,          # Display results live
    conf=0.5,           # Confidence threshold
    stream=True,        # Process as a stream (memory efficient)
    verbose=False,      # Suppress per-frame logging
)

# Process each frame's results (if needed)
for result in results:
    # Access detections
    boxes = result.boxes.xyxy      # Bounding boxes
    confs = result.boxes.conf      # Confidence scores
    classes = result.boxes.cls     # Class indices
    
    # Count detections per frame
    n_objects = len(boxes)
    if n_objects > 0:
        print(f"Frame: {n_objects} objects detected")
    
    # Break on 'q' key (handled by show=True internally)

That's the beauty of modern YOLO — decades of research compressed into a library that gives you real-time object detection with minimal code. But understanding the internals (grid cells, IoU, NMS, FPN) lets you debug problems, customize architectures, and push the boundaries of what's possible.

Data Augmentation for Detection

A crucial difference between augmenting for classification vs detection: when you transform the image (flip, rotate, crop), you MUST apply the same geometric transformation to the bounding boxes. Here's how to do this properly with the Albumentations library:

import torch
import numpy as np

# Demonstrate bounding box-aware augmentation logic
def horizontal_flip_with_boxes(image, boxes):
    """
    Flip image horizontally and adjust bounding boxes.

    Args:
        image: numpy array (H, W, 3)
        boxes: numpy array (N, 4) in [x1, y1, x2, y2] normalized [0,1]

    Returns:
        flipped_image, flipped_boxes
    """
    # Flip image
    flipped_image = np.flip(image, axis=1).copy()

    # Flip boxes: x_new = 1 - x_old (mirror x-coordinates)
    flipped_boxes = boxes.copy()
    flipped_boxes[:, 0] = 1.0 - boxes[:, 2]  # new x1 = 1 - old x2
    flipped_boxes[:, 2] = 1.0 - boxes[:, 0]  # new x2 = 1 - old x1

    return flipped_image, flipped_boxes

def random_crop_with_boxes(image, boxes, min_scale=0.5):
    """
    Random crop that ensures at least some boxes remain valid.

    Args:
        image: numpy array (H, W, 3)
        boxes: numpy array (N, 4) in [x1, y1, x2, y2] normalized
        min_scale: minimum crop size relative to original
    """
    H, W = image.shape[:2]

    # Random crop parameters
    scale = np.random.uniform(min_scale, 1.0)
    crop_h, crop_w = int(H * scale), int(W * scale)
    top = np.random.randint(0, H - crop_h + 1)
    left = np.random.randint(0, W - crop_w + 1)

    # Crop image
    cropped = image[top:top+crop_h, left:left+crop_w]

    # Adjust boxes to crop coordinates
    crop_x1, crop_y1 = left / W, top / H
    crop_x2, crop_y2 = (left + crop_w) / W, (top + crop_h) / H

    adjusted_boxes = []
    for box in boxes:
        # Clip box to crop region
        new_x1 = max(0, (box[0] - crop_x1) / (crop_x2 - crop_x1))
        new_y1 = max(0, (box[1] - crop_y1) / (crop_y2 - crop_y1))
        new_x2 = min(1, (box[2] - crop_x1) / (crop_x2 - crop_x1))
        new_y2 = min(1, (box[3] - crop_y1) / (crop_y2 - crop_y1))

        # Keep box only if it has valid area
        if new_x2 > new_x1 + 0.01 and new_y2 > new_y1 + 0.01:
            adjusted_boxes.append([new_x1, new_y1, new_x2, new_y2])

    return cropped, np.array(adjusted_boxes) if adjusted_boxes else np.zeros((0, 4))

# Demo
image = np.random.randint(0, 255, (416, 416, 3), dtype=np.uint8)
boxes = np.array([
    [0.1, 0.2, 0.5, 0.7],   # Object on the left
    [0.6, 0.3, 0.9, 0.8],   # Object on the right
])

# Horizontal flip
flipped_img, flipped_boxes = horizontal_flip_with_boxes(image, boxes)
print("Original boxes:", boxes)
print("After H-flip:  ", flipped_boxes)
print("  Left object moved to right, right object moved to left ✓")

# Random crop
cropped_img, cropped_boxes = random_crop_with_boxes(image, boxes)
print(f"\nCropped image: {cropped_img.shape}")
print(f"Remaining boxes after crop: {len(cropped_boxes)}")

The Ultralytics library applies these augmentations (plus mosaic and mixup) automatically during training. Mosaic augmentation — stitching 4 images into one — is particularly effective for detection because it increases the variety of object scales and contexts the model sees during training.

Conclusion & Next Steps

You've now built a comprehensive understanding of YOLO object detection — from the foundational philosophy of single-pass regression, through the grid cell prediction system, all the way to modern YOLOv8 with Ultralytics. Here's what we covered:

  • The YOLO paradigm: treating detection as regression on a spatial grid
  • Core mechanics: grid cells, bounding box encoding, confidence scores
  • Training: multi-part loss function with scale-aware width/height
  • Post-processing: NMS for duplicate removal
  • Modern advances: anchor boxes, FPN, multi-scale detection
  • Production usage: Ultralytics YOLOv8 for training and inference
Next steps: Try fine-tuning YOLOv8 on a custom dataset (e.g., detecting specific products, animals, or defects). The Ultralytics library makes this straightforward — you just need annotated images in YOLO format and a data.yaml configuration file.