Object Detection Landscape
Before diving into YOLO, it's important to understand where object detection sits in the broader computer vision hierarchy. There are three progressively harder tasks that a neural network can perform on an image:
- Image Classification — "What is in this image?" A single label per image (e.g., "cat").
- Object Localization — "Where is the object?" Classification plus a single bounding box.
- Object Detection — "Where are ALL objects?" Multiple bounding boxes, each with a class label and confidence score.
Object detection is dramatically harder than classification because the network must simultaneously predict how many objects are present, where each one is located, and what class each belongs to — all in a single forward pass.
Two-Stage vs One-Stage Detectors
Historically, object detectors fell into two camps based on their architectural philosophy:
flowchart TD
A[Object Detection] --> B[Two-Stage Detectors]
A --> C[One-Stage Detectors]
B --> D[R-CNN Family]
D --> D1[R-CNN 2014]
D --> D2[Fast R-CNN 2015]
D --> D3[Faster R-CNN 2015]
C --> E[YOLO Family]
C --> F[SSD 2016]
C --> G[RetinaNet 2017]
E --> E1[YOLOv1 2016]
E --> E2[YOLOv3 2018]
E --> E3[YOLOv5/v8 2020+]
B -.->|Higher Accuracy| H[Slower ~5 FPS]
C -.->|Real-Time| I[Faster ~30-60 FPS]
Two-stage detectors (R-CNN family) first propose candidate regions where objects might be, then classify each region individually. This is thorough but slow — Faster R-CNN achieves about 5-7 FPS on a GPU. One-stage detectors (YOLO, SSD) skip the region proposal step entirely and predict bounding boxes and classes in a single network pass, enabling real-time speeds of 30-60+ FPS.
Why Real-Time Matters
The following code demonstrates the speed difference by timing a simple classification model versus a detection model. This gives you intuition for why architectural choices matter for real-time applications.
import torch
import torch.nn as nn
import time
# Simple classifier: single prediction per image
class SimpleClassifier(nn.Module):
def __init__(self, num_classes=80):
super().__init__()
self.features = nn.Sequential(
nn.Conv2d(3, 64, 3, padding=1), nn.ReLU(),
nn.AdaptiveAvgPool2d(1)
)
self.classifier = nn.Linear(64, num_classes)
def forward(self, x):
x = self.features(x)
x = x.view(x.size(0), -1)
return self.classifier(x)
# Simple detector: predicts grid of boxes + classes
class SimpleDetector(nn.Module):
def __init__(self, S=7, B=2, C=80):
super().__init__()
self.features = nn.Sequential(
nn.Conv2d(3, 64, 3, padding=1), nn.ReLU(),
nn.Conv2d(64, 128, 3, padding=1), nn.ReLU(),
nn.AdaptiveAvgPool2d(S)
)
self.head = nn.Conv2d(128, B * 5 + C, 1)
def forward(self, x):
x = self.features(x)
return self.head(x) # Shape: (batch, B*5+C, S, S)
# Benchmark both models
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
img = torch.randn(1, 3, 448, 448).to(device)
classifier = SimpleClassifier().to(device).eval()
detector = SimpleDetector().to(device).eval()
# Time classifier
start = time.time()
for _ in range(100):
with torch.no_grad():
_ = classifier(img)
cls_time = (time.time() - start) / 100
# Time detector
start = time.time()
for _ in range(100):
with torch.no_grad():
_ = detector(img)
det_time = (time.time() - start) / 100
print(f"Classifier: {cls_time*1000:.2f} ms/image")
print(f"Detector: {det_time*1000:.2f} ms/image")
print(f"Detector output shape: {detector(img).shape}")
Notice that the detector produces a spatial grid of predictions (S×S) in roughly the same time as a classifier because it's still just one forward pass — the key insight behind YOLO's speed.
The YOLO Philosophy
"You Only Look Once" perfectly captures YOLO's core innovation. Unlike two-stage detectors that examine an image multiple times (first for proposals, then for classification), YOLO processes the entire image in a single neural network evaluation. The network simultaneously predicts all bounding boxes and class probabilities for every object in the frame.
Detection as Regression
YOLO's radical idea was to treat object detection as a regression problem rather than a classification problem. Instead of asking "Is there an object here?" at thousands of candidate locations, YOLO asks "Given this entire image, what are all the bounding box coordinates and class labels?" The network directly outputs a fixed-size tensor containing all predictions.
YOLO's Single-Pass Design
By encoding the entire detection pipeline into a single convolutional neural network, YOLO achieves three things simultaneously: (1) it sees the full image context when making predictions (reducing background false positives), (2) it runs at real-time speeds because there's only one network to evaluate, and (3) it learns generalizable representations of objects that transfer well to new domains.
Here's how YOLO conceptually differs from a region-based approach. We can simulate the difference in approaches with pseudocode-style Python:
import torch
import torch.nn as nn
# Two-stage approach (conceptual): propose then classify
def two_stage_detect(image, rpn, classifier):
"""Slow: generates ~2000 proposals, classifies each one"""
proposals = rpn(image) # ~2000 candidate boxes
results = []
for box in proposals: # Classify EACH proposal
crop = image[:, :, box[1]:box[3], box[0]:box[2]]
label = classifier(crop)
results.append((box, label))
return results # Very slow!
# YOLO approach: single pass does everything
class YOLOConcept(nn.Module):
"""Fast: one forward pass predicts ALL boxes and classes"""
def __init__(self, S=7, B=2, C=20):
super().__init__()
self.S, self.B, self.C = S, B, C
# Single backbone + head
self.backbone = nn.Sequential(
nn.Conv2d(3, 64, 7, stride=2, padding=3),
nn.LeakyReLU(0.1),
nn.MaxPool2d(2),
nn.Conv2d(64, 192, 3, padding=1),
nn.LeakyReLU(0.1),
nn.AdaptiveAvgPool2d(S),
)
self.head = nn.Sequential(
nn.Flatten(),
nn.Linear(192 * S * S, 4096),
nn.LeakyReLU(0.1),
nn.Linear(4096, S * S * (B * 5 + C)),
)
def forward(self, x):
features = self.backbone(x)
output = self.head(features)
# Reshape to (batch, S, S, B*5 + C)
return output.view(-1, self.S, self.S, self.B * 5 + self.C)
# Demo
model = YOLOConcept(S=7, B=2, C=20)
img = torch.randn(1, 3, 448, 448)
predictions = model(img)
print(f"Input shape: {img.shape}")
print(f"Output shape: {predictions.shape}")
print(f"Grid: 7x7, each cell predicts: {2*5 + 20} values")
print(f" = 2 boxes × 5 values (x, y, w, h, conf) + 20 class probs")
The output tensor has shape (batch, 7, 7, 30) — meaning each of the 49 grid cells predicts 2 bounding boxes (each with 5 values: x, y, w, h, confidence) plus 20 class probabilities. Everything in one shot.
YOLO Grid System
YOLO divides the input image into an $S \times S$ grid (typically $7 \times 7$ for YOLOv1). Each grid cell is responsible for detecting objects whose center falls within that cell. Each cell predicts:
- $B$ bounding boxes, each with 5 values: $(x, y, w, h, \text{confidence})$
- $C$ class probabilities: $P(\text{Class}_i | \text{Object})$ for each class
The total output tensor has shape $S \times S \times (B \times 5 + C)$. For YOLOv1 with PASCAL VOC (20 classes): $7 \times 7 \times (2 \times 5 + 20) = 7 \times 7 \times 30$.
flowchart LR
A[Input Image
448×448] --> B[CNN Backbone
24 Conv Layers]
B --> C[Output Tensor
7×7×30]
C --> D[Grid Cell i,j]
D --> E[Box 1: x,y,w,h,conf]
D --> F[Box 2: x,y,w,h,conf]
D --> G[20 Class Probs]
E --> H[Final Detection]
F --> H
G --> H
H --> I[class_conf =
P class × IoU]
The Responsible Cell Concept
A critical detail: the grid cell that contains the center point of a ground-truth object is "responsible" for predicting that object. If a dog's center is at pixel (200, 300) in a 448×448 image, that maps to grid cell $(200/64, 300/64) = (3, 4)$ in a 7×7 grid (where each cell is 64 pixels wide). That specific cell must predict the dog's bounding box.
import torch
import numpy as np
def assign_objects_to_grid(bboxes, labels, S=7, img_size=448):
"""
Assign ground-truth objects to grid cells.
Args:
bboxes: Tensor of shape (N, 4) with [x_center, y_center, width, height]
all normalized to [0, 1] relative to image size
labels: Tensor of shape (N,) with class indices
S: Grid size (7 for YOLOv1)
Returns:
target: Tensor of shape (S, S, 5 + C) with assigned ground truth
"""
C = 20 # Number of classes (PASCAL VOC)
target = torch.zeros(S, S, 5 + C)
for i in range(len(bboxes)):
x_center, y_center, w, h = bboxes[i]
# Which grid cell is responsible?
grid_x = int(x_center * S) # Column index
grid_y = int(y_center * S) # Row index
# Clamp to valid range
grid_x = min(grid_x, S - 1)
grid_y = min(grid_y, S - 1)
# Position relative to grid cell (0 to 1 within cell)
x_cell = x_center * S - grid_x
y_cell = y_center * S - grid_y
# Store: [x_cell, y_cell, w, h, confidence, one-hot class]
target[grid_y, grid_x, 0] = x_cell
target[grid_y, grid_x, 1] = y_cell
target[grid_y, grid_x, 2] = w
target[grid_y, grid_x, 3] = h
target[grid_y, grid_x, 4] = 1.0 # Object is present
# One-hot encode class
class_idx = int(labels[i])
target[grid_y, grid_x, 5 + class_idx] = 1.0
return target
# Example: 2 objects in a 448x448 image
bboxes = torch.tensor([
[0.45, 0.65, 0.30, 0.40], # Dog at center (0.45, 0.65)
[0.80, 0.20, 0.15, 0.25], # Car at center (0.80, 0.20)
])
labels = torch.tensor([11, 6]) # dog=11, car=6 in VOC
target = assign_objects_to_grid(bboxes, labels)
print(f"Target shape: {target.shape}")
print(f"Dog assigned to cell: ({int(0.65*7)}, {int(0.45*7)}) = (4, 3)")
print(f"Car assigned to cell: ({int(0.20*7)}, {int(0.80*7)}) = (1, 5)")
print(f"Cell (4,3) confidence: {target[4, 3, 4].item()}")
print(f"Cell (4,3) class 11 (dog): {target[4, 3, 5+11].item()}")
This assignment mechanism means that YOLO has a natural limitation: each grid cell can only predict one object in YOLOv1. If two objects have centers in the same cell, only one can be detected. Later versions (YOLOv3+) address this with anchor boxes and multi-scale predictions.
Bounding Box Representation
YOLO predicts bounding boxes using a specific coordinate format. Each box has 5 values:
- $(x, y)$ — center of the box relative to the grid cell (values between 0 and 1)
- $(w, h)$ — width and height relative to the entire image (values between 0 and 1)
- confidence — $P(\text{Object}) \times \text{IoU}_{\text{pred}}^{\text{truth}}$
The confidence score captures two things: how likely an object exists in that box AND how well the predicted box aligns with the actual object. Formally:
$$\text{Confidence} = P(\text{Object}) \times \text{IoU}_{\text{pred}}^{\text{truth}}$$
Where IoU (Intersection over Union) measures the overlap between predicted and ground-truth boxes:
$$\text{IoU} = \frac{|B_p \cap B_{gt}|}{|B_p \cup B_{gt}|}$$
Format Conversion
In practice, we frequently convert between box formats. The two most common are center format (x_center, y_center, w, h) used by YOLO and corner format (x_min, y_min, x_max, y_max) used for IoU computation and visualization.
import torch
def center_to_corners(boxes):
"""
Convert boxes from (x_center, y_center, w, h) to (x1, y1, x2, y2).
All values normalized to [0, 1].
"""
x_center, y_center, w, h = boxes.unbind(-1)
x1 = x_center - w / 2
y1 = y_center - h / 2
x2 = x_center + w / 2
y2 = y_center + h / 2
return torch.stack([x1, y1, x2, y2], dim=-1)
def corners_to_center(boxes):
"""
Convert boxes from (x1, y1, x2, y2) to (x_center, y_center, w, h).
"""
x1, y1, x2, y2 = boxes.unbind(-1)
x_center = (x1 + x2) / 2
y_center = (y1 + y2) / 2
w = x2 - x1
h = y2 - y1
return torch.stack([x_center, y_center, w, h], dim=-1)
def compute_iou(boxes1, boxes2):
"""
Compute IoU between two sets of boxes (both in corner format).
boxes1: (N, 4), boxes2: (M, 4)
Returns: (N, M) IoU matrix
"""
# Intersection coordinates
x1 = torch.max(boxes1[:, None, 0], boxes2[None, :, 0])
y1 = torch.max(boxes1[:, None, 1], boxes2[None, :, 1])
x2 = torch.min(boxes1[:, None, 2], boxes2[None, :, 2])
y2 = torch.min(boxes1[:, None, 3], boxes2[None, :, 3])
# Intersection area (clamp to 0 if no overlap)
intersection = (x2 - x1).clamp(min=0) * (y2 - y1).clamp(min=0)
# Union area
area1 = (boxes1[:, 2] - boxes1[:, 0]) * (boxes1[:, 3] - boxes1[:, 1])
area2 = (boxes2[:, 2] - boxes2[:, 0]) * (boxes2[:, 3] - boxes2[:, 1])
union = area1[:, None] + area2[None, :] - intersection
return intersection / (union + 1e-6)
# Demo: Convert and compute IoU
pred_boxes = torch.tensor([
[0.5, 0.5, 0.4, 0.6], # Predicted box (center format)
[0.3, 0.3, 0.2, 0.2], # Another prediction
])
gt_boxes = torch.tensor([
[0.48, 0.52, 0.38, 0.58], # Ground truth (center format)
])
# Convert to corners for IoU
pred_corners = center_to_corners(pred_boxes)
gt_corners = center_to_corners(gt_boxes)
iou_matrix = compute_iou(pred_corners, gt_corners)
print(f"Predicted boxes (center): \n{pred_boxes}")
print(f"Predicted boxes (corners): \n{pred_corners}")
print(f"IoU with ground truth: {iou_matrix.squeeze()}")
print(f"Box 1 IoU: {iou_matrix[0, 0]:.4f} (good match)")
print(f"Box 2 IoU: {iou_matrix[1, 0]:.4f} (poor match)")
IoU is the fundamental metric for object detection — it tells us how well a predicted box overlaps with the ground truth. An IoU above 0.5 is typically considered a "correct" detection, while 0.75+ indicates high-quality localization.
YOLO Loss Function
The YOLO loss function is a multi-part sum-squared error that balances three objectives: box localization, confidence prediction, and class prediction. The full loss is:
$$\mathcal{L} = \lambda_{\text{coord}} \mathcal{L}_{\text{box}} + \mathcal{L}_{\text{conf}} + \mathcal{L}_{\text{class}}$$
Where:
- $\lambda_{\text{coord}} = 5$ — upweights localization errors (boxes matter more than background confidence)
- $\lambda_{\text{noobj}} = 0.5$ — downweights confidence loss for cells without objects (most cells are background)
- Box width/height use $\sqrt{w}$ and $\sqrt{h}$ instead of raw values to make the loss more sensitive to small-box errors
Loss Implementation
Here's a complete implementation of the YOLOv1 loss function. This is one of the most educational pieces of code because it shows how every prediction component is supervised:
import torch
import torch.nn as nn
class YOLOv1Loss(nn.Module):
"""
YOLOv1 loss function implementation.
Penalizes localization, confidence, and classification errors.
"""
def __init__(self, S=7, B=2, C=20, lambda_coord=5.0, lambda_noobj=0.5):
super().__init__()
self.S = S
self.B = B
self.C = C
self.lambda_coord = lambda_coord
self.lambda_noobj = lambda_noobj
def forward(self, predictions, targets):
"""
predictions: (batch, S, S, B*5 + C)
targets: (batch, S, S, 5 + C) — only 1 box per cell in target
"""
batch_size = predictions.shape[0]
pred = predictions.reshape(batch_size, self.S, self.S, self.B * 5 + self.C)
# Extract target components
target_boxes = targets[..., :4] # (batch, S, S, 4)
target_conf = targets[..., 4:5] # (batch, S, S, 1) — 1 if object, 0 otherwise
target_class = targets[..., 5:] # (batch, S, S, C)
# Object mask: cells that contain an object
obj_mask = target_conf.squeeze(-1) # (batch, S, S)
# Extract predictions for box 1 and box 2
pred_box1 = pred[..., :5] # x, y, w, h, conf for box 1
pred_box2 = pred[..., 5:10] # x, y, w, h, conf for box 2
pred_class = pred[..., 10:] # class predictions
# ============ Localization Loss ============
# Use box 1 for simplicity (full impl selects best IoU box)
xy_loss = self.lambda_coord * torch.sum(
obj_mask.unsqueeze(-1) * (pred_box1[..., :2] - target_boxes[..., :2]) ** 2
)
# sqrt(w) and sqrt(h) for scale sensitivity
wh_pred = torch.sign(pred_box1[..., 2:4]) * torch.sqrt(
torch.abs(pred_box1[..., 2:4]) + 1e-6
)
wh_target = torch.sqrt(target_boxes[..., 2:4] + 1e-6)
wh_loss = self.lambda_coord * torch.sum(
obj_mask.unsqueeze(-1) * (wh_pred - wh_target) ** 2
)
# ============ Confidence Loss ============
# Object cells: confidence should match IoU
conf_obj_loss = torch.sum(
obj_mask * (pred_box1[..., 4] - target_conf.squeeze(-1)) ** 2
)
# No-object cells: confidence should be 0
noobj_mask = 1.0 - obj_mask
conf_noobj_loss = self.lambda_noobj * torch.sum(
noobj_mask * (pred_box1[..., 4]) ** 2
)
# Also penalize box 2 confidence in no-object cells
conf_noobj_loss += self.lambda_noobj * torch.sum(
noobj_mask * (pred_box2[..., 4]) ** 2
)
# ============ Classification Loss ============
class_loss = torch.sum(
obj_mask.unsqueeze(-1) * (pred_class - target_class) ** 2
)
# Total loss
total_loss = (xy_loss + wh_loss + conf_obj_loss +
conf_noobj_loss + class_loss) / batch_size
return total_loss
# Demo: create random predictions and targets
loss_fn = YOLOv1Loss(S=7, B=2, C=20)
pred = torch.randn(4, 7, 7, 30) # Batch of 4
target = torch.zeros(4, 7, 7, 25) # 5 + 20 = 25
# Place an object at cell (3, 3) in first image
target[0, 3, 3, :4] = torch.tensor([0.5, 0.5, 0.3, 0.4])
target[0, 3, 3, 4] = 1.0 # Object present
target[0, 3, 3, 5 + 14] = 1.0 # Class 14 (person)
loss = loss_fn(pred, target)
print(f"YOLO Loss: {loss.item():.4f}")
print(f"Loss components penalize localization, confidence, and classification jointly")
The loss function is the heart of YOLO training. The lambda_coord = 5 multiplier ensures the network prioritizes getting bounding box positions right, while lambda_noobj = 0.5 prevents the overwhelming number of background cells from dominating the gradient signal.
Building YOLOv1 Backbone
The original YOLOv1 uses a custom backbone called Darknet, inspired by GoogLeNet's inception modules but simplified into plain convolutions. It consists of 24 convolutional layers followed by 2 fully connected layers. The architecture progressively reduces spatial resolution while increasing channel depth, ultimately producing the 7×7×30 output tensor.
PyTorch Implementation
Below is a faithful PyTorch implementation of the YOLOv1 architecture. We define the backbone as a sequence of convolutional blocks and attach a detection head that outputs the final predictions:
import torch
import torch.nn as nn
class ConvBlock(nn.Module):
"""Conv + BatchNorm + LeakyReLU block used throughout Darknet."""
def __init__(self, in_channels, out_channels, kernel_size, stride=1, padding=0):
super().__init__()
self.conv = nn.Sequential(
nn.Conv2d(in_channels, out_channels, kernel_size,
stride=stride, padding=padding, bias=False),
nn.BatchNorm2d(out_channels),
nn.LeakyReLU(0.1, inplace=True),
)
def forward(self, x):
return self.conv(x)
class YOLOv1(nn.Module):
"""
YOLOv1 Architecture (simplified but faithful).
Input: 448x448x3
Output: S x S x (B*5 + C) = 7x7x30
"""
def __init__(self, S=7, B=2, C=20):
super().__init__()
self.S, self.B, self.C = S, B, C
# Darknet backbone (24 conv layers)
self.backbone = nn.Sequential(
# Block 1
ConvBlock(3, 64, 7, stride=2, padding=3), # 448 -> 224
nn.MaxPool2d(2, stride=2), # 224 -> 112
# Block 2
ConvBlock(64, 192, 3, padding=1), # 112 -> 112
nn.MaxPool2d(2, stride=2), # 112 -> 56
# Block 3
ConvBlock(192, 128, 1),
ConvBlock(128, 256, 3, padding=1),
ConvBlock(256, 256, 1),
ConvBlock(256, 512, 3, padding=1),
nn.MaxPool2d(2, stride=2), # 56 -> 28
# Block 4 (repeated 1x1 → 3x3 pattern)
ConvBlock(512, 256, 1),
ConvBlock(256, 512, 3, padding=1),
ConvBlock(512, 256, 1),
ConvBlock(256, 512, 3, padding=1),
ConvBlock(512, 256, 1),
ConvBlock(256, 512, 3, padding=1),
ConvBlock(512, 256, 1),
ConvBlock(256, 512, 3, padding=1),
ConvBlock(512, 512, 1),
ConvBlock(512, 1024, 3, padding=1),
nn.MaxPool2d(2, stride=2), # 28 -> 14
# Block 5
ConvBlock(1024, 512, 1),
ConvBlock(512, 1024, 3, padding=1),
ConvBlock(1024, 512, 1),
ConvBlock(512, 1024, 3, padding=1),
ConvBlock(1024, 1024, 3, padding=1),
ConvBlock(1024, 1024, 3, stride=2, padding=1), # 14 -> 7
# Block 6
ConvBlock(1024, 1024, 3, padding=1),
ConvBlock(1024, 1024, 3, padding=1), # 7x7x1024
)
# Detection head (2 FC layers)
self.head = nn.Sequential(
nn.Flatten(),
nn.Linear(1024 * S * S, 4096),
nn.LeakyReLU(0.1, inplace=True),
nn.Dropout(0.5),
nn.Linear(4096, S * S * (B * 5 + C)),
)
def forward(self, x):
features = self.backbone(x)
output = self.head(features)
return output.view(-1, self.S, self.S, self.B * 5 + self.C)
# Create model and verify output shape
model = YOLOv1(S=7, B=2, C=20)
x = torch.randn(2, 3, 448, 448)
output = model(x)
print(f"Input shape: {x.shape}")
print(f"Output shape: {output.shape}")
print(f"Expected: (2, 7, 7, 30)")
print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")
This model has roughly 270 million parameters — significantly larger than modern efficient detectors. The fully connected layers in particular are parameter-heavy, which is why later YOLO versions replaced them with convolutional heads.
Non-Maximum Suppression (NMS)
After YOLO produces predictions, multiple grid cells often detect the same object. A large dog might span several grid cells, and each cell might predict a box for it. Non-Maximum Suppression (NMS) is the post-processing step that removes duplicate detections, keeping only the best box for each object.
NMS from Scratch
Let's implement NMS from scratch, then compare with PyTorch's built-in implementation:
import torch
from torchvision.ops import nms as torchvision_nms
def nms_from_scratch(boxes, scores, iou_threshold=0.5):
"""
Non-Maximum Suppression implemented from scratch.
Args:
boxes: Tensor (N, 4) in corner format [x1, y1, x2, y2]
scores: Tensor (N,) confidence scores
iou_threshold: IoU threshold for suppression
Returns:
keep: indices of boxes to keep
"""
# Sort by confidence (descending)
order = scores.argsort(descending=True)
keep = []
while order.numel() > 0:
# Pick the best box
idx = order[0].item()
keep.append(idx)
if order.numel() == 1:
break
# Compute IoU of this box with all remaining boxes
remaining = order[1:]
best_box = boxes[idx].unsqueeze(0) # (1, 4)
other_boxes = boxes[remaining] # (M, 4)
# IoU calculation
x1 = torch.max(best_box[:, 0], other_boxes[:, 0])
y1 = torch.max(best_box[:, 1], other_boxes[:, 1])
x2 = torch.min(best_box[:, 2], other_boxes[:, 2])
y2 = torch.min(best_box[:, 3], other_boxes[:, 3])
intersection = (x2 - x1).clamp(min=0) * (y2 - y1).clamp(min=0)
area_best = (best_box[:, 2] - best_box[:, 0]) * (best_box[:, 3] - best_box[:, 1])
area_other = (other_boxes[:, 2] - other_boxes[:, 0]) * (other_boxes[:, 3] - other_boxes[:, 1])
union = area_best + area_other - intersection
iou = intersection / (union + 1e-6)
# Keep boxes with low IoU (different objects)
mask = iou.squeeze(0) < iou_threshold
order = remaining[mask]
return torch.tensor(keep, dtype=torch.long)
# Demo: Multiple overlapping detections of the same object
boxes = torch.tensor([
[100, 100, 300, 300], # High confidence box
[110, 105, 295, 310], # Overlapping (same object)
[105, 98, 305, 295], # Overlapping (same object)
[400, 200, 550, 400], # Different object entirely
[410, 205, 545, 395], # Overlapping with box above
], dtype=torch.float32)
scores = torch.tensor([0.95, 0.88, 0.82, 0.90, 0.75])
# Our implementation
keep_ours = nms_from_scratch(boxes, scores, iou_threshold=0.5)
print(f"Our NMS keeps indices: {keep_ours.tolist()}")
print(f"Kept boxes: {len(keep_ours)} out of {len(boxes)}")
# PyTorch's implementation (should match)
keep_torch = torchvision_nms(boxes, scores, iou_threshold=0.5)
print(f"Torchvision NMS keeps: {keep_torch.tolist()}")
print(f"Results match: {keep_ours.tolist() == keep_torch.tolist()}")
NMS reduces our 5 raw predictions to just 2 final detections — one for each distinct object. The overlapping boxes for the same object are suppressed, keeping only the highest-confidence version.
Modern YOLO: YOLOv5/v8 with Ultralytics
While understanding YOLOv1's mechanics is educational, real-world projects use modern implementations like YOLOv8 from Ultralytics. YOLOv8 incorporates years of improvements: anchor-free detection heads, CSP (Cross-Stage Partial) backbone, path aggregation networks, and state-of-the-art training recipes — all wrapped in a clean Python API.
YOLOv8 Model Variants
YOLOv8 comes in five sizes (nano, small, medium, large, xlarge) trading speed for accuracy. YOLOv8n runs at 100+ FPS on edge GPUs while YOLOv8x achieves near state-of-the-art mAP on COCO. All models share the same API — just change the size suffix.
Inference & Fine-tuning
The Ultralytics library makes running YOLOv8 incredibly simple — just a few lines for inference on images. Install with pip install ultralytics:
from ultralytics import YOLO
import torch
# Load a pretrained YOLOv8 model (downloads automatically)
model = YOLO('yolov8n.pt') # nano model (fastest)
# Run inference on an image
results = model('https://ultralytics.com/images/bus.jpg')
# Process results
for result in results:
boxes = result.boxes # Bounding box outputs
print(f"Detected {len(boxes)} objects:")
print(f" Bounding boxes (xyxy): {boxes.xyxy.shape}")
print(f" Confidence scores: {boxes.conf}")
print(f" Class indices: {boxes.cls}")
# Get class names
for i, (box, conf, cls) in enumerate(zip(boxes.xyxy, boxes.conf, boxes.cls)):
class_name = model.names[int(cls)]
print(f" [{i}] {class_name}: {conf:.2f} at {box.tolist()}")
For custom datasets, YOLOv8 supports fine-tuning with just a few more lines. You provide a YAML file describing your dataset and the library handles data loading, augmentation, and training:
from ultralytics import YOLO
# Load pretrained model as starting point
model = YOLO('yolov8n.pt')
# Fine-tune on custom dataset
# Requires a data.yaml file pointing to your train/val images and labels
# Example data.yaml:
# train: /path/to/train/images
# val: /path/to/val/images
# nc: 3 (number of classes)
# names: ['cat', 'dog', 'bird']
# Train for 50 epochs
results = model.train(
data='data.yaml', # Path to dataset config
epochs=50, # Number of training epochs
imgsz=640, # Input image size
batch=16, # Batch size
lr0=0.01, # Initial learning rate
device='0', # GPU device (or 'cpu')
project='runs/detect', # Save directory
name='custom_yolov8', # Experiment name
)
# Evaluate on validation set
metrics = model.val()
print(f"mAP@0.5: {metrics.box.map50:.4f}")
print(f"mAP@0.5:0.95: {metrics.box.map:.4f}")
# Export model for deployment
model.export(format='onnx') # ONNX format for production
print("Model exported to ONNX!")
YOLOv8 handles all the complexity internally — data augmentation (mosaic, mixup, HSV shifts), learning rate scheduling (cosine annealing with warmup), and multi-scale training. You just configure high-level parameters.
Anchor Boxes & FPN
YOLOv1's fixed grid has a major limitation: each cell only predicts 2 boxes with unconstrained shapes. This makes it hard to learn very wide or very tall objects (like a giraffe vs a school bus). Anchor boxes (introduced in YOLOv2) provide shape priors — predefined aspect ratios that the network refines rather than predicting from scratch.
Instead of predicting raw $(x, y, w, h)$, the network predicts offsets from predefined anchor shapes. If an anchor has width $a_w$ and height $a_h$, the predicted box is:
$$w = a_w \cdot e^{t_w}, \quad h = a_h \cdot e^{t_h}$$
Where $t_w, t_h$ are the network's learned adjustments. This makes training more stable because the network starts from reasonable shapes.
import torch
import numpy as np
def generate_anchors(feature_sizes, anchor_configs, image_size=416):
"""
Generate anchor boxes at multiple scales (like YOLOv3).
Args:
feature_sizes: list of feature map sizes [52, 26, 13]
anchor_configs: anchor (w, h) pairs for each scale
image_size: input image resolution
Returns:
all_anchors: dict mapping scale to anchor boxes
"""
all_anchors = {}
for scale_idx, (feat_size, anchors) in enumerate(zip(feature_sizes, anchor_configs)):
stride = image_size / feat_size
grid_anchors = []
for gy in range(feat_size):
for gx in range(feat_size):
cx = (gx + 0.5) * stride
cy = (gy + 0.5) * stride
for aw, ah in anchors:
# Anchor box in absolute pixel coordinates
x1 = cx - aw / 2
y1 = cy - ah / 2
x2 = cx + aw / 2
y2 = cy + ah / 2
grid_anchors.append([x1, y1, x2, y2])
all_anchors[f'scale_{feat_size}x{feat_size}'] = torch.tensor(grid_anchors)
return all_anchors
# YOLOv3-style anchor configuration (3 scales, 3 anchors each)
feature_sizes = [52, 26, 13] # Small, Medium, Large objects
anchor_configs = [
[(10, 13), (16, 30), (33, 23)], # Small anchors (52x52)
[(30, 61), (62, 45), (59, 119)], # Medium anchors (26x26)
[(116, 90), (156, 198), (373, 326)], # Large anchors (13x13)
]
anchors = generate_anchors(feature_sizes, anchor_configs)
for scale, boxes in anchors.items():
print(f"{scale}: {boxes.shape[0]} anchor boxes")
# Total anchors = 52*52*3 + 26*26*3 + 13*13*3 = 10647
total = sum(boxes.shape[0] for boxes in anchors.values())
print(f"Total anchor boxes: {total}")
print(f"Each anchor predicts: 4 coords + 1 objectness + 80 classes = 85 values")
Feature Pyramid Networks
Different-sized objects are best detected at different feature map resolutions. Small objects (like a distant car) need high-resolution features, while large objects (like a close-up face) are captured well by low-resolution, high-level features. Feature Pyramid Networks (FPN) combine both by creating a top-down pathway that merges features at multiple scales:
import torch
import torch.nn as nn
import torch.nn.functional as F
class SimpleFPN(nn.Module):
"""
Simplified Feature Pyramid Network for multi-scale detection.
Merges features from different backbone stages.
"""
def __init__(self, in_channels_list, out_channels=256):
super().__init__()
# Lateral connections (1x1 conv to match channels)
self.lateral_convs = nn.ModuleList([
nn.Conv2d(in_ch, out_channels, 1)
for in_ch in in_channels_list
])
# Smoothing convolutions (3x3 after merge)
self.smooth_convs = nn.ModuleList([
nn.Conv2d(out_channels, out_channels, 3, padding=1)
for _ in in_channels_list
])
def forward(self, features):
"""
features: list of feature maps from backbone [C3, C4, C5]
from high-res to low-res
"""
# Apply lateral connections
laterals = [conv(f) for conv, f in zip(self.lateral_convs, features)]
# Top-down pathway: upsample and add
for i in range(len(laterals) - 2, -1, -1):
upsampled = F.interpolate(
laterals[i + 1], size=laterals[i].shape[2:], mode='nearest'
)
laterals[i] = laterals[i] + upsampled
# Apply smoothing
outputs = [conv(lat) for conv, lat in zip(self.smooth_convs, laterals)]
return outputs
# Demo: Simulate backbone features at 3 scales
batch_size = 2
# C3: 52x52, C4: 26x26, C5: 13x13
features = [
torch.randn(batch_size, 256, 52, 52), # High-res (small objects)
torch.randn(batch_size, 512, 26, 26), # Medium-res
torch.randn(batch_size, 1024, 13, 13), # Low-res (large objects)
]
fpn = SimpleFPN(in_channels_list=[256, 512, 1024], out_channels=256)
pyramid_features = fpn(features)
print("Feature Pyramid Network outputs:")
for i, feat in enumerate(pyramid_features):
print(f" P{i+3}: {feat.shape} — detects {'small' if i==0 else 'medium' if i==1 else 'large'} objects")
YOLOv3+ uses exactly this pattern: three detection heads at scales 13×13, 26×26, and 52×52. Large objects are detected at the 13×13 scale (large receptive field), while small objects are caught at 52×52 (high spatial resolution). This multi-scale approach is why modern YOLO handles objects of vastly different sizes.
Training a Custom Object Detector
Training a real object detector requires properly formatted data, effective augmentation (that transforms both images AND bounding boxes), and evaluation using the standard mean Average Precision (mAP) metric.
.txt file. Each line represents one object: class_id x_center y_center width height (all normalized to [0, 1]). This is the format expected by Ultralytics and most YOLO implementations.
Here's how to build a custom dataset class for YOLO training that loads images and their bounding box annotations:
import torch
from torch.utils.data import Dataset, DataLoader
import numpy as np
from PIL import Image
import os
class YOLODataset(Dataset):
"""
Custom dataset for YOLO-format annotations.
Each image has a .txt label file with format:
class_id x_center y_center width height (normalized)
"""
def __init__(self, img_dir, label_dir, img_size=416, S=7, B=2, C=20):
self.img_dir = img_dir
self.label_dir = label_dir
self.img_size = img_size
self.S, self.B, self.C = S, B, C
# List all image files
self.images = [f for f in os.listdir(img_dir)
if f.endswith(('.jpg', '.png', '.jpeg'))]
def __len__(self):
return len(self.images)
def __getitem__(self, idx):
# Load image
img_path = os.path.join(self.img_dir, self.images[idx])
image = Image.open(img_path).convert('RGB')
image = image.resize((self.img_size, self.img_size))
image = torch.tensor(np.array(image), dtype=torch.float32)
image = image.permute(2, 0, 1) / 255.0 # (C, H, W), normalized
# Load labels
label_file = self.images[idx].replace('.jpg', '.txt').replace('.png', '.txt')
label_path = os.path.join(self.label_dir, label_file)
boxes = []
if os.path.exists(label_path):
with open(label_path, 'r') as f:
for line in f.readlines():
parts = line.strip().split()
class_id = int(parts[0])
x, y, w, h = map(float, parts[1:5])
boxes.append([class_id, x, y, w, h])
# Convert to YOLO target tensor
target = self._encode_target(boxes)
return image, target
def _encode_target(self, boxes):
"""Encode bounding boxes into S×S×(5+C) target tensor."""
target = torch.zeros(self.S, self.S, 5 + self.C)
for box in boxes:
class_id, x, y, w, h = box
grid_x = int(x * self.S)
grid_y = int(y * self.S)
grid_x = min(grid_x, self.S - 1)
grid_y = min(grid_y, self.S - 1)
# Only assign if cell is empty (first object wins)
if target[grid_y, grid_x, 4] == 0:
target[grid_y, grid_x, 0] = x * self.S - grid_x
target[grid_y, grid_x, 1] = y * self.S - grid_y
target[grid_y, grid_x, 2] = w
target[grid_y, grid_x, 3] = h
target[grid_y, grid_x, 4] = 1.0
target[grid_y, grid_x, 5 + class_id] = 1.0
return target
# Example usage (with synthetic data for demonstration)
print("YOLODataset expects:")
print(" img_dir/ → image files (.jpg, .png)")
print(" label_dir/ → label files (.txt, same name as image)")
print(" Label format: 'class_id x_center y_center width height'")
print(f"\nTarget tensor shape: ({7}, {7}, {5 + 20}) = (7, 7, 25)")
print("Each cell encodes: [x_offset, y_offset, w, h, objectness, 20 class probs]")
mAP Evaluation
The standard metric for object detection is mean Average Precision (mAP). It measures both localization quality and classification accuracy across all IoU thresholds. Here's a simplified mAP calculator:
import torch
import numpy as np
def calculate_ap(precisions, recalls):
"""Calculate Average Precision using 11-point interpolation."""
ap = 0.0
for t in np.arange(0, 1.1, 0.1):
# Maximum precision at recall >= t
prec_at_recall = precisions[recalls >= t]
if len(prec_at_recall) > 0:
ap += prec_at_recall.max()
return ap / 11.0
def compute_map(predictions, ground_truths, iou_threshold=0.5, num_classes=20):
"""
Compute mean Average Precision.
Args:
predictions: list of dicts with 'boxes', 'scores', 'labels', 'image_id'
ground_truths: list of dicts with 'boxes', 'labels', 'image_id'
iou_threshold: IoU threshold for a "correct" detection
num_classes: number of object classes
"""
aps = []
for cls in range(num_classes):
# Collect all predictions and GTs for this class
cls_preds = []
cls_gts = {}
n_gt = 0
for gt in ground_truths:
mask = gt['labels'] == cls
img_id = gt['image_id']
cls_gts[img_id] = gt['boxes'][mask]
n_gt += mask.sum().item()
if n_gt == 0:
continue
for pred in predictions:
mask = pred['labels'] == cls
for box, score in zip(pred['boxes'][mask], pred['scores'][mask]):
cls_preds.append({
'box': box, 'score': score.item(),
'image_id': pred['image_id']
})
# Sort by confidence
cls_preds.sort(key=lambda x: x['score'], reverse=True)
# Compute precision-recall curve
tp = np.zeros(len(cls_preds))
fp = np.zeros(len(cls_preds))
matched = {img_id: set() for img_id in cls_gts}
for i, pred in enumerate(cls_preds):
img_id = pred['image_id']
gt_boxes = cls_gts.get(img_id, torch.zeros(0, 4))
if len(gt_boxes) == 0:
fp[i] = 1
continue
# Find best matching GT box
pred_box = pred['box'].unsqueeze(0)
ious = compute_single_iou(pred_box, gt_boxes)
best_iou, best_idx = ious.max(dim=1)
if best_iou >= iou_threshold and best_idx.item() not in matched[img_id]:
tp[i] = 1
matched[img_id].add(best_idx.item())
else:
fp[i] = 1
# Cumulative precision and recall
cum_tp = np.cumsum(tp)
cum_fp = np.cumsum(fp)
precisions = cum_tp / (cum_tp + cum_fp + 1e-6)
recalls = cum_tp / (n_gt + 1e-6)
ap = calculate_ap(precisions, recalls)
aps.append(ap)
mAP = np.mean(aps) if aps else 0.0
return mAP
def compute_single_iou(box, boxes):
"""Compute IoU of one box against many boxes."""
x1 = torch.max(box[:, 0], boxes[:, 0])
y1 = torch.max(box[:, 1], boxes[:, 1])
x2 = torch.min(box[:, 2], boxes[:, 2])
y2 = torch.min(box[:, 3], boxes[:, 3])
inter = (x2 - x1).clamp(0) * (y2 - y1).clamp(0)
area1 = (box[:, 2] - box[:, 0]) * (box[:, 3] - box[:, 1])
area2 = (boxes[:, 2] - boxes[:, 0]) * (boxes[:, 3] - boxes[:, 1])
union = area1 + area2 - inter
return inter / (union + 1e-6)
# Demo with synthetic detections
print("mAP Evaluation Summary:")
print(" mAP@0.5 → IoU threshold of 0.5 (standard)")
print(" mAP@0.75 → IoU threshold of 0.75 (strict)")
print(" mAP@[.5:.95] → Average across thresholds 0.5 to 0.95 (COCO metric)")
print("\nHigher mAP = better detector. COCO leaderboard uses mAP@[.5:.95]")
The mAP metric tells you: "Across all classes and all confidence thresholds, what fraction of detections are correct?" A detector with mAP@0.5 of 0.45 means 45% of its predictions at IoU>0.5 are true positives, averaged over all recall levels. Modern YOLOv8x achieves ~53.9 mAP@[.5:.95] on COCO — state-of-the-art for real-time detectors.
Inference & Visualization
A detection pipeline isn't complete until you can visualize the results. Drawing bounding boxes with class labels and confidence scores on images is essential for debugging and demonstrating your model. Here's a complete visualization function:
import torch
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as patches
def visualize_detections(image, boxes, scores, labels, class_names,
conf_threshold=0.5, figsize=(12, 8)):
"""
Draw bounding boxes with labels on an image.
Args:
image: numpy array (H, W, 3) in [0, 255] or [0, 1]
boxes: tensor (N, 4) in [x1, y1, x2, y2] pixel coords
scores: tensor (N,) confidence scores
labels: tensor (N,) class indices
class_names: list of class name strings
conf_threshold: minimum confidence to display
"""
# Normalize image to [0, 1] for matplotlib
if image.max() > 1.0:
image = image / 255.0
fig, ax = plt.subplots(1, figsize=figsize)
ax.imshow(image)
# Color palette for different classes
colors = plt.cm.Set3(np.linspace(0, 1, len(class_names)))
# Filter by confidence
mask = scores >= conf_threshold
boxes = boxes[mask]
scores = scores[mask]
labels = labels[mask]
for box, score, label in zip(boxes, scores, labels):
x1, y1, x2, y2 = box.tolist()
w, h = x2 - x1, y2 - y1
cls_idx = int(label)
color = colors[cls_idx % len(colors)]
# Draw bounding box
rect = patches.Rectangle(
(x1, y1), w, h,
linewidth=2, edgecolor=color, facecolor='none'
)
ax.add_patch(rect)
# Draw label background
label_text = f"{class_names[cls_idx]}: {score:.2f}"
ax.text(
x1, y1 - 5, label_text,
fontsize=10, fontweight='bold',
color='white',
bbox=dict(boxstyle='round,pad=0.3',
facecolor=color, alpha=0.8)
)
ax.axis('off')
ax.set_title(f"Detections: {len(boxes)} objects (conf > {conf_threshold})")
plt.tight_layout()
plt.show()
# Demo with synthetic detections
np.random.seed(42)
H, W = 480, 640
image = np.random.randint(100, 200, (H, W, 3), dtype=np.uint8)
# Simulate detections
boxes = torch.tensor([
[50, 80, 200, 300], # Person
[300, 150, 500, 400], # Car
[420, 50, 580, 180], # Dog
[100, 350, 250, 470], # Chair
])
scores = torch.tensor([0.92, 0.87, 0.78, 0.45])
labels = torch.tensor([0, 2, 16, 56])
class_names = ['person'] + ['bicycle', 'car'] + [''] * 13 + ['dog'] + [''] * 39 + ['chair']
print("Visualization function ready!")
print(f" {len(boxes)} raw detections")
print(f" After threshold (0.5): {(scores >= 0.5).sum()} shown")
# Uncomment below to actually render (requires display):
# visualize_detections(image, boxes, scores, labels, class_names, conf_threshold=0.5)
Real-Time Detection Pipeline
For real-time detection from a webcam or video stream, you need an efficient processing loop that captures frames, runs inference, and displays results at interactive speeds. Here's the complete pipeline using OpenCV and Ultralytics:
import torch
import numpy as np
import time
# Real-time detection pipeline (conceptual — requires webcam for actual use)
class RealtimeDetector:
"""
Real-time object detection pipeline.
Captures frames, runs YOLO inference, draws results.
"""
def __init__(self, model_name='yolov8n.pt', conf_threshold=0.5):
self.conf_threshold = conf_threshold
self.fps_history = []
# In production: self.model = YOLO(model_name)
def process_frame(self, frame, model_output):
"""
Process a single frame with detections.
Returns annotated frame with boxes drawn.
"""
start = time.time()
# Simulate detection results (in production, model(frame))
annotated = frame.copy()
# Draw detections
for box, score, label in model_output:
if score < self.conf_threshold:
continue
x1, y1, x2, y2 = map(int, box)
# In production: cv2.rectangle, cv2.putText
annotated[y1:y1+3, x1:x2] = [0, 255, 0] # Top border
annotated[y2-3:y2, x1:x2] = [0, 255, 0] # Bottom border
# Calculate FPS
elapsed = time.time() - start
fps = 1.0 / (elapsed + 1e-6)
self.fps_history.append(fps)
return annotated, fps
def run_benchmark(self, num_frames=100, frame_size=(640, 480)):
"""Benchmark detection speed without actual camera."""
print(f"Benchmarking {num_frames} frames at {frame_size}...")
for i in range(num_frames):
# Simulate frame capture
frame = np.random.randint(0, 255, (*frame_size, 3), dtype=np.uint8)
# Simulate model inference (just timing the overhead)
start = time.time()
_ = torch.randn(1, 3, 640, 640) # Simulate tensor creation
elapsed = time.time() - start
self.fps_history.append(1.0 / (elapsed + 1e-6))
avg_fps = np.mean(self.fps_history[-num_frames:])
print(f"Average processing speed: {avg_fps:.0f} FPS")
print(f"Latency per frame: {1000/avg_fps:.1f} ms")
return avg_fps
# Run benchmark
detector = RealtimeDetector(conf_threshold=0.5)
fps = detector.run_benchmark(num_frames=50)
print(f"\nReal-time detection pipeline:")
print(f" Model: YOLOv8n (nano — optimized for speed)")
print(f" Input: 640×640 (standard YOLO input size)")
print(f" Expected FPS with GPU: 80-120 FPS")
print(f" Expected FPS with CPU: 15-30 FPS")
In production, you'd use OpenCV's cv2.VideoCapture for camera input and cv2.imshow for display. The Ultralytics library also provides model.predict(source=0, show=True) which handles the entire webcam pipeline in one line.
Here's the complete Ultralytics one-liner for live webcam detection:
from ultralytics import YOLO
# One-line real-time webcam detection
model = YOLO('yolov8n.pt')
# Stream from webcam (source=0) with live display
# This opens a window showing detections in real-time
results = model.predict(
source=0, # Webcam index (0 = default camera)
show=True, # Display results live
conf=0.5, # Confidence threshold
stream=True, # Process as a stream (memory efficient)
verbose=False, # Suppress per-frame logging
)
# Process each frame's results (if needed)
for result in results:
# Access detections
boxes = result.boxes.xyxy # Bounding boxes
confs = result.boxes.conf # Confidence scores
classes = result.boxes.cls # Class indices
# Count detections per frame
n_objects = len(boxes)
if n_objects > 0:
print(f"Frame: {n_objects} objects detected")
# Break on 'q' key (handled by show=True internally)
That's the beauty of modern YOLO — decades of research compressed into a library that gives you real-time object detection with minimal code. But understanding the internals (grid cells, IoU, NMS, FPN) lets you debug problems, customize architectures, and push the boundaries of what's possible.
Data Augmentation for Detection
A crucial difference between augmenting for classification vs detection: when you transform the image (flip, rotate, crop), you MUST apply the same geometric transformation to the bounding boxes. Here's how to do this properly with the Albumentations library:
import torch
import numpy as np
# Demonstrate bounding box-aware augmentation logic
def horizontal_flip_with_boxes(image, boxes):
"""
Flip image horizontally and adjust bounding boxes.
Args:
image: numpy array (H, W, 3)
boxes: numpy array (N, 4) in [x1, y1, x2, y2] normalized [0,1]
Returns:
flipped_image, flipped_boxes
"""
# Flip image
flipped_image = np.flip(image, axis=1).copy()
# Flip boxes: x_new = 1 - x_old (mirror x-coordinates)
flipped_boxes = boxes.copy()
flipped_boxes[:, 0] = 1.0 - boxes[:, 2] # new x1 = 1 - old x2
flipped_boxes[:, 2] = 1.0 - boxes[:, 0] # new x2 = 1 - old x1
return flipped_image, flipped_boxes
def random_crop_with_boxes(image, boxes, min_scale=0.5):
"""
Random crop that ensures at least some boxes remain valid.
Args:
image: numpy array (H, W, 3)
boxes: numpy array (N, 4) in [x1, y1, x2, y2] normalized
min_scale: minimum crop size relative to original
"""
H, W = image.shape[:2]
# Random crop parameters
scale = np.random.uniform(min_scale, 1.0)
crop_h, crop_w = int(H * scale), int(W * scale)
top = np.random.randint(0, H - crop_h + 1)
left = np.random.randint(0, W - crop_w + 1)
# Crop image
cropped = image[top:top+crop_h, left:left+crop_w]
# Adjust boxes to crop coordinates
crop_x1, crop_y1 = left / W, top / H
crop_x2, crop_y2 = (left + crop_w) / W, (top + crop_h) / H
adjusted_boxes = []
for box in boxes:
# Clip box to crop region
new_x1 = max(0, (box[0] - crop_x1) / (crop_x2 - crop_x1))
new_y1 = max(0, (box[1] - crop_y1) / (crop_y2 - crop_y1))
new_x2 = min(1, (box[2] - crop_x1) / (crop_x2 - crop_x1))
new_y2 = min(1, (box[3] - crop_y1) / (crop_y2 - crop_y1))
# Keep box only if it has valid area
if new_x2 > new_x1 + 0.01 and new_y2 > new_y1 + 0.01:
adjusted_boxes.append([new_x1, new_y1, new_x2, new_y2])
return cropped, np.array(adjusted_boxes) if adjusted_boxes else np.zeros((0, 4))
# Demo
image = np.random.randint(0, 255, (416, 416, 3), dtype=np.uint8)
boxes = np.array([
[0.1, 0.2, 0.5, 0.7], # Object on the left
[0.6, 0.3, 0.9, 0.8], # Object on the right
])
# Horizontal flip
flipped_img, flipped_boxes = horizontal_flip_with_boxes(image, boxes)
print("Original boxes:", boxes)
print("After H-flip: ", flipped_boxes)
print(" Left object moved to right, right object moved to left ✓")
# Random crop
cropped_img, cropped_boxes = random_crop_with_boxes(image, boxes)
print(f"\nCropped image: {cropped_img.shape}")
print(f"Remaining boxes after crop: {len(cropped_boxes)}")
The Ultralytics library applies these augmentations (plus mosaic and mixup) automatically during training. Mosaic augmentation — stitching 4 images into one — is particularly effective for detection because it increases the variety of object scales and contexts the model sees during training.
Conclusion & Next Steps
You've now built a comprehensive understanding of YOLO object detection — from the foundational philosophy of single-pass regression, through the grid cell prediction system, all the way to modern YOLOv8 with Ultralytics. Here's what we covered:
- The YOLO paradigm: treating detection as regression on a spatial grid
- Core mechanics: grid cells, bounding box encoding, confidence scores
- Training: multi-part loss function with scale-aware width/height
- Post-processing: NMS for duplicate removal
- Modern advances: anchor boxes, FPN, multi-scale detection
- Production usage: Ultralytics YOLOv8 for training and inference