Back to TensorFlow Mastery Series

Part 6: CNNs & Computer Vision

May 3, 2026 Wasil Zafar 30 min read

From pixels to predictions — master convolutional neural networks in TensorFlow. Build CNNs from scratch, leverage pretrained models via transfer learning, implement data augmentation pipelines, and visualize what your networks actually learn with Grad-CAM.

Table of Contents

  1. How CNNs See Images
  2. Convolution Operations
  3. Pooling & Downsampling
  4. Building a CNN from Scratch
  5. Data Augmentation for Vision
  6. Transfer Learning
  7. Fine-Tuning Strategy
  8. Image Classification Pipeline
  9. Grad-CAM Visualization
  10. Beyond Classification

How CNNs See Images

Convolutional Neural Networks (CNNs) are specialized architectures designed to process grid-structured data like images. Unlike fully-connected networks that treat each pixel independently, CNNs exploit the spatial structure of images through three key principles: local receptive fields, parameter sharing, and translation invariance.

A neuron in a convolutional layer doesn't see the entire image — it only looks at a small local region (its receptive field). The same filter weights are shared across all spatial positions, meaning a feature detector learned in one part of the image can detect that same feature anywhere. This parameter sharing dramatically reduces the number of learnable parameters compared to a fully-connected approach.

Key Insight: CNNs build a hierarchical feature representation. Early layers detect edges and textures. Middle layers combine these into parts (eyes, wheels). Deep layers recognize entire objects (faces, cars). This bottom-up hierarchy is what makes CNNs so powerful for vision tasks.

Hierarchical Feature Learning

The convolutional output dimension follows a precise formula. Given input width $W$, kernel size $K$, padding $P$, and stride $S$:

$$O = \lfloor\frac{W - K + 2P}{S}\rfloor + 1$$

For example, a 32×32 input with a 3×3 kernel, padding=1, stride=1 produces a 32×32 output (spatial dimensions preserved). With stride=2, the output becomes 16×16 (spatial downsampling by 2×).

import tensorflow as tf
import numpy as np

# Demonstrate CNN hierarchical features with a simple model
model = tf.keras.Sequential([
    tf.keras.layers.Input(shape=(64, 64, 3)),

    # Layer 1: Detects edges, gradients (3x3 receptive field)
    tf.keras.layers.Conv2D(32, (3, 3), activation='relu', padding='same', name='edges'),

    # Layer 2: Detects textures, patterns (5x5 effective receptive field)
    tf.keras.layers.Conv2D(64, (3, 3), activation='relu', padding='same', name='textures'),
    tf.keras.layers.MaxPooling2D((2, 2)),

    # Layer 3: Detects parts, shapes (larger effective receptive field)
    tf.keras.layers.Conv2D(128, (3, 3), activation='relu', padding='same', name='parts'),
    tf.keras.layers.MaxPooling2D((2, 2)),

    # Layer 4: Detects objects, high-level concepts
    tf.keras.layers.Conv2D(256, (3, 3), activation='relu', padding='same', name='objects'),
    tf.keras.layers.GlobalAveragePooling2D(),
    tf.keras.layers.Dense(10, activation='softmax')
])

model.summary()

# Calculate output formula: O = floor((W - K + 2P) / S) + 1
W, K, P, S = 32, 3, 1, 1
output_size = (W - K + 2*P) // S + 1
print(f"\nInput={W}, Kernel={K}, Padding={P}, Stride={S} → Output={output_size}")

W, K, P, S = 32, 3, 1, 2
output_size = (W - K + 2*P) // S + 1
print(f"Input={W}, Kernel={K}, Padding={P}, Stride={S} → Output={output_size}")

# Parameters in Conv2D: K × K × C_in × C_out + C_out (bias)
K, C_in, C_out = 3, 3, 32
params = K * K * C_in * C_out + C_out
print(f"\nConv2D(32, 3x3) on RGB input: {params} parameters")
print(f"  = {K}×{K}×{C_in}×{C_out} + {C_out} (bias)")

The parameter count for a Conv2D layer is $K \times K \times C_{in} \times C_{out} + C_{out}$, where $C_{in}$ is input channels and $C_{out}$ is output filters. A 3×3 Conv2D with 3 input channels and 32 filters has only 896 parameters — compared to 393,216 parameters for a Dense layer connecting even a tiny 64×64×3 input to 32 neurons.

Convolution Operations

The tf.keras.layers.Conv2D layer is the workhorse of CNN architectures. Understanding its parameters — filters, kernel_size, strides, padding — is essential for designing effective networks. Beyond standard convolutions, 1×1 convolutions and depthwise separable convolutions offer powerful alternatives for specific use cases.

import tensorflow as tf
import numpy as np

# Create a random "image" batch: (batch=1, height=32, width=32, channels=3)
images = tf.random.normal([1, 32, 32, 3])

# Standard Conv2D: filters=64, kernel=3x3, stride=1, padding='same'
conv_same = tf.keras.layers.Conv2D(64, (3, 3), strides=1, padding='same', activation='relu')
out_same = conv_same(images)
print(f"Input shape:        {images.shape}")            # (1, 32, 32, 3)
print(f"Conv2D same output: {out_same.shape}")          # (1, 32, 32, 64)

# padding='valid' (no padding) — shrinks spatial dims
conv_valid = tf.keras.layers.Conv2D(64, (3, 3), strides=1, padding='valid', activation='relu')
out_valid = conv_valid(images)
print(f"Conv2D valid output: {out_valid.shape}")        # (1, 30, 30, 64)

# Stride=2 for downsampling (alternative to pooling)
conv_stride2 = tf.keras.layers.Conv2D(64, (3, 3), strides=2, padding='same', activation='relu')
out_stride2 = conv_stride2(images)
print(f"Conv2D stride=2 output: {out_stride2.shape}")   # (1, 16, 16, 64)

# 1x1 convolution: channel mixing without spatial change
conv_1x1 = tf.keras.layers.Conv2D(32, (1, 1), activation='relu')
out_1x1 = conv_1x1(out_same)
print(f"1x1 conv output:    {out_1x1.shape}")           # (1, 32, 32, 32)
print(f"\n1x1 conv params: {conv_1x1.count_params()}")  # 64*32 + 32 = 2080

Depthwise Separable Convolutions

Depthwise separable convolutions (used in MobileNet) factorize a standard convolution into a depthwise convolution (one filter per input channel) followed by a pointwise 1×1 convolution. This reduces computation by roughly $\frac{1}{C_{out}} + \frac{1}{K^2}$, making it ideal for mobile and edge deployment.

import tensorflow as tf
import numpy as np

# Compare standard Conv2D vs DepthwiseSeparable
images = tf.random.normal([1, 32, 32, 64])

# Standard Conv2D: 64 input channels → 128 output channels
standard_conv = tf.keras.layers.Conv2D(128, (3, 3), padding='same')
standard_conv(images)  # build
standard_params = standard_conv.count_params()

# Depthwise Separable: same input → same output
separable_conv = tf.keras.layers.SeparableConv2D(128, (3, 3), padding='same')
separable_conv(images)  # build
separable_params = separable_conv.count_params()

print(f"Standard Conv2D params:   {standard_params:,}")    # 73,856
print(f"SeparableConv2D params:   {separable_params:,}")   # 8,896
print(f"Reduction factor:         {standard_params / separable_params:.1f}x")

# MobileNet-style block: DepthwiseConv → BN → ReLU → PointwiseConv → BN → ReLU
def mobilenet_block(x, filters, stride=1):
    """MobileNet-style depthwise separable block."""
    # Depthwise convolution (spatial filtering)
    x = tf.keras.layers.DepthwiseConv2D(
        (3, 3), strides=stride, padding='same', use_bias=False
    )(x)
    x = tf.keras.layers.BatchNormalization()(x)
    x = tf.keras.layers.ReLU()(x)

    # Pointwise convolution (channel mixing)
    x = tf.keras.layers.Conv2D(filters, (1, 1), use_bias=False)(x)
    x = tf.keras.layers.BatchNormalization()(x)
    x = tf.keras.layers.ReLU()(x)
    return x

# Demo
inputs = tf.keras.Input(shape=(32, 32, 64))
outputs = mobilenet_block(inputs, filters=128, stride=2)
block_model = tf.keras.Model(inputs, outputs)
print(f"\nMobileNet block output shape: {block_model.output_shape}")
print(f"MobileNet block params:       {block_model.count_params():,}")

Pooling & Downsampling

Pooling layers reduce spatial dimensions, decrease computation, and provide a degree of translation invariance. The choice between max pooling, average pooling, and stride-2 convolutions affects what information is preserved through the network.

import tensorflow as tf
import numpy as np

# Create feature maps to pool
feature_maps = tf.random.normal([1, 8, 8, 64])

# MaxPooling2D: keeps strongest activations (most common)
max_pool = tf.keras.layers.MaxPooling2D(pool_size=(2, 2), strides=2)
out_max = max_pool(feature_maps)
print(f"Input:          {feature_maps.shape}")   # (1, 8, 8, 64)
print(f"MaxPool2D:      {out_max.shape}")         # (1, 4, 4, 64)

# AveragePooling2D: smooths activations (good for final layers)
avg_pool = tf.keras.layers.AveragePooling2D(pool_size=(2, 2), strides=2)
out_avg = avg_pool(feature_maps)
print(f"AvgPool2D:      {out_avg.shape}")         # (1, 4, 4, 64)

# GlobalAveragePooling2D: collapses spatial dims entirely
gap = tf.keras.layers.GlobalAveragePooling2D()
out_gap = gap(feature_maps)
print(f"GlobalAvgPool:  {out_gap.shape}")         # (1, 64) — one value per channel

# GlobalMaxPooling2D: takes max per channel
gmp = tf.keras.layers.GlobalMaxPooling2D()
out_gmp = gmp(feature_maps)
print(f"GlobalMaxPool:  {out_gmp.shape}")         # (1, 64)

When to Use Each Pooling Type

Pooling Guidelines: Use MaxPool2D in early/middle layers to preserve dominant features. Use GlobalAveragePooling2D before the classification head (replaces Flatten + Dense, reduces overfitting). Use stride-2 convolutions when you want the network to learn how to downsample (modern practice in ResNets).

Modern architectures increasingly favor stride-2 convolutions over explicit pooling layers, as they allow the network to learn the downsampling operation rather than using a fixed function. However, GlobalAveragePooling2D remains the standard way to transition from convolutional features to the classification head.

Building a CNN from Scratch

The classic CNN pattern stacks Conv → BatchNorm → ReLU → Pool blocks, progressively increasing filters while reducing spatial dimensions. After the convolutional backbone extracts features, a classification head (GlobalAveragePooling + Dense) produces predictions. Let's build one for CIFAR-10 (32×32 RGB images, 10 classes).

CNN Architecture Flow
flowchart LR
    A[Input
32×32×3] --> B[Conv 32
3×3] B --> C[BN + ReLU] C --> D[Conv 32
3×3] D --> E[BN + ReLU
+ MaxPool] E --> F[Conv 64
3×3] F --> G[BN + ReLU] G --> H[Conv 64
3×3] H --> I[BN + ReLU
+ MaxPool] I --> J[Conv 128
3×3] J --> K[BN + ReLU] K --> L[GAP] L --> M[Dense 10
Softmax]
import tensorflow as tf
import numpy as np

# Load CIFAR-10
(X_train, y_train), (X_test, y_test) = tf.keras.datasets.cifar10.load_data()
X_train = X_train.astype(np.float32) / 255.0
X_test = X_test.astype(np.float32) / 255.0
y_train = y_train.flatten()
y_test = y_test.flatten()

print(f"Training: {X_train.shape}, Labels: {y_train.shape}")  # (50000, 32, 32, 3)
print(f"Test:     {X_test.shape}, Labels: {y_test.shape}")    # (10000, 32, 32, 3)
print(f"Classes:  {np.unique(y_train)}")                       # [0..9]

# Build CNN from scratch: Conv-BN-ReLU-Pool pattern
def build_cnn(input_shape=(32, 32, 3), num_classes=10):
    inputs = tf.keras.Input(shape=input_shape)

    # Block 1: 32 filters
    x = tf.keras.layers.Conv2D(32, (3, 3), padding='same', use_bias=False)(inputs)
    x = tf.keras.layers.BatchNormalization()(x)
    x = tf.keras.layers.ReLU()(x)
    x = tf.keras.layers.Conv2D(32, (3, 3), padding='same', use_bias=False)(x)
    x = tf.keras.layers.BatchNormalization()(x)
    x = tf.keras.layers.ReLU()(x)
    x = tf.keras.layers.MaxPooling2D((2, 2))(x)
    x = tf.keras.layers.Dropout(0.25)(x)

    # Block 2: 64 filters
    x = tf.keras.layers.Conv2D(64, (3, 3), padding='same', use_bias=False)(x)
    x = tf.keras.layers.BatchNormalization()(x)
    x = tf.keras.layers.ReLU()(x)
    x = tf.keras.layers.Conv2D(64, (3, 3), padding='same', use_bias=False)(x)
    x = tf.keras.layers.BatchNormalization()(x)
    x = tf.keras.layers.ReLU()(x)
    x = tf.keras.layers.MaxPooling2D((2, 2))(x)
    x = tf.keras.layers.Dropout(0.25)(x)

    # Block 3: 128 filters
    x = tf.keras.layers.Conv2D(128, (3, 3), padding='same', use_bias=False)(x)
    x = tf.keras.layers.BatchNormalization()(x)
    x = tf.keras.layers.ReLU()(x)

    # Classification head
    x = tf.keras.layers.GlobalAveragePooling2D()(x)
    x = tf.keras.layers.Dense(256, activation='relu')(x)
    x = tf.keras.layers.Dropout(0.5)(x)
    outputs = tf.keras.layers.Dense(num_classes, activation='softmax')(x)

    return tf.keras.Model(inputs, outputs, name='cifar10_cnn')

model = build_cnn()
model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=1e-3),
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)
model.summary()
print(f"\nTotal parameters: {model.count_params():,}")

Training to 85%+ Accuracy

Here is the implementation for Training to 85%+ Accuracy. Each code example below is self-contained and can be run independently:

import tensorflow as tf
import numpy as np

# Load and preprocess CIFAR-10
(X_train, y_train), (X_test, y_test) = tf.keras.datasets.cifar10.load_data()
X_train = X_train.astype(np.float32) / 255.0
X_test = X_test.astype(np.float32) / 255.0
y_train, y_test = y_train.flatten(), y_test.flatten()

# Build model (same architecture as above)
inputs = tf.keras.Input(shape=(32, 32, 3))
x = inputs
for filters in [32, 32]:
    x = tf.keras.layers.Conv2D(filters, (3, 3), padding='same', use_bias=False)(x)
    x = tf.keras.layers.BatchNormalization()(x)
    x = tf.keras.layers.ReLU()(x)
x = tf.keras.layers.MaxPooling2D((2, 2))(x)
x = tf.keras.layers.Dropout(0.25)(x)
for filters in [64, 64]:
    x = tf.keras.layers.Conv2D(filters, (3, 3), padding='same', use_bias=False)(x)
    x = tf.keras.layers.BatchNormalization()(x)
    x = tf.keras.layers.ReLU()(x)
x = tf.keras.layers.MaxPooling2D((2, 2))(x)
x = tf.keras.layers.Dropout(0.25)(x)
x = tf.keras.layers.Conv2D(128, (3, 3), padding='same', use_bias=False)(x)
x = tf.keras.layers.BatchNormalization()(x)
x = tf.keras.layers.ReLU()(x)
x = tf.keras.layers.GlobalAveragePooling2D()(x)
x = tf.keras.layers.Dense(256, activation='relu')(x)
x = tf.keras.layers.Dropout(0.5)(x)
outputs = tf.keras.layers.Dense(10, activation='softmax')(x)
model = tf.keras.Model(inputs, outputs)

model.compile(
    optimizer=tf.keras.optimizers.Adam(1e-3),
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)

# Train with callbacks for best performance
callbacks = [
    tf.keras.callbacks.ReduceLROnPlateau(monitor='val_loss', factor=0.5, patience=5),
    tf.keras.callbacks.EarlyStopping(monitor='val_accuracy', patience=10, restore_best_weights=True)
]

history = model.fit(
    X_train, y_train,
    epochs=50,
    batch_size=128,
    validation_split=0.1,
    callbacks=callbacks,
    verbose=1
)

# Evaluate on test set
test_loss, test_acc = model.evaluate(X_test, y_test, verbose=0)
print(f"\nTest accuracy: {test_acc:.4f}")
print(f"Test loss:     {test_loss:.4f}")
print(f"Best val accuracy: {max(history.history['val_accuracy']):.4f}")

Data Augmentation for Vision

Data augmentation artificially expands your training set by applying random transformations — flips, rotations, zooms, contrast changes — during training. In TensorFlow, augmentation layers can be embedded directly into the model (GPU-accelerated) or applied as a preprocessing step in the data pipeline.

import tensorflow as tf
import numpy as np

# Keras augmentation layers — embedded in the model
data_augmentation = tf.keras.Sequential([
    tf.keras.layers.RandomFlip("horizontal"),           # 50% chance horizontal flip
    tf.keras.layers.RandomRotation(0.1),                # ±10% of full rotation (±36°)
    tf.keras.layers.RandomZoom(0.1),                    # ±10% zoom
    tf.keras.layers.RandomContrast(0.1),                # ±10% contrast
    tf.keras.layers.RandomTranslation(0.1, 0.1),        # ±10% shift
], name='data_augmentation')

# Demo: augment a single image
(X_train, _), _ = tf.keras.datasets.cifar10.load_data()
sample_image = X_train[0:1].astype(np.float32) / 255.0  # (1, 32, 32, 3)

# Apply augmentation 5 times to see variety
print("Original shape:", sample_image.shape)
for i in range(5):
    augmented = data_augmentation(sample_image, training=True)
    # Check that values are valid (no NaN, reasonable range)
    print(f"  Augmented {i+1}: min={augmented.numpy().min():.3f}, "
          f"max={augmented.numpy().max():.3f}, "
          f"shape={augmented.shape}")

# Augmentation is ONLY active during training (training=True)
inference_output = data_augmentation(sample_image, training=False)
print(f"\nInference (no augmentation): identical={np.allclose(sample_image, inference_output.numpy())}")

Augmentation as Part of Model vs Dataset

Here is the implementation for Augmentation as Part of Model vs Dataset. Each code example below is self-contained and can be run independently:

import tensorflow as tf
import numpy as np

# Approach 1: Augmentation inside the model (GPU-accelerated)
def build_model_with_augmentation(input_shape=(32, 32, 3), num_classes=10):
    inputs = tf.keras.Input(shape=input_shape)

    # Augmentation layers (only active during training)
    x = tf.keras.layers.RandomFlip("horizontal")(inputs)
    x = tf.keras.layers.RandomRotation(0.05)(x)
    x = tf.keras.layers.RandomZoom(0.1)(x)

    # CNN backbone
    x = tf.keras.layers.Conv2D(32, (3, 3), padding='same', activation='relu')(x)
    x = tf.keras.layers.MaxPooling2D((2, 2))(x)
    x = tf.keras.layers.Conv2D(64, (3, 3), padding='same', activation='relu')(x)
    x = tf.keras.layers.GlobalAveragePooling2D()(x)
    outputs = tf.keras.layers.Dense(num_classes, activation='softmax')(x)

    return tf.keras.Model(inputs, outputs)

model_aug = build_model_with_augmentation()
model_aug.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
print("Model with augmentation layers:")
print(f"  Total params: {model_aug.count_params():,}")

# Approach 2: Augmentation in tf.data pipeline (CPU, async with GPU training)
def augment_dataset(image, label):
    image = tf.image.random_flip_left_right(image)
    image = tf.image.random_brightness(image, max_delta=0.1)
    image = tf.image.random_contrast(image, lower=0.9, upper=1.1)
    return image, label

# Create pipeline with augmentation
(X_train, y_train), _ = tf.keras.datasets.cifar10.load_data()
X_train = X_train.astype(np.float32) / 255.0
y_train = y_train.flatten()

train_ds = tf.data.Dataset.from_tensor_slices((X_train, y_train))
train_ds = train_ds.shuffle(10000).map(augment_dataset, num_parallel_calls=tf.data.AUTOTUNE)
train_ds = train_ds.batch(64).prefetch(tf.data.AUTOTUNE)

# Verify pipeline
for images, labels in train_ds.take(1):
    print(f"\nDataset pipeline batch: images={images.shape}, labels={labels.shape}")

Transfer Learning

Transfer learning uses a model pretrained on a large dataset (typically ImageNet with 1.2M images, 1000 classes) as a starting point for your task. Instead of learning features from scratch, you leverage features already learned by state-of-the-art architectures. This is especially powerful when you have limited data (hundreds to thousands of images).

Transfer Learning Workflow
flowchart TD
    A[Pretrained Model
ImageNet weights] --> B{Strategy?} B -->|Feature Extraction| C[Freeze all base layers] C --> D[Add custom head] D --> E[Train head only
High LR: 1e-3] B -->|Fine-Tuning| F[Freeze base initially] F --> G[Train head first] G --> H[Unfreeze top N layers] H --> I[Train end-to-end
Low LR: 1e-5] E --> J[Evaluate] I --> J
import tensorflow as tf
import numpy as np

# Transfer Learning: Feature Extraction with MobileNetV2
# Load pretrained backbone WITHOUT top classification layer
base_model = tf.keras.applications.MobileNetV2(
    input_shape=(96, 96, 3),
    include_top=False,           # Remove ImageNet classification head
    weights='imagenet'           # Load pretrained weights
)

# Freeze the base model — don't update pretrained weights
base_model.trainable = False

print(f"MobileNetV2 base layers:   {len(base_model.layers)}")
print(f"MobileNetV2 base params:   {base_model.count_params():,}")
print(f"Trainable params (frozen):  {sum(tf.keras.backend.count_params(w) for w in base_model.trainable_weights):,}")

# Build complete model with custom head
inputs = tf.keras.Input(shape=(96, 96, 3))

# Preprocessing: MobileNetV2 expects pixels in [-1, 1]
x = tf.keras.applications.mobilenet_v2.preprocess_input(inputs)

# Feature extraction (frozen)
x = base_model(x, training=False)  # training=False keeps BN in inference mode

# Custom classification head
x = tf.keras.layers.GlobalAveragePooling2D()(x)
x = tf.keras.layers.Dropout(0.2)(x)
outputs = tf.keras.layers.Dense(5, activation='softmax')(x)  # 5 classes

model = tf.keras.Model(inputs, outputs)
model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=1e-3),
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)

# Only the head is trainable
print(f"\nTotal params:     {model.count_params():,}")
print(f"Trainable params: {sum(tf.keras.backend.count_params(w) for w in model.trainable_weights):,}")
print(f"Non-trainable:    {sum(tf.keras.backend.count_params(w) for w in model.non_trainable_weights):,}")

Choosing a Pretrained Backbone

Backbone Selection Guide:
MobileNetV2 — fast, small, good for mobile/edge (3.4M params)
EfficientNetB0 — excellent accuracy/efficiency trade-off (5.3M params)
ResNet50 — well-understood, good baseline (25.6M params)
EfficientNetB4+ — maximum accuracy, slower (19M+ params)
import tensorflow as tf

# Compare available pretrained models
backbones = {
    'MobileNetV2': tf.keras.applications.MobileNetV2,
    'EfficientNetB0': tf.keras.applications.EfficientNetB0,
    'ResNet50': tf.keras.applications.ResNet50,
    'DenseNet121': tf.keras.applications.DenseNet121,
}

print(f"{'Model':<18} {'Params':>12} {'Output Shape':>20}")
print("-" * 55)

for name, model_fn in backbones.items():
    model = model_fn(input_shape=(224, 224, 3), include_top=False, weights=None)
    output_shape = model.output_shape[1:]
    params = model.count_params()
    print(f"{name:<18} {params:>12,} {str(output_shape):>20}")
    del model

# Each backbone has its own preprocessing function
print("\nPreprocessing functions:")
print("  MobileNetV2:    pixels → [-1, 1]")
print("  EfficientNetB0: pixels → [0, 255] (no rescaling needed)")
print("  ResNet50:       pixels → caffe-style (BGR, mean-subtracted)")
print("  DenseNet121:    pixels → [0, 1], then imagenet normalization")

Fine-Tuning Strategy

Fine-tuning goes beyond feature extraction by unfreezing some pretrained layers and training them with a very low learning rate. The standard approach is two-phase training: first train only the head (fast convergence), then unfreeze top layers and fine-tune end-to-end (incremental improvement).

import tensorflow as tf
import numpy as np

# Phase 1: Feature extraction (head only)
base_model = tf.keras.applications.MobileNetV2(
    input_shape=(96, 96, 3), include_top=False, weights='imagenet'
)
base_model.trainable = False

inputs = tf.keras.Input(shape=(96, 96, 3))
x = tf.keras.applications.mobilenet_v2.preprocess_input(inputs)
x = base_model(x, training=False)
x = tf.keras.layers.GlobalAveragePooling2D()(x)
x = tf.keras.layers.Dropout(0.2)(x)
outputs = tf.keras.layers.Dense(10, activation='softmax')(x)
model = tf.keras.Model(inputs, outputs)

# Phase 1: Train head with high learning rate
model.compile(
    optimizer=tf.keras.optimizers.Adam(1e-3),
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)

# Simulate training with random data
X_dummy = np.random.rand(200, 96, 96, 3).astype(np.float32) * 255
y_dummy = np.random.randint(0, 10, 200)
model.fit(X_dummy, y_dummy, epochs=5, batch_size=32, verbose=0)
print("Phase 1 complete: Head trained")
print(f"  Trainable params: {sum(tf.keras.backend.count_params(w) for w in model.trainable_weights):,}")

# Phase 2: Unfreeze top layers for fine-tuning
base_model.trainable = True

# Freeze all layers except the last 30 (fine-tune top layers only)
fine_tune_from = len(base_model.layers) - 30
for layer in base_model.layers[:fine_tune_from]:
    layer.trainable = False

# Recompile with MUCH lower learning rate
model.compile(
    optimizer=tf.keras.optimizers.Adam(1e-5),  # 100x lower than Phase 1
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)

print(f"\nPhase 2: Fine-tuning from layer {fine_tune_from}/{len(base_model.layers)}")
print(f"  Trainable params: {sum(tf.keras.backend.count_params(w) for w in model.trainable_weights):,}")

# Continue training with lower LR
model.fit(X_dummy, y_dummy, epochs=5, batch_size=32, verbose=0)
print("Phase 2 complete: Fine-tuned")

BatchNorm in Inference Mode

Critical: When fine-tuning, always pass training=False to the base model. This keeps BatchNormalization layers in inference mode (using moving averages). If training=True, BN layers will update their statistics on your small dataset, destroying pretrained representations.
import tensorflow as tf
import numpy as np

# CORRECT: BatchNorm in inference mode during fine-tuning
base_model = tf.keras.applications.MobileNetV2(
    input_shape=(96, 96, 3), include_top=False, weights='imagenet'
)

inputs = tf.keras.Input(shape=(96, 96, 3))
x = tf.keras.applications.mobilenet_v2.preprocess_input(inputs)

# training=False is CRITICAL — keeps BN statistics frozen
x = base_model(x, training=False)
x = tf.keras.layers.GlobalAveragePooling2D()(x)
outputs = tf.keras.layers.Dense(10, activation='softmax')(x)

model = tf.keras.Model(inputs, outputs)

# Even after unfreezing, BN stays in inference mode because training=False
base_model.trainable = True
for layer in base_model.layers[:-30]:
    layer.trainable = False

# Verify BN behavior
bn_layers = [l for l in base_model.layers if isinstance(l, tf.keras.layers.BatchNormalization)]
trainable_bn = [l for l in bn_layers if l.trainable]
print(f"Total BN layers:     {len(bn_layers)}")
print(f"Trainable BN layers: {len(trainable_bn)}")
print(f"\nNote: Even trainable BN layers use inference mode (moving avg)")
print(f"because base_model is called with training=False")

Image Classification Pipeline

Here's a complete end-to-end image classification pipeline combining all techniques: loading images from directories, augmentation, pretrained backbone with fine-tuning, training, and evaluation. This pattern works for any custom image dataset organized in class-folder structure.

import tensorflow as tf
import numpy as np
import os

# Complete image classification pipeline
# Dataset structure expected:
#   dataset/
#     train/
#       class_a/ (images...)
#       class_b/ (images...)
#     validation/
#       class_a/
#       class_b/

# Configuration
IMG_SIZE = (160, 160)
BATCH_SIZE = 32
NUM_CLASSES = 5
EPOCHS_PHASE1 = 10
EPOCHS_PHASE2 = 10

# Step 1: Load dataset from directory
# (Using simulated data for reproducibility — replace with your directory)
# train_ds = tf.keras.utils.image_dataset_from_directory(
#     'dataset/train',
#     image_size=IMG_SIZE,
#     batch_size=BATCH_SIZE,
#     label_mode='int'
# )

# Simulate dataset
X_train = np.random.rand(500, 160, 160, 3).astype(np.float32) * 255
y_train = np.random.randint(0, NUM_CLASSES, 500)
X_val = np.random.rand(100, 160, 160, 3).astype(np.float32) * 255
y_val = np.random.randint(0, NUM_CLASSES, 100)

train_ds = tf.data.Dataset.from_tensor_slices((X_train, y_train))
train_ds = train_ds.shuffle(500).batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)
val_ds = tf.data.Dataset.from_tensor_slices((X_val, y_val))
val_ds = val_ds.batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)

# Step 2: Data augmentation layer
data_augmentation = tf.keras.Sequential([
    tf.keras.layers.RandomFlip("horizontal"),
    tf.keras.layers.RandomRotation(0.1),
    tf.keras.layers.RandomZoom(0.1),
])

# Step 3: Build model with pretrained backbone
base_model = tf.keras.applications.EfficientNetB0(
    input_shape=(*IMG_SIZE, 3), include_top=False, weights='imagenet'
)
base_model.trainable = False  # Freeze for Phase 1

inputs = tf.keras.Input(shape=(*IMG_SIZE, 3))
x = data_augmentation(inputs)                           # Augmentation
x = tf.keras.applications.efficientnet.preprocess_input(x)  # Preprocessing
x = base_model(x, training=False)                       # Feature extraction
x = tf.keras.layers.GlobalAveragePooling2D()(x)
x = tf.keras.layers.Dropout(0.3)(x)
outputs = tf.keras.layers.Dense(NUM_CLASSES, activation='softmax')(x)
model = tf.keras.Model(inputs, outputs)

# Step 4: Phase 1 — Train head
model.compile(
    optimizer=tf.keras.optimizers.Adam(1e-3),
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)

print("Phase 1: Training classification head...")
history1 = model.fit(train_ds, validation_data=val_ds, epochs=EPOCHS_PHASE1, verbose=1)

# Step 5: Phase 2 — Fine-tune top layers
base_model.trainable = True
for layer in base_model.layers[:-20]:
    layer.trainable = False

model.compile(
    optimizer=tf.keras.optimizers.Adam(1e-5),
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)

print("\nPhase 2: Fine-tuning top layers...")
history2 = model.fit(train_ds, validation_data=val_ds, epochs=EPOCHS_PHASE2, verbose=1)

# Step 6: Evaluate
val_loss, val_acc = model.evaluate(val_ds, verbose=0)
print(f"\nFinal validation accuracy: {val_acc:.4f}")
print(f"Final validation loss:     {val_loss:.4f}")

Evaluation & Metrics

Here is the implementation for Evaluation & Metrics. Each code example below is self-contained and can be run independently:

import tensorflow as tf
import numpy as np

# Evaluation utilities for classification models
def evaluate_classifier(model, test_ds, class_names):
    """Compute per-class precision, recall, F1."""
    all_preds = []
    all_labels = []

    for images, labels in test_ds:
        preds = model.predict(images, verbose=0)
        all_preds.extend(np.argmax(preds, axis=1))
        all_labels.extend(labels.numpy())

    all_preds = np.array(all_preds)
    all_labels = np.array(all_labels)
    num_classes = len(class_names)

    print(f"{'Class':<15} {'Precision':>10} {'Recall':>10} {'F1':>10} {'Support':>10}")
    print("-" * 58)

    for i in range(num_classes):
        tp = np.sum((all_preds == i) & (all_labels == i))
        fp = np.sum((all_preds == i) & (all_labels != i))
        fn = np.sum((all_preds != i) & (all_labels == i))

        precision = tp / (tp + fp + 1e-7)
        recall = tp / (tp + fn + 1e-7)
        f1 = 2 * precision * recall / (precision + recall + 1e-7)
        support = np.sum(all_labels == i)

        print(f"{class_names[i]:<15} {precision:>10.3f} {recall:>10.3f} {f1:>10.3f} {support:>10}")

    accuracy = np.mean(all_preds == all_labels)
    print(f"\n{'Overall Accuracy:':<15} {accuracy:.4f}")
    return accuracy

# Demo with a simple model
model = tf.keras.Sequential([
    tf.keras.layers.Flatten(input_shape=(32, 32, 3)),
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dense(5, activation='softmax')
])
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

X_test = np.random.rand(200, 32, 32, 3).astype(np.float32)
y_test = np.random.randint(0, 5, 200)
model.fit(X_test, y_test, epochs=5, verbose=0)

test_ds = tf.data.Dataset.from_tensor_slices((X_test, y_test)).batch(32)
class_names = ['Cat', 'Dog', 'Bird', 'Fish', 'Horse']
evaluate_classifier(model, test_ds, class_names)

Grad-CAM Visualization

Gradient-weighted Class Activation Mapping (Grad-CAM) reveals which regions of an image are most important for a CNN's prediction. It computes the gradient of the predicted class score with respect to the final convolutional layer's feature maps, then produces a heatmap highlighting the discriminative regions.

How Grad-CAM Works: (1) Forward pass to get feature maps from the last conv layer. (2) Compute gradient of predicted class score w.r.t. those feature maps. (3) Global average pool the gradients to get importance weights per channel. (4) Compute weighted sum of feature maps → ReLU → normalize to [0, 1]. The result shows where the model is "looking."

Grad-CAM Implementation

Here is the implementation for Grad-CAM Implementation. Each code example below is self-contained and can be run independently:

import tensorflow as tf
import numpy as np

def make_gradcam_heatmap(img_array, model, last_conv_layer_name, pred_index=None):
    """
    Generate Grad-CAM heatmap for a given image and model.

    Args:
        img_array: Preprocessed image tensor (1, H, W, C)
        model: Trained Keras model
        last_conv_layer_name: Name of the last convolutional layer
        pred_index: Class index to visualize (None = top prediction)

    Returns:
        heatmap: Normalized heatmap array (H', W') in [0, 1]
    """
    # Build sub-model: input → last conv layer output + final predictions
    grad_model = tf.keras.Model(
        inputs=model.input,
        outputs=[
            model.get_layer(last_conv_layer_name).output,
            model.output
        ]
    )

    # Forward pass + gradient computation
    with tf.GradientTape() as tape:
        conv_outputs, predictions = grad_model(img_array)
        if pred_index is None:
            pred_index = tf.argmax(predictions[0])
        class_score = predictions[:, pred_index]

    # Gradient of class score w.r.t. conv layer output
    grads = tape.gradient(class_score, conv_outputs)

    # Global average pooling of gradients → importance weights
    pooled_grads = tf.reduce_mean(grads, axis=(0, 1, 2))

    # Weighted combination of feature maps
    conv_outputs = conv_outputs[0]
    heatmap = conv_outputs @ pooled_grads[..., tf.newaxis]
    heatmap = tf.squeeze(heatmap)

    # ReLU + normalize to [0, 1]
    heatmap = tf.maximum(heatmap, 0) / (tf.reduce_max(heatmap) + 1e-8)
    return heatmap.numpy()

# Demo: Build a simple model and generate Grad-CAM
model = tf.keras.Sequential([
    tf.keras.layers.Conv2D(32, (3, 3), activation='relu', padding='same',
                           input_shape=(32, 32, 3), name='conv1'),
    tf.keras.layers.Conv2D(64, (3, 3), activation='relu', padding='same', name='conv2'),
    tf.keras.layers.MaxPooling2D((2, 2)),
    tf.keras.layers.Conv2D(128, (3, 3), activation='relu', padding='same', name='last_conv'),
    tf.keras.layers.GlobalAveragePooling2D(),
    tf.keras.layers.Dense(10, activation='softmax')
])

# Train briefly on CIFAR-10 for meaningful gradients
(X_train, y_train), _ = tf.keras.datasets.cifar10.load_data()
X_train = X_train[:1000].astype(np.float32) / 255.0
y_train = y_train[:1000].flatten()
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
model.fit(X_train, y_train, epochs=3, batch_size=64, verbose=0)

# Generate Grad-CAM for a test image
test_image = X_train[0:1]  # (1, 32, 32, 3)
heatmap = make_gradcam_heatmap(test_image, model, 'last_conv')
print(f"Heatmap shape: {heatmap.shape}")
print(f"Heatmap range: [{heatmap.min():.3f}, {heatmap.max():.3f}]")
print(f"Predicted class: {np.argmax(model.predict(test_image, verbose=0))}")

# In practice, overlay heatmap on original image:
# 1. Resize heatmap to image size
# 2. Apply colormap (e.g., jet)
# 3. Blend with original: superimposed = alpha * colormap + (1-alpha) * image
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt

def overlay_gradcam(image, heatmap, alpha=0.4):
    """
    Overlay Grad-CAM heatmap on original image.

    Args:
        image: Original image array (H, W, 3) in [0, 1]
        heatmap: Grad-CAM heatmap (H', W') in [0, 1]
        alpha: Blending factor

    Returns:
        superimposed: Blended image (H, W, 3) in [0, 1]
    """
    # Resize heatmap to image dimensions
    heatmap_resized = tf.image.resize(
        heatmap[..., tf.newaxis], (image.shape[0], image.shape[1])
    ).numpy().squeeze()

    # Apply colormap (jet-like: blue→green→red)
    heatmap_colored = plt.cm.jet(heatmap_resized)[:, :, :3]  # RGB only

    # Blend
    superimposed = alpha * heatmap_colored + (1 - alpha) * image
    superimposed = np.clip(superimposed, 0, 1)
    return superimposed

# Demo visualization
(X_train, y_train), _ = tf.keras.datasets.cifar10.load_data()
X_sample = X_train[42:43].astype(np.float32) / 255.0
cifar_classes = ['airplane', 'automobile', 'bird', 'cat', 'deer',
                 'dog', 'frog', 'horse', 'ship', 'truck']

# Build and train model
model = tf.keras.Sequential([
    tf.keras.layers.Conv2D(32, 3, activation='relu', padding='same',
                           input_shape=(32, 32, 3), name='conv1'),
    tf.keras.layers.MaxPooling2D(2),
    tf.keras.layers.Conv2D(64, 3, activation='relu', padding='same', name='last_conv'),
    tf.keras.layers.GlobalAveragePooling2D(),
    tf.keras.layers.Dense(10, activation='softmax')
])
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy')
model.fit(X_train[:5000] / 255.0, y_train[:5000], epochs=5, batch_size=128, verbose=0)

# Generate and display
heatmap = make_gradcam_heatmap(X_sample, model, 'last_conv')
superimposed = overlay_gradcam(X_sample[0], heatmap)
pred_class = np.argmax(model.predict(X_sample, verbose=0))
true_class = y_train[42][0]

print(f"Predicted: {cifar_classes[pred_class]}")
print(f"True:      {cifar_classes[true_class]}")
print(f"Heatmap shape: {heatmap.shape}, Overlay shape: {superimposed.shape}")
plt.imshow(superimposed)
plt.title(f"Grad-CAM: {cifar_classes[pred_class]}")
plt.axis('off')
plt.show()

Beyond Classification

CNNs power far more than image classification. Object detection localizes and classifies multiple objects within an image. Semantic segmentation assigns a class label to every pixel. These tasks build on the same convolutional backbone features but use different head architectures.

Object Detection Overview

Object detection models predict bounding boxes and class labels for multiple objects. Key concepts include anchor boxes (predefined box shapes at each spatial location), Non-Maximum Suppression (NMS) for removing duplicate detections, and IoU (Intersection over Union) for measuring box overlap.

import tensorflow as tf
import numpy as np

def compute_iou(box1, box2):
    """
    Compute Intersection over Union between two boxes.
    Boxes are [y_min, x_min, y_max, x_max] format.
    """
    # Intersection area
    y_min = max(box1[0], box2[0])
    x_min = max(box1[1], box2[1])
    y_max = min(box1[2], box2[2])
    x_max = min(box1[3], box2[3])

    intersection = max(0, y_max - y_min) * max(0, x_max - x_min)

    # Union area
    area1 = (box1[2] - box1[0]) * (box1[3] - box1[1])
    area2 = (box2[2] - box2[0]) * (box2[3] - box2[1])
    union = area1 + area2 - intersection

    return intersection / (union + 1e-7)

def non_max_suppression(boxes, scores, iou_threshold=0.5, max_detections=100):
    """
    Apply NMS to remove overlapping detections.

    Args:
        boxes: Array of shape (N, 4) — [y_min, x_min, y_max, x_max]
        scores: Array of shape (N,) — confidence scores
        iou_threshold: Suppress boxes with IoU > threshold

    Returns:
        indices: Indices of kept boxes
    """
    # TensorFlow's built-in NMS
    selected_indices = tf.image.non_max_suppression(
        boxes=boxes,
        scores=scores,
        max_output_size=max_detections,
        iou_threshold=iou_threshold
    )
    return selected_indices.numpy()

# Demo: NMS on overlapping detections
boxes = np.array([
    [10, 10, 50, 50],   # Box A
    [12, 12, 52, 52],   # Box B (overlaps heavily with A)
    [100, 100, 150, 150],  # Box C (separate)
    [102, 102, 148, 148],  # Box D (overlaps with C)
], dtype=np.float32)

scores = np.array([0.9, 0.75, 0.85, 0.7], dtype=np.float32)

# Compute IoU between overlapping boxes
iou_ab = compute_iou(boxes[0], boxes[1])
iou_cd = compute_iou(boxes[2], boxes[3])
print(f"IoU(A, B) = {iou_ab:.3f}")  # High overlap
print(f"IoU(C, D) = {iou_cd:.3f}")  # High overlap

# Apply NMS
kept = non_max_suppression(boxes, scores, iou_threshold=0.5)
print(f"\nNMS kept indices: {kept}")
print(f"Kept boxes: {len(kept)} out of {len(boxes)}")
for idx in kept:
    print(f"  Box {idx}: score={scores[idx]:.2f}, coords={boxes[idx]}")

Segmentation (U-Net in TensorFlow)

U-Net is the standard architecture for image segmentation. It uses an encoder-decoder structure with skip connections that concatenate encoder features with decoder upsampling outputs, preserving spatial detail for pixel-precise predictions.

import tensorflow as tf
import numpy as np

def build_unet(input_shape=(128, 128, 3), num_classes=2):
    """
    Minimal U-Net for semantic segmentation.
    Encoder: Conv blocks with MaxPooling (downsampling)
    Decoder: Conv blocks with UpSampling + skip connections
    """
    inputs = tf.keras.Input(shape=input_shape)

    # --- Encoder ---
    # Block 1
    c1 = tf.keras.layers.Conv2D(32, (3, 3), activation='relu', padding='same')(inputs)
    c1 = tf.keras.layers.Conv2D(32, (3, 3), activation='relu', padding='same')(c1)
    p1 = tf.keras.layers.MaxPooling2D((2, 2))(c1)  # 64x64

    # Block 2
    c2 = tf.keras.layers.Conv2D(64, (3, 3), activation='relu', padding='same')(p1)
    c2 = tf.keras.layers.Conv2D(64, (3, 3), activation='relu', padding='same')(c2)
    p2 = tf.keras.layers.MaxPooling2D((2, 2))(c2)  # 32x32

    # Block 3 (Bottleneck)
    c3 = tf.keras.layers.Conv2D(128, (3, 3), activation='relu', padding='same')(p2)
    c3 = tf.keras.layers.Conv2D(128, (3, 3), activation='relu', padding='same')(c3)

    # --- Decoder ---
    # Block 4 (up + skip from c2)
    u4 = tf.keras.layers.UpSampling2D((2, 2))(c3)    # 64x64
    u4 = tf.keras.layers.Concatenate()([u4, c2])      # Skip connection
    c4 = tf.keras.layers.Conv2D(64, (3, 3), activation='relu', padding='same')(u4)
    c4 = tf.keras.layers.Conv2D(64, (3, 3), activation='relu', padding='same')(c4)

    # Block 5 (up + skip from c1)
    u5 = tf.keras.layers.UpSampling2D((2, 2))(c4)    # 128x128
    u5 = tf.keras.layers.Concatenate()([u5, c1])      # Skip connection
    c5 = tf.keras.layers.Conv2D(32, (3, 3), activation='relu', padding='same')(u5)
    c5 = tf.keras.layers.Conv2D(32, (3, 3), activation='relu', padding='same')(c5)

    # Output: per-pixel classification
    outputs = tf.keras.layers.Conv2D(num_classes, (1, 1), activation='softmax')(c5)

    return tf.keras.Model(inputs, outputs, name='unet')

# Build and inspect U-Net
unet = build_unet(input_shape=(128, 128, 3), num_classes=4)
unet.summary()

# Verify input/output shapes match (full resolution)
test_input = np.random.rand(2, 128, 128, 3).astype(np.float32)
test_output = unet.predict(test_input, verbose=0)
print(f"\nInput shape:  {test_input.shape}")    # (2, 128, 128, 3)
print(f"Output shape: {test_output.shape}")     # (2, 128, 128, 4) — per-pixel class probs
print(f"Output sums to 1 per pixel: {np.allclose(test_output.sum(axis=-1), 1.0)}")

For feature extraction as a downstream task, you can remove the classification head from any trained CNN and use the feature vectors directly. This is common for image retrieval, clustering, and as input to other models.

import tensorflow as tf
import numpy as np

# Using a CNN as a feature extractor
base = tf.keras.applications.MobileNetV2(
    input_shape=(224, 224, 3), include_top=False,
    weights='imagenet', pooling='avg'  # GlobalAveragePooling built-in
)

# Extract features from images
images = np.random.rand(10, 224, 224, 3).astype(np.float32) * 255
images_preprocessed = tf.keras.applications.mobilenet_v2.preprocess_input(images)

features = base.predict(images_preprocessed, verbose=0)
print(f"Feature vectors shape: {features.shape}")  # (10, 1280)
print(f"Feature vector dim:    {features.shape[1]}")

# Cosine similarity between feature vectors
from numpy.linalg import norm
def cosine_similarity(a, b):
    return np.dot(a, b) / (norm(a) * norm(b))

sim_01 = cosine_similarity(features[0], features[1])
sim_09 = cosine_similarity(features[0], features[9])
print(f"\nCosine similarity (img 0 vs 1): {sim_01:.4f}")
print(f"Cosine similarity (img 0 vs 9): {sim_09:.4f}")
print("(Similar images → higher cosine similarity)")

Next in the Series

In Part 7: RNNs, NLP & Time Series, we'll tackle sequential data with recurrent architectures — LSTMs, GRUs, bidirectional layers, text classification, sequence-to-sequence models, and time series forecasting.