How CNNs See Images
Convolutional Neural Networks (CNNs) are specialized architectures designed to process grid-structured data like images. Unlike fully-connected networks that treat each pixel independently, CNNs exploit the spatial structure of images through three key principles: local receptive fields, parameter sharing, and translation invariance.
A neuron in a convolutional layer doesn't see the entire image — it only looks at a small local region (its receptive field). The same filter weights are shared across all spatial positions, meaning a feature detector learned in one part of the image can detect that same feature anywhere. This parameter sharing dramatically reduces the number of learnable parameters compared to a fully-connected approach.
Hierarchical Feature Learning
The convolutional output dimension follows a precise formula. Given input width $W$, kernel size $K$, padding $P$, and stride $S$:
$$O = \lfloor\frac{W - K + 2P}{S}\rfloor + 1$$
For example, a 32×32 input with a 3×3 kernel, padding=1, stride=1 produces a 32×32 output (spatial dimensions preserved). With stride=2, the output becomes 16×16 (spatial downsampling by 2×).
import tensorflow as tf
import numpy as np
# Demonstrate CNN hierarchical features with a simple model
model = tf.keras.Sequential([
tf.keras.layers.Input(shape=(64, 64, 3)),
# Layer 1: Detects edges, gradients (3x3 receptive field)
tf.keras.layers.Conv2D(32, (3, 3), activation='relu', padding='same', name='edges'),
# Layer 2: Detects textures, patterns (5x5 effective receptive field)
tf.keras.layers.Conv2D(64, (3, 3), activation='relu', padding='same', name='textures'),
tf.keras.layers.MaxPooling2D((2, 2)),
# Layer 3: Detects parts, shapes (larger effective receptive field)
tf.keras.layers.Conv2D(128, (3, 3), activation='relu', padding='same', name='parts'),
tf.keras.layers.MaxPooling2D((2, 2)),
# Layer 4: Detects objects, high-level concepts
tf.keras.layers.Conv2D(256, (3, 3), activation='relu', padding='same', name='objects'),
tf.keras.layers.GlobalAveragePooling2D(),
tf.keras.layers.Dense(10, activation='softmax')
])
model.summary()
# Calculate output formula: O = floor((W - K + 2P) / S) + 1
W, K, P, S = 32, 3, 1, 1
output_size = (W - K + 2*P) // S + 1
print(f"\nInput={W}, Kernel={K}, Padding={P}, Stride={S} → Output={output_size}")
W, K, P, S = 32, 3, 1, 2
output_size = (W - K + 2*P) // S + 1
print(f"Input={W}, Kernel={K}, Padding={P}, Stride={S} → Output={output_size}")
# Parameters in Conv2D: K × K × C_in × C_out + C_out (bias)
K, C_in, C_out = 3, 3, 32
params = K * K * C_in * C_out + C_out
print(f"\nConv2D(32, 3x3) on RGB input: {params} parameters")
print(f" = {K}×{K}×{C_in}×{C_out} + {C_out} (bias)")
The parameter count for a Conv2D layer is $K \times K \times C_{in} \times C_{out} + C_{out}$, where $C_{in}$ is input channels and $C_{out}$ is output filters. A 3×3 Conv2D with 3 input channels and 32 filters has only 896 parameters — compared to 393,216 parameters for a Dense layer connecting even a tiny 64×64×3 input to 32 neurons.
Convolution Operations
The tf.keras.layers.Conv2D layer is the workhorse of CNN architectures. Understanding its parameters — filters, kernel_size, strides, padding — is essential for designing effective networks. Beyond standard convolutions, 1×1 convolutions and depthwise separable convolutions offer powerful alternatives for specific use cases.
import tensorflow as tf
import numpy as np
# Create a random "image" batch: (batch=1, height=32, width=32, channels=3)
images = tf.random.normal([1, 32, 32, 3])
# Standard Conv2D: filters=64, kernel=3x3, stride=1, padding='same'
conv_same = tf.keras.layers.Conv2D(64, (3, 3), strides=1, padding='same', activation='relu')
out_same = conv_same(images)
print(f"Input shape: {images.shape}") # (1, 32, 32, 3)
print(f"Conv2D same output: {out_same.shape}") # (1, 32, 32, 64)
# padding='valid' (no padding) — shrinks spatial dims
conv_valid = tf.keras.layers.Conv2D(64, (3, 3), strides=1, padding='valid', activation='relu')
out_valid = conv_valid(images)
print(f"Conv2D valid output: {out_valid.shape}") # (1, 30, 30, 64)
# Stride=2 for downsampling (alternative to pooling)
conv_stride2 = tf.keras.layers.Conv2D(64, (3, 3), strides=2, padding='same', activation='relu')
out_stride2 = conv_stride2(images)
print(f"Conv2D stride=2 output: {out_stride2.shape}") # (1, 16, 16, 64)
# 1x1 convolution: channel mixing without spatial change
conv_1x1 = tf.keras.layers.Conv2D(32, (1, 1), activation='relu')
out_1x1 = conv_1x1(out_same)
print(f"1x1 conv output: {out_1x1.shape}") # (1, 32, 32, 32)
print(f"\n1x1 conv params: {conv_1x1.count_params()}") # 64*32 + 32 = 2080
Depthwise Separable Convolutions
Depthwise separable convolutions (used in MobileNet) factorize a standard convolution into a depthwise convolution (one filter per input channel) followed by a pointwise 1×1 convolution. This reduces computation by roughly $\frac{1}{C_{out}} + \frac{1}{K^2}$, making it ideal for mobile and edge deployment.
import tensorflow as tf
import numpy as np
# Compare standard Conv2D vs DepthwiseSeparable
images = tf.random.normal([1, 32, 32, 64])
# Standard Conv2D: 64 input channels → 128 output channels
standard_conv = tf.keras.layers.Conv2D(128, (3, 3), padding='same')
standard_conv(images) # build
standard_params = standard_conv.count_params()
# Depthwise Separable: same input → same output
separable_conv = tf.keras.layers.SeparableConv2D(128, (3, 3), padding='same')
separable_conv(images) # build
separable_params = separable_conv.count_params()
print(f"Standard Conv2D params: {standard_params:,}") # 73,856
print(f"SeparableConv2D params: {separable_params:,}") # 8,896
print(f"Reduction factor: {standard_params / separable_params:.1f}x")
# MobileNet-style block: DepthwiseConv → BN → ReLU → PointwiseConv → BN → ReLU
def mobilenet_block(x, filters, stride=1):
"""MobileNet-style depthwise separable block."""
# Depthwise convolution (spatial filtering)
x = tf.keras.layers.DepthwiseConv2D(
(3, 3), strides=stride, padding='same', use_bias=False
)(x)
x = tf.keras.layers.BatchNormalization()(x)
x = tf.keras.layers.ReLU()(x)
# Pointwise convolution (channel mixing)
x = tf.keras.layers.Conv2D(filters, (1, 1), use_bias=False)(x)
x = tf.keras.layers.BatchNormalization()(x)
x = tf.keras.layers.ReLU()(x)
return x
# Demo
inputs = tf.keras.Input(shape=(32, 32, 64))
outputs = mobilenet_block(inputs, filters=128, stride=2)
block_model = tf.keras.Model(inputs, outputs)
print(f"\nMobileNet block output shape: {block_model.output_shape}")
print(f"MobileNet block params: {block_model.count_params():,}")
Pooling & Downsampling
Pooling layers reduce spatial dimensions, decrease computation, and provide a degree of translation invariance. The choice between max pooling, average pooling, and stride-2 convolutions affects what information is preserved through the network.
import tensorflow as tf
import numpy as np
# Create feature maps to pool
feature_maps = tf.random.normal([1, 8, 8, 64])
# MaxPooling2D: keeps strongest activations (most common)
max_pool = tf.keras.layers.MaxPooling2D(pool_size=(2, 2), strides=2)
out_max = max_pool(feature_maps)
print(f"Input: {feature_maps.shape}") # (1, 8, 8, 64)
print(f"MaxPool2D: {out_max.shape}") # (1, 4, 4, 64)
# AveragePooling2D: smooths activations (good for final layers)
avg_pool = tf.keras.layers.AveragePooling2D(pool_size=(2, 2), strides=2)
out_avg = avg_pool(feature_maps)
print(f"AvgPool2D: {out_avg.shape}") # (1, 4, 4, 64)
# GlobalAveragePooling2D: collapses spatial dims entirely
gap = tf.keras.layers.GlobalAveragePooling2D()
out_gap = gap(feature_maps)
print(f"GlobalAvgPool: {out_gap.shape}") # (1, 64) — one value per channel
# GlobalMaxPooling2D: takes max per channel
gmp = tf.keras.layers.GlobalMaxPooling2D()
out_gmp = gmp(feature_maps)
print(f"GlobalMaxPool: {out_gmp.shape}") # (1, 64)
When to Use Each Pooling Type
Modern architectures increasingly favor stride-2 convolutions over explicit pooling layers, as they allow the network to learn the downsampling operation rather than using a fixed function. However, GlobalAveragePooling2D remains the standard way to transition from convolutional features to the classification head.
Building a CNN from Scratch
The classic CNN pattern stacks Conv → BatchNorm → ReLU → Pool blocks, progressively increasing filters while reducing spatial dimensions. After the convolutional backbone extracts features, a classification head (GlobalAveragePooling + Dense) produces predictions. Let's build one for CIFAR-10 (32×32 RGB images, 10 classes).
flowchart LR
A[Input
32×32×3] --> B[Conv 32
3×3]
B --> C[BN + ReLU]
C --> D[Conv 32
3×3]
D --> E[BN + ReLU
+ MaxPool]
E --> F[Conv 64
3×3]
F --> G[BN + ReLU]
G --> H[Conv 64
3×3]
H --> I[BN + ReLU
+ MaxPool]
I --> J[Conv 128
3×3]
J --> K[BN + ReLU]
K --> L[GAP]
L --> M[Dense 10
Softmax]
import tensorflow as tf
import numpy as np
# Load CIFAR-10
(X_train, y_train), (X_test, y_test) = tf.keras.datasets.cifar10.load_data()
X_train = X_train.astype(np.float32) / 255.0
X_test = X_test.astype(np.float32) / 255.0
y_train = y_train.flatten()
y_test = y_test.flatten()
print(f"Training: {X_train.shape}, Labels: {y_train.shape}") # (50000, 32, 32, 3)
print(f"Test: {X_test.shape}, Labels: {y_test.shape}") # (10000, 32, 32, 3)
print(f"Classes: {np.unique(y_train)}") # [0..9]
# Build CNN from scratch: Conv-BN-ReLU-Pool pattern
def build_cnn(input_shape=(32, 32, 3), num_classes=10):
inputs = tf.keras.Input(shape=input_shape)
# Block 1: 32 filters
x = tf.keras.layers.Conv2D(32, (3, 3), padding='same', use_bias=False)(inputs)
x = tf.keras.layers.BatchNormalization()(x)
x = tf.keras.layers.ReLU()(x)
x = tf.keras.layers.Conv2D(32, (3, 3), padding='same', use_bias=False)(x)
x = tf.keras.layers.BatchNormalization()(x)
x = tf.keras.layers.ReLU()(x)
x = tf.keras.layers.MaxPooling2D((2, 2))(x)
x = tf.keras.layers.Dropout(0.25)(x)
# Block 2: 64 filters
x = tf.keras.layers.Conv2D(64, (3, 3), padding='same', use_bias=False)(x)
x = tf.keras.layers.BatchNormalization()(x)
x = tf.keras.layers.ReLU()(x)
x = tf.keras.layers.Conv2D(64, (3, 3), padding='same', use_bias=False)(x)
x = tf.keras.layers.BatchNormalization()(x)
x = tf.keras.layers.ReLU()(x)
x = tf.keras.layers.MaxPooling2D((2, 2))(x)
x = tf.keras.layers.Dropout(0.25)(x)
# Block 3: 128 filters
x = tf.keras.layers.Conv2D(128, (3, 3), padding='same', use_bias=False)(x)
x = tf.keras.layers.BatchNormalization()(x)
x = tf.keras.layers.ReLU()(x)
# Classification head
x = tf.keras.layers.GlobalAveragePooling2D()(x)
x = tf.keras.layers.Dense(256, activation='relu')(x)
x = tf.keras.layers.Dropout(0.5)(x)
outputs = tf.keras.layers.Dense(num_classes, activation='softmax')(x)
return tf.keras.Model(inputs, outputs, name='cifar10_cnn')
model = build_cnn()
model.compile(
optimizer=tf.keras.optimizers.Adam(learning_rate=1e-3),
loss='sparse_categorical_crossentropy',
metrics=['accuracy']
)
model.summary()
print(f"\nTotal parameters: {model.count_params():,}")
Training to 85%+ Accuracy
Here is the implementation for Training to 85%+ Accuracy. Each code example below is self-contained and can be run independently:
import tensorflow as tf
import numpy as np
# Load and preprocess CIFAR-10
(X_train, y_train), (X_test, y_test) = tf.keras.datasets.cifar10.load_data()
X_train = X_train.astype(np.float32) / 255.0
X_test = X_test.astype(np.float32) / 255.0
y_train, y_test = y_train.flatten(), y_test.flatten()
# Build model (same architecture as above)
inputs = tf.keras.Input(shape=(32, 32, 3))
x = inputs
for filters in [32, 32]:
x = tf.keras.layers.Conv2D(filters, (3, 3), padding='same', use_bias=False)(x)
x = tf.keras.layers.BatchNormalization()(x)
x = tf.keras.layers.ReLU()(x)
x = tf.keras.layers.MaxPooling2D((2, 2))(x)
x = tf.keras.layers.Dropout(0.25)(x)
for filters in [64, 64]:
x = tf.keras.layers.Conv2D(filters, (3, 3), padding='same', use_bias=False)(x)
x = tf.keras.layers.BatchNormalization()(x)
x = tf.keras.layers.ReLU()(x)
x = tf.keras.layers.MaxPooling2D((2, 2))(x)
x = tf.keras.layers.Dropout(0.25)(x)
x = tf.keras.layers.Conv2D(128, (3, 3), padding='same', use_bias=False)(x)
x = tf.keras.layers.BatchNormalization()(x)
x = tf.keras.layers.ReLU()(x)
x = tf.keras.layers.GlobalAveragePooling2D()(x)
x = tf.keras.layers.Dense(256, activation='relu')(x)
x = tf.keras.layers.Dropout(0.5)(x)
outputs = tf.keras.layers.Dense(10, activation='softmax')(x)
model = tf.keras.Model(inputs, outputs)
model.compile(
optimizer=tf.keras.optimizers.Adam(1e-3),
loss='sparse_categorical_crossentropy',
metrics=['accuracy']
)
# Train with callbacks for best performance
callbacks = [
tf.keras.callbacks.ReduceLROnPlateau(monitor='val_loss', factor=0.5, patience=5),
tf.keras.callbacks.EarlyStopping(monitor='val_accuracy', patience=10, restore_best_weights=True)
]
history = model.fit(
X_train, y_train,
epochs=50,
batch_size=128,
validation_split=0.1,
callbacks=callbacks,
verbose=1
)
# Evaluate on test set
test_loss, test_acc = model.evaluate(X_test, y_test, verbose=0)
print(f"\nTest accuracy: {test_acc:.4f}")
print(f"Test loss: {test_loss:.4f}")
print(f"Best val accuracy: {max(history.history['val_accuracy']):.4f}")
Data Augmentation for Vision
Data augmentation artificially expands your training set by applying random transformations — flips, rotations, zooms, contrast changes — during training. In TensorFlow, augmentation layers can be embedded directly into the model (GPU-accelerated) or applied as a preprocessing step in the data pipeline.
import tensorflow as tf
import numpy as np
# Keras augmentation layers — embedded in the model
data_augmentation = tf.keras.Sequential([
tf.keras.layers.RandomFlip("horizontal"), # 50% chance horizontal flip
tf.keras.layers.RandomRotation(0.1), # ±10% of full rotation (±36°)
tf.keras.layers.RandomZoom(0.1), # ±10% zoom
tf.keras.layers.RandomContrast(0.1), # ±10% contrast
tf.keras.layers.RandomTranslation(0.1, 0.1), # ±10% shift
], name='data_augmentation')
# Demo: augment a single image
(X_train, _), _ = tf.keras.datasets.cifar10.load_data()
sample_image = X_train[0:1].astype(np.float32) / 255.0 # (1, 32, 32, 3)
# Apply augmentation 5 times to see variety
print("Original shape:", sample_image.shape)
for i in range(5):
augmented = data_augmentation(sample_image, training=True)
# Check that values are valid (no NaN, reasonable range)
print(f" Augmented {i+1}: min={augmented.numpy().min():.3f}, "
f"max={augmented.numpy().max():.3f}, "
f"shape={augmented.shape}")
# Augmentation is ONLY active during training (training=True)
inference_output = data_augmentation(sample_image, training=False)
print(f"\nInference (no augmentation): identical={np.allclose(sample_image, inference_output.numpy())}")
Augmentation as Part of Model vs Dataset
Here is the implementation for Augmentation as Part of Model vs Dataset. Each code example below is self-contained and can be run independently:
import tensorflow as tf
import numpy as np
# Approach 1: Augmentation inside the model (GPU-accelerated)
def build_model_with_augmentation(input_shape=(32, 32, 3), num_classes=10):
inputs = tf.keras.Input(shape=input_shape)
# Augmentation layers (only active during training)
x = tf.keras.layers.RandomFlip("horizontal")(inputs)
x = tf.keras.layers.RandomRotation(0.05)(x)
x = tf.keras.layers.RandomZoom(0.1)(x)
# CNN backbone
x = tf.keras.layers.Conv2D(32, (3, 3), padding='same', activation='relu')(x)
x = tf.keras.layers.MaxPooling2D((2, 2))(x)
x = tf.keras.layers.Conv2D(64, (3, 3), padding='same', activation='relu')(x)
x = tf.keras.layers.GlobalAveragePooling2D()(x)
outputs = tf.keras.layers.Dense(num_classes, activation='softmax')(x)
return tf.keras.Model(inputs, outputs)
model_aug = build_model_with_augmentation()
model_aug.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
print("Model with augmentation layers:")
print(f" Total params: {model_aug.count_params():,}")
# Approach 2: Augmentation in tf.data pipeline (CPU, async with GPU training)
def augment_dataset(image, label):
image = tf.image.random_flip_left_right(image)
image = tf.image.random_brightness(image, max_delta=0.1)
image = tf.image.random_contrast(image, lower=0.9, upper=1.1)
return image, label
# Create pipeline with augmentation
(X_train, y_train), _ = tf.keras.datasets.cifar10.load_data()
X_train = X_train.astype(np.float32) / 255.0
y_train = y_train.flatten()
train_ds = tf.data.Dataset.from_tensor_slices((X_train, y_train))
train_ds = train_ds.shuffle(10000).map(augment_dataset, num_parallel_calls=tf.data.AUTOTUNE)
train_ds = train_ds.batch(64).prefetch(tf.data.AUTOTUNE)
# Verify pipeline
for images, labels in train_ds.take(1):
print(f"\nDataset pipeline batch: images={images.shape}, labels={labels.shape}")
Transfer Learning
Transfer learning uses a model pretrained on a large dataset (typically ImageNet with 1.2M images, 1000 classes) as a starting point for your task. Instead of learning features from scratch, you leverage features already learned by state-of-the-art architectures. This is especially powerful when you have limited data (hundreds to thousands of images).
flowchart TD
A[Pretrained Model
ImageNet weights] --> B{Strategy?}
B -->|Feature Extraction| C[Freeze all base layers]
C --> D[Add custom head]
D --> E[Train head only
High LR: 1e-3]
B -->|Fine-Tuning| F[Freeze base initially]
F --> G[Train head first]
G --> H[Unfreeze top N layers]
H --> I[Train end-to-end
Low LR: 1e-5]
E --> J[Evaluate]
I --> J
import tensorflow as tf
import numpy as np
# Transfer Learning: Feature Extraction with MobileNetV2
# Load pretrained backbone WITHOUT top classification layer
base_model = tf.keras.applications.MobileNetV2(
input_shape=(96, 96, 3),
include_top=False, # Remove ImageNet classification head
weights='imagenet' # Load pretrained weights
)
# Freeze the base model — don't update pretrained weights
base_model.trainable = False
print(f"MobileNetV2 base layers: {len(base_model.layers)}")
print(f"MobileNetV2 base params: {base_model.count_params():,}")
print(f"Trainable params (frozen): {sum(tf.keras.backend.count_params(w) for w in base_model.trainable_weights):,}")
# Build complete model with custom head
inputs = tf.keras.Input(shape=(96, 96, 3))
# Preprocessing: MobileNetV2 expects pixels in [-1, 1]
x = tf.keras.applications.mobilenet_v2.preprocess_input(inputs)
# Feature extraction (frozen)
x = base_model(x, training=False) # training=False keeps BN in inference mode
# Custom classification head
x = tf.keras.layers.GlobalAveragePooling2D()(x)
x = tf.keras.layers.Dropout(0.2)(x)
outputs = tf.keras.layers.Dense(5, activation='softmax')(x) # 5 classes
model = tf.keras.Model(inputs, outputs)
model.compile(
optimizer=tf.keras.optimizers.Adam(learning_rate=1e-3),
loss='sparse_categorical_crossentropy',
metrics=['accuracy']
)
# Only the head is trainable
print(f"\nTotal params: {model.count_params():,}")
print(f"Trainable params: {sum(tf.keras.backend.count_params(w) for w in model.trainable_weights):,}")
print(f"Non-trainable: {sum(tf.keras.backend.count_params(w) for w in model.non_trainable_weights):,}")
Choosing a Pretrained Backbone
• MobileNetV2 — fast, small, good for mobile/edge (3.4M params)
• EfficientNetB0 — excellent accuracy/efficiency trade-off (5.3M params)
• ResNet50 — well-understood, good baseline (25.6M params)
• EfficientNetB4+ — maximum accuracy, slower (19M+ params)
import tensorflow as tf
# Compare available pretrained models
backbones = {
'MobileNetV2': tf.keras.applications.MobileNetV2,
'EfficientNetB0': tf.keras.applications.EfficientNetB0,
'ResNet50': tf.keras.applications.ResNet50,
'DenseNet121': tf.keras.applications.DenseNet121,
}
print(f"{'Model':<18} {'Params':>12} {'Output Shape':>20}")
print("-" * 55)
for name, model_fn in backbones.items():
model = model_fn(input_shape=(224, 224, 3), include_top=False, weights=None)
output_shape = model.output_shape[1:]
params = model.count_params()
print(f"{name:<18} {params:>12,} {str(output_shape):>20}")
del model
# Each backbone has its own preprocessing function
print("\nPreprocessing functions:")
print(" MobileNetV2: pixels → [-1, 1]")
print(" EfficientNetB0: pixels → [0, 255] (no rescaling needed)")
print(" ResNet50: pixels → caffe-style (BGR, mean-subtracted)")
print(" DenseNet121: pixels → [0, 1], then imagenet normalization")
Fine-Tuning Strategy
Fine-tuning goes beyond feature extraction by unfreezing some pretrained layers and training them with a very low learning rate. The standard approach is two-phase training: first train only the head (fast convergence), then unfreeze top layers and fine-tune end-to-end (incremental improvement).
import tensorflow as tf
import numpy as np
# Phase 1: Feature extraction (head only)
base_model = tf.keras.applications.MobileNetV2(
input_shape=(96, 96, 3), include_top=False, weights='imagenet'
)
base_model.trainable = False
inputs = tf.keras.Input(shape=(96, 96, 3))
x = tf.keras.applications.mobilenet_v2.preprocess_input(inputs)
x = base_model(x, training=False)
x = tf.keras.layers.GlobalAveragePooling2D()(x)
x = tf.keras.layers.Dropout(0.2)(x)
outputs = tf.keras.layers.Dense(10, activation='softmax')(x)
model = tf.keras.Model(inputs, outputs)
# Phase 1: Train head with high learning rate
model.compile(
optimizer=tf.keras.optimizers.Adam(1e-3),
loss='sparse_categorical_crossentropy',
metrics=['accuracy']
)
# Simulate training with random data
X_dummy = np.random.rand(200, 96, 96, 3).astype(np.float32) * 255
y_dummy = np.random.randint(0, 10, 200)
model.fit(X_dummy, y_dummy, epochs=5, batch_size=32, verbose=0)
print("Phase 1 complete: Head trained")
print(f" Trainable params: {sum(tf.keras.backend.count_params(w) for w in model.trainable_weights):,}")
# Phase 2: Unfreeze top layers for fine-tuning
base_model.trainable = True
# Freeze all layers except the last 30 (fine-tune top layers only)
fine_tune_from = len(base_model.layers) - 30
for layer in base_model.layers[:fine_tune_from]:
layer.trainable = False
# Recompile with MUCH lower learning rate
model.compile(
optimizer=tf.keras.optimizers.Adam(1e-5), # 100x lower than Phase 1
loss='sparse_categorical_crossentropy',
metrics=['accuracy']
)
print(f"\nPhase 2: Fine-tuning from layer {fine_tune_from}/{len(base_model.layers)}")
print(f" Trainable params: {sum(tf.keras.backend.count_params(w) for w in model.trainable_weights):,}")
# Continue training with lower LR
model.fit(X_dummy, y_dummy, epochs=5, batch_size=32, verbose=0)
print("Phase 2 complete: Fine-tuned")
BatchNorm in Inference Mode
training=False to the base model. This keeps BatchNormalization layers in inference mode (using moving averages). If training=True, BN layers will update their statistics on your small dataset, destroying pretrained representations.
import tensorflow as tf
import numpy as np
# CORRECT: BatchNorm in inference mode during fine-tuning
base_model = tf.keras.applications.MobileNetV2(
input_shape=(96, 96, 3), include_top=False, weights='imagenet'
)
inputs = tf.keras.Input(shape=(96, 96, 3))
x = tf.keras.applications.mobilenet_v2.preprocess_input(inputs)
# training=False is CRITICAL — keeps BN statistics frozen
x = base_model(x, training=False)
x = tf.keras.layers.GlobalAveragePooling2D()(x)
outputs = tf.keras.layers.Dense(10, activation='softmax')(x)
model = tf.keras.Model(inputs, outputs)
# Even after unfreezing, BN stays in inference mode because training=False
base_model.trainable = True
for layer in base_model.layers[:-30]:
layer.trainable = False
# Verify BN behavior
bn_layers = [l for l in base_model.layers if isinstance(l, tf.keras.layers.BatchNormalization)]
trainable_bn = [l for l in bn_layers if l.trainable]
print(f"Total BN layers: {len(bn_layers)}")
print(f"Trainable BN layers: {len(trainable_bn)}")
print(f"\nNote: Even trainable BN layers use inference mode (moving avg)")
print(f"because base_model is called with training=False")
Image Classification Pipeline
Here's a complete end-to-end image classification pipeline combining all techniques: loading images from directories, augmentation, pretrained backbone with fine-tuning, training, and evaluation. This pattern works for any custom image dataset organized in class-folder structure.
import tensorflow as tf
import numpy as np
import os
# Complete image classification pipeline
# Dataset structure expected:
# dataset/
# train/
# class_a/ (images...)
# class_b/ (images...)
# validation/
# class_a/
# class_b/
# Configuration
IMG_SIZE = (160, 160)
BATCH_SIZE = 32
NUM_CLASSES = 5
EPOCHS_PHASE1 = 10
EPOCHS_PHASE2 = 10
# Step 1: Load dataset from directory
# (Using simulated data for reproducibility — replace with your directory)
# train_ds = tf.keras.utils.image_dataset_from_directory(
# 'dataset/train',
# image_size=IMG_SIZE,
# batch_size=BATCH_SIZE,
# label_mode='int'
# )
# Simulate dataset
X_train = np.random.rand(500, 160, 160, 3).astype(np.float32) * 255
y_train = np.random.randint(0, NUM_CLASSES, 500)
X_val = np.random.rand(100, 160, 160, 3).astype(np.float32) * 255
y_val = np.random.randint(0, NUM_CLASSES, 100)
train_ds = tf.data.Dataset.from_tensor_slices((X_train, y_train))
train_ds = train_ds.shuffle(500).batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)
val_ds = tf.data.Dataset.from_tensor_slices((X_val, y_val))
val_ds = val_ds.batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)
# Step 2: Data augmentation layer
data_augmentation = tf.keras.Sequential([
tf.keras.layers.RandomFlip("horizontal"),
tf.keras.layers.RandomRotation(0.1),
tf.keras.layers.RandomZoom(0.1),
])
# Step 3: Build model with pretrained backbone
base_model = tf.keras.applications.EfficientNetB0(
input_shape=(*IMG_SIZE, 3), include_top=False, weights='imagenet'
)
base_model.trainable = False # Freeze for Phase 1
inputs = tf.keras.Input(shape=(*IMG_SIZE, 3))
x = data_augmentation(inputs) # Augmentation
x = tf.keras.applications.efficientnet.preprocess_input(x) # Preprocessing
x = base_model(x, training=False) # Feature extraction
x = tf.keras.layers.GlobalAveragePooling2D()(x)
x = tf.keras.layers.Dropout(0.3)(x)
outputs = tf.keras.layers.Dense(NUM_CLASSES, activation='softmax')(x)
model = tf.keras.Model(inputs, outputs)
# Step 4: Phase 1 — Train head
model.compile(
optimizer=tf.keras.optimizers.Adam(1e-3),
loss='sparse_categorical_crossentropy',
metrics=['accuracy']
)
print("Phase 1: Training classification head...")
history1 = model.fit(train_ds, validation_data=val_ds, epochs=EPOCHS_PHASE1, verbose=1)
# Step 5: Phase 2 — Fine-tune top layers
base_model.trainable = True
for layer in base_model.layers[:-20]:
layer.trainable = False
model.compile(
optimizer=tf.keras.optimizers.Adam(1e-5),
loss='sparse_categorical_crossentropy',
metrics=['accuracy']
)
print("\nPhase 2: Fine-tuning top layers...")
history2 = model.fit(train_ds, validation_data=val_ds, epochs=EPOCHS_PHASE2, verbose=1)
# Step 6: Evaluate
val_loss, val_acc = model.evaluate(val_ds, verbose=0)
print(f"\nFinal validation accuracy: {val_acc:.4f}")
print(f"Final validation loss: {val_loss:.4f}")
Evaluation & Metrics
Here is the implementation for Evaluation & Metrics. Each code example below is self-contained and can be run independently:
import tensorflow as tf
import numpy as np
# Evaluation utilities for classification models
def evaluate_classifier(model, test_ds, class_names):
"""Compute per-class precision, recall, F1."""
all_preds = []
all_labels = []
for images, labels in test_ds:
preds = model.predict(images, verbose=0)
all_preds.extend(np.argmax(preds, axis=1))
all_labels.extend(labels.numpy())
all_preds = np.array(all_preds)
all_labels = np.array(all_labels)
num_classes = len(class_names)
print(f"{'Class':<15} {'Precision':>10} {'Recall':>10} {'F1':>10} {'Support':>10}")
print("-" * 58)
for i in range(num_classes):
tp = np.sum((all_preds == i) & (all_labels == i))
fp = np.sum((all_preds == i) & (all_labels != i))
fn = np.sum((all_preds != i) & (all_labels == i))
precision = tp / (tp + fp + 1e-7)
recall = tp / (tp + fn + 1e-7)
f1 = 2 * precision * recall / (precision + recall + 1e-7)
support = np.sum(all_labels == i)
print(f"{class_names[i]:<15} {precision:>10.3f} {recall:>10.3f} {f1:>10.3f} {support:>10}")
accuracy = np.mean(all_preds == all_labels)
print(f"\n{'Overall Accuracy:':<15} {accuracy:.4f}")
return accuracy
# Demo with a simple model
model = tf.keras.Sequential([
tf.keras.layers.Flatten(input_shape=(32, 32, 3)),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dense(5, activation='softmax')
])
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
X_test = np.random.rand(200, 32, 32, 3).astype(np.float32)
y_test = np.random.randint(0, 5, 200)
model.fit(X_test, y_test, epochs=5, verbose=0)
test_ds = tf.data.Dataset.from_tensor_slices((X_test, y_test)).batch(32)
class_names = ['Cat', 'Dog', 'Bird', 'Fish', 'Horse']
evaluate_classifier(model, test_ds, class_names)
Grad-CAM Visualization
Gradient-weighted Class Activation Mapping (Grad-CAM) reveals which regions of an image are most important for a CNN's prediction. It computes the gradient of the predicted class score with respect to the final convolutional layer's feature maps, then produces a heatmap highlighting the discriminative regions.
Grad-CAM Implementation
Here is the implementation for Grad-CAM Implementation. Each code example below is self-contained and can be run independently:
import tensorflow as tf
import numpy as np
def make_gradcam_heatmap(img_array, model, last_conv_layer_name, pred_index=None):
"""
Generate Grad-CAM heatmap for a given image and model.
Args:
img_array: Preprocessed image tensor (1, H, W, C)
model: Trained Keras model
last_conv_layer_name: Name of the last convolutional layer
pred_index: Class index to visualize (None = top prediction)
Returns:
heatmap: Normalized heatmap array (H', W') in [0, 1]
"""
# Build sub-model: input → last conv layer output + final predictions
grad_model = tf.keras.Model(
inputs=model.input,
outputs=[
model.get_layer(last_conv_layer_name).output,
model.output
]
)
# Forward pass + gradient computation
with tf.GradientTape() as tape:
conv_outputs, predictions = grad_model(img_array)
if pred_index is None:
pred_index = tf.argmax(predictions[0])
class_score = predictions[:, pred_index]
# Gradient of class score w.r.t. conv layer output
grads = tape.gradient(class_score, conv_outputs)
# Global average pooling of gradients → importance weights
pooled_grads = tf.reduce_mean(grads, axis=(0, 1, 2))
# Weighted combination of feature maps
conv_outputs = conv_outputs[0]
heatmap = conv_outputs @ pooled_grads[..., tf.newaxis]
heatmap = tf.squeeze(heatmap)
# ReLU + normalize to [0, 1]
heatmap = tf.maximum(heatmap, 0) / (tf.reduce_max(heatmap) + 1e-8)
return heatmap.numpy()
# Demo: Build a simple model and generate Grad-CAM
model = tf.keras.Sequential([
tf.keras.layers.Conv2D(32, (3, 3), activation='relu', padding='same',
input_shape=(32, 32, 3), name='conv1'),
tf.keras.layers.Conv2D(64, (3, 3), activation='relu', padding='same', name='conv2'),
tf.keras.layers.MaxPooling2D((2, 2)),
tf.keras.layers.Conv2D(128, (3, 3), activation='relu', padding='same', name='last_conv'),
tf.keras.layers.GlobalAveragePooling2D(),
tf.keras.layers.Dense(10, activation='softmax')
])
# Train briefly on CIFAR-10 for meaningful gradients
(X_train, y_train), _ = tf.keras.datasets.cifar10.load_data()
X_train = X_train[:1000].astype(np.float32) / 255.0
y_train = y_train[:1000].flatten()
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
model.fit(X_train, y_train, epochs=3, batch_size=64, verbose=0)
# Generate Grad-CAM for a test image
test_image = X_train[0:1] # (1, 32, 32, 3)
heatmap = make_gradcam_heatmap(test_image, model, 'last_conv')
print(f"Heatmap shape: {heatmap.shape}")
print(f"Heatmap range: [{heatmap.min():.3f}, {heatmap.max():.3f}]")
print(f"Predicted class: {np.argmax(model.predict(test_image, verbose=0))}")
# In practice, overlay heatmap on original image:
# 1. Resize heatmap to image size
# 2. Apply colormap (e.g., jet)
# 3. Blend with original: superimposed = alpha * colormap + (1-alpha) * image
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
def overlay_gradcam(image, heatmap, alpha=0.4):
"""
Overlay Grad-CAM heatmap on original image.
Args:
image: Original image array (H, W, 3) in [0, 1]
heatmap: Grad-CAM heatmap (H', W') in [0, 1]
alpha: Blending factor
Returns:
superimposed: Blended image (H, W, 3) in [0, 1]
"""
# Resize heatmap to image dimensions
heatmap_resized = tf.image.resize(
heatmap[..., tf.newaxis], (image.shape[0], image.shape[1])
).numpy().squeeze()
# Apply colormap (jet-like: blue→green→red)
heatmap_colored = plt.cm.jet(heatmap_resized)[:, :, :3] # RGB only
# Blend
superimposed = alpha * heatmap_colored + (1 - alpha) * image
superimposed = np.clip(superimposed, 0, 1)
return superimposed
# Demo visualization
(X_train, y_train), _ = tf.keras.datasets.cifar10.load_data()
X_sample = X_train[42:43].astype(np.float32) / 255.0
cifar_classes = ['airplane', 'automobile', 'bird', 'cat', 'deer',
'dog', 'frog', 'horse', 'ship', 'truck']
# Build and train model
model = tf.keras.Sequential([
tf.keras.layers.Conv2D(32, 3, activation='relu', padding='same',
input_shape=(32, 32, 3), name='conv1'),
tf.keras.layers.MaxPooling2D(2),
tf.keras.layers.Conv2D(64, 3, activation='relu', padding='same', name='last_conv'),
tf.keras.layers.GlobalAveragePooling2D(),
tf.keras.layers.Dense(10, activation='softmax')
])
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy')
model.fit(X_train[:5000] / 255.0, y_train[:5000], epochs=5, batch_size=128, verbose=0)
# Generate and display
heatmap = make_gradcam_heatmap(X_sample, model, 'last_conv')
superimposed = overlay_gradcam(X_sample[0], heatmap)
pred_class = np.argmax(model.predict(X_sample, verbose=0))
true_class = y_train[42][0]
print(f"Predicted: {cifar_classes[pred_class]}")
print(f"True: {cifar_classes[true_class]}")
print(f"Heatmap shape: {heatmap.shape}, Overlay shape: {superimposed.shape}")
plt.imshow(superimposed)
plt.title(f"Grad-CAM: {cifar_classes[pred_class]}")
plt.axis('off')
plt.show()
Beyond Classification
CNNs power far more than image classification. Object detection localizes and classifies multiple objects within an image. Semantic segmentation assigns a class label to every pixel. These tasks build on the same convolutional backbone features but use different head architectures.
Object Detection Overview
Object detection models predict bounding boxes and class labels for multiple objects. Key concepts include anchor boxes (predefined box shapes at each spatial location), Non-Maximum Suppression (NMS) for removing duplicate detections, and IoU (Intersection over Union) for measuring box overlap.
import tensorflow as tf
import numpy as np
def compute_iou(box1, box2):
"""
Compute Intersection over Union between two boxes.
Boxes are [y_min, x_min, y_max, x_max] format.
"""
# Intersection area
y_min = max(box1[0], box2[0])
x_min = max(box1[1], box2[1])
y_max = min(box1[2], box2[2])
x_max = min(box1[3], box2[3])
intersection = max(0, y_max - y_min) * max(0, x_max - x_min)
# Union area
area1 = (box1[2] - box1[0]) * (box1[3] - box1[1])
area2 = (box2[2] - box2[0]) * (box2[3] - box2[1])
union = area1 + area2 - intersection
return intersection / (union + 1e-7)
def non_max_suppression(boxes, scores, iou_threshold=0.5, max_detections=100):
"""
Apply NMS to remove overlapping detections.
Args:
boxes: Array of shape (N, 4) — [y_min, x_min, y_max, x_max]
scores: Array of shape (N,) — confidence scores
iou_threshold: Suppress boxes with IoU > threshold
Returns:
indices: Indices of kept boxes
"""
# TensorFlow's built-in NMS
selected_indices = tf.image.non_max_suppression(
boxes=boxes,
scores=scores,
max_output_size=max_detections,
iou_threshold=iou_threshold
)
return selected_indices.numpy()
# Demo: NMS on overlapping detections
boxes = np.array([
[10, 10, 50, 50], # Box A
[12, 12, 52, 52], # Box B (overlaps heavily with A)
[100, 100, 150, 150], # Box C (separate)
[102, 102, 148, 148], # Box D (overlaps with C)
], dtype=np.float32)
scores = np.array([0.9, 0.75, 0.85, 0.7], dtype=np.float32)
# Compute IoU between overlapping boxes
iou_ab = compute_iou(boxes[0], boxes[1])
iou_cd = compute_iou(boxes[2], boxes[3])
print(f"IoU(A, B) = {iou_ab:.3f}") # High overlap
print(f"IoU(C, D) = {iou_cd:.3f}") # High overlap
# Apply NMS
kept = non_max_suppression(boxes, scores, iou_threshold=0.5)
print(f"\nNMS kept indices: {kept}")
print(f"Kept boxes: {len(kept)} out of {len(boxes)}")
for idx in kept:
print(f" Box {idx}: score={scores[idx]:.2f}, coords={boxes[idx]}")
Segmentation (U-Net in TensorFlow)
U-Net is the standard architecture for image segmentation. It uses an encoder-decoder structure with skip connections that concatenate encoder features with decoder upsampling outputs, preserving spatial detail for pixel-precise predictions.
import tensorflow as tf
import numpy as np
def build_unet(input_shape=(128, 128, 3), num_classes=2):
"""
Minimal U-Net for semantic segmentation.
Encoder: Conv blocks with MaxPooling (downsampling)
Decoder: Conv blocks with UpSampling + skip connections
"""
inputs = tf.keras.Input(shape=input_shape)
# --- Encoder ---
# Block 1
c1 = tf.keras.layers.Conv2D(32, (3, 3), activation='relu', padding='same')(inputs)
c1 = tf.keras.layers.Conv2D(32, (3, 3), activation='relu', padding='same')(c1)
p1 = tf.keras.layers.MaxPooling2D((2, 2))(c1) # 64x64
# Block 2
c2 = tf.keras.layers.Conv2D(64, (3, 3), activation='relu', padding='same')(p1)
c2 = tf.keras.layers.Conv2D(64, (3, 3), activation='relu', padding='same')(c2)
p2 = tf.keras.layers.MaxPooling2D((2, 2))(c2) # 32x32
# Block 3 (Bottleneck)
c3 = tf.keras.layers.Conv2D(128, (3, 3), activation='relu', padding='same')(p2)
c3 = tf.keras.layers.Conv2D(128, (3, 3), activation='relu', padding='same')(c3)
# --- Decoder ---
# Block 4 (up + skip from c2)
u4 = tf.keras.layers.UpSampling2D((2, 2))(c3) # 64x64
u4 = tf.keras.layers.Concatenate()([u4, c2]) # Skip connection
c4 = tf.keras.layers.Conv2D(64, (3, 3), activation='relu', padding='same')(u4)
c4 = tf.keras.layers.Conv2D(64, (3, 3), activation='relu', padding='same')(c4)
# Block 5 (up + skip from c1)
u5 = tf.keras.layers.UpSampling2D((2, 2))(c4) # 128x128
u5 = tf.keras.layers.Concatenate()([u5, c1]) # Skip connection
c5 = tf.keras.layers.Conv2D(32, (3, 3), activation='relu', padding='same')(u5)
c5 = tf.keras.layers.Conv2D(32, (3, 3), activation='relu', padding='same')(c5)
# Output: per-pixel classification
outputs = tf.keras.layers.Conv2D(num_classes, (1, 1), activation='softmax')(c5)
return tf.keras.Model(inputs, outputs, name='unet')
# Build and inspect U-Net
unet = build_unet(input_shape=(128, 128, 3), num_classes=4)
unet.summary()
# Verify input/output shapes match (full resolution)
test_input = np.random.rand(2, 128, 128, 3).astype(np.float32)
test_output = unet.predict(test_input, verbose=0)
print(f"\nInput shape: {test_input.shape}") # (2, 128, 128, 3)
print(f"Output shape: {test_output.shape}") # (2, 128, 128, 4) — per-pixel class probs
print(f"Output sums to 1 per pixel: {np.allclose(test_output.sum(axis=-1), 1.0)}")
For feature extraction as a downstream task, you can remove the classification head from any trained CNN and use the feature vectors directly. This is common for image retrieval, clustering, and as input to other models.
import tensorflow as tf
import numpy as np
# Using a CNN as a feature extractor
base = tf.keras.applications.MobileNetV2(
input_shape=(224, 224, 3), include_top=False,
weights='imagenet', pooling='avg' # GlobalAveragePooling built-in
)
# Extract features from images
images = np.random.rand(10, 224, 224, 3).astype(np.float32) * 255
images_preprocessed = tf.keras.applications.mobilenet_v2.preprocess_input(images)
features = base.predict(images_preprocessed, verbose=0)
print(f"Feature vectors shape: {features.shape}") # (10, 1280)
print(f"Feature vector dim: {features.shape[1]}")
# Cosine similarity between feature vectors
from numpy.linalg import norm
def cosine_similarity(a, b):
return np.dot(a, b) / (norm(a) * norm(b))
sim_01 = cosine_similarity(features[0], features[1])
sim_09 = cosine_similarity(features[0], features[9])
print(f"\nCosine similarity (img 0 vs 1): {sim_01:.4f}")
print(f"Cosine similarity (img 0 vs 9): {sim_09:.4f}")
print("(Similar images → higher cosine similarity)")
Next in the Series
In Part 7: RNNs, NLP & Time Series, we'll tackle sequential data with recurrent architectures — LSTMs, GRUs, bidirectional layers, text classification, sequence-to-sequence models, and time series forecasting.