The Scaling Problem
Convolutional neural networks have traditionally been scaled by increasing only one dimension at a time: deeper networks (ResNet-152), wider networks (Wide ResNet), or higher resolution inputs. Each approach yields diminishing returns because it ignores the fundamental relationship between network depth, width, and input resolution.
Parameter Count Comparison
Let’s compare the parameter counts and approximate accuracy of popular architectures to understand why EfficientNet’s approach is revolutionary:
import tensorflow as tf
import numpy as np
# Compare parameter counts of popular architectures
models_info = {
"VGG-16": {"params": 138_000_000, "top1_acc": 71.3, "flops_b": 15.5},
"ResNet-50": {"params": 25_600_000, "top1_acc": 76.0, "flops_b": 4.1},
"ResNet-152": {"params": 60_200_000, "top1_acc": 77.8, "flops_b": 11.6},
"DenseNet-201": {"params": 20_000_000, "top1_acc": 77.4, "flops_b": 4.3},
"EfficientNet-B0": {"params": 5_300_000, "top1_acc": 77.1, "flops_b": 0.39},
"EfficientNet-B3": {"params": 12_000_000, "top1_acc": 81.6, "flops_b": 1.8},
"EfficientNet-B7": {"params": 66_000_000, "top1_acc": 84.3, "flops_b": 37.0},
}
print(f"{'Model':<18} {'Params (M)':<12} {'Top-1 (%)':<10} {'FLOPs (B)':<10}")
print("-" * 50)
for name, info in models_info.items():
params_m = info["params"] / 1e6
print(f"{name:<18} {params_m:<12.1f} {info['top1_acc']:<10.1f} {info['flops_b']:<10.2f}")
# Calculate efficiency ratio (accuracy per million parameters)
print("\nEfficiency (Top-1 accuracy per million params):")
print("-" * 40)
for name, info in models_info.items():
efficiency = info["top1_acc"] / (info["params"] / 1e6)
print(f"{name:<18} {efficiency:.2f} %/M params")
EfficientNet-B0 achieves 77.1% ImageNet top-1 accuracy with only 5.3M parameters — that’s 14.5% accuracy per million parameters. ResNet-50 achieves similar accuracy (76.0%) but requires 25.6M parameters (only 2.97%/M). This 4.9× improvement in parameter efficiency is the direct result of compound scaling.
Compound Scaling Method
The compound scaling method uses a single compound coefficient $\phi$ to uniformly scale all three dimensions:
- Depth: $d = \alpha^\phi$
- Width: $w = \beta^\phi$
- Resolution: $r = \gamma^\phi$
Subject to the constraint:
$$\alpha \cdot \beta^2 \cdot \gamma^2 \approx 2$$where $\alpha \geq 1$, $\beta \geq 1$, $\gamma \geq 1$. The constraint ensures that for any new $\phi$, the total FLOPs roughly increase by $2^\phi$. The base coefficients found via grid search for EfficientNet are $\alpha = 1.2$, $\beta = 1.1$, $\gamma = 1.15$.
Implementing the Scaling Function
import numpy as np
def compute_scaling_coefficients(phi, alpha=1.2, beta=1.1, gamma=1.15):
"""
Compute depth, width, and resolution scaling factors.
Args:
phi: Compound coefficient (0 for B0, 1 for B1, ..., 7 for B7)
alpha: Depth scaling base (default 1.2)
beta: Width scaling base (default 1.1)
gamma: Resolution scaling base (default 1.15)
Returns:
Tuple of (depth_coeff, width_coeff, resolution)
"""
depth_coeff = alpha ** phi
width_coeff = beta ** phi
resolution_coeff = gamma ** phi
# Verify FLOPS constraint
flops_ratio = alpha * (beta ** 2) * (gamma ** 2)
print(f"FLOPS constraint check: alpha * beta^2 * gamma^2 = {flops_ratio:.4f} (target ~2.0)")
return depth_coeff, width_coeff, resolution_coeff
# Base resolution for B0 is 224
BASE_RESOLUTION = 224
# Compute configs for B0 through B7
print(f"{'Variant':<12} {'phi':<5} {'Depth':<8} {'Width':<8} {'Resolution':<12}")
print("-" * 50)
for variant in range(8):
phi = variant
d, w, r = compute_scaling_coefficients(phi)
resolution = int(BASE_RESOLUTION * r)
# Round resolution to nearest multiple of 8 for hardware efficiency
resolution = int(np.ceil(resolution / 8) * 8)
print(f"B{variant:<11} {phi:<5} {d:<8.2f} {w:<8.2f} {resolution:<12}")
import numpy as np
# EfficientNet variant specifications (from the paper)
EFFICIENTNET_CONFIGS = {
"B0": {"phi": 0, "resolution": 224, "depth_mult": 1.0, "width_mult": 1.0, "dropout": 0.2},
"B1": {"phi": 1, "resolution": 240, "depth_mult": 1.1, "width_mult": 1.0, "dropout": 0.2},
"B2": {"phi": 2, "resolution": 260, "depth_mult": 1.2, "width_mult": 1.1, "dropout": 0.3},
"B3": {"phi": 3, "resolution": 300, "depth_mult": 1.4, "width_mult": 1.2, "dropout": 0.3},
"B4": {"phi": 4, "resolution": 380, "depth_mult": 1.8, "width_mult": 1.4, "dropout": 0.4},
"B5": {"phi": 5, "resolution": 456, "depth_mult": 2.2, "width_mult": 1.6, "dropout": 0.4},
"B6": {"phi": 6, "resolution": 528, "depth_mult": 2.6, "width_mult": 1.8, "dropout": 0.5},
"B7": {"phi": 7, "resolution": 600, "depth_mult": 3.1, "width_mult": 2.0, "dropout": 0.5},
}
print(f"{'Variant':<6} {'Resolution':<12} {'Depth Mult':<12} {'Width Mult':<12} {'Dropout':<8}")
print("-" * 52)
for name, cfg in EFFICIENTNET_CONFIGS.items():
print(f"{name:<6} {cfg['resolution']:<12} {cfg['depth_mult']:<12.1f} "
f"{cfg['width_mult']:<12.1f} {cfg['dropout']:<8.1f}")
MBConv: The Building Block
The Mobile Inverted Bottleneck Convolution (MBConv) from MobileNetV2 is the core building block of EfficientNet. Unlike standard residual blocks that use a wide→narrow→wide pattern, MBConv uses narrow→wide→narrow (inverted bottleneck):
flowchart TD
A[Input: C channels] --> B[1x1 Conv: Expand to C*t channels]
B --> C[BatchNorm + Swish]
C --> D[Depthwise Conv kxk]
D --> E[BatchNorm + Swish]
E --> F[Squeeze-and-Excitation]
F --> G[1x1 Conv: Project to C_out channels]
G --> H[BatchNorm]
H --> I{Residual?}
I -->|stride=1 AND C_in=C_out| J[Add Residual + Drop Connect]
I -->|Otherwise| K[Output]
J --> K
A -.->|Skip Connection| J
MBConv Keras Implementation
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
def mbconv_block(inputs, in_channels, out_channels, expand_ratio,
kernel_size, stride, se_ratio=0.25, drop_rate=0.0):
"""
Mobile Inverted Bottleneck Convolution Block.
Args:
inputs: Input tensor
in_channels: Number of input channels
out_channels: Number of output channels
expand_ratio: Expansion ratio for inverted bottleneck
kernel_size: Depthwise convolution kernel size (3 or 5)
stride: Stride for depthwise convolution (1 or 2)
se_ratio: Squeeze-and-excitation reduction ratio
drop_rate: Drop connect rate for stochastic depth
Returns:
Output tensor
"""
expanded_channels = in_channels * expand_ratio
use_residual = (stride == 1 and in_channels == out_channels)
x = inputs
# Phase 1: Expansion (skip if expand_ratio == 1)
if expand_ratio != 1:
x = layers.Conv2D(
expanded_channels, 1, padding="same", use_bias=False,
kernel_initializer="he_normal"
)(x)
x = layers.BatchNormalization(momentum=0.99, epsilon=1e-3)(x)
x = layers.Activation("swish")(x)
# Phase 2: Depthwise Convolution
x = layers.DepthwiseConv2D(
kernel_size, strides=stride, padding="same", use_bias=False,
depthwise_initializer="he_normal"
)(x)
x = layers.BatchNormalization(momentum=0.99, epsilon=1e-3)(x)
x = layers.Activation("swish")(x)
# Phase 3: Squeeze-and-Excitation
if se_ratio > 0:
se_channels = max(1, int(in_channels * se_ratio))
se = layers.GlobalAveragePooling2D(keepdims=True)(x)
se = layers.Conv2D(se_channels, 1, activation="swish", use_bias=True)(se)
se = layers.Conv2D(expanded_channels, 1, activation="sigmoid", use_bias=True)(se)
x = layers.Multiply()([x, se])
# Phase 4: Projection (pointwise linear)
x = layers.Conv2D(
out_channels, 1, padding="same", use_bias=False,
kernel_initializer="he_normal"
)(x)
x = layers.BatchNormalization(momentum=0.99, epsilon=1e-3)(x)
# Residual connection with drop connect
if use_residual:
if drop_rate > 0:
x = layers.Dropout(drop_rate, noise_shape=(None, 1, 1, 1))(x)
x = layers.Add()([x, inputs])
return x
# Test the MBConv block
test_input = keras.Input(shape=(56, 56, 32))
test_output = mbconv_block(
test_input, in_channels=32, out_channels=16,
expand_ratio=1, kernel_size=3, stride=1, se_ratio=0.25
)
test_model = keras.Model(inputs=test_input, outputs=test_output)
print(f"MBConv1 (32->16, k3, s1): {test_model.count_params():,} parameters")
print(f"Output shape: {test_output.shape}")
Squeeze-and-Excitation Module
The Squeeze-and-Excitation (SE) module introduces channel attention by adaptively re-weighting feature maps. It learns which channels are most important for a given input and scales them accordingly.
flowchart LR
A[Feature Map
H x W x C] --> B[Global Avg Pool
1 x 1 x C]
B --> C[FC Reduce
1 x 1 x C/r]
C --> D[ReLU/Swish]
D --> E[FC Expand
1 x 1 x C]
E --> F[Sigmoid]
F --> G[Scale
H x W x C]
A --> G
SE Block Implementation
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
class SqueezeExcitation(layers.Layer):
"""
Squeeze-and-Excitation block for channel attention.
The SE block adaptively recalibrates channel-wise feature responses
by explicitly modeling interdependencies between channels.
"""
def __init__(self, input_channels, reduction_ratio=0.25, **kwargs):
super().__init__(**kwargs)
self.input_channels = input_channels
self.reduced_channels = max(1, int(input_channels * reduction_ratio))
# Squeeze: Global Average Pooling
self.global_pool = layers.GlobalAveragePooling2D(keepdims=True)
# Excitation: Two FC layers
self.fc_reduce = layers.Conv2D(
self.reduced_channels, 1, use_bias=True, activation="swish"
)
self.fc_expand = layers.Conv2D(
input_channels, 1, use_bias=True, activation="sigmoid"
)
def call(self, inputs):
# Squeeze: H x W x C -> 1 x 1 x C
se = self.global_pool(inputs)
# Excitation: Learn channel importance
se = self.fc_reduce(se) # 1x1xC -> 1x1x(C/r)
se = self.fc_expand(se) # 1x1x(C/r) -> 1x1xC
# Scale: Reweight original features
return inputs * se
def get_config(self):
config = super().get_config()
config.update({
"input_channels": self.input_channels,
"reduced_channels": self.reduced_channels,
})
return config
# Demonstrate SE block behavior
import numpy as np
# Create a test feature map
np.random.seed(42)
test_features = tf.constant(np.random.randn(1, 7, 7, 64).astype(np.float32))
# Apply SE block
se_block = SqueezeExcitation(input_channels=64, reduction_ratio=0.25)
output = se_block(test_features)
print(f"Input shape: {test_features.shape}")
print(f"Output shape: {output.shape}")
print(f"SE reduced channels: {se_block.reduced_channels}")
print(f"Parameters in SE block: {sum(p.numpy().size for p in se_block.trainable_weights)}")
# Show channel scaling weights for one sample
se_weights = se_block.global_pool(test_features)
se_weights = se_block.fc_reduce(se_weights)
se_weights = se_block.fc_expand(se_weights)
print(f"\nChannel attention weights (first 8 channels):")
print(f" {se_weights.numpy()[0, 0, 0, :8].round(3)}")
print(f" Min: {se_weights.numpy().min():.3f}, Max: {se_weights.numpy().max():.3f}")
Building EfficientNet-B0 from Scratch
EfficientNet-B0 is the baseline architecture discovered via Neural Architecture Search (NAS). It consists of a stem convolution, 7 stages of MBConv blocks with varying configurations, and a classification head.
Full EfficientNet-B0 Implementation
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import math
def mbconv_block(inputs, in_ch, out_ch, expand_ratio, kernel_size, stride,
se_ratio=0.25, drop_rate=0.0):
"""Mobile Inverted Bottleneck Conv block with SE attention."""
expanded = in_ch * expand_ratio
use_skip = (stride == 1 and in_ch == out_ch)
x = inputs
# Expansion phase
if expand_ratio != 1:
x = layers.Conv2D(expanded, 1, padding="same", use_bias=False)(x)
x = layers.BatchNormalization(momentum=0.99, epsilon=1e-3)(x)
x = layers.Activation("swish")(x)
# Depthwise convolution
x = layers.DepthwiseConv2D(kernel_size, strides=stride, padding="same", use_bias=False)(x)
x = layers.BatchNormalization(momentum=0.99, epsilon=1e-3)(x)
x = layers.Activation("swish")(x)
# Squeeze-and-Excitation
if se_ratio:
se_ch = max(1, int(in_ch * se_ratio))
se = layers.GlobalAveragePooling2D(keepdims=True)(x)
se = layers.Conv2D(se_ch, 1, activation="swish", use_bias=True)(se)
se = layers.Conv2D(expanded, 1, activation="sigmoid", use_bias=True)(se)
x = layers.Multiply()([x, se])
# Projection
x = layers.Conv2D(out_ch, 1, padding="same", use_bias=False)(x)
x = layers.BatchNormalization(momentum=0.99, epsilon=1e-3)(x)
# Skip connection
if use_skip:
if drop_rate > 0:
x = layers.Dropout(drop_rate, noise_shape=(None, 1, 1, 1))(x)
x = layers.Add()([x, inputs])
return x
def round_filters(filters, width_mult, divisor=8):
"""Round number of filters based on width multiplier."""
filters = filters * width_mult
new_filters = max(divisor, int(filters + divisor / 2) // divisor * divisor)
if new_filters < 0.9 * filters:
new_filters += divisor
return int(new_filters)
def round_repeats(repeats, depth_mult):
"""Round number of repeats based on depth multiplier."""
return int(math.ceil(repeats * depth_mult))
def build_efficientnet(input_shape=(224, 224, 3), num_classes=1000,
width_mult=1.0, depth_mult=1.0, dropout_rate=0.2):
"""
Build EfficientNet model.
B0 base architecture:
Stage | Operator | Resolution | Channels | Layers | Stride | Kernel | Expand
0 | Conv3x3 | 224->112 | 32 | 1 | 2 | 3 | -
1 | MBConv1 | 112->112 | 16 | 1 | 1 | 3 | 1
2 | MBConv6 | 112->56 | 24 | 2 | 2 | 3 | 6
3 | MBConv6 | 56->28 | 40 | 2 | 2 | 5 | 6
4 | MBConv6 | 28->14 | 80 | 3 | 2 | 3 | 6
5 | MBConv6 | 14->14 | 112 | 3 | 1 | 5 | 6
6 | MBConv6 | 14->7 | 192 | 4 | 2 | 5 | 6
7 | MBConv6 | 7->7 | 320 | 1 | 1 | 3 | 6
8 | Conv1x1 | 7->7 | 1280 | 1 | 1 | 1 | -
"""
# Block configurations: (expand, channels, repeats, stride, kernel)
block_configs = [
(1, 16, 1, 1, 3), # Stage 1
(6, 24, 2, 2, 3), # Stage 2
(6, 40, 2, 2, 5), # Stage 3
(6, 80, 3, 2, 3), # Stage 4
(6, 112, 3, 1, 5), # Stage 5
(6, 192, 4, 2, 5), # Stage 6
(6, 320, 1, 1, 3), # Stage 7
]
inputs = keras.Input(shape=input_shape)
# Stem: Conv 3x3, stride 2
stem_filters = round_filters(32, width_mult)
x = layers.Conv2D(stem_filters, 3, strides=2, padding="same", use_bias=False)(inputs)
x = layers.BatchNormalization(momentum=0.99, epsilon=1e-3)(x)
x = layers.Activation("swish")(x)
# MBConv stages
total_blocks = sum(round_repeats(cfg[2], depth_mult) for cfg in block_configs)
block_idx = 0
prev_channels = stem_filters
for expand, channels, repeats, stride, kernel in block_configs:
out_channels = round_filters(channels, width_mult)
num_repeats = round_repeats(repeats, depth_mult)
for i in range(num_repeats):
s = stride if i == 0 else 1
# Stochastic depth: linearly increase drop rate
drop_rate = dropout_rate * block_idx / total_blocks
x = mbconv_block(x, prev_channels, out_channels, expand, kernel, s,
se_ratio=0.25, drop_rate=drop_rate)
prev_channels = out_channels
block_idx += 1
# Head: Conv 1x1 + GlobalPool + Dense
head_filters = round_filters(1280, width_mult)
x = layers.Conv2D(head_filters, 1, padding="same", use_bias=False)(x)
x = layers.BatchNormalization(momentum=0.99, epsilon=1e-3)(x)
x = layers.Activation("swish")(x)
x = layers.GlobalAveragePooling2D()(x)
x = layers.Dropout(dropout_rate)(x)
outputs = layers.Dense(num_classes, activation="softmax")(x)
model = keras.Model(inputs=inputs, outputs=outputs, name="EfficientNet")
return model
# Build EfficientNet-B0
model_b0 = build_efficientnet(
input_shape=(224, 224, 3), num_classes=1000,
width_mult=1.0, depth_mult=1.0, dropout_rate=0.2
)
model_b0.summary(print_fn=lambda x: print(x) if "Total" in x or "param" in x else None)
print(f"\nEfficientNet-B0 Total Parameters: {model_b0.count_params():,}")
The implementation above should produce approximately 5.3M parameters for EfficientNet-B0, matching the paper’s reported 5.3M. Small discrepancies (within 1%) may arise from rounding conventions in filter counts. The key architectural properties to verify: 7 MBConv stages, total of 16 blocks, 224×224 input resolution, and 1280 head channels.
Scaling to B1-B7
With our base model defined, scaling to larger variants is straightforward — we simply apply the compound scaling coefficients to get B1 through B7:
import tensorflow as tf
from tensorflow import keras
import math
# (Re-use build_efficientnet from previous block in practice)
# Here we show the scaling configurations and expected parameter counts
EFFICIENTNET_VARIANTS = {
"B0": {"width": 1.0, "depth": 1.0, "resolution": 224, "dropout": 0.2},
"B1": {"width": 1.0, "depth": 1.1, "resolution": 240, "dropout": 0.2},
"B2": {"width": 1.1, "depth": 1.2, "resolution": 260, "dropout": 0.3},
"B3": {"width": 1.2, "depth": 1.4, "resolution": 300, "dropout": 0.3},
"B4": {"width": 1.4, "depth": 1.8, "resolution": 380, "dropout": 0.4},
"B5": {"width": 1.6, "depth": 2.2, "resolution": 456, "dropout": 0.4},
"B6": {"width": 1.8, "depth": 2.6, "resolution": 528, "dropout": 0.5},
"B7": {"width": 2.0, "depth": 3.1, "resolution": 600, "dropout": 0.5},
}
# Expected parameter counts (millions) from the paper
EXPECTED_PARAMS = {
"B0": 5.3, "B1": 7.8, "B2": 9.2, "B3": 12.0,
"B4": 19.0, "B5": 30.0, "B6": 43.0, "B7": 66.0,
}
# Expected ImageNet Top-1 accuracy
EXPECTED_ACC = {
"B0": 77.1, "B1": 79.1, "B2": 80.1, "B3": 81.6,
"B4": 82.9, "B5": 83.6, "B6": 84.0, "B7": 84.3,
}
print(f"{'Variant':<6} {'Resolution':<12} {'Width':<8} {'Depth':<8} "
f"{'Params (M)':<12} {'Top-1 (%)':<10} {'Dropout':<8}")
print("-" * 70)
for name, cfg in EFFICIENTNET_VARIANTS.items():
print(f"{name:<6} {cfg['resolution']:<12} {cfg['width']:<8.1f} "
f"{cfg['depth']:<8.1f} {EXPECTED_PARAMS[name]:<12.1f} "
f"{EXPECTED_ACC[name]:<10.1f} {cfg['dropout']:<8.1f}")
# Show scaling ratios relative to B0
print("\nScaling Ratios Relative to B0:")
print("-" * 50)
for name, cfg in EFFICIENTNET_VARIANTS.items():
param_ratio = EXPECTED_PARAMS[name] / EXPECTED_PARAMS["B0"]
flops_ratio = (cfg["width"]**2 * cfg["depth"] *
(cfg["resolution"]/224)**2)
print(f"{name}: {param_ratio:.1f}x params, ~{flops_ratio:.1f}x FLOPs, "
f"+{EXPECTED_ACC[name] - EXPECTED_ACC['B0']:.1f}% accuracy")
Building Any Variant
import tensorflow as tf
from tensorflow import keras
def get_efficientnet(variant="B0", num_classes=1000):
"""
Build any EfficientNet variant (B0-B7).
Args:
variant: String "B0" through "B7"
num_classes: Number of output classes
Returns:
Keras Model instance
"""
configs = {
"B0": (1.0, 1.0, 224, 0.2),
"B1": (1.0, 1.1, 240, 0.2),
"B2": (1.1, 1.2, 260, 0.3),
"B3": (1.2, 1.4, 300, 0.3),
"B4": (1.4, 1.8, 380, 0.4),
"B5": (1.6, 2.2, 456, 0.4),
"B6": (1.8, 2.6, 528, 0.5),
"B7": (2.0, 3.1, 600, 0.5),
}
if variant not in configs:
raise ValueError(f"Unknown variant: {variant}. Choose B0-B7.")
width_mult, depth_mult, resolution, dropout = configs[variant]
input_shape = (resolution, resolution, 3)
# build_efficientnet defined in previous section
model = build_efficientnet(
input_shape=input_shape,
num_classes=num_classes,
width_mult=width_mult,
depth_mult=depth_mult,
dropout_rate=dropout,
)
model._name = f"EfficientNet-{variant}"
return model
# Example: Build B3 variant
model_b3 = get_efficientnet("B3", num_classes=1000)
print(f"EfficientNet-B3:")
print(f" Input shape: (300, 300, 3)")
print(f" Parameters: {model_b3.count_params():,}")
print(f" Expected: ~12,000,000")
Training on CIFAR-10
Let’s train a scaled-down EfficientNet-B0 on CIFAR-10. We’ll use modern training techniques: RandAugment-style augmentation, cosine learning rate decay, and label smoothing.
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import numpy as np
# Load CIFAR-10
(x_train, y_train), (x_test, y_test) = keras.datasets.cifar10.load_data()
x_train = x_train.astype("float32") / 255.0
x_test = x_test.astype("float32") / 255.0
y_train = y_train.flatten()
y_test = y_test.flatten()
NUM_CLASSES = 10
INPUT_SHAPE = (32, 32, 3)
BATCH_SIZE = 128
EPOCHS = 100
AUTOTUNE = tf.data.AUTOTUNE
print(f"Training samples: {len(x_train)}")
print(f"Test samples: {len(x_test)}")
print(f"Input shape: {INPUT_SHAPE}")
print(f"Batch size: {BATCH_SIZE}")
print(f"Epochs: {EPOCHS}")
def build_augmentation():
"""Build data augmentation pipeline."""
return keras.Sequential([
layers.RandomFlip("horizontal"),
layers.RandomRotation(0.1),
layers.RandomZoom(0.1),
layers.RandomTranslation(0.1, 0.1),
layers.RandomContrast(0.1),
], name="augmentation")
def create_dataset(images, labels, is_training=True):
"""Create tf.data pipeline with augmentation."""
dataset = tf.data.Dataset.from_tensor_slices((images, labels))
if is_training:
dataset = dataset.shuffle(10000, reshuffle_each_iteration=True)
dataset = dataset.batch(BATCH_SIZE)
if is_training:
augment = build_augmentation()
dataset = dataset.map(
lambda x, y: (augment(x, training=True), y),
num_parallel_calls=AUTOTUNE
)
dataset = dataset.prefetch(AUTOTUNE)
return dataset
train_ds = create_dataset(x_train, y_train, is_training=True)
test_ds = create_dataset(x_test, y_test, is_training=False)
# Verify pipeline
for images, labels in train_ds.take(1):
print(f"\nBatch shape: {images.shape}")
print(f"Labels shape: {labels.shape}")
print(f"Pixel range: [{images.numpy().min():.3f}, {images.numpy().max():.3f}]")
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import numpy as np
import math
# Cosine learning rate schedule with warmup
class CosineDecayWithWarmup(keras.optimizers.schedules.LearningRateSchedule):
"""Cosine decay schedule with linear warmup."""
def __init__(self, base_lr, total_steps, warmup_steps):
super().__init__()
self.base_lr = base_lr
self.total_steps = total_steps
self.warmup_steps = warmup_steps
def __call__(self, step):
step = tf.cast(step, tf.float32)
warmup_steps = tf.cast(self.warmup_steps, tf.float32)
total_steps = tf.cast(self.total_steps, tf.float32)
# Linear warmup
warmup_lr = self.base_lr * (step / warmup_steps)
# Cosine decay after warmup
progress = (step - warmup_steps) / (total_steps - warmup_steps)
cosine_lr = self.base_lr * 0.5 * (1.0 + tf.math.cos(math.pi * progress))
return tf.where(step < warmup_steps, warmup_lr, cosine_lr)
# Build a small EfficientNet for CIFAR-10 (32x32 input)
def build_cifar_efficientnet(input_shape=(32, 32, 3), num_classes=10):
"""Scaled-down EfficientNet for CIFAR-10."""
inputs = keras.Input(shape=input_shape)
# Stem (no stride-2 since input is only 32x32)
x = layers.Conv2D(32, 3, padding="same", use_bias=False)(inputs)
x = layers.BatchNormalization()(x)
x = layers.Activation("swish")(x)
# Simplified MBConv stages for 32x32
configs = [
# (filters, expand, kernel, stride, repeats)
(16, 1, 3, 1, 1),
(24, 6, 3, 2, 2), # 32->16
(40, 6, 5, 2, 2), # 16->8
(80, 6, 3, 2, 3), # 8->4
(112, 6, 5, 1, 3), # 4->4
(192, 6, 5, 2, 4), # 4->2
(320, 6, 3, 1, 1), # 2->2
]
prev_filters = 32
for filters, expand, kernel, stride, repeats in configs:
for i in range(repeats):
s = stride if i == 0 else 1
expanded = prev_filters * expand
use_skip = (s == 1 and prev_filters == filters)
block_input = x
# Expansion
if expand != 1:
x = layers.Conv2D(expanded, 1, use_bias=False)(x)
x = layers.BatchNormalization()(x)
x = layers.Activation("swish")(x)
# Depthwise
x = layers.DepthwiseConv2D(kernel, strides=s, padding="same", use_bias=False)(x)
x = layers.BatchNormalization()(x)
x = layers.Activation("swish")(x)
# SE
se_ch = max(1, prev_filters // 4)
se = layers.GlobalAveragePooling2D(keepdims=True)(x)
se = layers.Conv2D(se_ch, 1, activation="swish")(se)
se = layers.Conv2D(expanded if expand != 1 else prev_filters, 1, activation="sigmoid")(se)
x = layers.Multiply()([x, se])
# Projection
x = layers.Conv2D(filters, 1, use_bias=False)(x)
x = layers.BatchNormalization()(x)
if use_skip:
x = layers.Add()([x, block_input])
prev_filters = filters
# Head
x = layers.Conv2D(1280, 1, use_bias=False)(x)
x = layers.BatchNormalization()(x)
x = layers.Activation("swish")(x)
x = layers.GlobalAveragePooling2D()(x)
x = layers.Dropout(0.3)(x)
outputs = layers.Dense(num_classes, activation="softmax")(x)
return keras.Model(inputs=inputs, outputs=outputs, name="EfficientNet-CIFAR10")
# Build and compile
model = build_cifar_efficientnet()
print(f"CIFAR-10 EfficientNet parameters: {model.count_params():,}")
# Learning rate schedule
steps_per_epoch = 50000 // 128
total_steps = steps_per_epoch * 100
warmup_steps = steps_per_epoch * 5
lr_schedule = CosineDecayWithWarmup(
base_lr=0.01, total_steps=total_steps, warmup_steps=warmup_steps
)
model.compile(
optimizer=keras.optimizers.Adam(learning_rate=lr_schedule),
loss=keras.losses.SparseCategoricalCrossentropy(from_logits=False),
metrics=["accuracy"],
)
print(f"\nTotal training steps: {total_steps}")
print(f"Warmup steps: {warmup_steps}")
print(f"Base learning rate: 0.01")
Training Results
import tensorflow as tf
from tensorflow import keras
import numpy as np
# Training with callbacks (assumes model, train_ds, test_ds from above)
callbacks = [
keras.callbacks.EarlyStopping(
monitor="val_accuracy", patience=15, restore_best_weights=True
),
keras.callbacks.ReduceLROnPlateau(
monitor="val_loss", factor=0.5, patience=5, min_lr=1e-6
),
]
# Train the model
# history = model.fit(
# train_ds,
# validation_data=test_ds,
# epochs=100,
# callbacks=callbacks,
# )
# Simulated training results for demonstration
epochs_ran = np.arange(1, 101)
train_acc = 1.0 - 0.85 * np.exp(-epochs_ran / 20) + np.random.normal(0, 0.005, 100)
val_acc = 1.0 - 0.88 * np.exp(-epochs_ran / 25) + np.random.normal(0, 0.008, 100)
train_acc = np.clip(train_acc, 0.1, 0.99)
val_acc = np.clip(val_acc, 0.1, 0.965)
print("Training Summary:")
print(f" Best validation accuracy: {val_acc.max():.4f} (epoch {val_acc.argmax() + 1})")
print(f" Final training accuracy: {train_acc[-1]:.4f}")
print(f" Final validation accuracy: {val_acc[-1]:.4f}")
print(f"\nExpected results with full training:")
print(f" EfficientNet-B0 on CIFAR-10: ~95-96% accuracy")
print(f" With advanced augmentation: ~96-97% accuracy")
print(f" Training time (GPU): ~30-45 minutes on V100")
Transfer Learning with Pre-trained EfficientNet
For practical applications, using pre-trained weights from ImageNet is far more efficient than training from scratch. TensorFlow provides EfficientNetV2 models with pre-trained weights:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
# Load pre-trained EfficientNetV2B0 (without top classification layer)
base_model = keras.applications.EfficientNetV2B0(
include_top=False,
weights="imagenet",
input_shape=(224, 224, 3),
include_preprocessing=True, # Built-in normalization
)
# Freeze the base model
base_model.trainable = False
print(f"Base model: EfficientNetV2B0")
print(f"Parameters: {base_model.count_params():,}")
print(f"Trainable: {sum(p.numpy().size for p in base_model.trainable_weights):,}")
print(f"Output shape: {base_model.output_shape}")
# Build custom classification head
inputs = keras.Input(shape=(224, 224, 3))
x = base_model(inputs, training=False)
x = layers.GlobalAveragePooling2D()(x)
x = layers.BatchNormalization()(x)
x = layers.Dropout(0.3)(x)
x = layers.Dense(256, activation="swish")(x)
x = layers.Dropout(0.2)(x)
outputs = layers.Dense(5, activation="softmax")(x) # 5 flower classes
model = keras.Model(inputs, outputs, name="EfficientNet-Flowers")
model.summary(print_fn=lambda x: print(x) if "Total" in x or "param" in x else None)
# Compile for initial training (frozen base)
model.compile(
optimizer=keras.optimizers.Adam(learning_rate=1e-3),
loss="sparse_categorical_crossentropy",
metrics=["accuracy"],
)
print(f"\nTotal params: {model.count_params():,}")
print(f"Trainable params: {sum(p.numpy().size for p in model.trainable_weights):,}")
print(f"Non-trainable params: {sum(p.numpy().size for p in model.non_trainable_weights):,}")
Fine-Tuning Strategy
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
# Load TF Flowers dataset
# In practice: tf.keras.utils.get_file or tfds.load('tf_flowers')
# Here we demonstrate the fine-tuning workflow
def create_flowers_dataset():
"""Load and prepare TF Flowers dataset."""
import tensorflow_datasets as tfds
# Load dataset (downloads ~218MB on first run)
(train_ds, val_ds), info = tfds.load(
"tf_flowers",
split=["train[:80%]", "train[80%:]"],
as_supervised=True,
with_info=True,
)
num_classes = info.features["label"].num_classes
print(f"Classes: {num_classes} ({info.features['label'].names})")
print(f"Training samples: {info.splits['train'].num_examples}")
def preprocess(image, label):
image = tf.image.resize(image, (224, 224))
image = tf.cast(image, tf.float32)
return image, label
train_ds = train_ds.map(preprocess).shuffle(1000).batch(32).prefetch(tf.data.AUTOTUNE)
val_ds = val_ds.map(preprocess).batch(32).prefetch(tf.data.AUTOTUNE)
return train_ds, val_ds, num_classes
# Fine-tuning procedure (2 phases)
print("=" * 60)
print("Phase 1: Train only the classification head (5 epochs)")
print("=" * 60)
print(" - Base model frozen")
print(" - Learning rate: 1e-3")
print(" - Expected accuracy: ~90-93%")
print("\n" + "=" * 60)
print("Phase 2: Fine-tune top layers of base model (5 epochs)")
print("=" * 60)
print(" - Unfreeze last 20 layers of base model")
print(" - Learning rate: 1e-5 (10x lower to avoid catastrophic forgetting)")
print(" - Expected accuracy: ~96-98%")
# Phase 2: Unfreeze top layers
# base_model.trainable = True
# for layer in base_model.layers[:-20]:
# layer.trainable = False
# Recompile with lower learning rate
# model.compile(
# optimizer=keras.optimizers.Adam(learning_rate=1e-5),
# loss="sparse_categorical_crossentropy",
# metrics=["accuracy"],
# )
# Fine-tune
# history_fine = model.fit(train_ds, validation_data=val_ds, epochs=5)
print("\nExpected Results:")
print(" Phase 1 (frozen): ~92% val accuracy in 5 epochs")
print(" Phase 2 (fine-tune): ~97% val accuracy in 5 more epochs")
print(" Total training time: ~3-5 minutes on GPU")
print(" vs. Training from scratch: ~2-4 hours for similar accuracy")
| Approach | Epochs | Time (GPU) | Accuracy |
|---|---|---|---|
| From scratch (B0) | 100+ | 2-4 hours | ~85-90% |
| Transfer (frozen) | 5 | 2 min | ~92% |
| Transfer (fine-tuned) | 5+5 | 5 min | ~97% |