Part 2: Building Models with Keras

The Three Keras APIs

Keras — now fully integrated as tf.keras — offers three distinct APIs for building models, each trading simplicity for flexibility. Choosing the right one for your architecture is the first design decision you'll make on every project.

API	Complexity	Flexibility	Best For
Sequential	Lowest	Limited	Simple stack of layers (feedforward, basic CNNs)
Functional	Medium	High	Multi-input/output, shared layers, skip connections
Subclassing	Highest	Maximum	Dynamic architectures, Python control flow in forward pass

                            
                            Rule of Thumb: Start with Sequential for prototyping, graduate to Functional for most production models, and use Subclassing only when you need dynamic behaviour (e.g., loops, conditionals inside the forward pass). The Functional API covers ~90% of real-world architectures.
                        

Which API Should You Use?

This flowchart helps you pick the right API based on your model's requirements:

Keras API Decision Flowchart

flowchart TD
    A["Start:
Define Your Model"] --> B{"Single input
& single output?"}
    B -->|Yes| C{"Purely linear
stack of layers?"}
    B -->|No| F["Functional API
Multi-input/output,
shared layers, DAGs"]
    C -->|Yes| D["Sequential API
Simplest approach"]
    C -->|No| E{"Need Python
control flow in
forward pass?"}
    E -->|No| F
    E -->|Yes| G["Model Subclassing
Full Python flexibility"]

    style D fill:#3B9797,stroke:#132440,color:#ffffff
    style F fill:#16476A,stroke:#132440,color:#ffffff
    style G fill:#BF092F,stroke:#132440,color:#ffffff

Sequential API

The Sequential API is the simplest way to build a model in Keras. You create a tf.keras.Sequential object and add layers one at a time — each layer has exactly one input tensor and one output tensor. Think of it as a pipeline where data flows through a linear stack.

Stacking Layers

You can define a Sequential model either by passing a list of layers to the constructor or by calling .add() incrementally:

import tensorflow as tf

# Method 1: Pass layers as a list
model = tf.keras.Sequential([
    tf.keras.layers.Dense(128, activation='relu', input_shape=(784,)),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(10, activation='softmax')
])

# Method 2: Add layers incrementally
model2 = tf.keras.Sequential()
model2.add(tf.keras.layers.Dense(128, activation='relu', input_shape=(784,)))
model2.add(tf.keras.layers.Dense(64, activation='relu'))
model2.add(tf.keras.layers.Dense(10, activation='softmax'))

print("Model 1 layers:", len(model.layers))
print("Model 2 layers:", len(model2.layers))

Both approaches produce identical models. Use the list form for concise definitions and .add() when you need conditional layer inclusion based on hyperparameters.

Input Shapes & Model Summary

The first layer in a Sequential model must specify input_shape (or equivalently use an InputLayer). Once set, Keras infers all subsequent shapes automatically. Call model.summary() to inspect the architecture:

import tensorflow as tf

model = tf.keras.Sequential([
    tf.keras.layers.Dense(256, activation='relu', input_shape=(100,)),
    tf.keras.layers.Dropout(0.3),
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

model.summary()
# Output shows: layer names, output shapes, param counts
# Total params: 100 * 256 + 256 + 256 * 128 + 128 + 128 * 1 + 1 = 59,265

Building an MNIST Classifier

Here's a complete end-to-end Sequential classifier on MNIST — from data loading to prediction:

import tensorflow as tf
import numpy as np

# Load and preprocess MNIST
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
x_train = x_train.reshape(-1, 784).astype('float32') / 255.0
x_test = x_test.reshape(-1, 784).astype('float32') / 255.0

# Build Sequential model
model = tf.keras.Sequential([
    tf.keras.layers.Dense(256, activation='relu', input_shape=(784,)),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dropout(0.3),
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(10, activation='softmax')
])

# Compile and train
model.compile(
    optimizer='adam',
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)
model.fit(x_train, y_train, epochs=3, batch_size=64, validation_split=0.1)

# Evaluate
loss, acc = model.evaluate(x_test, y_test)
print(f"Test accuracy: {acc:.4f}")

Experiment

Try It: Sequential Model Variants

Modify the classifier above to explore these variations:

Replace Dense(256) + Dense(128) with a single Dense(512) — does a wider single layer beat two narrower layers?
Remove the BatchNormalization() layer — how does convergence speed change?
Swap activation='relu' for activation='elu' — compare final accuracy

Sequential MNIST Classification

Functional API

The Functional API treats layers as functions that you call on tensors. Instead of stacking layers linearly, you create a directed acyclic graph (DAG) of layers — enabling multi-input models, multi-output models, shared layers, and skip connections. This is the workhorse API for most production architectures.

The pattern is always the same: (1) define Input() tensors, (2) chain layers as function calls, and (3) create a Model from inputs to outputs:

import tensorflow as tf

# Define input
inputs = tf.keras.Input(shape=(784,))

# Chain layers as function calls on tensors
x = tf.keras.layers.Dense(256, activation='relu')(inputs)
x = tf.keras.layers.Dropout(0.3)(x)
x = tf.keras.layers.Dense(128, activation='relu')(x)
x = tf.keras.layers.Dropout(0.2)(x)
outputs = tf.keras.layers.Dense(10, activation='softmax')(x)

# Create model from inputs → outputs
model = tf.keras.Model(inputs=inputs, outputs=outputs, name='functional_mnist')
model.summary()

Multi-Input / Multi-Output Model

A common real-world pattern is a model that takes multiple inputs (e.g., numerical features + text features) and produces multiple outputs (e.g., classification + regression). This is impossible with Sequential but natural with the Functional API:

import tensorflow as tf

# Input 1: Numerical features (e.g., age, income)
numerical_input = tf.keras.Input(shape=(10,), name='numerical')
x1 = tf.keras.layers.Dense(64, activation='relu')(numerical_input)
x1 = tf.keras.layers.Dense(32, activation='relu')(x1)

# Input 2: Categorical embedding (e.g., product category)
categorical_input = tf.keras.Input(shape=(1,), name='categorical')
x2 = tf.keras.layers.Embedding(input_dim=100, output_dim=16)(categorical_input)
x2 = tf.keras.layers.Flatten()(x2)
x2 = tf.keras.layers.Dense(32, activation='relu')(x2)

# Merge branches
merged = tf.keras.layers.Concatenate()([x1, x2])
x = tf.keras.layers.Dense(64, activation='relu')(merged)

# Output 1: Classification (buy / not buy)
class_output = tf.keras.layers.Dense(1, activation='sigmoid', name='buy_prediction')(x)

# Output 2: Regression (predicted spend amount)
spend_output = tf.keras.layers.Dense(1, activation='linear', name='spend_prediction')(x)

# Create model with 2 inputs and 2 outputs
model = tf.keras.Model(
    inputs=[numerical_input, categorical_input],
    outputs=[class_output, spend_output]
)
model.summary()
print("Input names:", [inp.name for inp in model.inputs])
print("Output names:", [out.name for out in model.outputs])

Skip Connections (Residual Blocks)

Skip connections — popularised by ResNet — add the input of a block directly to its output, enabling much deeper networks. This is where the Functional API really shines:

import tensorflow as tf

def residual_block(x, units):
    """A residual block: two Dense layers with a skip connection."""
    shortcut = x

    x = tf.keras.layers.Dense(units, activation='relu')(x)
    x = tf.keras.layers.BatchNormalization()(x)
    x = tf.keras.layers.Dense(units, activation=None)(x)
    x = tf.keras.layers.BatchNormalization()(x)

    # Match dimensions if needed
    if shortcut.shape[-1] != units:
        shortcut = tf.keras.layers.Dense(units, activation=None)(shortcut)

    x = tf.keras.layers.Add()([x, shortcut])
    x = tf.keras.layers.Activation('relu')(x)
    return x

# Build a model with residual blocks
inputs = tf.keras.Input(shape=(128,))
x = tf.keras.layers.Dense(64, activation='relu')(inputs)
x = residual_block(x, 64)
x = residual_block(x, 64)
x = residual_block(x, 32)
outputs = tf.keras.layers.Dense(10, activation='softmax')(x)

model = tf.keras.Model(inputs=inputs, outputs=outputs, name='resnet_mlp')
model.summary()
print(f"Total residual blocks: 3")

                            
                            Why Skip Connections Work: In deep networks, gradients can vanish during backpropagation. Skip connections provide a "gradient highway" — the gradient can flow directly through the addition operation, bypassing the non-linear transformations. This is why ResNet-152 trains better than a plain 20-layer network.
                        

Model Subclassing

Model subclassing gives you maximum flexibility by letting you define the forward pass as arbitrary Python code. You inherit from tf.keras.Model, define layers in __init__(), and implement the forward pass in call(). This is the approach used by researchers who need loops, conditionals, or other dynamic behaviour in their model.

import tensorflow as tf

class MNISTClassifier(tf.keras.Model):
    def __init__(self, num_classes=10):
        super().__init__()
        self.dense1 = tf.keras.layers.Dense(256, activation='relu')
        self.bn1 = tf.keras.layers.BatchNormalization()
        self.dropout1 = tf.keras.layers.Dropout(0.3)
        self.dense2 = tf.keras.layers.Dense(128, activation='relu')
        self.dropout2 = tf.keras.layers.Dropout(0.2)
        self.classifier = tf.keras.layers.Dense(num_classes, activation='softmax')

    def call(self, inputs, training=False):
        x = self.dense1(inputs)
        x = self.bn1(x, training=training)
        x = self.dropout1(x, training=training)
        x = self.dense2(x)
        x = self.dropout2(x, training=training)
        return self.classifier(x)

# Instantiate and build
model = MNISTClassifier()
model.build(input_shape=(None, 784))
model.summary()

# Test forward pass
sample = tf.random.normal((2, 784))
output = model(sample, training=False)
print("Output shape:", output.shape)  # (2, 10)
print("Sum of probabilities:", tf.reduce_sum(output, axis=1).numpy())  # [1.0, 1.0]

Dynamic Architectures

The real power of subclassing emerges when you need Python control flow in the forward pass — something impossible with Sequential or Functional:

import tensorflow as tf

class DynamicDepthModel(tf.keras.Model):
    """A model that applies a variable number of dense blocks
    based on input magnitude — demonstrating dynamic control flow."""

    def __init__(self, max_blocks=5, units=64):
        super().__init__()
        self.max_blocks = max_blocks
        self.input_proj = tf.keras.layers.Dense(units, activation='relu')
        # Create a list of dense blocks
        self.blocks = [
            tf.keras.layers.Dense(units, activation='relu')
            for _ in range(max_blocks)
        ]
        self.output_layer = tf.keras.layers.Dense(1)

    def call(self, inputs, training=False):
        x = self.input_proj(inputs)

        # Dynamic: number of blocks depends on input magnitude
        input_norm = tf.reduce_mean(tf.abs(inputs))
        num_blocks = tf.minimum(
            tf.cast(input_norm * self.max_blocks, tf.int32),
            self.max_blocks
        )

        for i in range(self.max_blocks):
            if i < num_blocks:
                x = self.blocks[i](x)
        return self.output_layer(x)

model = DynamicDepthModel(max_blocks=5, units=64)
# Small input → fewer blocks
small_input = tf.random.normal((4, 32)) * 0.1
# Large input → more blocks
large_input = tf.random.normal((4, 32)) * 2.0
print("Small output:", model(small_input).shape)
print("Large output:", model(large_input).shape)

                            
                            Subclassing Trade-offs: Subclassed models lose some Keras conveniences: model.summary() requires calling model.build() first, plot_model() shows limited detail, and serialization requires implementing get_config(). Use subclassing only when you genuinely need dynamic forward passes — the Functional API is almost always sufficient.
                        

Built-in Layers

Keras provides a rich library of pre-built layers. Here are the most important ones, organised by category:

Core Layers

Layer	Purpose	Key Parameters
`Dense`	Fully connected layer	`units`, `activation`, `kernel_regularizer`
`Flatten`	Flatten multi-D input to 1D	—
`Dropout`	Randomly zero out units	`rate` (0–1)
`Embedding`	Map integer indices to dense vectors	`input_dim`, `output_dim`

Convolutional & Pooling Layers

Layer	Purpose	Key Parameters
`Conv2D`	2D convolution (images)	`filters`, `kernel_size`, `strides`, `padding`
`MaxPooling2D`	Downsamples by taking max	`pool_size`
`GlobalAveragePooling2D`	Global spatial average → 1D	—

Recurrent & Normalization Layers

Layer	Purpose	Key Parameters
`LSTM`	Long Short-Term Memory RNN	`units`, `return_sequences`, `dropout`
`GRU`	Gated Recurrent Unit (lighter LSTM)	`units`, `return_sequences`
`BatchNormalization`	Normalise activations per batch	`momentum`, `epsilon`
`LayerNormalization`	Normalise activations per sample	`epsilon`

Here's a practical example using several built-in layers to build a small CNN:

import tensorflow as tf

# Mini CNN with built-in layers
model = tf.keras.Sequential([
    tf.keras.layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.MaxPooling2D((2, 2)),
    tf.keras.layers.Conv2D(64, (3, 3), activation='relu'),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.GlobalAveragePooling2D(),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(10, activation='softmax')
])

model.summary()
print(f"Total parameters: {model.count_params():,}")

Custom Layers

When built-in layers aren't enough, you can create your own by subclassing tf.keras.layers.Layer. Custom layers participate fully in Keras's training, serialisation, and graph compilation.

build() vs init()

The key distinction: __init__() stores hyperparameters (things known before seeing data), while build() creates weights that depend on the input shape. This lazy initialization lets Keras infer shapes automatically:

import tensorflow as tf

class ScaledDense(tf.keras.layers.Layer):
    """A Dense layer with a learnable scaling factor."""

    def __init__(self, units, activation=None, **kwargs):
        super().__init__(**kwargs)
        self.units = units
        self.activation = tf.keras.activations.get(activation)

    def build(self, input_shape):
        # Create weights — shape depends on input
        self.w = self.add_weight(
            name='kernel',
            shape=(input_shape[-1], self.units),
            initializer='glorot_uniform',
            trainable=True
        )
        self.b = self.add_weight(
            name='bias',
            shape=(self.units,),
            initializer='zeros',
            trainable=True
        )
        # Learnable scale factor
        self.scale = self.add_weight(
            name='scale',
            shape=(self.units,),
            initializer='ones',
            trainable=True
        )
        super().build(input_shape)

    def call(self, inputs):
        output = tf.matmul(inputs, self.w) + self.b
        output = output * self.scale  # Apply learnable scaling
        if self.activation is not None:
            output = self.activation(output)
        return output

# Use it like any Keras layer
layer = ScaledDense(64, activation='relu')
sample = tf.random.normal((4, 32))
output = layer(sample)  # build() called automatically on first call
print("Output shape:", output.shape)  # (4, 64)
print("Trainable weights:", [w.name for w in layer.trainable_weights])

The training Flag

Layers that behave differently during training vs inference (like Dropout and BatchNorm) use the training argument in call():

import tensorflow as tf

class NoisyDense(tf.keras.layers.Layer):
    """Dense layer that adds Gaussian noise during training only."""

    def __init__(self, units, noise_stddev=0.1, **kwargs):
        super().__init__(**kwargs)
        self.units = units
        self.noise_stddev = noise_stddev

    def build(self, input_shape):
        self.w = self.add_weight(
            shape=(input_shape[-1], self.units),
            initializer='glorot_uniform',
            trainable=True
        )
        self.b = self.add_weight(
            shape=(self.units,),
            initializer='zeros',
            trainable=True
        )
        super().build(input_shape)

    def call(self, inputs, training=False):
        output = tf.matmul(inputs, self.w) + self.b
        if training:
            noise = tf.random.normal(shape=tf.shape(output), stddev=self.noise_stddev)
            output = output + noise
        return tf.nn.relu(output)

# During training: noise is added
layer = NoisyDense(32)
sample = tf.random.normal((2, 16))
train_out = layer(sample, training=True)
infer_out = layer(sample, training=False)
print("Training output (with noise):", train_out[0, :5].numpy())
print("Inference output (no noise):", infer_out[0, :5].numpy())

Activation Functions

Activation functions introduce non-linearity into neural networks — without them, stacking linear layers would just produce another linear transformation. Here are the most important ones, with their formulas and use cases.

ReLU Family

ReLU (Rectified Linear Unit) is the default activation for hidden layers:

$f(x) = \max(0, x)$

Simple, fast, and works well in most cases. However, neurons can "die" — if a neuron's input is always negative, its gradient is permanently zero. Variants fix this:

LeakyReLU: $f(x) = \max(\alpha x, x)$ where $\alpha = 0.01$ — allows a small gradient when $x < 0$
ELU: $f(x) = x$ if $x \geq 0$, $\alpha(e^x - 1)$ if $x < 0$ — smooth curve, pushes mean activations closer to zero

import tensorflow as tf
import numpy as np

x = tf.constant([-3.0, -1.0, 0.0, 1.0, 3.0])

# ReLU
relu = tf.keras.activations.relu(x)
print("ReLU:     ", relu.numpy())  # [0. 0. 0. 1. 3.]

# LeakyReLU (alpha=0.1)
leaky = tf.keras.layers.LeakyReLU(alpha=0.1)(x)
print("LeakyReLU:", leaky.numpy())  # [-0.3 -0.1  0.   1.   3. ]

# ELU
elu = tf.keras.activations.elu(x, alpha=1.0)
print("ELU:      ", elu.numpy())  # [-0.9502 -0.6321  0.  1.  3.]

Sigmoid & Softmax

Sigmoid squashes values to $(0, 1)$ — used for binary classification outputs:

$\sigma(x) = \frac{1}{1 + e^{-x}}$

Softmax converts a vector of logits into a probability distribution — used for multi-class outputs:

$\text{softmax}(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}}$

import tensorflow as tf
import numpy as np

# Sigmoid — binary classification output
logits = tf.constant([-2.0, 0.0, 2.0, 5.0])
probs = tf.keras.activations.sigmoid(logits)
print("Sigmoid:", probs.numpy())  # [0.1192, 0.5, 0.8808, 0.9933]

# Softmax — multi-class classification output
logits_mc = tf.constant([[2.0, 1.0, 0.1]])
probs_mc = tf.keras.activations.softmax(logits_mc)
print("Softmax:", probs_mc.numpy())  # [[0.659, 0.242, 0.098]]
print("Sum:", tf.reduce_sum(probs_mc).numpy())  # 1.0

GELU & Swish

GELU (Gaussian Error Linear Unit) is the default activation in Transformers (BERT, GPT):

$\text{GELU}(x) = x \cdot \Phi(x)$

where $\Phi(x)$ is the cumulative distribution function of the standard normal distribution. Unlike ReLU, GELU is smooth everywhere and allows small negative values.

Swish (also called SiLU) is $f(x) = x \cdot \sigma(x)$, discovered through neural architecture search by Google:

import tensorflow as tf
import numpy as np

x = tf.constant([-2.0, -1.0, 0.0, 1.0, 2.0])

# GELU — used in Transformers (BERT, GPT, ViT)
gelu = tf.keras.activations.gelu(x)
print("GELU: ", gelu.numpy())

# Swish — smooth, non-monotonic
swish = tf.keras.activations.swish(x)
print("Swish:", swish.numpy())

# Side-by-side comparison
print("\nActivation Comparison:")
print(f"{'x':>6} | {'ReLU':>8} | {'GELU':>8} | {'Swish':>8}")
print("-" * 40)
for val in [-2.0, -1.0, -0.5, 0.0, 0.5, 1.0, 2.0]:
    t = tf.constant([val])
    r = tf.keras.activations.relu(t).numpy()[0]
    g = tf.keras.activations.gelu(t).numpy()[0]
    s = tf.keras.activations.swish(t).numpy()[0]
    print(f"{val:>6.1f} | {r:>8.4f} | {g:>8.4f} | {s:>8.4f}")

                            
                            Activation Cheat Sheet:
                            Hidden layers: ReLU (default), GELU (Transformers), Swish (EfficientNet)
Binary output: Sigmoid
Multi-class output: Softmax
Regression output: None (linear)
Dying ReLU problem: Try LeakyReLU or ELU

                        

Regularization Techniques

Regularization prevents overfitting by constraining model capacity. Keras offers several built-in techniques that you can combine:

L1, L2, and ElasticNet Regularizers

Weight regularizers add a penalty term to the loss function that discourages large weights:

L2 (Ridge): $\lambda \sum w_i^2$ — shrinks weights uniformly, most common
L1 (Lasso): $\lambda \sum |w_i|$ — drives some weights to exactly zero (sparse)
ElasticNet: $\lambda_1 \sum |w_i| + \lambda_2 \sum w_i^2$ — combination of both

import tensorflow as tf

# L2 regularization on Dense layer weights
model = tf.keras.Sequential([
    tf.keras.layers.Dense(
        128, activation='relu',
        kernel_regularizer=tf.keras.regularizers.l2(0.01),
        input_shape=(64,)
    ),
    tf.keras.layers.Dense(
        64, activation='relu',
        kernel_regularizer=tf.keras.regularizers.l1(0.001)
    ),
    tf.keras.layers.Dense(
        32, activation='relu',
        kernel_regularizer=tf.keras.regularizers.l1_l2(l1=0.001, l2=0.01)
    ),
    tf.keras.layers.Dense(10, activation='softmax')
])

model.compile(optimizer='adam', loss='sparse_categorical_crossentropy')

# Check regularization losses
sample = tf.random.normal((2, 64))
_ = model(sample)  # Forward pass to compute reg losses
print("Regularization losses:", [l.numpy() for l in model.losses])

Dropout & BatchNorm as Regularization

Dropout randomly zeroes out a fraction of neurons during training, forcing the network to learn redundant representations. BatchNormalization also acts as a mild regularizer because the mini-batch statistics add noise:

import tensorflow as tf

# Regularization strategy combining multiple techniques
model = tf.keras.Sequential([
    tf.keras.layers.Dense(256, activation='relu', input_shape=(100,),
                          kernel_regularizer=tf.keras.regularizers.l2(1e-4)),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dropout(0.4),

    tf.keras.layers.Dense(128, activation='relu',
                          kernel_regularizer=tf.keras.regularizers.l2(1e-4)),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dropout(0.3),

    tf.keras.layers.Dense(10, activation='softmax')
])

model.compile(
    optimizer='adam',
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)
model.summary()
print(f"\nRegularization layers: {sum(1 for l in model.layers if isinstance(l, (tf.keras.layers.Dropout, tf.keras.layers.BatchNormalization)))}")

                            
                            Regularization Rules of Thumb:
                            L2 (weight decay): Always a good default — start with 1e-4
Dropout: 0.2–0.5 for Dense layers; lower for Conv layers (0.1–0.25); place after activation
BatchNorm: Place before activation or after — both work, but before is the original paper's recommendation
Combining: L2 + Dropout + BatchNorm together is standard practice for large models

                        

Model Inspection & Visualization

Once you've built a model, Keras provides several tools to inspect its architecture, count parameters, access individual layers, and visualize the computation graph.

model.summary()

The summary() method prints a table showing each layer's name, output shape, and parameter count:

import tensorflow as tf

model = tf.keras.Sequential([
    tf.keras.layers.Dense(128, activation='relu', input_shape=(784,), name='hidden_1'),
    tf.keras.layers.Dense(64, activation='relu', name='hidden_2'),
    tf.keras.layers.Dense(10, activation='softmax', name='output')
])

# Print summary
model.summary()

# Programmatic access to counts
total = model.count_params()
trainable = sum(tf.keras.backend.count_params(w) for w in model.trainable_weights)
non_trainable = total - trainable
print(f"\nTotal params: {total:,}")
print(f"Trainable: {trainable:,}")
print(f"Non-trainable: {non_trainable:,}")

Visualizing with plot_model()

For Functional and Sequential models, tf.keras.utils.plot_model() generates a diagram of the computation graph. This is especially useful for complex multi-branch architectures:

import tensorflow as tf

# Build a multi-branch model to visualize
input_a = tf.keras.Input(shape=(32,), name='input_a')
input_b = tf.keras.Input(shape=(16,), name='input_b')

branch_a = tf.keras.layers.Dense(64, activation='relu', name='dense_a')(input_a)
branch_b = tf.keras.layers.Dense(64, activation='relu', name='dense_b')(input_b)
merged = tf.keras.layers.Concatenate(name='merge')([branch_a, branch_b])
output = tf.keras.layers.Dense(1, activation='sigmoid', name='output')(merged)

model = tf.keras.Model(inputs=[input_a, input_b], outputs=output, name='dual_branch')

# Generate architecture diagram (saves to file)
# Requires graphviz: pip install graphviz pydot
tf.keras.utils.plot_model(
    model,
    to_file='model_architecture.png',
    show_shapes=True,
    show_layer_names=True,
    show_dtype=True,
    rankdir='TB'  # Top-to-Bottom layout
)
print("Model diagram saved to model_architecture.png")

Layer & Weight Access

You can access individual layers, get and set their weights, and even freeze layers for transfer learning:

import tensorflow as tf
import numpy as np

model = tf.keras.Sequential([
    tf.keras.layers.Dense(64, activation='relu', input_shape=(32,), name='layer_1'),
    tf.keras.layers.Dense(32, activation='relu', name='layer_2'),
    tf.keras.layers.Dense(10, activation='softmax', name='output')
])

# Access layers by name or index
layer = model.get_layer('layer_1')
print("Layer name:", layer.name)
print("Layer config:", layer.get_config())

# Get weights (kernel + bias)
kernel, bias = layer.get_weights()
print(f"\nKernel shape: {kernel.shape}")  # (32, 64)
print(f"Bias shape: {bias.shape}")        # (64,)
print(f"Kernel mean: {np.mean(kernel):.6f}")

# Freeze a layer (for transfer learning)
layer.trainable = False
print(f"\nTrainable params after freezing layer_1:")
print(f"  {sum(tf.keras.backend.count_params(w) for w in model.trainable_weights):,}")

Keras Model Inspection API

flowchart LR
    A["model"] --> B[".summary()
Text table of layers"]
    A --> C[".layers
List of Layer objects"]
    A --> D[".get_layer(name)
Single layer by name"]
    A --> E[".count_params()
Total parameters"]
    A --> F[".trainable_weights
Trainable Variables"]

    C --> G[".get_weights()
NumPy arrays"]
    C --> H[".trainable = False
Freeze layer"]
    D --> G
    D --> H

    style A fill:#132440,stroke:#3B9797,color:#ffffff
    style B fill:#3B9797,stroke:#132440,color:#ffffff
    style C fill:#3B9797,stroke:#132440,color:#ffffff
    style D fill:#3B9797,stroke:#132440,color:#ffffff
    style E fill:#3B9797,stroke:#132440,color:#ffffff
    style F fill:#3B9797,stroke:#132440,color:#ffffff

Putting It Together

Let's build the same architecture — a two-hidden-layer classifier with BatchNorm and Dropout — using all three Keras APIs side by side. This drives home when each API shines and that they're all interchangeable for the same architecture:

Version 1: Sequential

Here is the implementation for Version 1: Sequential. Each code example below is self-contained and can be run independently:

import tensorflow as tf

def build_sequential_model():
    return tf.keras.Sequential([
        tf.keras.layers.Dense(128, activation='relu', input_shape=(64,)),
        tf.keras.layers.BatchNormalization(),
        tf.keras.layers.Dropout(0.3),
        tf.keras.layers.Dense(64, activation='relu'),
        tf.keras.layers.BatchNormalization(),
        tf.keras.layers.Dropout(0.2),
        tf.keras.layers.Dense(10, activation='softmax')
    ], name='sequential_model')

model_seq = build_sequential_model()
model_seq.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
print("Sequential params:", model_seq.count_params())

Version 2: Functional

Here is the implementation for Version 2: Functional. Each code example below is self-contained and can be run independently:

import tensorflow as tf

def build_functional_model():
    inputs = tf.keras.Input(shape=(64,))
    x = tf.keras.layers.Dense(128, activation='relu')(inputs)
    x = tf.keras.layers.BatchNormalization()(x)
    x = tf.keras.layers.Dropout(0.3)(x)
    x = tf.keras.layers.Dense(64, activation='relu')(x)
    x = tf.keras.layers.BatchNormalization()(x)
    x = tf.keras.layers.Dropout(0.2)(x)
    outputs = tf.keras.layers.Dense(10, activation='softmax')(x)
    return tf.keras.Model(inputs=inputs, outputs=outputs, name='functional_model')

model_func = build_functional_model()
model_func.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
print("Functional params:", model_func.count_params())

Version 3: Subclassed

Here is the implementation for Version 3: Subclassed. Each code example below is self-contained and can be run independently:

import tensorflow as tf

class SubclassedModel(tf.keras.Model):
    def __init__(self):
        super().__init__(name='subclassed_model')
        self.dense1 = tf.keras.layers.Dense(128, activation='relu')
        self.bn1 = tf.keras.layers.BatchNormalization()
        self.drop1 = tf.keras.layers.Dropout(0.3)
        self.dense2 = tf.keras.layers.Dense(64, activation='relu')
        self.bn2 = tf.keras.layers.BatchNormalization()
        self.drop2 = tf.keras.layers.Dropout(0.2)
        self.output_layer = tf.keras.layers.Dense(10, activation='softmax')

    def call(self, inputs, training=False):
        x = self.dense1(inputs)
        x = self.bn1(x, training=training)
        x = self.drop1(x, training=training)
        x = self.dense2(x)
        x = self.bn2(x, training=training)
        x = self.drop2(x, training=training)
        return self.output_layer(x)

model_sub = SubclassedModel()
model_sub.build(input_shape=(None, 64))
model_sub.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
print("Subclassed params:", model_sub.count_params())

Now let's verify all three produce identical parameter counts and test them on the same random data:

import tensorflow as tf
import numpy as np

# Rebuild all three models
# --- Sequential ---
model_seq = tf.keras.Sequential([
    tf.keras.layers.Dense(128, activation='relu', input_shape=(64,)),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dropout(0.3),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(10, activation='softmax')
], name='seq')

# --- Functional ---
inp = tf.keras.Input(shape=(64,))
x = tf.keras.layers.Dense(128, activation='relu')(inp)
x = tf.keras.layers.BatchNormalization()(x)
x = tf.keras.layers.Dropout(0.3)(x)
x = tf.keras.layers.Dense(64, activation='relu')(x)
x = tf.keras.layers.BatchNormalization()(x)
x = tf.keras.layers.Dropout(0.2)(x)
out = tf.keras.layers.Dense(10, activation='softmax')(x)
model_func = tf.keras.Model(inputs=inp, outputs=out, name='func')

# --- Compare parameter counts ---
print("Sequential params: ", model_seq.count_params())
print("Functional params: ", model_func.count_params())
print("Params match:", model_seq.count_params() == model_func.count_params())

# Test on same data
x_test = np.random.randn(8, 64).astype('float32')
out_seq = model_seq(x_test, training=False)
out_func = model_func(x_test, training=False)
print(f"\nSequential output shape: {out_seq.shape}")
print(f"Functional output shape: {out_func.shape}")
print(f"Both sum to 1.0: {tf.reduce_all(tf.abs(tf.reduce_sum(out_seq, axis=1) - 1.0) < 1e-5).numpy()}")

Challenge

Build a Multi-Modal Model

Using the Functional API, build a model that takes three inputs:

Image input: (28, 28, 1) → Conv2D → GlobalAveragePooling2D → Dense(64)
Text input: (100,) → Embedding(5000, 32) → GlobalAveragePooling1D → Dense(64)
Metadata input: (10,) → Dense(32)

Concatenate all three branches, add two Dense layers, and output a 5-class softmax. Inspect the model with summary() to verify the architecture.

Functional API Multi-Modal Architecture

Conclusion & Next Steps

You now have a complete toolkit for building models in Keras. Let's recap the key concepts:

Sequential API — the simplest approach for linear stacks of layers; ideal for prototyping
Functional API — treats layers as functions on tensors; supports multi-input/output, skip connections, and shared layers
Model Subclassing — maximum flexibility with Python control flow in the forward pass; use sparingly
Built-in layers — Dense, Conv2D, LSTM, Embedding, BatchNorm, Dropout, and more cover most architectures
Custom layers — subclass tf.keras.layers.Layer with build() for shape-dependent weights and call() for the forward pass
Activations — ReLU for most hidden layers, GELU for Transformers, Sigmoid/Softmax for outputs
Regularization — L2 + Dropout + BatchNorm is the standard combination for preventing overfitting
Model inspection — summary(), plot_model(), get_weights(), and trainable flag for transfer learning

Next in the Series

In Part 3: Training & Optimization, we'll learn how to train these models — optimizers (SGD, Adam, AdamW), learning rate schedules, loss functions, metrics, and custom training loops with GradientTape.

Previous Part 1: Tensors, Eager Execution & Autodiff Next Part 3: Training & Optimization

Cookie Consent

Table of Contents

The Three Keras APIs

Which API Should You Use?

Sequential API

Stacking Layers

Input Shapes & Model Summary

Building an MNIST Classifier

Try It: Sequential Model Variants

Functional API

Multi-Input / Multi-Output Model

Skip Connections (Residual Blocks)

Model Subclassing

Dynamic Architectures

Built-in Layers

Core Layers

Convolutional & Pooling Layers

Recurrent & Normalization Layers

Custom Layers

build() vs init()

The training Flag

Activation Functions

ReLU Family

Sigmoid & Softmax

GELU & Swish

Regularization Techniques

L1, L2, and ElasticNet Regularizers

Dropout & BatchNorm as Regularization

Model Inspection & Visualization

model.summary()

Visualizing with plot_model()

Layer & Weight Access

Putting It Together

Version 1: Sequential

Version 2: Functional

Version 3: Subclassed

Build a Multi-Modal Model

Conclusion & Next Steps

Next in the Series

Cookie Consent

Part 2: Building Models with Keras

Table of Contents

The Three Keras APIs

Which API Should You Use?

Sequential API

Stacking Layers

Input Shapes & Model Summary

Building an MNIST Classifier

Try It: Sequential Model Variants

Functional API

Multi-Input / Multi-Output Model

Skip Connections (Residual Blocks)

Model Subclassing

Dynamic Architectures

Built-in Layers

Core Layers

Convolutional & Pooling Layers

Recurrent & Normalization Layers

Custom Layers

build() vs __init__()

The training Flag

Activation Functions

ReLU Family

Sigmoid & Softmax

GELU & Swish

Regularization Techniques

L1, L2, and ElasticNet Regularizers

Dropout & BatchNorm as Regularization

Model Inspection & Visualization

model.summary()

Visualizing with plot_model()

Layer & Weight Access

Putting It Together

Version 1: Sequential

Version 2: Functional

Version 3: Subclassed

Build a Multi-Modal Model

Conclusion & Next Steps

Next in the Series

Related Articles in This Series

Part 1: Tensors, Eager Execution & Autodiff

Part 3: Training & Optimization

Part 6: CNNs & Computer Vision

build() vs init()