Part 5: Neural Network Architectures Overview

The Architecture Landscape

Neural networks are not one-size-fits-all. Over the past four decades, researchers have designed specialized architectures to tackle fundamentally different types of data and problems. Each architecture introduces a specific inductive bias — a structural assumption about the data that allows the network to learn more efficiently.

In Parts 1–4, we built the foundations: perceptrons, activation functions, backpropagation, and a complete feedforward network from scratch. Now we zoom out to see the full landscape of architectures that build upon those foundations.

                            
                            Key Insight: Every architecture in this guide is built from the same core primitives you learned in Parts 2–4: neurons, weights, biases, activation functions, and gradient descent. The difference is how those primitives are connected and constrained.
                        

The Architecture Family Tree

The following diagram shows how architectures evolved over time, each solving a limitation of its predecessors:

Neural Network Architecture Evolution

flowchart TD
    A[Perceptron
1958] --> B[Feedforward NN
1986]
    B --> C[CNN
1989 - LeNet]
    B --> D[RNN
1986]
    B --> E[Autoencoder
1986]
    D --> F[LSTM
1997]
    D --> G[GRU
2014]
    E --> H[VAE
2013]
    E --> I[GAN
2014]
    F --> J[Seq2Seq
2014]
    J --> K[Attention
2015]
    K --> L[Transformer
2017]
    L --> M[GPT / BERT
2018-2019]
    L --> N[Vision Transformer
2020]
    C --> N
    L --> O[Diffusion Models
2020]

Each branch in this tree represents a distinct approach to encoding structure into the network. Let’s explore each major architecture family.

Feedforward Neural Networks (FNN)

The feedforward neural network (also called a multi-layer perceptron or MLP) is the architecture you already built in Part 4. Data flows in one direction: input → hidden layers → output. There are no cycles, no memory, no spatial awareness — just a stack of fully connected layers.

                            
                            You already built this! In Part 4, we implemented a complete FNN from scratch to solve the XOR problem. The architecture below is the same concept scaled up to real-world datasets.
                        

Best Use Cases

Tabular data — structured data with rows and columns (customer features, financial metrics)
Classification — binary or multi-class prediction
Regression — predicting continuous values
Function approximation — learning arbitrary input-output mappings

Limitations

No spatial awareness (pixel position is meaningless)
No memory (cannot process sequences)
Parameter-heavy for high-dimensional inputs (e.g., images)

Quick Demo: FNN on Iris Dataset

import numpy as np

# Iris dataset (simplified: 4 features, 3 classes)
# We use a small subset for demonstration
np.random.seed(42)
X = np.random.randn(150, 4)  # 150 samples, 4 features
y_true = np.random.randint(0, 3, 150)  # 3 classes

# One-hot encode targets
Y = np.zeros((150, 3))
for i, label in enumerate(y_true):
    Y[i, label] = 1.0

# FNN: 4 inputs -> 8 hidden (ReLU) -> 3 outputs (softmax)
W1 = np.random.randn(4, 8) * 0.5
b1 = np.zeros((1, 8))
W2 = np.random.randn(8, 3) * 0.5
b2 = np.zeros((1, 3))

def relu(z):
    return np.maximum(0, z)

def softmax(z):
    exp_z = np.exp(z - np.max(z, axis=1, keepdims=True))
    return exp_z / np.sum(exp_z, axis=1, keepdims=True)

# Forward pass
hidden = relu(X @ W1 + b1)
output = softmax(hidden @ W2 + b2)

# Cross-entropy loss
loss = -np.mean(np.sum(Y * np.log(output + 1e-8), axis=1))
predictions = np.argmax(output, axis=1)
accuracy = np.mean(predictions == y_true)

print(f"Initial loss: {loss:.4f}")
print(f"Initial accuracy: {accuracy:.2%}")
print(f"Network shape: 4 -> 8 -> 3")
print(f"Total parameters: {4*8 + 8 + 8*3 + 3}")  # 67

This simple two-layer FNN has only 67 parameters. For tabular data with a moderate number of features, FNNs remain highly effective and should be your first choice before reaching for more complex architectures.

Convolutional Neural Networks (CNN)

While FNNs treat every input feature independently, Convolutional Neural Networks exploit the spatial structure of grid-like data. The key insight: nearby pixels in an image are more related to each other than distant ones, and the same pattern (edge, texture, shape) can appear anywhere in the image.

Two Revolutionary Ideas

Core Concept

Local Connectivity + Weight Sharing

Local connectivity: Each neuron connects only to a small region of the input (its receptive field), not the entire image. This dramatically reduces parameters.

Weight sharing: The same filter (set of weights) slides across the entire image. A vertical-edge detector works regardless of where the edge appears. This gives CNNs translation invariance.

Spatial Hierarchy Feature Maps Pooling

The Convolution Operation

A convolution slides a small filter (typically $3 \times 3$ or $5 \times 5$) across the input, computing dot products at each position to produce a feature map. Multiple filters detect different features (edges, corners, textures) simultaneously.

import numpy as np

# Demonstrate 2D convolution on a small image
image = np.array([
    [0, 0, 0, 0, 0],
    [0, 1, 1, 1, 0],
    [0, 1, 0, 1, 0],
    [0, 1, 1, 1, 0],
    [0, 0, 0, 0, 0]
], dtype=np.float32)

# Vertical edge detection filter
filter_v = np.array([
    [-1, 0, 1],
    [-1, 0, 1],
    [-1, 0, 1]
], dtype=np.float32)

# Manual convolution (no padding, stride=1)
output_size = image.shape[0] - filter_v.shape[0] + 1  # 3x3 output
feature_map = np.zeros((output_size, output_size))

for i in range(output_size):
    for j in range(output_size):
        region = image[i:i+3, j:j+3]
        feature_map[i, j] = np.sum(region * filter_v)

print("Input image (5x5):")
print(image)
print("\nVertical edge filter (3x3):")
print(filter_v)
print("\nFeature map (3x3):")
print(feature_map)
print(f"\nParameter reduction: {5*5} pixels, only {3*3} = 9 filter params")

CNN Pipeline

A typical CNN stacks convolutional layers (feature extraction) with pooling layers (spatial reduction), ending with fully connected layers for classification:

Standard CNN Architecture Pipeline

flowchart LR
    A[Input Image
32x32x3] --> B[Conv + ReLU
32x32x32]
    B --> C[Max Pool
16x16x32]
    C --> D[Conv + ReLU
16x16x64]
    D --> E[Max Pool
8x8x64]
    E --> F[Flatten
4096]
    F --> G[Dense + ReLU
256]
    G --> H[Dense + Softmax
10 classes]

When to Use CNNs

Image classification (CIFAR-10, ImageNet)
Object detection (YOLO, Faster R-CNN)
Audio spectrograms (speech recognition)
Any grid-structured data with local spatial patterns

Deep Dive Available: We build CNNs from scratch in Part 6. For framework implementations, see PyTorch Mastery Part 5 and TensorFlow Mastery Part 6.

Recurrent Neural Networks (RNN) & LSTMs

Feedforward networks process each input independently — they have no concept of order or time. Recurrent Neural Networks solve this by introducing a hidden state that carries information from one time step to the next, creating a form of memory.

The Key Insight: Hidden State as Memory

At each time step $t$, an RNN takes two inputs: the current data $x_t$ and the previous hidden state $h_{t-1}$. It produces a new hidden state $h_t$ that encodes everything the network has “seen” so far:

$$h_t = \tanh(W_{xh} \cdot x_t + W_{hh} \cdot h_{t-1} + b_h)$$

import numpy as np

# Simple RNN cell demonstration
np.random.seed(42)

input_size = 4    # features per time step
hidden_size = 8   # hidden state dimension
seq_length = 5    # sequence length

# Weights
W_xh = np.random.randn(input_size, hidden_size) * 0.1
W_hh = np.random.randn(hidden_size, hidden_size) * 0.1
b_h = np.zeros((1, hidden_size))

# Input sequence (5 time steps, 4 features each)
X_seq = np.random.randn(seq_length, input_size)

# Forward pass through time
h = np.zeros((1, hidden_size))  # initial hidden state
hidden_states = []

for t in range(seq_length):
    x_t = X_seq[t:t+1]  # shape (1, 4)
    h = np.tanh(x_t @ W_xh + h @ W_hh + b_h)
    hidden_states.append(h.copy())

print(f"Sequence length: {seq_length} time steps")
print(f"Hidden state dimension: {hidden_size}")
print(f"Final hidden state (encodes full sequence):")
print(f"  {h[0, :4]}...")
print(f"Parameters: {input_size*hidden_size + hidden_size*hidden_size + hidden_size}")

The Vanishing Gradient Problem

In theory, RNNs can remember information from arbitrarily far back. In practice, gradients flowing through many time steps either vanish (approach zero) or explode (approach infinity). This makes it extremely difficult for standard RNNs to learn long-range dependencies.

                            
                            Vanishing Gradients: After ~10–20 time steps, the gradient signal becomes so small that early inputs have virtually no influence on the loss. The network “forgets” earlier parts of the sequence.
                        

LSTMs: The Gating Solution

Long Short-Term Memory (LSTM) networks solve the vanishing gradient problem by introducing a cell state — a highway that carries information across many time steps with minimal transformation — and three gates that control information flow:

Forget gate ($f_t$): What old information to discard
Input gate ($i_t$): What new information to store
Output gate ($o_t$): What to expose as the hidden state

The GRU (Gated Recurrent Unit) is a simplified variant with only two gates (reset and update), often performing comparably with fewer parameters.

When to Use RNNs/LSTMs

Time series forecasting (stock prices, weather)
Text generation (character-level or word-level)
Speech recognition (audio as sequential frames)
Music composition (note sequences)

Deep Dive Available: Part 7 builds RNNs and LSTMs from scratch. Framework implementations: PyTorch Part 6, TensorFlow Part 7.

Autoencoders

An autoencoder learns to compress data into a low-dimensional representation and then reconstruct it. The network is forced through a bottleneck — a layer much smaller than the input — forcing it to learn only the most important features.

Architecture: Encoder → Bottleneck → Decoder

The encoder maps input $x$ to a latent code $z$, and the decoder maps $z$ back to a reconstruction $\hat{x}$. The loss is the reconstruction error: $\mathcal{L} = ||x - \hat{x}||^2$.

import numpy as np

# Simple autoencoder demonstration
np.random.seed(42)

# Generate sample data: 50 samples, 10 features
X = np.random.randn(50, 10)

# Autoencoder: 10 -> 3 (bottleneck) -> 10
# Encoder weights
W_enc = np.random.randn(10, 3) * 0.3
b_enc = np.zeros((1, 3))
# Decoder weights
W_dec = np.random.randn(3, 10) * 0.3
b_dec = np.zeros((1, 10))

def sigmoid(z):
    return 1.0 / (1.0 + np.exp(-np.clip(z, -500, 500)))

# Forward pass
latent = np.tanh(X @ W_enc + b_enc)       # Encode: 10 -> 3
reconstruction = latent @ W_dec + b_dec     # Decode: 3 -> 10

# Reconstruction error
mse = np.mean((X - reconstruction) ** 2)
compression_ratio = 10 / 3

print(f"Input dimension: 10")
print(f"Latent dimension: 3 (bottleneck)")
print(f"Compression ratio: {compression_ratio:.1f}x")
print(f"Reconstruction MSE (untrained): {mse:.4f}")
print(f"Latent representation shape: {latent.shape}")

Variants

Autoencoder Family

Denoising Autoencoder: Add noise to input, train to reconstruct the clean version. Learns robust features.

Variational Autoencoder (VAE): Learns a probability distribution in latent space (not just a point). Enables smooth generation of new samples by sampling from the learned distribution.

Sparse Autoencoder: Adds a sparsity penalty to the bottleneck, forcing most latent neurons to be inactive. Learns more interpretable features.

Anomaly Detection Dimensionality Reduction Generation

When to Use Autoencoders

Anomaly detection — high reconstruction error = anomaly
Dimensionality reduction — non-linear alternative to PCA
Denoising — removing noise from images or signals
Generative modeling (VAEs) — generating new realistic samples

                            
                            Deep Dive Available: Part 8 implements autoencoders and GANs from scratch, including the VAE reparameterization trick.
                        

Generative Adversarial Networks (GANs)

While autoencoders learn to reconstruct, GANs learn to create. Introduced by Ian Goodfellow in 2014, GANs frame generation as a game between two competing networks:

Generator ($G$): Takes random noise and produces fake samples (e.g., images)
Discriminator ($D$): Tries to distinguish real samples from the generator’s fakes

Training alternates: the generator improves at fooling the discriminator, while the discriminator improves at catching fakes. At equilibrium, the generator produces samples indistinguishable from real data.

The Minimax Game

The GAN objective is a minimax optimization:

$$\min_G \max_D \; \mathbb{E}_{x \sim p_{data}}[\log D(x)] + \mathbb{E}_{z \sim p_z}[\log(1 - D(G(z)))]$$

The discriminator wants to maximize this (correctly classify real vs fake), while the generator wants to minimize it (make $D(G(z))$ close to 1).

Applications

GAN Applications in Practice

Image Generation: StyleGAN produces photorealistic human faces that don’t exist. Progressive growing enables high-resolution output (1024×1024).

Style Transfer: CycleGAN translates between domains (horses to zebras, summer to winter) without paired training data.

Super-Resolution: SRGAN upscales low-resolution images with realistic detail — far beyond simple interpolation.

Data Augmentation: Generate synthetic training data for rare classes in imbalanced datasets.

Image Synthesis Style Transfer Super-Resolution Data Augmentation

                            
                            Training Challenge: GANs are notoriously difficult to train. Mode collapse (generator produces limited variety), training instability, and oscillation between generator/discriminator are common issues. Techniques like Wasserstein loss and gradient penalty help stabilize training.
                        

Transformers

The Transformer architecture, introduced in the 2017 paper “Attention Is All You Need,” has become the dominant architecture in modern AI. It powers GPT, BERT, DALL-E, Stable Diffusion, and virtually every large language model. The key breakthrough: replacing recurrence with self-attention.

The Key Insight: Self-Attention

Instead of processing sequences one token at a time (like RNNs), transformers process all tokens simultaneously. Self-attention computes a weighted relationship between every pair of tokens, allowing each token to “attend to” any other token regardless of distance.

For each token, we compute three vectors: Query ($Q$), Key ($K$), and Value ($V$). Attention scores are:

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

import numpy as np

# Self-attention mechanism demonstration
np.random.seed(42)

# Sequence of 4 tokens, each with embedding dimension 8
seq_length = 4
d_model = 8
d_k = 8  # key/query dimension

# Random token embeddings (simulating word embeddings)
X = np.random.randn(seq_length, d_model)

# Weight matrices for Q, K, V
W_Q = np.random.randn(d_model, d_k) * 0.1
W_K = np.random.randn(d_model, d_k) * 0.1
W_V = np.random.randn(d_model, d_k) * 0.1

# Compute Q, K, V
Q = X @ W_Q
K = X @ W_K
V = X @ W_V

# Scaled dot-product attention
scores = Q @ K.T / np.sqrt(d_k)

# Softmax (stable version)
exp_scores = np.exp(scores - np.max(scores, axis=-1, keepdims=True))
attention_weights = exp_scores / np.sum(exp_scores, axis=-1, keepdims=True)

# Weighted sum of values
output = attention_weights @ V

print(f"Input shape: ({seq_length}, {d_model})")
print(f"Attention weights (each token attends to all others):")
print(np.round(attention_weights, 3))
print(f"\nOutput shape: {output.shape}")
print(f"Key advantage: ALL tokens processed in parallel!")

Transformer Architecture

Transformer Block Architecture

flowchart TD
    A[Input Embeddings
+ Positional Encoding] --> B[Multi-Head
Self-Attention]
    B --> C[Add & LayerNorm]
    A --> C
    C --> D[Feed-Forward
Network]
    D --> E[Add & LayerNorm]
    C --> E
    E --> F[Output /
Next Block]

Why Transformers Dominate

Parallelism: Unlike RNNs, all positions are processed simultaneously (massive GPU speedup)
Long-range dependencies: Any token can attend to any other token directly (no vanishing gradients over distance)
Scalability: Performance improves predictably with more data and parameters (scaling laws)
Versatility: Same architecture works for text (GPT), images (ViT), audio (Whisper), and multimodal (GPT-4)

When to Use Transformers

Natural language processing — text classification, generation, translation
Long sequences — where RNNs struggle with vanishing gradients
Multi-modal tasks — combining text, images, audio
When you have large datasets — transformers are data-hungry

Deep Dive Available: Part 9 builds attention and a transformer block from scratch. Framework implementations: PyTorch Part 7, TensorFlow Part 8.

Choosing the Right Architecture

With so many architectures available, how do you choose? The decision primarily depends on your data type and problem structure. Here is a practical decision guide:

                            
                            Architecture Decision Guide:
                            Tabular data (structured features) → FNN or gradient boosting (XGBoost)
Images / spatial data → CNN (or Vision Transformer for large datasets)
Text / language → Transformer (BERT, GPT)
Short sequences / time series → RNN/LSTM (or Transformer)
Generation (images) → GAN or Diffusion Model
Generation (text) → Autoregressive Transformer (GPT-style)
Anomaly detection → Autoencoder
Dimensionality reduction → Autoencoder or VAE

                        

Quick Reference

Architecture Comparison Summary

Architecture	Inductive Bias	Best For	Series Part
FNN	None (universal)	Tabular, classification	Part 4
CNN	Spatial locality	Images, grids	Part 6
RNN/LSTM	Sequential order	Time series, text	Part 7
Autoencoder	Compression	Anomalies, reduction	Part 8
GAN	Adversarial	Image generation	Part 8
Transformer	Global attention	Language, multimodal	Part 9

Decision Guide Data-Driven Choice

What’s Next

This overview gives you the map — now it’s time to explore the territory. Starting with Part 6, we’ll implement each architecture from scratch, building deep intuition for how convolutions, recurrence, attention, and adversarial training actually work at the code level.

Next in the Series

In Part 6: CNNs Deep Dive — Convolutions from Scratch, we’ll implement convolutional layers, pooling, and a complete CNN in NumPy to classify images — building every operation from first principles.

Previous Part 4: Build Your First Neural Network from Scratch Next Part 6: CNNs Deep Dive — Convolutions from Scratch

Table of Contents