Why Sequences Need Special Networks
Feedforward neural networks (FNNs) process each input independently — they have no concept of order or context. When you feed an FNN a word, it has no idea what came before it. But language, time series, audio, and countless other real-world signals are inherently sequential: the meaning of “bank” depends on whether the previous words were “river” or “savings.”
Let us demonstrate this failure concretely. We will try to predict the next value in a simple sequence using a feedforward network that sees only the current input:
import numpy as np
# Simple sequence: predict next value based on pattern
# Pattern: 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, ...
sequence = [0, 1, 2, 3] * 10 # repeating pattern
# FNN approach: predict next value from ONLY current value
# Problem: when input is 0, output should be 1
# when input is 3, output should be 0
# But what if the sequence was 0, 2, 4, 6, 0, 2, 4, 6...?
# Then input 0 should predict 2, not 1!
# The FNN cannot distinguish these cases without memory
current_values = np.array(sequence[:-1])
next_values = np.array(sequence[1:])
# Simple linear model (FNN with 1 layer)
# y = wx + b, minimize MSE
# For pattern [0,1,2,3,0,1,2,3...]:
# When x=0, y should be 1
# When x=3, y should be 0 (wraps around!)
# A linear model CANNOT learn this wrapping behavior
w = np.sum(current_values * next_values) / np.sum(current_values**2 + 1e-8)
predictions = current_values * w
mse = np.mean((predictions - next_values)**2)
print(f"Linear FNN weight: {w:.4f}")
print(f"MSE on sequence prediction: {mse:.4f}")
print(f"\nSample predictions vs actual:")
for i in range(8):
print(f" Input: {current_values[i]} -> Predicted: {predictions[i]:.2f}, Actual: {next_values[i]}")
print(f"\nThe FNN fails because it cannot remember previous context!")
print(f"It sees '3' and must predict '0', but also sees '0' and must predict '1'")
print(f"Without memory of WHERE we are in the sequence, this is impossible.")
The fundamental limitation is clear: without memory of previous inputs, the network cannot learn patterns that depend on context. This is why we need recurrent architectures.
RNN Architecture and Memory
A Recurrent Neural Network solves the memory problem by maintaining a hidden state $h_t$ that carries information from previous time steps. At each step, the network combines the current input $x_t$ with the previous hidden state $h_{t-1}$:
$$h_t = \tanh(W_{hh} \cdot h_{t-1} + W_{xh} \cdot x_t + b_h)$$
$$y_t = W_{hy} \cdot h_t + b_y$$
where:
- $W_{xh}$ — input-to-hidden weights
- $W_{hh}$ — hidden-to-hidden weights (the “recurrence”)
- $W_{hy}$ — hidden-to-output weights
- $b_h$, $b_y$ — biases
- $\tanh$ — squashes values to $[-1, 1]$, providing non-linearity
We implement a single RNN cell and step through a sequence to observe how hidden state evolves:
import numpy as np
class RNNCell:
"""A single RNN cell that processes one time step."""
def __init__(self, input_size, hidden_size):
# Xavier initialization for stable training
self.W_xh = np.random.randn(input_size, hidden_size) * np.sqrt(2.0 / (input_size + hidden_size))
self.W_hh = np.random.randn(hidden_size, hidden_size) * np.sqrt(2.0 / (hidden_size + hidden_size))
self.W_hy = np.random.randn(hidden_size, hidden_size) * np.sqrt(2.0 / (hidden_size + hidden_size))
self.b_h = np.zeros((1, hidden_size))
def forward(self, x_t, h_prev):
"""Process one time step.
Args:
x_t: input at time t, shape (batch_size, input_size)
h_prev: hidden state from previous step, shape (batch_size, hidden_size)
Returns:
h_next: new hidden state, shape (batch_size, hidden_size)
"""
# Combine input and previous hidden state
h_next = np.tanh(x_t @ self.W_xh + h_prev @ self.W_hh + self.b_h)
return h_next
# Create an RNN cell: input_size=3, hidden_size=4
np.random.seed(42)
cell = RNNCell(input_size=3, hidden_size=4)
# Process a sequence of 5 time steps
sequence_length = 5
batch_size = 1
input_size = 3
hidden_size = 4
# Random input sequence
X = np.random.randn(sequence_length, batch_size, input_size)
# Initial hidden state (zeros)
h = np.zeros((batch_size, hidden_size))
print("Processing sequence through RNN cell:")
print(f"{'Step':<6}{'Input (first 3)':<25}{'Hidden State (4 dims)'}")
print("-" * 70)
for t in range(sequence_length):
h = cell.forward(X[t], h)
input_str = np.array2string(X[t][0], precision=2, separator=', ')
hidden_str = np.array2string(h[0], precision=4, separator=', ')
print(f"t={t:<4}{input_str:<25}{hidden_str}")
print(f"\nNotice: hidden state carries forward information from ALL previous inputs.")
print(f"At t=4, h encodes information from inputs at t=0, t=1, t=2, t=3, and t=4.")
flowchart LR
subgraph "Time Step t-2"
X0["x(t-2)"] --> H0["h(t-2)"]
end
subgraph "Time Step t-1"
X1["x(t-1)"] --> H1["h(t-1)"]
end
subgraph "Time Step t"
X2["x(t)"] --> H2["h(t)"]
end
subgraph "Time Step t+1"
X3["x(t+1)"] --> H3["h(t+1)"]
end
H0 -->|"W_hh"| H1
H1 -->|"W_hh"| H2
H2 -->|"W_hh"| H3
H2 --> Y2["y(t)"]
H3 --> Y3["y(t+1)"]
When we “unroll” the RNN through time, we see that it is essentially a very deep network where the same weights are shared at every layer (time step). This weight sharing is both the power and the weakness of RNNs, as we will discover in the vanishing gradient section.
Building an RNN from Scratch
Now let us build a complete RNN that can be trained on character-level text prediction. Given a sequence of characters, the network learns to predict the next character. This is the simplest form of a language model.
import numpy as np
class CharRNN:
"""Character-level RNN for next-character prediction."""
def __init__(self, vocab_size, hidden_size, learning_rate=0.01):
self.vocab_size = vocab_size
self.hidden_size = hidden_size
self.lr = learning_rate
# Weight initialization (Xavier)
scale = np.sqrt(2.0 / (vocab_size + hidden_size))
self.W_xh = np.random.randn(vocab_size, hidden_size) * scale
self.W_hh = np.random.randn(hidden_size, hidden_size) * scale
self.W_hy = np.random.randn(hidden_size, vocab_size) * scale
self.b_h = np.zeros((1, hidden_size))
self.b_y = np.zeros((1, vocab_size))
def forward(self, inputs, h_prev):
"""Forward pass through entire sequence.
Args:
inputs: list of integer character indices
h_prev: initial hidden state
Returns:
outputs: list of output probabilities at each step
hiddens: list of hidden states
"""
hiddens = [h_prev]
outputs = []
for t in range(len(inputs)):
# One-hot encode input
x_t = np.zeros((1, self.vocab_size))
x_t[0, inputs[t]] = 1.0
# RNN step
h_t = np.tanh(x_t @ self.W_xh + hiddens[-1] @ self.W_hh + self.b_h)
y_t = h_t @ self.W_hy + self.b_y
# Softmax for probabilities
exp_y = np.exp(y_t - np.max(y_t))
probs = exp_y / np.sum(exp_y)
hiddens.append(h_t)
outputs.append(probs)
return outputs, hiddens
def loss(self, outputs, targets):
"""Cross-entropy loss."""
total_loss = 0.0
for t in range(len(targets)):
total_loss -= np.log(outputs[t][0, targets[t]] + 1e-8)
return total_loss / len(targets)
# Training on "hello world" pattern
text = "hello world " * 20
chars = sorted(list(set(text)))
char_to_idx = {ch: i for i, ch in enumerate(chars)}
idx_to_char = {i: ch for ch, i in char_to_idx.items()}
vocab_size = len(chars)
print(f"Vocabulary: {chars}")
print(f"Vocab size: {vocab_size}")
# Create and train the RNN
np.random.seed(42)
rnn = CharRNN(vocab_size=vocab_size, hidden_size=32, learning_rate=0.01)
# Simple training loop (forward pass only for demonstration)
seq_length = 10
h = np.zeros((1, 32))
# Show predictions before training
test_input = [char_to_idx[c] for c in "hello worl"]
outputs, _ = rnn.forward(test_input, h)
print(f"\nBefore training - predicting after 'hello worl':")
last_probs = outputs[-1][0]
top_3 = np.argsort(last_probs)[-3:][::-1]
for idx in top_3:
print(f" '{idx_to_char[idx]}' -> probability: {last_probs[idx]:.4f}")
print(f"\nThe untrained RNN outputs nearly uniform probabilities.")
print(f"After training with backpropagation through time (BPTT),")
print(f"it would learn that 'd' follows 'hello worl' with high probability.")
The key insight is that the hidden state $h_t$ accumulates information about everything seen so far. After processing “hello worl”, the hidden state encodes enough context to predict “d” as the next character.
The Vanishing Gradient Problem
During backpropagation through time (BPTT), gradients must flow backward through every time step. At each step, the gradient is multiplied by the derivative of $\tanh$ and by $W_{hh}$. Since $\tanh'(x) \leq 1$ and typically $\|W_{hh}\| < 1$, these multiplications compound:
$$\frac{\partial h_T}{\partial h_1} = \prod_{t=2}^{T} \frac{\partial h_t}{\partial h_{t-1}} = \prod_{t=2}^{T} W_{hh}^T \cdot \text{diag}(\tanh'(z_t))$$
If the largest singular value of $W_{hh}$ is less than 1, this product shrinks exponentially with $T$.
import numpy as np
def demonstrate_vanishing_gradients(hidden_size=64, num_steps=50):
"""Show how gradients vanish over many time steps."""
np.random.seed(42)
# Initialize W_hh with values that cause vanishing
# (most random initializations will cause this)
W_hh = np.random.randn(hidden_size, hidden_size) * 0.5
# Simulate gradient flow backward through time
gradient = np.ones(hidden_size) # Start with gradient of 1
gradient_norms = [np.linalg.norm(gradient)]
for t in range(num_steps):
# At each step, gradient is multiplied by W_hh^T and tanh derivative
# tanh derivative is at most 1, typically around 0.5 for random activations
tanh_derivative = np.random.uniform(0.2, 0.8, hidden_size)
gradient = (W_hh.T @ gradient) * tanh_derivative
gradient_norms.append(np.linalg.norm(gradient))
print("Gradient Norm at Each Time Step (flowing backward):")
print("-" * 55)
steps_to_show = [0, 5, 10, 15, 20, 30, 40, 49]
for step in steps_to_show:
norm = gradient_norms[step]
bar = "#" * min(int(norm * 50), 50)
print(f" Step {step:>2}: norm = {norm:.2e} {bar}")
print(f"\n Initial gradient norm: {gradient_norms[0]:.4f}")
print(f" After 50 steps: {gradient_norms[-1]:.2e}")
print(f" Ratio: {gradient_norms[-1] / gradient_norms[0]:.2e}")
print(f"\n The gradient has essentially vanished!")
print(f" Information from 50 steps ago cannot influence current weights.")
return gradient_norms
gradient_norms = demonstrate_vanishing_gradients()
# Also demonstrate exploding gradients
print("\n" + "=" * 55)
print("Now with a LARGER W_hh (causes exploding gradients):")
print("=" * 55)
np.random.seed(42)
W_hh_large = np.random.randn(64, 64) * 1.5
gradient = np.ones(64)
for t in range(20):
tanh_derivative = np.random.uniform(0.5, 1.0, 64)
gradient = (W_hh_large.T @ gradient) * tanh_derivative
if t % 5 == 0:
print(f" Step {t:>2}: gradient norm = {np.linalg.norm(gradient):.2e}")
print(f"\n Exploding gradients cause NaN values and unstable training.")
print(f" Solution: gradient clipping (cap norm at a threshold).")
This is the fundamental motivation for gated architectures. We need a mechanism that allows gradients to flow unchanged over long distances — a “highway” for gradient information.
LSTM: Long Short-Term Memory
The LSTM (Hochreiter & Schmidhuber, 1997) introduces a cell state $C_t$ that runs through time like a conveyor belt, and three gates that control what information flows in and out:
The Three Gates
Forget Gate — decides what to discard from cell state:
$$f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f)$$
Input Gate — decides what new information to store:
$$i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i)$$
$$\tilde{C}_t = \tanh(W_C \cdot [h_{t-1}, x_t] + b_C)$$
Output Gate — decides what to output from cell state:
$$o_t = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o)$$
Cell State Update:
$$C_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t$$
Hidden State:
$$h_t = o_t \odot \tanh(C_t)$$
The $\sigma$ (sigmoid) function outputs values between 0 and 1, acting as a soft switch. The $\odot$ denotes element-wise multiplication.
flowchart TD
subgraph "LSTM Cell"
direction TB
CT_prev["C(t-1)"] -->|"cell state"| FORGET["x forget gate f_t"]
FORGET --> ADD["+ addition"]
IG["input gate i_t"] --> MUL_I["x"]
CAND["candidate C_tilde"] --> MUL_I
MUL_I --> ADD
ADD --> CT["C(t)"]
CT --> TANH_C["tanh"]
TANH_C --> MUL_O["x"]
OG["output gate o_t"] --> MUL_O
MUL_O --> HT["h(t)"]
end
XT["x(t)"] --> IG
XT --> OG
XT --> FORGET
XT --> CAND
HT_prev["h(t-1)"] --> IG
HT_prev --> OG
HT_prev --> FORGET
HT_prev --> CAND
import numpy as np
def sigmoid(x):
"""Numerically stable sigmoid."""
return np.where(x >= 0,
1 / (1 + np.exp(-x)),
np.exp(x) / (1 + np.exp(x)))
class LSTMCell:
"""LSTM cell implementation from scratch."""
def __init__(self, input_size, hidden_size):
self.input_size = input_size
self.hidden_size = hidden_size
concat_size = input_size + hidden_size
# Initialize all four gate weight matrices
scale = np.sqrt(2.0 / concat_size)
self.W_f = np.random.randn(concat_size, hidden_size) * scale # forget gate
self.W_i = np.random.randn(concat_size, hidden_size) * scale # input gate
self.W_c = np.random.randn(concat_size, hidden_size) * scale # candidate
self.W_o = np.random.randn(concat_size, hidden_size) * scale # output gate
self.b_f = np.ones((1, hidden_size)) # bias forget gate to 1 (remember by default)
self.b_i = np.zeros((1, hidden_size))
self.b_c = np.zeros((1, hidden_size))
self.b_o = np.zeros((1, hidden_size))
def forward(self, x_t, h_prev, c_prev):
"""Process one time step.
Args:
x_t: input, shape (batch, input_size)
h_prev: previous hidden state, shape (batch, hidden_size)
c_prev: previous cell state, shape (batch, hidden_size)
Returns:
h_t: new hidden state
c_t: new cell state
gates: dict of gate values for inspection
"""
# Concatenate input and previous hidden state
concat = np.concatenate([h_prev, x_t], axis=1)
# Compute all four gates
f_t = sigmoid(concat @ self.W_f + self.b_f) # forget gate
i_t = sigmoid(concat @ self.W_i + self.b_i) # input gate
c_tilde = np.tanh(concat @ self.W_c + self.b_c) # candidate values
o_t = sigmoid(concat @ self.W_o + self.b_o) # output gate
# Update cell state
c_t = f_t * c_prev + i_t * c_tilde
# Compute hidden state
h_t = o_t * np.tanh(c_t)
gates = {'forget': f_t, 'input': i_t, 'output': o_t, 'candidate': c_tilde}
return h_t, c_t, gates
# Demonstrate LSTM maintaining information over 100+ time steps
np.random.seed(42)
input_size = 5
hidden_size = 16
lstm = LSTMCell(input_size, hidden_size)
# Process a long sequence
seq_length = 120
X = np.random.randn(seq_length, 1, input_size) * 0.1
h = np.zeros((1, hidden_size))
c = np.zeros((1, hidden_size))
# Store cell state norms to show information is maintained
cell_norms = []
hidden_norms = []
for t in range(seq_length):
h, c, gates = lstm.forward(X[t], h, c)
cell_norms.append(np.linalg.norm(c))
hidden_norms.append(np.linalg.norm(h))
print("LSTM Cell State Norm Over 120 Time Steps:")
print("-" * 50)
for t in [0, 10, 20, 40, 60, 80, 100, 119]:
bar = "#" * min(int(cell_norms[t] * 5), 40)
print(f" Step {t:>3}: cell_norm = {cell_norms[t]:.4f} {bar}")
print(f"\n Cell state remains stable (no vanishing)!")
print(f" Initial: {cell_norms[0]:.4f}, Final: {cell_norms[-1]:.4f}")
print(f" The forget gate preserves information across all 120 steps.")
# Show gate values at a sample step
print(f"\n Sample gate values at step 60:")
h_sample = np.zeros((1, hidden_size))
c_sample = np.zeros((1, hidden_size))
for t in range(61):
h_sample, c_sample, gates = lstm.forward(X[t], h_sample, c_sample)
print(f" Forget gate mean: {gates['forget'].mean():.3f} (close to 1 = remembering)")
print(f" Input gate mean: {gates['input'].mean():.3f}")
print(f" Output gate mean: {gates['output'].mean():.3f}")
Compare this to a vanilla RNN: the LSTM cell state norm stays bounded and stable over 120+ steps, whereas a vanilla RNN’s hidden state either decays to near-zero or explodes.
import numpy as np
def compare_rnn_vs_lstm_memory(seq_length=100):
"""Compare information retention: vanilla RNN vs LSTM."""
np.random.seed(42)
hidden_size = 16
input_size = 5
# === Vanilla RNN ===
W_xh = np.random.randn(input_size, hidden_size) * 0.3
W_hh = np.random.randn(hidden_size, hidden_size) * 0.3
b_h = np.zeros((1, hidden_size))
# Inject a strong signal at step 0, then feed noise
X = np.random.randn(seq_length, 1, input_size) * 0.01
X[0] = np.ones((1, input_size)) * 2.0 # strong signal at t=0
# Track how much of the initial signal survives
h_rnn = np.zeros((1, hidden_size))
rnn_norms = []
for t in range(seq_length):
h_rnn = np.tanh(X[t] @ W_xh + h_rnn @ W_hh + b_h)
rnn_norms.append(np.linalg.norm(h_rnn))
# === LSTM ===
def sigmoid(x):
return np.where(x >= 0, 1/(1+np.exp(-x)), np.exp(x)/(1+np.exp(x)))
concat_size = input_size + hidden_size
scale = np.sqrt(2.0 / concat_size)
W_f = np.random.randn(concat_size, hidden_size) * scale
W_i = np.random.randn(concat_size, hidden_size) * scale
W_c = np.random.randn(concat_size, hidden_size) * scale
W_o = np.random.randn(concat_size, hidden_size) * scale
b_f = np.ones((1, hidden_size)) # forget bias = 1
b_i = np.zeros((1, hidden_size))
b_c = np.zeros((1, hidden_size))
b_o = np.zeros((1, hidden_size))
h_lstm = np.zeros((1, hidden_size))
c_lstm = np.zeros((1, hidden_size))
lstm_norms = []
for t in range(seq_length):
concat = np.concatenate([h_lstm, X[t]], axis=1)
f_t = sigmoid(concat @ W_f + b_f)
i_t = sigmoid(concat @ W_i + b_i)
c_tilde = np.tanh(concat @ W_c + b_c)
o_t = sigmoid(concat @ W_o + b_o)
c_lstm = f_t * c_lstm + i_t * c_tilde
h_lstm = o_t * np.tanh(c_lstm)
lstm_norms.append(np.linalg.norm(c_lstm))
print("Information Retention: RNN vs LSTM")
print("(Strong signal at t=0, then noise)")
print("=" * 55)
print(f"{'Step':<8}{'RNN hidden norm':<20}{'LSTM cell norm'}")
print("-" * 55)
for t in [0, 5, 10, 20, 30, 50, 70, 99]:
print(f" {t:<6}{rnn_norms[t]:<20.6f}{lstm_norms[t]:.6f}")
print(f"\n RNN signal after 100 steps: {rnn_norms[-1]:.6f}")
print(f" LSTM signal after 100 steps: {lstm_norms[-1]:.6f}")
print(f" LSTM retains {lstm_norms[-1]/max(lstm_norms[0], 1e-8)*100:.1f}% of initial signal")
compare_rnn_vs_lstm_memory()
GRU: Gated Recurrent Unit
The GRU (Cho et al., 2014) simplifies the LSTM by merging the cell state and hidden state into a single state vector, and combining the forget and input gates into a single update gate. It uses only two gates:
Reset Gate — controls how much past information to forget:
$$r_t = \sigma(W_r \cdot [h_{t-1}, x_t] + b_r)$$
Update Gate — controls the balance between old and new information:
$$z_t = \sigma(W_z \cdot [h_{t-1}, x_t] + b_z)$$
Candidate hidden state:
$$\tilde{h}_t = \tanh(W_h \cdot [r_t \odot h_{t-1}, x_t] + b_h)$$
Final hidden state:
$$h_t = (1 - z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t$$
import numpy as np
def sigmoid(x):
"""Numerically stable sigmoid."""
return np.where(x >= 0,
1 / (1 + np.exp(-x)),
np.exp(x) / (1 + np.exp(x)))
class GRUCell:
"""GRU cell implementation from scratch."""
def __init__(self, input_size, hidden_size):
self.input_size = input_size
self.hidden_size = hidden_size
concat_size = input_size + hidden_size
# Only 3 weight matrices (vs 4 in LSTM)
scale = np.sqrt(2.0 / concat_size)
self.W_r = np.random.randn(concat_size, hidden_size) * scale # reset gate
self.W_z = np.random.randn(concat_size, hidden_size) * scale # update gate
self.W_h = np.random.randn(concat_size, hidden_size) * scale # candidate
self.b_r = np.zeros((1, hidden_size))
self.b_z = np.zeros((1, hidden_size))
self.b_h = np.zeros((1, hidden_size))
def forward(self, x_t, h_prev):
"""Process one time step.
Args:
x_t: input, shape (batch, input_size)
h_prev: previous hidden state, shape (batch, hidden_size)
Returns:
h_t: new hidden state
"""
concat = np.concatenate([h_prev, x_t], axis=1)
# Reset and update gates
r_t = sigmoid(concat @ self.W_r + self.b_r)
z_t = sigmoid(concat @ self.W_z + self.b_z)
# Candidate hidden state (reset gate applied to h_prev)
concat_reset = np.concatenate([r_t * h_prev, x_t], axis=1)
h_tilde = np.tanh(concat_reset @ self.W_h + self.b_h)
# Interpolate between old and new
h_t = (1 - z_t) * h_prev + z_t * h_tilde
return h_t
# Compare parameter counts
input_size = 32
hidden_size = 64
concat_size = input_size + hidden_size
lstm_params = 4 * (concat_size * hidden_size + hidden_size) # 4 gates, each with W and b
gru_params = 3 * (concat_size * hidden_size + hidden_size) # 3 gates
print("Parameter Count Comparison")
print("=" * 45)
print(f" Input size: {input_size}")
print(f" Hidden size: {hidden_size}")
print(f" Concat size: {concat_size}")
print(f"")
print(f" LSTM parameters: {lstm_params:,}")
print(f" - 4 weight matrices: 4 x ({concat_size} x {hidden_size}) = {4 * concat_size * hidden_size:,}")
print(f" - 4 bias vectors: 4 x {hidden_size} = {4 * hidden_size}")
print(f"")
print(f" GRU parameters: {gru_params:,}")
print(f" - 3 weight matrices: 3 x ({concat_size} x {hidden_size}) = {3 * concat_size * hidden_size:,}")
print(f" - 3 bias vectors: 3 x {hidden_size} = {3 * hidden_size}")
print(f"")
print(f" GRU saves {lstm_params - gru_params:,} parameters ({(1 - gru_params/lstm_params)*100:.1f}% fewer)")
# Quick test: process a sequence through GRU
np.random.seed(42)
gru = GRUCell(input_size=5, hidden_size=16)
h = np.zeros((1, 16))
X = np.random.randn(20, 1, 5) * 0.5
print(f"\nGRU processing 20-step sequence:")
for t in [0, 5, 10, 15, 19]:
for step in range(t + 1):
if step == 0:
h_test = np.zeros((1, 16))
h_test = gru.forward(X[step], h_test)
print(f" Step {t:>2}: hidden norm = {np.linalg.norm(h_test):.4f}")
Framework Implementations
Now that we understand the internals, let us see how modern frameworks handle all of this in just a few lines. The from-scratch implementations above are equivalent to what happens inside these framework calls:
import numpy as np
# What took us ~40 lines in NumPy becomes 2 lines in PyTorch:
#
# import torch.nn as nn
#
# # Create an LSTM layer
# lstm = nn.LSTM(input_size=32, hidden_size=64, num_layers=2, batch_first=True)
#
# # Process a sequence
# output, (h_n, c_n) = lstm(input_tensor)
#
# That single nn.LSTM call handles:
# - All 4 gate computations (forget, input, output, candidate)
# - Cell state and hidden state management
# - Stacking multiple layers (num_layers=2)
# - Bidirectional processing (bidirectional=True)
# - Dropout between layers
# - CUDA/GPU acceleration
# - Optimized cuDNN kernels
# Parameter comparison: our scratch LSTM vs PyTorch
input_size = 32
hidden_size = 64
num_layers = 2
# Single layer
single_layer = 4 * ((input_size + hidden_size) * hidden_size + hidden_size)
# Second layer (input is hidden_size from layer 1)
second_layer = 4 * ((hidden_size + hidden_size) * hidden_size + hidden_size)
total_params = single_layer + second_layer
print("PyTorch nn.LSTM(input=32, hidden=64, layers=2) parameters:")
print(f" Layer 1: {single_layer:,} params")
print(f" Layer 2: {second_layer:,} params")
print(f" Total: {total_params:,} params")
print(f"\n All managed automatically with backprop, GPU support,")
print(f" and optimized CUDA kernels for 10-100x speedup.")
What’s Next
We have now built the three foundational recurrent architectures from scratch: vanilla RNNs, LSTMs, and GRUs. You understand exactly how hidden states carry memory, why gradients vanish in deep sequences, and how gating mechanisms provide a solution.
In the next article, we shift from sequence modeling to generative architectures:
Next in the Series
In Part 8: Autoencoders & GANs Deep Dive, we will build variational autoencoders for dimensionality reduction and generative adversarial networks from scratch — understanding the minimax game that produces synthetic data.