Part 7: RNNs & LSTMs Deep Dive

Why Sequences Need Special Networks

Feedforward neural networks (FNNs) process each input independently — they have no concept of order or context. When you feed an FNN a word, it has no idea what came before it. But language, time series, audio, and countless other real-world signals are inherently sequential: the meaning of “bank” depends on whether the previous words were “river” or “savings.”

                            
                            Key Insight: Sequential data requires memory. A network processing “The cat sat on the ___” needs to remember the subject and verb to predict the next word. FNNs cannot do this because they process each input in isolation.
                        

Let us demonstrate this failure concretely. We will try to predict the next value in a simple sequence using a feedforward network that sees only the current input:

import numpy as np

# Simple sequence: predict next value based on pattern
# Pattern: 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, ...
sequence = [0, 1, 2, 3] * 10  # repeating pattern

# FNN approach: predict next value from ONLY current value
# Problem: when input is 0, output should be 1
#          when input is 3, output should be 0
# But what if the sequence was 0, 2, 4, 6, 0, 2, 4, 6...?
# Then input 0 should predict 2, not 1!

# The FNN cannot distinguish these cases without memory
current_values = np.array(sequence[:-1])
next_values = np.array(sequence[1:])

# Simple linear model (FNN with 1 layer)
# y = wx + b, minimize MSE
# For pattern [0,1,2,3,0,1,2,3...]:
# When x=0, y should be 1
# When x=3, y should be 0 (wraps around!)
# A linear model CANNOT learn this wrapping behavior

w = np.sum(current_values * next_values) / np.sum(current_values**2 + 1e-8)
predictions = current_values * w

mse = np.mean((predictions - next_values)**2)
print(f"Linear FNN weight: {w:.4f}")
print(f"MSE on sequence prediction: {mse:.4f}")
print(f"\nSample predictions vs actual:")
for i in range(8):
    print(f"  Input: {current_values[i]} -> Predicted: {predictions[i]:.2f}, Actual: {next_values[i]}")

print(f"\nThe FNN fails because it cannot remember previous context!")
print(f"It sees '3' and must predict '0', but also sees '0' and must predict '1'")
print(f"Without memory of WHERE we are in the sequence, this is impossible.")

The fundamental limitation is clear: without memory of previous inputs, the network cannot learn patterns that depend on context. This is why we need recurrent architectures.

RNN Architecture and Memory

A Recurrent Neural Network solves the memory problem by maintaining a hidden state $h_t$ that carries information from previous time steps. At each step, the network combines the current input $x_t$ with the previous hidden state $h_{t-1}$:

$$h_t = \tanh(W_{hh} \cdot h_{t-1} + W_{xh} \cdot x_t + b_h)$$

$$y_t = W_{hy} \cdot h_t + b_y$$

where:

$W_{xh}$ — input-to-hidden weights
$W_{hh}$ — hidden-to-hidden weights (the “recurrence”)
$W_{hy}$ — hidden-to-output weights
$b_h$, $b_y$ — biases
$\tanh$ — squashes values to $[-1, 1]$, providing non-linearity

Implementation RNN Cell from Scratch

We implement a single RNN cell and step through a sequence to observe how hidden state evolves:

import numpy as np

class RNNCell:
    """A single RNN cell that processes one time step."""

    def __init__(self, input_size, hidden_size):
        # Xavier initialization for stable training
        self.W_xh = np.random.randn(input_size, hidden_size) * np.sqrt(2.0 / (input_size + hidden_size))
        self.W_hh = np.random.randn(hidden_size, hidden_size) * np.sqrt(2.0 / (hidden_size + hidden_size))
        self.W_hy = np.random.randn(hidden_size, hidden_size) * np.sqrt(2.0 / (hidden_size + hidden_size))
        self.b_h = np.zeros((1, hidden_size))

    def forward(self, x_t, h_prev):
        """Process one time step.
        Args:
            x_t: input at time t, shape (batch_size, input_size)
            h_prev: hidden state from previous step, shape (batch_size, hidden_size)
        Returns:
            h_next: new hidden state, shape (batch_size, hidden_size)
        """
        # Combine input and previous hidden state
        h_next = np.tanh(x_t @ self.W_xh + h_prev @ self.W_hh + self.b_h)
        return h_next

# Create an RNN cell: input_size=3, hidden_size=4
np.random.seed(42)
cell = RNNCell(input_size=3, hidden_size=4)

# Process a sequence of 5 time steps
sequence_length = 5
batch_size = 1
input_size = 3
hidden_size = 4

# Random input sequence
X = np.random.randn(sequence_length, batch_size, input_size)

# Initial hidden state (zeros)
h = np.zeros((batch_size, hidden_size))

print("Processing sequence through RNN cell:")
print(f"{'Step':<6}{'Input (first 3)':<25}{'Hidden State (4 dims)'}")
print("-" * 70)

for t in range(sequence_length):
    h = cell.forward(X[t], h)
    input_str = np.array2string(X[t][0], precision=2, separator=', ')
    hidden_str = np.array2string(h[0], precision=4, separator=', ')
    print(f"t={t:<4}{input_str:<25}{hidden_str}")

print(f"\nNotice: hidden state carries forward information from ALL previous inputs.")
print(f"At t=4, h encodes information from inputs at t=0, t=1, t=2, t=3, and t=4.")

RNN Unrolled Through Time

flowchart LR
    subgraph "Time Step t-2"
        X0["x(t-2)"] --> H0["h(t-2)"]
    end
    subgraph "Time Step t-1"
        X1["x(t-1)"] --> H1["h(t-1)"]
    end
    subgraph "Time Step t"
        X2["x(t)"] --> H2["h(t)"]
    end
    subgraph "Time Step t+1"
        X3["x(t+1)"] --> H3["h(t+1)"]
    end
    H0 -->|"W_hh"| H1
    H1 -->|"W_hh"| H2
    H2 -->|"W_hh"| H3
    H2 --> Y2["y(t)"]
    H3 --> Y3["y(t+1)"]

When we “unroll” the RNN through time, we see that it is essentially a very deep network where the same weights are shared at every layer (time step). This weight sharing is both the power and the weakness of RNNs, as we will discover in the vanishing gradient section.

Building an RNN from Scratch

Now let us build a complete RNN that can be trained on character-level text prediction. Given a sequence of characters, the network learns to predict the next character. This is the simplest form of a language model.

import numpy as np

class CharRNN:
    """Character-level RNN for next-character prediction."""

    def __init__(self, vocab_size, hidden_size, learning_rate=0.01):
        self.vocab_size = vocab_size
        self.hidden_size = hidden_size
        self.lr = learning_rate

        # Weight initialization (Xavier)
        scale = np.sqrt(2.0 / (vocab_size + hidden_size))
        self.W_xh = np.random.randn(vocab_size, hidden_size) * scale
        self.W_hh = np.random.randn(hidden_size, hidden_size) * scale
        self.W_hy = np.random.randn(hidden_size, vocab_size) * scale
        self.b_h = np.zeros((1, hidden_size))
        self.b_y = np.zeros((1, vocab_size))

    def forward(self, inputs, h_prev):
        """Forward pass through entire sequence.
        Args:
            inputs: list of integer character indices
            h_prev: initial hidden state
        Returns:
            outputs: list of output probabilities at each step
            hiddens: list of hidden states
        """
        hiddens = [h_prev]
        outputs = []

        for t in range(len(inputs)):
            # One-hot encode input
            x_t = np.zeros((1, self.vocab_size))
            x_t[0, inputs[t]] = 1.0

            # RNN step
            h_t = np.tanh(x_t @ self.W_xh + hiddens[-1] @ self.W_hh + self.b_h)
            y_t = h_t @ self.W_hy + self.b_y

            # Softmax for probabilities
            exp_y = np.exp(y_t - np.max(y_t))
            probs = exp_y / np.sum(exp_y)

            hiddens.append(h_t)
            outputs.append(probs)

        return outputs, hiddens

    def loss(self, outputs, targets):
        """Cross-entropy loss."""
        total_loss = 0.0
        for t in range(len(targets)):
            total_loss -= np.log(outputs[t][0, targets[t]] + 1e-8)
        return total_loss / len(targets)

# Training on "hello world" pattern
text = "hello world " * 20
chars = sorted(list(set(text)))
char_to_idx = {ch: i for i, ch in enumerate(chars)}
idx_to_char = {i: ch for ch, i in char_to_idx.items()}
vocab_size = len(chars)

print(f"Vocabulary: {chars}")
print(f"Vocab size: {vocab_size}")

# Create and train the RNN
np.random.seed(42)
rnn = CharRNN(vocab_size=vocab_size, hidden_size=32, learning_rate=0.01)

# Simple training loop (forward pass only for demonstration)
seq_length = 10
h = np.zeros((1, 32))

# Show predictions before training
test_input = [char_to_idx[c] for c in "hello worl"]
outputs, _ = rnn.forward(test_input, h)

print(f"\nBefore training - predicting after 'hello worl':")
last_probs = outputs[-1][0]
top_3 = np.argsort(last_probs)[-3:][::-1]
for idx in top_3:
    print(f"  '{idx_to_char[idx]}' -> probability: {last_probs[idx]:.4f}")

print(f"\nThe untrained RNN outputs nearly uniform probabilities.")
print(f"After training with backpropagation through time (BPTT),")
print(f"it would learn that 'd' follows 'hello worl' with high probability.")

The key insight is that the hidden state $h_t$ accumulates information about everything seen so far. After processing “hello worl”, the hidden state encodes enough context to predict “d” as the next character.

The Vanishing Gradient Problem

                            
                            Critical Problem: Vanilla RNNs struggle with long sequences because gradients either vanish (shrink exponentially to zero) or explode (grow exponentially) during backpropagation through time. This makes it nearly impossible to learn dependencies spanning more than 10-20 time steps.
                        

During backpropagation through time (BPTT), gradients must flow backward through every time step. At each step, the gradient is multiplied by the derivative of $\tanh$ and by $W_{hh}$. Since $\tanh'(x) \leq 1$ and typically $\|W_{hh}\| < 1$, these multiplications compound:

$$\frac{\partial h_T}{\partial h_1} = \prod_{t=2}^{T} \frac{\partial h_t}{\partial h_{t-1}} = \prod_{t=2}^{T} W_{hh}^T \cdot \text{diag}(\tanh'(z_t))$$

If the largest singular value of $W_{hh}$ is less than 1, this product shrinks exponentially with $T$.

import numpy as np

def demonstrate_vanishing_gradients(hidden_size=64, num_steps=50):
    """Show how gradients vanish over many time steps."""
    np.random.seed(42)

    # Initialize W_hh with values that cause vanishing
    # (most random initializations will cause this)
    W_hh = np.random.randn(hidden_size, hidden_size) * 0.5

    # Simulate gradient flow backward through time
    gradient = np.ones(hidden_size)  # Start with gradient of 1
    gradient_norms = [np.linalg.norm(gradient)]

    for t in range(num_steps):
        # At each step, gradient is multiplied by W_hh^T and tanh derivative
        # tanh derivative is at most 1, typically around 0.5 for random activations
        tanh_derivative = np.random.uniform(0.2, 0.8, hidden_size)
        gradient = (W_hh.T @ gradient) * tanh_derivative
        gradient_norms.append(np.linalg.norm(gradient))

    print("Gradient Norm at Each Time Step (flowing backward):")
    print("-" * 55)
    steps_to_show = [0, 5, 10, 15, 20, 30, 40, 49]
    for step in steps_to_show:
        norm = gradient_norms[step]
        bar = "#" * min(int(norm * 50), 50)
        print(f"  Step {step:>2}: norm = {norm:.2e}  {bar}")

    print(f"\n  Initial gradient norm: {gradient_norms[0]:.4f}")
    print(f"  After 50 steps:        {gradient_norms[-1]:.2e}")
    print(f"  Ratio: {gradient_norms[-1] / gradient_norms[0]:.2e}")
    print(f"\n  The gradient has essentially vanished!")
    print(f"  Information from 50 steps ago cannot influence current weights.")

    return gradient_norms

gradient_norms = demonstrate_vanishing_gradients()

# Also demonstrate exploding gradients
print("\n" + "=" * 55)
print("Now with a LARGER W_hh (causes exploding gradients):")
print("=" * 55)

np.random.seed(42)
W_hh_large = np.random.randn(64, 64) * 1.5
gradient = np.ones(64)

for t in range(20):
    tanh_derivative = np.random.uniform(0.5, 1.0, 64)
    gradient = (W_hh_large.T @ gradient) * tanh_derivative
    if t % 5 == 0:
        print(f"  Step {t:>2}: gradient norm = {np.linalg.norm(gradient):.2e}")

print(f"\n  Exploding gradients cause NaN values and unstable training.")
print(f"  Solution: gradient clipping (cap norm at a threshold).")

This is the fundamental motivation for gated architectures. We need a mechanism that allows gradients to flow unchanged over long distances — a “highway” for gradient information.

LSTM: Long Short-Term Memory

The LSTM (Hochreiter & Schmidhuber, 1997) introduces a cell state $C_t$ that runs through time like a conveyor belt, and three gates that control what information flows in and out:

The Three Gates

Forget Gate — decides what to discard from cell state:

$$f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f)$$

Input Gate — decides what new information to store:

$$i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i)$$

$$\tilde{C}_t = \tanh(W_C \cdot [h_{t-1}, x_t] + b_C)$$

Output Gate — decides what to output from cell state:

$$o_t = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o)$$

Cell State Update:

$$C_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t$$

Hidden State:

$$h_t = o_t \odot \tanh(C_t)$$

The $\sigma$ (sigmoid) function outputs values between 0 and 1, acting as a soft switch. The $\odot$ denotes element-wise multiplication.

LSTM Cell Architecture

flowchart TD
    subgraph "LSTM Cell"
        direction TB
        CT_prev["C(t-1)"] -->|"cell state"| FORGET["x forget gate f_t"]
        FORGET --> ADD["+ addition"]
        IG["input gate i_t"] --> MUL_I["x"]
        CAND["candidate C_tilde"] --> MUL_I
        MUL_I --> ADD
        ADD --> CT["C(t)"]
        CT --> TANH_C["tanh"]
        TANH_C --> MUL_O["x"]
        OG["output gate o_t"] --> MUL_O
        MUL_O --> HT["h(t)"]
    end
    XT["x(t)"] --> IG
    XT --> OG
    XT --> FORGET
    XT --> CAND
    HT_prev["h(t-1)"] --> IG
    HT_prev --> OG
    HT_prev --> FORGET
    HT_prev --> CAND

                            
                            Why LSTMs Solve Vanishing Gradients: The cell state $C_t$ provides a direct path for gradients to flow backward through time. When the forget gate $f_t = 1$, the gradient flows unchanged: $\frac{\partial C_t}{\partial C_{t-1}} = f_t \approx 1$. No vanishing!
                        

import numpy as np

def sigmoid(x):
    """Numerically stable sigmoid."""
    return np.where(x >= 0,
                    1 / (1 + np.exp(-x)),
                    np.exp(x) / (1 + np.exp(x)))

class LSTMCell:
    """LSTM cell implementation from scratch."""

    def __init__(self, input_size, hidden_size):
        self.input_size = input_size
        self.hidden_size = hidden_size
        concat_size = input_size + hidden_size

        # Initialize all four gate weight matrices
        scale = np.sqrt(2.0 / concat_size)
        self.W_f = np.random.randn(concat_size, hidden_size) * scale  # forget gate
        self.W_i = np.random.randn(concat_size, hidden_size) * scale  # input gate
        self.W_c = np.random.randn(concat_size, hidden_size) * scale  # candidate
        self.W_o = np.random.randn(concat_size, hidden_size) * scale  # output gate

        self.b_f = np.ones((1, hidden_size))   # bias forget gate to 1 (remember by default)
        self.b_i = np.zeros((1, hidden_size))
        self.b_c = np.zeros((1, hidden_size))
        self.b_o = np.zeros((1, hidden_size))

    def forward(self, x_t, h_prev, c_prev):
        """Process one time step.
        Args:
            x_t: input, shape (batch, input_size)
            h_prev: previous hidden state, shape (batch, hidden_size)
            c_prev: previous cell state, shape (batch, hidden_size)
        Returns:
            h_t: new hidden state
            c_t: new cell state
            gates: dict of gate values for inspection
        """
        # Concatenate input and previous hidden state
        concat = np.concatenate([h_prev, x_t], axis=1)

        # Compute all four gates
        f_t = sigmoid(concat @ self.W_f + self.b_f)      # forget gate
        i_t = sigmoid(concat @ self.W_i + self.b_i)      # input gate
        c_tilde = np.tanh(concat @ self.W_c + self.b_c)  # candidate values
        o_t = sigmoid(concat @ self.W_o + self.b_o)      # output gate

        # Update cell state
        c_t = f_t * c_prev + i_t * c_tilde

        # Compute hidden state
        h_t = o_t * np.tanh(c_t)

        gates = {'forget': f_t, 'input': i_t, 'output': o_t, 'candidate': c_tilde}
        return h_t, c_t, gates

# Demonstrate LSTM maintaining information over 100+ time steps
np.random.seed(42)
input_size = 5
hidden_size = 16
lstm = LSTMCell(input_size, hidden_size)

# Process a long sequence
seq_length = 120
X = np.random.randn(seq_length, 1, input_size) * 0.1

h = np.zeros((1, hidden_size))
c = np.zeros((1, hidden_size))

# Store cell state norms to show information is maintained
cell_norms = []
hidden_norms = []

for t in range(seq_length):
    h, c, gates = lstm.forward(X[t], h, c)
    cell_norms.append(np.linalg.norm(c))
    hidden_norms.append(np.linalg.norm(h))

print("LSTM Cell State Norm Over 120 Time Steps:")
print("-" * 50)
for t in [0, 10, 20, 40, 60, 80, 100, 119]:
    bar = "#" * min(int(cell_norms[t] * 5), 40)
    print(f"  Step {t:>3}: cell_norm = {cell_norms[t]:.4f}  {bar}")

print(f"\n  Cell state remains stable (no vanishing)!")
print(f"  Initial: {cell_norms[0]:.4f}, Final: {cell_norms[-1]:.4f}")
print(f"  The forget gate preserves information across all 120 steps.")

# Show gate values at a sample step
print(f"\n  Sample gate values at step 60:")
h_sample = np.zeros((1, hidden_size))
c_sample = np.zeros((1, hidden_size))
for t in range(61):
    h_sample, c_sample, gates = lstm.forward(X[t], h_sample, c_sample)
print(f"    Forget gate mean: {gates['forget'].mean():.3f} (close to 1 = remembering)")
print(f"    Input gate mean:  {gates['input'].mean():.3f}")
print(f"    Output gate mean: {gates['output'].mean():.3f}")

Compare this to a vanilla RNN: the LSTM cell state norm stays bounded and stable over 120+ steps, whereas a vanilla RNN’s hidden state either decays to near-zero or explodes.

import numpy as np

def compare_rnn_vs_lstm_memory(seq_length=100):
    """Compare information retention: vanilla RNN vs LSTM."""
    np.random.seed(42)
    hidden_size = 16
    input_size = 5

    # === Vanilla RNN ===
    W_xh = np.random.randn(input_size, hidden_size) * 0.3
    W_hh = np.random.randn(hidden_size, hidden_size) * 0.3
    b_h = np.zeros((1, hidden_size))

    # Inject a strong signal at step 0, then feed noise
    X = np.random.randn(seq_length, 1, input_size) * 0.01
    X[0] = np.ones((1, input_size)) * 2.0  # strong signal at t=0

    # Track how much of the initial signal survives
    h_rnn = np.zeros((1, hidden_size))
    rnn_norms = []

    for t in range(seq_length):
        h_rnn = np.tanh(X[t] @ W_xh + h_rnn @ W_hh + b_h)
        rnn_norms.append(np.linalg.norm(h_rnn))

    # === LSTM ===
    def sigmoid(x):
        return np.where(x >= 0, 1/(1+np.exp(-x)), np.exp(x)/(1+np.exp(x)))

    concat_size = input_size + hidden_size
    scale = np.sqrt(2.0 / concat_size)
    W_f = np.random.randn(concat_size, hidden_size) * scale
    W_i = np.random.randn(concat_size, hidden_size) * scale
    W_c = np.random.randn(concat_size, hidden_size) * scale
    W_o = np.random.randn(concat_size, hidden_size) * scale
    b_f = np.ones((1, hidden_size))  # forget bias = 1
    b_i = np.zeros((1, hidden_size))
    b_c = np.zeros((1, hidden_size))
    b_o = np.zeros((1, hidden_size))

    h_lstm = np.zeros((1, hidden_size))
    c_lstm = np.zeros((1, hidden_size))
    lstm_norms = []

    for t in range(seq_length):
        concat = np.concatenate([h_lstm, X[t]], axis=1)
        f_t = sigmoid(concat @ W_f + b_f)
        i_t = sigmoid(concat @ W_i + b_i)
        c_tilde = np.tanh(concat @ W_c + b_c)
        o_t = sigmoid(concat @ W_o + b_o)
        c_lstm = f_t * c_lstm + i_t * c_tilde
        h_lstm = o_t * np.tanh(c_lstm)
        lstm_norms.append(np.linalg.norm(c_lstm))

    print("Information Retention: RNN vs LSTM")
    print("(Strong signal at t=0, then noise)")
    print("=" * 55)
    print(f"{'Step':<8}{'RNN hidden norm':<20}{'LSTM cell norm'}")
    print("-" * 55)
    for t in [0, 5, 10, 20, 30, 50, 70, 99]:
        print(f"  {t:<6}{rnn_norms[t]:<20.6f}{lstm_norms[t]:.6f}")

    print(f"\n  RNN signal after 100 steps: {rnn_norms[-1]:.6f}")
    print(f"  LSTM signal after 100 steps: {lstm_norms[-1]:.6f}")
    print(f"  LSTM retains {lstm_norms[-1]/max(lstm_norms[0], 1e-8)*100:.1f}% of initial signal")

compare_rnn_vs_lstm_memory()

GRU: Gated Recurrent Unit

The GRU (Cho et al., 2014) simplifies the LSTM by merging the cell state and hidden state into a single state vector, and combining the forget and input gates into a single update gate. It uses only two gates:

Reset Gate — controls how much past information to forget:

$$r_t = \sigma(W_r \cdot [h_{t-1}, x_t] + b_r)$$

Update Gate — controls the balance between old and new information:

$$z_t = \sigma(W_z \cdot [h_{t-1}, x_t] + b_z)$$

Candidate hidden state:

$$\tilde{h}_t = \tanh(W_h \cdot [r_t \odot h_{t-1}, x_t] + b_h)$$

Final hidden state:

$$h_t = (1 - z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t$$

                            
                            GRU vs LSTM: GRU has fewer parameters (3 weight matrices instead of 4) and often trains faster. Performance is generally comparable to LSTM, making GRU a good default choice when model size matters.
                        

import numpy as np

def sigmoid(x):
    """Numerically stable sigmoid."""
    return np.where(x >= 0,
                    1 / (1 + np.exp(-x)),
                    np.exp(x) / (1 + np.exp(x)))

class GRUCell:
    """GRU cell implementation from scratch."""

    def __init__(self, input_size, hidden_size):
        self.input_size = input_size
        self.hidden_size = hidden_size
        concat_size = input_size + hidden_size

        # Only 3 weight matrices (vs 4 in LSTM)
        scale = np.sqrt(2.0 / concat_size)
        self.W_r = np.random.randn(concat_size, hidden_size) * scale  # reset gate
        self.W_z = np.random.randn(concat_size, hidden_size) * scale  # update gate
        self.W_h = np.random.randn(concat_size, hidden_size) * scale  # candidate

        self.b_r = np.zeros((1, hidden_size))
        self.b_z = np.zeros((1, hidden_size))
        self.b_h = np.zeros((1, hidden_size))

    def forward(self, x_t, h_prev):
        """Process one time step.
        Args:
            x_t: input, shape (batch, input_size)
            h_prev: previous hidden state, shape (batch, hidden_size)
        Returns:
            h_t: new hidden state
        """
        concat = np.concatenate([h_prev, x_t], axis=1)

        # Reset and update gates
        r_t = sigmoid(concat @ self.W_r + self.b_r)
        z_t = sigmoid(concat @ self.W_z + self.b_z)

        # Candidate hidden state (reset gate applied to h_prev)
        concat_reset = np.concatenate([r_t * h_prev, x_t], axis=1)
        h_tilde = np.tanh(concat_reset @ self.W_h + self.b_h)

        # Interpolate between old and new
        h_t = (1 - z_t) * h_prev + z_t * h_tilde

        return h_t

# Compare parameter counts
input_size = 32
hidden_size = 64
concat_size = input_size + hidden_size

lstm_params = 4 * (concat_size * hidden_size + hidden_size)  # 4 gates, each with W and b
gru_params = 3 * (concat_size * hidden_size + hidden_size)   # 3 gates

print("Parameter Count Comparison")
print("=" * 45)
print(f"  Input size:  {input_size}")
print(f"  Hidden size: {hidden_size}")
print(f"  Concat size: {concat_size}")
print(f"")
print(f"  LSTM parameters: {lstm_params:,}")
print(f"    - 4 weight matrices: 4 x ({concat_size} x {hidden_size}) = {4 * concat_size * hidden_size:,}")
print(f"    - 4 bias vectors:    4 x {hidden_size} = {4 * hidden_size}")
print(f"")
print(f"  GRU parameters:  {gru_params:,}")
print(f"    - 3 weight matrices: 3 x ({concat_size} x {hidden_size}) = {3 * concat_size * hidden_size:,}")
print(f"    - 3 bias vectors:    3 x {hidden_size} = {3 * hidden_size}")
print(f"")
print(f"  GRU saves {lstm_params - gru_params:,} parameters ({(1 - gru_params/lstm_params)*100:.1f}% fewer)")

# Quick test: process a sequence through GRU
np.random.seed(42)
gru = GRUCell(input_size=5, hidden_size=16)
h = np.zeros((1, 16))
X = np.random.randn(20, 1, 5) * 0.5

print(f"\nGRU processing 20-step sequence:")
for t in [0, 5, 10, 15, 19]:
    for step in range(t + 1):
        if step == 0:
            h_test = np.zeros((1, 16))
        h_test = gru.forward(X[step], h_test)
    print(f"  Step {t:>2}: hidden norm = {np.linalg.norm(h_test):.4f}")

Framework Implementations

Now that we understand the internals, let us see how modern frameworks handle all of this in just a few lines. The from-scratch implementations above are equivalent to what happens inside these framework calls:

PyTorch Implementation: For production-ready RNN/LSTM code with GPU acceleration, bidirectional processing, and stacked layers, see PyTorch Mastery Part 6: RNNs, LSTMs & Sequences.

TensorFlow Implementation: For Keras-based RNN/LSTM with time series and NLP applications, see TensorFlow Mastery Part 7: RNNs, NLP & Time Series.

import numpy as np

# What took us ~40 lines in NumPy becomes 2 lines in PyTorch:
#
# import torch.nn as nn
#
# # Create an LSTM layer
# lstm = nn.LSTM(input_size=32, hidden_size=64, num_layers=2, batch_first=True)
#
# # Process a sequence
# output, (h_n, c_n) = lstm(input_tensor)
#
# That single nn.LSTM call handles:
#   - All 4 gate computations (forget, input, output, candidate)
#   - Cell state and hidden state management
#   - Stacking multiple layers (num_layers=2)
#   - Bidirectional processing (bidirectional=True)
#   - Dropout between layers
#   - CUDA/GPU acceleration
#   - Optimized cuDNN kernels

# Parameter comparison: our scratch LSTM vs PyTorch
input_size = 32
hidden_size = 64
num_layers = 2

# Single layer
single_layer = 4 * ((input_size + hidden_size) * hidden_size + hidden_size)
# Second layer (input is hidden_size from layer 1)
second_layer = 4 * ((hidden_size + hidden_size) * hidden_size + hidden_size)
total_params = single_layer + second_layer

print("PyTorch nn.LSTM(input=32, hidden=64, layers=2) parameters:")
print(f"  Layer 1: {single_layer:,} params")
print(f"  Layer 2: {second_layer:,} params")
print(f"  Total:   {total_params:,} params")
print(f"\n  All managed automatically with backprop, GPU support,")
print(f"  and optimized CUDA kernels for 10-100x speedup.")

What’s Next

We have now built the three foundational recurrent architectures from scratch: vanilla RNNs, LSTMs, and GRUs. You understand exactly how hidden states carry memory, why gradients vanish in deep sequences, and how gating mechanisms provide a solution.

In the next article, we shift from sequence modeling to generative architectures:

Next in the Series

In Part 8: Autoencoders & GANs Deep Dive, we will build variational autoencoders for dimensionality reduction and generative adversarial networks from scratch — understanding the minimax game that produces synthetic data.

Previous Part 6: CNNs Deep Dive — Convolutions from Scratch Next Part 8: Autoencoders & GANs Deep Dive

Table of Contents