Back to PyTorch Mastery Series

Part 6: RNNs, LSTMs & Sequence Models

May 3, 2026 Wasil Zafar 30 min read

Build sequence models in PyTorch — understand RNNs, master LSTM & GRU gates, create word embeddings, build text classifiers, implement seq2seq architectures, and forecast time series data.

Table of Contents

  1. Why Sequence Models?
  2. Vanilla RNNs
  3. LSTM: Long Short-Term Memory
  4. GRU: Gated Recurrent Unit
  5. Word Embeddings
  6. Building a Text Classifier
  7. Bidirectional RNNs
  8. Sequence-to-Sequence Models
  9. Time Series Forecasting
  10. Packed Sequences
  11. Conclusion & Next Steps

Why Sequence Models?

Much of the world's data is sequential — it arrives in an ordered stream where the position of each element matters. Think about a sentence: "the cat sat on the mat" has meaning precisely because those words appear in that order. Rearrange them to "mat the on sat cat the" and the meaning collapses. The same is true for stock prices over time, audio waveforms, DNA nucleotide sequences, and sensor readings from a wearable device. In all these cases, order carries information.

The feedforward networks we built in earlier parts treat every input as an independent, fixed-size vector. If you wanted to classify a sentence, you would have to flatten it into a single vector (maybe averaging word vectors), losing all word-order information. That is like trying to understand a movie by looking at a single frame — you lose the plot, the dialogue, and every dramatic arc.

Key Insight: Sequence models process data one step at a time while maintaining a memory (hidden state) that summarises everything seen so far. This lets them capture patterns that depend on order, context, and long-range dependencies.

Sequence models solve this by processing data step by step, carrying forward a hidden state that acts as a running summary of everything the network has seen so far. At each time step the network reads one new element (a word, a price tick, an audio sample) and updates its hidden state. This hidden state is the network's memory — it lets the model remember earlier context while processing later inputs.

Let's start by exploring the simplest sequence model: the vanilla Recurrent Neural Network (RNN).

Vanilla RNNs

A Recurrent Neural Network (RNN) is a neural network with a loop. At each time step t, the RNN takes two inputs: the current data element xt and the previous hidden state ht-1. It combines them through a learned transformation to produce a new hidden state ht. The same weights are reused at every time step — this is called weight sharing, and it is what lets the RNN handle sequences of any length.

RNN hidden state update:

$$h_t = \tanh(W_{ih} \cdot x_t + W_{hh} \cdot h_{t-1} + b)$$

Where $x_t \in \mathbb{R}^d$ is the input at time $t$, $h_{t-1} \in \mathbb{R}^h$ is the previous hidden state, $W_{ih} \in \mathbb{R}^{h \times d}$ maps input to hidden, $W_{hh} \in \mathbb{R}^{h \times h}$ maps hidden to hidden, and $b$ is the bias. The output at each step is simply $y_t = h_t$ (or a linear projection of it).

The following diagram shows an RNN "unrolled" across three time steps. You can see how the hidden state flows from left to right, carrying information from earlier time steps to later ones:

RNN Unrolled Across Time Steps
flowchart LR
    x0["x₀\n(input)"] --> RNN0["RNN Cell\ntanh(W·x + W·h + b)"]
    h_init["h₋₁\n(zeros)"] --> RNN0
    RNN0 --> h0["h₀"]
    RNN0 --> y0["y₀\n(output)"]
    
    x1["x₁\n(input)"] --> RNN1["RNN Cell\n(same weights)"]
    h0 --> RNN1
    RNN1 --> h1["h₁"]
    RNN1 --> y1["y₁\n(output)"]
    
    x2["x₂\n(input)"] --> RNN2["RNN Cell\n(same weights)"]
    h1 --> RNN2
    RNN2 --> h2["h₂"]
    RNN2 --> y2["y₂\n(output)"]
                            

Notice that all three cells share the exact same weights — we draw them as separate boxes only to show the flow through time. Before diving into code, let's build a rock-solid intuition for what an RNN actually does.

Building Intuition: RNNs in Plain English

Forget the math for a moment. An RNN is just a function that updates memory as new input arrives. That's it. In pseudocode:

memory = 0

for x in sequence:
    memory = update(x, memory)

This is an RNN. The variable x is the current input, memory is the hidden state, and update() is a small neural network. Everything else — weight matrices, tanh, PyTorch modules — is just implementation detail on top of this one idea.

The Best Analogy: Taking Notes While Reading a Book

Imagine reading a book and keeping a running set of notes:

  • Each sentence you read = $x_t$ (current input)
  • Your notes so far = $h_{t-1}$ (previous memory)
  • You update your notes after each sentence = $h_t$ (new memory)

new_notes = combine(current_sentence, previous_notes) — That's an RNN.

The rolling memory in action:

# The "rolling memory" mental model
# Each step REWRITES memory with a blend of new input + old memory

h = 0                    # start with blank memory
h = f(x1, h)            # Step 1: memory = f(first input, nothing)
h = f(x2, h)            # Step 2: memory = f(second input, memory of x1)
h = f(x3, h)            # Step 3: memory = f(third input, memory of x1+x2)
# ... h now summarizes the ENTIRE sequence

Now rename things to match the real equation:

$$h_t = \tanh\!\big(\underbrace{W_{xh} \cdot x_t}_{\text{what input says}} + \underbrace{W_{hh} \cdot h_{t-1}}_{\text{what memory says}} + b_h\big)$$

Read it as: new memory = squash(input influence + past memory influence). The tanh just keeps values in $[-1, 1]$ so memory doesn't explode.

The 3 Things That Define an RNN
PropertyWhat it means
1. SequenceProcesses one step at a time (the for loop)
2. State (memory)Carries information forward (the hidden state h)
3. Shared weightsSame function applied at every step (same W matrices)
sequence state weight sharing

RNN vs Normal Neural Network

A standard feedforward net computes output = f(input) — each input is independent. An RNN computes output = f(input + history) — it has memory of what came before. That's the entire upgrade from feedforward to recurrent.

Common Confusions (Cleared Up)

FAQ: Where People Get Stuck

"Why reuse the same weights at every step?"
Because the pattern of how to update memory is the same at every step — only the data changes. Just like your note-taking strategy doesn't change sentence to sentence.

"Where is time stored?"
In the loop and the hidden state — NOT in separate variables. Time is implicit in the iteration order.

"Why is it called recurrent?"
Because the output (hidden state) feeds back as input to the next step. Output → input → output → input. That's recurrence.

"In code, why h = f(x, h) instead of h_t = f(x_t, h_{t-1})?"
They're the same thing! Before the line executes, h holds the old value ($h_{t-1}$). After execution, h holds the new value ($h_t$). Python reassignment handles the "time step" for free.

With this intuition locked in, let's now implement the RNN from scratch in PyTorch so you can see every moving part.

RNN from Scratch (Manual Implementation)

Before using PyTorch's built-in nn.RNN, let's implement the recurrence manually with raw nn.Parameter tensors. This makes the weight sharing and time-step loop completely explicit, so you can see exactly what happens inside the black box.

An RNN maintains a hidden state vector $h_t$ that serves as the network's memory. At each time step $t$, the RNN takes two inputs — the current input $x_t$ and the previous hidden state $h_{t-1}$ — and computes a new hidden state by combining them through a learnable linear transformation followed by tanh activation:

Hidden state update (the core recurrence):

$$h_t = \tanh(x_t \cdot W_{xh} + h_{t-1} \cdot W_{hh} + b_h)$$

Output at each step:

$$y_t = h_t \cdot W_{hy} + b_y$$

The three weight matrices have distinct roles: $W_{xh}$ transforms the current input into "hidden space", $W_{hh}$ transforms the previous memory into a form that can be combined with the new input (this is the "recurrence" — past information feeding forward), and $W_{hy}$ projects the hidden state to produce an output. The tanh squashes values to $[-1, 1]$, preventing the hidden state from growing unboundedly over time.

Data Flow Through One RNN Time Step
flowchart LR
    xt["x_t\n(input at step t)"] --> WXH["× W_xh"]
    ht_prev["h_{t-1}\n(previous memory)"] --> WHH["× W_hh"]
    WXH --> SUM["+ (add)"]
    WHH --> SUM
    BH["b_h"] --> SUM
    SUM --> TANH["tanh"]
    TANH --> ht["h_t\n(new memory)"]
    ht --> WHY["× W_hy + b_y"]
    WHY --> yt["y_t\n(output)"]
    ht --> NEXT["→ next step\nas h_{t-1}"]
                            

Crucially, the same weight matrices are shared across all time steps — this is what makes it a "recurrent" network. Let's implement this step-by-step, starting with the simplest possible version before building up to a proper module.

Step 1: The Bare-Bones Loop (No Classes, No Batching)

Let's strip the RNN to its absolute core — just the loop, the weights, and tanh. This is the entire algorithm in 15 lines:

import torch

# === The simplest possible RNN ===
# Dimensions
input_size = 3    # features per time step (e.g., 3 sensor readings)
hidden_size = 4   # memory capacity (how much the RNN can "remember")
seq_len = 5       # number of time steps in our sequence

# Weight matrices (normally learned; we initialize randomly here)
W_xh = torch.randn(input_size, hidden_size) * 0.1   # input → hidden
W_hh = torch.randn(hidden_size, hidden_size) * 0.1  # hidden → hidden (the "recurrence")
b_h = torch.zeros(hidden_size)                       # bias

# Input sequence: 5 time steps, each with 3 features
x = torch.randn(seq_len, input_size)
print(f"Input sequence shape: {x.shape}  (5 steps × 3 features)")

# The RNN loop — THIS IS THE ENTIRE ALGORITHM
h = torch.zeros(hidden_size)  # start with blank memory

for t in range(seq_len):
    # At each step: combine current input + previous memory → new memory
    h = torch.tanh(x[t] @ W_xh + h @ W_hh + b_h)
    print(f"  Step {t}: input={x[t].tolist()[:2]}... → h={h.tolist()[:3]}...")

print(f"\nFinal hidden state: {h.shape}  (this summarizes the ENTIRE sequence)")
print(f"Hidden values: {h.round(decimals=3).tolist()}")

That's it. The entire RNN is one line inside a loop: h = tanh(x[t] @ W_xh + h @ W_hh + b_h). Everything else — classes, batching, output projections — is engineering convenience built on top of this core idea. The key insight: the same weights are reused at every step, and the hidden state h carries information forward through time like a rolling summary.

Step 2: Full RNN Module (with Batching & Output)

Now let's wrap this into a proper nn.Module that supports batched inputs and produces an output at each step:

import torch
import torch.nn as nn

class SimpleRNN(nn.Module):
    """RNN built from scratch — identical to what nn.RNN does internally."""

    def __init__(self, input_size, hidden_size, output_size):
        super().__init__()
        self.hidden_size = hidden_size

        # The three weight matrices from the equations above
        self.W_xh = nn.Parameter(torch.randn(input_size, hidden_size) * 0.01)
        self.W_hh = nn.Parameter(torch.randn(hidden_size, hidden_size) * 0.01)
        self.W_hy = nn.Parameter(torch.randn(hidden_size, output_size) * 0.01)
        self.b_h = nn.Parameter(torch.zeros(hidden_size))
        self.b_y = nn.Parameter(torch.zeros(output_size))

    def forward(self, x):
        """x shape: (batch_size, seq_len, input_size)"""
        batch_size, seq_len, _ = x.size()

        # Initialize hidden state to zeros (blank memory)
        h = torch.zeros(batch_size, self.hidden_size, device=x.device)

        outputs = []
        for t in range(seq_len):
            # Core recurrence: h_t = tanh(x_t · W_xh + h_{t-1} · W_hh + b_h)
            h = torch.tanh(x[:, t, :] @ self.W_xh + h @ self.W_hh + self.b_h)

            # Output projection: y_t = h_t · W_hy + b_y
            y = h @ self.W_hy + self.b_y
            outputs.append(y.unsqueeze(1))

        # Stack all outputs: (batch, seq_len, output_size)
        outputs = torch.cat(outputs, dim=1)
        return outputs, h  # all outputs + final hidden state


# --- Try it out ---
model = SimpleRNN(input_size=10, hidden_size=20, output_size=5)

x = torch.randn(4, 6, 10)  # batch=4, seq_len=6, features=10
outputs, last_hidden = model(x)

print(f"Input:        {x.shape}           (4 sequences, 6 steps, 10 features)")
print(f"All outputs:  {outputs.shape}      (one output per step)")
print(f"Last hidden:  {last_hidden.shape}  (final memory — summarizes all 6 steps)")
print(f"\nTotal parameters: {sum(p.numel() for p in model.parameters()):,}")

Reading the Code — What Maps to What

  • x[:, t, :] @ self.W_xh → the "$x_t \cdot W_{xh}$" term (transform current input)
  • h @ self.W_hh → the "$h_{t-1} \cdot W_{hh}$" term (transform previous memory)
  • torch.tanh(...) → the $\tanh$ that squashes the combination to $[-1, 1]$
  • h @ self.W_hy → the "$h_t \cdot W_{hy}$" output projection
  • The for t in range(seq_len) loop → "unrolling" the RNN through time

This is exactly what nn.RNN does internally — the same W_xh and W_hh matrices are applied at every time step (weight sharing). The loop over seq_len is the "unrolling" we visualized in the diagram above. Now let's see how PyTorch wraps this into a single efficient module:

nn.RNN in PyTorch

Let's create a simple RNN and pass a batch of sequences through it. The key parameters are input_size (the number of features per time step), hidden_size (the size of the hidden state vector), and num_layers (how many RNN layers to stack). The RNN returns two things: output (the hidden state at every time step) and h_n (the hidden state at the last time step only).

import torch
import torch.nn as nn

# Create an RNN: input features = 10, hidden size = 20, 1 layer
rnn = nn.RNN(input_size=10, hidden_size=20, num_layers=1, batch_first=True)

# Random input: batch=3, sequence_length=5, features=10
x = torch.randn(3, 5, 10)

# Initial hidden state: (num_layers, batch, hidden_size)
h0 = torch.zeros(1, 3, 20)

# Forward pass
output, h_n = rnn(x, h0)

print("Input shape:     ", x.shape)        # [3, 5, 10]
print("Output shape:    ", output.shape)    # [3, 5, 20] — hidden state at every time step
print("Final h_n shape: ", h_n.shape)       # [1, 3, 20] — hidden state at last time step only
print("output[:, -1, :] == h_n?", torch.allclose(output[:, -1, :], h_n[0], atol=1e-6))

This code demonstrates PyTorch's nn.RNN module, which wraps the manual loop we wrote earlier into a single optimized call. We create an RNN that accepts 10 features per time step and maintains a 20-dimensional hidden state. The input x is a batch of 3 sequences, each 5 time steps long, with 10 features at each step. Setting batch_first=True means our tensors are shaped [batch, seq_len, features] (the more intuitive layout) rather than PyTorch's default [seq_len, batch, features].

We initialize the hidden state h0 to zeros — this is the "blank memory" the RNN starts with before seeing any input. Its shape (num_layers, batch, hidden_size) accommodates stacked RNN layers; with num_layers=1, the leading dimension is simply 1.

The forward pass returns two tensors. The output tensor has shape [3, 5, 20] — it contains the hidden state at every time step, which is useful for sequence-to-sequence tasks (like tagging each word in a sentence). The h_n tensor has shape [1, 3, 20] — it contains only the final hidden state after processing the last time step. The final line proves these are the same: slicing the last time step from output gives exactly h_n. For classification tasks, you typically use h_n as the "summary" of the entire sequence and feed it into a linear layer.

The Vanishing Gradient Problem

Vanilla RNNs have a critical flaw: when sequences are long, gradients that flow backward through many time steps get multiplied by the same weight matrix over and over. If the largest eigenvalue of that matrix is less than 1, gradients shrink exponentially toward zero — the network simply cannot learn dependencies between events that are far apart. If the eigenvalue is greater than 1, gradients explode instead. This is the infamous vanishing/exploding gradient problem.

The following code demonstrates this by examining gradient magnitudes after backpropagation through a 50-step sequence:

import torch
import torch.nn as nn

# A simple RNN on a long sequence
rnn = nn.RNN(input_size=5, hidden_size=32, num_layers=1, batch_first=True)
x = torch.randn(1, 50, 5, requires_grad=True)  # 50 time steps
h0 = torch.zeros(1, 1, 32)

output, h_n = rnn(x, h0)

# Backpropagate from the final hidden state
loss = h_n.sum()
loss.backward()

# Check gradient magnitude at different time steps
grad = x.grad[0]  # shape: [50, 5]
print("Gradient norm at step  0 (earliest):", grad[0].norm().item())
print("Gradient norm at step 25 (middle):  ", grad[25].norm().item())
print("Gradient norm at step 49 (latest):  ", grad[49].norm().item())
print("\nNotice: early time-step gradients are much smaller — the vanishing gradient problem!")

You'll typically see that gradients at early time steps are orders of magnitude smaller than at later steps. This means the RNN struggles to learn from information presented at the beginning of a long sequence. The solution? LSTM and GRU networks, which use gating mechanisms to preserve gradients across many time steps.

Experiment
Try It Yourself

Change the sequence length in the code above from 50 to 200. You will see the gradient at step 0 become even smaller (possibly approaching zero), while the gradient at the last step stays healthy. This is exactly why vanilla RNNs fail on long sequences like full paragraphs or multi-minute audio clips.

vanishing gradients long sequences RNN limitations

LSTM: Long Short-Term Memory

The Core Idea (Plain English First)

A vanilla RNN has one memory vector (h) that gets completely rewritten at every step. This is like erasing and rewriting your entire notebook after each sentence — eventually, early notes get lost. The LSTM solves this by adding a second memory lane called the cell state ($c_t$) that acts like a conveyor belt running alongside the main processing. Information on this conveyor belt flows forward largely untouched, protected by gates that decide what to add, remove, or read.

The Best Analogy: A Filing Cabinet with a Secretary

Think of an LSTM as a filing cabinet (cell state) managed by a secretary (gates):

  • Forget Gate = "Which old files should I shred?" — decides what to remove from the cabinet
  • Input Gate = "Which new documents should I file?" — decides what to add to the cabinet
  • Output Gate = "Which files should I pull out for the boss right now?" — decides what to expose as output

The cabinet itself (cell state) persists across time steps. Only the gates modify it — and they do so via addition, not replacement. This is why LSTMs can remember information across hundreds of time steps.

Ultra-compressed version:

# LSTM in 4 lines of pseudocode:
forget = sigmoid(input + old_memory)      # What to erase?
add    = sigmoid(input + old_memory)      # What to write?
cell   = forget * old_cell + add * new_candidate  # Update filing cabinet
output = sigmoid(input + old_memory) * tanh(cell)  # What to share?

Compare with vanilla RNN (1 line): h = tanh(W*x + W*h). The LSTM has more machinery, but the purpose is the same — it's just smarter about what to remember and what to forget.

RNN vs LSTM — The Key Difference
Vanilla RNNLSTM
Memory updateComplete rewrite ($h = \tanh(...)$)Selective edit (add/remove via gates)
AnalogyErasing & rewriting your entire notebookFiling cabinet with selective add/remove
Math operationMultiplication (gradients vanish)Addition (gradients flow)
Long-range memoryFails beyond ~20 stepsWorks across 100s of steps
LSTM vanishing gradients gating

The Three Gates (Deep Dive)

Now let's formalize the intuition. The LSTM has three sigmoid gates (outputting values in $[0, 1]$) that control information flow, plus a tanh layer that generates new candidate information:

The Three Gates:
  • Forget Gate ($f_t$) — decides what old information to throw away from the cell state. Output near 1 = "keep this", near 0 = "forget this". ("Should I forget the subject of the sentence now that I've seen a period?")
  • Input Gate ($i_t$) — decides what new information to write into the cell state. Works with a tanh candidate layer that proposes new values. ("This new word is a noun — I should remember it as the subject.")
  • Output Gate ($o_t$) — decides what part of the cell state to expose as the hidden state output. ("For predicting the next word, I need the current subject and verb.")

The following diagram shows the internal structure of a single LSTM cell. The cell state runs horizontally across the top, while the three gates control what gets forgotten, added, and output:

LSTM Cell — Gates & Data Flow
flowchart TD
    subgraph LSTM["LSTM Cell at Time Step t"]
        direction TB
        IN["x_t, h_{t-1}"]
        FG["Forget Gate\nσ(W_f · [h,x] + b_f)"]
        IG["Input Gate\nσ(W_i · [h,x] + b_i)"]
        CAND["Candidate\ntanh(W_c · [h,x] + b_c)"]
        OG["Output Gate\nσ(W_o · [h,x] + b_o)"]
        
        CS_IN["c_{t-1}"] --> MUL_F["× forget"]
        FG --> MUL_F
        IG --> MUL_I["× input"]
        CAND --> MUL_I
        MUL_F --> ADD["+ add"]
        MUL_I --> ADD
        ADD --> CS_OUT["c_t"]
        ADD --> TANH["tanh"]
        OG --> MUL_O["× output"]
        TANH --> MUL_O
        MUL_O --> HID["h_t"]
        
        IN --> FG
        IN --> IG
        IN --> CAND
        IN --> OG
    end
                            

nn.LSTM in PyTorch

Using an LSTM in PyTorch is almost identical to using an RNN — the only difference is that LSTMs return two hidden states: h_n (hidden state) and c_n (cell state). Both are needed to continue the sequence from where you left off.

import torch
import torch.nn as nn

# Create an LSTM: input features = 10, hidden size = 20, 2 stacked layers
lstm = nn.LSTM(input_size=10, hidden_size=20, num_layers=2, batch_first=True)

# Random input: batch=3, sequence_length=8, features=10
x = torch.randn(3, 8, 10)

# Initial states: (num_layers, batch, hidden_size)
h0 = torch.zeros(2, 3, 20)  # hidden state
c0 = torch.zeros(2, 3, 20)  # cell state

# Forward pass — note the tuple (h0, c0)
output, (h_n, c_n) = lstm(x, (h0, c0))

print("Output shape: ", output.shape)   # [3, 8, 20] — hidden state at every step
print("h_n shape:    ", h_n.shape)      # [2, 3, 20] — final hidden state per layer
print("c_n shape:    ", c_n.shape)      # [2, 3, 20] — final cell state per layer
print("\nh_n contains the final hidden state from EACH layer.")
print("For classification, use h_n[-1] (the last layer's final hidden state).")

With 2 stacked layers, the output of the first LSTM layer becomes the input to the second LSTM layer at every time step. The h_n[-1] tensor (the last layer's final hidden state) is what you would typically feed into a classifier.

Why LSTMs Solve Vanishing Gradients

The secret is the cell state's additive update rule. In a vanilla RNN, the hidden state is computed by multiplying by a weight matrix at every step — this repeated multiplication causes gradients to vanish or explode. In an LSTM, the cell state is updated by addition. Addition preserves gradient magnitude, allowing information (and its gradient) to flow across hundreds of time steps without degradation. The forget gate learns to be close to 1 for information that should be remembered long-term.

LSTM gate equations:

$$f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f) \quad \text{(forget gate)}$$ $$i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i) \quad \text{(input gate)}$$ $$\tilde{c}_t = \tanh(W_c \cdot [h_{t-1}, x_t] + b_c) \quad \text{(candidate)}$$ $$c_t = f_t \odot c_{t-1} + i_t \odot \tilde{c}_t \quad \text{(cell state update)}$$ $$o_t = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o) \quad \text{(output gate)}$$ $$h_t = o_t \odot \tanh(c_t) \quad \text{(hidden state)}$$

Where $\sigma$ is the sigmoid function, $\odot$ is element-wise multiplication, and $[h_{t-1}, x_t]$ is the concatenation of the previous hidden state and current input. The cell state $c_t$ flows through time via addition, avoiding the repeated multiplication that causes vanishing gradients.

GRU: Gated Recurrent Unit

The Core Idea (Plain English)

If LSTM is a filing cabinet with three separate controls, GRU is a smart notebook with one slider. That slider (the update gate) controls the balance between "keep old notes" and "write new notes" — all in a single operation. The GRU merges the LSTM's cell state and hidden state into one vector and reduces three gates to two. Less complexity, same fundamental idea.

The Best Analogy: A Volume Knob Between Past and Present

Imagine a mixing board with two channels:

  • Channel 1 = old memory ($h_{t-1}$)
  • Channel 2 = new candidate ($\tilde{h}_t$)
  • Update gate ($z_t$) = the crossfader between them

When $z_t = 0$: keep 100% of old memory (ignore new input). When $z_t = 1$: accept 100% of new candidate (forget old memory). Usually it's somewhere in between — a soft blend.

Ultra-compressed version:

# GRU in 3 lines of pseudocode:
reset  = sigmoid(...)     # How much of old memory to ERASE before computing new?
new    = tanh(input + reset * old_memory)   # Compute new candidate
output = (1 - update) * old_memory + update * new   # Blend old & new

The Gated Recurrent Unit (GRU), introduced by Cho et al. in 2014, is a simplified alternative to the LSTM. Instead of three gates and a separate cell state, the GRU uses only two gates — a reset gate and an update gate — and merges the cell state and hidden state into a single vector. This makes GRUs faster to train (fewer parameters) while often achieving comparable performance.

The reset gate controls how much of the previous hidden state to ignore when computing a candidate new state (like erasing a whiteboard before writing new notes). The update gate controls the blend between keeping the old hidden state and accepting the new candidate (like a slider between “remember everything” and “start fresh”).

GRU gate equations:

$$r_t = \sigma(W_r \cdot [h_{t-1}, x_t] + b_r) \quad \text{(reset gate)}$$ $$z_t = \sigma(W_z \cdot [h_{t-1}, x_t] + b_z) \quad \text{(update gate)}$$ $$\tilde{h}_t = \tanh(W_h \cdot [r_t \odot h_{t-1}, x_t] + b_h) \quad \text{(candidate)}$$ $$h_t = (1 - z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t \quad \text{(final state)}$$

Where $r_t$ determines how much past info to reset before computing the candidate, and $z_t$ blends the old $h_{t-1}$ with the new candidate $\tilde{h}_t$. When $z_t \approx 1$, the GRU accepts the new candidate; when $z_t \approx 0$, it keeps the old state unchanged.

import torch
import torch.nn as nn

# Create a GRU: input features = 10, hidden size = 20, 1 layer
gru = nn.GRU(input_size=10, hidden_size=20, num_layers=1, batch_first=True)

# Random input: batch=4, sequence_length=6, features=10
x = torch.randn(4, 6, 10)
h0 = torch.zeros(1, 4, 20)

# Forward pass — GRU returns output and h_n (no cell state)
output, h_n = gru(x, h0)

print("Output shape:", output.shape)   # [4, 6, 20]
print("h_n shape:   ", h_n.shape)      # [1, 4, 20]

Notice that the GRU's API is identical to the vanilla RNN — it returns (output, h_n) with no cell state. This makes it a drop-in replacement: just change nn.RNN to nn.GRU.

GRU vs LSTM — When to Use Each

Comparison
GRU vs LSTM at a Glance
FeatureLSTMGRU
Gates3 (forget, input, output)2 (reset, update)
State vectors2 (h and c)1 (h only)
Parameters4 × (input + hidden) × hidden3 × (input + hidden) × hidden
Training speedSlower (~25% more params)Faster
Long-range memorySlightly better (separate cell state)Comparable for most tasks
Best forVery long sequences, language modelingSmaller datasets, real-time systems

Rule of thumb: Try GRU first (faster to iterate). Switch to LSTM if GRU's accuracy plateaus on your validation set, especially for long sequences (>200 steps).

GRU LSTM comparison

Word Embeddings

The Core Idea (Plain English)

Neural networks work with numbers, not words. So how do we convert "king", "queen", "banana" into something a network can process? The naive approach — one-hot encoding — gives each word a huge sparse vector (10,000 dimensions for 10,000 words) where every word is equally "different" from every other. This is like giving each student in a school a random locker number — the numbers tell you nothing about who the students are.

The Best Analogy: GPS Coordinates for Words

Word embeddings are like GPS coordinates for words on a "meaning map":

  • "King" and "Queen" are close together (both royalty)
  • "King" and "Banana" are far apart (unrelated)
  • The direction from "King" to "Queen" is similar to "Man" to "Woman" (gender axis)

Instead of 10,000 dimensions (one per word), you compress to ~100 dimensions where distance = semantic similarity. That's an embedding.

Ultra-compressed version:

# One-hot: word → [0, 0, 0, ..., 1, ..., 0, 0]  (10,000 dims, sparse, meaningless)
# Embedding: word → [0.21, -0.45, 0.87, ...]     (64 dims, dense, meaningful)

# In PyTorch: it's just a lookup table
embedding_table[word_index]  # → returns a learned dense vector

Before we can feed text into an RNN, LSTM, or GRU, we need to convert words into numbers. The simplest approach is one-hot encoding: give each word in a 10,000-word vocabulary a unique index, then represent it as a 10,000-dimensional vector with a single 1 and 9,999 zeros. This is spectacularly wasteful — the vectors are huge, sparse, and encode no semantic information (the vectors for "king" and "queen" are just as different as "king" and "banana").

nn.Embedding — Dense Vector Representations

Word embeddings solve this by mapping each word to a small, dense vector (typically 50–300 dimensions) where semantically similar words end up close together. PyTorch's nn.Embedding is essentially a lookup table: given a word index, it returns the corresponding embedding vector. During training, these vectors are learned alongside the rest of the network.

import torch
import torch.nn as nn

# Vocabulary of 1000 words, each mapped to a 64-dimensional vector
embedding = nn.Embedding(num_embeddings=1000, embedding_dim=64)

# A batch of 2 sentences, each with 5 word indices
word_indices = torch.tensor([
    [12, 45, 3, 678, 99],    # Sentence 1 (5 words)
    [7, 234, 56, 0, 888]     # Sentence 2 (5 words)
])

# Look up embeddings
embedded = embedding(word_indices)
print("Input shape: ", word_indices.shape)  # [2, 5]
print("Output shape:", embedded.shape)       # [2, 5, 64] — each word is now a 64-dim vector

# Embeddings are learnable parameters
print("Total embedding parameters:", embedding.weight.shape)  # [1000, 64]
print("Embedding for word index 12:", embedded[0, 0, :5])      # First 5 values of word 12's vector

The nn.Embedding layer is the standard first layer in any text model. It replaces sparse one-hot vectors with dense, learnable representations. The embedding weights are initialized randomly and refined through backpropagation — words that appear in similar contexts gradually develop similar vectors.

Pretrained Embeddings (Word2Vec, GloVe)

Training embeddings from scratch requires a lot of data. For smaller datasets, you can start with pretrained embeddings like Word2Vec or GloVe, which were trained on billions of words and capture rich semantic relationships. You can load these into nn.Embedding using from_pretrained():

import torch
import torch.nn as nn

# Simulate loading pretrained embeddings (normally you'd load a real file)
# GloVe-50d has 400K words × 50 dimensions
vocab_size = 5000
embed_dim = 50
pretrained_weights = torch.randn(vocab_size, embed_dim)  # Simulated GloVe vectors

# Create embedding layer from pretrained weights
embedding = nn.Embedding.from_pretrained(pretrained_weights, freeze=False)

# freeze=False: Fine-tune embeddings during training (recommended for most tasks)
# freeze=True:  Keep embeddings fixed (useful when training data is very small)

word_idx = torch.tensor([42, 100, 7])
vectors = embedding(word_idx)
print("Pretrained embedding shape:", vectors.shape)  # [3, 50]
print("Trainable?", embedding.weight.requires_grad)   # True (freeze=False)

With freeze=False, the pretrained vectors serve as a strong initialization but continue to adapt to your specific task during training. With freeze=True, they stay fixed — useful when your dataset is too small to reliably update embeddings.

Building a Text Classifier

The Core Idea (Plain English)

Text classification is the "hello world" of NLP: given a sentence, predict a category (positive/negative, spam/not-spam, topic A/B/C). The full pipeline is actually straightforward once you know the pieces:

The Best Analogy: A Reading Comprehension Assembly Line

Think of text classification as a 3-stage assembly line:

  1. Vocabulary — Assign each word a number (like a dictionary index). "great" → 42, "terrible" → 87
  2. Embedding — Look up each number in a table to get a meaning vector (GPS coordinates for words)
  3. LSTM + Classifier — Read the sequence of vectors, compress into one summary, then decide: positive or negative?

In code: text → indices → embeddings → LSTM → hidden state → linear layer → prediction

Ultra-compressed pipeline:

# The entire text classification pipeline in pseudocode:
indices = [vocab[word] for word in sentence.split()]  # words → numbers
vectors = embedding_table[indices]                     # numbers → dense vectors
_, (summary, _) = lstm(vectors)                       # read sequence → one vector
prediction = linear(summary)                           # one vector → class score

Now let's put everything together and build a complete sentiment classifier that reads a sentence and predicts whether it expresses positive or negative sentiment. This pipeline involves tokenization, vocabulary building, embedding lookup, LSTM processing, and a classification head.

Tokenization & Vocabulary

First, we need to convert raw text into sequences of integer indices. We'll build a simple vocabulary mapping from scratch. In production you would use a library like torchtext or HuggingFace's tokenizers, but this manual approach helps you understand what happens under the hood.

import torch

# Sample training data
texts = [
    "this movie is absolutely wonderful and amazing",
    "terrible film waste of time boring and dull",
    "great acting superb story loved every minute",
    "awful movie bad script horrible experience",
    "brilliant performance outstanding cinematography",
    "worst movie ever made completely unwatchable"
]
labels = [1, 0, 1, 0, 1, 0]  # 1 = positive, 0 = negative

# Step 1: Build vocabulary from training data
word2idx = {"<PAD>": 0, "<UNK>": 1}
for text in texts:
    for word in text.lower().split():
        if word not in word2idx:
            word2idx[word] = len(word2idx)

print(f"Vocabulary size: {len(word2idx)}")

# Step 2: Convert texts to index sequences
def text_to_indices(text, word2idx, max_len=10):
    indices = [word2idx.get(w, word2idx["<UNK>"]) for w in text.lower().split()]
    # Pad or truncate to max_len
    if len(indices) < max_len:
        indices += [word2idx["<PAD>"]] * (max_len - len(indices))
    else:
        indices = indices[:max_len]
    return indices

sequences = [text_to_indices(t, word2idx) for t in texts]
X = torch.tensor(sequences)
y = torch.tensor(labels, dtype=torch.float32)

print("Input tensor shape:", X.shape)   # [6, 10]
print("First sequence:    ", X[0])
print("Labels:            ", y)

Each sentence is now a fixed-length tensor of word indices. The <PAD> token fills shorter sentences to the maximum length, and <UNK> handles words not in the vocabulary.

The Sentiment Classifier Model

Our model has three components: (1) an embedding layer that converts word indices to dense vectors, (2) an LSTM that reads the embedded sequence and produces a final hidden state, and (3) a linear layer (classifier head) that maps the hidden state to a single output logit for binary classification.

import torch
import torch.nn as nn

class SentimentLSTM(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, output_dim, pad_idx):
        super().__init__()
        # Embedding: word index -> dense vector
        self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=pad_idx)
        # LSTM: processes the embedded sequence
        self.lstm = nn.LSTM(embed_dim, hidden_dim, num_layers=1, batch_first=True)
        # Classifier: maps final hidden state to output
        self.fc = nn.Linear(hidden_dim, output_dim)
        self.dropout = nn.Dropout(0.3)

    def forward(self, text):
        # text shape: [batch, seq_len]
        embedded = self.dropout(self.embedding(text))   # [batch, seq_len, embed_dim]
        output, (h_n, c_n) = self.lstm(embedded)        # h_n: [1, batch, hidden_dim]
        hidden = self.dropout(h_n[-1])                  # [batch, hidden_dim]
        logit = self.fc(hidden)                         # [batch, output_dim]
        return logit.squeeze(-1)                        # [batch]

# Instantiate the model
model = SentimentLSTM(
    vocab_size=50,      # Small vocab for demo
    embed_dim=16,       # Embedding size
    hidden_dim=32,      # LSTM hidden size
    output_dim=1,       # Binary classification
    pad_idx=0           # PAD token index
)

# Quick test with random data
test_input = torch.randint(0, 50, (4, 10))  # batch=4, seq_len=10
test_output = model(test_input)
print("Model output shape:", test_output.shape)  # [4]
print("Model output:      ", test_output)
print(f"\nTotal parameters: {sum(p.numel() for p in model.parameters()):,}")

The padding_idx=pad_idx argument tells the embedding layer to keep the PAD token's vector as all zeros and never update it during training. This prevents padding tokens from contributing to the model's predictions.

Training the Classifier

Now let's train the sentiment classifier end-to-end. We use BCEWithLogitsLoss (binary cross-entropy with built-in sigmoid) and the Adam optimizer. Even with just 6 training examples, you can see the loss decrease and the model learn to separate positive from negative reviews.

import torch
import torch.nn as nn

# ---- Rebuild everything needed (independent code block) ----
texts = [
    "this movie is absolutely wonderful and amazing",
    "terrible film waste of time boring and dull",
    "great acting superb story loved every minute",
    "awful movie bad script horrible experience",
    "brilliant performance outstanding cinematography",
    "worst movie ever made completely unwatchable"
]
labels = [1, 0, 1, 0, 1, 0]

word2idx = {"<PAD>": 0, "<UNK>": 1}
for text in texts:
    for word in text.lower().split():
        if word not in word2idx:
            word2idx[word] = len(word2idx)

def text_to_indices(text, w2i, max_len=10):
    idx = [w2i.get(w, 1) for w in text.lower().split()]
    return (idx + [0] * max_len)[:max_len]

X = torch.tensor([text_to_indices(t, word2idx) for t in texts])
y = torch.tensor(labels, dtype=torch.float32)

class SentimentLSTM(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, pad_idx):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=pad_idx)
        self.lstm = nn.LSTM(embed_dim, hidden_dim, batch_first=True)
        self.fc = nn.Linear(hidden_dim, 1)

    def forward(self, text):
        embedded = self.embedding(text)
        _, (h_n, _) = self.lstm(embedded)
        return self.fc(h_n[-1]).squeeze(-1)

model = SentimentLSTM(len(word2idx), embed_dim=16, hidden_dim=32, pad_idx=0)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
criterion = nn.BCEWithLogitsLoss()

# ---- Training loop ----
for epoch in range(50):
    model.train()
    optimizer.zero_grad()
    predictions = model(X)
    loss = criterion(predictions, y)
    loss.backward()
    optimizer.step()

    if (epoch + 1) % 10 == 0:
        with torch.no_grad():
            probs = torch.sigmoid(predictions)
            predicted_labels = (probs > 0.5).float()
            accuracy = (predicted_labels == y).float().mean()
        print(f"Epoch {epoch+1:3d} | Loss: {loss.item():.4f} | Accuracy: {accuracy:.2%}")

# ---- Inference on new text ----
model.eval()
test_text = "amazing wonderful brilliant movie"
test_idx = torch.tensor([text_to_indices(test_text, word2idx)]).long()
with torch.no_grad():
    prob = torch.sigmoid(model(test_idx)).item()
print(f"\n'{test_text}' → Positive probability: {prob:.4f}")

After 50 epochs on this tiny dataset, the model should reach 100% training accuracy and correctly classify the test sentence as positive. In practice, you would use thousands of labeled examples, proper train/validation splits, and early stopping — but the pipeline structure remains exactly the same.

Bidirectional RNNs

The Core Idea (Plain English)

A standard LSTM reads left-to-right — at each word, it only knows what came before. But meaning often depends on what comes after too. A bidirectional model solves this by reading the sentence twice: once forward, once backward — then combining both perspectives.

The Best Analogy: Reading a Mystery Novel Twice

Imagine reading a mystery novel:

  • First read (forward) — you notice clues as they appear, building suspense
  • Second read (backward) — knowing the ending, you spot foreshadowing you missed
  • Combined understanding — much richer than either pass alone

In "The bank of the river" — you need to see "river" (future context) to know "bank" means riverbank, not a financial institution. Bidirectional gives you both.

Ultra-compressed version:

# Bidirectional = two LSTMs, concatenated
h_forward  = lstm_forward("The cat sat on the mat")   # reads left → right
h_backward = lstm_backward("The cat sat on the mat")  # reads right → left
h_combined = concat(h_forward, h_backward)             # double the info!

A standard LSTM reads the sequence from left to right — at each position, it knows about the past but not the future. But in many tasks (machine translation, named entity recognition, sentiment analysis), the meaning of a word depends on context from both directions. For example, in "The bank of the river", you need to see "river" to know that "bank" means a riverbank, not a financial institution.

A bidirectional LSTM runs two separate LSTMs: one reading left-to-right and one reading right-to-left. Their outputs are concatenated at each time step, giving the model access to both past and future context. Setting bidirectional=True in PyTorch doubles the output dimension.

Bidirectional LSTM in Code

Here we create a bidirectional LSTM and observe how the output shape doubles compared to a unidirectional one. The hidden state also becomes 2 × num_layers because each layer has both a forward and backward component:

import torch
import torch.nn as nn

# Bidirectional LSTM — note bidirectional=True
bilstm = nn.LSTM(
    input_size=10,
    hidden_size=20,
    num_layers=1,
    batch_first=True,
    bidirectional=True   # <-- This is the only change!
)

x = torch.randn(3, 7, 10)  # batch=3, seq_len=7, features=10

# For bidirectional: num_directions = 2
# h0 shape: (num_layers * num_directions, batch, hidden_size)
h0 = torch.zeros(2, 3, 20)  # 2 = 1 layer × 2 directions
c0 = torch.zeros(2, 3, 20)

output, (h_n, c_n) = bilstm(x, (h0, c0))

print("Output shape:", output.shape)  # [3, 7, 40] — 40 = 20 * 2 (forward + backward)
print("h_n shape:   ", h_n.shape)     # [2, 3, 20] — h_n[0] = forward, h_n[1] = backward

# For classification: concatenate forward and backward final hidden states
h_forward = h_n[0]   # [3, 20] — forward LSTM's final hidden state
h_backward = h_n[1]  # [3, 20] — backward LSTM's final hidden state
h_combined = torch.cat([h_forward, h_backward], dim=1)  # [3, 40]
print("Combined hidden state:", h_combined.shape)  # [3, 40] — feed this to classifier

The output at each time step is now 40-dimensional (20 from forward + 20 from backward). For sequence classification, you concatenate the two final hidden states to get a 40-dimensional summary vector. For token-level tasks (like NER), you use the 40-dimensional output at each position.

When NOT to use bidirectional: Bidirectional models cannot be used for autoregressive generation (predicting the next word, next stock price, etc.) because they require seeing the entire sequence upfront. Use unidirectional LSTMs for any task where you're predicting the future step by step.

Sequence-to-Sequence Models

The Core Idea (Plain English)

All models so far produce either one output (classification) or one output per input step. But some tasks need to transform a whole sequence into a different-length sequence: translate English (5 words) → French (7 words), summarize a paragraph into a sentence, convert speech audio into text. The solution: Encoder-Decoder architecture.

The Best Analogy: A Simultaneous Interpreter

Think of a UN interpreter translating a speech:

  • Encoder (Listening Phase) — The interpreter listens to the entire English sentence, building a mental summary of its meaning
  • Context Vector — That mental summary — a compressed representation of everything that was said
  • Decoder (Speaking Phase) — The interpreter now produces the French translation word by word, drawing from their mental summary

The encoder and decoder are separate LSTMs. The only thing connecting them is the context vector (the encoder's final hidden state passed as the decoder's initial state).

Ultra-compressed version:

# Seq2Seq in 3 lines of pseudocode:
context = encoder("Hello world !")          # Read input → compress to one vector
output = []
for step in range(output_length):
    word, context = decoder(context)         # Generate one word at a time
    output.append(word)
# output = ["Bonjour", "monde", "!"]

A sequence-to-sequence (seq2seq) model transforms one variable-length sequence into another — for example, translating an English sentence into French, summarizing a paragraph into a single sentence, or converting speech to text. The architecture consists of two parts: an encoder that reads the input sequence and compresses it into a fixed-size context vector, and a decoder that generates the output sequence one token at a time using that context vector.

Sequence-to-Sequence Encoder-Decoder Architecture
flowchart LR
    subgraph Encoder
        E1["LSTM"] --> E2["LSTM"] --> E3["LSTM"]
    end
    
    subgraph Context["Context Vector"]
        CV["h, c"]
    end
    
    subgraph Decoder
        D1["LSTM"] --> D2["LSTM"] --> D3["LSTM"]
    end
    
    I1["Hello"] --> E1
    I2["world"] --> E2
    I3["!"] --> E3
    E3 --> CV
    CV --> D1
    D1 --> O1["Bonjour"]
    D2 --> O2["monde"]
    D3 --> O3["!"]
                            

Encoder-Decoder with Teacher Forcing

During training, the decoder uses the actual previous target token as input at each step (not its own prediction). This technique is called teacher forcing — it stabilizes and speeds up training because the decoder always receives correct context. At inference time, the decoder must use its own predictions since there are no ground-truth targets available.

import torch
import torch.nn as nn

class Encoder(nn.Module):
    def __init__(self, input_dim, embed_dim, hidden_dim):
        super().__init__()
        self.embedding = nn.Embedding(input_dim, embed_dim)
        self.lstm = nn.LSTM(embed_dim, hidden_dim, batch_first=True)

    def forward(self, src):
        embedded = self.embedding(src)             # [batch, src_len, embed_dim]
        _, (hidden, cell) = self.lstm(embedded)     # hidden, cell: [1, batch, hidden_dim]
        return hidden, cell

class Decoder(nn.Module):
    def __init__(self, output_dim, embed_dim, hidden_dim):
        super().__init__()
        self.embedding = nn.Embedding(output_dim, embed_dim)
        self.lstm = nn.LSTM(embed_dim, hidden_dim, batch_first=True)
        self.fc = nn.Linear(hidden_dim, output_dim)

    def forward(self, input_token, hidden, cell):
        # input_token: [batch, 1] — one token at a time
        embedded = self.embedding(input_token)                  # [batch, 1, embed_dim]
        output, (hidden, cell) = self.lstm(embedded, (hidden, cell))
        prediction = self.fc(output.squeeze(1))                 # [batch, output_dim]
        return prediction, hidden, cell

class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder, target_vocab_size):
        super().__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.target_vocab_size = target_vocab_size

    def forward(self, src, trg, teacher_forcing_ratio=0.5):
        batch_size = src.size(0)
        trg_len = trg.size(1)
        outputs = torch.zeros(batch_size, trg_len, self.target_vocab_size)

        # Encode the source sequence
        hidden, cell = self.encoder(src)

        # First decoder input is the start-of-sequence token (index 1)
        input_token = trg[:, 0:1]  # [batch, 1]

        for t in range(1, trg_len):
            prediction, hidden, cell = self.decoder(input_token, hidden, cell)
            outputs[:, t, :] = prediction

            # Teacher forcing: use actual target or predicted token?
            if torch.rand(1).item() < teacher_forcing_ratio:
                input_token = trg[:, t:t+1]                    # Ground truth
            else:
                input_token = prediction.argmax(dim=1, keepdim=True)  # Model's prediction

        return outputs

# Demo
encoder = Encoder(input_dim=100, embed_dim=32, hidden_dim=64)
decoder = Decoder(output_dim=120, embed_dim=32, hidden_dim=64)
model = Seq2Seq(encoder, decoder, target_vocab_size=120)

src = torch.randint(0, 100, (2, 8))    # 2 source sentences, length 8
trg = torch.randint(0, 120, (2, 10))   # 2 target sentences, length 10
output = model(src, trg)
print("Seq2Seq output shape:", output.shape)  # [2, 10, 120]
print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")

The teacher_forcing_ratio parameter controls the probability of using ground truth vs. the model's own predictions at each decoder step. A ratio of 0.5 means the model uses ground truth about half the time. During inference (evaluation), you would set this to 0.0 so the model relies entirely on its own predictions.

Beam Search: At inference time, instead of greedily picking the single most likely token at each step (greedy decoding), beam search keeps track of the top-k most probable partial sequences (beams) and expands them in parallel. This often produces higher-quality translations. A beam width of 4–8 is typical in practice.

Time Series Forecasting

The Core Idea (Plain English)

LSTMs aren't just for text — they excel at any ordered data where the past predicts the future. Stock prices, temperature readings, energy consumption, heart rate signals — these all have temporal patterns an LSTM can learn. The technique is beautifully simple: show the model a window of past values and ask it to predict what comes next.

The Best Analogy: Weather Forecasting by Pattern

Think of how a weather forecaster works:

  • Look at the last 7 days of temperature — this is your "window" (input sequence)
  • Identify the pattern — temperatures rising? Cyclic? Following yesterday's rain?
  • Predict tomorrow — based on the pattern seen in the window

An LSTM does exactly this, but instead of hand-crafted rules, it learns the patterns from data. You slide the window forward one step at a time to create thousands of training examples from one time series.

Ultra-compressed version:

# Time series forecasting in 3 steps:
# 1. Slide a window across the data to create (input, target) pairs
#    [1,2,3,4,5] → 6,  [2,3,4,5,6] → 7,  [3,4,5,6,7] → 8 ...
# 2. Feed each window through LSTM → get final hidden state
# 3. Linear layer maps hidden state → predicted next value

LSTMs aren't just for text — they excel at any sequential prediction task. Time series forecasting uses historical values (stock prices, temperatures, energy consumption) to predict future values. The key technique is the sliding window: given a window of n past observations, predict the next value (or next k values for multi-step forecasting).

Creating Sliding Window Datasets

The function below takes a time series and creates input/target pairs using a sliding window. For example, with a window size of 5 and the series [1, 2, 3, 4, 5, 6, 7], it creates: input=[1,2,3,4,5] → target=6, input=[2,3,4,5,6] → target=7.

import torch
import numpy as np

def create_sequences(data, window_size):
    """Create sliding window sequences for time series prediction."""
    xs, ys = [], []
    for i in range(len(data) - window_size):
        x = data[i : i + window_size]
        y = data[i + window_size]
        xs.append(x)
        ys.append(y)
    return torch.tensor(np.array(xs), dtype=torch.float32), \
           torch.tensor(np.array(ys), dtype=torch.float32)

# Generate synthetic sine wave data
t = np.linspace(0, 50, 500)
data = np.sin(t) + 0.1 * np.random.randn(500)  # Sine wave + noise

# Create sequences with window size 20
window_size = 20
X, y = create_sequences(data, window_size)

# Add feature dimension for LSTM: [batch, seq_len, 1]
X = X.unsqueeze(-1)

print("Feature tensor shape:", X.shape)   # [480, 20, 1]
print("Target tensor shape: ", y.shape)   # [480]
print(f"Created {len(X)} training sequences from {len(data)} data points")

Each input sample is a window of 20 consecutive values, and the target is the 21st value. The .unsqueeze(-1) adds a feature dimension because LSTM expects 3D input: [batch, seq_len, features].

LSTM Forecasting Model

Now let's build and train an LSTM to predict the next value in our sine wave. The model reads 20 past observations and outputs a single predicted value for the next time step.

import torch
import torch.nn as nn
import numpy as np

# ---- Recreate data (independent block) ----
t = np.linspace(0, 50, 500)
data = np.sin(t) + 0.1 * np.random.randn(500)

def create_sequences(data, window_size):
    xs, ys = [], []
    for i in range(len(data) - window_size):
        xs.append(data[i:i+window_size])
        ys.append(data[i+window_size])
    return torch.tensor(np.array(xs), dtype=torch.float32), \
           torch.tensor(np.array(ys), dtype=torch.float32)

X, y = create_sequences(data, window_size=20)
X = X.unsqueeze(-1)  # [480, 20, 1]

# ---- LSTM forecasting model ----
class TimeSeriesLSTM(nn.Module):
    def __init__(self, input_size=1, hidden_size=50, num_layers=1):
        super().__init__()
        self.lstm = nn.LSTM(input_size, hidden_size, num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_size, 1)

    def forward(self, x):
        output, (h_n, _) = self.lstm(x)
        return self.fc(h_n[-1]).squeeze(-1)

model = TimeSeriesLSTM()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
criterion = nn.MSELoss()

# ---- Train ----
for epoch in range(100):
    model.train()
    optimizer.zero_grad()
    predictions = model(X)
    loss = criterion(predictions, y)
    loss.backward()
    optimizer.step()

    if (epoch + 1) % 25 == 0:
        print(f"Epoch {epoch+1:3d} | MSE Loss: {loss.item():.6f}")

# ---- Test prediction ----
model.eval()
with torch.no_grad():
    test_window = X[-1:, :, :]      # Last window
    pred = model(test_window).item()
    actual = y[-1].item()
    print(f"\nPredicted: {pred:.4f} | Actual: {actual:.4f} | Error: {abs(pred - actual):.4f}")

The loss should decrease steadily over 100 epochs. The model learns to approximate the sine wave pattern from just 20 past values. For real-world forecasting (stock prices, weather), you would normalize your data, use train/validation/test splits, and potentially stack multiple LSTM layers for more capacity.

Packed Sequences

The Core Idea (Plain English)

Real sentences have different lengths: "Hi" (1 word) vs "The quick brown fox jumped over the lazy dog" (9 words). To batch them, we pad shorter sentences with zeros to match the longest — but this wastes computation because the LSTM processes meaningless PAD tokens. Packed sequences fix this by telling the LSTM "stop here for this sentence" so it skips the padding entirely.

The Best Analogy: Shipping Boxes of Different Sizes

Imagine shipping 3 packages of different sizes:

  • Padding approach — put all packages in the same giant box, fill empty space with bubble wrap (wasteful)
  • Packing approach — use each package's actual size, label it with its dimensions, and stack them efficiently (no wasted space)

pack_padded_sequence = remove the bubble wrap. pad_packed_sequence = put it back afterward. The LSTM only processes real content, saving 15–40% computation.

Ultra-compressed version:

# Without packing: LSTM processes ALL positions (including useless PAD tokens)
# "Hello world _____ _____"  ← LSTM wastes time on _ _ _ _ _

# With packing:    LSTM processes ONLY real tokens
# "Hello world" (length=2)   ← LSTM stops after 2 steps for this sentence

# The workflow:
packed = pack_padded_sequence(padded_batch, lengths)  # Remove padding
output, hidden = lstm(packed)                          # Process efficiently
unpacked = pad_packed_sequence(output)                 # Restore padding shape

In real NLP tasks, sentences have different lengths. Padding shorter sentences to match the longest one works, but it wastes computation — the LSTM processes meaningless PAD tokens at every padded position. Packed sequences solve this by telling the LSTM exactly how long each sequence is, so it can skip padding and process only real tokens.

PyTorch provides two utility functions for this:

  • pack_padded_sequence() — converts a padded tensor + lengths into a compact PackedSequence object
  • pad_packed_sequence() — converts a PackedSequence back to a padded tensor (after the LSTM has processed it)

pack_padded_sequence Example

The key workflow is: (1) sort your batch by sequence length in descending order, (2) pack it, (3) pass through the LSTM, (4) unpack the output. Here's a complete example:

import torch
import torch.nn as nn
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence

# Three sentences of different lengths (already padded with zeros)
# Sentence 1: 5 tokens, Sentence 2: 3 tokens, Sentence 3: 7 tokens
padded = torch.randn(3, 7, 10)  # batch=3, max_len=7, features=10
lengths = torch.tensor([5, 3, 7])

# Step 1: Sort by length (descending) — required by pack_padded_sequence
sorted_lengths, sort_idx = lengths.sort(descending=True)
sorted_padded = padded[sort_idx]
print("Sorted lengths:", sorted_lengths.tolist())  # [7, 5, 3]

# Step 2: Pack the padded sequence
packed = pack_padded_sequence(sorted_padded, sorted_lengths.cpu(), batch_first=True)
print("Packed data shape:", packed.data.shape)  # [15, 10] — only 15 real tokens (7+5+3)

# Step 3: Pass through LSTM
lstm = nn.LSTM(input_size=10, hidden_size=20, batch_first=True)
packed_output, (h_n, c_n) = lstm(packed)

# Step 4: Unpack back to padded format
output, output_lengths = pad_packed_sequence(packed_output, batch_first=True)
print("Unpacked output shape:", output.shape)        # [3, 7, 20]
print("Output lengths:       ", output_lengths.tolist())  # [7, 5, 3]

# Step 5: Unsort to restore original order
_, unsort_idx = sort_idx.sort()
output = output[unsort_idx]
h_n = h_n[:, unsort_idx, :]
print("Final h_n shape (original order):", h_n.shape)  # [1, 3, 20]

The packed representation contains only 15 real tokens (7 + 5 + 3) instead of the 21 tokens in the padded tensor (3 × 7). This means the LSTM does 28% less computation. For large batches with highly variable sequence lengths, packing can provide significant speedups.

Experiment
Packed vs Padded Performance

With packed sequences, the LSTM avoids processing PAD tokens entirely. This is especially impactful when batch sizes are large and sequences have very different lengths (e.g., short tweets mixed with long articles). In production NLP systems, packing typically gives a 15–40% training speedup with no change in model accuracy.

performance efficiency variable length

Putting It All Together — LSTM with Packing

Here is a complete model class that integrates embedding, packing, LSTM processing, and unpacking into a clean module you can reuse in any NLP project:

import torch
import torch.nn as nn
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence

class PackedLSTMClassifier(nn.Module):
    """Text classifier that uses packed sequences for efficiency."""
    def __init__(self, vocab_size, embed_dim, hidden_dim, num_classes, pad_idx=0):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=pad_idx)
        self.lstm = nn.LSTM(embed_dim, hidden_dim, batch_first=True, bidirectional=True)
        self.fc = nn.Linear(hidden_dim * 2, num_classes)  # *2 for bidirectional
        self.dropout = nn.Dropout(0.5)

    def forward(self, text, lengths):
        # text: [batch, max_seq_len], lengths: [batch]
        embedded = self.dropout(self.embedding(text))

        # Sort, pack, process, unpack, unsort
        sorted_lengths, sort_idx = lengths.sort(descending=True)
        sorted_embedded = embedded[sort_idx]

        packed = pack_padded_sequence(sorted_embedded, sorted_lengths.cpu(), batch_first=True)
        packed_output, (h_n, _) = self.lstm(packed)

        # h_n: [2, batch, hidden] for bidirectional
        # Concatenate forward and backward final states
        h_combined = torch.cat([h_n[0], h_n[1]], dim=1)  # [batch, hidden*2]

        # Unsort to restore original order
        _, unsort_idx = sort_idx.sort()
        h_combined = h_combined[unsort_idx]

        return self.fc(self.dropout(h_combined))

# Demo
model = PackedLSTMClassifier(
    vocab_size=5000, embed_dim=64, hidden_dim=128, num_classes=3, pad_idx=0
)
texts = torch.randint(1, 5000, (4, 15))   # batch=4, max_len=15
lengths = torch.tensor([15, 10, 7, 12])    # Actual lengths

output = model(texts, lengths)
print("Classification output:", output.shape)  # [4, 3] — 3 classes
print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")

This pattern — embed → sort → pack → LSTM → unpack → unsort → classify — is the standard recipe for efficient text classification in PyTorch. The bidirectional LSTM gives the model access to both left and right context, and packing ensures no computation is wasted on padding tokens.

Practical Tips for Sequence Models:
  • Gradient clipping: Use torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0) to prevent exploding gradients — essential for RNNs.
  • Layer normalization: Consider nn.LayerNorm instead of batch norm for sequence data, as batch statistics are less stable with variable-length sequences.
  • Learning rate: Start with 1e-3 for Adam. Reduce if training is unstable.
  • Hidden size: 128–512 is typical. Larger isn't always better — it can overfit on small datasets.
  • Stacking layers: 2–3 LSTM layers usually helps; more than 4 rarely does (diminishing returns).
  • Dropout between layers: Set dropout=0.3 in nn.LSTM() for multi-layer LSTMs (applied between layers, not within).

Gradient Clipping in Practice

Exploding gradients can cause training to diverge — loss suddenly jumps to infinity, and the model's weights become NaN. Gradient clipping scales down the gradient vector whenever its norm exceeds a threshold, keeping training stable. Here's how to integrate it into your training loop:

import torch
import torch.nn as nn

# Simple LSTM model for demonstration
model = nn.LSTM(input_size=10, hidden_size=32, num_layers=2, batch_first=True)
fc = nn.Linear(32, 1)
optimizer = torch.optim.Adam(list(model.parameters()) + list(fc.parameters()), lr=0.001)

# Training step with gradient clipping
x = torch.randn(4, 15, 10)
target = torch.randn(4)

output, (h_n, _) = model(x)
pred = fc(h_n[-1]).squeeze(-1)
loss = nn.MSELoss()(pred, target)

optimizer.zero_grad()
loss.backward()

# Check gradient norm BEFORE clipping
total_norm_before = torch.nn.utils.clip_grad_norm_(
    list(model.parameters()) + list(fc.parameters()),
    max_norm=float('inf')  # Don't actually clip — just measure
)

# Now actually clip
optimizer.zero_grad()
loss.backward()
total_norm = torch.nn.utils.clip_grad_norm_(
    list(model.parameters()) + list(fc.parameters()),
    max_norm=1.0  # Clip to max norm of 1.0
)

optimizer.step()
print(f"Gradient norm before clip: {total_norm_before:.4f}")
print(f"Gradient norm after clip:  {min(total_norm.item(), 1.0):.4f}")
print("Clipping keeps gradients bounded, preventing training instability!")

Always add gradient clipping when training RNNs, LSTMs, or GRUs. A max_norm of 1.0 to 5.0 is typical. This single line can be the difference between stable training and mysterious NaN losses.

Multi-Step Forecasting

So far we've predicted one step ahead. For multi-step forecasting (predicting the next 5 or 10 values), there are two common approaches: (1) recursive — predict one step, append to input, repeat; or (2) direct — modify the model to output multiple values at once. Here's the direct approach:

import torch
import torch.nn as nn
import numpy as np

# Generate synthetic data
t = np.linspace(0, 50, 500)
data = np.sin(t)

# Create sequences: input = 20 steps, target = NEXT 5 steps
def create_multi_step_sequences(data, input_len, output_len):
    xs, ys = [], []
    for i in range(len(data) - input_len - output_len + 1):
        xs.append(data[i:i+input_len])
        ys.append(data[i+input_len:i+input_len+output_len])
    return torch.tensor(np.array(xs), dtype=torch.float32), \
           torch.tensor(np.array(ys), dtype=torch.float32)

X, y = create_multi_step_sequences(data, input_len=20, output_len=5)
X = X.unsqueeze(-1)  # [N, 20, 1]

print("Input shape: ", X.shape)   # [476, 20, 1]
print("Target shape:", y.shape)   # [476, 5] — predict 5 future values

# Model that outputs 5 future values
class MultiStepLSTM(nn.Module):
    def __init__(self, forecast_steps=5):
        super().__init__()
        self.lstm = nn.LSTM(1, 64, batch_first=True)
        self.fc = nn.Linear(64, forecast_steps)  # Output 5 values at once

    def forward(self, x):
        _, (h_n, _) = self.lstm(x)
        return self.fc(h_n[-1])

model = MultiStepLSTM(forecast_steps=5)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

for epoch in range(50):
    pred = model(X)
    loss = nn.MSELoss()(pred, y)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    if (epoch + 1) % 10 == 0:
        print(f"Epoch {epoch+1:2d} | MSE: {loss.item():.6f}")

# Test: predict next 5 values from the last window
model.eval()
with torch.no_grad():
    forecast = model(X[-1:])
    print(f"\n5-step forecast: {forecast[0].tolist()}")
    print(f"Actual values:   {y[-1].tolist()}")

The direct approach is simpler and avoids error accumulation (where errors compound at each recursive step). For longer forecast horizons, you may need larger hidden sizes or attention mechanisms — which we'll cover in Part 7: Transformers & Attention.

Conclusion & Next Steps

In this article you've learned the complete toolkit for sequence modeling in PyTorch:

  • Vanilla RNNs — the foundational concept of recurrent processing and hidden states
  • LSTMs — cell states and three gates (forget, input, output) that solve vanishing gradients
  • GRUs — a lighter alternative with two gates and fewer parameters
  • Word Embeddings — dense, learnable vector representations via nn.Embedding
  • Text Classification — a full pipeline from raw text to trained sentiment classifier
  • Bidirectional LSTMs — capturing both past and future context simultaneously
  • Seq2Seq Models — encoder-decoder architectures with teacher forcing
  • Time Series Forecasting — sliding windows, single-step and multi-step prediction
  • Packed Sequences — efficient handling of variable-length inputs

RNNs and LSTMs dominated NLP and sequence modeling for years, but they have a fundamental limitation: they process sequences one token at a time, which makes them slow on long sequences and hard to parallelise on GPUs. In Part 7: Transformers & Attention, you'll learn about the architecture that largely replaced recurrent models — the Transformer — and how its self-attention mechanism processes all tokens in parallel while capturing even longer-range dependencies.

Next in the Series

In Part 7: Transformers & Attention, we'll implement self-attention from scratch, build a complete Transformer encoder, explore positional encoding, and see why attention mechanisms have become the foundation of modern NLP (BERT, GPT) and beyond.