Back to TensorFlow Mastery Series

Part 7: RNNs, NLP & Time Series

May 3, 2026 Wasil Zafar 35 min read

From words to predictions — master recurrent neural networks for natural language processing and time series. Build LSTM/GRU models, implement sentiment analysis, generate text character-by-character, forecast time series data, and architect sequence-to-sequence models.

Table of Contents

  1. Sequence Data & Why RNNs
  2. SimpleRNN & LSTM
  3. Word Embeddings
  4. Text Preprocessing
  5. Sentiment Analysis
  6. Text Generation
  7. Time Series Fundamentals
  8. Time Series Forecasting
  9. Sequence-to-Sequence
  10. Practical Tips

Sequence Data & Why RNNs

Many real-world problems involve sequential data — text (sequences of words), time series (sequences of measurements), audio (sequences of samples), and video (sequences of frames). Unlike images where spatial neighbors matter, sequences have a temporal dimension: the order of elements carries meaning. "The cat sat on the mat" means something very different rearranged as "mat the the on sat cat."

Dense (fully-connected) networks treat each input independently — they have no concept of order or memory. If you flatten a sentence into a fixed-size vector and feed it to a Dense layer, the network cannot distinguish "dog bites man" from "man bites dog." We need architectures that maintain a hidden state — an internal memory that accumulates information as it processes each element in the sequence.

Key Insight: Recurrent Neural Networks (RNNs) process sequences one element at a time, maintaining a hidden state $h_t$ that serves as a compressed summary of everything seen so far. At each timestep, the network combines the current input $x_t$ with the previous hidden state $h_{t-1}$ to produce a new state: $h_t = f(W_h \cdot h_{t-1} + W_x \cdot x_t + b)$.

The Vanishing Gradient Problem

Simple RNNs struggle with long sequences because gradients must flow through many timesteps during backpropagation through time (BPTT). With repeated multiplication by the weight matrix, gradients either vanish (shrink exponentially, losing long-range dependencies) or explode (grow exponentially, causing numerical instability). This is why vanilla RNNs rarely work for sequences longer than ~20 timesteps.

import tensorflow as tf
import numpy as np

# Demonstrate sequence data types
# 1. Text: sequence of token indices
text_sequence = [45, 123, 7, 892, 33]  # "The cat sat on mat" → token IDs
print("Text sequence (token IDs):", text_sequence)

# 2. Time series: sequence of measurements
time_series = np.sin(np.linspace(0, 4 * np.pi, 100))  # 100 timesteps
print(f"Time series shape: {time_series.shape}")

# 3. Audio: sequence of amplitude samples
audio_signal = np.random.randn(16000)  # 1 second at 16kHz
print(f"Audio signal shape: {audio_signal.shape}")

# Why Dense networks fail for sequences:
# Flatten approach loses order information
sentence_a = "dog bites man"  # Different meaning
sentence_b = "man bites dog"  # Same words, different order
# A bag-of-words Dense layer cannot distinguish these!

# RNN hidden state evolution
# h_t = tanh(W_h * h_{t-1} + W_x * x_t + b)
np.random.seed(42)
W_h = np.random.randn(4, 4) * 0.5  # Hidden-to-hidden weights
W_x = np.random.randn(4, 3) * 0.5  # Input-to-hidden weights
b = np.zeros(4)

h = np.zeros(4)  # Initial hidden state
sequence = np.random.randn(5, 3)  # 5 timesteps, 3 features

print("\nRNN hidden state evolution:")
for t in range(5):
    h = np.tanh(W_h @ h + W_x @ sequence[t] + b)
    print(f"  t={t}: h = [{h[0]:.3f}, {h[1]:.3f}, {h[2]:.3f}, {h[3]:.3f}]")

# Demonstrate vanishing gradients
print("\nVanishing gradient demonstration:")
grad = 1.0
weight = 0.7  # Typical weight magnitude
for t in range(20):
    grad *= weight
    if t % 5 == 4:
        print(f"  After {t+1} timesteps: gradient = {grad:.6f}")

The output shows how gradients decay exponentially — after 20 timesteps with a weight of 0.7, the gradient has shrunk to less than 0.001 of its original value. This makes it nearly impossible for a SimpleRNN to learn dependencies spanning more than ~10-15 steps. LSTM and GRU architectures solve this with gating mechanisms.

SimpleRNN & LSTM

TensorFlow provides three core recurrent layers: SimpleRNN, LSTM (Long Short-Term Memory), and GRU (Gated Recurrent Unit). While SimpleRNN implements the basic recurrence formula, LSTM introduces a sophisticated gating mechanism that allows information to flow across many timesteps without degradation.

LSTM Cell Architecture — Gates & Cell State
flowchart LR
    subgraph LSTM["LSTM Cell"]
        direction TB
        X["x_t (input)"] --> FG["Forget Gate
σ(W_f·[h,x]+b_f)"] H["h_{t-1}"] --> FG X --> IG["Input Gate
σ(W_i·[h,x]+b_i)"] H --> IG X --> CC["Candidate
tanh(W_c·[h,x]+b_c)"] H --> CC X --> OG["Output Gate
σ(W_o·[h,x]+b_o)"] H --> OG FG --> MUL1["⊙"] CS["C_{t-1}"] --> MUL1 IG --> MUL2["⊙"] CC --> MUL2 MUL1 --> ADD["+"] MUL2 --> ADD ADD --> CT["C_t"] CT --> TANH["tanh"] TANH --> MUL3["⊙"] OG --> MUL3 MUL3 --> HT["h_t"] end

The LSTM cell maintains two states: the cell state $C_t$ (long-term memory highway) and the hidden state $h_t$ (short-term working memory). Three gates control information flow:

  • Forget gate: $f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f)$ — decides what to discard from cell state
  • Input gate: $i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i)$ — decides what new information to store
  • Output gate: $o_t = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o)$ — decides what to output from cell state

The cell state update combines forgetting and adding:

$$C_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t$$

where $\tilde{C}_t = \tanh(W_C \cdot [h_{t-1}, x_t] + b_C)$ is the candidate cell state. The hidden state is then: $h_t = o_t \odot \tanh(C_t)$.

LSTM in TensorFlow

Here is the implementation for LSTM in TensorFlow. Each code example below is self-contained and can be run independently:

import tensorflow as tf
import numpy as np

# SimpleRNN vs LSTM vs GRU comparison
batch_size = 32
timesteps = 50
features = 10

# Create dummy sequential data
X = tf.random.normal([batch_size, timesteps, features])

# 1. SimpleRNN — basic recurrence, struggles with long sequences
simple_rnn = tf.keras.layers.SimpleRNN(64, return_sequences=False)
output_simple = simple_rnn(X)
print(f"SimpleRNN output: {output_simple.shape}")  # (32, 64)

# 2. LSTM — gated architecture, handles long dependencies
lstm = tf.keras.layers.LSTM(64, return_sequences=False)
output_lstm = lstm(X)
print(f"LSTM output: {output_lstm.shape}")  # (32, 64)

# 3. LSTM with return_sequences=True — output at every timestep
lstm_seq = tf.keras.layers.LSTM(64, return_sequences=True)
output_seq = lstm_seq(X)
print(f"LSTM (return_sequences): {output_seq.shape}")  # (32, 50, 64)

# 4. LSTM with return_state=True — also returns cell state
lstm_state = tf.keras.layers.LSTM(64, return_sequences=False, return_state=True)
output, hidden_state, cell_state = lstm_state(X)
print(f"\nLSTM return_state:")
print(f"  Output (h_t): {output.shape}")       # (32, 64)
print(f"  Hidden state: {hidden_state.shape}")  # (32, 64) — same as output
print(f"  Cell state:   {cell_state.shape}")    # (32, 64)

# 5. GRU — simplified LSTM (2 gates instead of 3, no cell state)
gru = tf.keras.layers.GRU(64, return_sequences=False)
output_gru = gru(X)
print(f"\nGRU output: {output_gru.shape}")  # (32, 64)

# Parameter comparison
print("\nParameter counts:")
print(f"  SimpleRNN(64): {simple_rnn.count_params()}")  # (10+64)*64 + 64
print(f"  LSTM(64):      {lstm.count_params()}")         # 4 * [(10+64)*64 + 64]
print(f"  GRU(64):       {gru.count_params()}")          # 3 * [(10+64)*64 + 64]

GRU & Bidirectional Wrappers

GRU (Gated Recurrent Unit) simplifies LSTM by merging the forget and input gates into a single "update gate" and combining the cell state with hidden state. It often performs comparably to LSTM with fewer parameters. The Bidirectional wrapper runs the RNN in both directions, capturing both past and future context at each timestep.

import tensorflow as tf
import numpy as np

# Bidirectional LSTM — processes sequence forwards AND backwards
X = tf.random.normal([32, 50, 10])  # batch=32, timesteps=50, features=10

# Unidirectional: only sees past context
uni_lstm = tf.keras.layers.LSTM(64, return_sequences=True)
out_uni = uni_lstm(X)
print(f"Unidirectional LSTM: {out_uni.shape}")  # (32, 50, 64)

# Bidirectional: sees past AND future context
bi_lstm = tf.keras.layers.Bidirectional(
    tf.keras.layers.LSTM(64, return_sequences=True)
)
out_bi = bi_lstm(X)
print(f"Bidirectional LSTM:  {out_bi.shape}")  # (32, 50, 128) — 2×64 concatenated

# Stacked LSTMs — deeper representations
model = tf.keras.Sequential([
    tf.keras.layers.Input(shape=(50, 10)),
    # First LSTM must return_sequences=True to feed next LSTM
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(128, return_sequences=True)),
    tf.keras.layers.Dropout(0.3),
    # Second LSTM can return only final state
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64, return_sequences=False)),
    tf.keras.layers.Dropout(0.3),
    tf.keras.layers.Dense(32, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

model.summary()
print(f"\nTotal parameters: {model.count_params():,}")
Critical Rule: When stacking multiple RNN layers, all layers except the last MUST use return_sequences=True. The intermediate layers need to pass the full sequence (all timesteps) to the next layer. Only the final RNN layer can use return_sequences=False to produce a single vector.

Word Embeddings

Words cannot be fed directly to neural networks — they must be converted to numerical vectors. One-hot encoding creates sparse vectors (vocabulary size = 10,000 → 10,000-dimensional vectors), wasting memory and failing to capture word relationships. Embeddings map each word to a dense, low-dimensional vector (typically 64–300 dimensions) where similar words cluster together in the vector space.

The tf.keras.layers.Embedding layer is a trainable lookup table: given an integer index, it returns the corresponding embedding vector. During training, these vectors are learned end-to-end to capture semantic relationships. Alternatively, you can initialize with pretrained embeddings like GloVe or Word2Vec for better generalization on small datasets.

import tensorflow as tf
import numpy as np

# Embedding layer basics
vocab_size = 10000    # Number of unique words
embedding_dim = 128   # Dimension of each word vector
max_length = 100      # Maximum sequence length

# Create an Embedding layer
embedding = tf.keras.layers.Embedding(
    input_dim=vocab_size,      # Vocabulary size
    output_dim=embedding_dim,  # Embedding dimension
    input_length=max_length    # Optional: sequence length
)

# Input: batch of integer sequences (word indices)
word_indices = tf.constant([[1, 42, 7, 0, 0],
                            [88, 5, 23, 14, 0]])  # (batch=2, seq_len=5)

# Output: corresponding embedding vectors
embedded = embedding(word_indices)
print(f"Input shape:  {word_indices.shape}")  # (2, 5)
print(f"Output shape: {embedded.shape}")      # (2, 5, 128)

# Each word index maps to a 128-dim vector
print(f"\nWord index 42 → vector shape: {embedding(tf.constant([42])).shape}")
print(f"Embedding matrix shape: {embedding.embeddings.shape}")  # (10000, 128)

# Embedding dimension guidelines:
# Vocabulary < 10K  → embedding_dim = 64
# Vocabulary 10K-50K → embedding_dim = 128
# Vocabulary > 50K  → embedding_dim = 256-300

# Simple model with Embedding
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
    tf.keras.layers.LSTM(64),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

print(f"\nEmbedding parameters: {vocab_size * embedding_dim:,}")
print(f"Total model parameters: {model.count_params():,}")
model.summary()

Pretrained Embeddings (GloVe)

Pretrained embeddings like GloVe (Global Vectors) capture semantic relationships from massive corpora. "King - Man + Woman ≈ Queen" is the classic example. Using pretrained embeddings provides a strong initialization, especially when your training data is limited.

import tensorflow as tf
import numpy as np

# Simulating loading GloVe embeddings
# In practice: download glove.6B.100d.txt from Stanford NLP

# Step 1: Build word-to-index mapping
vocab = ["", "the", "cat", "sat", "on", "mat", "dog", "ran", "fast", "slow"]
word_index = {word: i for i, word in enumerate(vocab)}
vocab_size = len(vocab)
embedding_dim = 100

# Step 2: Create embedding matrix from pretrained vectors
# (simulated — real GloVe loads from file)
np.random.seed(42)
embedding_matrix = np.zeros((vocab_size, embedding_dim))
for word, idx in word_index.items():
    if idx > 0:  # Skip padding token
        # In reality: embedding_matrix[idx] = glove_vectors[word]
        embedding_matrix[idx] = np.random.randn(embedding_dim) * 0.1

print(f"Embedding matrix shape: {embedding_matrix.shape}")
print(f"'cat' vector (first 5): {embedding_matrix[word_index['cat']][:5]}")

# Step 3: Create non-trainable Embedding layer with pretrained weights
pretrained_embedding = tf.keras.layers.Embedding(
    input_dim=vocab_size,
    output_dim=embedding_dim,
    weights=[embedding_matrix],
    trainable=False  # Freeze pretrained embeddings
)

# Step 4: Model with frozen pretrained embeddings
model = tf.keras.Sequential([
    pretrained_embedding,
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)),
    tf.keras.layers.Dense(32, activation='relu'),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

model.summary()

# Trainable vs non-trainable parameter count
trainable = sum(tf.keras.backend.count_params(w) for w in model.trainable_weights)
non_trainable = sum(tf.keras.backend.count_params(w) for w in model.non_trainable_weights)
print(f"\nTrainable params:     {trainable:,}")
print(f"Non-trainable params: {non_trainable:,} (frozen embeddings)")

Text Preprocessing

TensorFlow's TextVectorization layer handles the complete text preprocessing pipeline — lowercasing, punctuation stripping, tokenization, vocabulary building, and sequence padding — all within the model graph. This ensures consistent preprocessing during both training and inference, eliminating train-serve skew.

import tensorflow as tf
import numpy as np

# TextVectorization — end-to-end text preprocessing
text_data = [
    "TensorFlow makes deep learning accessible",
    "RNNs process sequential data effectively",
    "LSTM networks handle long-term dependencies",
    "Natural language processing is fascinating",
    "Word embeddings capture semantic meaning"
]

# Create TextVectorization layer
vectorizer = tf.keras.layers.TextVectorization(
    max_tokens=1000,          # Vocabulary size cap
    output_mode='int',        # Output integer indices
    output_sequence_length=10, # Pad/truncate to 10 tokens
    standardize='lower_and_strip_punctuation',  # Preprocessing
    split='whitespace'        # Tokenization strategy
)

# Build vocabulary from data (call adapt)
vectorizer.adapt(text_data)

# Inspect vocabulary
vocab = vectorizer.get_vocabulary()
print(f"Vocabulary size: {len(vocab)}")
print(f"First 15 tokens: {vocab[:15]}")
print(f"  '' = padding, '[UNK]' = unknown")

# Vectorize text
sample = tf.constant(["LSTM networks are powerful"])
encoded = vectorizer(sample)
print(f"\nInput:   'LSTM networks are powerful'")
print(f"Encoded: {encoded.numpy()}")

# Decode back (manually)
idx_to_word = {i: w for i, w in enumerate(vocab)}
decoded = [idx_to_word.get(idx, '?') for idx in encoded.numpy()[0] if idx > 0]
print(f"Decoded: {' '.join(decoded)}")

# Batch vectorization
batch_encoded = vectorizer(text_data)
print(f"\nBatch shape: {batch_encoded.shape}")  # (5, 10)
print("First sentence:", batch_encoded[0].numpy())

Padding Strategies

Sequences in a batch must have uniform length. Pre-padding (default) adds zeros at the beginning — better for RNNs since the last elements are closest to the output. Post-padding adds zeros at the end — more natural for attention-based models. The Masking layer or mask_zero=True in Embedding tells the model to skip padded positions.

import tensorflow as tf
import numpy as np

# Padding strategies demonstration
sequences = [
    [1, 2, 3],           # Length 3
    [4, 5, 6, 7, 8],    # Length 5
    [9, 10]              # Length 2
]

# Pre-padding (default) — zeros at the start
pre_padded = tf.keras.utils.pad_sequences(sequences, maxlen=6, padding='pre')
print("Pre-padded (default for RNNs):")
print(pre_padded)

# Post-padding — zeros at the end
post_padded = tf.keras.utils.pad_sequences(sequences, maxlen=6, padding='post')
print("\nPost-padded:")
print(post_padded)

# Truncation — when sequences exceed maxlen
long_seq = [[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]]
truncated_pre = tf.keras.utils.pad_sequences(long_seq, maxlen=5, truncating='pre')
truncated_post = tf.keras.utils.pad_sequences(long_seq, maxlen=5, truncating='post')
print(f"\nOriginal: {long_seq[0]}")
print(f"Truncate pre  (keep end):   {truncated_pre[0]}")
print(f"Truncate post (keep start): {truncated_post[0]}")

# Masking: tell model to ignore padding tokens
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(100, 32, mask_zero=True),  # Generates mask
    tf.keras.layers.LSTM(64),  # Respects mask — skips padded timesteps
    tf.keras.layers.Dense(1, activation='sigmoid')
])

# Alternative: explicit Masking layer
model_explicit = tf.keras.Sequential([
    tf.keras.layers.Embedding(100, 32),
    tf.keras.layers.Masking(mask_value=0.0),  # Mask where all features = 0
    tf.keras.layers.LSTM(64),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

# Verify masking works
sample_input = tf.constant([[0, 0, 5, 12, 3]])  # Pre-padded
embedding_layer = model.layers[0]
output = embedding_layer(sample_input)
mask = embedding_layer.compute_mask(sample_input)
print(f"\nInput: {sample_input.numpy()}")
print(f"Mask:  {mask.numpy()}")  # [False, False, True, True, True]

Sentiment Analysis

Let's build a complete sentiment analysis pipeline using the IMDB movie reviews dataset — 50,000 reviews labeled as positive or negative. This end-to-end example demonstrates TextVectorizationEmbeddingLSTMDense, achieving over 85% accuracy on binary classification.

import tensorflow as tf
import numpy as np

# Load IMDB dataset (25,000 train + 25,000 test reviews)
(train_data, train_labels), (test_data, test_labels) = tf.keras.datasets.imdb.load_data(
    num_words=10000  # Keep top 10,000 most frequent words
)

print(f"Training samples: {len(train_data)}")
print(f"Test samples:     {len(test_data)}")
print(f"Sample lengths:   {len(train_data[0])}, {len(train_data[1])}, {len(train_data[2])}")

# Decode a review to see raw text
word_index = tf.keras.datasets.imdb.get_word_index()
reverse_word_index = {v + 3: k for k, v in word_index.items()}
reverse_word_index[0] = ''
reverse_word_index[1] = ''
reverse_word_index[2] = ''

decoded = ' '.join(reverse_word_index.get(i, '?') for i in train_data[0][:30])
print(f"\nFirst review (first 30 words): {decoded}...")
print(f"Label: {'Positive' if train_labels[0] == 1 else 'Negative'}")

# Pad sequences to uniform length
max_length = 256
train_padded = tf.keras.utils.pad_sequences(train_data, maxlen=max_length, padding='post')
test_padded = tf.keras.utils.pad_sequences(test_data, maxlen=max_length, padding='post')
print(f"\nPadded shape: {train_padded.shape}")  # (25000, 256)

# Build LSTM sentiment classifier
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(10000, 128, mask_zero=True),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64, dropout=0.2)),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

model.compile(
    optimizer='adam',
    loss='binary_crossentropy',
    metrics=['accuracy']
)

model.summary()

# Train the model
history = model.fit(
    train_padded, train_labels,
    epochs=5,
    batch_size=64,
    validation_split=0.2,
    verbose=1
)

# Evaluate on test set
test_loss, test_acc = model.evaluate(test_padded, test_labels, verbose=0)
print(f"\nTest accuracy: {test_acc:.4f}")
print(f"Test loss:     {test_loss:.4f}")

Comparing Architectures

Different architectures trade off speed vs accuracy for text classification. Simple models (Embedding + GlobalAveragePooling) train fast but miss sequential patterns. LSTM captures order but is slower. Bidirectional LSTM is most powerful but doubles parameters. For production, choose based on your latency and accuracy requirements.

import tensorflow as tf
import numpy as np

# Compare 3 architectures for sentiment analysis
vocab_size = 10000
embedding_dim = 128
max_length = 256

# Architecture 1: Simple (Embedding + GlobalAvgPool)
model_simple = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
    tf.keras.layers.GlobalAveragePooling1D(),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
], name='simple_avgpool')

# Architecture 2: LSTM
model_lstm = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
    tf.keras.layers.LSTM(64, dropout=0.2),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
], name='lstm')

# Architecture 3: Bidirectional LSTM
model_bilstm = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64, dropout=0.2)),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(1, activation='sigmoid')
], name='bidirectional_lstm')

# Architecture 4: Conv1D + LSTM hybrid
model_conv_lstm = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
    tf.keras.layers.Conv1D(64, 5, activation='relu'),
    tf.keras.layers.MaxPooling1D(4),
    tf.keras.layers.LSTM(64),
    tf.keras.layers.Dense(1, activation='sigmoid')
], name='conv1d_lstm')

# Compare parameter counts and expected performance
models = [model_simple, model_lstm, model_bilstm, model_conv_lstm]
print(f"{'Model':<25} {'Parameters':>12} {'Expected Acc':>14}")
print("-" * 55)
for m in models:
    expected = {'simple_avgpool': '~86%', 'lstm': '~87%',
                'bidirectional_lstm': '~88%', 'conv1d_lstm': '~87%'}
    print(f"{m.name:<25} {m.count_params():>12,} {expected[m.name]:>14}")

Text Generation

Text generation with RNNs works by training a model to predict the next character (or word) given the preceding context. At inference time, we feed the model a seed sequence, sample the next token, append it to the input, and repeat. The temperature parameter controls randomness: low temperature produces conservative, repetitive text; high temperature produces creative but potentially incoherent output.

import tensorflow as tf
import numpy as np

# Character-level text generation
# Using a small text corpus for demonstration
text = """To be, or not to be, that is the question:
Whether 'tis nobler in the mind to suffer
The slings and arrows of outrageous fortune,
Or to take arms against a sea of troubles,
And by opposing end them. To die, to sleep."""

# Step 1: Build character vocabulary
chars = sorted(set(text))
char_to_idx = {c: i for i, c in enumerate(chars)}
idx_to_char = {i: c for c, i in char_to_idx.items()}
vocab_size = len(chars)

print(f"Text length: {len(text)} characters")
print(f"Unique chars: {vocab_size}")
print(f"Vocabulary: {''.join(chars)}")

# Step 2: Create training sequences (input → next char)
seq_length = 40
sequences = []
targets = []

text_encoded = [char_to_idx[c] for c in text]
for i in range(len(text_encoded) - seq_length):
    sequences.append(text_encoded[i:i + seq_length])
    targets.append(text_encoded[i + seq_length])

X = np.array(sequences)
y = np.array(targets)
print(f"\nTraining sequences: {X.shape}")  # (N, 40)
print(f"Targets: {y.shape}")

# Step 3: Build character-level RNN model
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, 64, input_length=seq_length),
    tf.keras.layers.LSTM(256, return_sequences=True),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.LSTM(128),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(vocab_size, activation='softmax')
])

model.compile(
    optimizer='adam',
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)

model.summary()

# Step 4: Train (brief for demonstration)
model.fit(X, y, epochs=3, batch_size=64, verbose=1)

Sampling Strategies

The temperature parameter reshapes the softmax distribution before sampling. Given logits $z_i$, the temperature-scaled probability is $p_i = \frac{e^{z_i / T}}{\sum_j e^{z_j / T}}$. Temperature $T = 1.0$ gives standard softmax; $T < 1$ sharpens the distribution (more deterministic); $T > 1$ flattens it (more random).

import tensorflow as tf
import numpy as np

# Sampling strategies for text generation
def sample_with_temperature(logits, temperature=1.0):
    """Sample from logits with temperature scaling."""
    logits = logits / temperature
    # tf.random.categorical expects log-probabilities
    predicted_id = tf.random.categorical(
        tf.expand_dims(logits, 0), num_samples=1
    )[0, 0].numpy()
    return predicted_id

# Simulate model output logits for next character
np.random.seed(42)
logits = np.array([2.0, 1.5, 0.5, -0.5, -1.0, 0.8, 3.0, 0.1])
chars = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']

# Compare different temperatures
print("Sampling with different temperatures:")
print(f"{'Temp':<8} {'Sampled (10 draws)':<30} {'Top prob':>10}")
print("-" * 50)

for temp in [0.2, 0.5, 1.0, 1.5, 2.0]:
    # Calculate probabilities at this temperature
    scaled = logits / temp
    probs = np.exp(scaled) / np.exp(scaled).sum()
    top_prob = probs.max()

    # Sample 10 characters
    samples = []
    for _ in range(10):
        idx = sample_with_temperature(tf.constant(logits, dtype=tf.float32), temp)
        samples.append(chars[idx])

    print(f"{temp:<8} {''.join(samples):<30} {top_prob:>10.4f}")

# Generate text function (using trained model)
def generate_text(model, start_string, char_to_idx, idx_to_char, length=200, temperature=0.8):
    """Generate text character by character."""
    seq_length = model.input_shape[1]
    input_eval = [char_to_idx.get(c, 0) for c in start_string]

    # Pad to seq_length
    if len(input_eval) < seq_length:
        input_eval = [0] * (seq_length - len(input_eval)) + input_eval
    else:
        input_eval = input_eval[-seq_length:]

    generated = list(start_string)

    for _ in range(length):
        input_tensor = tf.expand_dims(input_eval, 0)  # (1, seq_length)
        predictions = model(input_tensor, training=False)[0]  # (vocab_size,)

        # Temperature sampling
        predicted_id = sample_with_temperature(predictions, temperature)
        generated.append(idx_to_char.get(predicted_id, '?'))

        # Slide window
        input_eval = input_eval[1:] + [predicted_id]

    return ''.join(generated)

print("\n\nGeneration function ready.")
print("Usage: generate_text(model, 'To be', char_to_idx, idx_to_char, length=200, temperature=0.8)")

Time Series Fundamentals

Time series forecasting requires careful data preparation. The key challenge is creating windows — sliding input/target pairs from a continuous signal — while respecting temporal ordering to avoid data leakage (never let the model see future data during training). TensorFlow's tf.data.Dataset.window() provides an efficient windowing mechanism.

import tensorflow as tf
import numpy as np

# Generate synthetic time series (trend + seasonality + noise)
np.random.seed(42)
time = np.arange(1000)
trend = 0.02 * time
seasonality = 10 * np.sin(2 * np.pi * time / 50)
noise = np.random.randn(1000) * 2
series = trend + seasonality + noise

print(f"Time series shape: {series.shape}")
print(f"Range: [{series.min():.1f}, {series.max():.1f}]")

# Windowing function: create (input, target) pairs
def create_windows(series, window_size, forecast_horizon=1):
    """Create sliding windows for time series prediction."""
    dataset = tf.data.Dataset.from_tensor_slices(series)

    # Window = input + target
    total_window = window_size + forecast_horizon
    dataset = dataset.window(total_window, shift=1, drop_remainder=True)

    # Flatten nested datasets into tensors
    dataset = dataset.flat_map(lambda w: w.batch(total_window))

    # Split into input and target
    dataset = dataset.map(lambda w: (w[:window_size], w[-forecast_horizon:]))

    return dataset

# Create windows: 30 timesteps in → predict next 1
window_size = 30
dataset = create_windows(series.astype(np.float32), window_size, forecast_horizon=1)

# Inspect one window
for x, y in dataset.take(1):
    print(f"\nWindow input shape:  {x.shape}")   # (30,)
    print(f"Window target shape: {y.shape}")      # (1,)
    print(f"Input (last 5):  {x[-5:].numpy()}")
    print(f"Target:          {y.numpy()}")

# Batch and prefetch for training
train_dataset = dataset.shuffle(1000).batch(32).prefetch(tf.data.AUTOTUNE)

for batch_x, batch_y in train_dataset.take(1):
    print(f"\nBatch input:  {batch_x.shape}")   # (32, 30)
    print(f"Batch target: {batch_y.shape}")      # (32, 1)

Temporal Train/Val/Test Splits

For time series, you must never shuffle before splitting. The validation and test sets must come from the future relative to the training set — otherwise, the model gets to "peek" at future patterns during training. A typical split is 70/15/15 in temporal order.

import tensorflow as tf
import numpy as np

# Temporal split — respecting time order (NO shuffling before split!)
np.random.seed(42)
time = np.arange(1000)
series = 0.02 * time + 10 * np.sin(2 * np.pi * time / 50) + np.random.randn(1000) * 2

# Split boundaries (temporal order preserved)
train_end = 700
val_end = 850

train_series = series[:train_end]
val_series = series[train_end:val_end]
test_series = series[val_end:]

print(f"Train: steps 0-{train_end-1}   ({len(train_series)} samples)")
print(f"Val:   steps {train_end}-{val_end-1} ({len(val_series)} samples)")
print(f"Test:  steps {val_end}-{len(series)-1}  ({len(test_series)} samples)")

# Normalize using ONLY training statistics (prevent data leakage!)
train_mean = train_series.mean()
train_std = train_series.std()

train_norm = (train_series - train_mean) / train_std
val_norm = (val_series - train_mean) / train_std    # Use train stats!
test_norm = (test_series - train_mean) / train_std  # Use train stats!

print(f"\nNormalization stats (from train only):")
print(f"  Mean: {train_mean:.4f}")
print(f"  Std:  {train_std:.4f}")
print(f"  Train range: [{train_norm.min():.2f}, {train_norm.max():.2f}]")
print(f"  Val range:   [{val_norm.min():.2f}, {val_norm.max():.2f}]")

# Create windowed datasets from each split
def make_windowed_dataset(series, window_size, batch_size, shuffle=True):
    """Create batched windowed dataset for time series."""
    series = tf.cast(series, tf.float32)
    ds = tf.data.Dataset.from_tensor_slices(series)
    ds = ds.window(window_size + 1, shift=1, drop_remainder=True)
    ds = ds.flat_map(lambda w: w.batch(window_size + 1))
    ds = ds.map(lambda w: (w[:-1], w[-1:]))
    if shuffle:
        ds = ds.shuffle(1000)
    ds = ds.batch(batch_size).prefetch(tf.data.AUTOTUNE)
    return ds

window_size = 30
batch_size = 32

train_ds = make_windowed_dataset(train_norm, window_size, batch_size, shuffle=True)
val_ds = make_windowed_dataset(val_norm, window_size, batch_size, shuffle=False)
test_ds = make_windowed_dataset(test_norm, window_size, batch_size, shuffle=False)

for x, y in train_ds.take(1):
    print(f"\nTrain batch: input={x.shape}, target={y.shape}")
Avoiding Data Leakage: Three critical rules for time series: (1) Split temporally — validation/test must be after training data, (2) Normalize using only training statistics, (3) Shuffle windows within a split (safe) but never shuffle across splits.

Time Series Forecasting

Now let's build actual forecasting models. We'll compare single-step prediction (predict the next value) with multi-step prediction (predict the next N values). The LSTM architecture naturally handles the sequential nature of time series data, and combining it with Conv1D creates a powerful hybrid that captures both local patterns and long-range dependencies.

import tensorflow as tf
import numpy as np

# Generate realistic time series data
np.random.seed(42)
n_steps = 1500
time = np.arange(n_steps)
series = (0.03 * time +
          10 * np.sin(2 * np.pi * time / 50) +
          5 * np.sin(2 * np.pi * time / 120) +
          np.random.randn(n_steps) * 2).astype(np.float32)

# Temporal split and normalize
train_end, val_end = 1000, 1250
train_mean, train_std = series[:train_end].mean(), series[:train_end].std()
series_norm = (series - train_mean) / train_std

# Create windowed datasets
window_size = 50

def make_dataset(data, window_size, batch_size=32, shuffle=True):
    ds = tf.data.Dataset.from_tensor_slices(data)
    ds = ds.window(window_size + 1, shift=1, drop_remainder=True)
    ds = ds.flat_map(lambda w: w.batch(window_size + 1))
    ds = ds.map(lambda w: (tf.expand_dims(w[:-1], -1), w[-1:]))  # Add feature dim
    if shuffle:
        ds = ds.shuffle(500)
    return ds.batch(batch_size).prefetch(tf.data.AUTOTUNE)

train_ds = make_dataset(series_norm[:train_end], window_size)
val_ds = make_dataset(series_norm[train_end:val_end], window_size, shuffle=False)

# LSTM forecasting model (single-step prediction)
model = tf.keras.Sequential([
    tf.keras.layers.Input(shape=(window_size, 1)),
    tf.keras.layers.LSTM(128, return_sequences=True),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.LSTM(64),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(32, activation='relu'),
    tf.keras.layers.Dense(1)
])

model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=1e-3),
    loss='mse',
    metrics=['mae']
)

model.summary()

# Train with early stopping
early_stop = tf.keras.callbacks.EarlyStopping(
    monitor='val_loss', patience=5, restore_best_weights=True
)

history = model.fit(
    train_ds,
    validation_data=val_ds,
    epochs=20,
    callbacks=[early_stop],
    verbose=1
)

print(f"\nBest val MAE: {min(history.history['val_mae']):.4f}")
print(f"Best val loss (MSE): {min(history.history['val_loss']):.4f}")

# Denormalize predictions for real-world MAE
denorm_mae = min(history.history['val_mae']) * train_std
print(f"Denormalized MAE: {denorm_mae:.2f} (original scale)")

Conv1D + LSTM Hybrid

Combining 1D convolutions with LSTM creates a powerful hybrid: Conv1D extracts local temporal patterns (similar to n-grams in text), then LSTM captures long-range dependencies across the conv feature maps. This often outperforms pure LSTM while training faster due to the dimensionality reduction from convolution.

import tensorflow as tf
import numpy as np

# Conv1D + LSTM hybrid model for time series
window_size = 50

# Model 1: Pure stacked LSTM
model_lstm = tf.keras.Sequential([
    tf.keras.layers.Input(shape=(window_size, 1)),
    tf.keras.layers.LSTM(128, return_sequences=True),
    tf.keras.layers.LSTM(64),
    tf.keras.layers.Dense(1)
], name='pure_lstm')

# Model 2: Conv1D + LSTM hybrid
model_hybrid = tf.keras.Sequential([
    tf.keras.layers.Input(shape=(window_size, 1)),
    # Conv1D extracts local patterns (like 5-step motifs)
    tf.keras.layers.Conv1D(64, kernel_size=5, activation='relu', padding='same'),
    tf.keras.layers.MaxPooling1D(2),
    tf.keras.layers.Conv1D(32, kernel_size=3, activation='relu', padding='same'),
    tf.keras.layers.MaxPooling1D(2),
    # LSTM captures long-range dependencies on conv features
    tf.keras.layers.LSTM(64),
    tf.keras.layers.Dense(32, activation='relu'),
    tf.keras.layers.Dense(1)
], name='conv_lstm_hybrid')

# Model 3: Multi-step forecaster (predict next 5 values)
forecast_horizon = 5
model_multistep = tf.keras.Sequential([
    tf.keras.layers.Input(shape=(window_size, 1)),
    tf.keras.layers.Conv1D(64, 5, activation='relu', padding='same'),
    tf.keras.layers.MaxPooling1D(2),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)),
    tf.keras.layers.Dense(32, activation='relu'),
    tf.keras.layers.Dense(forecast_horizon)  # Predict 5 future values
], name='multistep_forecaster')

# Compare architectures
print(f"{'Model':<25} {'Parameters':>12} {'Output Shape':>15}")
print("-" * 55)
for m in [model_lstm, model_hybrid, model_multistep]:
    out_shape = m.output_shape[-1]
    print(f"{m.name:<25} {m.count_params():>12,} {str(out_shape):>15}")

# Compile the hybrid model
model_hybrid.compile(
    optimizer=tf.keras.optimizers.Adam(1e-3),
    loss='huber',  # Robust to outliers
    metrics=['mae']
)

# Generate dummy data to verify model works
dummy_x = tf.random.normal([32, window_size, 1])
dummy_y = model_hybrid(dummy_x)
print(f"\nHybrid model: input={dummy_x.shape} → output={dummy_y.shape}")

Sequence-to-Sequence

Sequence-to-sequence (seq2seq) models map a variable-length input sequence to a variable-length output sequence. The classic architecture uses an encoder (reads input and compresses it into a context vector) and a decoder (generates the output sequence from the context). This architecture powers machine translation, text summarization, and dialogue systems.

Seq2Seq Encoder-Decoder Architecture
flowchart LR
    subgraph Encoder["Encoder (reads input)"]
        direction LR
        E1["LSTM"] --> E2["LSTM"] --> E3["LSTM"]
        I1["hola"] --> E1
        I2["mundo"] --> E2
        I3["<eos>"] --> E3
    end
    subgraph Context["Context Vector"]
        CV["[h, c]"]
    end
    subgraph Decoder["Decoder (generates output)"]
        direction LR
        D1["LSTM"] --> D2["LSTM"] --> D3["LSTM"]
        D1 --> O1["hello"]
        D2 --> O2["world"]
        D3 --> O3["<eos>"]
    end
    E3 --> CV --> D1
                            

Teacher forcing is a training technique where the decoder receives the ground-truth previous token as input (instead of its own prediction). This stabilizes training but creates a discrepancy between training (sees correct tokens) and inference (sees its own predictions). Scheduled sampling gradually transitions from teacher forcing to free-running during training.

import tensorflow as tf
import numpy as np

# Basic Seq2Seq model for number reversal (demonstration)
# Input: [1, 2, 3, 4, 5] → Output: [5, 4, 3, 2, 1]

# Hyperparameters
vocab_size = 20       # Numbers 0-19
embedding_dim = 32
latent_dim = 64
max_length = 10

# Encoder
encoder_inputs = tf.keras.layers.Input(shape=(max_length,), name='encoder_input')
encoder_embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)(encoder_inputs)
encoder_lstm = tf.keras.layers.LSTM(latent_dim, return_state=True, name='encoder_lstm')
_, state_h, state_c = encoder_lstm(encoder_embedding)
encoder_states = [state_h, state_c]  # Context vector

# Decoder
decoder_inputs = tf.keras.layers.Input(shape=(max_length,), name='decoder_input')
decoder_embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)(decoder_inputs)
decoder_lstm = tf.keras.layers.LSTM(
    latent_dim, return_sequences=True, return_state=True, name='decoder_lstm'
)
decoder_outputs, _, _ = decoder_lstm(decoder_embedding, initial_state=encoder_states)
decoder_dense = tf.keras.layers.Dense(vocab_size, activation='softmax', name='output')
outputs = decoder_dense(decoder_outputs)

# Full model (training with teacher forcing)
model = tf.keras.Model([encoder_inputs, decoder_inputs], outputs)
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
model.summary()

# Generate training data: reverse sequences
def generate_reverse_data(n_samples, seq_length, vocab_size):
    """Generate pairs of (sequence, reversed_sequence)."""
    # Input sequences (random integers, start from 1 to reserve 0 for padding)
    encoder_input = np.random.randint(1, vocab_size, size=(n_samples, seq_length))

    # Target = reversed input
    decoder_target = encoder_input[:, ::-1]

    # Decoder input = target shifted right (teacher forcing)
    # Start token = 0
    decoder_input = np.zeros_like(decoder_target)
    decoder_input[:, 1:] = decoder_target[:, :-1]

    return encoder_input, decoder_input, decoder_target

# Generate and train
seq_length = 5
enc_in, dec_in, dec_target = generate_reverse_data(10000, seq_length, vocab_size)

# Pad to max_length
enc_in_padded = tf.keras.utils.pad_sequences(enc_in, maxlen=max_length, padding='post')
dec_in_padded = tf.keras.utils.pad_sequences(dec_in, maxlen=max_length, padding='post')
dec_target_padded = tf.keras.utils.pad_sequences(dec_target, maxlen=max_length, padding='post')

print(f"\nEncoder input:  {enc_in_padded[0]}")
print(f"Decoder input:  {dec_in_padded[0]}")
print(f"Decoder target: {dec_target_padded[0]}")

# Train
history = model.fit(
    [enc_in_padded, dec_in_padded],
    np.expand_dims(dec_target_padded, -1),
    epochs=10,
    batch_size=64,
    validation_split=0.1,
    verbose=1
)

print(f"\nFinal accuracy: {history.history['accuracy'][-1]:.4f}")

Teacher Forcing & Attention Preview

The attention mechanism (covered in depth in Part 8) addresses the seq2seq bottleneck: compressing an entire input sequence into a single fixed-size vector loses information for long sequences. Attention allows the decoder to "look back" at all encoder states, focusing on relevant parts for each output step. Here's a simplified preview:

import tensorflow as tf
import numpy as np

# Simplified attention mechanism preview
# Full implementation in Part 8: Transformers & Attention

class SimpleAttention(tf.keras.layers.Layer):
    """Bahdanau-style additive attention (preview)."""

    def __init__(self, units):
        super().__init__()
        self.W1 = tf.keras.layers.Dense(units)  # For encoder states
        self.W2 = tf.keras.layers.Dense(units)  # For decoder state
        self.V = tf.keras.layers.Dense(1)       # Score function

    def call(self, encoder_outputs, decoder_hidden):
        # encoder_outputs: (batch, seq_len, hidden_dim)
        # decoder_hidden: (batch, hidden_dim)

        # Expand decoder hidden for broadcasting
        decoder_expanded = tf.expand_dims(decoder_hidden, 1)  # (batch, 1, hidden)

        # Alignment scores
        score = self.V(tf.nn.tanh(
            self.W1(encoder_outputs) + self.W2(decoder_expanded)
        ))  # (batch, seq_len, 1)

        # Attention weights (softmax over sequence)
        attention_weights = tf.nn.softmax(score, axis=1)  # (batch, seq_len, 1)

        # Context vector (weighted sum of encoder outputs)
        context = tf.reduce_sum(attention_weights * encoder_outputs, axis=1)

        return context, tf.squeeze(attention_weights, -1)

# Demonstrate attention
batch_size = 2
seq_len = 8
hidden_dim = 64

encoder_outputs = tf.random.normal([batch_size, seq_len, hidden_dim])
decoder_hidden = tf.random.normal([batch_size, hidden_dim])

attention = SimpleAttention(units=32)
context, weights = attention(encoder_outputs, decoder_hidden)

print(f"Encoder outputs: {encoder_outputs.shape}")    # (2, 8, 64)
print(f"Decoder hidden:  {decoder_hidden.shape}")     # (2, 64)
print(f"Context vector:  {context.shape}")            # (2, 64)
print(f"Attention weights: {weights.shape}")          # (2, 8)
print(f"\nSample attention weights (sum=1):")
print(f"  {weights[0].numpy()}")
print(f"  Sum: {weights[0].numpy().sum():.4f}")

Practical Tips

Training RNNs effectively requires careful attention to gradient flow, variable-length sequence handling, and hardware optimization. These practical tips will help you avoid common pitfalls and get the best performance from your recurrent models.

import tensorflow as tf
import numpy as np

# Tip 1: Gradient Clipping — prevent exploding gradients
model = tf.keras.Sequential([
    tf.keras.layers.Input(shape=(50, 10)),
    tf.keras.layers.LSTM(128, return_sequences=True),
    tf.keras.layers.LSTM(64),
    tf.keras.layers.Dense(1)
])

# Clip gradients by global norm (most common for RNNs)
optimizer = tf.keras.optimizers.Adam(
    learning_rate=1e-3,
    clipnorm=1.0  # Clip if gradient norm > 1.0
)

# Alternative: clip by value
optimizer_clipvalue = tf.keras.optimizers.Adam(
    learning_rate=1e-3,
    clipvalue=0.5  # Clip each gradient element to [-0.5, 0.5]
)

model.compile(optimizer=optimizer, loss='mse')
print("Tip 1: Gradient clipping configured (clipnorm=1.0)")

# Tip 2: CuDNN-optimized LSTM requirements
# For GPU acceleration, LSTM must use:
#   activation='tanh', recurrent_activation='sigmoid'
#   No dropout on recurrent connections (use dropout, not recurrent_dropout)
#   unroll=False, use_bias=True

cudnn_lstm = tf.keras.layers.LSTM(
    128,
    activation='tanh',              # Required for CuDNN
    recurrent_activation='sigmoid', # Required for CuDNN
    dropout=0.2,                    # OK — applied to input
    recurrent_dropout=0.0,          # Must be 0 for CuDNN!
    unroll=False,                   # Must be False for CuDNN
    use_bias=True                   # Must be True for CuDNN
)
print("\nTip 2: CuDNN-optimized LSTM configured")
print("  ✓ activation='tanh'")
print("  ✓ recurrent_activation='sigmoid'")
print("  ✓ recurrent_dropout=0.0")
print("  ✓ unroll=False, use_bias=True")

Masking & Stateful RNNs

Masking tells the model to skip padded timesteps during computation — essential for variable-length sequences. Stateful RNNs maintain their hidden state between batches instead of resetting at each batch boundary — useful for very long sequences that don't fit in a single window.

import tensorflow as tf
import numpy as np

# Tip 3: Masking variable-length sequences
# Method A: mask_zero in Embedding
model_masked = tf.keras.Sequential([
    tf.keras.layers.Embedding(1000, 64, mask_zero=True),
    tf.keras.layers.LSTM(32),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

# Method B: Explicit Masking layer (for non-embedding inputs)
model_explicit_mask = tf.keras.Sequential([
    tf.keras.layers.Input(shape=(100, 5)),
    tf.keras.layers.Masking(mask_value=0.0),  # Skip timesteps where ALL features = 0
    tf.keras.layers.LSTM(32),
    tf.keras.layers.Dense(1)
])

# Verify masking propagation
sample = tf.constant([[0, 0, 5, 12, 0, 8, 0]])  # 0s are padding
emb = model_masked.layers[0]
mask = emb.compute_mask(sample)
print("Tip 3: Masking")
print(f"  Input:  {sample.numpy()}")
print(f"  Mask:   {mask.numpy()}")  # False where padded

# Tip 4: Stateful RNNs — state persists across batches
batch_size = 16  # Must be fixed for stateful

stateful_model = tf.keras.Sequential([
    tf.keras.layers.LSTM(64, stateful=True, batch_input_shape=(batch_size, 50, 1)),
    tf.keras.layers.Dense(1)
])

# Manual state management
print("\nTip 4: Stateful RNN")
dummy_input = tf.random.normal([batch_size, 50, 1])

# Process first chunk
out1 = stateful_model(dummy_input)
state_after_chunk1 = stateful_model.layers[0].states[0].numpy().mean()
print(f"  State after chunk 1: mean={state_after_chunk1:.4f}")

# Process second chunk (state carries over!)
out2 = stateful_model(dummy_input)
state_after_chunk2 = stateful_model.layers[0].states[0].numpy().mean()
print(f"  State after chunk 2: mean={state_after_chunk2:.4f}")

# Reset state at epoch boundary
stateful_model.reset_states()
state_after_reset = stateful_model.layers[0].states[0].numpy().mean()
print(f"  State after reset:   mean={state_after_reset:.4f}")

# Tip 5: Learning rate scheduling for RNNs
print("\nTip 5: Recommended training recipe for RNNs:")
print("  1. Start with Adam(lr=1e-3, clipnorm=1.0)")
print("  2. Use ReduceLROnPlateau(patience=3, factor=0.5)")
print("  3. EarlyStopping(patience=7, restore_best_weights=True)")
print("  4. Batch size: 32-128 (larger = smoother gradients)")
print("  5. Sequence length: start short (50-100), increase if needed")
Performance Checklist: (1) Use clipnorm=1.0 to prevent gradient explosion, (2) Set recurrent_dropout=0.0 to enable CuDNN acceleration on GPU, (3) Use mask_zero=True or Masking layer for variable-length inputs, (4) Prefer Bidirectional for classification, unidirectional for generation, (5) Start with LSTM — try GRU if you need faster training with similar performance.

Next in the Series

In Part 8: Transformers & Attention, we'll move beyond recurrence to the architecture that revolutionized NLP — self-attention, multi-head attention, positional encoding, and building Transformer models from scratch in TensorFlow.