1. Unsupervised Learning and Representation
Unlike supervised learning where we have labels guiding our models, autoencoders learn entirely without labels. The goal is to discover a compressed representation — an encoding — that captures the essential features of the data. This is the foundation of generative modeling: if we can learn a good latent representation, we can generate new data from it.
The core architecture consists of three components: an encoder that compresses input to a lower-dimensional latent space, a bottleneck (the compressed representation), and a decoder that reconstructs the original input from the bottleneck.
flowchart LR
A[Input x] --> B[Encoder]
B --> C[Bottleneck z]
C --> D[Decoder]
D --> E[Reconstruction x-hat]
E --> F[Loss: MSE between x and x-hat]
F -.->|Backprop| B
F -.->|Backprop| D
The encoder maps high-dimensional input to a low-dimensional latent code, while the decoder attempts to reconstruct the original from this code. The loss function measures reconstruction quality — typically Mean Squared Error (MSE) between input and output.
2. Building a Basic Autoencoder
Let us build an autoencoder from scratch using only NumPy. Our encoder compresses input to a lower dimension, and the decoder reconstructs it. The loss is the reconstruction error (MSE between input and output).
Dense Layer Implementation
First, we need a reusable dense layer with forward and backward passes:
import numpy as np
class DenseLayer:
"""Fully connected layer with Xavier initialization."""
def __init__(self, input_dim, output_dim):
# Xavier initialization for stable gradients
scale = np.sqrt(2.0 / (input_dim + output_dim))
self.weights = np.random.randn(input_dim, output_dim) * scale
self.bias = np.zeros((1, output_dim))
self.input_cache = None
def forward(self, x):
self.input_cache = x
return x @ self.weights + self.bias
def backward(self, grad_output, lr=0.01):
# Compute gradients
grad_weights = self.input_cache.T @ grad_output
grad_bias = np.sum(grad_output, axis=0, keepdims=True)
grad_input = grad_output @ self.weights.T
# Update parameters
self.weights -= lr * grad_weights
self.bias -= lr * grad_bias
return grad_input
# Test the layer
np.random.seed(42)
layer = DenseLayer(4, 2)
x = np.random.randn(3, 4) # batch of 3, input dim 4
output = layer.forward(x)
print("Input shape:", x.shape)
print("Output shape:", output.shape)
print("Output:\n", output)
Complete Autoencoder
Now we build a complete autoencoder that compresses 2D data to 1D and reconstructs it:
import numpy as np
class DenseLayer:
"""Fully connected layer with Xavier initialization."""
def __init__(self, input_dim, output_dim):
scale = np.sqrt(2.0 / (input_dim + output_dim))
self.weights = np.random.randn(input_dim, output_dim) * scale
self.bias = np.zeros((1, output_dim))
self.input_cache = None
def forward(self, x):
self.input_cache = x
return x @ self.weights + self.bias
def backward(self, grad_output, lr=0.01):
grad_weights = self.input_cache.T @ grad_output
grad_bias = np.sum(grad_output, axis=0, keepdims=True)
grad_input = grad_output @ self.weights.T
self.weights -= lr * grad_weights
self.bias -= lr * grad_bias
return grad_input
def relu(x):
return np.maximum(0, x)
def relu_derivative(x):
return (x > 0).astype(float)
class Autoencoder:
"""Basic autoencoder: 2D input -> 1D bottleneck -> 2D output."""
def __init__(self, input_dim=2, hidden_dim=4, latent_dim=1):
# Encoder layers
self.enc1 = DenseLayer(input_dim, hidden_dim)
self.enc2 = DenseLayer(hidden_dim, latent_dim)
# Decoder layers
self.dec1 = DenseLayer(latent_dim, hidden_dim)
self.dec2 = DenseLayer(hidden_dim, input_dim)
def encode(self, x):
h = relu(self.enc1.forward(x))
self.enc1_out = h
z = self.enc2.forward(h)
return z
def decode(self, z):
h = relu(self.dec1.forward(z))
self.dec1_out = h
x_hat = self.dec2.forward(h)
return x_hat
def forward(self, x):
z = self.encode(x)
x_hat = self.decode(z)
return x_hat
def train_step(self, x, lr=0.001):
# Forward pass
x_hat = self.forward(x)
# MSE loss gradient: d(loss)/d(x_hat) = 2*(x_hat - x)/N
n = x.shape[0]
grad = 2.0 * (x_hat - x) / n
loss = np.mean((x_hat - x) ** 2)
# Backward through decoder
grad = self.dec2.backward(grad, lr)
grad = grad * relu_derivative(self.dec1_out)
grad = self.dec1.backward(grad, lr)
# Backward through encoder
grad = self.enc2.backward(grad, lr)
grad = grad * relu_derivative(self.enc1_out)
grad = self.enc1.backward(grad, lr)
return loss
# Generate synthetic 2D data (circle pattern)
np.random.seed(42)
theta = np.linspace(0, 2 * np.pi, 200)
data = np.column_stack([np.cos(theta), np.sin(theta)])
data += np.random.randn(*data.shape) * 0.05 # Add slight noise
# Train autoencoder
ae = Autoencoder(input_dim=2, hidden_dim=4, latent_dim=1)
losses = []
for epoch in range(500):
loss = ae.train_step(data, lr=0.005)
losses.append(loss)
if (epoch + 1) % 100 == 0:
print(f"Epoch {epoch+1}, Loss: {loss:.6f}")
# Reconstruct and compare
reconstructed = ae.forward(data)
print(f"\nFinal reconstruction error: {np.mean((data - reconstructed)**2):.6f}")
print(f"Original data sample: {data[0]}")
print(f"Reconstructed sample: {reconstructed[0]}")
Bottleneck Dimension Effect
Try changing latent_dim from 1 to 2. With a 2D bottleneck for 2D data, the autoencoder can achieve near-perfect reconstruction. With 1D, it must learn the most important axis of variation — similar to PCA finding the first principal component.
3. Denoising Autoencoders
A denoising autoencoder (DAE) receives corrupted input but is trained to output the clean version. This forces the network to learn robust features rather than simply memorizing an identity mapping. The corruption acts as a regularizer, preventing the autoencoder from learning trivial solutions.
import numpy as np
class DenseLayer:
"""Fully connected layer."""
def __init__(self, input_dim, output_dim):
scale = np.sqrt(2.0 / (input_dim + output_dim))
self.weights = np.random.randn(input_dim, output_dim) * scale
self.bias = np.zeros((1, output_dim))
self.input_cache = None
def forward(self, x):
self.input_cache = x
return x @ self.weights + self.bias
def backward(self, grad_output, lr=0.01):
grad_weights = self.input_cache.T @ grad_output
grad_bias = np.sum(grad_output, axis=0, keepdims=True)
grad_input = grad_output @ self.weights.T
self.weights -= lr * grad_weights
self.bias -= lr * grad_bias
return grad_input
def relu(x):
return np.maximum(0, x)
def relu_derivative(x):
return (x > 0).astype(float)
class DenoisingAutoencoder:
"""Denoising autoencoder that learns to remove Gaussian noise."""
def __init__(self, input_dim=2, hidden_dim=8, latent_dim=2):
self.enc1 = DenseLayer(input_dim, hidden_dim)
self.enc2 = DenseLayer(hidden_dim, latent_dim)
self.dec1 = DenseLayer(latent_dim, hidden_dim)
self.dec2 = DenseLayer(hidden_dim, input_dim)
def forward(self, x):
h1 = relu(self.enc1.forward(x))
self.h1 = h1
z = self.enc2.forward(h1)
h2 = relu(self.dec1.forward(z))
self.h2 = h2
x_hat = self.dec2.forward(h2)
return x_hat
def train_step(self, x_noisy, x_clean, lr=0.001):
x_hat = self.forward(x_noisy)
n = x_clean.shape[0]
loss = np.mean((x_hat - x_clean) ** 2)
grad = 2.0 * (x_hat - x_clean) / n
grad = self.dec2.backward(grad, lr)
grad = grad * relu_derivative(self.h2)
grad = self.dec1.backward(grad, lr)
grad = self.enc2.backward(grad, lr)
grad = grad * relu_derivative(self.h1)
grad = self.enc1.backward(grad, lr)
return loss
# Generate clean signal (sine wave samples)
np.random.seed(42)
t = np.linspace(0, 4 * np.pi, 300)
clean_data = np.column_stack([t / (4 * np.pi), np.sin(t)])
# Add Gaussian noise
noise_level = 0.3
noisy_data = clean_data + np.random.randn(*clean_data.shape) * noise_level
# Train denoising autoencoder
dae = DenoisingAutoencoder(input_dim=2, hidden_dim=8, latent_dim=2)
for epoch in range(1000):
# Each epoch: add fresh noise for variety
noise = np.random.randn(*clean_data.shape) * noise_level
x_noisy = clean_data + noise
loss = dae.train_step(x_noisy, clean_data, lr=0.002)
if (epoch + 1) % 200 == 0:
print(f"Epoch {epoch+1}, Denoising Loss: {loss:.6f}")
# Test: denoise the noisy data
denoised = dae.forward(noisy_data)
noise_error = np.mean((noisy_data - clean_data) ** 2)
denoised_error = np.mean((denoised - clean_data) ** 2)
print(f"\nNoisy data MSE from clean: {noise_error:.6f}")
print(f"Denoised data MSE from clean: {denoised_error:.6f}")
print(f"Noise reduction ratio: {noise_error / denoised_error:.2f}x")
4. Variational Autoencoders (VAE)
Standard autoencoders encode each input to a single point in latent space. Variational autoencoders (VAEs) instead encode to a distribution — specifically, a Gaussian parameterized by mean and variance. This enables smooth interpolation and generation of new data by sampling from the latent space.
The Reparameterization Trick
The key innovation of VAEs is the reparameterization trick. Instead of sampling directly from the learned distribution (which would block gradient flow), we sample noise from a standard normal and transform it:
$$z = \mu + \sigma \cdot \epsilon, \quad \epsilon \sim \mathcal{N}(0, 1)$$
The total VAE loss combines reconstruction quality with a regularization term that keeps the latent distribution close to a standard normal:
$$\mathcal{L} = \mathcal{L}_{recon} + D_{KL}(q(z|x) \| p(z))$$
Where the KL divergence for Gaussians has a closed-form solution:
$$D_{KL} = -\frac{1}{2} \sum_{j=1}^{J} (1 + \log(\sigma_j^2) - \mu_j^2 - \sigma_j^2)$$
import numpy as np
class DenseLayer:
"""Fully connected layer."""
def __init__(self, input_dim, output_dim):
scale = np.sqrt(2.0 / (input_dim + output_dim))
self.weights = np.random.randn(input_dim, output_dim) * scale
self.bias = np.zeros((1, output_dim))
self.input_cache = None
def forward(self, x):
self.input_cache = x
return x @ self.weights + self.bias
def backward(self, grad_output, lr=0.01):
grad_weights = self.input_cache.T @ grad_output
grad_bias = np.sum(grad_output, axis=0, keepdims=True)
grad_input = grad_output @ self.weights.T
self.weights -= lr * grad_weights
self.bias -= lr * grad_bias
return grad_input
def relu(x):
return np.maximum(0, x)
def relu_derivative(x):
return (x > 0).astype(float)
class VAE:
"""Variational Autoencoder with reparameterization trick."""
def __init__(self, input_dim=2, hidden_dim=8, latent_dim=2):
self.latent_dim = latent_dim
# Encoder
self.enc1 = DenseLayer(input_dim, hidden_dim)
self.enc_mu = DenseLayer(hidden_dim, latent_dim)
self.enc_logvar = DenseLayer(hidden_dim, latent_dim)
# Decoder
self.dec1 = DenseLayer(latent_dim, hidden_dim)
self.dec2 = DenseLayer(hidden_dim, input_dim)
def encode(self, x):
h = relu(self.enc1.forward(x))
self.h_enc = h
mu = self.enc_mu.forward(h)
log_var = self.enc_logvar.forward(h)
return mu, log_var
def reparameterize(self, mu, log_var):
"""Sample z = mu + sigma * epsilon (reparameterization trick)."""
std = np.exp(0.5 * log_var)
epsilon = np.random.randn(*mu.shape)
z = mu + std * epsilon
self.std = std
self.epsilon = epsilon
return z
def decode(self, z):
h = relu(self.dec1.forward(z))
self.h_dec = h
x_hat = self.dec2.forward(h)
return x_hat
def train_step(self, x, lr=0.001, kl_weight=0.1):
n = x.shape[0]
# Forward pass
mu, log_var = self.encode(x)
z = self.reparameterize(mu, log_var)
x_hat = self.decode(z)
# Losses
recon_loss = np.mean((x_hat - x) ** 2)
kl_loss = -0.5 * np.mean(1 + log_var - mu**2 - np.exp(log_var))
total_loss = recon_loss + kl_weight * kl_loss
# Backward: reconstruction gradient
grad = 2.0 * (x_hat - x) / n
# Through decoder
grad = self.dec2.backward(grad, lr)
grad = grad * relu_derivative(self.h_dec)
grad_z = self.dec1.backward(grad, lr)
# KL gradients for mu and log_var
grad_mu_kl = mu / n
grad_logvar_kl = 0.5 * (np.exp(log_var) - 1) / n
# Through reparameterization: dL/dmu, dL/dlogvar
grad_mu = grad_z + kl_weight * grad_mu_kl
grad_logvar = grad_z * self.epsilon * 0.5 * self.std + kl_weight * grad_logvar_kl
# Through encoder heads
self.enc_mu.backward(grad_mu, lr)
self.enc_logvar.backward(grad_logvar, lr)
# Through shared encoder
grad_h = (grad_mu @ self.enc_mu.weights.T +
grad_logvar @ self.enc_logvar.weights.T)
grad_h = grad_h * relu_derivative(self.h_enc)
self.enc1.backward(grad_h, lr)
return total_loss, recon_loss, kl_loss
def generate(self, n_samples=5):
"""Generate new data by sampling from latent space."""
z = np.random.randn(n_samples, self.latent_dim)
return self.decode(z)
# Train VAE on 2D Gaussian mixture data
np.random.seed(42)
n_per_cluster = 100
cluster1 = np.random.randn(n_per_cluster, 2) * 0.3 + np.array([1, 1])
cluster2 = np.random.randn(n_per_cluster, 2) * 0.3 + np.array([-1, -1])
cluster3 = np.random.randn(n_per_cluster, 2) * 0.3 + np.array([1, -1])
data = np.vstack([cluster1, cluster2, cluster3])
vae = VAE(input_dim=2, hidden_dim=16, latent_dim=2)
for epoch in range(1500):
total, recon, kl = vae.train_step(data, lr=0.001, kl_weight=0.05)
if (epoch + 1) % 300 == 0:
print(f"Epoch {epoch+1} | Total: {total:.4f} | Recon: {recon:.4f} | KL: {kl:.4f}")
# Generate new samples from the learned distribution
generated = vae.generate(n_samples=10)
print(f"\nGenerated samples (from random latent vectors):")
for i, sample in enumerate(generated[:5]):
print(f" Sample {i+1}: [{sample[0]:.3f}, {sample[1]:.3f}]")
KL Weight Annealing
Try varying kl_weight from 0.01 to 1.0. Low values give good reconstruction but poor generation (latent space has gaps). High values give smooth latent space but blurry reconstructions. This is the fundamental VAE tradeoff — balancing fidelity with generative quality.
5. The Adversarial Game: GAN Fundamentals
Generative Adversarial Networks (GANs) take a completely different approach to generation. Instead of learning a reconstruction objective, two networks compete in a minimax game:
- Generator (G): Takes random noise and transforms it into fake data that looks real
- Discriminator (D): Tries to distinguish real data from the generator’s fakes
The mathematical formulation of this adversarial game is:
$$\min_G \max_D \; \mathbb{E}_{x \sim p_{data}}[\log D(x)] + \mathbb{E}_{z \sim p_z}[\log(1 - D(G(z)))]$$
At Nash equilibrium, the generator produces data indistinguishable from real data, and the discriminator outputs 0.5 for everything (it cannot tell the difference).
flowchart TD
Z[Random Noise z] --> G[Generator G]
G --> FAKE[Fake Data]
REAL[Real Data x] --> D[Discriminator D]
FAKE --> D
D --> LOSS_D[D Loss: classify real vs fake]
LOSS_D -->|Update D weights| D
D --> LOSS_G[G Loss: fool D]
LOSS_G -->|Update G weights| G
6. Building a GAN from Scratch
We will implement a complete GAN that learns to generate samples from a 1D Gaussian distribution. The generator takes uniform random noise and transforms it to match the target distribution, while the discriminator learns to separate real from generated samples.
import numpy as np
def sigmoid(x):
x = np.clip(x, -500, 500)
return 1.0 / (1.0 + np.exp(-x))
def relu(x):
return np.maximum(0, x)
def relu_derivative(x):
return (x > 0).astype(float)
class GANLayer:
"""Dense layer for GAN networks."""
def __init__(self, input_dim, output_dim):
self.weights = np.random.randn(input_dim, output_dim) * 0.1
self.bias = np.zeros((1, output_dim))
self.input_cache = None
def forward(self, x):
self.input_cache = x
return x @ self.weights + self.bias
def backward(self, grad_output, lr):
grad_weights = self.input_cache.T @ grad_output
grad_bias = np.sum(grad_output, axis=0, keepdims=True)
grad_input = grad_output @ self.weights.T
self.weights -= lr * grad_weights
self.bias -= lr * grad_bias
return grad_input
class Generator:
"""Generator: maps noise to data space."""
def __init__(self, noise_dim=1, hidden_dim=16, output_dim=1):
self.layer1 = GANLayer(noise_dim, hidden_dim)
self.layer2 = GANLayer(hidden_dim, hidden_dim)
self.layer3 = GANLayer(hidden_dim, output_dim)
def forward(self, z):
h1 = relu(self.layer1.forward(z))
self.h1 = h1
h2 = relu(self.layer2.forward(h1))
self.h2 = h2
out = self.layer3.forward(h2)
return out
def backward(self, grad, lr):
grad = self.layer3.backward(grad, lr)
grad = grad * relu_derivative(self.h2)
grad = self.layer2.backward(grad, lr)
grad = grad * relu_derivative(self.h1)
grad = self.layer1.backward(grad, lr)
class Discriminator:
"""Discriminator: classifies real vs fake."""
def __init__(self, input_dim=1, hidden_dim=16):
self.layer1 = GANLayer(input_dim, hidden_dim)
self.layer2 = GANLayer(hidden_dim, hidden_dim)
self.layer3 = GANLayer(hidden_dim, 1)
def forward(self, x):
h1 = relu(self.layer1.forward(x))
self.h1 = h1
h2 = relu(self.layer2.forward(h1))
self.h2 = h2
logit = self.layer3.forward(h2)
out = sigmoid(logit)
self.logit = logit
return out
def backward(self, grad, lr):
# Gradient through sigmoid
s = sigmoid(self.logit)
grad = grad * s * (1 - s)
grad = self.layer3.backward(grad, lr)
grad = grad * relu_derivative(self.h2)
grad = self.layer2.backward(grad, lr)
grad = grad * relu_derivative(self.h1)
grad = self.layer1.backward(grad, lr)
return grad
# Training setup
np.random.seed(42)
real_mean = 3.0
real_std = 0.5
G = Generator(noise_dim=1, hidden_dim=16, output_dim=1)
D = Discriminator(input_dim=1, hidden_dim=16)
batch_size = 64
lr_d = 0.001
lr_g = 0.001
print("Training GAN to learn N(3.0, 0.5) distribution...")
print("-" * 50)
for epoch in range(2000):
# === Train Discriminator ===
# Real data from target distribution
real_data = np.random.randn(batch_size, 1) * real_std + real_mean
# Fake data from generator
noise = np.random.randn(batch_size, 1)
fake_data = G.forward(noise)
# D scores
d_real = D.forward(real_data)
# Binary cross-entropy gradient for real (label=1)
grad_real = -(1.0 / (d_real + 1e-8)) / batch_size
D.backward(grad_real, lr_d)
d_fake = D.forward(fake_data)
# Binary cross-entropy gradient for fake (label=0)
grad_fake = (1.0 / (1.0 - d_fake + 1e-8)) / batch_size
D.backward(grad_fake, lr_d)
# === Train Generator ===
noise = np.random.randn(batch_size, 1)
fake_data = G.forward(noise)
d_fake_for_g = D.forward(fake_data)
# Generator wants D to output 1 for fakes
grad_g_d = -(1.0 / (d_fake_for_g + 1e-8)) / batch_size
# Backprop through D (without updating D)
s = sigmoid(D.logit)
grad_g = grad_g_d * s * (1 - s)
grad_g = D.layer3.input_cache.T @ grad_g # Skip D update
# Simplified: get gradient at G output
grad_g_out = grad_g_d * s * (1 - s)
grad_g_out = grad_g_out @ D.layer3.weights.T
grad_g_out = grad_g_out * relu_derivative(D.h2)
grad_g_out = grad_g_out @ D.layer2.weights.T
grad_g_out = grad_g_out * relu_derivative(D.h1)
grad_g_out = grad_g_out @ D.layer1.weights.T
G.backward(grad_g_out, lr_g)
if (epoch + 1) % 400 == 0:
test_noise = np.random.randn(1000, 1)
generated = G.forward(test_noise)
gen_mean = np.mean(generated)
gen_std = np.std(generated)
print(f"Epoch {epoch+1} | Gen mean: {gen_mean:.3f} (target: {real_mean}) | "
f"Gen std: {gen_std:.3f} (target: {real_std})")
# Final evaluation
test_noise = np.random.randn(5000, 1)
final_generated = G.forward(test_noise)
print(f"\nFinal Generator Statistics:")
print(f" Mean: {np.mean(final_generated):.4f} (target: {real_mean})")
print(f" Std: {np.std(final_generated):.4f} (target: {real_std})")
7. GAN Training Challenges
GAN training is notoriously unstable. Three major challenges plague practitioners:
Mode Collapse
The generator finds a single output that fools the discriminator and stops exploring other modes of the data distribution. Instead of generating diverse samples, it produces the same (or very similar) outputs repeatedly.
Vanishing Gradients
When the discriminator becomes too good, it outputs values very close to 0 for all generator outputs. The gradient signal to the generator becomes extremely small, halting learning.
Training Instability
The competing objectives can cause oscillation rather than convergence. One network may overpower the other, breaking the delicate balance needed for learning.
- Training D too many steps per G step — D dominates, G gradients vanish
- Learning rate too high — causes oscillation between G and D
- Generator not expressive enough — cannot represent the full data distribution
- No gradient clipping — exploding gradients destabilize training
import numpy as np
def sigmoid(x):
x = np.clip(x, -500, 500)
return 1.0 / (1.0 + np.exp(-x))
class SimpleGenerator:
"""Minimal generator for demonstrating mode collapse."""
def __init__(self):
self.w1 = np.random.randn(1, 8) * 0.1
self.b1 = np.zeros((1, 8))
self.w2 = np.random.randn(8, 1) * 0.1
self.b2 = np.zeros((1, 1))
def forward(self, z):
self.z = z
self.h = np.maximum(0, z @ self.w1 + self.b1)
return self.h @ self.w2 + self.b2
class SimpleDiscriminator:
"""Minimal discriminator."""
def __init__(self):
self.w1 = np.random.randn(1, 8) * 0.1
self.b1 = np.zeros((1, 8))
self.w2 = np.random.randn(8, 1) * 0.1
self.b2 = np.zeros((1, 1))
def forward(self, x):
self.h = np.maximum(0, x @ self.w1 + self.b1)
logit = self.h @ self.w2 + self.b2
return sigmoid(logit)
# Demonstrate mode collapse with bimodal target
np.random.seed(42)
# Target: bimodal distribution (two peaks at -2 and +2)
def sample_bimodal(n):
choices = np.random.choice([-2.0, 2.0], size=(n, 1))
return choices + np.random.randn(n, 1) * 0.2
# Train without label smoothing (prone to mode collapse)
G_no_smooth = SimpleGenerator()
D_no_smooth = SimpleDiscriminator()
print("=== Without Label Smoothing (Mode Collapse Risk) ===")
for epoch in range(500):
real = sample_bimodal(32)
noise = np.random.randn(32, 1)
fake = G_no_smooth.forward(noise)
# Quick train (simplified for demonstration)
d_real = D_no_smooth.forward(real)
d_fake = D_no_smooth.forward(fake)
# Simple gradient update for G (push output toward high D score)
target_direction = np.mean(real) - np.mean(fake)
G_no_smooth.w2 += 0.001 * target_direction
G_no_smooth.b2 += 0.0005 * target_direction
# Check for mode collapse
test_noise = np.random.randn(1000, 1)
generated_no_smooth = G_no_smooth.forward(test_noise)
print(f"Generated mean: {np.mean(generated_no_smooth):.3f}")
print(f"Generated std: {np.std(generated_no_smooth):.3f}")
print(f"All samples near single mode: {np.std(generated_no_smooth) < 0.5}")
# Train WITH label smoothing (reduces mode collapse)
print("\n=== With Label Smoothing (More Stable) ===")
G_smooth = SimpleGenerator()
np.random.seed(123)
for epoch in range(500):
real = sample_bimodal(32)
noise = np.random.randn(32, 1)
fake = G_smooth.forward(noise)
# Label smoothing: real labels = 0.9 instead of 1.0
# This prevents D from becoming too confident
smooth_label = 0.9
# Encourage diversity by adding noise to G gradient direction
diversity_noise = np.random.randn() * 0.5
target_direction = np.mean(real) - np.mean(fake) + diversity_noise
G_smooth.w2 += 0.001 * target_direction
G_smooth.b2 += 0.0005 * target_direction
generated_smooth = G_smooth.forward(test_noise)
print(f"Generated mean: {np.mean(generated_smooth):.3f}")
print(f"Generated std: {np.std(generated_smooth):.3f}")
print(f"Better diversity (std > 0.5): {np.std(generated_smooth) > 0.5}")
print("\n=== Solutions Summary ===")
print("1. Label Smoothing: Use 0.9 for real, 0.1 for fake labels")
print("2. Feature Matching: Match intermediate layer statistics")
print("3. Wasserstein Distance: Use Earth Mover distance instead of BCE")
print("4. Spectral Normalization: Constrain D Lipschitz constant")
Wasserstein Distance (Conceptual)
The Wasserstein GAN (WGAN) replaces the binary cross-entropy loss with the Earth Mover’s distance. This provides more meaningful gradients even when the discriminator (called a "critic" in WGAN) performs well, solving the vanishing gradient problem:
import numpy as np
# Wasserstein loss concept demonstration
# Instead of log probabilities, use raw scores (no sigmoid)
def wasserstein_d_loss(d_real_scores, d_fake_scores):
"""Critic loss: maximize E[D(real)] - E[D(fake)]."""
return -(np.mean(d_real_scores) - np.mean(d_fake_scores))
def wasserstein_g_loss(d_fake_scores):
"""Generator loss: maximize E[D(fake)] = minimize -E[D(fake)]."""
return -np.mean(d_fake_scores)
def weight_clip(weights, clip_value=0.01):
"""Enforce Lipschitz constraint via weight clipping."""
return np.clip(weights, -clip_value, clip_value)
# Simulate WGAN training dynamics
np.random.seed(42)
d_real_scores = np.random.randn(100) + 2.0 # Critic rates real highly
d_fake_scores = np.random.randn(100) - 1.0 # Critic rates fake low
d_loss = wasserstein_d_loss(d_real_scores, d_fake_scores)
g_loss = wasserstein_g_loss(d_fake_scores)
print(f"Wasserstein D Loss: {d_loss:.4f}")
print(f"Wasserstein G Loss: {g_loss:.4f}")
print(f"Earth Mover Distance estimate: {np.mean(d_real_scores) - np.mean(d_fake_scores):.4f}")
# Key advantage: gradient does NOT vanish
print(f"\nGradient magnitude for G: {np.abs(np.mean(-1.0 * np.ones(100) / 100)):.4f}")
print("(Constant gradient regardless of D quality - no vanishing!)")
# Weight clipping example
sample_weights = np.random.randn(4, 4) * 0.5
clipped = weight_clip(sample_weights, clip_value=0.01)
print(f"\nOriginal weight range: [{sample_weights.min():.3f}, {sample_weights.max():.3f}]")
print(f"Clipped weight range: [{clipped.min():.3f}, {clipped.max():.3f}]")
8. What’s Next
In this article we built three powerful generative architectures from scratch — basic autoencoders for compression, variational autoencoders for principled generation, and GANs for adversarial learning. These form the foundation of modern generative AI.
Key takeaways:
- Autoencoders learn compressed representations through reconstruction
- VAEs add probabilistic structure to enable smooth generation
- GANs use competition between networks to produce realistic outputs
- Training stability requires careful balancing of learning rates, label smoothing, and architecture design
Next in the Series
In Part 9: Transformers & Best Practices, we tackle the architecture that powers modern AI — self-attention mechanisms, positional encoding, and the complete transformer model. This completes our journey through generative models and moves into the architecture behind GPT, BERT, and beyond.