Part 8: Autoencoders & GANs Deep Dive

1. Unsupervised Learning and Representation

Unlike supervised learning where we have labels guiding our models, autoencoders learn entirely without labels. The goal is to discover a compressed representation — an encoding — that captures the essential features of the data. This is the foundation of generative modeling: if we can learn a good latent representation, we can generate new data from it.

                            
                            Key Insight: Autoencoders force the network to learn what matters by compressing information through a bottleneck. The network must decide which features are essential and which can be discarded.
                        

The core architecture consists of three components: an encoder that compresses input to a lower-dimensional latent space, a bottleneck (the compressed representation), and a decoder that reconstructs the original input from the bottleneck.

Autoencoder Architecture

flowchart LR
    A[Input x] --> B[Encoder]
    B --> C[Bottleneck z]
    C --> D[Decoder]
    D --> E[Reconstruction x-hat]
    E --> F[Loss: MSE between x and x-hat]
    F -.->|Backprop| B
    F -.->|Backprop| D

The encoder maps high-dimensional input to a low-dimensional latent code, while the decoder attempts to reconstruct the original from this code. The loss function measures reconstruction quality — typically Mean Squared Error (MSE) between input and output.

2. Building a Basic Autoencoder

Let us build an autoencoder from scratch using only NumPy. Our encoder compresses input to a lower dimension, and the decoder reconstructs it. The loss is the reconstruction error (MSE between input and output).

Dense Layer Implementation

First, we need a reusable dense layer with forward and backward passes:

import numpy as np

class DenseLayer:
    """Fully connected layer with Xavier initialization."""
    def __init__(self, input_dim, output_dim):
        # Xavier initialization for stable gradients
        scale = np.sqrt(2.0 / (input_dim + output_dim))
        self.weights = np.random.randn(input_dim, output_dim) * scale
        self.bias = np.zeros((1, output_dim))
        self.input_cache = None

    def forward(self, x):
        self.input_cache = x
        return x @ self.weights + self.bias

    def backward(self, grad_output, lr=0.01):
        # Compute gradients
        grad_weights = self.input_cache.T @ grad_output
        grad_bias = np.sum(grad_output, axis=0, keepdims=True)
        grad_input = grad_output @ self.weights.T

        # Update parameters
        self.weights -= lr * grad_weights
        self.bias -= lr * grad_bias
        return grad_input

# Test the layer
np.random.seed(42)
layer = DenseLayer(4, 2)
x = np.random.randn(3, 4)  # batch of 3, input dim 4
output = layer.forward(x)
print("Input shape:", x.shape)
print("Output shape:", output.shape)
print("Output:\n", output)

Complete Autoencoder

Now we build a complete autoencoder that compresses 2D data to 1D and reconstructs it:

import numpy as np

class DenseLayer:
    """Fully connected layer with Xavier initialization."""
    def __init__(self, input_dim, output_dim):
        scale = np.sqrt(2.0 / (input_dim + output_dim))
        self.weights = np.random.randn(input_dim, output_dim) * scale
        self.bias = np.zeros((1, output_dim))
        self.input_cache = None

    def forward(self, x):
        self.input_cache = x
        return x @ self.weights + self.bias

    def backward(self, grad_output, lr=0.01):
        grad_weights = self.input_cache.T @ grad_output
        grad_bias = np.sum(grad_output, axis=0, keepdims=True)
        grad_input = grad_output @ self.weights.T
        self.weights -= lr * grad_weights
        self.bias -= lr * grad_bias
        return grad_input

def relu(x):
    return np.maximum(0, x)

def relu_derivative(x):
    return (x > 0).astype(float)

class Autoencoder:
    """Basic autoencoder: 2D input -> 1D bottleneck -> 2D output."""
    def __init__(self, input_dim=2, hidden_dim=4, latent_dim=1):
        # Encoder layers
        self.enc1 = DenseLayer(input_dim, hidden_dim)
        self.enc2 = DenseLayer(hidden_dim, latent_dim)
        # Decoder layers
        self.dec1 = DenseLayer(latent_dim, hidden_dim)
        self.dec2 = DenseLayer(hidden_dim, input_dim)

    def encode(self, x):
        h = relu(self.enc1.forward(x))
        self.enc1_out = h
        z = self.enc2.forward(h)
        return z

    def decode(self, z):
        h = relu(self.dec1.forward(z))
        self.dec1_out = h
        x_hat = self.dec2.forward(h)
        return x_hat

    def forward(self, x):
        z = self.encode(x)
        x_hat = self.decode(z)
        return x_hat

    def train_step(self, x, lr=0.001):
        # Forward pass
        x_hat = self.forward(x)

        # MSE loss gradient: d(loss)/d(x_hat) = 2*(x_hat - x)/N
        n = x.shape[0]
        grad = 2.0 * (x_hat - x) / n
        loss = np.mean((x_hat - x) ** 2)

        # Backward through decoder
        grad = self.dec2.backward(grad, lr)
        grad = grad * relu_derivative(self.dec1_out)
        grad = self.dec1.backward(grad, lr)

        # Backward through encoder
        grad = self.enc2.backward(grad, lr)
        grad = grad * relu_derivative(self.enc1_out)
        grad = self.enc1.backward(grad, lr)

        return loss

# Generate synthetic 2D data (circle pattern)
np.random.seed(42)
theta = np.linspace(0, 2 * np.pi, 200)
data = np.column_stack([np.cos(theta), np.sin(theta)])
data += np.random.randn(*data.shape) * 0.05  # Add slight noise

# Train autoencoder
ae = Autoencoder(input_dim=2, hidden_dim=4, latent_dim=1)
losses = []
for epoch in range(500):
    loss = ae.train_step(data, lr=0.005)
    losses.append(loss)
    if (epoch + 1) % 100 == 0:
        print(f"Epoch {epoch+1}, Loss: {loss:.6f}")

# Reconstruct and compare
reconstructed = ae.forward(data)
print(f"\nFinal reconstruction error: {np.mean((data - reconstructed)**2):.6f}")
print(f"Original data sample: {data[0]}")
print(f"Reconstructed sample: {reconstructed[0]}")

Experiment

Bottleneck Dimension Effect

Try changing latent_dim from 1 to 2. With a 2D bottleneck for 2D data, the autoencoder can achieve near-perfect reconstruction. With 1D, it must learn the most important axis of variation — similar to PCA finding the first principal component.

compression dimensionality information loss

3. Denoising Autoencoders

A denoising autoencoder (DAE) receives corrupted input but is trained to output the clean version. This forces the network to learn robust features rather than simply memorizing an identity mapping. The corruption acts as a regularizer, preventing the autoencoder from learning trivial solutions.

                            
                            Why Denoising Works: By learning to remove noise, the network must understand the underlying structure of the data distribution. It cannot simply copy input to output — it must extract meaningful patterns.
                        

import numpy as np

class DenseLayer:
    """Fully connected layer."""
    def __init__(self, input_dim, output_dim):
        scale = np.sqrt(2.0 / (input_dim + output_dim))
        self.weights = np.random.randn(input_dim, output_dim) * scale
        self.bias = np.zeros((1, output_dim))
        self.input_cache = None

    def forward(self, x):
        self.input_cache = x
        return x @ self.weights + self.bias

    def backward(self, grad_output, lr=0.01):
        grad_weights = self.input_cache.T @ grad_output
        grad_bias = np.sum(grad_output, axis=0, keepdims=True)
        grad_input = grad_output @ self.weights.T
        self.weights -= lr * grad_weights
        self.bias -= lr * grad_bias
        return grad_input

def relu(x):
    return np.maximum(0, x)

def relu_derivative(x):
    return (x > 0).astype(float)

class DenoisingAutoencoder:
    """Denoising autoencoder that learns to remove Gaussian noise."""
    def __init__(self, input_dim=2, hidden_dim=8, latent_dim=2):
        self.enc1 = DenseLayer(input_dim, hidden_dim)
        self.enc2 = DenseLayer(hidden_dim, latent_dim)
        self.dec1 = DenseLayer(latent_dim, hidden_dim)
        self.dec2 = DenseLayer(hidden_dim, input_dim)

    def forward(self, x):
        h1 = relu(self.enc1.forward(x))
        self.h1 = h1
        z = self.enc2.forward(h1)
        h2 = relu(self.dec1.forward(z))
        self.h2 = h2
        x_hat = self.dec2.forward(h2)
        return x_hat

    def train_step(self, x_noisy, x_clean, lr=0.001):
        x_hat = self.forward(x_noisy)
        n = x_clean.shape[0]
        loss = np.mean((x_hat - x_clean) ** 2)
        grad = 2.0 * (x_hat - x_clean) / n

        grad = self.dec2.backward(grad, lr)
        grad = grad * relu_derivative(self.h2)
        grad = self.dec1.backward(grad, lr)
        grad = self.enc2.backward(grad, lr)
        grad = grad * relu_derivative(self.h1)
        grad = self.enc1.backward(grad, lr)
        return loss

# Generate clean signal (sine wave samples)
np.random.seed(42)
t = np.linspace(0, 4 * np.pi, 300)
clean_data = np.column_stack([t / (4 * np.pi), np.sin(t)])

# Add Gaussian noise
noise_level = 0.3
noisy_data = clean_data + np.random.randn(*clean_data.shape) * noise_level

# Train denoising autoencoder
dae = DenoisingAutoencoder(input_dim=2, hidden_dim=8, latent_dim=2)
for epoch in range(1000):
    # Each epoch: add fresh noise for variety
    noise = np.random.randn(*clean_data.shape) * noise_level
    x_noisy = clean_data + noise
    loss = dae.train_step(x_noisy, clean_data, lr=0.002)
    if (epoch + 1) % 200 == 0:
        print(f"Epoch {epoch+1}, Denoising Loss: {loss:.6f}")

# Test: denoise the noisy data
denoised = dae.forward(noisy_data)
noise_error = np.mean((noisy_data - clean_data) ** 2)
denoised_error = np.mean((denoised - clean_data) ** 2)
print(f"\nNoisy data MSE from clean: {noise_error:.6f}")
print(f"Denoised data MSE from clean: {denoised_error:.6f}")
print(f"Noise reduction ratio: {noise_error / denoised_error:.2f}x")

4. Variational Autoencoders (VAE)

Standard autoencoders encode each input to a single point in latent space. Variational autoencoders (VAEs) instead encode to a distribution — specifically, a Gaussian parameterized by mean and variance. This enables smooth interpolation and generation of new data by sampling from the latent space.

The Reparameterization Trick

The key innovation of VAEs is the reparameterization trick. Instead of sampling directly from the learned distribution (which would block gradient flow), we sample noise from a standard normal and transform it:

$$z = \mu + \sigma \cdot \epsilon, \quad \epsilon \sim \mathcal{N}(0, 1)$$

The total VAE loss combines reconstruction quality with a regularization term that keeps the latent distribution close to a standard normal:

$$\mathcal{L} = \mathcal{L}_{recon} + D_{KL}(q(z|x) \| p(z))$$

Where the KL divergence for Gaussians has a closed-form solution:

$$D_{KL} = -\frac{1}{2} \sum_{j=1}^{J} (1 + \log(\sigma_j^2) - \mu_j^2 - \sigma_j^2)$$

                            
                            Mathematical Intuition: The KL divergence term prevents the encoder from collapsing all inputs to a single point. It ensures the latent space is smooth and continuous, which is what allows meaningful generation and interpolation.
                        

import numpy as np

class DenseLayer:
    """Fully connected layer."""
    def __init__(self, input_dim, output_dim):
        scale = np.sqrt(2.0 / (input_dim + output_dim))
        self.weights = np.random.randn(input_dim, output_dim) * scale
        self.bias = np.zeros((1, output_dim))
        self.input_cache = None

    def forward(self, x):
        self.input_cache = x
        return x @ self.weights + self.bias

    def backward(self, grad_output, lr=0.01):
        grad_weights = self.input_cache.T @ grad_output
        grad_bias = np.sum(grad_output, axis=0, keepdims=True)
        grad_input = grad_output @ self.weights.T
        self.weights -= lr * grad_weights
        self.bias -= lr * grad_bias
        return grad_input

def relu(x):
    return np.maximum(0, x)

def relu_derivative(x):
    return (x > 0).astype(float)

class VAE:
    """Variational Autoencoder with reparameterization trick."""
    def __init__(self, input_dim=2, hidden_dim=8, latent_dim=2):
        self.latent_dim = latent_dim
        # Encoder
        self.enc1 = DenseLayer(input_dim, hidden_dim)
        self.enc_mu = DenseLayer(hidden_dim, latent_dim)
        self.enc_logvar = DenseLayer(hidden_dim, latent_dim)
        # Decoder
        self.dec1 = DenseLayer(latent_dim, hidden_dim)
        self.dec2 = DenseLayer(hidden_dim, input_dim)

    def encode(self, x):
        h = relu(self.enc1.forward(x))
        self.h_enc = h
        mu = self.enc_mu.forward(h)
        log_var = self.enc_logvar.forward(h)
        return mu, log_var

    def reparameterize(self, mu, log_var):
        """Sample z = mu + sigma * epsilon (reparameterization trick)."""
        std = np.exp(0.5 * log_var)
        epsilon = np.random.randn(*mu.shape)
        z = mu + std * epsilon
        self.std = std
        self.epsilon = epsilon
        return z

    def decode(self, z):
        h = relu(self.dec1.forward(z))
        self.h_dec = h
        x_hat = self.dec2.forward(h)
        return x_hat

    def train_step(self, x, lr=0.001, kl_weight=0.1):
        n = x.shape[0]

        # Forward pass
        mu, log_var = self.encode(x)
        z = self.reparameterize(mu, log_var)
        x_hat = self.decode(z)

        # Losses
        recon_loss = np.mean((x_hat - x) ** 2)
        kl_loss = -0.5 * np.mean(1 + log_var - mu**2 - np.exp(log_var))
        total_loss = recon_loss + kl_weight * kl_loss

        # Backward: reconstruction gradient
        grad = 2.0 * (x_hat - x) / n

        # Through decoder
        grad = self.dec2.backward(grad, lr)
        grad = grad * relu_derivative(self.h_dec)
        grad_z = self.dec1.backward(grad, lr)

        # KL gradients for mu and log_var
        grad_mu_kl = mu / n
        grad_logvar_kl = 0.5 * (np.exp(log_var) - 1) / n

        # Through reparameterization: dL/dmu, dL/dlogvar
        grad_mu = grad_z + kl_weight * grad_mu_kl
        grad_logvar = grad_z * self.epsilon * 0.5 * self.std + kl_weight * grad_logvar_kl

        # Through encoder heads
        self.enc_mu.backward(grad_mu, lr)
        self.enc_logvar.backward(grad_logvar, lr)

        # Through shared encoder
        grad_h = (grad_mu @ self.enc_mu.weights.T +
                  grad_logvar @ self.enc_logvar.weights.T)
        grad_h = grad_h * relu_derivative(self.h_enc)
        self.enc1.backward(grad_h, lr)

        return total_loss, recon_loss, kl_loss

    def generate(self, n_samples=5):
        """Generate new data by sampling from latent space."""
        z = np.random.randn(n_samples, self.latent_dim)
        return self.decode(z)

# Train VAE on 2D Gaussian mixture data
np.random.seed(42)
n_per_cluster = 100
cluster1 = np.random.randn(n_per_cluster, 2) * 0.3 + np.array([1, 1])
cluster2 = np.random.randn(n_per_cluster, 2) * 0.3 + np.array([-1, -1])
cluster3 = np.random.randn(n_per_cluster, 2) * 0.3 + np.array([1, -1])
data = np.vstack([cluster1, cluster2, cluster3])

vae = VAE(input_dim=2, hidden_dim=16, latent_dim=2)
for epoch in range(1500):
    total, recon, kl = vae.train_step(data, lr=0.001, kl_weight=0.05)
    if (epoch + 1) % 300 == 0:
        print(f"Epoch {epoch+1} | Total: {total:.4f} | Recon: {recon:.4f} | KL: {kl:.4f}")

# Generate new samples from the learned distribution
generated = vae.generate(n_samples=10)
print(f"\nGenerated samples (from random latent vectors):")
for i, sample in enumerate(generated[:5]):
    print(f"  Sample {i+1}: [{sample[0]:.3f}, {sample[1]:.3f}]")

Experiment

KL Weight Annealing

Try varying kl_weight from 0.01 to 1.0. Low values give good reconstruction but poor generation (latent space has gaps). High values give smooth latent space but blurry reconstructions. This is the fundamental VAE tradeoff — balancing fidelity with generative quality.

VAE KL annealing beta-VAE

5. The Adversarial Game: GAN Fundamentals

Generative Adversarial Networks (GANs) take a completely different approach to generation. Instead of learning a reconstruction objective, two networks compete in a minimax game:

Generator (G): Takes random noise and transforms it into fake data that looks real
Discriminator (D): Tries to distinguish real data from the generator’s fakes

The mathematical formulation of this adversarial game is:

$$\min_G \max_D \; \mathbb{E}_{x \sim p_{data}}[\log D(x)] + \mathbb{E}_{z \sim p_z}[\log(1 - D(G(z)))]$$

At Nash equilibrium, the generator produces data indistinguishable from real data, and the discriminator outputs 0.5 for everything (it cannot tell the difference).

GAN Training Loop

flowchart TD
    Z[Random Noise z] --> G[Generator G]
    G --> FAKE[Fake Data]
    REAL[Real Data x] --> D[Discriminator D]
    FAKE --> D
    D --> LOSS_D[D Loss: classify real vs fake]
    LOSS_D -->|Update D weights| D
    D --> LOSS_G[G Loss: fool D]
    LOSS_G -->|Update G weights| G

                            
                            The Game Analogy: Think of the Generator as a counterfeiter learning to make fake currency, and the Discriminator as a detective trying to catch fakes. As the detective gets better, the counterfeiter must improve. This competition drives both networks to improve until the fakes are indistinguishable from real currency.
                        

6. Building a GAN from Scratch

We will implement a complete GAN that learns to generate samples from a 1D Gaussian distribution. The generator takes uniform random noise and transforms it to match the target distribution, while the discriminator learns to separate real from generated samples.

import numpy as np

def sigmoid(x):
    x = np.clip(x, -500, 500)
    return 1.0 / (1.0 + np.exp(-x))

def relu(x):
    return np.maximum(0, x)

def relu_derivative(x):
    return (x > 0).astype(float)

class GANLayer:
    """Dense layer for GAN networks."""
    def __init__(self, input_dim, output_dim):
        self.weights = np.random.randn(input_dim, output_dim) * 0.1
        self.bias = np.zeros((1, output_dim))
        self.input_cache = None

    def forward(self, x):
        self.input_cache = x
        return x @ self.weights + self.bias

    def backward(self, grad_output, lr):
        grad_weights = self.input_cache.T @ grad_output
        grad_bias = np.sum(grad_output, axis=0, keepdims=True)
        grad_input = grad_output @ self.weights.T
        self.weights -= lr * grad_weights
        self.bias -= lr * grad_bias
        return grad_input

class Generator:
    """Generator: maps noise to data space."""
    def __init__(self, noise_dim=1, hidden_dim=16, output_dim=1):
        self.layer1 = GANLayer(noise_dim, hidden_dim)
        self.layer2 = GANLayer(hidden_dim, hidden_dim)
        self.layer3 = GANLayer(hidden_dim, output_dim)

    def forward(self, z):
        h1 = relu(self.layer1.forward(z))
        self.h1 = h1
        h2 = relu(self.layer2.forward(h1))
        self.h2 = h2
        out = self.layer3.forward(h2)
        return out

    def backward(self, grad, lr):
        grad = self.layer3.backward(grad, lr)
        grad = grad * relu_derivative(self.h2)
        grad = self.layer2.backward(grad, lr)
        grad = grad * relu_derivative(self.h1)
        grad = self.layer1.backward(grad, lr)

class Discriminator:
    """Discriminator: classifies real vs fake."""
    def __init__(self, input_dim=1, hidden_dim=16):
        self.layer1 = GANLayer(input_dim, hidden_dim)
        self.layer2 = GANLayer(hidden_dim, hidden_dim)
        self.layer3 = GANLayer(hidden_dim, 1)

    def forward(self, x):
        h1 = relu(self.layer1.forward(x))
        self.h1 = h1
        h2 = relu(self.layer2.forward(h1))
        self.h2 = h2
        logit = self.layer3.forward(h2)
        out = sigmoid(logit)
        self.logit = logit
        return out

    def backward(self, grad, lr):
        # Gradient through sigmoid
        s = sigmoid(self.logit)
        grad = grad * s * (1 - s)
        grad = self.layer3.backward(grad, lr)
        grad = grad * relu_derivative(self.h2)
        grad = self.layer2.backward(grad, lr)
        grad = grad * relu_derivative(self.h1)
        grad = self.layer1.backward(grad, lr)
        return grad

# Training setup
np.random.seed(42)
real_mean = 3.0
real_std = 0.5

G = Generator(noise_dim=1, hidden_dim=16, output_dim=1)
D = Discriminator(input_dim=1, hidden_dim=16)

batch_size = 64
lr_d = 0.001
lr_g = 0.001

print("Training GAN to learn N(3.0, 0.5) distribution...")
print("-" * 50)

for epoch in range(2000):
    # === Train Discriminator ===
    # Real data from target distribution
    real_data = np.random.randn(batch_size, 1) * real_std + real_mean

    # Fake data from generator
    noise = np.random.randn(batch_size, 1)
    fake_data = G.forward(noise)

    # D scores
    d_real = D.forward(real_data)
    # Binary cross-entropy gradient for real (label=1)
    grad_real = -(1.0 / (d_real + 1e-8)) / batch_size
    D.backward(grad_real, lr_d)

    d_fake = D.forward(fake_data)
    # Binary cross-entropy gradient for fake (label=0)
    grad_fake = (1.0 / (1.0 - d_fake + 1e-8)) / batch_size
    D.backward(grad_fake, lr_d)

    # === Train Generator ===
    noise = np.random.randn(batch_size, 1)
    fake_data = G.forward(noise)
    d_fake_for_g = D.forward(fake_data)

    # Generator wants D to output 1 for fakes
    grad_g_d = -(1.0 / (d_fake_for_g + 1e-8)) / batch_size
    # Backprop through D (without updating D)
    s = sigmoid(D.logit)
    grad_g = grad_g_d * s * (1 - s)
    grad_g = D.layer3.input_cache.T @ grad_g  # Skip D update
    # Simplified: get gradient at G output
    grad_g_out = grad_g_d * s * (1 - s)
    grad_g_out = grad_g_out @ D.layer3.weights.T
    grad_g_out = grad_g_out * relu_derivative(D.h2)
    grad_g_out = grad_g_out @ D.layer2.weights.T
    grad_g_out = grad_g_out * relu_derivative(D.h1)
    grad_g_out = grad_g_out @ D.layer1.weights.T
    G.backward(grad_g_out, lr_g)

    if (epoch + 1) % 400 == 0:
        test_noise = np.random.randn(1000, 1)
        generated = G.forward(test_noise)
        gen_mean = np.mean(generated)
        gen_std = np.std(generated)
        print(f"Epoch {epoch+1} | Gen mean: {gen_mean:.3f} (target: {real_mean}) | "
              f"Gen std: {gen_std:.3f} (target: {real_std})")

# Final evaluation
test_noise = np.random.randn(5000, 1)
final_generated = G.forward(test_noise)
print(f"\nFinal Generator Statistics:")
print(f"  Mean: {np.mean(final_generated):.4f} (target: {real_mean})")
print(f"  Std:  {np.std(final_generated):.4f} (target: {real_std})")

                            
                            Convergence Indicator: A well-trained GAN will have the generator’s output statistics (mean and standard deviation) closely matching the target distribution. The discriminator accuracy should converge toward 50% as it becomes unable to distinguish real from fake.
                        

7. GAN Training Challenges

GAN training is notoriously unstable. Three major challenges plague practitioners:

Mode Collapse

The generator finds a single output that fools the discriminator and stops exploring other modes of the data distribution. Instead of generating diverse samples, it produces the same (or very similar) outputs repeatedly.

Vanishing Gradients

When the discriminator becomes too good, it outputs values very close to 0 for all generator outputs. The gradient signal to the generator becomes extremely small, halting learning.

Training Instability

The competing objectives can cause oscillation rather than convergence. One network may overpower the other, breaking the delicate balance needed for learning.

                            
                            Common Pitfalls:
                            Training D too many steps per G step — D dominates, G gradients vanish
Learning rate too high — causes oscillation between G and D
Generator not expressive enough — cannot represent the full data distribution
No gradient clipping — exploding gradients destabilize training

                        

import numpy as np

def sigmoid(x):
    x = np.clip(x, -500, 500)
    return 1.0 / (1.0 + np.exp(-x))

class SimpleGenerator:
    """Minimal generator for demonstrating mode collapse."""
    def __init__(self):
        self.w1 = np.random.randn(1, 8) * 0.1
        self.b1 = np.zeros((1, 8))
        self.w2 = np.random.randn(8, 1) * 0.1
        self.b2 = np.zeros((1, 1))

    def forward(self, z):
        self.z = z
        self.h = np.maximum(0, z @ self.w1 + self.b1)
        return self.h @ self.w2 + self.b2

class SimpleDiscriminator:
    """Minimal discriminator."""
    def __init__(self):
        self.w1 = np.random.randn(1, 8) * 0.1
        self.b1 = np.zeros((1, 8))
        self.w2 = np.random.randn(8, 1) * 0.1
        self.b2 = np.zeros((1, 1))

    def forward(self, x):
        self.h = np.maximum(0, x @ self.w1 + self.b1)
        logit = self.h @ self.w2 + self.b2
        return sigmoid(logit)

# Demonstrate mode collapse with bimodal target
np.random.seed(42)

# Target: bimodal distribution (two peaks at -2 and +2)
def sample_bimodal(n):
    choices = np.random.choice([-2.0, 2.0], size=(n, 1))
    return choices + np.random.randn(n, 1) * 0.2

# Train without label smoothing (prone to mode collapse)
G_no_smooth = SimpleGenerator()
D_no_smooth = SimpleDiscriminator()

print("=== Without Label Smoothing (Mode Collapse Risk) ===")
for epoch in range(500):
    real = sample_bimodal(32)
    noise = np.random.randn(32, 1)
    fake = G_no_smooth.forward(noise)

    # Quick train (simplified for demonstration)
    d_real = D_no_smooth.forward(real)
    d_fake = D_no_smooth.forward(fake)

    # Simple gradient update for G (push output toward high D score)
    target_direction = np.mean(real) - np.mean(fake)
    G_no_smooth.w2 += 0.001 * target_direction
    G_no_smooth.b2 += 0.0005 * target_direction

# Check for mode collapse
test_noise = np.random.randn(1000, 1)
generated_no_smooth = G_no_smooth.forward(test_noise)
print(f"Generated mean: {np.mean(generated_no_smooth):.3f}")
print(f"Generated std: {np.std(generated_no_smooth):.3f}")
print(f"All samples near single mode: {np.std(generated_no_smooth) < 0.5}")

# Train WITH label smoothing (reduces mode collapse)
print("\n=== With Label Smoothing (More Stable) ===")
G_smooth = SimpleGenerator()
np.random.seed(123)

for epoch in range(500):
    real = sample_bimodal(32)
    noise = np.random.randn(32, 1)
    fake = G_smooth.forward(noise)

    # Label smoothing: real labels = 0.9 instead of 1.0
    # This prevents D from becoming too confident
    smooth_label = 0.9

    # Encourage diversity by adding noise to G gradient direction
    diversity_noise = np.random.randn() * 0.5
    target_direction = np.mean(real) - np.mean(fake) + diversity_noise
    G_smooth.w2 += 0.001 * target_direction
    G_smooth.b2 += 0.0005 * target_direction

generated_smooth = G_smooth.forward(test_noise)
print(f"Generated mean: {np.mean(generated_smooth):.3f}")
print(f"Generated std: {np.std(generated_smooth):.3f}")
print(f"Better diversity (std > 0.5): {np.std(generated_smooth) > 0.5}")

print("\n=== Solutions Summary ===")
print("1. Label Smoothing: Use 0.9 for real, 0.1 for fake labels")
print("2. Feature Matching: Match intermediate layer statistics")
print("3. Wasserstein Distance: Use Earth Mover distance instead of BCE")
print("4. Spectral Normalization: Constrain D Lipschitz constant")

Wasserstein Distance (Conceptual)

The Wasserstein GAN (WGAN) replaces the binary cross-entropy loss with the Earth Mover’s distance. This provides more meaningful gradients even when the discriminator (called a "critic" in WGAN) performs well, solving the vanishing gradient problem:

import numpy as np

# Wasserstein loss concept demonstration
# Instead of log probabilities, use raw scores (no sigmoid)

def wasserstein_d_loss(d_real_scores, d_fake_scores):
    """Critic loss: maximize E[D(real)] - E[D(fake)]."""
    return -(np.mean(d_real_scores) - np.mean(d_fake_scores))

def wasserstein_g_loss(d_fake_scores):
    """Generator loss: maximize E[D(fake)] = minimize -E[D(fake)]."""
    return -np.mean(d_fake_scores)

def weight_clip(weights, clip_value=0.01):
    """Enforce Lipschitz constraint via weight clipping."""
    return np.clip(weights, -clip_value, clip_value)

# Simulate WGAN training dynamics
np.random.seed(42)
d_real_scores = np.random.randn(100) + 2.0  # Critic rates real highly
d_fake_scores = np.random.randn(100) - 1.0  # Critic rates fake low

d_loss = wasserstein_d_loss(d_real_scores, d_fake_scores)
g_loss = wasserstein_g_loss(d_fake_scores)

print(f"Wasserstein D Loss: {d_loss:.4f}")
print(f"Wasserstein G Loss: {g_loss:.4f}")
print(f"Earth Mover Distance estimate: {np.mean(d_real_scores) - np.mean(d_fake_scores):.4f}")

# Key advantage: gradient does NOT vanish
print(f"\nGradient magnitude for G: {np.abs(np.mean(-1.0 * np.ones(100) / 100)):.4f}")
print("(Constant gradient regardless of D quality - no vanishing!)")

# Weight clipping example
sample_weights = np.random.randn(4, 4) * 0.5
clipped = weight_clip(sample_weights, clip_value=0.01)
print(f"\nOriginal weight range: [{sample_weights.min():.3f}, {sample_weights.max():.3f}]")
print(f"Clipped weight range: [{clipped.min():.3f}, {clipped.max():.3f}]")

8. What’s Next

In this article we built three powerful generative architectures from scratch — basic autoencoders for compression, variational autoencoders for principled generation, and GANs for adversarial learning. These form the foundation of modern generative AI.

Key takeaways:

Autoencoders learn compressed representations through reconstruction
VAEs add probabilistic structure to enable smooth generation
GANs use competition between networks to produce realistic outputs
Training stability requires careful balancing of learning rates, label smoothing, and architecture design

Next in the Series

In Part 9: Transformers & Best Practices, we tackle the architecture that powers modern AI — self-attention mechanisms, positional encoding, and the complete transformer model. This completes our journey through generative models and moves into the architecture behind GPT, BERT, and beyond.

Previous Part 7: RNNs & LSTMs Deep Dive Next Part 9: Transformers & Best Practices