Part 2: Building Blocks — Neurons, Weights & Activations - Neural Networks Mastery

The Artificial Neuron

The artificial neuron is the fundamental unit of every neural network. Inspired by biological neurons, it takes multiple inputs, multiplies each by a weight, sums the results, adds a bias, and passes the result through an activation function to produce an output.

                            
                            The Neuron Formula: A single neuron computes $y = f\left(\sum_{i=1}^{n} w_i x_i + b\right)$ where $w_i$ are weights, $x_i$ are inputs, $b$ is bias, and $f$ is the activation function.
                        

Think of it as a tiny decision-maker: it gathers evidence (weighted inputs), applies a threshold (bias), and decides how strongly to “fire” (activation). Let’s visualize this:

Single Artificial Neuron

flowchart LR
    x1["x1"] -->|"w1"| SUM["Sum + Bias"]
    x2["x2"] -->|"w2"| SUM
    x3["x3"] -->|"w3"| SUM
    B["b"] --> SUM
    SUM --> ACT["Activation f(z)"]
    ACT --> Y["Output y"]

Implementing a Single Neuron

Let’s build a Neuron class from scratch and visualize how it responds to different inputs:

import numpy as np
import matplotlib.pyplot as plt

class Neuron:
    """A single artificial neuron with configurable activation."""

    def __init__(self, n_inputs, activation='sigmoid'):
        # Initialize weights randomly (small values)
        self.weights = np.random.randn(n_inputs) * 0.5
        self.bias = 0.0
        self.activation = activation

    def _activate(self, z):
        """Apply activation function."""
        if self.activation == 'sigmoid':
            return 1.0 / (1.0 + np.exp(-z))
        elif self.activation == 'relu':
            return np.maximum(0, z)
        elif self.activation == 'tanh':
            return np.tanh(z)
        else:
            return z  # linear (no activation)

    def forward(self, x):
        """Compute neuron output: f(w*x + b)."""
        z = np.dot(self.weights, x) + self.bias
        return self._activate(z)

# Create a neuron with 2 inputs
np.random.seed(42)
neuron = Neuron(n_inputs=2, activation='sigmoid')
print(f"Weights: {neuron.weights}")
print(f"Bias: {neuron.bias}")

# Test with sample inputs
inputs = np.array([0.5, -0.3])
output = neuron.forward(inputs)
print(f"\nInput: {inputs}")
print(f"Weighted sum: {np.dot(neuron.weights, inputs) + neuron.bias:.4f}")
print(f"Output (after sigmoid): {output:.4f}")

# Visualize neuron response across input space
x1_range = np.linspace(-3, 3, 100)
x2_range = np.linspace(-3, 3, 100)
X1, X2 = np.meshgrid(x1_range, x2_range)
Z = np.zeros_like(X1)

for i in range(100):
    for j in range(100):
        Z[i, j] = neuron.forward(np.array([X1[i, j], X2[i, j]]))

plt.figure(figsize=(8, 6))
plt.contourf(X1, X2, Z, levels=20, cmap='RdYlBu_r')
plt.colorbar(label='Neuron Output')
plt.xlabel('Input x1', fontsize=12)
plt.ylabel('Input x2', fontsize=12)
plt.title('Single Neuron Response Surface (Sigmoid)', fontsize=14)
plt.axhline(y=0, color='k', linestyle='--', alpha=0.3)
plt.axvline(x=0, color='k', linestyle='--', alpha=0.3)
plt.tight_layout()
plt.savefig('neuron_response.png', dpi=100, bbox_inches='tight')
plt.show()

print("\nThe neuron creates a LINEAR decision boundary")
print("(sigmoid squashes but the boundary is still a line).")

Experiment Try It Yourself

Modify the neuron’s weights to [2.0, -1.0] and bias to 0.5. How does the decision boundary change? Try switching the activation to 'relu' — notice how the response surface becomes piecewise linear instead of smooth.

Hands-On Visualization

Weights and Biases Explained

Weights and biases are the learnable parameters of a neural network. During training, the network adjusts these values to minimize error. But what exactly do they control?

What Weights Control: Strength and Direction

A weight $w_i$ controls how much influence input $x_i$ has on the output. Large positive weights amplify the input, large negative weights suppress it, and weights near zero ignore it. The weighted sum is:

$$z = \sum_{i=1}^{n} w_i x_i + b = w_1 x_1 + w_2 x_2 + \cdots + w_n x_n + b$$

The Role of Bias: Shifting the Threshold

The bias $b$ shifts the activation threshold. Without bias, a neuron with sigmoid activation always outputs exactly 0.5 when all inputs are zero. Bias lets the neuron “pre-activate” — or require stronger evidence before firing.

import numpy as np
import matplotlib.pyplot as plt

def sigmoid(z):
    """Sigmoid activation function."""
    return 1.0 / (1.0 + np.exp(-z))

# Demonstrate effect of different weights (steepness)
z = np.linspace(-6, 6, 200)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left plot: varying weights (steepness of sigmoid)
ax1 = axes[0]
weights = [0.5, 1.0, 2.0, 5.0]
for w in weights:
    ax1.plot(z, sigmoid(w * z), linewidth=2, label=f'w = {w}')
ax1.axhline(y=0.5, color='gray', linestyle='--', alpha=0.5)
ax1.axvline(x=0, color='gray', linestyle='--', alpha=0.5)
ax1.set_xlabel('Input x', fontsize=12)
ax1.set_ylabel('Output sigmoid(w*x)', fontsize=12)
ax1.set_title('Effect of Weight: Controls Steepness', fontsize=13)
ax1.legend(fontsize=11)
ax1.grid(True, alpha=0.3)

# Right plot: varying bias (shift of sigmoid)
ax2 = axes[1]
biases = [-3, -1, 0, 1, 3]
for b in biases:
    ax2.plot(z, sigmoid(z + b), linewidth=2, label=f'b = {b}')
ax2.axhline(y=0.5, color='gray', linestyle='--', alpha=0.5)
ax2.axvline(x=0, color='gray', linestyle='--', alpha=0.5)
ax2.set_xlabel('Input x', fontsize=12)
ax2.set_ylabel('Output sigmoid(x + b)', fontsize=12)
ax2.set_title('Effect of Bias: Shifts Threshold Left/Right', fontsize=13)
ax2.legend(fontsize=11)
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('weights_biases_effect.png', dpi=100, bbox_inches='tight')
plt.show()

print("=== Weight Effect ===")
print("Larger weight = steeper transition (more confident)")
print("Smaller weight = gradual transition (more uncertain)")
print("\n=== Bias Effect ===")
print("Positive bias = shifts threshold LEFT (fires more easily)")
print("Negative bias = shifts threshold RIGHT (harder to fire)")
print("Zero bias = transition centered at input = 0")

                            
                            Intuition: Weights are like volume knobs for each input channel. Bias is like the trigger sensitivity — how much total evidence the neuron needs before it activates.
                        

Activation Functions

Every neuron computes a weighted sum $z = \mathbf{w}^\top \mathbf{x} + b$ — a purely linear operation. Stacking linear operations always collapses to one linear transformation, no matter how many layers you add. Activation functions are the nonlinear gates that break this collapse, letting the network learn curves, boundaries, and concepts that no straight line can capture.

                            
                            Intuition first: Think of activation functions like rheostats on a circuit — they decide how much of a signal passes through, and they can do it in a curved, threshold-gated, or conditionally shaped way. Without them the whole network is just one big multiplication.
                        

There are roughly a dozen activation functions in active use across modern ML. Below we cover every one that matters — formula, intuition, implementation, strengths, and exactly when to reach for it.

Sigmoid — $\sigma(z) = \dfrac{1}{1+e^{-z}}$

Range: (0, 1) | Zero-centred: No | Saturates: Yes

Analogy: A spam filter that turns raw email-weirdness scores into a clean probability — 0 = definitely not spam, 1 = definitely spam. Nothing is ever perfectly certain, so the output is always strictly between the extremes.

Use CaseBinary Classification

When Sigmoid Belongs in Your Network

Output layer of binary classifiers — logistic regression, medical diagnosis (disease/no disease), fraud detection. Sigmoid turns a raw score into an interpretable probability. It is almost never used in hidden layers of deep networks today because its derivative maxes out at 0.25 and vanishes toward zero for large $|z|$, making backpropagation through many layers numerically ineffective.

import numpy as np
import matplotlib.pyplot as plt

def sigmoid(z):
    """Squashes any real number to (0, 1)."""
    return 1.0 / (1.0 + np.exp(-np.clip(z, -500, 500)))  # clip prevents overflow

def sigmoid_derivative(z):
    """Max value is 0.25 at z=0 — shrinks fast for large |z|."""
    s = sigmoid(z)
    return s * (1.0 - s)

z = np.linspace(-6, 6, 300)
plt.figure(figsize=(8, 4))
plt.plot(z, sigmoid(z), linewidth=2.5, label='sigmoid(z)')
plt.plot(z, sigmoid_derivative(z), linewidth=2, linestyle='--', label="sigmoid'(z) [max=0.25]")
plt.axhline(0, color='k', linewidth=0.5); plt.axvline(0, color='k', linewidth=0.5)
plt.title('Sigmoid and Its Derivative'); plt.legend(); plt.grid(alpha=0.3)
plt.tight_layout(); plt.show()

# Practical demo: binary classification probability
raw_scores = np.array([-3.0, -1.0, 0.0, 1.0, 3.0])
probs = sigmoid(raw_scores)
for score, prob in zip(raw_scores, probs):
    print(f"score={score:+.1f}  →  P(positive)={prob:.3f}")
# score=-3.0  →  P(positive)=0.047
# score= 0.0  →  P(positive)=0.500
# score=+3.0  →  P(positive)=0.953

                            
                            Vanishing gradient trap: In a 10-layer network, gradients are multiplied at each layer. If every layer uses sigmoid, the product is roughly $0.25^{10} \approx 10^{-6}$ — the first layers learn almost nothing. Use sigmoid only at the output of binary classifiers.
                        

Tanh — $\tanh(z) = \dfrac{e^z - e^{-z}}{e^z + e^{-z}}$

Range: (−1, 1) | Zero-centred: Yes | Saturates: Yes

Analogy: A sentiment meter that reads from −1 (strongly negative) through 0 (neutral) to +1 (strongly positive). Because the centre is zero, the average gradient across a mini-batch stays near zero — gradients cancel rather than all pushing the same way, giving faster convergence than sigmoid.

Use CaseRecurrent Networks

Where Tanh Still Wins in 2026

Tanh is the standard activation inside LSTM and GRU cells. It keeps hidden states in a bounded, zero-centred range, which prevents hidden state magnitudes from exploding over hundreds of timesteps. It is also used in some normalizing flows because it is invertible over its full domain.

import numpy as np
import matplotlib.pyplot as plt

def tanh(z):
    """Zero-centred, range (-1, 1)."""
    return np.tanh(z)

def tanh_derivative(z):
    """Derivative: 1 - tanh(z)^2. Maximum 1.0 at z=0."""
    return 1.0 - np.tanh(z) ** 2

z = np.linspace(-4, 4, 300)
plt.figure(figsize=(8, 4))
plt.plot(z, tanh(z), linewidth=2.5, label='tanh(z)')
plt.plot(z, tanh_derivative(z), linewidth=2, linestyle='--', label="tanh'(z) [max=1.0]")
plt.axhline(0, color='k', linewidth=0.5); plt.axvline(0, color='k', linewidth=0.5)
plt.title('Tanh and Its Derivative'); plt.legend(); plt.grid(alpha=0.3)
plt.tight_layout(); plt.show()

# tanh vs sigmoid: derivative comparison
for z_val in [-3, -1, 0, 1, 3]:
    sig_grad = sigmoid_val = 1/(1+np.exp(-z_val))
    sig_grad = sig_grad * (1 - sig_grad)
    tanh_grad = tanh_derivative(z_val)
    print(f"z={z_val:+d}  sigmoid'={sig_grad:.4f}  tanh'={tanh_grad:.4f}")
# z=0  sigmoid'=0.2500  tanh'=1.0000  ← tanh has 4× larger gradient at centre

ReLU — $f(z) = \max(0, z)$

Range: [0, ∞) | Zero-centred: No | Saturates: No (positive side)

Analogy: A one-way valve in a water pipe — water flows freely in one direction (positive values pass through unchanged), but the valve blocks all backflow (negative values become zero). It is cheap to compute, never saturates on the positive side, and empirically trains deep networks faster than sigmoid or tanh.

Use CaseDeep CNNs & MLPs

Why ReLU Became the Default Hidden-Layer Activation

When deep learning took off with AlexNet (2012), ReLU was the key ingredient enabling 8 layers to train in days rather than months. Its gradient is always 1 for positive inputs (no shrinkage), so gradients flow cleanly to early layers. Most convolutional neural networks and MLPs today still use ReLU as the default unless there is a specific reason to switch. PyTorch's nn.ReLU() and TensorFlow's tf.keras.activations.relu both default to this.

import numpy as np
import matplotlib.pyplot as plt

def relu(z):
    """Passes positive values unchanged; clips negatives to zero."""
    return np.maximum(0, z)

def relu_derivative(z):
    """1 for z>0, 0 for z<0, undefined at z=0 (convention: 0)."""
    return (z > 0).astype(float)

z = np.linspace(-4, 4, 300)
plt.figure(figsize=(8, 4))
plt.plot(z, relu(z), linewidth=2.5, label='relu(z)')
plt.plot(z, relu_derivative(z), linewidth=2, linestyle='--', label="relu'(z)")
plt.axhline(0, color='k', linewidth=0.5); plt.axvline(0, color='k', linewidth=0.5)
plt.title('ReLU and Its Derivative'); plt.legend(); plt.grid(alpha=0.3)
plt.tight_layout(); plt.show()

# Demonstrate sparse activation (healthy for feature learning)
np.random.seed(42)
batch = np.random.randn(1000, 64)   # 1000 samples, 64 features
activations = relu(batch)
sparsity = (activations == 0).mean()
print(f"Sparsity of ReLU activations: {sparsity:.1%}")  # ~50% neurons silent
print("Sparse activations → efficient representations (only relevant neurons fire)")

                            
                            Dying ReLU problem: If a neuron receives predominantly negative inputs (e.g., due to a large negative bias or bad initialisation), its output is permanently zero and its gradient is zero — it can never recover during training. With large learning rates this can kill 40–50% of neurons in a network. The variants below solve this.
                        

Leaky ReLU & PReLU — $f(z) = \max(\alpha z,\, z)$

Range: (−∞, ∞) | Zero-centred: Near | Saturates: No

Analogy: The same one-way valve, but with a tiny leak — negative signals aren't completely blocked; a small fraction ($\alpha$) trickles through. This keeps the gradient non-zero everywhere, preventing neurons from dying permanently.

Leaky ReLU: $\alpha$ is a fixed hyperparameter (typically 0.01 or 0.1).
PReLU (Parametric ReLU): $\alpha$ is learned during backpropagation — one $\alpha$ per channel. Microsoft Research used PReLU to win ImageNet 2015.

import numpy as np
import matplotlib.pyplot as plt

def leaky_relu(z, alpha=0.01):
    """Small fixed slope alpha for negative inputs."""
    return np.where(z > 0, z, alpha * z)

def leaky_relu_derivative(z, alpha=0.01):
    return np.where(z > 0, 1.0, alpha)

class PReLU:
    """Parametric ReLU — alpha is a learnable parameter."""
    def __init__(self, init_alpha=0.25):
        self.alpha = init_alpha   # updated by optimizer during training
    def forward(self, z):
        return np.where(z > 0, z, self.alpha * z)
    def backward(self, z, grad_output):
        grad_z     = np.where(z > 0, grad_output, self.alpha * grad_output)
        grad_alpha = np.sum(np.where(z > 0, 0, grad_output * z))  # dL/d_alpha
        return grad_z, grad_alpha

z = np.linspace(-4, 4, 300)
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
alpha_values = [0.01, 0.1, 0.3]
for alpha in alpha_values:
    axes[0].plot(z, leaky_relu(z, alpha), linewidth=2, label=f'alpha={alpha}')
axes[0].set_title('Leaky ReLU (various alpha)'); axes[0].legend(); axes[0].grid(alpha=0.3)

prelu = PReLU(init_alpha=0.25)
axes[1].plot(z, prelu.forward(z), linewidth=2.5, color='purple', label='PReLU (alpha=0.25 learned)')
axes[1].set_title('PReLU — learnable slope'); axes[1].legend(); axes[1].grid(alpha=0.3)
plt.tight_layout(); plt.show()

# Quick comparison: how many neurons die?
np.random.seed(0)
z_batch = np.random.randn(10000) - 1.5  # biased toward negative (simulates bad init)
relu_dead    = (np.maximum(0, z_batch) == 0).mean()
leaky_dead   = (leaky_relu(z_batch, 0.01) == 0).mean()
print(f"ReLU dead neurons:       {relu_dead:.1%}")
print(f"Leaky ReLU dead neurons: {leaky_dead:.1%}")
# ReLU dead neurons:       ~82%  ← catastrophic
# Leaky ReLU dead neurons: 0.0%  ← none, leak keeps gradient alive

ELU & SELU — smooth negative region

ELU Range: (−α, ∞) | Zero-centred: Approximately | Saturates: Only below at −α

Analogy (ELU): A soft-edged valve — instead of a hard kink at zero, the negative side curves smoothly to a floor of $-\alpha$. This reduces the bias shift that Leaky ReLU can cause and pushes the mean activation closer to zero, speeding up convergence.

$$\text{ELU}(z) = \begin{cases} z & z > 0 \\ \alpha (e^z - 1) & z \leq 0 \end{cases}$$

SELU (Scaled ELU) adds two carefully chosen constants $\lambda$ and $\alpha$ (derived mathematically) so that activations are self-normalizing — the mean and variance of each layer's output are automatically kept near 0 and 1 respectively, without BatchNorm. This is the activation of choice for fully-connected deep networks where you want to avoid adding normalisation layers.

$$\text{SELU}(z) = \lambda \begin{cases} z & z > 0 \\ \alpha (e^z - 1) & z \leq 0 \end{cases} \quad \lambda \approx 1.0507,\ \alpha \approx 1.6733$$

import numpy as np
import matplotlib.pyplot as plt

def elu(z, alpha=1.0):
    """Smooth negative region, mean activation closer to zero."""
    return np.where(z > 0, z, alpha * (np.exp(z) - 1))

# SELU constants are derived theoretically (Klambauer et al., 2017)
SELU_LAMBDA = 1.0507009873554804934193349852946
SELU_ALPHA  = 1.6732632423543772848170429916717

def selu(z):
    """Self-normalizing ELU — keeps layer outputs near N(0,1)."""
    return SELU_LAMBDA * np.where(z > 0, z, SELU_ALPHA * (np.exp(z) - 1))

z = np.linspace(-4, 4, 300)
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
for alpha in [0.5, 1.0, 2.0]:
    axes[0].plot(z, elu(z, alpha), linewidth=2, label=f'ELU alpha={alpha}')
axes[0].set_title('ELU (various alpha)'); axes[0].legend(); axes[0].grid(alpha=0.3)
axes[1].plot(z, selu(z), linewidth=2.5, color='green', label='SELU')
axes[1].set_title('SELU — self-normalising'); axes[1].legend(); axes[1].grid(alpha=0.3)
plt.tight_layout(); plt.show()

# Verify SELU self-normalisation property
np.random.seed(42)
x = np.random.randn(10000)   # standard normal input
y = selu(x)
print(f"Input:  mean={x.mean():.4f}, std={x.std():.4f}")
print(f"SELU:   mean={y.mean():.4f}, std={y.std():.4f}")
# SELU:   mean~-0.08, std~0.96  — close to N(0,1) after one pass

                            
                            SELU requirements: Self-normalisation only holds when weights are initialised with LeCun normal initialisation (torch.nn.init.lecun_normal_) and the network uses AlphaDropout instead of standard dropout. Mixing SELU with BatchNorm or incorrect initialisation breaks the theoretical guarantee.
                        

GELU — Gaussian Error Linear Unit (Transformers)

Range: (−0.17, ∞) | Zero-centred: Approximately | Saturates: No

$$\text{GELU}(z) = z \cdot \Phi(z) \approx 0.5\,z\left(1 + \tanh\!\left[\sqrt{\tfrac{2}{\pi}}\left(z + 0.044715\,z^3\right)\right]\right)$$

Analogy: A probabilistic gate — instead of a hard or leaky switch, GELU asks "what is the probability that a Gaussian random variable is less than $z$?" and gates the signal by that probability. A neuron with a small positive input still passes some signal (proportional to its statistical likelihood of being signal rather than noise), while a neuron with a large positive input passes almost all of it.

Use CaseTransformers & LLMs

GELU Powers BERT, GPT, and Most Modern LLMs

Google's BERT (2018) and OpenAI's GPT series both use GELU in their feed-forward sublayers. The smooth probabilistic gating empirically outperforms ReLU on NLP tasks. It is now also the default in many vision transformers (ViT). The exact form uses a tanh approximation for computational efficiency; PyTorch exposes both via F.gelu(x) and F.gelu(x, approximate='tanh').

import numpy as np
import matplotlib.pyplot as plt
from scipy.special import erf   # or use tanh approximation

def gelu_exact(z):
    """Exact GELU via the Gaussian CDF (requires scipy)."""
    return z * 0.5 * (1.0 + erf(z / np.sqrt(2)))

def gelu_approx(z):
    """Tanh approximation used in PyTorch (matches torch.nn.functional.gelu)."""
    return 0.5 * z * (1.0 + np.tanh(np.sqrt(2 / np.pi) * (z + 0.044715 * z**3)))

z = np.linspace(-4, 4, 300)
plt.figure(figsize=(9, 4))
plt.plot(z, gelu_exact(z),  linewidth=2.5, label='GELU (exact)')
plt.plot(z, gelu_approx(z), linewidth=2,   linestyle='--', label='GELU (tanh approx)')
# Compare against ReLU
plt.plot(z, np.maximum(0, z), linewidth=1.5, linestyle=':', color='gray', label='ReLU')
plt.axhline(0, color='k', linewidth=0.5); plt.axvline(0, color='k', linewidth=0.5)
plt.title('GELU vs ReLU'); plt.legend(); plt.grid(alpha=0.3)
plt.tight_layout(); plt.show()

# Key property: GELU can suppress small positive values (probabilistic gating)
for z_val in [-2, -0.5, 0, 0.5, 2]:
    print(f"z={z_val:+.1f}  GELU={gelu_exact(np.array([z_val]))[0]:+.4f}  "
          f"ReLU={max(0, z_val):.4f}")
# z=-0.5  GELU=-0.1543  ReLU=0.0000  ← GELU leaks slight negative signal
# z=+0.5  GELU=+0.3457  ReLU=0.5000  ← GELU gates down small positives

Swish / SiLU & Mish — smooth self-gated units

Swish / SiLU Range: (−0.28, ∞) | Zero-centred: Approximately | Saturates: No

$$\text{Swish}(z) = z \cdot \sigma(z) = \frac{z}{1+e^{-z}} \qquad \text{Mish}(z) = z \cdot \tanh(\text{Softplus}(z)) = z \cdot \tanh(\ln(1+e^z))$$

Swish analogy: A self-gating highway — the input multiplies its own sigmoid, so large positive inputs pass through cleanly while small or negative inputs are smoothly attenuated. Discovered by Google Brain (2017) via neural architecture search, Swish outperforms ReLU on deep networks for image classification and language tasks.

Mish analogy: A smoother, slightly more aggressive version of Swish — it uses tanh(softplus(z)) as the gate, giving it a marginally richer gradient landscape. YOLOv4 and some recent object detection models prefer Mish because its slightly negative dip for small inputs helps prevent dead neurons. In practice, Swish and Mish are nearly interchangeable.

SiLU (Sigmoid Linear Unit) is identical to Swish with no learnable parameter. PyTorch calls it torch.nn.SiLU(); it is the default activation in EfficientNet and many diffusion model U-Nets.

import numpy as np
import matplotlib.pyplot as plt

def sigmoid(z):
    return 1.0 / (1.0 + np.exp(-np.clip(z, -500, 500)))

def swish(z, beta=1.0):
    """Swish / SiLU (beta=1). Also written as z * sigmoid(z)."""
    return z * sigmoid(beta * z)

def softplus(z):
    """Smooth approximation to ReLU: log(1 + e^z)."""
    return np.log1p(np.exp(np.clip(z, -500, 20)))  # clip for numerical stability

def mish(z):
    """Mish = z * tanh(softplus(z))."""
    return z * np.tanh(softplus(z))

z = np.linspace(-4, 6, 300)
plt.figure(figsize=(10, 4))
plt.plot(z, swish(z),           linewidth=2.5, label='Swish / SiLU')
plt.plot(z, mish(z),            linewidth=2,   linestyle='--', label='Mish')
plt.plot(z, np.maximum(0, z),   linewidth=1.5, linestyle=':', color='gray', label='ReLU (reference)')
plt.axhline(0, color='k', linewidth=0.5)
plt.title('Swish / SiLU vs Mish vs ReLU'); plt.legend(); plt.grid(alpha=0.3)
plt.tight_layout(); plt.show()

# Gradient comparison (Swish has a richer gradient landscape than ReLU)
def swish_derivative(z):
    sig = sigmoid(z)
    return sig + z * sig * (1 - sig)

print("Swish gradient at key points:")
for z_val in [-2, -0.5, 0, 1, 3]:
    z_arr = np.array([z_val])
    print(f"  z={z_val:+.1f}  swish'={swish_derivative(z_arr)[0]:+.4f}")

Softmax & Softplus — probability outputs and smooth approximation

Softmax is not used in hidden layers — it is exclusively an output-layer activation for multi-class classification. It converts a vector of raw scores (logits) into a valid probability distribution: all outputs are positive and sum to exactly 1.

$$\text{Softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}}$$

Analogy (Softmax): A tournament voting system — each contestant (class) gets votes proportional to the exponent of their score. The winner takes a large share, but every contestant gets at least some non-zero fraction of the total vote. This is why softmax is paired with cross-entropy loss in every multi-class classifier.

Softplus $= \ln(1 + e^z)$ is a smooth, everywhere-differentiable approximation of ReLU. It is mainly used in variational autoencoders (to produce positive variance predictions) and in models where smoothness at zero matters more than computational speed.

import numpy as np
import matplotlib.pyplot as plt

def softmax(z):
    """
    Numerically stable softmax: subtract max before exp to prevent overflow.
    Input:  z — 1-D array of logits (raw scores)
    Output: probability distribution summing to 1.0
    """
    e = np.exp(z - z.max())    # subtract max for numerical stability
    return e / e.sum()

def softplus(z):
    """Smooth, always-positive approximation to ReLU."""
    return np.log1p(np.exp(np.clip(z, -500, 20)))

# Softmax demo: 5-class classifier (e.g., digit recognition 0–4)
logits = np.array([2.0, 1.0, 0.1, -0.5, 3.5])   # raw network outputs
probs  = softmax(logits)
print("Logits → Softmax probabilities:")
for i, (logit, prob) in enumerate(zip(logits, probs)):
    bar = '█' * int(prob * 40)
    print(f"  Class {i}: logit={logit:+.1f}  P={prob:.3f}  {bar}")
print(f"  Sum = {probs.sum():.6f}")

# Softplus vs ReLU comparison
z = np.linspace(-4, 4, 300)
plt.figure(figsize=(8, 4))
plt.plot(z, softplus(z),        linewidth=2.5, label='Softplus')
plt.plot(z, np.maximum(0, z),   linewidth=2,   linestyle='--', color='gray', label='ReLU')
plt.axhline(0, color='k', linewidth=0.5)
plt.title('Softplus vs ReLU — Softplus is smooth at the kink')
plt.legend(); plt.grid(alpha=0.3)
plt.tight_layout(); plt.show()

                            
                            Softmax + Cross-Entropy: In practice, modern frameworks (PyTorch's nn.CrossEntropyLoss, TensorFlow's tf.keras.losses.CategoricalCrossentropy(from_logits=True)) apply softmax internally during loss calculation. Pass raw logits — not the softmax output — to the loss function for better numerical stability.
                        

Choosing the Right Activation Function

With a dozen options available, the choice comes down to three questions: where in the network are you?, what is the task?, and how deep is the network?

                            Quick decision table:
                            
                                
                                        Situation
                                        Recommended
                                        Why
                                    

                                Default hidden layers (MLP/CNN)ReLUFast, sparse, no vanishing gradient on positive side
Seeing dying neurons?Leaky ReLUNon-zero gradient everywhere; $\alpha$=0.01–0.1
Deep MLP without BatchNormSELUSelf-normalises; use LeCun init + AlphaDropout
Transformer feed-forward layerGELUUsed in BERT, GPT; probabilistic gating suits attention outputs
EfficientNet / diffusion U-NetSwish / SiLUSmooth, self-gated; matches neural architecture search results
Object detection (YOLO, etc.)MishSlightly richer gradient than Swish for localisation tasks
LSTM / GRU hidden stateTanhZero-centred, bounded — prevents state explosion over time
Binary classification outputSigmoidOutputs P(positive class) in (0,1)
Multi-class classification outputSoftmaxConverts logits to valid probability distribution
Regression outputNone (linear)Unconstrained real-valued prediction
VAE variance head / positive outputSoftplusSmooth, always positive — models variance without clamping

                            
                        

Situation	Recommended	Why
Default hidden layers (MLP/CNN)	ReLU	Fast, sparse, no vanishing gradient on positive side
Seeing dying neurons?	Leaky ReLU	Non-zero gradient everywhere; $\alpha$=0.01–0.1
Deep MLP without BatchNorm	SELU	Self-normalises; use LeCun init + AlphaDropout
Transformer feed-forward layer	GELU	Used in BERT, GPT; probabilistic gating suits attention outputs
EfficientNet / diffusion U-Net	Swish / SiLU	Smooth, self-gated; matches neural architecture search results
Object detection (YOLO, etc.)	Mish	Slightly richer gradient than Swish for localisation tasks
LSTM / GRU hidden state	Tanh	Zero-centred, bounded — prevents state explosion over time
Binary classification output	Sigmoid	Outputs P(positive class) in (0,1)
Multi-class classification output	Softmax	Converts logits to valid probability distribution
Regression output	None (linear)	Unconstrained real-valued prediction
VAE variance head / positive output	Softplus	Smooth, always positive — models variance without clamping

import numpy as np

# All activations in one self-contained reference block
def sigmoid(z):      return 1 / (1 + np.exp(-np.clip(z, -500, 500)))
def tanh(z):         return np.tanh(z)
def relu(z):         return np.maximum(0, z)
def leaky_relu(z, alpha=0.01): return np.where(z > 0, z, alpha * z)
def elu(z, alpha=1.0): return np.where(z > 0, z, alpha * (np.exp(z) - 1))

SELU_L, SELU_A = 1.0507009873554805, 1.6732632423543772
def selu(z): return SELU_L * np.where(z > 0, z, SELU_A * (np.exp(z) - 1))

def gelu(z): return 0.5*z*(1+np.tanh(np.sqrt(2/np.pi)*(z+0.044715*z**3)))
def swish(z): return z * sigmoid(z)
def softplus(z): return np.log1p(np.exp(np.clip(z, -500, 20)))
def mish(z):  return z * np.tanh(softplus(z))
def softmax(z): e=np.exp(z-z.max()); return e/e.sum()

# Side-by-side evaluation at representative inputs
activations = {
    'Sigmoid':    sigmoid,     'Tanh':      tanh,
    'ReLU':       relu,        'Leaky ReLU': leaky_relu,
    'ELU':        elu,         'SELU':       selu,
    'GELU':       gelu,        'Swish/SiLU': swish,
    'Mish':       mish,        'Softplus':   softplus,
}
test_values = np.array([-3.0, -1.0, 0.0, 1.0, 3.0])
print(f"{'Function':<14} ", end='')
for v in test_values: print(f"z={v:+.0f}    ", end='')
print()
print('-' * 75)
for name, fn in activations.items():
    vals = fn(test_values)
    print(f"{name:<14} ", end='')
    for v in vals: print(f"{v:+7.3f}  ", end='')
    print()

                            
                            Rule of thumb for 2026: Start with ReLU for CNNs and MLPs. Switch to GELU or Swish for transformers and large models. Use Sigmoid/Softmax only at the output layer. Only reach for ELU, SELU, Mish, or PReLU when you have a concrete reason (dying neurons, no BatchNorm, specific benchmark evidence).
                        

Why Non-Linearity Matters

This is perhaps the most critical insight in neural network theory: without non-linear activations, no matter how many layers you stack, the entire network is equivalent to a single linear transformation. Let’s prove this mathematically and empirically.

The Linear Collapse Proof

If we have two layers without activation: $y = W_2 (W_1 x) = (W_2 W_1) x = W_{combined} x$. The composition of linear functions is still linear. Let’s demonstrate this with code:

import numpy as np

# PROOF: Without activations, N layers = 1 layer (linear collapse)
np.random.seed(42)

# Create 5 random weight matrices (simulating 5 layers)
input_dim = 4
hidden_dim = 8
output_dim = 2

W1 = np.random.randn(hidden_dim, input_dim)
W2 = np.random.randn(hidden_dim, hidden_dim)
W3 = np.random.randn(hidden_dim, hidden_dim)
W4 = np.random.randn(hidden_dim, hidden_dim)
W5 = np.random.randn(output_dim, hidden_dim)

# Forward pass through 5 layers WITHOUT activation
x = np.random.randn(input_dim)

layer1 = W1 @ x
layer2 = W2 @ layer1
layer3 = W3 @ layer2
layer4 = W4 @ layer3
output_5layers = W5 @ layer4

# The entire 5-layer network is just ONE matrix multiplication
W_combined = W5 @ W4 @ W3 @ W2 @ W1
output_1layer = W_combined @ x

print("=== Linear Collapse Demonstration ===")
print(f"\n5-layer output: {output_5layers}")
print(f"1-layer output: {output_1layer}")
print(f"\nDifference: {np.abs(output_5layers - output_1layer).max():.2e}")
print("(Effectively zero - just floating point noise)")

print("\n--- Conclusion ---")
print(f"5 layers ({input_dim}x{hidden_dim}x{hidden_dim}x{hidden_dim}x{hidden_dim}x{output_dim})")
print(f"  = 1 layer ({input_dim}x{output_dim})")
print(f"\nW_combined shape: {W_combined.shape}")
print("Without activation, depth is MEANINGLESS!")
print("You cannot learn XOR, circles, or any non-linear pattern.")

Fixing XOR with Non-Linearity

The XOR problem (from Part 1) cannot be solved by any single linear boundary. But a network with just one hidden layer and ReLU activation can solve it perfectly:

import numpy as np
import matplotlib.pyplot as plt

# XOR problem: no single line can separate these points
X_xor = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y_xor = np.array([0, 1, 1, 0])

# Manual solution: 2-layer network with ReLU
# Hidden layer: 2 neurons with hand-crafted weights
W1 = np.array([[1, 1],    # neuron 1: detects (x1 + x2)
               [1, 1]])   # neuron 2: same
b1 = np.array([0, -1])    # neuron 1 threshold: 0, neuron 2: -1

# Output layer: 1 neuron
W2 = np.array([[1, -2]])  # combines hidden outputs
b2 = np.array([0])

def relu(z):
    return np.maximum(0, z)

print("=== Solving XOR with ReLU Network ===")
print(f"Architecture: 2 inputs -> 2 hidden (ReLU) -> 1 output")
print(f"\nW1 = {W1.tolist()}, b1 = {b1.tolist()}")
print(f"W2 = {W2.tolist()}, b2 = {b2.tolist()}")
print("\n" + "-" * 45)
print(f"{'Input':<12} {'Hidden (pre)':<15} {'Hidden (ReLU)':<15} {'Output':<8}")
print("-" * 45)

for i in range(4):
    x = X_xor[i]
    z1 = W1 @ x + b1           # pre-activation
    h1 = relu(z1)              # after ReLU
    output = W2 @ h1 + b2      # final output
    print(f"{x} -> {z1} -> {h1} -> {output[0]:.1f} (target: {y_xor[i]})")

# Visualize the hidden space transformation
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Original XOR space
colors = ['red' if y == 0 else 'blue' for y in y_xor]
axes[0].scatter(X_xor[:, 0], X_xor[:, 1], c=colors, s=200,
               edgecolors='black', linewidth=2, zorder=5)
axes[0].set_xlabel('x1', fontsize=12)
axes[0].set_ylabel('x2', fontsize=12)
axes[0].set_title('Original Space (NOT linearly separable)', fontsize=13)
axes[0].set_xlim(-0.5, 1.5)
axes[0].set_ylim(-0.5, 1.5)
axes[0].grid(True, alpha=0.3)

# Transformed hidden space
hidden_outputs = []
for x in X_xor:
    z1 = W1 @ x + b1
    h1 = relu(z1)
    hidden_outputs.append(h1)
hidden_outputs = np.array(hidden_outputs)

axes[1].scatter(hidden_outputs[:, 0], hidden_outputs[:, 1], c=colors, s=200,
               edgecolors='black', linewidth=2, zorder=5)
# Draw separating line in hidden space
h_range = np.linspace(-0.5, 2.5, 100)
sep_line = 0.5 * h_range  # W2 @ h = 0 boundary
axes[1].plot(h_range, sep_line, 'g--', linewidth=2, label='Decision Boundary')
axes[1].set_xlabel('Hidden Neuron 1 Output', fontsize=12)
axes[1].set_ylabel('Hidden Neuron 2 Output', fontsize=12)
axes[1].set_title('Hidden Space (NOW linearly separable!)', fontsize=13)
axes[1].legend(fontsize=11)
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('xor_nonlinearity.png', dpi=100, bbox_inches='tight')
plt.show()

print("\nReLU transforms the space so XOR becomes linearly separable!")

                            
                            Key Insight: Non-linear activations allow neural networks to “bend” the input space, transforming non-linearly-separable problems into linearly-separable ones in higher-dimensional hidden spaces. This is the true power of depth.
                        

Layers: Input, Hidden, Output

Neural networks are organized into layers, each performing a specific transformation. Understanding layer types is essential for designing architectures:

Input Layer: Not a computation layer — simply passes raw features to the network. Size equals number of input features.
Hidden Layers: Where the learning happens. Each neuron computes a weighted sum + activation. Multiple hidden layers enable hierarchical feature learning.
Output Layer: Produces the final prediction. Activation depends on the task (sigmoid for binary classification, softmax for multi-class, linear for regression).

Feedforward Network Architecture (2-3-2-1)

flowchart LR
    subgraph Input["Input Layer (2)"]
        i1["x1"]
        i2["x2"]
    end
    subgraph Hidden1["Hidden Layer 1 (3)"]
        h1["h1"]
        h2["h2"]
        h3["h3"]
    end
    subgraph Hidden2["Hidden Layer 2 (2)"]
        h4["h4"]
        h5["h5"]
    end
    subgraph Output["Output Layer (1)"]
        o1["y"]
    end
    i1 --> h1
    i1 --> h2
    i1 --> h3
    i2 --> h1
    i2 --> h2
    i2 --> h3
    h1 --> h4
    h1 --> h5
    h2 --> h4
    h2 --> h5
    h3 --> h4
    h3 --> h5
    h4 --> o1
    h5 --> o1

Implementing a Dense Layer

Each dense (fully-connected) layer stores a weight matrix $\mathbf{W}$ of shape (outputs, inputs) and a bias vector $\mathbf{b}$ of shape (outputs,). The forward pass computes $\mathbf{a} = f(\mathbf{W} \cdot \mathbf{x} + \mathbf{b})$:

import numpy as np

class DenseLayer:
    """A fully-connected (dense) neural network layer."""

    def __init__(self, n_inputs, n_neurons, activation='relu'):
        # Xavier initialization for better training
        scale = np.sqrt(2.0 / n_inputs)
        self.weights = np.random.randn(n_neurons, n_inputs) * scale
        self.biases = np.zeros(n_neurons)
        self.activation = activation

        # Store for later (backprop will need these)
        self.last_input = None
        self.last_z = None
        self.last_output = None

    def _activate(self, z):
        """Apply activation function."""
        if self.activation == 'relu':
            return np.maximum(0, z)
        elif self.activation == 'sigmoid':
            return 1.0 / (1.0 + np.exp(-z))
        elif self.activation == 'tanh':
            return np.tanh(z)
        elif self.activation == 'linear':
            return z
        else:
            raise ValueError(f"Unknown activation: {self.activation}")

    def forward(self, x):
        """Forward pass: output = activation(W @ x + b)."""
        self.last_input = x
        self.last_z = self.weights @ x + self.biases
        self.last_output = self._activate(self.last_z)
        return self.last_output

    def __repr__(self):
        return (f"DenseLayer(inputs={self.weights.shape[1]}, "
                f"neurons={self.weights.shape[0]}, "
                f"activation='{self.activation}')")

# Create layers and demonstrate
np.random.seed(42)

layer1 = DenseLayer(n_inputs=3, n_neurons=4, activation='relu')
layer2 = DenseLayer(n_inputs=4, n_neurons=2, activation='sigmoid')

print("=== Layer Architecture ===")
print(f"Layer 1: {layer1}")
print(f"  Weights shape: {layer1.weights.shape}")
print(f"  Biases shape: {layer1.biases.shape}")
print(f"  Total params: {layer1.weights.size + layer1.biases.size}")

print(f"\nLayer 2: {layer2}")
print(f"  Weights shape: {layer2.weights.shape}")
print(f"  Biases shape: {layer2.biases.shape}")
print(f"  Total params: {layer2.weights.size + layer2.biases.size}")

# Forward pass through both layers
x = np.array([1.0, 0.5, -0.3])
h = layer1.forward(x)
y = layer2.forward(h)

print(f"\n=== Forward Pass ===")
print(f"Input:          {x}")
print(f"After Layer 1:  {h} (ReLU applied)")
print(f"After Layer 2:  {y} (Sigmoid applied)")
print(f"\nTotal network parameters: "
      f"{layer1.weights.size + layer1.biases.size + layer2.weights.size + layer2.biases.size}")

Forward Propagation Step-by-Step

Forward propagation is the process of passing input through every layer of the network to produce an output. In matrix notation, for layer $l$:

$$\mathbf{a}^{[l]} = g\left(\mathbf{W}^{[l]} \mathbf{a}^{[l-1]} + \mathbf{b}^{[l]}\right)$$

where $\mathbf{a}^{[0]} = \mathbf{x}$ (the input), $\mathbf{W}^{[l]}$ is the weight matrix of layer $l$, $\mathbf{b}^{[l]}$ is the bias vector, and $g$ is the activation function.

Step-by-Step with Intermediate Values

Let’s trace through a complete forward pass, printing every intermediate computation:

import numpy as np

def sigmoid(z):
    return 1.0 / (1.0 + np.exp(-z))

def relu(z):
    return np.maximum(0, z)

# Define a small network: 3 inputs -> 4 hidden (ReLU) -> 2 output (sigmoid)
np.random.seed(123)

# Layer 1 parameters
W1 = np.random.randn(4, 3) * 0.5  # 4 neurons, 3 inputs each
b1 = np.zeros(4)

# Layer 2 parameters
W2 = np.random.randn(2, 4) * 0.5  # 2 neurons, 4 inputs each
b2 = np.zeros(2)

# Input
x = np.array([1.0, 0.5, -0.8])

print("=" * 60)
print("FORWARD PROPAGATION: Step-by-Step Trace")
print("=" * 60)
print(f"\nNetwork: 3 -> 4 (ReLU) -> 2 (Sigmoid)")
print(f"\nInput x = {x}")

# Step 1: Linear transformation in layer 1
z1 = W1 @ x + b1
print(f"\n--- Layer 1 (Hidden) ---")
print(f"W1 shape: {W1.shape}")
print(f"z1 = W1 @ x + b1")
print(f"z1 = {z1}")

# Step 2: Apply ReLU activation
a1 = relu(z1)
print(f"\na1 = ReLU(z1)")
print(f"a1 = {a1}")
print(f"(Negative values zeroed out by ReLU)")

# Step 3: Linear transformation in layer 2
z2 = W2 @ a1 + b2
print(f"\n--- Layer 2 (Output) ---")
print(f"W2 shape: {W2.shape}")
print(f"z2 = W2 @ a1 + b2")
print(f"z2 = {z2}")

# Step 4: Apply Sigmoid activation
a2 = sigmoid(z2)
print(f"\na2 = Sigmoid(z2)")
print(f"a2 = {a2}")
print(f"\n{'=' * 60}")
print(f"FINAL OUTPUT: {a2}")
print(f"{'=' * 60}")

# Verify dimensions at each step
print(f"\n--- Dimension Check ---")
print(f"x:  {x.shape}  (input)")
print(f"W1: {W1.shape}  (layer 1 weights)")
print(f"z1: {z1.shape}  (pre-activation 1)")
print(f"a1: {a1.shape}  (post-activation 1)")
print(f"W2: {W2.shape}  (layer 2 weights)")
print(f"z2: {z2.shape}  (pre-activation 2)")
print(f"a2: {a2.shape}  (final output)")

                            
                            Dimension Rule: If layer $l$ has $n^{[l]}$ neurons and receives input from $n^{[l-1]}$ neurons, then $\mathbf{W}^{[l]}$ has shape $(n^{[l]}, n^{[l-1]})$. The output $\mathbf{a}^{[l]}$ has shape $(n^{[l]},)$. Always check dimensions — shape mismatches are the #1 bug in neural network code.
                        

Putting It Together: Multi-Layer Perceptron

Now let’s combine everything into a complete, configurable MLP class. This is the foundation we’ll build upon for training (backpropagation) in Part 3:

import numpy as np

class MLP:
    """Multi-Layer Perceptron with configurable architecture."""

    def __init__(self, layer_sizes, activations=None):
        """
        Args:
            layer_sizes: list of ints, e.g. [3, 4, 4, 2]
                         (input=3, two hidden=4, output=2)
            activations: list of activation names for each layer
                         (length = len(layer_sizes) - 1)
        """
        self.n_layers = len(layer_sizes) - 1

        # Default: ReLU for hidden, sigmoid for output
        if activations is None:
            activations = ['relu'] * (self.n_layers - 1) + ['sigmoid']
        self.activations = activations

        # Initialize weights and biases
        self.weights = []
        self.biases = []
        for i in range(self.n_layers):
            n_in = layer_sizes[i]
            n_out = layer_sizes[i + 1]
            # Xavier/He initialization
            if activations[i] == 'relu':
                scale = np.sqrt(2.0 / n_in)  # He init for ReLU
            else:
                scale = np.sqrt(1.0 / n_in)  # Xavier init
            W = np.random.randn(n_out, n_in) * scale
            b = np.zeros(n_out)
            self.weights.append(W)
            self.biases.append(b)

        # Storage for forward pass values (needed for backprop)
        self.layer_inputs = []
        self.layer_outputs = []

    def _activate(self, z, activation):
        """Apply activation function."""
        if activation == 'relu':
            return np.maximum(0, z)
        elif activation == 'sigmoid':
            return 1.0 / (1.0 + np.exp(-np.clip(z, -500, 500)))
        elif activation == 'tanh':
            return np.tanh(z)
        elif activation == 'linear':
            return z
        elif activation == 'leaky_relu':
            return np.where(z > 0, z, 0.01 * z)
        else:
            raise ValueError(f"Unknown activation: {activation}")

    def forward(self, x):
        """Forward pass through all layers."""
        self.layer_inputs = [x]
        self.layer_outputs = [x]

        a = x
        for i in range(self.n_layers):
            z = self.weights[i] @ a + self.biases[i]
            a = self._activate(z, self.activations[i])
            self.layer_inputs.append(z)
            self.layer_outputs.append(a)
        return a

    def predict(self, X):
        """Predict for multiple samples (batch)."""
        return np.array([self.forward(x) for x in X])

    def count_params(self):
        """Count total trainable parameters."""
        total = 0
        for W, b in zip(self.weights, self.biases):
            total += W.size + b.size
        return total

    def summary(self):
        """Print network architecture summary."""
        print("=" * 50)
        print("MLP Architecture Summary")
        print("=" * 50)
        for i in range(self.n_layers):
            n_in = self.weights[i].shape[1]
            n_out = self.weights[i].shape[0]
            params = self.weights[i].size + self.biases[i].size
            print(f"Layer {i+1}: {n_in} -> {n_out} "
                  f"({self.activations[i]}) "
                  f"| params: {params}")
        print("-" * 50)
        print(f"Total parameters: {self.count_params()}")
        print("=" * 50)


# Create and test the MLP
np.random.seed(42)

# Architecture: 2 inputs -> 4 hidden -> 3 hidden -> 1 output
mlp = MLP(
    layer_sizes=[2, 4, 3, 1],
    activations=['relu', 'relu', 'sigmoid']
)
mlp.summary()

# Test on XOR-like data
X_test = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
print("\n=== Forward Pass on XOR Data ===")
for x in X_test:
    output = mlp.forward(x)
    print(f"Input: {x} -> Output: {output[0]:.4f}")

print("\nNote: Outputs are random (untrained network).")
print("After training (Part 3), this will solve XOR correctly!")

Testing the MLP on Simple Data

Let’s test our MLP on a slightly more complex dataset to verify it can produce different outputs for different inputs (even before training):

import numpy as np
import matplotlib.pyplot as plt

class MLP:
    """Multi-Layer Perceptron (repeated for independence)."""

    def __init__(self, layer_sizes, activations=None):
        self.n_layers = len(layer_sizes) - 1
        if activations is None:
            activations = ['relu'] * (self.n_layers - 1) + ['sigmoid']
        self.activations = activations
        self.weights = []
        self.biases = []
        for i in range(self.n_layers):
            n_in = layer_sizes[i]
            n_out = layer_sizes[i + 1]
            scale = np.sqrt(2.0 / n_in) if activations[i] == 'relu' else np.sqrt(1.0 / n_in)
            self.weights.append(np.random.randn(n_out, n_in) * scale)
            self.biases.append(np.zeros(n_out))

    def _activate(self, z, activation):
        if activation == 'relu':
            return np.maximum(0, z)
        elif activation == 'sigmoid':
            return 1.0 / (1.0 + np.exp(-np.clip(z, -500, 500)))
        elif activation == 'tanh':
            return np.tanh(z)
        return z

    def forward(self, x):
        a = x
        for i in range(self.n_layers):
            z = self.weights[i] @ a + self.biases[i]
            a = self._activate(z, self.activations[i])
        return a

# Create an MLP and visualize its decision landscape
np.random.seed(7)
mlp = MLP(layer_sizes=[2, 8, 4, 1], activations=['relu', 'relu', 'sigmoid'])

# Generate a grid of points
x_range = np.linspace(-3, 3, 80)
y_range = np.linspace(-3, 3, 80)
X_grid, Y_grid = np.meshgrid(x_range, y_range)
Z_grid = np.zeros_like(X_grid)

for i in range(80):
    for j in range(80):
        point = np.array([X_grid[i, j], Y_grid[i, j]])
        Z_grid[i, j] = mlp.forward(point)[0]

# Plot the untrained network decision landscape
plt.figure(figsize=(8, 6))
plt.contourf(X_grid, Y_grid, Z_grid, levels=20, cmap='RdYlBu_r')
plt.colorbar(label='Network Output (Sigmoid)')
plt.xlabel('x1', fontsize=12)
plt.ylabel('x2', fontsize=12)
plt.title('MLP Decision Landscape (Untrained, Random Weights)', fontsize=14)
plt.axhline(y=0, color='k', linestyle='--', alpha=0.3)
plt.axvline(x=0, color='k', linestyle='--', alpha=0.3)
plt.tight_layout()
plt.savefig('mlp_decision_landscape.png', dpi=100, bbox_inches='tight')
plt.show()

print("=== Untrained MLP Decision Landscape ===")
print("Even with random weights, the network creates")
print("NON-LINEAR boundaries thanks to ReLU activations.")
print(f"\nNetwork: 2 -> 8 -> 4 -> 1")
print(f"Total parameters: {sum(w.size for w in mlp.weights) + sum(b.size for b in mlp.biases)}")
print(f"Output range: [{Z_grid.min():.4f}, {Z_grid.max():.4f}]")
print("\nThe piecewise-linear regions are ReLU boundaries.")
print("Training will adjust these to match the target function.")

Challenge Architecture Comparison

Create three MLPs with the same total parameters but different architectures:

Wide and shallow: [2, 32, 1] (66 params)
Narrow and deep: [2, 4, 4, 4, 4, 1] (~60 params)
Balanced: [2, 8, 8, 1] (~89 params)

Plot their decision landscapes. Which creates the most complex boundaries? This foreshadows the depth vs. width debate in deep learning.

Architecture Depth vs Width Challenge

What’s Next

We’ve built all the forward-pass machinery: neurons, weights, biases, activations, layers, and a complete MLP class. But our network produces random outputs — it doesn’t learn yet. In Part 3, we’ll teach it to learn by implementing:

Loss Functions: Measuring how wrong the network is (MSE, cross-entropy)
Gradient Descent: Adjusting weights to reduce error
Backpropagation: The chain rule applied layer by layer to compute gradients
Training Loop: Iterating forward → loss → backward → update

Next in the Series

In Part 3: How Neural Networks Learn, we’ll implement loss functions, gradient descent, and the backpropagation algorithm from scratch — transforming our MLP from a random function into an intelligent learning system.

Previous Part 1: Why Neural Networks & Their History Next Part 3: How Neural Networks Learn

Cookie Consent

Table of Contents

Part 2: Building Blocks — Neurons, Weights & Activations

The Artificial Neuron

Implementing a Single Neuron

Weights and Biases Explained

What Weights Control: Strength and Direction

The Role of Bias: Shifting the Threshold

Activation Functions

Sigmoid — $\sigma(z) = \dfrac{1}{1+e^{-z}}$

When Sigmoid Belongs in Your Network

Tanh — $\tanh(z) = \dfrac{e^z - e^{-z}}{e^z + e^{-z}}$

Where Tanh Still Wins in 2026

ReLU — $f(z) = \max(0, z)$

Why ReLU Became the Default Hidden-Layer Activation

Leaky ReLU & PReLU — $f(z) = \max(\alpha z,\, z)$

ELU & SELU — smooth negative region

GELU — Gaussian Error Linear Unit (Transformers)

GELU Powers BERT, GPT, and Most Modern LLMs

Swish / SiLU & Mish — smooth self-gated units

Softmax & Softplus — probability outputs and smooth approximation

Choosing the Right Activation Function

Why Non-Linearity Matters

The Linear Collapse Proof

Fixing XOR with Non-Linearity

Layers: Input, Hidden, Output

Implementing a Dense Layer

Forward Propagation Step-by-Step

Step-by-Step with Intermediate Values

Putting It Together: Multi-Layer Perceptron

Testing the MLP on Simple Data

What’s Next

Next in the Series

Cookie Consent

Table of Contents

Part 2: Building Blocks — Neurons, Weights & Activations

The Artificial Neuron

Implementing a Single Neuron

Weights and Biases Explained

What Weights Control: Strength and Direction

The Role of Bias: Shifting the Threshold

Activation Functions

Sigmoid — $\sigma(z) = \dfrac{1}{1+e^{-z}}$

When Sigmoid Belongs in Your Network

Tanh — $\tanh(z) = \dfrac{e^z - e^{-z}}{e^z + e^{-z}}$

Where Tanh Still Wins in 2026

ReLU — $f(z) = \max(0, z)$

Why ReLU Became the Default Hidden-Layer Activation

Leaky ReLU & PReLU — $f(z) = \max(\alpha z,\, z)$

ELU & SELU — smooth negative region

GELU — Gaussian Error Linear Unit (Transformers)

GELU Powers BERT, GPT, and Most Modern LLMs

Swish / SiLU & Mish — smooth self-gated units

Softmax & Softplus — probability outputs and smooth approximation

Choosing the Right Activation Function

Why Non-Linearity Matters

The Linear Collapse Proof

Fixing XOR with Non-Linearity

Layers: Input, Hidden, Output

Implementing a Dense Layer

Forward Propagation Step-by-Step

Step-by-Step with Intermediate Values

Putting It Together: Multi-Layer Perceptron

Testing the MLP on Simple Data

What’s Next

Next in the Series

Related Articles in This Series

Part 1: Why Neural Networks & Their History

Part 3: How Neural Networks Learn

Part 4: Training in Practice