The Artificial Neuron
The artificial neuron is the fundamental unit of every neural network. Inspired by biological neurons, it takes multiple inputs, multiplies each by a weight, sums the results, adds a bias, and passes the result through an activation function to produce an output.
Think of it as a tiny decision-maker: it gathers evidence (weighted inputs), applies a threshold (bias), and decides how strongly to “fire” (activation). Let’s visualize this:
flowchart LR
x1["x1"] -->|"w1"| SUM["Sum + Bias"]
x2["x2"] -->|"w2"| SUM
x3["x3"] -->|"w3"| SUM
B["b"] --> SUM
SUM --> ACT["Activation f(z)"]
ACT --> Y["Output y"]
Implementing a Single Neuron
Let’s build a Neuron class from scratch and visualize how it responds to different inputs:
import numpy as np
import matplotlib.pyplot as plt
class Neuron:
"""A single artificial neuron with configurable activation."""
def __init__(self, n_inputs, activation='sigmoid'):
# Initialize weights randomly (small values)
self.weights = np.random.randn(n_inputs) * 0.5
self.bias = 0.0
self.activation = activation
def _activate(self, z):
"""Apply activation function."""
if self.activation == 'sigmoid':
return 1.0 / (1.0 + np.exp(-z))
elif self.activation == 'relu':
return np.maximum(0, z)
elif self.activation == 'tanh':
return np.tanh(z)
else:
return z # linear (no activation)
def forward(self, x):
"""Compute neuron output: f(w*x + b)."""
z = np.dot(self.weights, x) + self.bias
return self._activate(z)
# Create a neuron with 2 inputs
np.random.seed(42)
neuron = Neuron(n_inputs=2, activation='sigmoid')
print(f"Weights: {neuron.weights}")
print(f"Bias: {neuron.bias}")
# Test with sample inputs
inputs = np.array([0.5, -0.3])
output = neuron.forward(inputs)
print(f"\nInput: {inputs}")
print(f"Weighted sum: {np.dot(neuron.weights, inputs) + neuron.bias:.4f}")
print(f"Output (after sigmoid): {output:.4f}")
# Visualize neuron response across input space
x1_range = np.linspace(-3, 3, 100)
x2_range = np.linspace(-3, 3, 100)
X1, X2 = np.meshgrid(x1_range, x2_range)
Z = np.zeros_like(X1)
for i in range(100):
for j in range(100):
Z[i, j] = neuron.forward(np.array([X1[i, j], X2[i, j]]))
plt.figure(figsize=(8, 6))
plt.contourf(X1, X2, Z, levels=20, cmap='RdYlBu_r')
plt.colorbar(label='Neuron Output')
plt.xlabel('Input x1', fontsize=12)
plt.ylabel('Input x2', fontsize=12)
plt.title('Single Neuron Response Surface (Sigmoid)', fontsize=14)
plt.axhline(y=0, color='k', linestyle='--', alpha=0.3)
plt.axvline(x=0, color='k', linestyle='--', alpha=0.3)
plt.tight_layout()
plt.savefig('neuron_response.png', dpi=100, bbox_inches='tight')
plt.show()
print("\nThe neuron creates a LINEAR decision boundary")
print("(sigmoid squashes but the boundary is still a line).")
Modify the neuron’s weights to [2.0, -1.0] and bias to 0.5. How does the decision boundary change? Try switching the activation to 'relu' — notice how the response surface becomes piecewise linear instead of smooth.
Weights and Biases Explained
Weights and biases are the learnable parameters of a neural network. During training, the network adjusts these values to minimize error. But what exactly do they control?
What Weights Control: Strength and Direction
A weight $w_i$ controls how much influence input $x_i$ has on the output. Large positive weights amplify the input, large negative weights suppress it, and weights near zero ignore it. The weighted sum is:
$$z = \sum_{i=1}^{n} w_i x_i + b = w_1 x_1 + w_2 x_2 + \cdots + w_n x_n + b$$The Role of Bias: Shifting the Threshold
The bias $b$ shifts the activation threshold. Without bias, a neuron with sigmoid activation always outputs exactly 0.5 when all inputs are zero. Bias lets the neuron “pre-activate” — or require stronger evidence before firing.
import numpy as np
import matplotlib.pyplot as plt
def sigmoid(z):
"""Sigmoid activation function."""
return 1.0 / (1.0 + np.exp(-z))
# Demonstrate effect of different weights (steepness)
z = np.linspace(-6, 6, 200)
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Left plot: varying weights (steepness of sigmoid)
ax1 = axes[0]
weights = [0.5, 1.0, 2.0, 5.0]
for w in weights:
ax1.plot(z, sigmoid(w * z), linewidth=2, label=f'w = {w}')
ax1.axhline(y=0.5, color='gray', linestyle='--', alpha=0.5)
ax1.axvline(x=0, color='gray', linestyle='--', alpha=0.5)
ax1.set_xlabel('Input x', fontsize=12)
ax1.set_ylabel('Output sigmoid(w*x)', fontsize=12)
ax1.set_title('Effect of Weight: Controls Steepness', fontsize=13)
ax1.legend(fontsize=11)
ax1.grid(True, alpha=0.3)
# Right plot: varying bias (shift of sigmoid)
ax2 = axes[1]
biases = [-3, -1, 0, 1, 3]
for b in biases:
ax2.plot(z, sigmoid(z + b), linewidth=2, label=f'b = {b}')
ax2.axhline(y=0.5, color='gray', linestyle='--', alpha=0.5)
ax2.axvline(x=0, color='gray', linestyle='--', alpha=0.5)
ax2.set_xlabel('Input x', fontsize=12)
ax2.set_ylabel('Output sigmoid(x + b)', fontsize=12)
ax2.set_title('Effect of Bias: Shifts Threshold Left/Right', fontsize=13)
ax2.legend(fontsize=11)
ax2.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('weights_biases_effect.png', dpi=100, bbox_inches='tight')
plt.show()
print("=== Weight Effect ===")
print("Larger weight = steeper transition (more confident)")
print("Smaller weight = gradual transition (more uncertain)")
print("\n=== Bias Effect ===")
print("Positive bias = shifts threshold LEFT (fires more easily)")
print("Negative bias = shifts threshold RIGHT (harder to fire)")
print("Zero bias = transition centered at input = 0")
Activation Functions
Every neuron computes a weighted sum $z = \mathbf{w}^\top \mathbf{x} + b$ — a purely linear operation. Stacking linear operations always collapses to one linear transformation, no matter how many layers you add. Activation functions are the nonlinear gates that break this collapse, letting the network learn curves, boundaries, and concepts that no straight line can capture.
There are roughly a dozen activation functions in active use across modern ML. Below we cover every one that matters — formula, intuition, implementation, strengths, and exactly when to reach for it.
Sigmoid — $\sigma(z) = \dfrac{1}{1+e^{-z}}$
Range: (0, 1) | Zero-centred: No | Saturates: Yes
Analogy: A spam filter that turns raw email-weirdness scores into a clean probability — 0 = definitely not spam, 1 = definitely spam. Nothing is ever perfectly certain, so the output is always strictly between the extremes.
When Sigmoid Belongs in Your Network
Output layer of binary classifiers — logistic regression, medical diagnosis (disease/no disease), fraud detection. Sigmoid turns a raw score into an interpretable probability. It is almost never used in hidden layers of deep networks today because its derivative maxes out at 0.25 and vanishes toward zero for large $|z|$, making backpropagation through many layers numerically ineffective.
import numpy as np
import matplotlib.pyplot as plt
def sigmoid(z):
"""Squashes any real number to (0, 1)."""
return 1.0 / (1.0 + np.exp(-np.clip(z, -500, 500))) # clip prevents overflow
def sigmoid_derivative(z):
"""Max value is 0.25 at z=0 — shrinks fast for large |z|."""
s = sigmoid(z)
return s * (1.0 - s)
z = np.linspace(-6, 6, 300)
plt.figure(figsize=(8, 4))
plt.plot(z, sigmoid(z), linewidth=2.5, label='sigmoid(z)')
plt.plot(z, sigmoid_derivative(z), linewidth=2, linestyle='--', label="sigmoid'(z) [max=0.25]")
plt.axhline(0, color='k', linewidth=0.5); plt.axvline(0, color='k', linewidth=0.5)
plt.title('Sigmoid and Its Derivative'); plt.legend(); plt.grid(alpha=0.3)
plt.tight_layout(); plt.show()
# Practical demo: binary classification probability
raw_scores = np.array([-3.0, -1.0, 0.0, 1.0, 3.0])
probs = sigmoid(raw_scores)
for score, prob in zip(raw_scores, probs):
print(f"score={score:+.1f} → P(positive)={prob:.3f}")
# score=-3.0 → P(positive)=0.047
# score= 0.0 → P(positive)=0.500
# score=+3.0 → P(positive)=0.953
Tanh — $\tanh(z) = \dfrac{e^z - e^{-z}}{e^z + e^{-z}}$
Range: (−1, 1) | Zero-centred: Yes | Saturates: Yes
Analogy: A sentiment meter that reads from −1 (strongly negative) through 0 (neutral) to +1 (strongly positive). Because the centre is zero, the average gradient across a mini-batch stays near zero — gradients cancel rather than all pushing the same way, giving faster convergence than sigmoid.
Where Tanh Still Wins in 2026
Tanh is the standard activation inside LSTM and GRU cells. It keeps hidden states in a bounded, zero-centred range, which prevents hidden state magnitudes from exploding over hundreds of timesteps. It is also used in some normalizing flows because it is invertible over its full domain.
import numpy as np
import matplotlib.pyplot as plt
def tanh(z):
"""Zero-centred, range (-1, 1)."""
return np.tanh(z)
def tanh_derivative(z):
"""Derivative: 1 - tanh(z)^2. Maximum 1.0 at z=0."""
return 1.0 - np.tanh(z) ** 2
z = np.linspace(-4, 4, 300)
plt.figure(figsize=(8, 4))
plt.plot(z, tanh(z), linewidth=2.5, label='tanh(z)')
plt.plot(z, tanh_derivative(z), linewidth=2, linestyle='--', label="tanh'(z) [max=1.0]")
plt.axhline(0, color='k', linewidth=0.5); plt.axvline(0, color='k', linewidth=0.5)
plt.title('Tanh and Its Derivative'); plt.legend(); plt.grid(alpha=0.3)
plt.tight_layout(); plt.show()
# tanh vs sigmoid: derivative comparison
for z_val in [-3, -1, 0, 1, 3]:
sig_grad = sigmoid_val = 1/(1+np.exp(-z_val))
sig_grad = sig_grad * (1 - sig_grad)
tanh_grad = tanh_derivative(z_val)
print(f"z={z_val:+d} sigmoid'={sig_grad:.4f} tanh'={tanh_grad:.4f}")
# z=0 sigmoid'=0.2500 tanh'=1.0000 ← tanh has 4× larger gradient at centre
ReLU — $f(z) = \max(0, z)$
Range: [0, ∞) | Zero-centred: No | Saturates: No (positive side)
Analogy: A one-way valve in a water pipe — water flows freely in one direction (positive values pass through unchanged), but the valve blocks all backflow (negative values become zero). It is cheap to compute, never saturates on the positive side, and empirically trains deep networks faster than sigmoid or tanh.
Why ReLU Became the Default Hidden-Layer Activation
When deep learning took off with AlexNet (2012), ReLU was the key ingredient enabling 8 layers to train in days rather than months. Its gradient is always 1 for positive inputs (no shrinkage), so gradients flow cleanly to early layers. Most convolutional neural networks and MLPs today still use ReLU as the default unless there is a specific reason to switch. PyTorch's nn.ReLU() and TensorFlow's tf.keras.activations.relu both default to this.
import numpy as np
import matplotlib.pyplot as plt
def relu(z):
"""Passes positive values unchanged; clips negatives to zero."""
return np.maximum(0, z)
def relu_derivative(z):
"""1 for z>0, 0 for z<0, undefined at z=0 (convention: 0)."""
return (z > 0).astype(float)
z = np.linspace(-4, 4, 300)
plt.figure(figsize=(8, 4))
plt.plot(z, relu(z), linewidth=2.5, label='relu(z)')
plt.plot(z, relu_derivative(z), linewidth=2, linestyle='--', label="relu'(z)")
plt.axhline(0, color='k', linewidth=0.5); plt.axvline(0, color='k', linewidth=0.5)
plt.title('ReLU and Its Derivative'); plt.legend(); plt.grid(alpha=0.3)
plt.tight_layout(); plt.show()
# Demonstrate sparse activation (healthy for feature learning)
np.random.seed(42)
batch = np.random.randn(1000, 64) # 1000 samples, 64 features
activations = relu(batch)
sparsity = (activations == 0).mean()
print(f"Sparsity of ReLU activations: {sparsity:.1%}") # ~50% neurons silent
print("Sparse activations → efficient representations (only relevant neurons fire)")
Leaky ReLU & PReLU — $f(z) = \max(\alpha z,\, z)$
Range: (−∞, ∞) | Zero-centred: Near | Saturates: No
Analogy: The same one-way valve, but with a tiny leak — negative signals aren't completely blocked; a small fraction ($\alpha$) trickles through. This keeps the gradient non-zero everywhere, preventing neurons from dying permanently.
- Leaky ReLU: $\alpha$ is a fixed hyperparameter (typically 0.01 or 0.1).
- PReLU (Parametric ReLU): $\alpha$ is learned during backpropagation — one $\alpha$ per channel. Microsoft Research used PReLU to win ImageNet 2015.
import numpy as np
import matplotlib.pyplot as plt
def leaky_relu(z, alpha=0.01):
"""Small fixed slope alpha for negative inputs."""
return np.where(z > 0, z, alpha * z)
def leaky_relu_derivative(z, alpha=0.01):
return np.where(z > 0, 1.0, alpha)
class PReLU:
"""Parametric ReLU — alpha is a learnable parameter."""
def __init__(self, init_alpha=0.25):
self.alpha = init_alpha # updated by optimizer during training
def forward(self, z):
return np.where(z > 0, z, self.alpha * z)
def backward(self, z, grad_output):
grad_z = np.where(z > 0, grad_output, self.alpha * grad_output)
grad_alpha = np.sum(np.where(z > 0, 0, grad_output * z)) # dL/d_alpha
return grad_z, grad_alpha
z = np.linspace(-4, 4, 300)
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
alpha_values = [0.01, 0.1, 0.3]
for alpha in alpha_values:
axes[0].plot(z, leaky_relu(z, alpha), linewidth=2, label=f'alpha={alpha}')
axes[0].set_title('Leaky ReLU (various alpha)'); axes[0].legend(); axes[0].grid(alpha=0.3)
prelu = PReLU(init_alpha=0.25)
axes[1].plot(z, prelu.forward(z), linewidth=2.5, color='purple', label='PReLU (alpha=0.25 learned)')
axes[1].set_title('PReLU — learnable slope'); axes[1].legend(); axes[1].grid(alpha=0.3)
plt.tight_layout(); plt.show()
# Quick comparison: how many neurons die?
np.random.seed(0)
z_batch = np.random.randn(10000) - 1.5 # biased toward negative (simulates bad init)
relu_dead = (np.maximum(0, z_batch) == 0).mean()
leaky_dead = (leaky_relu(z_batch, 0.01) == 0).mean()
print(f"ReLU dead neurons: {relu_dead:.1%}")
print(f"Leaky ReLU dead neurons: {leaky_dead:.1%}")
# ReLU dead neurons: ~82% ← catastrophic
# Leaky ReLU dead neurons: 0.0% ← none, leak keeps gradient alive
ELU & SELU — smooth negative region
ELU Range: (−α, ∞) | Zero-centred: Approximately | Saturates: Only below at −α
Analogy (ELU): A soft-edged valve — instead of a hard kink at zero, the negative side curves smoothly to a floor of $-\alpha$. This reduces the bias shift that Leaky ReLU can cause and pushes the mean activation closer to zero, speeding up convergence.
$$\text{ELU}(z) = \begin{cases} z & z > 0 \\ \alpha (e^z - 1) & z \leq 0 \end{cases}$$
SELU (Scaled ELU) adds two carefully chosen constants $\lambda$ and $\alpha$ (derived mathematically) so that activations are self-normalizing — the mean and variance of each layer's output are automatically kept near 0 and 1 respectively, without BatchNorm. This is the activation of choice for fully-connected deep networks where you want to avoid adding normalisation layers.
$$\text{SELU}(z) = \lambda \begin{cases} z & z > 0 \\ \alpha (e^z - 1) & z \leq 0 \end{cases} \quad \lambda \approx 1.0507,\ \alpha \approx 1.6733$$
import numpy as np
import matplotlib.pyplot as plt
def elu(z, alpha=1.0):
"""Smooth negative region, mean activation closer to zero."""
return np.where(z > 0, z, alpha * (np.exp(z) - 1))
# SELU constants are derived theoretically (Klambauer et al., 2017)
SELU_LAMBDA = 1.0507009873554804934193349852946
SELU_ALPHA = 1.6732632423543772848170429916717
def selu(z):
"""Self-normalizing ELU — keeps layer outputs near N(0,1)."""
return SELU_LAMBDA * np.where(z > 0, z, SELU_ALPHA * (np.exp(z) - 1))
z = np.linspace(-4, 4, 300)
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
for alpha in [0.5, 1.0, 2.0]:
axes[0].plot(z, elu(z, alpha), linewidth=2, label=f'ELU alpha={alpha}')
axes[0].set_title('ELU (various alpha)'); axes[0].legend(); axes[0].grid(alpha=0.3)
axes[1].plot(z, selu(z), linewidth=2.5, color='green', label='SELU')
axes[1].set_title('SELU — self-normalising'); axes[1].legend(); axes[1].grid(alpha=0.3)
plt.tight_layout(); plt.show()
# Verify SELU self-normalisation property
np.random.seed(42)
x = np.random.randn(10000) # standard normal input
y = selu(x)
print(f"Input: mean={x.mean():.4f}, std={x.std():.4f}")
print(f"SELU: mean={y.mean():.4f}, std={y.std():.4f}")
# SELU: mean~-0.08, std~0.96 — close to N(0,1) after one pass
torch.nn.init.lecun_normal_) and the network uses AlphaDropout instead of standard dropout. Mixing SELU with BatchNorm or incorrect initialisation breaks the theoretical guarantee.
GELU — Gaussian Error Linear Unit (Transformers)
Range: (−0.17, ∞) | Zero-centred: Approximately | Saturates: No
$$\text{GELU}(z) = z \cdot \Phi(z) \approx 0.5\,z\left(1 + \tanh\!\left[\sqrt{\tfrac{2}{\pi}}\left(z + 0.044715\,z^3\right)\right]\right)$$
Analogy: A probabilistic gate — instead of a hard or leaky switch, GELU asks "what is the probability that a Gaussian random variable is less than $z$?" and gates the signal by that probability. A neuron with a small positive input still passes some signal (proportional to its statistical likelihood of being signal rather than noise), while a neuron with a large positive input passes almost all of it.
GELU Powers BERT, GPT, and Most Modern LLMs
Google's BERT (2018) and OpenAI's GPT series both use GELU in their feed-forward sublayers. The smooth probabilistic gating empirically outperforms ReLU on NLP tasks. It is now also the default in many vision transformers (ViT). The exact form uses a tanh approximation for computational efficiency; PyTorch exposes both via F.gelu(x) and F.gelu(x, approximate='tanh').
import numpy as np
import matplotlib.pyplot as plt
from scipy.special import erf # or use tanh approximation
def gelu_exact(z):
"""Exact GELU via the Gaussian CDF (requires scipy)."""
return z * 0.5 * (1.0 + erf(z / np.sqrt(2)))
def gelu_approx(z):
"""Tanh approximation used in PyTorch (matches torch.nn.functional.gelu)."""
return 0.5 * z * (1.0 + np.tanh(np.sqrt(2 / np.pi) * (z + 0.044715 * z**3)))
z = np.linspace(-4, 4, 300)
plt.figure(figsize=(9, 4))
plt.plot(z, gelu_exact(z), linewidth=2.5, label='GELU (exact)')
plt.plot(z, gelu_approx(z), linewidth=2, linestyle='--', label='GELU (tanh approx)')
# Compare against ReLU
plt.plot(z, np.maximum(0, z), linewidth=1.5, linestyle=':', color='gray', label='ReLU')
plt.axhline(0, color='k', linewidth=0.5); plt.axvline(0, color='k', linewidth=0.5)
plt.title('GELU vs ReLU'); plt.legend(); plt.grid(alpha=0.3)
plt.tight_layout(); plt.show()
# Key property: GELU can suppress small positive values (probabilistic gating)
for z_val in [-2, -0.5, 0, 0.5, 2]:
print(f"z={z_val:+.1f} GELU={gelu_exact(np.array([z_val]))[0]:+.4f} "
f"ReLU={max(0, z_val):.4f}")
# z=-0.5 GELU=-0.1543 ReLU=0.0000 ← GELU leaks slight negative signal
# z=+0.5 GELU=+0.3457 ReLU=0.5000 ← GELU gates down small positives
Swish / SiLU & Mish — smooth self-gated units
Swish / SiLU Range: (−0.28, ∞) | Zero-centred: Approximately | Saturates: No
$$\text{Swish}(z) = z \cdot \sigma(z) = \frac{z}{1+e^{-z}} \qquad \text{Mish}(z) = z \cdot \tanh(\text{Softplus}(z)) = z \cdot \tanh(\ln(1+e^z))$$
Swish analogy: A self-gating highway — the input multiplies its own sigmoid, so large positive inputs pass through cleanly while small or negative inputs are smoothly attenuated. Discovered by Google Brain (2017) via neural architecture search, Swish outperforms ReLU on deep networks for image classification and language tasks.
Mish analogy: A smoother, slightly more aggressive version of Swish — it uses tanh(softplus(z)) as the gate, giving it a marginally richer gradient landscape. YOLOv4 and some recent object detection models prefer Mish because its slightly negative dip for small inputs helps prevent dead neurons. In practice, Swish and Mish are nearly interchangeable.
SiLU (Sigmoid Linear Unit) is identical to Swish with no learnable parameter. PyTorch calls it torch.nn.SiLU(); it is the default activation in EfficientNet and many diffusion model U-Nets.
import numpy as np
import matplotlib.pyplot as plt
def sigmoid(z):
return 1.0 / (1.0 + np.exp(-np.clip(z, -500, 500)))
def swish(z, beta=1.0):
"""Swish / SiLU (beta=1). Also written as z * sigmoid(z)."""
return z * sigmoid(beta * z)
def softplus(z):
"""Smooth approximation to ReLU: log(1 + e^z)."""
return np.log1p(np.exp(np.clip(z, -500, 20))) # clip for numerical stability
def mish(z):
"""Mish = z * tanh(softplus(z))."""
return z * np.tanh(softplus(z))
z = np.linspace(-4, 6, 300)
plt.figure(figsize=(10, 4))
plt.plot(z, swish(z), linewidth=2.5, label='Swish / SiLU')
plt.plot(z, mish(z), linewidth=2, linestyle='--', label='Mish')
plt.plot(z, np.maximum(0, z), linewidth=1.5, linestyle=':', color='gray', label='ReLU (reference)')
plt.axhline(0, color='k', linewidth=0.5)
plt.title('Swish / SiLU vs Mish vs ReLU'); plt.legend(); plt.grid(alpha=0.3)
plt.tight_layout(); plt.show()
# Gradient comparison (Swish has a richer gradient landscape than ReLU)
def swish_derivative(z):
sig = sigmoid(z)
return sig + z * sig * (1 - sig)
print("Swish gradient at key points:")
for z_val in [-2, -0.5, 0, 1, 3]:
z_arr = np.array([z_val])
print(f" z={z_val:+.1f} swish'={swish_derivative(z_arr)[0]:+.4f}")
Softmax & Softplus — probability outputs and smooth approximation
Softmax is not used in hidden layers — it is exclusively an output-layer activation for multi-class classification. It converts a vector of raw scores (logits) into a valid probability distribution: all outputs are positive and sum to exactly 1.
$$\text{Softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}}$$
Analogy (Softmax): A tournament voting system — each contestant (class) gets votes proportional to the exponent of their score. The winner takes a large share, but every contestant gets at least some non-zero fraction of the total vote. This is why softmax is paired with cross-entropy loss in every multi-class classifier.
Softplus $= \ln(1 + e^z)$ is a smooth, everywhere-differentiable approximation of ReLU. It is mainly used in variational autoencoders (to produce positive variance predictions) and in models where smoothness at zero matters more than computational speed.
import numpy as np
import matplotlib.pyplot as plt
def softmax(z):
"""
Numerically stable softmax: subtract max before exp to prevent overflow.
Input: z — 1-D array of logits (raw scores)
Output: probability distribution summing to 1.0
"""
e = np.exp(z - z.max()) # subtract max for numerical stability
return e / e.sum()
def softplus(z):
"""Smooth, always-positive approximation to ReLU."""
return np.log1p(np.exp(np.clip(z, -500, 20)))
# Softmax demo: 5-class classifier (e.g., digit recognition 0–4)
logits = np.array([2.0, 1.0, 0.1, -0.5, 3.5]) # raw network outputs
probs = softmax(logits)
print("Logits → Softmax probabilities:")
for i, (logit, prob) in enumerate(zip(logits, probs)):
bar = '█' * int(prob * 40)
print(f" Class {i}: logit={logit:+.1f} P={prob:.3f} {bar}")
print(f" Sum = {probs.sum():.6f}")
# Softplus vs ReLU comparison
z = np.linspace(-4, 4, 300)
plt.figure(figsize=(8, 4))
plt.plot(z, softplus(z), linewidth=2.5, label='Softplus')
plt.plot(z, np.maximum(0, z), linewidth=2, linestyle='--', color='gray', label='ReLU')
plt.axhline(0, color='k', linewidth=0.5)
plt.title('Softplus vs ReLU — Softplus is smooth at the kink')
plt.legend(); plt.grid(alpha=0.3)
plt.tight_layout(); plt.show()
nn.CrossEntropyLoss, TensorFlow's tf.keras.losses.CategoricalCrossentropy(from_logits=True)) apply softmax internally during loss calculation. Pass raw logits — not the softmax output — to the loss function for better numerical stability.
Choosing the Right Activation Function
With a dozen options available, the choice comes down to three questions: where in the network are you?, what is the task?, and how deep is the network?
| Situation | Recommended | Why |
|---|---|---|
| Default hidden layers (MLP/CNN) | ReLU | Fast, sparse, no vanishing gradient on positive side |
| Seeing dying neurons? | Leaky ReLU | Non-zero gradient everywhere; $\alpha$=0.01–0.1 |
| Deep MLP without BatchNorm | SELU | Self-normalises; use LeCun init + AlphaDropout |
| Transformer feed-forward layer | GELU | Used in BERT, GPT; probabilistic gating suits attention outputs |
| EfficientNet / diffusion U-Net | Swish / SiLU | Smooth, self-gated; matches neural architecture search results |
| Object detection (YOLO, etc.) | Mish | Slightly richer gradient than Swish for localisation tasks |
| LSTM / GRU hidden state | Tanh | Zero-centred, bounded — prevents state explosion over time |
| Binary classification output | Sigmoid | Outputs P(positive class) in (0,1) |
| Multi-class classification output | Softmax | Converts logits to valid probability distribution |
| Regression output | None (linear) | Unconstrained real-valued prediction |
| VAE variance head / positive output | Softplus | Smooth, always positive — models variance without clamping |
import numpy as np
# All activations in one self-contained reference block
def sigmoid(z): return 1 / (1 + np.exp(-np.clip(z, -500, 500)))
def tanh(z): return np.tanh(z)
def relu(z): return np.maximum(0, z)
def leaky_relu(z, alpha=0.01): return np.where(z > 0, z, alpha * z)
def elu(z, alpha=1.0): return np.where(z > 0, z, alpha * (np.exp(z) - 1))
SELU_L, SELU_A = 1.0507009873554805, 1.6732632423543772
def selu(z): return SELU_L * np.where(z > 0, z, SELU_A * (np.exp(z) - 1))
def gelu(z): return 0.5*z*(1+np.tanh(np.sqrt(2/np.pi)*(z+0.044715*z**3)))
def swish(z): return z * sigmoid(z)
def softplus(z): return np.log1p(np.exp(np.clip(z, -500, 20)))
def mish(z): return z * np.tanh(softplus(z))
def softmax(z): e=np.exp(z-z.max()); return e/e.sum()
# Side-by-side evaluation at representative inputs
activations = {
'Sigmoid': sigmoid, 'Tanh': tanh,
'ReLU': relu, 'Leaky ReLU': leaky_relu,
'ELU': elu, 'SELU': selu,
'GELU': gelu, 'Swish/SiLU': swish,
'Mish': mish, 'Softplus': softplus,
}
test_values = np.array([-3.0, -1.0, 0.0, 1.0, 3.0])
print(f"{'Function':<14} ", end='')
for v in test_values: print(f"z={v:+.0f} ", end='')
print()
print('-' * 75)
for name, fn in activations.items():
vals = fn(test_values)
print(f"{name:<14} ", end='')
for v in vals: print(f"{v:+7.3f} ", end='')
print()
Why Non-Linearity Matters
This is perhaps the most critical insight in neural network theory: without non-linear activations, no matter how many layers you stack, the entire network is equivalent to a single linear transformation. Let’s prove this mathematically and empirically.
The Linear Collapse Proof
If we have two layers without activation: $y = W_2 (W_1 x) = (W_2 W_1) x = W_{combined} x$. The composition of linear functions is still linear. Let’s demonstrate this with code:
import numpy as np
# PROOF: Without activations, N layers = 1 layer (linear collapse)
np.random.seed(42)
# Create 5 random weight matrices (simulating 5 layers)
input_dim = 4
hidden_dim = 8
output_dim = 2
W1 = np.random.randn(hidden_dim, input_dim)
W2 = np.random.randn(hidden_dim, hidden_dim)
W3 = np.random.randn(hidden_dim, hidden_dim)
W4 = np.random.randn(hidden_dim, hidden_dim)
W5 = np.random.randn(output_dim, hidden_dim)
# Forward pass through 5 layers WITHOUT activation
x = np.random.randn(input_dim)
layer1 = W1 @ x
layer2 = W2 @ layer1
layer3 = W3 @ layer2
layer4 = W4 @ layer3
output_5layers = W5 @ layer4
# The entire 5-layer network is just ONE matrix multiplication
W_combined = W5 @ W4 @ W3 @ W2 @ W1
output_1layer = W_combined @ x
print("=== Linear Collapse Demonstration ===")
print(f"\n5-layer output: {output_5layers}")
print(f"1-layer output: {output_1layer}")
print(f"\nDifference: {np.abs(output_5layers - output_1layer).max():.2e}")
print("(Effectively zero - just floating point noise)")
print("\n--- Conclusion ---")
print(f"5 layers ({input_dim}x{hidden_dim}x{hidden_dim}x{hidden_dim}x{hidden_dim}x{output_dim})")
print(f" = 1 layer ({input_dim}x{output_dim})")
print(f"\nW_combined shape: {W_combined.shape}")
print("Without activation, depth is MEANINGLESS!")
print("You cannot learn XOR, circles, or any non-linear pattern.")
Fixing XOR with Non-Linearity
The XOR problem (from Part 1) cannot be solved by any single linear boundary. But a network with just one hidden layer and ReLU activation can solve it perfectly:
import numpy as np
import matplotlib.pyplot as plt
# XOR problem: no single line can separate these points
X_xor = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y_xor = np.array([0, 1, 1, 0])
# Manual solution: 2-layer network with ReLU
# Hidden layer: 2 neurons with hand-crafted weights
W1 = np.array([[1, 1], # neuron 1: detects (x1 + x2)
[1, 1]]) # neuron 2: same
b1 = np.array([0, -1]) # neuron 1 threshold: 0, neuron 2: -1
# Output layer: 1 neuron
W2 = np.array([[1, -2]]) # combines hidden outputs
b2 = np.array([0])
def relu(z):
return np.maximum(0, z)
print("=== Solving XOR with ReLU Network ===")
print(f"Architecture: 2 inputs -> 2 hidden (ReLU) -> 1 output")
print(f"\nW1 = {W1.tolist()}, b1 = {b1.tolist()}")
print(f"W2 = {W2.tolist()}, b2 = {b2.tolist()}")
print("\n" + "-" * 45)
print(f"{'Input':<12} {'Hidden (pre)':<15} {'Hidden (ReLU)':<15} {'Output':<8}")
print("-" * 45)
for i in range(4):
x = X_xor[i]
z1 = W1 @ x + b1 # pre-activation
h1 = relu(z1) # after ReLU
output = W2 @ h1 + b2 # final output
print(f"{x} -> {z1} -> {h1} -> {output[0]:.1f} (target: {y_xor[i]})")
# Visualize the hidden space transformation
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
# Original XOR space
colors = ['red' if y == 0 else 'blue' for y in y_xor]
axes[0].scatter(X_xor[:, 0], X_xor[:, 1], c=colors, s=200,
edgecolors='black', linewidth=2, zorder=5)
axes[0].set_xlabel('x1', fontsize=12)
axes[0].set_ylabel('x2', fontsize=12)
axes[0].set_title('Original Space (NOT linearly separable)', fontsize=13)
axes[0].set_xlim(-0.5, 1.5)
axes[0].set_ylim(-0.5, 1.5)
axes[0].grid(True, alpha=0.3)
# Transformed hidden space
hidden_outputs = []
for x in X_xor:
z1 = W1 @ x + b1
h1 = relu(z1)
hidden_outputs.append(h1)
hidden_outputs = np.array(hidden_outputs)
axes[1].scatter(hidden_outputs[:, 0], hidden_outputs[:, 1], c=colors, s=200,
edgecolors='black', linewidth=2, zorder=5)
# Draw separating line in hidden space
h_range = np.linspace(-0.5, 2.5, 100)
sep_line = 0.5 * h_range # W2 @ h = 0 boundary
axes[1].plot(h_range, sep_line, 'g--', linewidth=2, label='Decision Boundary')
axes[1].set_xlabel('Hidden Neuron 1 Output', fontsize=12)
axes[1].set_ylabel('Hidden Neuron 2 Output', fontsize=12)
axes[1].set_title('Hidden Space (NOW linearly separable!)', fontsize=13)
axes[1].legend(fontsize=11)
axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('xor_nonlinearity.png', dpi=100, bbox_inches='tight')
plt.show()
print("\nReLU transforms the space so XOR becomes linearly separable!")
Layers: Input, Hidden, Output
Neural networks are organized into layers, each performing a specific transformation. Understanding layer types is essential for designing architectures:
- Input Layer: Not a computation layer — simply passes raw features to the network. Size equals number of input features.
- Hidden Layers: Where the learning happens. Each neuron computes a weighted sum + activation. Multiple hidden layers enable hierarchical feature learning.
- Output Layer: Produces the final prediction. Activation depends on the task (sigmoid for binary classification, softmax for multi-class, linear for regression).
flowchart LR
subgraph Input["Input Layer (2)"]
i1["x1"]
i2["x2"]
end
subgraph Hidden1["Hidden Layer 1 (3)"]
h1["h1"]
h2["h2"]
h3["h3"]
end
subgraph Hidden2["Hidden Layer 2 (2)"]
h4["h4"]
h5["h5"]
end
subgraph Output["Output Layer (1)"]
o1["y"]
end
i1 --> h1
i1 --> h2
i1 --> h3
i2 --> h1
i2 --> h2
i2 --> h3
h1 --> h4
h1 --> h5
h2 --> h4
h2 --> h5
h3 --> h4
h3 --> h5
h4 --> o1
h5 --> o1
Implementing a Dense Layer
Each dense (fully-connected) layer stores a weight matrix $\mathbf{W}$ of shape (outputs, inputs) and a bias vector $\mathbf{b}$ of shape (outputs,). The forward pass computes $\mathbf{a} = f(\mathbf{W} \cdot \mathbf{x} + \mathbf{b})$:
import numpy as np
class DenseLayer:
"""A fully-connected (dense) neural network layer."""
def __init__(self, n_inputs, n_neurons, activation='relu'):
# Xavier initialization for better training
scale = np.sqrt(2.0 / n_inputs)
self.weights = np.random.randn(n_neurons, n_inputs) * scale
self.biases = np.zeros(n_neurons)
self.activation = activation
# Store for later (backprop will need these)
self.last_input = None
self.last_z = None
self.last_output = None
def _activate(self, z):
"""Apply activation function."""
if self.activation == 'relu':
return np.maximum(0, z)
elif self.activation == 'sigmoid':
return 1.0 / (1.0 + np.exp(-z))
elif self.activation == 'tanh':
return np.tanh(z)
elif self.activation == 'linear':
return z
else:
raise ValueError(f"Unknown activation: {self.activation}")
def forward(self, x):
"""Forward pass: output = activation(W @ x + b)."""
self.last_input = x
self.last_z = self.weights @ x + self.biases
self.last_output = self._activate(self.last_z)
return self.last_output
def __repr__(self):
return (f"DenseLayer(inputs={self.weights.shape[1]}, "
f"neurons={self.weights.shape[0]}, "
f"activation='{self.activation}')")
# Create layers and demonstrate
np.random.seed(42)
layer1 = DenseLayer(n_inputs=3, n_neurons=4, activation='relu')
layer2 = DenseLayer(n_inputs=4, n_neurons=2, activation='sigmoid')
print("=== Layer Architecture ===")
print(f"Layer 1: {layer1}")
print(f" Weights shape: {layer1.weights.shape}")
print(f" Biases shape: {layer1.biases.shape}")
print(f" Total params: {layer1.weights.size + layer1.biases.size}")
print(f"\nLayer 2: {layer2}")
print(f" Weights shape: {layer2.weights.shape}")
print(f" Biases shape: {layer2.biases.shape}")
print(f" Total params: {layer2.weights.size + layer2.biases.size}")
# Forward pass through both layers
x = np.array([1.0, 0.5, -0.3])
h = layer1.forward(x)
y = layer2.forward(h)
print(f"\n=== Forward Pass ===")
print(f"Input: {x}")
print(f"After Layer 1: {h} (ReLU applied)")
print(f"After Layer 2: {y} (Sigmoid applied)")
print(f"\nTotal network parameters: "
f"{layer1.weights.size + layer1.biases.size + layer2.weights.size + layer2.biases.size}")
Forward Propagation Step-by-Step
Forward propagation is the process of passing input through every layer of the network to produce an output. In matrix notation, for layer $l$:
$$\mathbf{a}^{[l]} = g\left(\mathbf{W}^{[l]} \mathbf{a}^{[l-1]} + \mathbf{b}^{[l]}\right)$$where $\mathbf{a}^{[0]} = \mathbf{x}$ (the input), $\mathbf{W}^{[l]}$ is the weight matrix of layer $l$, $\mathbf{b}^{[l]}$ is the bias vector, and $g$ is the activation function.
Step-by-Step with Intermediate Values
Let’s trace through a complete forward pass, printing every intermediate computation:
import numpy as np
def sigmoid(z):
return 1.0 / (1.0 + np.exp(-z))
def relu(z):
return np.maximum(0, z)
# Define a small network: 3 inputs -> 4 hidden (ReLU) -> 2 output (sigmoid)
np.random.seed(123)
# Layer 1 parameters
W1 = np.random.randn(4, 3) * 0.5 # 4 neurons, 3 inputs each
b1 = np.zeros(4)
# Layer 2 parameters
W2 = np.random.randn(2, 4) * 0.5 # 2 neurons, 4 inputs each
b2 = np.zeros(2)
# Input
x = np.array([1.0, 0.5, -0.8])
print("=" * 60)
print("FORWARD PROPAGATION: Step-by-Step Trace")
print("=" * 60)
print(f"\nNetwork: 3 -> 4 (ReLU) -> 2 (Sigmoid)")
print(f"\nInput x = {x}")
# Step 1: Linear transformation in layer 1
z1 = W1 @ x + b1
print(f"\n--- Layer 1 (Hidden) ---")
print(f"W1 shape: {W1.shape}")
print(f"z1 = W1 @ x + b1")
print(f"z1 = {z1}")
# Step 2: Apply ReLU activation
a1 = relu(z1)
print(f"\na1 = ReLU(z1)")
print(f"a1 = {a1}")
print(f"(Negative values zeroed out by ReLU)")
# Step 3: Linear transformation in layer 2
z2 = W2 @ a1 + b2
print(f"\n--- Layer 2 (Output) ---")
print(f"W2 shape: {W2.shape}")
print(f"z2 = W2 @ a1 + b2")
print(f"z2 = {z2}")
# Step 4: Apply Sigmoid activation
a2 = sigmoid(z2)
print(f"\na2 = Sigmoid(z2)")
print(f"a2 = {a2}")
print(f"\n{'=' * 60}")
print(f"FINAL OUTPUT: {a2}")
print(f"{'=' * 60}")
# Verify dimensions at each step
print(f"\n--- Dimension Check ---")
print(f"x: {x.shape} (input)")
print(f"W1: {W1.shape} (layer 1 weights)")
print(f"z1: {z1.shape} (pre-activation 1)")
print(f"a1: {a1.shape} (post-activation 1)")
print(f"W2: {W2.shape} (layer 2 weights)")
print(f"z2: {z2.shape} (pre-activation 2)")
print(f"a2: {a2.shape} (final output)")
Putting It Together: Multi-Layer Perceptron
Now let’s combine everything into a complete, configurable MLP class. This is the foundation we’ll build upon for training (backpropagation) in Part 3:
import numpy as np
class MLP:
"""Multi-Layer Perceptron with configurable architecture."""
def __init__(self, layer_sizes, activations=None):
"""
Args:
layer_sizes: list of ints, e.g. [3, 4, 4, 2]
(input=3, two hidden=4, output=2)
activations: list of activation names for each layer
(length = len(layer_sizes) - 1)
"""
self.n_layers = len(layer_sizes) - 1
# Default: ReLU for hidden, sigmoid for output
if activations is None:
activations = ['relu'] * (self.n_layers - 1) + ['sigmoid']
self.activations = activations
# Initialize weights and biases
self.weights = []
self.biases = []
for i in range(self.n_layers):
n_in = layer_sizes[i]
n_out = layer_sizes[i + 1]
# Xavier/He initialization
if activations[i] == 'relu':
scale = np.sqrt(2.0 / n_in) # He init for ReLU
else:
scale = np.sqrt(1.0 / n_in) # Xavier init
W = np.random.randn(n_out, n_in) * scale
b = np.zeros(n_out)
self.weights.append(W)
self.biases.append(b)
# Storage for forward pass values (needed for backprop)
self.layer_inputs = []
self.layer_outputs = []
def _activate(self, z, activation):
"""Apply activation function."""
if activation == 'relu':
return np.maximum(0, z)
elif activation == 'sigmoid':
return 1.0 / (1.0 + np.exp(-np.clip(z, -500, 500)))
elif activation == 'tanh':
return np.tanh(z)
elif activation == 'linear':
return z
elif activation == 'leaky_relu':
return np.where(z > 0, z, 0.01 * z)
else:
raise ValueError(f"Unknown activation: {activation}")
def forward(self, x):
"""Forward pass through all layers."""
self.layer_inputs = [x]
self.layer_outputs = [x]
a = x
for i in range(self.n_layers):
z = self.weights[i] @ a + self.biases[i]
a = self._activate(z, self.activations[i])
self.layer_inputs.append(z)
self.layer_outputs.append(a)
return a
def predict(self, X):
"""Predict for multiple samples (batch)."""
return np.array([self.forward(x) for x in X])
def count_params(self):
"""Count total trainable parameters."""
total = 0
for W, b in zip(self.weights, self.biases):
total += W.size + b.size
return total
def summary(self):
"""Print network architecture summary."""
print("=" * 50)
print("MLP Architecture Summary")
print("=" * 50)
for i in range(self.n_layers):
n_in = self.weights[i].shape[1]
n_out = self.weights[i].shape[0]
params = self.weights[i].size + self.biases[i].size
print(f"Layer {i+1}: {n_in} -> {n_out} "
f"({self.activations[i]}) "
f"| params: {params}")
print("-" * 50)
print(f"Total parameters: {self.count_params()}")
print("=" * 50)
# Create and test the MLP
np.random.seed(42)
# Architecture: 2 inputs -> 4 hidden -> 3 hidden -> 1 output
mlp = MLP(
layer_sizes=[2, 4, 3, 1],
activations=['relu', 'relu', 'sigmoid']
)
mlp.summary()
# Test on XOR-like data
X_test = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
print("\n=== Forward Pass on XOR Data ===")
for x in X_test:
output = mlp.forward(x)
print(f"Input: {x} -> Output: {output[0]:.4f}")
print("\nNote: Outputs are random (untrained network).")
print("After training (Part 3), this will solve XOR correctly!")
Testing the MLP on Simple Data
Let’s test our MLP on a slightly more complex dataset to verify it can produce different outputs for different inputs (even before training):
import numpy as np
import matplotlib.pyplot as plt
class MLP:
"""Multi-Layer Perceptron (repeated for independence)."""
def __init__(self, layer_sizes, activations=None):
self.n_layers = len(layer_sizes) - 1
if activations is None:
activations = ['relu'] * (self.n_layers - 1) + ['sigmoid']
self.activations = activations
self.weights = []
self.biases = []
for i in range(self.n_layers):
n_in = layer_sizes[i]
n_out = layer_sizes[i + 1]
scale = np.sqrt(2.0 / n_in) if activations[i] == 'relu' else np.sqrt(1.0 / n_in)
self.weights.append(np.random.randn(n_out, n_in) * scale)
self.biases.append(np.zeros(n_out))
def _activate(self, z, activation):
if activation == 'relu':
return np.maximum(0, z)
elif activation == 'sigmoid':
return 1.0 / (1.0 + np.exp(-np.clip(z, -500, 500)))
elif activation == 'tanh':
return np.tanh(z)
return z
def forward(self, x):
a = x
for i in range(self.n_layers):
z = self.weights[i] @ a + self.biases[i]
a = self._activate(z, self.activations[i])
return a
# Create an MLP and visualize its decision landscape
np.random.seed(7)
mlp = MLP(layer_sizes=[2, 8, 4, 1], activations=['relu', 'relu', 'sigmoid'])
# Generate a grid of points
x_range = np.linspace(-3, 3, 80)
y_range = np.linspace(-3, 3, 80)
X_grid, Y_grid = np.meshgrid(x_range, y_range)
Z_grid = np.zeros_like(X_grid)
for i in range(80):
for j in range(80):
point = np.array([X_grid[i, j], Y_grid[i, j]])
Z_grid[i, j] = mlp.forward(point)[0]
# Plot the untrained network decision landscape
plt.figure(figsize=(8, 6))
plt.contourf(X_grid, Y_grid, Z_grid, levels=20, cmap='RdYlBu_r')
plt.colorbar(label='Network Output (Sigmoid)')
plt.xlabel('x1', fontsize=12)
plt.ylabel('x2', fontsize=12)
plt.title('MLP Decision Landscape (Untrained, Random Weights)', fontsize=14)
plt.axhline(y=0, color='k', linestyle='--', alpha=0.3)
plt.axvline(x=0, color='k', linestyle='--', alpha=0.3)
plt.tight_layout()
plt.savefig('mlp_decision_landscape.png', dpi=100, bbox_inches='tight')
plt.show()
print("=== Untrained MLP Decision Landscape ===")
print("Even with random weights, the network creates")
print("NON-LINEAR boundaries thanks to ReLU activations.")
print(f"\nNetwork: 2 -> 8 -> 4 -> 1")
print(f"Total parameters: {sum(w.size for w in mlp.weights) + sum(b.size for b in mlp.biases)}")
print(f"Output range: [{Z_grid.min():.4f}, {Z_grid.max():.4f}]")
print("\nThe piecewise-linear regions are ReLU boundaries.")
print("Training will adjust these to match the target function.")
Create three MLPs with the same total parameters but different architectures:
- Wide and shallow: [2, 32, 1] (66 params)
- Narrow and deep: [2, 4, 4, 4, 4, 1] (~60 params)
- Balanced: [2, 8, 8, 1] (~89 params)
Plot their decision landscapes. Which creates the most complex boundaries? This foreshadows the depth vs. width debate in deep learning.
What’s Next
We’ve built all the forward-pass machinery: neurons, weights, biases, activations, layers, and a complete MLP class. But our network produces random outputs — it doesn’t learn yet. In Part 3, we’ll teach it to learn by implementing:
- Loss Functions: Measuring how wrong the network is (MSE, cross-entropy)
- Gradient Descent: Adjusting weights to reduce error
- Backpropagation: The chain rule applied layer by layer to compute gradients
- Training Loop: Iterating forward → loss → backward → update
Next in the Series
In Part 3: How Neural Networks Learn, we’ll implement loss functions, gradient descent, and the backpropagation algorithm from scratch — transforming our MLP from a random function into an intelligent learning system.