Back to Math for AI Hub

Self-Attention from Scratch

April 30, 2026Wasil Zafar18 min read

A practical NumPy capstone that implements scaled dot-product self-attention and makes every tensor shape visible.

Table of Contents

  1. Project Goal
  2. Implementation
  3. Shape Checks
  4. Extensions

Project Goal

Given token embeddings $X \in \mathbb{R}^{T \times d}$, compute $Q=XW_Q$, $K=XW_K$, $V=XW_V$, then apply $$\text{softmax}(QK^\top/\sqrt{d_k})V.$$

Success criterion: your output should have one contextual vector per input token, with shape $(T,d_k)$ for a single head.

Implementation

import numpy as np

np.random.seed(12)
T, d_model, d_k = 5, 8, 4
X = np.random.randn(T, d_model)
Wq = np.random.randn(d_model, d_k) / np.sqrt(d_model)
Wk = np.random.randn(d_model, d_k) / np.sqrt(d_model)
Wv = np.random.randn(d_model, d_k) / np.sqrt(d_model)

Q = X @ Wq
K = X @ Wk
V = X @ Wv
scores = Q @ K.T / np.sqrt(d_k)
mask = np.triu(np.ones((T, T), dtype=bool), k=1)
scores = scores.copy()
scores[mask] = -1e9
weights = np.exp(scores - scores.max(axis=-1, keepdims=True))
weights = weights / weights.sum(axis=-1, keepdims=True)
context = weights @ V

print("Q:", Q.shape, "scores:", scores.shape, "context:", context.shape)
print("row sums:", np.round(weights.sum(axis=-1), 4))

Shape Checks

TensorShapeMeaning
$Q,K,V$$(T,d_k)$Token projections
$QK^\top$$(T,T)$Attention logits
weights$(T,T)$Probability distribution per query token
context$(T,d_k)$Weighted value summaries

Extensions

ChallengeMulti-Head
Add Heads

Split $d_{model}$ into $h$ heads, run attention independently, concatenate outputs, and project back to $d_{model}$.