Self-Attention from Scratch

Project Goal

Given token embeddings $X \in \mathbb{R}^{T \times d}$, compute $Q=XW_Q$, $K=XW_K$, $V=XW_V$, then apply $$\text{softmax}(QK^\top/\sqrt{d_k})V.$$

Success criterion: your output should have one contextual vector per input token, with shape $(T,d_k)$ for a single head.

Implementation

import numpy as np

np.random.seed(12)
T, d_model, d_k = 5, 8, 4
X = np.random.randn(T, d_model)
Wq = np.random.randn(d_model, d_k) / np.sqrt(d_model)
Wk = np.random.randn(d_model, d_k) / np.sqrt(d_model)
Wv = np.random.randn(d_model, d_k) / np.sqrt(d_model)

Q = X @ Wq
K = X @ Wk
V = X @ Wv
scores = Q @ K.T / np.sqrt(d_k)
mask = np.triu(np.ones((T, T), dtype=bool), k=1)
scores = scores.copy()
scores[mask] = -1e9
weights = np.exp(scores - scores.max(axis=-1, keepdims=True))
weights = weights / weights.sum(axis=-1, keepdims=True)
context = weights @ V

print("Q:", Q.shape, "scores:", scores.shape, "context:", context.shape)
print("row sums:", np.round(weights.sum(axis=-1), 4))

Shape Checks

Tensor	Shape	Meaning
$Q,K,V$	$(T,d_k)$	Token projections
$QK^\top$	$(T,T)$	Attention logits
weights	$(T,T)$	Probability distribution per query token
context	$(T,d_k)$	Weighted value summaries

Extensions

ChallengeMulti-Head

Add Heads

Split $d_{model}$ into $h$ heads, run attention independently, concatenate outputs, and project back to $d_{model}$.

Cookie Consent

Table of Contents

Project Goal

Implementation

Shape Checks

Extensions

Add Heads

Cookie Consent

Self-Attention from Scratch

Table of Contents

Project Goal

Implementation

Shape Checks

Extensions

Add Heads

Related Reading

Transformer & LLM Math