Project Goal
Given token embeddings $X \in \mathbb{R}^{T \times d}$, compute $Q=XW_Q$, $K=XW_K$, $V=XW_V$, then apply $$\text{softmax}(QK^\top/\sqrt{d_k})V.$$
Success criterion: your output should have one contextual vector per input token, with shape $(T,d_k)$ for a single head.
Implementation
import numpy as np
np.random.seed(12)
T, d_model, d_k = 5, 8, 4
X = np.random.randn(T, d_model)
Wq = np.random.randn(d_model, d_k) / np.sqrt(d_model)
Wk = np.random.randn(d_model, d_k) / np.sqrt(d_model)
Wv = np.random.randn(d_model, d_k) / np.sqrt(d_model)
Q = X @ Wq
K = X @ Wk
V = X @ Wv
scores = Q @ K.T / np.sqrt(d_k)
mask = np.triu(np.ones((T, T), dtype=bool), k=1)
scores = scores.copy()
scores[mask] = -1e9
weights = np.exp(scores - scores.max(axis=-1, keepdims=True))
weights = weights / weights.sum(axis=-1, keepdims=True)
context = weights @ V
print("Q:", Q.shape, "scores:", scores.shape, "context:", context.shape)
print("row sums:", np.round(weights.sum(axis=-1), 4))
Shape Checks
| Tensor | Shape | Meaning |
|---|---|---|
| $Q,K,V$ | $(T,d_k)$ | Token projections |
| $QK^\top$ | $(T,T)$ | Attention logits |
| weights | $(T,T)$ | Probability distribution per query token |
| context | $(T,d_k)$ | Weighted value summaries |
Extensions
Add Heads
Split $d_{model}$ into $h$ heads, run attention independently, concatenate outputs, and project back to $d_{model}$.