Metric Learning Foundations
The goal of metric learning is to learn an embedding function $f_\theta: \mathcal{X} \to \mathbb{R}^d$ such that similar inputs map to nearby points and dissimilar inputs map to distant points in the embedding space.
Distance & Similarity Measures
| Measure | Formula | Range | Used In |
|---|---|---|---|
| Euclidean distance | $d(u,v) = \|u - v\|_2$ | $[0, \infty)$ | Triplet loss, clustering |
| Cosine similarity | $\text{sim}(u,v) = \frac{u \cdot v}{\|u\|\|v\|}$ | $[-1, 1]$ | InfoNCE, CLIP |
| Dot product | $s(u,v) = u \cdot v$ | $(-\infty, \infty)$ | Attention, retrieval |
Why normalize? With L2-normalized embeddings ($\|u\| = \|v\| = 1$), cosine similarity equals the dot product, and Euclidean distance relates to cosine by $\|u-v\|^2 = 2(1 - \cos(u,v))$. Normalization prevents the model from "cheating" by simply increasing embedding magnitudes to reduce loss.
Contrastive Loss (Siamese Networks)
The original contrastive loss for pairs $(x_i, x_j)$ with binary label $y_{ij} \in \{0, 1\}$ (1 = similar):
$$\mathcal{L}_{\text{contrastive}} = y_{ij}\,d(f(x_i), f(x_j))^2 + (1-y_{ij})\,\max(0, m - d(f(x_i), f(x_j)))^2$$Similar pairs are pulled together (minimize distance). Dissimilar pairs are pushed apart, but only up to margin $m$ — once they're farther than $m$, the loss is zero and no gradient flows.
Triplet Loss
Derivation & Margin
Triplet loss operates on triplets $(a, p, n)$: an anchor, a positive (same class), and a negative (different class). The goal: the anchor should be closer to the positive than to the negative by at least a margin $\alpha$:
$$d(f(a), f(p)) + \alpha < d(f(a), f(n))$$The loss penalizes violations:
$$\boxed{\mathcal{L}_{\text{triplet}} = \max\left(0,\;\|f(a) - f(p)\|^2 - \|f(a) - f(n)\|^2 + \alpha\right)}$$Geometry: This carves out a sphere of radius $r$ around each anchor in embedding space. All positives must lie inside this sphere, all negatives must lie outside with margin $\alpha$. The loss is zero for "easy" triplets where the negative is already far away.
Gradient: When the loss is active (i.e., the margin constraint is violated):
- $\nabla_{f(a)} \mathcal{L} = 2(f(n) - f(p))$ — move anchor away from negative, toward positive
- $\nabla_{f(p)} \mathcal{L} = 2(f(p) - f(a))$ — move positive toward anchor
- $\nabla_{f(n)} \mathcal{L} = 2(f(a) - f(n))$ — move negative away from anchor
Hard Negative Mining
Most triplets are "easy" (loss = 0) and provide no learning signal. Hard negative mining selects the most informative triplets:
| Strategy | Definition | Properties |
|---|---|---|
| Hardest negative | $n^* = \arg\min_n \|f(a) - f(n)\|$ | Can lead to collapsed embeddings early in training |
| Semi-hard negative | $\|f(a)-f(p)\| < \|f(a)-f(n)\| < \|f(a)-f(p)\| + \alpha$ | Active loss, avoids collapse — used in FaceNet |
| Random negative | Uniformly sampled from different class | Mostly easy triplets, slow learning |
import numpy as np
def triplet_loss(anchor, positive, negative, margin=1.0):
"""Compute triplet loss for a batch of embeddings."""
d_pos = np.sum((anchor - positive)**2, axis=1) # ||f(a) - f(p)||^2
d_neg = np.sum((anchor - negative)**2, axis=1) # ||f(a) - f(n)||^2
losses = np.maximum(0, d_pos - d_neg + margin)
return losses.mean()
# Example: 4 triplets in 3D embedding space
np.random.seed(42)
dim = 3
batch_size = 4
# Anchors from class A
anchors = np.random.randn(batch_size, dim)
# Positives: close to anchors (same class)
positives = anchors + np.random.randn(batch_size, dim) * 0.3
# Negatives: far from anchors (different class)
negatives = anchors + np.random.randn(batch_size, dim) * 2.0
margin = 1.0
loss = triplet_loss(anchors, positives, negatives, margin)
# Per-triplet analysis
d_pos = np.sum((anchors - positives)**2, axis=1)
d_neg = np.sum((anchors - negatives)**2, axis=1)
print(f"Triplet loss (margin={margin}): {loss:.4f}\n")
print(f"{'Triplet':>8} {'d(a,p)':>8} {'d(a,n)':>8} {'Active?':>8}")
for i in range(batch_size):
active = "YES" if d_pos[i] - d_neg[i] + margin > 0 else "no"
print(f"{i:>8} {d_pos[i]:8.3f} {d_neg[i]:8.3f} {active:>8}")
InfoNCE & NT-Xent
Information-Theoretic Derivation
InfoNCE (Noise-Contrastive Estimation) is derived from a lower bound on mutual information. Given an anchor $x$ and one positive sample $x^+$ among $N-1$ negative samples $\{x_1^-, \ldots, x_{N-1}^-\}$, the loss is:
$$\boxed{\mathcal{L}_{\text{InfoNCE}} = -\log \frac{\exp(\text{sim}(f(x), f(x^+)) / \tau)}{\exp(\text{sim}(f(x), f(x^+)) / \tau) + \sum_{j=1}^{N-1}\exp(\text{sim}(f(x), f(x_j^-)) / \tau)}}$$Why this is a softmax cross-entropy: Recognize this as the negative log-probability of the positive being identified among all candidates. It is equivalent to a $(N)$-way classification problem where the model must identify which of $N$ samples is the true positive.
Connection to mutual information: The InfoNCE loss provides a lower bound on the mutual information $I(X; X^+)$ between the anchor and positive views:
$$I(X; X^+) \geq \log N - \mathcal{L}_{\text{InfoNCE}}$$Minimizing $\mathcal{L}_{\text{InfoNCE}}$ maximizes a lower bound on mutual information. The bound is tighter with more negatives ($N$ larger).
NT-Xent (Normalized Temperature-scaled Cross Entropy) is the SimCLR variant of InfoNCE. For a batch of $N$ pairs (from $2N$ augmented views), each anchor $z_i$ has one positive $z_j$ and $2(N-1)$ negatives:
$$\mathcal{L}_{\text{NT-Xent}}^{(i,j)} = -\log \frac{\exp(\text{sim}(z_i, z_j)/\tau)}{\sum_{k=1}^{2N} \mathbf{1}_{k\neq i}\exp(\text{sim}(z_i, z_k)/\tau)}$$Temperature Scaling
The temperature $\tau$ controls how "peaked" the similarity distribution is:
- $\tau \to 0$: only the hardest negatives contribute gradient (winner-take-all)
- $\tau \to \infty$: all negatives contribute equally (uniform attention)
- $\tau = 0.07$ (CLIP default): sharp distribution, strong negative penalties
- $\tau = 0.5$ (SimCLR default): softer distribution
Mathematical effect: Dividing logits by $\tau$ before softmax is equivalent to raising probabilities to the power $1/\tau$. Low $\tau$ amplifies differences between similar and dissimilar pairs.
import numpy as np
def infonce_loss(anchor, positive, negatives, temperature=0.07):
"""
Compute InfoNCE loss.
Args:
anchor: embedding vector, shape (d,)
positive: positive embedding, shape (d,)
negatives: negative embeddings, shape (N-1, d)
temperature: scaling parameter
Returns:
Scalar loss value
"""
# L2 normalize all embeddings
anchor = anchor / np.linalg.norm(anchor)
positive = positive / np.linalg.norm(positive)
negatives = negatives / np.linalg.norm(negatives, axis=1, keepdims=True)
# Cosine similarities
sim_pos = np.dot(anchor, positive) / temperature
sim_neg = np.dot(negatives, anchor) / temperature # shape (N-1,)
# InfoNCE = -log(exp(sim_pos) / (exp(sim_pos) + sum(exp(sim_neg))))
# Use log-sum-exp trick for stability
all_logits = np.concatenate([[sim_pos], sim_neg])
log_sum_exp = np.max(all_logits) + np.log(np.sum(np.exp(all_logits - np.max(all_logits))))
loss = -(sim_pos - log_sum_exp)
return loss
# Example: 128-dim embeddings, 1 positive + 255 negatives
np.random.seed(42)
d = 128
n_negatives = 255
anchor = np.random.randn(d)
positive = anchor + np.random.randn(d) * 0.3 # close to anchor
negatives = np.random.randn(n_negatives, d) # random negatives
# Compare different temperatures
for tau in [0.01, 0.07, 0.5, 1.0]:
loss = infonce_loss(anchor, positive, negatives, temperature=tau)
print(f"tau={tau:.2f}: InfoNCE loss = {loss:.4f}")
# MI lower bound: I(X; X+) >= log(N) - L_InfoNCE
N = n_negatives + 1
best_loss = infonce_loss(anchor, positive, negatives, temperature=0.07)
mi_bound = np.log(N) - best_loss
print(f"\nMI lower bound: >= {mi_bound:.4f} nats")
print(f"log(N) = log({N}) = {np.log(N):.4f} nats (maximum possible)")
CLIP Objective
CLIP (Contrastive Language-Image Pre-training) applies InfoNCE symmetrically across image and text modalities. Given a batch of $N$ (image, text) pairs, CLIP treats each pair as a positive and all $N-1$ other pairings as negatives.
Let $I_i = f_{\text{image}}(x_i)$ and $T_j = f_{\text{text}}(t_j)$ be L2-normalized embeddings. The similarity matrix is $S_{ij} = I_i \cdot T_j / \tau$. CLIP minimizes:
$$\mathcal{L}_{\text{CLIP}} = \frac{1}{2}\left(\mathcal{L}_{\text{i2t}} + \mathcal{L}_{\text{t2i}}\right)$$where:
$$\mathcal{L}_{\text{i2t}} = -\frac{1}{N}\sum_{i=1}^N \log\frac{\exp(S_{ii})}{\sum_{j=1}^N \exp(S_{ij})} \quad \text{(image→text matching)}$$ $$\mathcal{L}_{\text{t2i}} = -\frac{1}{N}\sum_{j=1}^N \log\frac{\exp(S_{jj})}{\sum_{i=1}^N \exp(S_{ij})} \quad \text{(text→image matching)}$$This is equivalent to cross-entropy loss on the $N \times N$ similarity matrix with the identity as the target: each row and column should have its maximum on the diagonal.
import numpy as np
def clip_loss(image_embeddings, text_embeddings, temperature=0.07):
"""
Compute symmetric CLIP contrastive loss.
Args:
image_embeddings: shape (N, d), L2-normalized
text_embeddings: shape (N, d), L2-normalized
temperature: learnable temperature parameter
Returns:
Scalar CLIP loss
"""
N = len(image_embeddings)
# Similarity matrix: S[i,j] = sim(image_i, text_j) / tau
S = (image_embeddings @ text_embeddings.T) / temperature # (N, N)
# Labels: diagonal entries are positives (identity permutation)
labels = np.arange(N)
# Image-to-text loss: each row should peak at diagonal
# = cross_entropy(S, labels) along rows
log_sum_exp_rows = np.log(np.sum(np.exp(S - S.max(axis=1, keepdims=True)), axis=1)) + S.max(axis=1)
loss_i2t = np.mean(-S[np.arange(N), labels] + log_sum_exp_rows)
# Text-to-image loss: each column should peak at diagonal
log_sum_exp_cols = np.log(np.sum(np.exp(S.T - S.T.max(axis=1, keepdims=True)), axis=1)) + S.T.max(axis=1)
loss_t2i = np.mean(-S.T[np.arange(N), labels] + log_sum_exp_cols)
return 0.5 * (loss_i2t + loss_t2i)
# Simulate a mini-batch of 8 image-text pairs
np.random.seed(42)
N, d = 8, 64
# Create embeddings where matched pairs are similar
shared = np.random.randn(N, d)
image_emb = shared + np.random.randn(N, d) * 0.2
text_emb = shared + np.random.randn(N, d) * 0.2
# L2 normalize
image_emb = image_emb / np.linalg.norm(image_emb, axis=1, keepdims=True)
text_emb = text_emb / np.linalg.norm(text_emb, axis=1, keepdims=True)
loss = clip_loss(image_emb, text_emb, temperature=0.07)
print(f"CLIP loss: {loss:.4f}")
print(f"Random baseline (N={N}): {np.log(N):.4f}")
print(f"Perfect matching: ~0.0")
# Show similarity matrix diagonal vs off-diagonal
S = image_emb @ text_emb.T
print(f"\nSimilarity matrix stats:")
print(f" Diagonal (pos pairs): mean={np.diag(S).mean():.3f}")
print(f" Off-diagonal (neg): mean={S[~np.eye(N, dtype=bool)].mean():.3f}")
Why Contrastive Learning Works
Alignment and Uniformity: Good contrastive representations satisfy two properties:
- Alignment: Positive pairs should map to nearby points: $\mathbb{E}_{(x,x^+)}\|f(x) - f(x^+)\|^2$ should be small
- Uniformity: Embeddings should be uniformly distributed on the hypersphere: $\log \mathbb{E}_{x,y}\exp(-2\|f(x) - f(y)\|^2)$ should be minimized
InfoNCE implicitly optimizes both: the numerator rewards alignment (high similarity to positive) while the denominator rewards uniformity (low similarity to all negatives).
Dimensional collapse prevention: Without enough negatives or with too-easy augmentations, embeddings can collapse to a lower-dimensional subspace (or a single point). Uniformity on the hypersphere prevents this by requiring the embeddings to "spread out." Techniques like variance-invariance-covariance (VICReg) regularization address collapse more directly.
Connection to downstream tasks: If two inputs are semantically similar (under relevant augmentations), their embeddings should be close. This creates a representation where linear probes can separate classes, because class boundaries align with distance in embedding space.
Practice Exercises
- Triplet loss gradient: Derive $\nabla_{f(a)}\mathcal{L}_{\text{triplet}}$ when the loss is active. Verify that the gradient pushes the anchor toward the positive and away from the negative.
- InfoNCE as classification: Show that InfoNCE is equivalent to $N$-way cross-entropy where the "correct class" is the positive sample. Write out the softmax probabilities explicitly.
- Temperature analysis: For cosine similarities $[0.9, 0.3, 0.2, 0.1]$ (positive first), compute the InfoNCE loss for $\tau \in \{0.01, 0.1, 1.0\}$. How does temperature affect which negatives contribute most to the gradient?
- MI bound tightness: Show that $I(X;X^+) \geq \log N - \mathcal{L}_{\text{InfoNCE}}$ by relating InfoNCE to the density ratio $\frac{p(x^+|x)}{p(x^+)}$.
- CLIP symmetry: Explain why CLIP uses both image→text and text→image losses. What could go wrong with only one direction?
- Coding challenge: Implement NT-Xent for a batch of 32 pairs (64 augmented views). Verify that your loss matches the formula when positives are maximally similar.