Contrastive & Metric Learning Mathematics

Math Foundations: This extension builds on Part 6: Information Theory (mutual information, KL divergence), Part 7: Linear Algebra (cosine similarity, norms), and Part 9: ML Math (softmax, cross-entropy). It is the canonical reference for contrastive learning content across PyTorch Transfer Learning and AI in the Wild: Multimodal AI.

Metric Learning Foundations

The goal of metric learning is to learn an embedding function $f_\theta: \mathcal{X} \to \mathbb{R}^d$ such that similar inputs map to nearby points and dissimilar inputs map to distant points in the embedding space.

Distance & Similarity Measures

Measure	Formula	Range	Used In
Euclidean distance	$d(u,v) = \\|u - v\\|_2$	$[0, \infty)$	Triplet loss, clustering
Cosine similarity	$\text{sim}(u,v) = \frac{u \cdot v}{\\|u\\|\\|v\\|}$	$[-1, 1]$	InfoNCE, CLIP
Dot product	$s(u,v) = u \cdot v$	$(-\infty, \infty)$	Attention, retrieval

Why normalize? With L2-normalized embeddings ($\|u\| = \|v\| = 1$), cosine similarity equals the dot product, and Euclidean distance relates to cosine by $\|u-v\|^2 = 2(1 - \cos(u,v))$. Normalization prevents the model from "cheating" by simply increasing embedding magnitudes to reduce loss.

Contrastive Loss (Siamese Networks)

The original contrastive loss for pairs $(x_i, x_j)$ with binary label $y_{ij} \in \{0, 1\}$ (1 = similar):

$$\mathcal{L}_{\text{contrastive}} = y_{ij}\,d(f(x_i), f(x_j))^2 + (1-y_{ij})\,\max(0, m - d(f(x_i), f(x_j)))^2$$

Similar pairs are pulled together (minimize distance). Dissimilar pairs are pushed apart, but only up to margin $m$ — once they're farther than $m$, the loss is zero and no gradient flows.

Triplet Loss

Derivation & Margin

Triplet loss operates on triplets $(a, p, n)$: an anchor, a positive (same class), and a negative (different class). The goal: the anchor should be closer to the positive than to the negative by at least a margin $\alpha$:

$$d(f(a), f(p)) + \alpha < d(f(a), f(n))$$

The loss penalizes violations:

$$\boxed{\mathcal{L}_{\text{triplet}} = \max\left(0,\;\|f(a) - f(p)\|^2 - \|f(a) - f(n)\|^2 + \alpha\right)}$$

Geometry: This carves out a sphere of radius $r$ around each anchor in embedding space. All positives must lie inside this sphere, all negatives must lie outside with margin $\alpha$. The loss is zero for "easy" triplets where the negative is already far away.

Gradient: When the loss is active (i.e., the margin constraint is violated):

$\nabla_{f(a)} \mathcal{L} = 2(f(n) - f(p))$ — move anchor away from negative, toward positive
$\nabla_{f(p)} \mathcal{L} = 2(f(p) - f(a))$ — move positive toward anchor
$\nabla_{f(n)} \mathcal{L} = 2(f(a) - f(n))$ — move negative away from anchor

Hard Negative Mining

Most triplets are "easy" (loss = 0) and provide no learning signal. Hard negative mining selects the most informative triplets:

Strategy	Definition	Properties
Hardest negative	$n^* = \arg\min_n \\|f(a) - f(n)\\|$	Can lead to collapsed embeddings early in training
Semi-hard negative	$\\|f(a)-f(p)\\| < \\|f(a)-f(n)\\| < \\|f(a)-f(p)\\| + \alpha$	Active loss, avoids collapse — used in FaceNet
Random negative	Uniformly sampled from different class	Mostly easy triplets, slow learning

import numpy as np

def triplet_loss(anchor, positive, negative, margin=1.0):
    """Compute triplet loss for a batch of embeddings."""
    d_pos = np.sum((anchor - positive)**2, axis=1)  # ||f(a) - f(p)||^2
    d_neg = np.sum((anchor - negative)**2, axis=1)  # ||f(a) - f(n)||^2
    losses = np.maximum(0, d_pos - d_neg + margin)
    return losses.mean()

# Example: 4 triplets in 3D embedding space
np.random.seed(42)
dim = 3
batch_size = 4

# Anchors from class A
anchors = np.random.randn(batch_size, dim)
# Positives: close to anchors (same class)
positives = anchors + np.random.randn(batch_size, dim) * 0.3
# Negatives: far from anchors (different class)
negatives = anchors + np.random.randn(batch_size, dim) * 2.0

margin = 1.0
loss = triplet_loss(anchors, positives, negatives, margin)

# Per-triplet analysis
d_pos = np.sum((anchors - positives)**2, axis=1)
d_neg = np.sum((anchors - negatives)**2, axis=1)

print(f"Triplet loss (margin={margin}): {loss:.4f}\n")
print(f"{'Triplet':>8} {'d(a,p)':>8} {'d(a,n)':>8} {'Active?':>8}")
for i in range(batch_size):
    active = "YES" if d_pos[i] - d_neg[i] + margin > 0 else "no"
    print(f"{i:>8} {d_pos[i]:8.3f} {d_neg[i]:8.3f} {active:>8}")

InfoNCE & NT-Xent

Information-Theoretic Derivation

InfoNCE (Noise-Contrastive Estimation) is derived from a lower bound on mutual information. Given an anchor $x$ and one positive sample $x^+$ among $N-1$ negative samples $\{x_1^-, \ldots, x_{N-1}^-\}$, the loss is:

$$\boxed{\mathcal{L}_{\text{InfoNCE}} = -\log \frac{\exp(\text{sim}(f(x), f(x^+)) / \tau)}{\exp(\text{sim}(f(x), f(x^+)) / \tau) + \sum_{j=1}^{N-1}\exp(\text{sim}(f(x), f(x_j^-)) / \tau)}}$$

Why this is a softmax cross-entropy: Recognize this as the negative log-probability of the positive being identified among all candidates. It is equivalent to a $(N)$-way classification problem where the model must identify which of $N$ samples is the true positive.

Connection to mutual information: The InfoNCE loss provides a lower bound on the mutual information $I(X; X^+)$ between the anchor and positive views:

$$I(X; X^+) \geq \log N - \mathcal{L}_{\text{InfoNCE}}$$

Minimizing $\mathcal{L}_{\text{InfoNCE}}$ maximizes a lower bound on mutual information. The bound is tighter with more negatives ($N$ larger).

NT-Xent (Normalized Temperature-scaled Cross Entropy) is the SimCLR variant of InfoNCE. For a batch of $N$ pairs (from $2N$ augmented views), each anchor $z_i$ has one positive $z_j$ and $2(N-1)$ negatives:

$$\mathcal{L}_{\text{NT-Xent}}^{(i,j)} = -\log \frac{\exp(\text{sim}(z_i, z_j)/\tau)}{\sum_{k=1}^{2N} \mathbf{1}_{k\neq i}\exp(\text{sim}(z_i, z_k)/\tau)}$$

Temperature Scaling

The temperature $\tau$ controls how "peaked" the similarity distribution is:

$\tau \to 0$: only the hardest negatives contribute gradient (winner-take-all)
$\tau \to \infty$: all negatives contribute equally (uniform attention)
$\tau = 0.07$ (CLIP default): sharp distribution, strong negative penalties
$\tau = 0.5$ (SimCLR default): softer distribution

Mathematical effect: Dividing logits by $\tau$ before softmax is equivalent to raising probabilities to the power $1/\tau$. Low $\tau$ amplifies differences between similar and dissimilar pairs.

import numpy as np

def infonce_loss(anchor, positive, negatives, temperature=0.07):
    """
    Compute InfoNCE loss.
    
    Args:
        anchor: embedding vector, shape (d,)
        positive: positive embedding, shape (d,)
        negatives: negative embeddings, shape (N-1, d)
        temperature: scaling parameter
    
    Returns:
        Scalar loss value
    """
    # L2 normalize all embeddings
    anchor = anchor / np.linalg.norm(anchor)
    positive = positive / np.linalg.norm(positive)
    negatives = negatives / np.linalg.norm(negatives, axis=1, keepdims=True)
    
    # Cosine similarities
    sim_pos = np.dot(anchor, positive) / temperature
    sim_neg = np.dot(negatives, anchor) / temperature  # shape (N-1,)
    
    # InfoNCE = -log(exp(sim_pos) / (exp(sim_pos) + sum(exp(sim_neg))))
    # Use log-sum-exp trick for stability
    all_logits = np.concatenate([[sim_pos], sim_neg])
    log_sum_exp = np.max(all_logits) + np.log(np.sum(np.exp(all_logits - np.max(all_logits))))
    loss = -(sim_pos - log_sum_exp)
    return loss

# Example: 128-dim embeddings, 1 positive + 255 negatives
np.random.seed(42)
d = 128
n_negatives = 255

anchor = np.random.randn(d)
positive = anchor + np.random.randn(d) * 0.3  # close to anchor
negatives = np.random.randn(n_negatives, d)     # random negatives

# Compare different temperatures
for tau in [0.01, 0.07, 0.5, 1.0]:
    loss = infonce_loss(anchor, positive, negatives, temperature=tau)
    print(f"tau={tau:.2f}: InfoNCE loss = {loss:.4f}")

# MI lower bound: I(X; X+) >= log(N) - L_InfoNCE
N = n_negatives + 1
best_loss = infonce_loss(anchor, positive, negatives, temperature=0.07)
mi_bound = np.log(N) - best_loss
print(f"\nMI lower bound: >= {mi_bound:.4f} nats")
print(f"log(N) = log({N}) = {np.log(N):.4f} nats (maximum possible)")

CLIP Objective

CLIP (Contrastive Language-Image Pre-training) applies InfoNCE symmetrically across image and text modalities. Given a batch of $N$ (image, text) pairs, CLIP treats each pair as a positive and all $N-1$ other pairings as negatives.

Let $I_i = f_{\text{image}}(x_i)$ and $T_j = f_{\text{text}}(t_j)$ be L2-normalized embeddings. The similarity matrix is $S_{ij} = I_i \cdot T_j / \tau$. CLIP minimizes:

$$\mathcal{L}_{\text{CLIP}} = \frac{1}{2}\left(\mathcal{L}_{\text{i2t}} + \mathcal{L}_{\text{t2i}}\right)$$

where:

$$\mathcal{L}_{\text{i2t}} = -\frac{1}{N}\sum_{i=1}^N \log\frac{\exp(S_{ii})}{\sum_{j=1}^N \exp(S_{ij})} \quad \text{(image→text matching)}$$ $$\mathcal{L}_{\text{t2i}} = -\frac{1}{N}\sum_{j=1}^N \log\frac{\exp(S_{jj})}{\sum_{i=1}^N \exp(S_{ij})} \quad \text{(text→image matching)}$$

This is equivalent to cross-entropy loss on the $N \times N$ similarity matrix with the identity as the target: each row and column should have its maximum on the diagonal.

Scaling insight: CLIP's batch size of 32,768 means each anchor has 32,767 negatives — making the InfoNCE bound tight and the learned representations highly discriminative. The learnable temperature $\tau$ (initialized to $1/0.07 = 14.3$) is optimized jointly.

import numpy as np

def clip_loss(image_embeddings, text_embeddings, temperature=0.07):
    """
    Compute symmetric CLIP contrastive loss.
    
    Args:
        image_embeddings: shape (N, d), L2-normalized
        text_embeddings: shape (N, d), L2-normalized
        temperature: learnable temperature parameter
    
    Returns:
        Scalar CLIP loss
    """
    N = len(image_embeddings)
    
    # Similarity matrix: S[i,j] = sim(image_i, text_j) / tau
    S = (image_embeddings @ text_embeddings.T) / temperature  # (N, N)
    
    # Labels: diagonal entries are positives (identity permutation)
    labels = np.arange(N)
    
    # Image-to-text loss: each row should peak at diagonal
    # = cross_entropy(S, labels) along rows
    log_sum_exp_rows = np.log(np.sum(np.exp(S - S.max(axis=1, keepdims=True)), axis=1)) + S.max(axis=1)
    loss_i2t = np.mean(-S[np.arange(N), labels] + log_sum_exp_rows)
    
    # Text-to-image loss: each column should peak at diagonal
    log_sum_exp_cols = np.log(np.sum(np.exp(S.T - S.T.max(axis=1, keepdims=True)), axis=1)) + S.T.max(axis=1)
    loss_t2i = np.mean(-S.T[np.arange(N), labels] + log_sum_exp_cols)
    
    return 0.5 * (loss_i2t + loss_t2i)

# Simulate a mini-batch of 8 image-text pairs
np.random.seed(42)
N, d = 8, 64

# Create embeddings where matched pairs are similar
shared = np.random.randn(N, d)
image_emb = shared + np.random.randn(N, d) * 0.2
text_emb = shared + np.random.randn(N, d) * 0.2

# L2 normalize
image_emb = image_emb / np.linalg.norm(image_emb, axis=1, keepdims=True)
text_emb = text_emb / np.linalg.norm(text_emb, axis=1, keepdims=True)

loss = clip_loss(image_emb, text_emb, temperature=0.07)
print(f"CLIP loss: {loss:.4f}")
print(f"Random baseline (N={N}): {np.log(N):.4f}")
print(f"Perfect matching: ~0.0")

# Show similarity matrix diagonal vs off-diagonal
S = image_emb @ text_emb.T
print(f"\nSimilarity matrix stats:")
print(f"  Diagonal (pos pairs):  mean={np.diag(S).mean():.3f}")
print(f"  Off-diagonal (neg):    mean={S[~np.eye(N, dtype=bool)].mean():.3f}")

Why Contrastive Learning Works

Alignment and Uniformity: Good contrastive representations satisfy two properties:

Alignment: Positive pairs should map to nearby points: $\mathbb{E}_{(x,x^+)}\|f(x) - f(x^+)\|^2$ should be small
Uniformity: Embeddings should be uniformly distributed on the hypersphere: $\log \mathbb{E}_{x,y}\exp(-2\|f(x) - f(y)\|^2)$ should be minimized

InfoNCE implicitly optimizes both: the numerator rewards alignment (high similarity to positive) while the denominator rewards uniformity (low similarity to all negatives).

Dimensional collapse prevention: Without enough negatives or with too-easy augmentations, embeddings can collapse to a lower-dimensional subspace (or a single point). Uniformity on the hypersphere prevents this by requiring the embeddings to "spread out." Techniques like variance-invariance-covariance (VICReg) regularization address collapse more directly.

Connection to downstream tasks: If two inputs are semantically similar (under relevant augmentations), their embeddings should be close. This creates a representation where linear probes can separate classes, because class boundaries align with distance in embedding space.

Practice Exercises

Exercises: Cement the derivations above.

Triplet loss gradient: Derive $\nabla_{f(a)}\mathcal{L}_{\text{triplet}}$ when the loss is active. Verify that the gradient pushes the anchor toward the positive and away from the negative.
InfoNCE as classification: Show that InfoNCE is equivalent to $N$-way cross-entropy where the "correct class" is the positive sample. Write out the softmax probabilities explicitly.
Temperature analysis: For cosine similarities $[0.9, 0.3, 0.2, 0.1]$ (positive first), compute the InfoNCE loss for $\tau \in \{0.01, 0.1, 1.0\}$. How does temperature affect which negatives contribute most to the gradient?
MI bound tightness: Show that $I(X;X^+) \geq \log N - \mathcal{L}_{\text{InfoNCE}}$ by relating InfoNCE to the density ratio $\frac{p(x^+|x)}{p(x^+)}$.
CLIP symmetry: Explain why CLIP uses both image→text and text→image losses. What could go wrong with only one direction?
Coding challenge: Implement NT-Xent for a batch of 32 pairs (64 augmented views). Verify that your loss matches the formula when positives are maximally similar.

Cookie Consent

Table of Contents

Metric Learning Foundations

Distance & Similarity Measures

Contrastive Loss (Siamese Networks)

Triplet Loss

Derivation & Margin

Hard Negative Mining

InfoNCE & NT-Xent

Information-Theoretic Derivation

Temperature Scaling

CLIP Objective

Why Contrastive Learning Works

Practice Exercises

Cookie Consent

Contrastive & Metric Learning Mathematics

Table of Contents

Metric Learning Foundations

Distance & Similarity Measures

Contrastive Loss (Siamese Networks)

Triplet Loss

Derivation & Margin

Hard Negative Mining

InfoNCE & NT-Xent

Information-Theoretic Derivation

Temperature Scaling

CLIP Objective

Why Contrastive Learning Works

Practice Exercises

Related Articles

Part 6: Information Theory

Part 7: Linear Algebra

Extension 2: Transformer & LLM Math