Back to Math for AI Hub

Contrastive & Metric Learning Mathematics

May 30, 2026Wasil Zafar30 min read

Contrastive learning teaches models what's similar and what's different, without labeled data. This extension derives the loss functions behind SimCLR, CLIP, and modern self-supervised learning from their information-theoretic foundations.

Table of Contents

  1. Metric Learning Foundations
  2. Triplet Loss
  3. InfoNCE & NT-Xent
  4. CLIP Objective
  5. Why Contrastive Learning Works
  6. Practice Exercises
Math Foundations: This extension builds on Part 6: Information Theory (mutual information, KL divergence), Part 7: Linear Algebra (cosine similarity, norms), and Part 9: ML Math (softmax, cross-entropy). It is the canonical reference for contrastive learning content across PyTorch Transfer Learning and AI in the Wild: Multimodal AI.

Metric Learning Foundations

The goal of metric learning is to learn an embedding function $f_\theta: \mathcal{X} \to \mathbb{R}^d$ such that similar inputs map to nearby points and dissimilar inputs map to distant points in the embedding space.

Distance & Similarity Measures

MeasureFormulaRangeUsed In
Euclidean distance$d(u,v) = \|u - v\|_2$$[0, \infty)$Triplet loss, clustering
Cosine similarity$\text{sim}(u,v) = \frac{u \cdot v}{\|u\|\|v\|}$$[-1, 1]$InfoNCE, CLIP
Dot product$s(u,v) = u \cdot v$$(-\infty, \infty)$Attention, retrieval

Why normalize? With L2-normalized embeddings ($\|u\| = \|v\| = 1$), cosine similarity equals the dot product, and Euclidean distance relates to cosine by $\|u-v\|^2 = 2(1 - \cos(u,v))$. Normalization prevents the model from "cheating" by simply increasing embedding magnitudes to reduce loss.

Contrastive Loss (Siamese Networks)

The original contrastive loss for pairs $(x_i, x_j)$ with binary label $y_{ij} \in \{0, 1\}$ (1 = similar):

$$\mathcal{L}_{\text{contrastive}} = y_{ij}\,d(f(x_i), f(x_j))^2 + (1-y_{ij})\,\max(0, m - d(f(x_i), f(x_j)))^2$$

Similar pairs are pulled together (minimize distance). Dissimilar pairs are pushed apart, but only up to margin $m$ — once they're farther than $m$, the loss is zero and no gradient flows.

Triplet Loss

Derivation & Margin

Triplet loss operates on triplets $(a, p, n)$: an anchor, a positive (same class), and a negative (different class). The goal: the anchor should be closer to the positive than to the negative by at least a margin $\alpha$:

$$d(f(a), f(p)) + \alpha < d(f(a), f(n))$$

The loss penalizes violations:

$$\boxed{\mathcal{L}_{\text{triplet}} = \max\left(0,\;\|f(a) - f(p)\|^2 - \|f(a) - f(n)\|^2 + \alpha\right)}$$

Geometry: This carves out a sphere of radius $r$ around each anchor in embedding space. All positives must lie inside this sphere, all negatives must lie outside with margin $\alpha$. The loss is zero for "easy" triplets where the negative is already far away.

Gradient: When the loss is active (i.e., the margin constraint is violated):

  • $\nabla_{f(a)} \mathcal{L} = 2(f(n) - f(p))$ — move anchor away from negative, toward positive
  • $\nabla_{f(p)} \mathcal{L} = 2(f(p) - f(a))$ — move positive toward anchor
  • $\nabla_{f(n)} \mathcal{L} = 2(f(a) - f(n))$ — move negative away from anchor

Hard Negative Mining

Most triplets are "easy" (loss = 0) and provide no learning signal. Hard negative mining selects the most informative triplets:

StrategyDefinitionProperties
Hardest negative$n^* = \arg\min_n \|f(a) - f(n)\|$Can lead to collapsed embeddings early in training
Semi-hard negative$\|f(a)-f(p)\| < \|f(a)-f(n)\| < \|f(a)-f(p)\| + \alpha$Active loss, avoids collapse — used in FaceNet
Random negativeUniformly sampled from different classMostly easy triplets, slow learning
import numpy as np

def triplet_loss(anchor, positive, negative, margin=1.0):
    """Compute triplet loss for a batch of embeddings."""
    d_pos = np.sum((anchor - positive)**2, axis=1)  # ||f(a) - f(p)||^2
    d_neg = np.sum((anchor - negative)**2, axis=1)  # ||f(a) - f(n)||^2
    losses = np.maximum(0, d_pos - d_neg + margin)
    return losses.mean()

# Example: 4 triplets in 3D embedding space
np.random.seed(42)
dim = 3
batch_size = 4

# Anchors from class A
anchors = np.random.randn(batch_size, dim)
# Positives: close to anchors (same class)
positives = anchors + np.random.randn(batch_size, dim) * 0.3
# Negatives: far from anchors (different class)
negatives = anchors + np.random.randn(batch_size, dim) * 2.0

margin = 1.0
loss = triplet_loss(anchors, positives, negatives, margin)

# Per-triplet analysis
d_pos = np.sum((anchors - positives)**2, axis=1)
d_neg = np.sum((anchors - negatives)**2, axis=1)

print(f"Triplet loss (margin={margin}): {loss:.4f}\n")
print(f"{'Triplet':>8} {'d(a,p)':>8} {'d(a,n)':>8} {'Active?':>8}")
for i in range(batch_size):
    active = "YES" if d_pos[i] - d_neg[i] + margin > 0 else "no"
    print(f"{i:>8} {d_pos[i]:8.3f} {d_neg[i]:8.3f} {active:>8}")

InfoNCE & NT-Xent

Information-Theoretic Derivation

InfoNCE (Noise-Contrastive Estimation) is derived from a lower bound on mutual information. Given an anchor $x$ and one positive sample $x^+$ among $N-1$ negative samples $\{x_1^-, \ldots, x_{N-1}^-\}$, the loss is:

$$\boxed{\mathcal{L}_{\text{InfoNCE}} = -\log \frac{\exp(\text{sim}(f(x), f(x^+)) / \tau)}{\exp(\text{sim}(f(x), f(x^+)) / \tau) + \sum_{j=1}^{N-1}\exp(\text{sim}(f(x), f(x_j^-)) / \tau)}}$$

Why this is a softmax cross-entropy: Recognize this as the negative log-probability of the positive being identified among all candidates. It is equivalent to a $(N)$-way classification problem where the model must identify which of $N$ samples is the true positive.

Connection to mutual information: The InfoNCE loss provides a lower bound on the mutual information $I(X; X^+)$ between the anchor and positive views:

$$I(X; X^+) \geq \log N - \mathcal{L}_{\text{InfoNCE}}$$

Minimizing $\mathcal{L}_{\text{InfoNCE}}$ maximizes a lower bound on mutual information. The bound is tighter with more negatives ($N$ larger).

NT-Xent (Normalized Temperature-scaled Cross Entropy) is the SimCLR variant of InfoNCE. For a batch of $N$ pairs (from $2N$ augmented views), each anchor $z_i$ has one positive $z_j$ and $2(N-1)$ negatives:

$$\mathcal{L}_{\text{NT-Xent}}^{(i,j)} = -\log \frac{\exp(\text{sim}(z_i, z_j)/\tau)}{\sum_{k=1}^{2N} \mathbf{1}_{k\neq i}\exp(\text{sim}(z_i, z_k)/\tau)}$$

Temperature Scaling

The temperature $\tau$ controls how "peaked" the similarity distribution is:

  • $\tau \to 0$: only the hardest negatives contribute gradient (winner-take-all)
  • $\tau \to \infty$: all negatives contribute equally (uniform attention)
  • $\tau = 0.07$ (CLIP default): sharp distribution, strong negative penalties
  • $\tau = 0.5$ (SimCLR default): softer distribution

Mathematical effect: Dividing logits by $\tau$ before softmax is equivalent to raising probabilities to the power $1/\tau$. Low $\tau$ amplifies differences between similar and dissimilar pairs.

import numpy as np

def infonce_loss(anchor, positive, negatives, temperature=0.07):
    """
    Compute InfoNCE loss.
    
    Args:
        anchor: embedding vector, shape (d,)
        positive: positive embedding, shape (d,)
        negatives: negative embeddings, shape (N-1, d)
        temperature: scaling parameter
    
    Returns:
        Scalar loss value
    """
    # L2 normalize all embeddings
    anchor = anchor / np.linalg.norm(anchor)
    positive = positive / np.linalg.norm(positive)
    negatives = negatives / np.linalg.norm(negatives, axis=1, keepdims=True)
    
    # Cosine similarities
    sim_pos = np.dot(anchor, positive) / temperature
    sim_neg = np.dot(negatives, anchor) / temperature  # shape (N-1,)
    
    # InfoNCE = -log(exp(sim_pos) / (exp(sim_pos) + sum(exp(sim_neg))))
    # Use log-sum-exp trick for stability
    all_logits = np.concatenate([[sim_pos], sim_neg])
    log_sum_exp = np.max(all_logits) + np.log(np.sum(np.exp(all_logits - np.max(all_logits))))
    loss = -(sim_pos - log_sum_exp)
    return loss

# Example: 128-dim embeddings, 1 positive + 255 negatives
np.random.seed(42)
d = 128
n_negatives = 255

anchor = np.random.randn(d)
positive = anchor + np.random.randn(d) * 0.3  # close to anchor
negatives = np.random.randn(n_negatives, d)     # random negatives

# Compare different temperatures
for tau in [0.01, 0.07, 0.5, 1.0]:
    loss = infonce_loss(anchor, positive, negatives, temperature=tau)
    print(f"tau={tau:.2f}: InfoNCE loss = {loss:.4f}")

# MI lower bound: I(X; X+) >= log(N) - L_InfoNCE
N = n_negatives + 1
best_loss = infonce_loss(anchor, positive, negatives, temperature=0.07)
mi_bound = np.log(N) - best_loss
print(f"\nMI lower bound: >= {mi_bound:.4f} nats")
print(f"log(N) = log({N}) = {np.log(N):.4f} nats (maximum possible)")

CLIP Objective

CLIP (Contrastive Language-Image Pre-training) applies InfoNCE symmetrically across image and text modalities. Given a batch of $N$ (image, text) pairs, CLIP treats each pair as a positive and all $N-1$ other pairings as negatives.

Let $I_i = f_{\text{image}}(x_i)$ and $T_j = f_{\text{text}}(t_j)$ be L2-normalized embeddings. The similarity matrix is $S_{ij} = I_i \cdot T_j / \tau$. CLIP minimizes:

$$\mathcal{L}_{\text{CLIP}} = \frac{1}{2}\left(\mathcal{L}_{\text{i2t}} + \mathcal{L}_{\text{t2i}}\right)$$

where:

$$\mathcal{L}_{\text{i2t}} = -\frac{1}{N}\sum_{i=1}^N \log\frac{\exp(S_{ii})}{\sum_{j=1}^N \exp(S_{ij})} \quad \text{(image→text matching)}$$ $$\mathcal{L}_{\text{t2i}} = -\frac{1}{N}\sum_{j=1}^N \log\frac{\exp(S_{jj})}{\sum_{i=1}^N \exp(S_{ij})} \quad \text{(text→image matching)}$$

This is equivalent to cross-entropy loss on the $N \times N$ similarity matrix with the identity as the target: each row and column should have its maximum on the diagonal.

Scaling insight: CLIP's batch size of 32,768 means each anchor has 32,767 negatives — making the InfoNCE bound tight and the learned representations highly discriminative. The learnable temperature $\tau$ (initialized to $1/0.07 = 14.3$) is optimized jointly.
import numpy as np

def clip_loss(image_embeddings, text_embeddings, temperature=0.07):
    """
    Compute symmetric CLIP contrastive loss.
    
    Args:
        image_embeddings: shape (N, d), L2-normalized
        text_embeddings: shape (N, d), L2-normalized
        temperature: learnable temperature parameter
    
    Returns:
        Scalar CLIP loss
    """
    N = len(image_embeddings)
    
    # Similarity matrix: S[i,j] = sim(image_i, text_j) / tau
    S = (image_embeddings @ text_embeddings.T) / temperature  # (N, N)
    
    # Labels: diagonal entries are positives (identity permutation)
    labels = np.arange(N)
    
    # Image-to-text loss: each row should peak at diagonal
    # = cross_entropy(S, labels) along rows
    log_sum_exp_rows = np.log(np.sum(np.exp(S - S.max(axis=1, keepdims=True)), axis=1)) + S.max(axis=1)
    loss_i2t = np.mean(-S[np.arange(N), labels] + log_sum_exp_rows)
    
    # Text-to-image loss: each column should peak at diagonal
    log_sum_exp_cols = np.log(np.sum(np.exp(S.T - S.T.max(axis=1, keepdims=True)), axis=1)) + S.T.max(axis=1)
    loss_t2i = np.mean(-S.T[np.arange(N), labels] + log_sum_exp_cols)
    
    return 0.5 * (loss_i2t + loss_t2i)

# Simulate a mini-batch of 8 image-text pairs
np.random.seed(42)
N, d = 8, 64

# Create embeddings where matched pairs are similar
shared = np.random.randn(N, d)
image_emb = shared + np.random.randn(N, d) * 0.2
text_emb = shared + np.random.randn(N, d) * 0.2

# L2 normalize
image_emb = image_emb / np.linalg.norm(image_emb, axis=1, keepdims=True)
text_emb = text_emb / np.linalg.norm(text_emb, axis=1, keepdims=True)

loss = clip_loss(image_emb, text_emb, temperature=0.07)
print(f"CLIP loss: {loss:.4f}")
print(f"Random baseline (N={N}): {np.log(N):.4f}")
print(f"Perfect matching: ~0.0")

# Show similarity matrix diagonal vs off-diagonal
S = image_emb @ text_emb.T
print(f"\nSimilarity matrix stats:")
print(f"  Diagonal (pos pairs):  mean={np.diag(S).mean():.3f}")
print(f"  Off-diagonal (neg):    mean={S[~np.eye(N, dtype=bool)].mean():.3f}")

Why Contrastive Learning Works

Alignment and Uniformity: Good contrastive representations satisfy two properties:

  1. Alignment: Positive pairs should map to nearby points: $\mathbb{E}_{(x,x^+)}\|f(x) - f(x^+)\|^2$ should be small
  2. Uniformity: Embeddings should be uniformly distributed on the hypersphere: $\log \mathbb{E}_{x,y}\exp(-2\|f(x) - f(y)\|^2)$ should be minimized

InfoNCE implicitly optimizes both: the numerator rewards alignment (high similarity to positive) while the denominator rewards uniformity (low similarity to all negatives).

Dimensional collapse prevention: Without enough negatives or with too-easy augmentations, embeddings can collapse to a lower-dimensional subspace (or a single point). Uniformity on the hypersphere prevents this by requiring the embeddings to "spread out." Techniques like variance-invariance-covariance (VICReg) regularization address collapse more directly.

Connection to downstream tasks: If two inputs are semantically similar (under relevant augmentations), their embeddings should be close. This creates a representation where linear probes can separate classes, because class boundaries align with distance in embedding space.

Practice Exercises

Exercises: Cement the derivations above.
  1. Triplet loss gradient: Derive $\nabla_{f(a)}\mathcal{L}_{\text{triplet}}$ when the loss is active. Verify that the gradient pushes the anchor toward the positive and away from the negative.
  2. InfoNCE as classification: Show that InfoNCE is equivalent to $N$-way cross-entropy where the "correct class" is the positive sample. Write out the softmax probabilities explicitly.
  3. Temperature analysis: For cosine similarities $[0.9, 0.3, 0.2, 0.1]$ (positive first), compute the InfoNCE loss for $\tau \in \{0.01, 0.1, 1.0\}$. How does temperature affect which negatives contribute most to the gradient?
  4. MI bound tightness: Show that $I(X;X^+) \geq \log N - \mathcal{L}_{\text{InfoNCE}}$ by relating InfoNCE to the density ratio $\frac{p(x^+|x)}{p(x^+)}$.
  5. CLIP symmetry: Explain why CLIP uses both image→text and text→image losses. What could go wrong with only one direction?
  6. Coding challenge: Implement NT-Xent for a batch of 32 pairs (64 augmented views). Verify that your loss matches the formula when positives are maximally similar.