Back to Math for AI Hub

Recommendation & Ranking Mathematics

May 30, 2026Wasil Zafar15 min read

Recommender systems are fundamentally matrix completion + ranking problems. This note derives the core mathematics: matrix factorization as optimization, collaborative filtering via SVD, and pairwise ranking losses (BPR, LambdaRank) that train modern recommendation models.

Table of Contents

  1. Matrix Factorization
  2. ALS Optimization
  3. SVD for Collaborative Filtering
  4. BPR: Bayesian Personalized Ranking
  5. LambdaRank
  6. Ranking Metrics (NDCG, MAP)
Prerequisites: Part 7: Linear Algebra (SVD, matrix decomposition), Part 9: ML Math (loss functions, regularization). This note is the canonical math reference for AI in the Wild: Recommender Systems.

Matrix Factorization

A user-item interaction matrix $R \in \mathbb{R}^{m \times n}$ (m users, n items) is typically sparse — most entries are unobserved. Matrix factorization approximates $R$ as a product of low-rank factors:

$$\boxed{R \approx U V^T, \quad U \in \mathbb{R}^{m \times k}, \quad V \in \mathbb{R}^{n \times k}}$$

where $k \ll \min(m,n)$ is the embedding dimension. Each user $u$ gets a vector $\mathbf{u}_i \in \mathbb{R}^k$ and each item gets $\mathbf{v}_j \in \mathbb{R}^k$. The predicted rating is $\hat{r}_{ij} = \mathbf{u}_i^T \mathbf{v}_j$.

Optimization objective (regularized, over observed entries only):

$$\min_{U, V} \sum_{(i,j) \in \Omega} (r_{ij} - \mathbf{u}_i^T\mathbf{v}_j)^2 + \lambda(\|\mathbf{u}_i\|^2 + \|\mathbf{v}_j\|^2)$$

where $\Omega$ is the set of observed ratings. This is non-convex in $(U, V)$ jointly, but convex in each alone — motivating Alternating Least Squares.

Alternating Least Squares (ALS)

ALS alternates between fixing $V$ (solving for $U$) and fixing $U$ (solving for $V$). When $V$ is fixed, the objective decomposes into $m$ independent ridge regression problems:

Update user $i$:

$$\mathbf{u}_i = (V_{\Omega_i}^T V_{\Omega_i} + \lambda I)^{-1} V_{\Omega_i}^T \mathbf{r}_i$$

where $V_{\Omega_i}$ contains rows of $V$ for items rated by user $i$, and $\mathbf{r}_i$ is the corresponding rating vector.

Update item $j$:

$$\mathbf{v}_j = (U_{\Omega_j}^T U_{\Omega_j} + \lambda I)^{-1} U_{\Omega_j}^T \mathbf{r}_j$$

Convergence guarantee: Each ALS iteration monotonically decreases the objective (block coordinate descent on a bi-convex problem). Converges to a local minimum. The $\lambda I$ regularization ensures the $k \times k$ matrix is always invertible.

Connection to SVD: Without regularization ($\lambda = 0$) and with all entries observed ($\Omega = \{1,\ldots,m\}\times\{1,\ldots,n\}$), the optimal rank-$k$ factorization is exactly the truncated SVD: $R \approx U_k \Sigma_k V_k^T$ (Eckart-Young theorem, see Part 7).

SVD for Collaborative Filtering

Truncated SVD on the centered rating matrix provides the optimal low-rank approximation (in Frobenius norm). For collaborative filtering:

  1. Mean-center: $\tilde{R}_{ij} = R_{ij} - \bar{r}_i$ (subtract user mean)
  2. Compute truncated SVD: $\tilde{R} \approx U_k \Sigma_k V_k^T$
  3. Predict: $\hat{r}_{ij} = \bar{r}_i + (U_k \Sigma_k V_k^T)_{ij}$

The $k$ singular values capture the $k$ most important "latent factors" — interpretable as genres, style preferences, or quality levels. This is the same SVD from Part 7 applied to user-item matrices.

import numpy as np

def als_matrix_factorization(R, k=10, lam=0.1, n_iters=20):
    """
    Alternating Least Squares for matrix factorization.
    
    Args:
        R: user-item matrix (0 = unobserved)
        k: embedding dimension
        lam: regularization strength
        n_iters: number of ALS iterations
    
    Returns:
        U (m x k), V (n x k) factor matrices
    """
    m, n = R.shape
    mask = (R != 0).astype(float)  # observed entries
    
    # Random initialization
    np.random.seed(42)
    U = np.random.randn(m, k) * 0.1
    V = np.random.randn(n, k) * 0.1
    
    for iteration in range(n_iters):
        # Fix V, solve for each user
        for i in range(m):
            rated = np.where(mask[i] > 0)[0]
            if len(rated) == 0:
                continue
            V_rated = V[rated]  # (|rated| x k)
            r_i = R[i, rated]   # (|rated|,)
            A = V_rated.T @ V_rated + lam * len(rated) * np.eye(k)
            b = V_rated.T @ r_i
            U[i] = np.linalg.solve(A, b)
        
        # Fix U, solve for each item
        for j in range(n):
            users = np.where(mask[:, j] > 0)[0]
            if len(users) == 0:
                continue
            U_users = U[users]
            r_j = R[users, j]
            A = U_users.T @ U_users + lam * len(users) * np.eye(k)
            b = U_users.T @ r_j
            V[j] = np.linalg.solve(A, b)
        
        # Compute loss on observed entries
        pred = U @ V.T
        loss = np.sum(mask * (R - pred)**2) + lam * (np.sum(U**2) + np.sum(V**2))
        if iteration % 5 == 0:
            print(f"  Iteration {iteration}: loss = {loss:.4f}")
    
    return U, V

# Example: small rating matrix (1-5 scale, 0 = unobserved)
R = np.array([
    [5, 3, 0, 1, 0],
    [4, 0, 0, 1, 0],
    [1, 1, 0, 5, 4],
    [0, 0, 5, 4, 0],
    [0, 1, 4, 0, 5]
], dtype=float)

print("ALS Matrix Factorization (k=2):")
U, V = als_matrix_factorization(R, k=2, lam=0.01, n_iters=20)

# Predict missing entries
predictions = U @ V.T
print(f"\nPredicted ratings (full matrix):")
print(np.round(predictions, 1))
print(f"\nPrediction for user 0, item 2 (unobserved): {predictions[0, 2]:.2f}")
print(f"Prediction for user 1, item 1 (unobserved): {predictions[1, 1]:.2f}")

BPR: Bayesian Personalized Ranking

For implicit feedback (clicks, views — no explicit ratings), we only know which items a user interacted with, not preference scores. BPR (Rendle et al., 2009) assumes: an interacted item should rank higher than a non-interacted one.

Pairwise objective: For user $u$ who interacted with item $i$ but not item $j$, maximize:

$$\boxed{\mathcal{L}_{\text{BPR}} = \sum_{(u,i,j) \in D_S} \ln \sigma(\hat{x}_{uij}) - \lambda \|\Theta\|^2}$$

where $\hat{x}_{uij} = \hat{r}_{ui} - \hat{r}_{uj}$ (predicted score difference), $\sigma$ is the sigmoid, and $D_S = \{(u,i,j) : i \in \Omega_u^+, j \notin \Omega_u^+\}$.

Derivation from Bayesian MAP: Assuming preferences are drawn from a Bernoulli with $P(i >_u j) = \sigma(\hat{x}_{uij})$ and a Gaussian prior on parameters:

$$\ln P(\Theta | >_u) \propto \sum_{(u,i,j)} \ln \sigma(\hat{x}_{uij}) - \lambda\|\Theta\|^2$$

Gradient for SGD:

$$\frac{\partial \mathcal{L}_{\text{BPR}}}{\partial \Theta} = \sum_{(u,i,j)} \frac{e^{-\hat{x}_{uij}}}{1 + e^{-\hat{x}_{uij}}} \cdot \frac{\partial \hat{x}_{uij}}{\partial \Theta} - 2\lambda\Theta$$

For matrix factorization: $\hat{x}_{uij} = \mathbf{u}^T(\mathbf{v}_i - \mathbf{v}_j)$, so $\frac{\partial}{\partial \mathbf{u}} = (1-\sigma(\hat{x}_{uij}))(\mathbf{v}_i - \mathbf{v}_j)$.

LambdaRank

LambdaRank (Burges 2006) defines gradients that directly optimize NDCG — even though NDCG is non-differentiable (it depends on discrete ranking positions).

Key idea: Define "lambda gradients" that weight pairwise swaps by their impact on NDCG:

$$\lambda_{ij} = \frac{-\sigma'(\hat{x}_{ij})}{1 + e^{\hat{x}_{ij}}} \cdot |\Delta\text{NDCG}_{ij}|$$

where $\Delta\text{NDCG}_{ij}$ is the change in NDCG from swapping items $i$ and $j$ in the ranking. The gradient for item $i$'s score is:

$$\frac{\partial \mathcal{C}}{\partial s_i} = \sum_{j: j \neq i} \lambda_{ij} - \lambda_{ji}$$

This "tricks" gradient descent into optimizing the (non-differentiable) ranking metric by attaching metric-aware weights to pairwise cross-entropy gradients.

Ranking Metrics (NDCG, MAP)

NDCG (Normalized Discounted Cumulative Gain):

$$\text{DCG}@K = \sum_{i=1}^K \frac{2^{rel_i} - 1}{\log_2(i + 1)}, \quad \text{NDCG}@K = \frac{\text{DCG}@K}{\text{IDCG}@K}$$

where $rel_i$ is the relevance of the item at position $i$, and IDCG is the DCG of the ideal (perfect) ranking. NDCG $\in [0, 1]$.

MAP (Mean Average Precision):

$$\text{AP} = \frac{1}{|\text{relevant}|}\sum_{k=1}^n \text{Precision}@k \cdot \mathbb{1}[\text{item}_k \text{ is relevant}]$$ $$\text{MAP} = \frac{1}{|Q|}\sum_{q=1}^{|Q|} \text{AP}(q)$$

Both metrics are position-sensitive and non-differentiable, which is why LambdaRank's trick of defining implicit gradients via swap impacts is so valuable.

Exercise
Compute NDCG@5

For a ranking with relevance scores [3, 2, 3, 0, 1] and ideal ranking [3, 3, 2, 1, 0], compute NDCG@5 manually. Verify: DCG = $\frac{7}{\log_2 2} + \frac{3}{\log_2 3} + \frac{7}{\log_2 4} + \frac{0}{\log_2 5} + \frac{1}{\log_2 6} \approx 13.35$.