Recommendation & Ranking Mathematics

Prerequisites: Part 7: Linear Algebra (SVD, matrix decomposition), Part 9: ML Math (loss functions, regularization). This note is the canonical math reference for AI in the Wild: Recommender Systems.

Matrix Factorization

A user-item interaction matrix $R \in \mathbb{R}^{m \times n}$ (m users, n items) is typically sparse — most entries are unobserved. Matrix factorization approximates $R$ as a product of low-rank factors:

$$\boxed{R \approx U V^T, \quad U \in \mathbb{R}^{m \times k}, \quad V \in \mathbb{R}^{n \times k}}$$

where $k \ll \min(m,n)$ is the embedding dimension. Each user $u$ gets a vector $\mathbf{u}_i \in \mathbb{R}^k$ and each item gets $\mathbf{v}_j \in \mathbb{R}^k$. The predicted rating is $\hat{r}_{ij} = \mathbf{u}_i^T \mathbf{v}_j$.

Optimization objective (regularized, over observed entries only):

$$\min_{U, V} \sum_{(i,j) \in \Omega} (r_{ij} - \mathbf{u}_i^T\mathbf{v}_j)^2 + \lambda(\|\mathbf{u}_i\|^2 + \|\mathbf{v}_j\|^2)$$

where $\Omega$ is the set of observed ratings. This is non-convex in $(U, V)$ jointly, but convex in each alone — motivating Alternating Least Squares.

Alternating Least Squares (ALS)

ALS alternates between fixing $V$ (solving for $U$) and fixing $U$ (solving for $V$). When $V$ is fixed, the objective decomposes into $m$ independent ridge regression problems:

Update user $i$:

$$\mathbf{u}_i = (V_{\Omega_i}^T V_{\Omega_i} + \lambda I)^{-1} V_{\Omega_i}^T \mathbf{r}_i$$

where $V_{\Omega_i}$ contains rows of $V$ for items rated by user $i$, and $\mathbf{r}_i$ is the corresponding rating vector.

Update item $j$:

$$\mathbf{v}_j = (U_{\Omega_j}^T U_{\Omega_j} + \lambda I)^{-1} U_{\Omega_j}^T \mathbf{r}_j$$

Convergence guarantee: Each ALS iteration monotonically decreases the objective (block coordinate descent on a bi-convex problem). Converges to a local minimum. The $\lambda I$ regularization ensures the $k \times k$ matrix is always invertible.

Connection to SVD: Without regularization ($\lambda = 0$) and with all entries observed ($\Omega = \{1,\ldots,m\}\times\{1,\ldots,n\}$), the optimal rank-$k$ factorization is exactly the truncated SVD: $R \approx U_k \Sigma_k V_k^T$ (Eckart-Young theorem, see Part 7).

SVD for Collaborative Filtering

Truncated SVD on the centered rating matrix provides the optimal low-rank approximation (in Frobenius norm). For collaborative filtering:

Mean-center: $\tilde{R}_{ij} = R_{ij} - \bar{r}_i$ (subtract user mean)
Compute truncated SVD: $\tilde{R} \approx U_k \Sigma_k V_k^T$
Predict: $\hat{r}_{ij} = \bar{r}_i + (U_k \Sigma_k V_k^T)_{ij}$

The $k$ singular values capture the $k$ most important "latent factors" — interpretable as genres, style preferences, or quality levels. This is the same SVD from Part 7 applied to user-item matrices.

import numpy as np

def als_matrix_factorization(R, k=10, lam=0.1, n_iters=20):
    """
    Alternating Least Squares for matrix factorization.
    
    Args:
        R: user-item matrix (0 = unobserved)
        k: embedding dimension
        lam: regularization strength
        n_iters: number of ALS iterations
    
    Returns:
        U (m x k), V (n x k) factor matrices
    """
    m, n = R.shape
    mask = (R != 0).astype(float)  # observed entries
    
    # Random initialization
    np.random.seed(42)
    U = np.random.randn(m, k) * 0.1
    V = np.random.randn(n, k) * 0.1
    
    for iteration in range(n_iters):
        # Fix V, solve for each user
        for i in range(m):
            rated = np.where(mask[i] > 0)[0]
            if len(rated) == 0:
                continue
            V_rated = V[rated]  # (|rated| x k)
            r_i = R[i, rated]   # (|rated|,)
            A = V_rated.T @ V_rated + lam * len(rated) * np.eye(k)
            b = V_rated.T @ r_i
            U[i] = np.linalg.solve(A, b)
        
        # Fix U, solve for each item
        for j in range(n):
            users = np.where(mask[:, j] > 0)[0]
            if len(users) == 0:
                continue
            U_users = U[users]
            r_j = R[users, j]
            A = U_users.T @ U_users + lam * len(users) * np.eye(k)
            b = U_users.T @ r_j
            V[j] = np.linalg.solve(A, b)
        
        # Compute loss on observed entries
        pred = U @ V.T
        loss = np.sum(mask * (R - pred)**2) + lam * (np.sum(U**2) + np.sum(V**2))
        if iteration % 5 == 0:
            print(f"  Iteration {iteration}: loss = {loss:.4f}")
    
    return U, V

# Example: small rating matrix (1-5 scale, 0 = unobserved)
R = np.array([
    [5, 3, 0, 1, 0],
    [4, 0, 0, 1, 0],
    [1, 1, 0, 5, 4],
    [0, 0, 5, 4, 0],
    [0, 1, 4, 0, 5]
], dtype=float)

print("ALS Matrix Factorization (k=2):")
U, V = als_matrix_factorization(R, k=2, lam=0.01, n_iters=20)

# Predict missing entries
predictions = U @ V.T
print(f"\nPredicted ratings (full matrix):")
print(np.round(predictions, 1))
print(f"\nPrediction for user 0, item 2 (unobserved): {predictions[0, 2]:.2f}")
print(f"Prediction for user 1, item 1 (unobserved): {predictions[1, 1]:.2f}")

BPR: Bayesian Personalized Ranking

For implicit feedback (clicks, views — no explicit ratings), we only know which items a user interacted with, not preference scores. BPR (Rendle et al., 2009) assumes: an interacted item should rank higher than a non-interacted one.

Pairwise objective: For user $u$ who interacted with item $i$ but not item $j$, maximize:

$$\boxed{\mathcal{L}_{\text{BPR}} = \sum_{(u,i,j) \in D_S} \ln \sigma(\hat{x}_{uij}) - \lambda \|\Theta\|^2}$$

where $\hat{x}_{uij} = \hat{r}_{ui} - \hat{r}_{uj}$ (predicted score difference), $\sigma$ is the sigmoid, and $D_S = \{(u,i,j) : i \in \Omega_u^+, j \notin \Omega_u^+\}$.

Derivation from Bayesian MAP: Assuming preferences are drawn from a Bernoulli with $P(i >_u j) = \sigma(\hat{x}_{uij})$ and a Gaussian prior on parameters:

$$\ln P(\Theta | >_u) \propto \sum_{(u,i,j)} \ln \sigma(\hat{x}_{uij}) - \lambda\|\Theta\|^2$$

Gradient for SGD:

$$\frac{\partial \mathcal{L}_{\text{BPR}}}{\partial \Theta} = \sum_{(u,i,j)} \frac{e^{-\hat{x}_{uij}}}{1 + e^{-\hat{x}_{uij}}} \cdot \frac{\partial \hat{x}_{uij}}{\partial \Theta} - 2\lambda\Theta$$

For matrix factorization: $\hat{x}_{uij} = \mathbf{u}^T(\mathbf{v}_i - \mathbf{v}_j)$, so $\frac{\partial}{\partial \mathbf{u}} = (1-\sigma(\hat{x}_{uij}))(\mathbf{v}_i - \mathbf{v}_j)$.

LambdaRank

LambdaRank (Burges 2006) defines gradients that directly optimize NDCG — even though NDCG is non-differentiable (it depends on discrete ranking positions).

Key idea: Define "lambda gradients" that weight pairwise swaps by their impact on NDCG:

$$\lambda_{ij} = \frac{-\sigma'(\hat{x}_{ij})}{1 + e^{\hat{x}_{ij}}} \cdot |\Delta\text{NDCG}_{ij}|$$

where $\Delta\text{NDCG}_{ij}$ is the change in NDCG from swapping items $i$ and $j$ in the ranking. The gradient for item $i$'s score is:

$$\frac{\partial \mathcal{C}}{\partial s_i} = \sum_{j: j \neq i} \lambda_{ij} - \lambda_{ji}$$

This "tricks" gradient descent into optimizing the (non-differentiable) ranking metric by attaching metric-aware weights to pairwise cross-entropy gradients.

Ranking Metrics (NDCG, MAP)

NDCG (Normalized Discounted Cumulative Gain):

$$\text{DCG}@K = \sum_{i=1}^K \frac{2^{rel_i} - 1}{\log_2(i + 1)}, \quad \text{NDCG}@K = \frac{\text{DCG}@K}{\text{IDCG}@K}$$

where $rel_i$ is the relevance of the item at position $i$, and IDCG is the DCG of the ideal (perfect) ranking. NDCG $\in [0, 1]$.

MAP (Mean Average Precision):

$$\text{AP} = \frac{1}{|\text{relevant}|}\sum_{k=1}^n \text{Precision}@k \cdot \mathbb{1}[\text{item}_k \text{ is relevant}]$$ $$\text{MAP} = \frac{1}{|Q|}\sum_{q=1}^{|Q|} \text{AP}(q)$$

Both metrics are position-sensitive and non-differentiable, which is why LambdaRank's trick of defining implicit gradients via swap impacts is so valuable.

Exercise

Compute NDCG@5

For a ranking with relevance scores [3, 2, 3, 0, 1] and ideal ranking [3, 3, 2, 1, 0], compute NDCG@5 manually. Verify: DCG = $\frac{7}{\log_2 2} + \frac{3}{\log_2 3} + \frac{7}{\log_2 4} + \frac{0}{\log_2 5} + \frac{1}{\log_2 6} \approx 13.35$.

Table of Contents