Matrix Factorization
A user-item interaction matrix $R \in \mathbb{R}^{m \times n}$ (m users, n items) is typically sparse — most entries are unobserved. Matrix factorization approximates $R$ as a product of low-rank factors:
$$\boxed{R \approx U V^T, \quad U \in \mathbb{R}^{m \times k}, \quad V \in \mathbb{R}^{n \times k}}$$where $k \ll \min(m,n)$ is the embedding dimension. Each user $u$ gets a vector $\mathbf{u}_i \in \mathbb{R}^k$ and each item gets $\mathbf{v}_j \in \mathbb{R}^k$. The predicted rating is $\hat{r}_{ij} = \mathbf{u}_i^T \mathbf{v}_j$.
Optimization objective (regularized, over observed entries only):
$$\min_{U, V} \sum_{(i,j) \in \Omega} (r_{ij} - \mathbf{u}_i^T\mathbf{v}_j)^2 + \lambda(\|\mathbf{u}_i\|^2 + \|\mathbf{v}_j\|^2)$$where $\Omega$ is the set of observed ratings. This is non-convex in $(U, V)$ jointly, but convex in each alone — motivating Alternating Least Squares.
Alternating Least Squares (ALS)
ALS alternates between fixing $V$ (solving for $U$) and fixing $U$ (solving for $V$). When $V$ is fixed, the objective decomposes into $m$ independent ridge regression problems:
Update user $i$:
$$\mathbf{u}_i = (V_{\Omega_i}^T V_{\Omega_i} + \lambda I)^{-1} V_{\Omega_i}^T \mathbf{r}_i$$where $V_{\Omega_i}$ contains rows of $V$ for items rated by user $i$, and $\mathbf{r}_i$ is the corresponding rating vector.
Update item $j$:
$$\mathbf{v}_j = (U_{\Omega_j}^T U_{\Omega_j} + \lambda I)^{-1} U_{\Omega_j}^T \mathbf{r}_j$$Convergence guarantee: Each ALS iteration monotonically decreases the objective (block coordinate descent on a bi-convex problem). Converges to a local minimum. The $\lambda I$ regularization ensures the $k \times k$ matrix is always invertible.
SVD for Collaborative Filtering
Truncated SVD on the centered rating matrix provides the optimal low-rank approximation (in Frobenius norm). For collaborative filtering:
- Mean-center: $\tilde{R}_{ij} = R_{ij} - \bar{r}_i$ (subtract user mean)
- Compute truncated SVD: $\tilde{R} \approx U_k \Sigma_k V_k^T$
- Predict: $\hat{r}_{ij} = \bar{r}_i + (U_k \Sigma_k V_k^T)_{ij}$
The $k$ singular values capture the $k$ most important "latent factors" — interpretable as genres, style preferences, or quality levels. This is the same SVD from Part 7 applied to user-item matrices.
import numpy as np
def als_matrix_factorization(R, k=10, lam=0.1, n_iters=20):
"""
Alternating Least Squares for matrix factorization.
Args:
R: user-item matrix (0 = unobserved)
k: embedding dimension
lam: regularization strength
n_iters: number of ALS iterations
Returns:
U (m x k), V (n x k) factor matrices
"""
m, n = R.shape
mask = (R != 0).astype(float) # observed entries
# Random initialization
np.random.seed(42)
U = np.random.randn(m, k) * 0.1
V = np.random.randn(n, k) * 0.1
for iteration in range(n_iters):
# Fix V, solve for each user
for i in range(m):
rated = np.where(mask[i] > 0)[0]
if len(rated) == 0:
continue
V_rated = V[rated] # (|rated| x k)
r_i = R[i, rated] # (|rated|,)
A = V_rated.T @ V_rated + lam * len(rated) * np.eye(k)
b = V_rated.T @ r_i
U[i] = np.linalg.solve(A, b)
# Fix U, solve for each item
for j in range(n):
users = np.where(mask[:, j] > 0)[0]
if len(users) == 0:
continue
U_users = U[users]
r_j = R[users, j]
A = U_users.T @ U_users + lam * len(users) * np.eye(k)
b = U_users.T @ r_j
V[j] = np.linalg.solve(A, b)
# Compute loss on observed entries
pred = U @ V.T
loss = np.sum(mask * (R - pred)**2) + lam * (np.sum(U**2) + np.sum(V**2))
if iteration % 5 == 0:
print(f" Iteration {iteration}: loss = {loss:.4f}")
return U, V
# Example: small rating matrix (1-5 scale, 0 = unobserved)
R = np.array([
[5, 3, 0, 1, 0],
[4, 0, 0, 1, 0],
[1, 1, 0, 5, 4],
[0, 0, 5, 4, 0],
[0, 1, 4, 0, 5]
], dtype=float)
print("ALS Matrix Factorization (k=2):")
U, V = als_matrix_factorization(R, k=2, lam=0.01, n_iters=20)
# Predict missing entries
predictions = U @ V.T
print(f"\nPredicted ratings (full matrix):")
print(np.round(predictions, 1))
print(f"\nPrediction for user 0, item 2 (unobserved): {predictions[0, 2]:.2f}")
print(f"Prediction for user 1, item 1 (unobserved): {predictions[1, 1]:.2f}")
BPR: Bayesian Personalized Ranking
For implicit feedback (clicks, views — no explicit ratings), we only know which items a user interacted with, not preference scores. BPR (Rendle et al., 2009) assumes: an interacted item should rank higher than a non-interacted one.
Pairwise objective: For user $u$ who interacted with item $i$ but not item $j$, maximize:
$$\boxed{\mathcal{L}_{\text{BPR}} = \sum_{(u,i,j) \in D_S} \ln \sigma(\hat{x}_{uij}) - \lambda \|\Theta\|^2}$$where $\hat{x}_{uij} = \hat{r}_{ui} - \hat{r}_{uj}$ (predicted score difference), $\sigma$ is the sigmoid, and $D_S = \{(u,i,j) : i \in \Omega_u^+, j \notin \Omega_u^+\}$.
Derivation from Bayesian MAP: Assuming preferences are drawn from a Bernoulli with $P(i >_u j) = \sigma(\hat{x}_{uij})$ and a Gaussian prior on parameters:
$$\ln P(\Theta | >_u) \propto \sum_{(u,i,j)} \ln \sigma(\hat{x}_{uij}) - \lambda\|\Theta\|^2$$Gradient for SGD:
$$\frac{\partial \mathcal{L}_{\text{BPR}}}{\partial \Theta} = \sum_{(u,i,j)} \frac{e^{-\hat{x}_{uij}}}{1 + e^{-\hat{x}_{uij}}} \cdot \frac{\partial \hat{x}_{uij}}{\partial \Theta} - 2\lambda\Theta$$For matrix factorization: $\hat{x}_{uij} = \mathbf{u}^T(\mathbf{v}_i - \mathbf{v}_j)$, so $\frac{\partial}{\partial \mathbf{u}} = (1-\sigma(\hat{x}_{uij}))(\mathbf{v}_i - \mathbf{v}_j)$.
LambdaRank
LambdaRank (Burges 2006) defines gradients that directly optimize NDCG — even though NDCG is non-differentiable (it depends on discrete ranking positions).
Key idea: Define "lambda gradients" that weight pairwise swaps by their impact on NDCG:
$$\lambda_{ij} = \frac{-\sigma'(\hat{x}_{ij})}{1 + e^{\hat{x}_{ij}}} \cdot |\Delta\text{NDCG}_{ij}|$$where $\Delta\text{NDCG}_{ij}$ is the change in NDCG from swapping items $i$ and $j$ in the ranking. The gradient for item $i$'s score is:
$$\frac{\partial \mathcal{C}}{\partial s_i} = \sum_{j: j \neq i} \lambda_{ij} - \lambda_{ji}$$This "tricks" gradient descent into optimizing the (non-differentiable) ranking metric by attaching metric-aware weights to pairwise cross-entropy gradients.
Ranking Metrics (NDCG, MAP)
NDCG (Normalized Discounted Cumulative Gain):
$$\text{DCG}@K = \sum_{i=1}^K \frac{2^{rel_i} - 1}{\log_2(i + 1)}, \quad \text{NDCG}@K = \frac{\text{DCG}@K}{\text{IDCG}@K}$$where $rel_i$ is the relevance of the item at position $i$, and IDCG is the DCG of the ideal (perfect) ranking. NDCG $\in [0, 1]$.
MAP (Mean Average Precision):
$$\text{AP} = \frac{1}{|\text{relevant}|}\sum_{k=1}^n \text{Precision}@k \cdot \mathbb{1}[\text{item}_k \text{ is relevant}]$$ $$\text{MAP} = \frac{1}{|Q|}\sum_{q=1}^{|Q|} \text{AP}(q)$$Both metrics are position-sensitive and non-differentiable, which is why LambdaRank's trick of defining implicit gradients via swap impacts is so valuable.
Compute NDCG@5
For a ranking with relevance scores [3, 2, 3, 0, 1] and ideal ranking [3, 3, 2, 1, 0], compute NDCG@5 manually. Verify: DCG = $\frac{7}{\log_2 2} + \frac{3}{\log_2 3} + \frac{7}{\log_2 4} + \frac{0}{\log_2 5} + \frac{1}{\log_2 6} \approx 13.35$.