Recommender Systems

AI in the Wild Part 5 of 24

About This Article

This article covers the recommender systems landscape end-to-end — from the formulation of the recommendation problem through classical collaborative and content-based filtering, matrix factorisation, deep two-tower architectures, sequential models, and the engineering of production recommendation pipelines at scale.

Collaborative Filtering Matrix Factorisation Two-Tower Models

AI in the Wild: Real-World Applications & Ethics

Your 24-part learning path • Currently on Step 5

5

Collaborative filtering, content-based, two-tower models

You Are Here

6

Reinforcement Learning Applications

Q-learning, policy gradients, RLHF, real-world deployments

7

Conversational AI & Chatbots

Dialogue systems, intent detection, RAG, production bots

8

Large Language Models

Architecture, scaling laws, capabilities, limitations

9

Prompt Engineering & In-Context Learning

Chain-of-thought, few-shot, structured outputs, prompt patterns

10

Fine-tuning, RLHF & Model Alignment

LoRA, instruction tuning, DPO, alignment techniques

11

Generative AI Applications

Diffusion models, GANs, image/audio/video generation

12

Multimodal AI

Vision-language models, audio-text, cross-modal retrieval

13

AI Agents & Agentic Workflows

Tool use, planning, memory, multi-agent orchestration

14

AI in Healthcare & Life Sciences

Diagnostics, drug discovery, clinical NLP, regulatory landscape

15

AI in Finance & Fraud Detection

Credit scoring, anomaly detection, algorithmic trading

16

AI in Autonomous Systems & Robotics

Perception, planning, control, sim-to-real transfer

17

AI Security & Adversarial Robustness

Adversarial attacks, poisoning, model extraction, defences

18

Explainable AI & Interpretability

SHAP, LIME, attention, mechanistic interpretability

19

AI Ethics & Bias Mitigation

Fairness metrics, dataset auditing, debiasing techniques

20

MLOps & Model Deployment

CI/CD for ML, feature stores, monitoring, drift detection

21

Edge AI & On-Device Intelligence

Quantization, pruning, TFLite, CoreML, embedded inference

22

AI Infrastructure, Hardware & Scaling

GPUs, TPUs, distributed training, memory hierarchy

23

Responsible AI Governance

Risk frameworks, model cards, auditing, organisational practice

24

AI Policy, Regulation & Future Directions

EU AI Act, global frameworks, emerging risks, what's next

Recommender Systems Fundamentals

Recommender systems are arguably the highest-value ML application deployed at scale today. Netflix estimates that its recommendation system saves over $1 billion per year in customer retention. YouTube's recommendation engine drives more than 70% of watch time on the platform. TikTok's For You Page, a recommendation system of exceptional sophistication, is frequently credited for making the app the fastest-growing social platform in history. Behind all of these systems is a deceptively simple problem statement: given what we know about a user and a catalogue of items, rank the items that user would most like to see next.

The two fundamental signals are explicit and implicit feedback. Explicit feedback is direct expression of preference — star ratings, thumbs up/down, explicit "not interested" signals. It is clean and unambiguous but rare: only a small fraction of users rate the things they consume. Implicit feedback is inferred from behaviour — clicks, streams, purchases, dwell time, scroll depth, add-to-cart events, skips. It is abundant and requires no user effort, but it is noisy: a click may reflect curiosity rather than genuine interest; a long watch time may indicate the user fell asleep. Critically, implicit feedback is positive-only — we observe what users engaged with, but not what they actively disliked. Distinguishing "not consumed because not interested" from "not consumed because not discovered" is the core modelling challenge.

                        
                        Key Insight:
                        The goal of a recommender system is not simply to predict what a user will rate highest in isolation,
                        but to surface items the user would genuinely enjoy that they would not have found on their own.
                        Novelty and serendipity — measured through metrics like intra-list diversity and catalogue coverage
                        — are often as commercially valuable as raw relevance accuracy,
                        because a system that only surfaces already-popular items
                        teaches users nothing and adds no discovery value.
                        The most commercially successful recommendation systems
                        (TikTok's For You Page, Spotify's Discover Weekly)
                        are celebrated precisely for their ability to surface unexpected items
                        that users genuinely love — not for their ability to recommend what users
                        already know they would like.
                    

Problem Formulation

At scale, recommendation is decomposed into at least two stages for latency and compute reasons. The retrieval (or candidate generation) stage must quickly narrow a catalogue of millions of items down to hundreds of plausible candidates for a given user — typically in under 10 milliseconds. The ranking stage then applies a heavier model to score those candidates with richer features and contextual signals, reordering them before presentation. This two-stage (and often three-stage, with a re-ranking phase for diversity and business rules) decomposition is fundamental to every large production recommendation system, from YouTube's Deep Neural Networks for YouTube Recommendations to Pinterest's PinSage.

Offline evaluation uses historical interaction data held out from training. Precision@k measures the fraction of the top-k recommended items that the user actually interacted with. Recall@k measures the fraction of all items the user interacted with that appear in the top-k list. NDCG (Normalised Discounted Cumulative Gain) accounts for ranking position: an item at position 1 contributes more than an item at position 10, weighted by log(rank+1). Mean Reciprocal Rank (MRR) focuses on the position of the first relevant item. These metrics are computed on held-out interaction logs, typically using temporal splits to respect the causal structure of the problem. The disconnect between offline metrics and online business metrics is a persistent operational headache: a model with better NDCG may produce lower click-through rate in an A/B test, because the historical interactions it was evaluated against reflect the biases of the previous recommendation system rather than true user preferences.

The Cold-Start Problem

Cold-start describes situations where the interaction matrix provides insufficient signal for collaborative methods to work. New user cold-start: a user who signed up five minutes ago has no interaction history. New item cold-start: a newly uploaded podcast episode or a just-listed product has received zero plays or purchases. System cold-start: a new platform bootstrapping its recommendation system before any interaction data has been collected. Each requires a different mitigation strategy. For new users, onboarding preference elicitation — asking users to rate a curated set of seed items or select interest categories — provides an immediate proxy for their taste profile. For new items, content-based features (title, description, category, genre, audio characteristics) enable similarity-based recommendations before engagement data accumulates. Popularity-based fallbacks serve as the default for both new users and new items, with the risk of reinforcing popularity bias and reducing catalogue exploration. Hybrid models that blend content features with collaborative signals provide a principled approach: as interaction data accumulates, the collaborative component's weight increases and the content component's weight decreases.

Collaborative Filtering

Collaborative filtering (CF) operates from a single elegant premise: users who agreed in the past will agree in the future. CF uses only the interaction matrix — no item metadata, no user demographics, no content features — making it domain-agnostic. A CF system trained on movie ratings can be retransplanted, without modification, to books, songs, or e-commerce products. This generality is its greatest strength. Its greatest weaknesses are the cold-start problem (no interactions, no recommendations) and popularity bias: popular items collect disproportionate interaction signal, causing the model to over-recommend mainstream content and neglect the long tail.

Memory-Based CF

Memory-based CF computes recommendations directly from the interaction matrix without learning a parametric model. User-based CF finds the k most similar users to the target user — measured by cosine similarity or Pearson correlation over their shared item ratings — and recommends items those neighbours liked that the target user has not yet seen, weighted by similarity score and neighbour rating. Item-based CF inverts this logic: for each item the user has interacted with, find items most similar to it (by how similarly they are rated across all users), and recommend the most similar unseen items. Amazon's original recommendation engine used item-based CF extensively, and it remains in use today in hybrid combinations.

Item-based CF is preferred over user-based CF in most large-scale settings for two practical reasons. First, item similarity is more temporally stable: two films that are consistently co-watched remain similar across years, while user preferences shift as tastes evolve. Second, the item catalogue is typically far smaller than the user base (millions of users, hundreds of thousands of items), making item-item similarity matrices cheaper to compute and store. Pre-computing top-k similar items for every item in the catalogue, and refreshing this index periodically, is the standard production approach.

Model-Based CF & Matrix Factorisation

Model-based CF learns a compact parametric representation of the interaction matrix. Matrix factorisation, the dominant approach, decomposes the sparse user-item matrix R (shape |U| × |I|) into two dense low-rank matrices: a user embedding matrix P (shape |U| × k) and an item embedding matrix Q (shape |I| × k), such that R ≈ P × Q^T. The predicted rating for user u on item i is the dot product of their latent vectors: r̂_ui = p_u · q_i. These k-dimensional latent factors capture abstract taste dimensions — not explicitly interpretable, but measurably predictive. Simon Funk's entry in the Netflix Prize competition (2006) demonstrated that optimising over only the observed entries of R using stochastic gradient descent with L2 regularisation massively outperformed SVD applied to the full dense matrix.

Alternating Least Squares (ALS) is an alternative optimisation strategy that is trivially parallelisable: fix Q, solve for each row of P in closed form; fix P, solve for each row of Q in closed form; repeat. ALS is particularly suited to implicit feedback through the iALS extension: rather than treating unobserved interactions as missing, they are treated as negative signals with low confidence, while observed interactions receive high confidence weights. Bayesian Personalised Ranking (BPR) takes yet another approach, optimising for the relative ordering of interacted vs. non-interacted items using pairwise loss — directly targeting ranking quality rather than rating prediction accuracy.

Matrix Factorisation: Code Example

The following example builds a user-item rating matrix, applies Truncated SVD for matrix factorisation, reconstructs predicted ratings, and computes item-to-item cosine similarity from the learned item factors:

import numpy as np
import pandas as pd
from sklearn.decomposition import TruncatedSVD
from sklearn.metrics.pairwise import cosine_similarity

# Build user-item rating matrix
ratings = pd.DataFrame({
    'user_id': [1, 1, 1, 2, 2, 3, 3, 3],
    'item_id': [101, 102, 103, 101, 104, 102, 103, 105],
    'rating':  [5.0, 3.0, 4.0, 4.0, 5.0, 2.0, 5.0, 4.0]
})
user_item = ratings.pivot(index='user_id', columns='item_id', values='rating').fillna(0)
# Shape: (3 users x 5 items)

# Matrix factorization: decompose into latent user/item factors
svd = TruncatedSVD(n_components=2, random_state=42)
user_factors = svd.fit_transform(user_item)  # (3, 2) -- user embeddings
item_factors = svd.components_.T              # (5, 2) -- item embeddings

# Predict ratings: user_factors @ item_factors.T
predicted = user_factors @ item_factors.T
print("Predicted rating for user 1, item 104:", predicted[0, 3])  # ~4.1

# Item-to-item similarity (used by Amazon's collaborative filtering)
item_sim = cosine_similarity(item_factors)
print("Items most similar to item 101:", np.argsort(-item_sim[0])[1:3])

Recommender Systems Across Industries

Recommender systems are deployed at scale across radically different industry contexts, each with its own catalogue structure, feedback signal, latency constraints, and ethical considerations. Understanding these differences is essential for designing systems that are fit for purpose rather than treating every recommendation problem as a variant of the Netflix movie problem.

E-Commerce & Retail

E-commerce recommendation must serve multiple funnel stages simultaneously: homepage personalisation (what to show a returning user before they have expressed any session intent), search result ranking (personalising the order of search results for a given query), product detail page cross-sells ("Customers also bought"), add-to-cart upsells, and post-purchase follow-ups. Each stage has different data availability, latency requirements, and business objectives. Amazon's patented item-to-item collaborative filtering — recommending items frequently co-purchased with the current item — remains the backbone of many e-commerce recommendation systems because it is fast, transparent, and naturally produces relevant cross-category recommendations. Modern systems layer on deep learning: a product tower encodes catalogue items using their images, descriptions, and metadata; a user tower encodes purchase history, browsing sessions, and demographic signals; and a real-time session feature captures current browse context.

Price sensitivity introduces a uniquely e-commerce challenge: a user who consistently buys budget products should not be shown premium recommendations, but a first-time luxury purchase should update their profile towards higher price points. Contextual bandits — one per price tier — handle this cleanly. Inventory constraints are another production complexity absent from media recommendation: recommending out-of-stock items, discontinued products, or items with only 1-2 units remaining requires real-time inventory signal integration in the feature store. Seasonality is dramatic: the recommendation system's sense of "similar items" shifts rapidly during holiday periods, and models retrained on year-round data may be poorly calibrated for peak shopping periods.

Streaming Media

Netflix, Spotify, and YouTube each have recommendation challenges shaped by their unique catalogue and engagement dynamics. Netflix's catalogue is relatively small (tens of thousands of titles) but each title requires significant commitment (a two-hour film, a 10-episode series), making the cost of a bad recommendation high and the exploration incentive strong. Spotify's catalogue is vast (100M+ tracks) but items are short (3–4 minutes), enabling rapid user feedback and exploration across many items per session. YouTube must recommend from a catalogue of 500 hours of video uploaded every minute — cold-start for new content is not a problem to solve once but a continuous operational challenge.

Netflix's recommendation system is built around user taste clusters — groups of users with similar preference profiles, estimated through matrix factorisation and updated weekly. The most famous algorithmic innovation Netflix disclosed is the Evidence and Artwork Personalisation system: the thumbnail image shown for each title is personalised per user. A user cluster that watches many romance films will see a thumbnail emphasising romantic moments from an action film; a cluster that watched many Nicolas Cage films will see his face prominently in the thumbnail. This multi-armed bandit system for artwork selection was reported to improve click-through by over 20% per title, demonstrating that presentation-layer personalisation can be as impactful as catalogue-level ranking.

News & Information

News recommendation differs from entertainment recommendation in its societal stakes. Entertainment recommendation optimising for engagement tends to produce serendipitous discovery of content the user enjoys. News recommendation optimising for engagement can amplify sensational, emotionally activating, and politically polarising content at the expense of informative, accurate, but less emotionally resonant reporting. Google News, Microsoft News, and Apple News have all implemented editorial constraints — blacklisting certain sources, downranking clickbait patterns, requiring geographic and topical diversity in news feeds — that deliberately reduce engagement metrics in exchange for information quality. This is a rare case of explicit business acceptance of recommendation metric degradation in service of a broader value. The practical implementation challenge is operationalising "information quality" as a computable signal: using signals like source reputation scores, factual consistency checks, linguistic quality classifiers, and editorial review pipelines alongside the core collaborative and content-based signals.

Content-Based Filtering

Content-based filtering recommends items similar to those the user has previously engaged with, using item features rather than other users' behaviour. This approach is entirely independent of the interaction matrix — it can make recommendations for a new user with a single known preference, and it can recommend newly uploaded items immediately. Feature engineering for content-based systems varies significantly by item type. For text items (news articles, product descriptions, academic papers), TF-IDF vectors capture keyword-based similarity, while sentence transformers (BERT, SBERT) capture semantic similarity that is robust to paraphrase and vocabulary variation. For music, acoustically computed features — tempo, key, mode, spectral centroid, timbre descriptors — can be supplemented with embeddings from audio neural networks trained on large music corpora. For movies, metadata graphs combining genre, cast, director, era, and critical reception enable multidimensional similarity computation.

A user profile in a content-based system is typically an aggregated representation of the items the user has interacted with — for instance, the average or weighted average of item embeddings, with recent interactions weighted more heavily. Recommendations are then items whose embeddings are most similar to this user profile vector. The approach is transparent and controllable: if a user reads five articles about climate policy, the profile vector is close to the climate policy cluster, and climate policy articles will be surfaced. The fundamental limitation is over-specialisation: a content-based system cannot recommend across category boundaries. A reader of science fiction novels will not be recommended a science fiction film, and a classical music listener will not discover jazz no matter how much their preferences evolve, unless they explicitly interact with cross-category content. Collaborative filtering does not have this boundary: a user similar to many jazz listeners who consistently listened to classical music will receive jazz recommendations even with zero prior jazz interactions.

                        
                        Feedback Loop Warning: Recommender systems trained on historical interaction data amplify existing popularity biases in a self-reinforcing loop: popular items receive more recommendations, accumulate more interactions, appear more frequently in training data, and receive even stronger recommendations in the next model generation. Over time, catalogue coverage — the fraction of items that are ever recommended to anyone — shrinks, newer or niche items struggle to gain traction regardless of their quality, and user experience homogenises. Explicit diversity and exploration mechanisms — MMR-based re-ranking, exploration bonuses, catalogue coverage monitoring, and periodic exposure guarantees for new items — are necessary engineering interventions, not optional optimisations.
                    

Hybrid Recommender Systems

Most production recommendation systems are hybrid: they combine collaborative and content-based signals in ways that mitigate the weaknesses of each. The integration can occur at different levels of the system. Feature-level hybridisation passes both collaborative features (user and item embeddings from matrix factorisation) and content features (item text, image, or metadata embeddings) as inputs to a shared deep neural network that learns to combine them. Model-level hybridisation maintains separate CF and content-based models and combines their output scores through a weighted ensemble or a meta-learner that is trained to predict which system should be trusted more for a given user-item pair. Cascade hybridisation uses one system to filter candidates and the other to rank them — a common pattern is content-based candidate retrieval followed by collaborative ranking.

The choice of hybridisation strategy depends on data availability and system constraints. When item content is rich and reliable (structured product attributes, well-labelled metadata), feature-level hybridisation produces strong representations even at launch. When interaction data is dense (millions of users, billions of interactions), pure collaborative systems and model-level hybrids that place more weight on collaborative signal tend to outperform content-heavy approaches. For platforms with significant new-item velocity, a cascade that uses content retrieval for items under a threshold age and collaborative retrieval for older items with sufficient interaction data provides a principled dynamic hybrid strategy. Pinterest's PinSage — a Graph Convolutional Network that propagates information through the pin-board graph using both content features and engagement signals — is the canonical example of a production system where the hybrid integration happens at the graph structure level rather than as a post-hoc ensemble.

Deep Learning for Recommendations

Deep learning transformed recommender systems in two ways. First, it enabled learning representations directly from raw input features — text descriptions, product images, audio signals, and behaviour sequences — without manual feature engineering. A product's embedding can be computed from its image and title using pre-trained encoders, immediately producing a high-quality representation even before any interaction data exists. Second, it enabled modelling complex non-linear user-item interactions that the dot product of matrix factorisation cannot capture. Wide & Deep Learning (Google, 2016), deployed in Google Play recommendations, combined a wide linear model (for memorisation of specific feature combinations) with a deep neural network (for generalisation across novel feature combinations). DeepFM and DCN (Deep Cross Network) added explicit feature crossing layers that learn high-order feature interactions systematically, particularly valuable in advertising click-through rate prediction.

Two-Tower Models

The two-tower (or dual-encoder) architecture has become the dominant approach for the retrieval stage of large-scale recommendation. The architecture is conceptually simple: a user tower maps user context (ID embedding, interaction history, demographics, session features) through a neural network to produce a user embedding vector u; an item tower maps item features (ID embedding, content features, metadata) to an item embedding vector v; the relevance score is the dot product u · v. During inference, because item embeddings do not depend on the query user, they can be pre-computed for the entire catalogue and indexed offline using approximate nearest neighbour (ANN) data structures — FAISS, Google's ScaNN, or managed services like Vertex AI Matching Engine — that retrieve the top-k most similar items to a query embedding in sub-millisecond time, even over billions of items.

Training uses sampled softmax with in-batch negatives: for each positive (user, item) interaction in a mini-batch, all other items in the batch serve as negatives. This is computationally efficient but can be misleading — popular items appear frequently as negatives and may be under-recommended as a result. Hard negative mining addresses this by explicitly sampling items the model currently ranks highly but are not positive examples, forcing the model to differentiate genuinely relevant items from deceptively plausible but irrelevant ones.

Two-Tower Model: Code Sketch

The following PyTorch sketch implements the two-tower architecture. In production, the user and item towers would incorporate richer feature sets — interaction history sequences, content embeddings, contextual signals — but the structural pattern is identical:

import torch
import torch.nn as nn

class TwoTowerRecommender(nn.Module):
    """Two-tower architecture: separate encoders for users and items."""
    def __init__(self, n_users, n_items, embedding_dim=64, hidden_dim=128):
        super().__init__()
        # User tower
        self.user_embed = nn.Embedding(n_users, embedding_dim)
        self.user_tower = nn.Sequential(
            nn.Linear(embedding_dim, hidden_dim), nn.ReLU(),
            nn.Linear(hidden_dim, 64)
        )
        # Item tower (can also encode item features)
        self.item_embed = nn.Embedding(n_items, embedding_dim)
        self.item_tower = nn.Sequential(
            nn.Linear(embedding_dim, hidden_dim), nn.ReLU(),
            nn.Linear(hidden_dim, 64)
        )

    def forward(self, user_ids, item_ids):
        u = self.user_tower(self.user_embed(user_ids))  # (batch, 64)
        v = self.item_tower(self.item_embed(item_ids))  # (batch, 64)
        # Dot product similarity (used as logit for binary cross-entropy)
        return (u * v).sum(dim=-1)  # (batch,)

# Training with in-batch negatives (each item in batch is a negative for other users)
# YouTube DNN, Pinterest, TikTok all use this pattern for candidate retrieval

Case Study

How Spotify Moved from Matrix Factorisation to Two-Tower Models for Podcast Recommendations

Spotify launched podcast recommendations in 2019 using collaborative filtering with iALS — the same matrix factorisation approach that had powered their music recommendations successfully for years. The system worked well for popular podcasts with abundant interaction data, but the podcast catalogue presented a structural challenge that music did not: rapid content growth (thousands of new episodes daily), extreme newness distribution (most streams concentrated in the first 48 hours after upload), and a cold-start problem for new shows with no play history. iALS embeddings could not be computed for new podcasts until they had accumulated sufficient interactions.

The team shifted to a two-tower architecture: a podcast tower encoding audio content features (extracted via a pre-trained audio transformer) alongside metadata, and a user tower encoding listening history via a learned sequence encoder. Because the podcast tower was trained on content features rather than collaborative signal alone, embeddings for brand-new episodes could be computed immediately at upload time — before a single play had occurred. The transition reduced cold-start latency from days to zero and lifted click-through rate on new episode recommendations by 31% in a held-out A/B test.

Spotify Two-Tower Model Cold-Start

Session-Based & Sequential Recommendations

Many recommendation contexts lack persistent user identity or have users whose preferences are highly context-dependent within a session. A user browsing shoes for 20 minutes wants shoe recommendations; showing them music or books ignores the obvious session intent. Session-based and sequential recommendation models explicitly model the sequence of user actions as the primary context signal. GRU4Rec (Hidasi et al., 2015) applied Gated Recurrent Units to session sequences, treating the next item prediction task as a sequence-to-sequence problem and training with session-parallel mini-batches to handle variable-length sessions efficiently. Transformer-based sequential models have since become the standard. SASRec (Self-Attentive Sequential Recommendation) applies a causal transformer encoder to the user's interaction history, using attention to dynamically weight earlier interactions based on their relevance to the current prediction step. BERT4Rec uses bidirectional attention with masked item prediction to better capture long-range dependencies. These models power "Continue watching," "Related to your recent purchase," and "Based on your listening history" features that drive a disproportionate share of engagement on streaming platforms.

Case Study

TikTok's For You Page: The Recommendation System That Changed Social Media

TikTok's For You Page is the most consequential recommender system deployed in the past decade, and its design philosophy departs significantly from traditional approaches. Where most social platforms use a primarily social graph — what your friends share drives what you see — TikTok's FYP gives almost no weight to who you follow and almost all weight to your direct engagement signals: watch time, replays, likes, shares, comments, and profile visits. This means a new account with zero followers can receive millions of views on a single video if the initial sample of viewers responds positively, and a user with zero followed accounts receives a fully personalised feed from their very first session.

The FYP system operates at multiple timescales simultaneously. At the microsecond level, each video plays automatically — removing the click barrier eliminates one of the biggest friction points that would otherwise suppress interaction signal. At the session level, the system monitors watch percentage (a user watching 100% of a 3-minute video is a stronger signal than a user watching 10% of a 30-second video, even though total watch time may be similar) and adjusts recommendations based on the user's engagement trajectory within the session. At the multi-session level, interest drift is handled gracefully: a user who watched cooking videos for two weeks and then abruptly stopped watching them will see cooking content gradually deprioritised and replaced with whatever the most recent engagement signals suggest. ByteDance has disclosed that the core architecture is a two-tower retrieval followed by a deep ranking model, with reinforcement learning used for the exploration policy that determines how aggressively to introduce content from novel categories into each user's feed.

TikTok For You Page Watch Time Optimisation

Graph Neural Networks for Recommendations

The user-item interaction matrix can be represented as a bipartite graph: users and items are nodes, and each observed interaction is an edge. Graph Neural Networks (GNNs) can propagate information through this graph, enabling user and item embeddings to incorporate multi-hop neighbourhood information. A user node aggregates information from the items they have interacted with; those item nodes in turn aggregate information from all users who interacted with them, including users the original user has never directly interacted with. This neighbourhood aggregation captures higher-order collaborative signals that matrix factorisation's direct user-item factorisation cannot represent.

PinSage (Ying et al., Pinterest 2018) was the first industrial-scale GNN-based recommender, operating on a graph of 3 billion nodes (pins and boards) and 18 billion edges. Unlike spectral GNNs that require the full graph to be resident in memory, PinSage uses random walk-based neighbourhood sampling to construct small, fixed-size neighbourhoods for each node, enabling mini-batch training on the full Pinterest graph. The resulting item embeddings support visual and semantic similarity search, complementing the collaborative signal from interaction history. LightGCN (He et al., 2020) simplified the GCN architecture for recommendation by removing the feature transformation and non-linear activation at each layer — the theoretical justification being that these components add parameters without improving collaborative filtering quality. LightGCN's final embeddings are a weighted sum of the representations from all layers, capturing both local (one-hop) and global (multi-hop) graph structure, and it consistently outperforms matrix factorisation and standard GCN-based methods on collaborative filtering benchmarks.

                        
                        Key Insight: GNN-based recommendation is especially powerful for platforms with rich auxiliary graph structure beyond the user-item interaction graph: social graphs (who follows whom), knowledge graphs (item-entity relationships), and content similarity graphs (which items share topics, genres, or creators) can all be incorporated as additional edge types. Multi-relational GNNs that learn separate propagation rules for each edge type produce richer representations than systems that treat all relationships identically.
                    

Evaluation Metrics

Choosing the right evaluation metric for a recommender system is not a technical detail — it is a design decision that encodes what the system should optimise for. Precision@K favours systems that put relevant items in the top-K with high density; Recall@K favours systems that capture a large fraction of all relevant items. NDCG rewards getting the most relevant items to the very top of the list. Coverage and novelty metrics capture the health of the catalogue and the discovery value the system provides. In production, multiple metrics are always monitored together, because any single metric can be gamed or can capture only part of the value being delivered.

Evaluation Metrics: Code Implementation

The following Python implementation covers the three most important recommendation evaluation metrics — Precision@K, Recall@K, and NDCG@K — with a worked example:

import numpy as np

def precision_at_k(recommended, relevant, k=10):
    """Fraction of top-k recommendations that are relevant."""
    recommended_k = recommended[:k]
    return len(set(recommended_k) & set(relevant)) / k

def recall_at_k(recommended, relevant, k=10):
    """Fraction of relevant items found in top-k recommendations."""
    if not relevant: return 0.0
    return len(set(recommended[:k]) & set(relevant)) / len(relevant)

def ndcg_at_k(recommended, relevant, k=10):
    """Normalized Discounted Cumulative Gain -- penalises relevant items ranked lower."""
    dcg = sum(1 / np.log2(i + 2) for i, item in enumerate(recommended[:k]) if item in set(relevant))
    idcg = sum(1 / np.log2(i + 2) for i in range(min(len(relevant), k)))
    return dcg / idcg if idcg > 0 else 0.0

# Example
recs = [105, 102, 101, 107, 103, 109, 106, 104, 108, 110]
true_likes = {101, 103, 105, 107}

print(f"Precision@10: {precision_at_k(recs, true_likes):.2f}")  # 0.40
print(f"Recall@10:    {recall_at_k(recs, true_likes):.2f}")     # 1.00
print(f"NDCG@10:      {ndcg_at_k(recs, true_likes):.4f}")       # 0.8122

Approaches & Metrics Comparison

The following tables summarise how the major recommendation approaches compare across key dimensions, and what each evaluation metric measures:

Approach	Data Needed	Cold Start	Scalability	Interpretability	Used By
Collaborative Filtering	Interaction matrix only	Poor	Medium (ALS parallelisable)	Low ("users like you")	Amazon (original), Grouplens
Content-Based	Item features + user history	Good (new items)	High	High ("because you liked X")	Pandora, news apps
Hybrid	Both interactions + features	Good	Medium-High	Medium	Netflix, Spotify (legacy)
Two-Tower Neural	Interactions + rich features	Excellent (content tower)	Very High (ANN index)	Low (embedding space)	YouTube, Pinterest, TikTok, Spotify

Metric	What it Measures	Formula (simple)	When to Prioritise
Precision@K	Density of relevant items in top-K	\|relevant ∩ top-K\| / K	When user attention is scarce (short lists, ads)
Recall@K	Coverage of relevant items in top-K	\|relevant ∩ top-K\| / \|relevant\|	When missing relevant items has high cost (e-commerce search)
NDCG@K	Ranked quality — top positions weighted higher	DCG / IDCG	Primary metric for ranked recommendation lists
MAP	Average precision across all relevant items	Mean of AP@K over users	When full recall across the ranked list matters
Coverage	Fraction of catalogue ever recommended	\|recommended items\| / \|catalogue\|	Long-tail content, combating popularity bias
Novelty	How unknown recommended items are to the user	Avg. inverse popularity of recs	Discovery platforms, creative content, new releases

Production Recommendation Pipelines

Large-scale recommendation pipelines follow a consistent multi-stage funnel structure. Candidate generation runs multiple retrieval sources in parallel: a two-tower ANN retrieval system (personalised to user), a session-based sequential model (short-term intent), popularity-based retrieval (trending and new items), editorial curation (manually selected items for promotions), and contextual retrieval (items similar to what the user is currently viewing). Each source contributes a few hundred candidates, producing a merged pool of perhaps 500–1,000 items. The ranking stage applies a heavier model — typically a deep neural network with hundreds of features per (user, item) pair, including pre-computed embeddings, interaction statistics, item quality signals, and real-time contextual features — to score and reorder these candidates.

The re-ranking stage applies business rules and optimisation objectives beyond raw relevance score: Maximal Marginal Relevance (MMR) injects diversity by penalising items that are too similar to already-selected items in the slate; freshness boosting increases the score of recently published content; policy filters enforce geographic rights restrictions, age-appropriateness rules, and copyright constraints. The final ranked list is assembled and served via a feature store — a low-latency key-value store (Feast, Tecton, or custom Redis-backed solutions) that pre-computes and serves user and item features with sub-millisecond retrieval times.

A/B testing recommender systems requires careful design. The standard approach — randomly assigning users to control and treatment recommendation systems — suffers from network effects: users share item recommendations socially, and a popular item promoted in the treatment group may receive organic traffic in the control group, contaminating the comparison. Interleaving evaluations — presenting a single combined ranked list where both algorithms contribute items, then comparing which algorithm's items receive more clicks — produce more sensitive and less biased comparisons. Counterfactual evaluation methods, using importance sampling to correct for the fact that historical data was collected under a different recommendation policy, are increasingly used for offline evaluation.

The engagement optimisation vs. user wellbeing tension is the most ethically significant challenge in production recommendation. Optimising for short-term engagement metrics — clicks, watch time, session length — produces systems that surface emotionally activating, sensational, and potentially harmful content, because such content drives strong short-term engagement signals. Platforms have responded with mixed success: introducing friction before sharing, reducing recommendation of borderline content, and incorporating user survey data on "time well spent" into reward signals alongside engagement metrics. The technical tools exist to reduce harmful recommendation patterns; the challenge is primarily organisational and incentive-structural.

The Ranking Stage in Detail

The ranking stage applies a significantly richer feature set to the candidate pool than retrieval allows. A typical production ranking model consumes hundreds or thousands of features per (user, item) pair. User-side features: user ID embedding, demographic segment, subscription tier, interaction history statistics (number of items rated, average session length, days since last visit), long-term taste embeddings from the candidate generation stage, and short-term session features computed from the current browsing session. Item-side features: item ID embedding, content category, content embedding, popularity statistics (number of interactions globally and within the user's demographic), freshness (time since upload or publication), and quality signals (average rating, share rate, completion rate). Cross features: historical interaction count between this specific user-item pair (if any), historical interaction patterns between the user and the item's content category, and co-occurrence statistics.

Deep learning ranking models — DLRM (Meta's Deep Learning Recommendation Model), DCN v2 (Deep Cross Network), and similar architectures — are specifically designed for this feature structure, combining dense embeddings from high-cardinality categorical features (user ID, item ID, category) with dense features (interaction counts, timestamps, quality scores) through explicit feature crossing layers and deep MLP towers. These models are trained on click prediction, completion prediction, or conversion prediction objectives, with the ranking score computed as a weighted combination of multiple predicted outcomes: P(click) × 0.4 + P(completion | click) × 0.5 + P(share | click) × 0.1, where the weights are tuned to reflect the business value of each outcome.

A/B Testing & Counterfactual Evaluation

A/B testing recommender systems is harder than A/B testing static UI changes because the recommendation system's behaviour is coupled to user behaviour in complex ways. Standard A/B tests randomly assign users to control (current system) or treatment (new system) buckets. Because recommendation generates the content users see, and user engagement with that content generates the training data for the next model update, experiment outcomes can be contaminated by network effects — a viral piece of content promoted by the treatment system may generate social sharing that brings organic traffic into the control bucket, inflating the control's metrics. Long-term experiments are particularly sensitive to this: an engagement metric improvement observed in week 1 may reverse in week 8 as users adjust their behaviour to the new recommendation pattern.

Interleaving evaluations — a technique pioneered at Netflix — address short-term contamination by presenting users with a single combined ranked list where items from both algorithms are interleaved, then comparing which algorithm's items receive more user engagement. Because both algorithms' items are presented to the same users in the same session, interleaving is dramatically more statistically efficient than separate A/B buckets and is free from cross-bucket contamination. Counterfactual evaluation methods use importance sampling to estimate how a new recommendation policy would have performed on historical traffic collected under a different policy, enabling offline evaluation that is better aligned with online performance than static metrics like NDCG.

Production Pattern

Feature Store Architecture for Real-Time Recommendations

A feature store is the infrastructure backbone of a production recommendation system. Pre-computed user features (embedding vector, recent interaction counts, demographic segment) and item features (embedding vector, category, quality score, freshness) are written to a low-latency key-value store (typically Redis or DynamoDB) by an offline batch pipeline that runs every 1–24 hours. Real-time features — current session activity, live item popularity, time-of-day context — are computed and written by a streaming pipeline (Kafka + Flink or Spark Streaming) with sub-second latency. At serving time, the ranking model reads features for all candidates from the feature store in a single batch lookup, typically achieving <5ms total feature retrieval time for 500 candidates. The separation of feature computation from serving logic means the same features are available for both training (offline materialization) and serving (online retrieval), preventing train-serve skew — the #1 cause of unexplained performance degradation in production ML systems.

Responsible Recommendation & Diversity Engineering

Recommender systems are not neutral information retrieval tools — they are opinion-shaping systems that determine what millions of people read, watch, buy, and believe. Designing them responsibly requires moving beyond engagement metrics to measure and optimise for user wellbeing, catalogue fairness, and information diversity.

Diversity-Aware Re-ranking

A recommendation list should not simply contain the top-k highest-relevance items if those items are all nearly identical. Maximal Marginal Relevance (MMR) is the standard algorithm for trading off relevance and diversity: given a set of already-selected items S, the next item to select is argmax_{i not in S} [λ · relevance(i) − (1−λ) · max_{j in S} sim(i, j)], where λ controls the relevance-diversity tradeoff. At λ=1, MMR reduces to pure relevance ranking. At λ=0, it selects the item least similar to any already-selected item. In practice, λ ∈ [0.7, 0.9] produces lists that are more diverse than pure relevance ranking while retaining high aggregate relevance.

Topic diversity — ensuring recommendations cover multiple distinct topics or content categories — is implemented through constrained ranking: for a 10-item list, require at most 3 items from the same top-level category. Temporal diversity — avoiding the recommendation of similar items the user recently consumed — is implemented through a recency penalty: items from categories or creators the user interacted with in the last 7 days receive a lower recommendation score unless the user's historical pattern indicates repeat consumption preference (e.g., a daily news reader).

Fairness in Recommender Systems

Recommendation fairness has two sides: consumer-side fairness (the user receives equitable quality of recommendations regardless of demographic group) and provider-side fairness (content creators and producers receive equitable exposure opportunities regardless of attributes like gender, race, or production scale). These two fairness objectives can conflict: optimising for consumer-side relevance may concentrate exposure on a small number of highly popular providers, disadvantaging smaller or minority creators regardless of their content quality.

Measuring fairness in recommendation requires defining the relevant subgroups and fairness metric. Demographic parity — equal recommendation rates across groups — is easy to compute but may conflict with relevance if user preferences differ across groups. Equal opportunity — equal recall for positive items across groups — requires knowing which items are "positive" for users in each group, which is often unavailable. Calibrated fairness — ensuring the distribution of recommended items mirrors the distribution of items the user would enjoy, as estimated by a calibration model — is the most operationally practical approach. Pinterest, Airbnb, and LinkedIn have published their fairness evaluation frameworks, providing public benchmarks for the industry.

                        
                        Key Insight: Popularity bias is simultaneously the biggest accuracy bug and the biggest fairness bug in recommender systems. A system that over-recommends popular items produces a homogenised user experience (accuracy problem), deprives niche content creators of fair exposure (provider-side fairness problem), and prevents users from discovering content they would genuinely enjoy but have never encountered (consumer-side serendipity problem). Addressing popularity bias is therefore aligned across accuracy, fairness, and business diversity objectives — a rare case where the ethical intervention also improves system performance.
                    

LLMs in Recommender Systems

Large language models have entered the recommendation landscape in two distinct roles. The first is as feature encoders: using a pre-trained language model (BERT, Sentence-BERT, or a domain-specific encoder) to produce dense semantic representations of item descriptions, reviews, and user-generated content, which replace or supplement traditional ID embeddings in both retrieval and ranking models. This approach substantially improves the handling of new items with no interaction history, because a well-described product can immediately be placed in a meaningful position in the embedding space. The second role is as a generative recommendation interface: a conversational agent that elicits user preferences through natural language dialogue, generates ranked recommendation lists, and provides natural language explanations for its suggestions.

P5 (Pre-train, Personalise, Predict, Prefer, and Prompt, Google 2022) demonstrated that multiple recommendation tasks — rating prediction, sequential recommendation, explanation generation — can be unified into a single text-to-text framework by framing them as natural language generation tasks. GPT-4 and Claude can produce surprisingly coherent cold-start recommendations from purely text-based descriptions of user preferences, without any access to interaction history. However, LLM-based recommenders face fundamental limitations: they lack real-time knowledge of what a platform currently offers, they hallucinate item details and titles, and they cannot personalise to the fine-grained preference signals embedded in billions of interaction events. Production deployments typically use LLMs for re-ranking with natural language explanations and conversational preference elicitation, while retaining purpose-built retrieval and ranking models for the core recommendation pipeline.

                        
                        Key Insight: The right mental model for LLMs in recommendation is as a reasoning and dialogue layer on top of a purpose-built ML system, not as a replacement for it. LLMs excel at explaining why an item is recommended, refining preferences through natural dialogue, and handling highly novel queries outside the training distribution. Specialised two-tower and sequential models excel at personalised retrieval at scale with sub-millisecond latency. Hybrid architectures that combine both are the emerging production pattern.
                    

Production Monitoring & Drift Detection

Recommender systems degrade in production through several failure modes, all of which require active monitoring to detect. Distribution shift occurs when the statistical properties of user behaviour change — a seasonal shopping pattern, a viral trend, a major world event that dominates user attention. Feature drift occurs when item catalogue properties change: new categories are added, item metadata schema evolves, or content quality shifts. Training-serving skew is the most insidious failure mode: the feature computation in the offline training pipeline and the online serving pipeline diverge due to code inconsistencies, producing training examples the model was never exposed to at inference time.

Standard monitoring metrics for production recommendation systems include: click-through rate (CTR) as a proxy for top-1 relevance, average session length and return rate as proxies for long-term satisfaction, catalogue coverage (what fraction of the catalogue is being recommended to at least one user per week) as a health metric for long-tail content, and proportion of new items reaching threshold recommendation volume within 24 hours of upload as a cold-start pipeline health metric. Statistical tests — Population Stability Index (PSI) for feature drift, Jensen-Shannon divergence between current and baseline score distributions — are computed daily and alert on threshold breaches. Model performance shadow evaluation, where a new candidate model's rankings are compared against the live model on a shadow traffic stream without affecting users, enables safe pre-deployment validation.

Production Pattern

Netflix Prize to Production: How the Winning Algorithm Was Never Deployed

The Netflix Prize, concluded in 2009, awarded $1 million to a team whose ensemble algorithm beat Netflix's Cinematch system by more than 10% RMSE on the held-out test set. The winning solution combined hundreds of models, including matrix factorisation variants, neighbourhood methods, and Boltzmann machines, requiring weeks to retrain. Netflix ultimately did not deploy the winning algorithm. The prize's offline evaluation metric — root mean squared error on explicit ratings — turned out to be a poor proxy for what Netflix actually optimised for in production: subscriber retention and time spent in session, driven by recommendations in a streaming context where users rarely provide explicit ratings. The winning algorithm, optimised to predict withheld explicit ratings from 2005–2006, was not significantly better at retaining subscribers in 2009's mobile-first, binge-watching landscape. This episode is the canonical illustration that offline metric optimisation and online business value can be deeply decoupled — and that deployment infrastructure, latency constraints, and retraining cadence are as important as model accuracy in production system design.

Netflix Prize Offline/Online Metric Gap Production Considerations

Ethics, Filter Bubbles & Regulatory Considerations

Recommender systems that optimise for engagement metrics can create filter bubbles — closed information environments where users are predominantly exposed to content aligned with their existing preferences and beliefs, with minimal exposure to contrary or diverse perspectives. The psychological and societal consequences of recommendation-driven filter bubbles are actively debated: the empirical evidence suggests their effect on political polarisation is smaller than commonly claimed in popular accounts, but the effect on information diets — what news people see, what products they discover, what cultural content they consume — is substantial and measurable. The EU Digital Services Act (DSA), in force from 2024, imposes transparency obligations on large platforms: users must have access to a recommendation system that is not based on profiling, and the platform must publish algorithmic accountability reports explaining how its recommender systems work. The US Algorithmic Accountability Act, if enacted, would require impact assessments for automated decision systems including recommendation engines. These regulatory developments are accelerating the deployment of diversity-aware re-ranking, explanation interfaces, and user control mechanisms that were previously treated as optional features.

Scaling & Serving Recommender Systems

Serving recommendations at billion-user scale requires engineering solutions across the full software stack. Candidate retrieval must complete in under 10 milliseconds for tens of millions of items; ranking must score 500–1,000 candidates in under 20 milliseconds with models that consume hundreds of features per candidate; and the entire end-to-end latency from user request to recommendations displayed must be under 100–200 milliseconds to avoid detectable lag in the user interface.

Approximate Nearest Neighbour Indexing

Exact nearest neighbour search over a billion item embeddings would require comparing the query embedding to every item in the catalogue — computationally infeasible at serving time. Approximate nearest neighbour (ANN) algorithms trade a small accuracy loss for orders-of-magnitude speedup by organising the embedding space into index structures that prune the search space. FAISS (Facebook AI Similarity Search) implements three main index types: IVF (inverted file index) partitions the embedding space into clusters using k-means and searches only the closest clusters; HNSW (Hierarchical Navigable Small World graphs) builds a multi-layer proximity graph that enables logarithmic-time approximate search; and PQ (Product Quantisation) compresses embeddings to reduce memory footprint. In production, IVFFlat or IVF+PQ indices are common for CPU serving; HNSW is preferred for GPU serving due to its parallelisability. Google's ScaNN (Scalable Nearest Neighbours) and Spotify's Annoy are purpose-built ANN libraries that have outperformed FAISS on specific workloads. Managed ANN services — Google Vertex AI Matching Engine, AWS Amazon Kendra, and Pinecone — abstract the infrastructure concerns and provide automatic index rebuilding as the item catalogue grows.

Model Serving Infrastructure

Recommendation models are served as microservices behind a load balancer. Multiple stateless model server replicas (TensorFlow Serving, TorchServe, or Triton Inference Server) handle concurrent requests in parallel, with horizontal autoscaling triggered by CPU/GPU utilisation metrics. Batching — processing multiple user requests in a single model inference call — dramatically improves throughput at the cost of slightly higher per-request latency: a typical configuration batches requests within a 1–5ms window, improving GPU utilisation from ~10% for single-request serving to ~80% for batched serving. Request routing separates retrieval (vector database queries, parallelisable across multiple ANN index shards) from ranking (single model inference with all candidate features, serialisable per request) — these two stages have fundamentally different latency and compute profiles and are typically served by separate microservices with separate scaling policies.

Model versioning and rolling deployment for recommendation models follows the same principles as supervised models with one additional complexity: the retrieval index and the ranking model must be updated atomically if they share a joint training objective. If the item embeddings in the ANN index are updated before the ranking model is updated, the ranking model may receive item representations it was not trained with, producing degraded scores. Coordinated deployment pipelines that update the index and the ranking model in a single atomic operation — or that use a compatibility layer that keeps both the old and new embedding spaces in memory during the transition — are standard practice at large recommendation platforms.

Exercises

These exercises progress from basic similarity computations to implementing and comparing full recommendation systems. Use the MovieLens 100K dataset (freely available from GroupLens) for exercises 1–4.

Exercise 1 Beginner

Item-Item Similarity

Using the MovieLens 100K dataset, build the item-item cosine similarity matrix from the user-item rating matrix. For the movie "Toy Story" (movieId=1), list the 5 most similar movies by cosine similarity, and print their titles and similarity scores.

Success criterion: Top-5 similar movies are all clearly related to Toy Story (animated films, Pixar films, or family adventure films), with cosine similarity values between 0 and 1.

Exercise 2 Intermediate

User-Based CF with Evaluation

Implement user-based collaborative filtering using leave-one-out cross-validation: for each user, hold out their most recent rating, use the remaining ratings to find k=20 nearest neighbours, and recommend the top 10 items. Compute Precision@10 and Recall@10 averaged across all users.

Success criterion: Report mean Precision@10 and Recall@10. Compare your results against a popularity baseline (always recommend the 10 most-rated movies). CF should outperform the popularity baseline on NDCG@10.

Exercise 3 Intermediate

Matrix Factorisation vs Content-Based

Compare matrix factorisation (TruncatedSVD with k=50 latent factors) against a content-based approach (TF-IDF on movie genres and tags from the MovieLens dataset). For each approach: (1) evaluate NDCG@10 using temporal splits (train on ratings before 2000, test on ratings after 2000), (2) identify 3 "new users" with no pre-2000 ratings and check whether each approach can generate recommendations for them.

Success criterion: Document the NDCG@10 gap between the two approaches and explain which handles new users better and why.

Exercise 4 Advanced

Two-Tower Neural Recommender

Implement the two-tower architecture from the code example above for MovieLens. Train with in-batch negatives (batch size 256). Evaluate NDCG@10 on a held-out temporal test set. Compare against the TruncatedSVD matrix factorisation baseline from Exercise 3. Document your training setup (embedding dim, hidden dim, epochs, learning rate) and report final metrics for both approaches.

Success criterion: Successfully train the two-tower model for at least 10 epochs without divergence. Report NDCG@10 comparison table showing matrix factorisation vs. two-tower results.

Recommender System Design Generator

Design your recommender system architecture. Download as Word, Excel, PDF, or PowerPoint.

Draft auto-saved

All data stays in your browser. Nothing is sent to or stored on any server.

Project Name *

Recommender Type

Problem Statement *

Success Metric *

Data Description

Cold-Start Strategy

Deployment Context

Constraints

Author Name

Conclusion & Next Steps

Recommender systems represent one of the most commercially mature applications of machine learning, deployed at billion-user scale with measurable impact on revenue, engagement, and — increasingly — on information environments and user wellbeing. The technical progression from memory-based collaborative filtering through matrix factorisation to deep two-tower retrieval and sequential transformers reflects a consistent pattern: each generation leveraged increased data and compute to learn richer representations, reducing the hand-engineering burden while improving prediction quality. The problems that remain unsolved — cold start, popularity bias, filter bubbles, and the offline-online metric gap — are not primarily algorithmic: they require careful system design, principled evaluation methodology, and alignment between business incentives and user outcomes.

The next part of this series shifts to reinforcement learning, which provides the formal framework for the sequential decision-making problems that sit beneath recommender systems, autonomous agents, and the alignment of large language models. The exploration vs. exploitation tradeoff introduced here in the recommendation context is, at its core, an RL problem — and understanding RL's formalisms will deepen your intuition for why production recommendation systems are designed the way they are.

Next in the Series

In Part 6: Reinforcement Learning Applications, we move from recommendation into the formal framework of sequential decision-making — covering Q-learning, policy gradients, PPO, and the RLHF pipeline that transformed large language models from text predictors into genuinely useful AI assistants.

Cookie Consent

Cookie Preferences

Recommender Systems

Table of Contents

About This Article

AI in the Wild: Real-World Applications & Ethics

AI & ML Landscape Overview

ML Foundations for Practitioners

Natural Language Processing

Computer Vision in the Real World

Recommender Systems

Reinforcement Learning Applications

Conversational AI & Chatbots

Large Language Models

Prompt Engineering & In-Context Learning

Fine-tuning, RLHF & Model Alignment

Generative AI Applications

Multimodal AI

AI Agents & Agentic Workflows

AI in Healthcare & Life Sciences

AI in Finance & Fraud Detection

AI in Autonomous Systems & Robotics

AI Security & Adversarial Robustness

Explainable AI & Interpretability

AI Ethics & Bias Mitigation

MLOps & Model Deployment

Edge AI & On-Device Intelligence

AI Infrastructure, Hardware & Scaling

Responsible AI Governance

AI Policy, Regulation & Future Directions

Recommender Systems Fundamentals

Problem Formulation

The Cold-Start Problem

Collaborative Filtering

Memory-Based CF

Model-Based CF & Matrix Factorisation

Matrix Factorisation: Code Example

Recommender Systems Across Industries

E-Commerce & Retail

Streaming Media

News & Information

Content-Based Filtering

Hybrid Recommender Systems

Deep Learning for Recommendations

Two-Tower Models

Two-Tower Model: Code Sketch

How Spotify Moved from Matrix Factorisation to Two-Tower Models for Podcast Recommendations

Session-Based & Sequential Recommendations

TikTok's For You Page: The Recommendation System That Changed Social Media

Graph Neural Networks for Recommendations

Evaluation Metrics

Evaluation Metrics: Code Implementation

Approaches & Metrics Comparison

Production Recommendation Pipelines

The Ranking Stage in Detail

A/B Testing & Counterfactual Evaluation

Feature Store Architecture for Real-Time Recommendations

Responsible Recommendation & Diversity Engineering

Diversity-Aware Re-ranking

Fairness in Recommender Systems

LLMs in Recommender Systems

Production Monitoring & Drift Detection

Netflix Prize to Production: How the Winning Algorithm Was Never Deployed

Ethics, Filter Bubbles & Regulatory Considerations

Scaling & Serving Recommender Systems

Approximate Nearest Neighbour Indexing

Model Serving Infrastructure

Exercises

Item-Item Similarity

User-Based CF with Evaluation

Matrix Factorisation vs Content-Based

Two-Tower Neural Recommender

Recommender System Design Generator

Conclusion & Next Steps

Next in the Series

Continue This Series

Part 4: Computer Vision in the Real World

Part 6: Reinforcement Learning Applications

Part 13: AI Agents & Agentic Workflows