AI & ML Landscape Overview
Paradigms, ecosystem map, real-world applications at a glance
ML Foundations for Practitioners
Supervised learning, bias-variance, model evaluation
Natural Language Processing
Tokenization, embeddings, transformers, semantic search
Computer Vision in the Real World
CNNs, ViTs, detection, segmentation, deployment patterns
Recommender Systems
Collaborative filtering, content-based, two-tower models
Reinforcement Learning Applications
Q-learning, policy gradients, RLHF, real-world deployments
Conversational AI & Chatbots
Dialogue systems, intent detection, RAG, production bots
Large Language Models
Architecture, scaling laws, capabilities, limitations
Prompt Engineering & In-Context Learning
Chain-of-thought, few-shot, structured outputs, prompt patterns
Fine-tuning, RLHF & Model Alignment
LoRA, instruction tuning, DPO, alignment techniques
Generative AI Applications
Diffusion models, GANs, image/audio/video generation
Multimodal AI
Vision-language models, audio-text, cross-modal retrieval
AI Agents & Agentic Workflows
Tool use, planning, memory, multi-agent orchestration
AI in Healthcare & Life Sciences
Diagnostics, drug discovery, clinical NLP, regulatory landscape
AI in Finance & Fraud Detection
Credit scoring, anomaly detection, algorithmic trading
AI in Autonomous Systems & Robotics
Perception, planning, control, sim-to-real transfer
AI Security & Adversarial Robustness
Adversarial attacks, poisoning, model extraction, defences
Explainable AI & Interpretability
SHAP, LIME, attention, mechanistic interpretability
AI Ethics & Bias Mitigation
Fairness metrics, dataset auditing, debiasing techniques
20
MLOps & Model Deployment
CI/CD for ML, feature stores, monitoring, drift detection
You Are Here
21
Edge AI & On-Device Intelligence
Quantization, pruning, TFLite, CoreML, embedded inference
22
AI Infrastructure, Hardware & Scaling
GPUs, TPUs, distributed training, memory hierarchy
23
Responsible AI Governance
Risk frameworks, model cards, auditing, organisational practice
24
AI Policy, Regulation & Future Directions
EU AI Act, global frameworks, emerging risks, what's next
AI in the Wild
Part 20 of 24
About This Article
MLOps — the discipline of applying DevOps principles to machine learning — is what separates research projects from production systems. This article covers the complete ML production lifecycle: experiment tracking and reproducibility, feature stores, model registries, serving infrastructure, CI/CD pipelines, and the monitoring practices needed to keep models performing well as the world changes around them.
MLOps
Model Deployment
MLflow
CI/CD for ML
Drift Detection
The MLOps Problem
In 2015, Google engineers Sculley et al. published "Hidden Technical Debt in Machine Learning Systems," arguably the most influential paper in applied ML engineering. Their central argument: the ML code that trains and serves a model typically represents only a small fraction of the total system complexity. The surrounding infrastructure — data pipelines, feature engineering, monitoring, serving, configuration management, and the feedback loops between components — creates technical debt that compounds over time and is far harder to pay off than the debt accumulated in traditional software systems. A model that performs excellently in offline evaluation can silently degrade in production for weeks before anyone notices, because the inputs to the model have drifted, the world it was trained to model has changed, or the serving environment has diverged from the training environment.
MLOps is the discipline that addresses this debt systematically. It applies the principles of DevOps — automation, reproducibility, continuous integration and delivery, monitoring, and collaboration between development and operations — to the ML lifecycle. The goal is not merely to deploy a model but to maintain a continuous pipeline that can retrain, evaluate, and deploy new model versions automatically when performance degrades, while keeping detailed records of every experiment, every dataset version, and every deployment decision.
Key Insight: The transition from "model that works in a notebook" to "model that works reliably in production" is not a deployment step — it is an engineering discipline that must be designed from the beginning. The earlier MLOps practices are adopted in a project, the lower the total cost of reaching and maintaining production-quality performance.
ML Technical Debt
The most common forms of ML technical debt in production systems are: undeclared consumers — other systems that silently depend on the model's output schema, which breaks when the schema changes without announcement; feedback loops — the model's outputs influencing future training data in ways that are not tracked or controlled; data dependencies — upstream feature pipelines that evolve independently of the model, causing silent changes in the input distribution; configuration debt — hyperparameters, preprocessing steps, and evaluation thresholds stored in ad hoc files rather than version-controlled configuration; and pipeline jungle — multiple overlapping data pipelines serving the same model, each maintained by different teams, with no authoritative source of truth for feature definitions.
MLOps Maturity Levels
Google's ML Engineering for Production guidelines define three maturity levels. Level 0 — the baseline — involves manual, script-driven processes: data scientists train models locally, hand off serialised model files to engineers, and monitoring is ad hoc. Most organisations start here, and many stay here longer than they should. Level 1 introduces automated ML pipelines: data ingestion, training, evaluation, and deployment are orchestrated by a pipeline tool (Kubeflow Pipelines, Apache Airflow, or a cloud-managed equivalent), and the pipeline is triggered automatically on new data. Model and data versioning are enforced. Level 2 adds CI/CD for the pipeline itself: changes to pipeline code are tested automatically before deployment, the model registry tracks all candidate and production models, and automated testing gates prevent performance regressions from reaching production.
Experiment Tracking & Model Registry
Reproducibility is the foundation of trustworthy ML. Without a systematic record of which code, which data version, which hyperparameters, and which environment produced each model, debugging degraded performance in production is guesswork. Experiment tracking tools capture this information automatically during training, creating a searchable database of all past experiments that enables scientific comparison, regression detection, and confident rollback.
MLflow Experiment Tracking
MLflow is the most widely adopted open-source experiment tracking and model lifecycle management platform. It provides four components: Tracking records parameters, metrics, tags, and artifacts for each training run; Projects packages ML code into reproducible, shareable units; Models provides a standard format for packaging and deploying models across frameworks; and Model Registry maintains a versioned store of production-grade models with lifecycle stage management. The following code demonstrates best-practice MLflow usage for a gradient boosting classifier:
import mlflow
import mlflow.sklearn
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import roc_auc_score, f1_score
import pandas as pd
# MLflow: reproducible experiment tracking
mlflow.set_experiment("churn-prediction-v2")
with mlflow.start_run(run_name="gbm-tuning-round-3"):
# Log parameters
params = {"n_estimators": 200, "max_depth": 5, "learning_rate": 0.05, "subsample": 0.8}
mlflow.log_params(params)
# Train model
model = GradientBoostingClassifier(**params, random_state=42)
model.fit(X_train, y_train)
# Log metrics
train_auc = roc_auc_score(y_train, model.predict_proba(X_train)[:, 1])
val_auc = roc_auc_score(y_val, model.predict_proba(X_val)[:, 1])
val_f1 = f1_score(y_val, model.predict(X_val))
mlflow.log_metrics({"train_auc": train_auc, "val_auc": val_auc, "val_f1": val_f1})
# Log model artifact
mlflow.sklearn.log_model(model, artifact_path="model",
registered_model_name="ChurnPredictor")
# Log any file artifacts
mlflow.log_artifact("feature_importance.png")
print(f"Run ID: {mlflow.active_run().info.run_id}")
print(f"Val AUC: {val_auc:.4f} (vs baseline: 0.7823)")
# Compare runs programmatically
runs = mlflow.search_runs(experiment_names=["churn-prediction-v2"],
order_by=["metrics.val_auc DESC"])
print(runs[["run_id", "params.n_estimators", "metrics.val_auc"]].head())
Model Registry & Versioning
The model registry is the gatekeeper between experimentation and production. Every model that passes evaluation gates is registered with a version number, a link to the training run that produced it, and a lifecycle stage. The stages typically follow a pipeline: None (registered but not evaluated), Staging (passed evaluation gates, under A/B test or shadow mode), Production (serving live traffic), and Archived (superseded by a newer version). This lifecycle ensures that any production model can be rolled back to a previous version in seconds by simply changing the stage label, without re-deploying any code — the serving layer always loads the model tagged "Production" from the registry at startup.
Production Warning: Loading a model from the registry at every prediction request is catastrophically slow. Always load the model once at service startup and cache it in memory. The correct pattern is: load at startup using the "Production" alias, expose a /model/reload admin endpoint that refreshes the cached model, and use a health check endpoint that reports the loaded model version and its registry metadata. Never load from registry per-request.
Feature Stores
A feature store is a centralised system for creating, storing, sharing, and serving ML features — the pre-computed, transformed inputs that models receive. Without a feature store, feature engineering code is duplicated across training pipelines and serving systems, leading to subtle discrepancies in how features are computed between offline training and online serving — the training-serving skew that is one of the most common root causes of production model degradation.
Design Principles
A production feature store has two serving layers. The offline store (typically a data warehouse like BigQuery, Snowflake, or Redshift backed by a columnar storage format like Parquet) provides historical feature values for training and batch scoring. The online store (typically a low-latency key-value store like Redis, DynamoDB, or Cassandra) provides real-time feature lookups for online serving — feature values are precomputed and written to the online store so that model inference can retrieve them in single-digit milliseconds. Feast (open source), Tecton, and Hopsworks are the leading feature store platforms; all major cloud providers now offer managed equivalents.
Training-Serving Skew
Training-serving skew arises when the feature values seen at training time differ from those seen at inference time, even when the raw data has not changed. Common causes include: computing features differently in SQL (offline) versus Python (online); time zone inconsistencies in timestamp arithmetic; different null-handling conventions between the data warehouse and the serving cache; and feature transformation logic captured manually during experimentation but not included in the pipeline. The canonical defence is a single, version-controlled feature transformation function that is called both during dataset generation and during online serving — the feature store enforces this by requiring feature definitions to be registered before use and serving them from the same definition at both stages.
Anti-Pattern
The Feature Duplication Problem
A data science team builds a churn prediction model. The training pipeline computes "days_since_last_purchase" using pandas in UTC. The serving engineer implements the same feature via a SQL DATEDIFF query against the production database using the server's local timezone. For customers in UTC-8, transactions near midnight are credited to the previous day in one computation but not the other. The model, trained to flag 30+ days of inactivity as churn risk, now receives inconsistent feature values and produces systematically wrong predictions for a subset of customers — a bug invisible until a customer-facing analyst notices anomalous churn scores for a specific cohort. A feature store with a single registered feature definition would have prevented this entirely.
Model Serving Infrastructure
Model serving is the engineering discipline of exposing trained models as reliable, low-latency, scalable API services. The serving layer must balance three competing demands: latency (predictions must arrive within the user experience budget — typically under 100ms for interactive applications); throughput (the service must handle peak traffic without degradation); and reliability (the service must be available even when individual components fail, and must degrade gracefully rather than catastrophically). Most serving implementations today use a combination of a model server behind a load balancer, with horizontal autoscaling triggered by request queue depth or CPU utilisation.
FastAPI Model Server
FastAPI has become the de facto standard for Python-based model serving. Its async request handling, automatic input validation via Pydantic, and OpenAPI documentation generation make it well-suited for ML APIs. The critical implementation discipline is loading the model once at service startup, not per request:
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import mlflow.sklearn
import numpy as np
from typing import List
import time
app = FastAPI(title="Churn Prediction API", version="2.1.0")
# Load model at startup (not per-request!)
MODEL_URI = "models:/ChurnPredictor/Production" # MLflow Model Registry
model = mlflow.sklearn.load_model(MODEL_URI)
class PredictionRequest(BaseModel):
customer_id: str
features: List[float] # must match training feature order
class PredictionResponse(BaseModel):
customer_id: str
churn_probability: float
prediction: str # "churn" | "retain"
model_version: str
latency_ms: float
@app.post("/predict", response_model=PredictionResponse)
async def predict(request: PredictionRequest):
start = time.time()
if len(request.features) != 15: # validate feature count
raise HTTPException(status_code=422, detail="Expected 15 features")
features = np.array(request.features).reshape(1, -1)
churn_prob = float(model.predict_proba(features)[0, 1])
return PredictionResponse(
customer_id=request.customer_id,
churn_probability=round(churn_prob, 4),
prediction="churn" if churn_prob > 0.5 else "retain",
model_version="2.1.0",
latency_ms=round((time.time() - start) * 1000, 2)
)
@app.get("/health")
async def health():
return {"status": "healthy", "model": MODEL_URI}
Serving Patterns
Production serving requires more than a single endpoint. Shadow mode deployment runs the new model in parallel with the production model, logging its predictions without serving them to users. Canary deployment routes a small percentage of traffic (e.g., 5%) to the new model and gradually increases the share as confidence grows. A/B testing randomly assigns users to model versions and compares business metrics across groups. Blue-green deployment maintains two identical serving environments and switches the load balancer instantaneously — enabling zero-downtime deployment and instant rollback. The choice of pattern depends on the cost of a bad prediction and the availability of labels for rapid evaluation.
CI/CD for Machine Learning
Continuous integration and delivery for ML extends the software CI/CD paradigm with ML-specific stages: data validation (verifying that new training data meets quality and distribution expectations), model evaluation (verifying that the candidate model meets performance thresholds), and model comparison (verifying that the candidate model outperforms or at minimum does not regress from the current production model). Each stage is a gate: a failed gate aborts the pipeline and triggers notifications to the responsible team.
Pipeline Stages
A mature ML CI/CD pipeline typically consists of five stages. Data validation uses Great Expectations, TFDV, or a custom suite to verify that new data satisfies invariants — expected null rates, value ranges, cardinality constraints, and distribution similarity to the reference dataset. Feature pipeline executes the feature engineering transformations registered in the feature store against the new data. Training executes the training script with the registered hyperparameter configuration, logging all parameters and metrics to the experiment tracker. Evaluation gate compares the candidate model against a predefined threshold (e.g., minimum AUC of 0.82) and against the current production model — if the candidate fails either gate, the pipeline fails. Deployment updates the model registry stage to "Production" and triggers a rolling restart of the serving fleet.
GitHub Actions Implementation
# .github/workflows/ml-pipeline.yml
# Automated ML pipeline: data validation → training → evaluation → deployment
name: ML Training Pipeline
on:
push:
branches: [main]
paths: ['src/**', 'data/**', 'configs/**']
jobs:
validate-data:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run Great Expectations data validation
run: |
pip install great_expectations
great_expectations checkpoint run churn_data_checkpoint
# Fails if data quality < 95% non-null, value ranges out of bounds, etc.
train-and-evaluate:
needs: validate-data
runs-on: [self-hosted, gpu] # GPU runner for training
steps:
- name: Train model
run: python src/train.py --config configs/gbm_v2.yaml
- name: Evaluate model
run: |
python src/evaluate.py --metric val_auc --threshold 0.82
# Fails CI if AUC < 0.82 (prevents regression)
- name: Register model if better than production
run: python src/register_model.py --compare-with production
env:
MLFLOW_TRACKING_URI: ${{ secrets.MLFLOW_URI }}
deploy-to-staging:
needs: train-and-evaluate
runs-on: ubuntu-latest
steps:
- name: Deploy to staging
run: |
kubectl set image deployment/churn-api churn-api=acr.io/churn:$GITHUB_SHA
kubectl rollout status deployment/churn-api --timeout=5m
The key design principle is that each stage is independently testable and produces a versioned artifact. The data validation stage produces a validation report stored as a CI artifact. The training stage produces a registered model version in MLflow. The evaluation stage produces a metrics comparison report. The deployment stage produces a deployment record. Any stage can be re-run in isolation if it fails for infrastructure reasons, without re-running the expensive training stage from scratch.
Monitoring & Drift Detection
A model that performs well at launch can degrade silently over days, weeks, or months as the world it was trained to model changes. Monitoring in production ML means tracking not just infrastructure metrics (latency, throughput, error rates) but model health metrics — the statistical properties of model inputs, outputs, and (when available) outcomes. Without systematic monitoring, degraded performance may only surface through downstream business metrics or customer complaints, long after the root cause has become difficult to diagnose.
Drift Types
| Drift Type |
Definition |
Detection Method |
Trigger for Retraining |
Example |
| Data Drift (Covariate Shift) |
Distribution of input features P(X) changes, but P(Y|X) stays the same |
PSI (Population Stability Index), KS test, Jensen-Shannon divergence per feature |
PSI > 0.25 on any key feature, or average PSI > 0.10 across feature set |
Pandemic causes customer spending patterns to shift; new user acquisition channel brings different demographics |
| Concept Drift |
Relationship between features and target P(Y|X) changes |
Monitored via degradation in accuracy metrics once labels available; use ADWIN or DDM on rolling window performance |
Rolling AUC drops below alert threshold; abrupt change detected by DDM algorithm |
Fraud patterns evolve after new fraud prevention measures; churn drivers change after competitor entry |
| Prediction Drift |
Distribution of model output P(Y-hat) changes without labelled ground truth |
Compare prediction score distribution today vs. reference period; alert on PSI of score histogram |
Score PSI > 0.20; average predicted probability shifts by more than 3 percentage points |
Model suddenly predicts high churn for almost everyone after a feature pipeline bug changes a key feature |
| Label Drift |
Distribution of actual outcomes P(Y) changes |
Monitor actual positive rate in labelled production data; chi-square test vs. reference period |
Actual positive rate deviates > 20% from expected; expected calibration error exceeds threshold |
Actual churn rate rises seasonally; economic downturn increases default rate above model's prior |
| Feature Schema Drift |
Feature names, types, or cardinality change at the source |
Schema validation at feature ingestion; alert on any schema change relative to registered feature definition |
Any schema mismatch — fail serving immediately, fall back to previous model version |
Upstream team renames a column; an enum feature gains a new category; a numeric feature becomes nullable |
Detection Methods
The Population Stability Index (PSI) is the most widely used drift metric in industry. PSI computes the sum of divergence terms between the reference distribution (training data) and the production distribution (recent serving data) bucketed into deciles. PSI below 0.10 indicates no significant drift; PSI between 0.10 and 0.25 indicates moderate drift requiring investigation; PSI above 0.25 indicates significant drift requiring model review or retraining. The Kolmogorov-Smirnov test and Jensen-Shannon divergence are statistically rigorous alternatives but less intuitive to communicate to business stakeholders. Evidently AI and Deepchecks are the leading open-source libraries for production drift monitoring; both integrate with standard MLOps stacks and can generate HTML reports, Slack alerts, and metric exports to Prometheus/Grafana dashboards.
Key Insight: Monitoring must cover both the model and the fairness properties documented in Part 19. A model can maintain overall AUC while its disparate impact ratio for a protected demographic group degrades significantly — because the data drift affects that group disproportionately. Add demographic slice metrics to the monitoring dashboard alongside aggregate performance metrics, and set independent drift alert thresholds for each demographic slice.
MLOps Tools Comparison
The MLOps tooling landscape has matured rapidly. The choice of platform depends on team size, existing cloud provider relationships, requirement for open-source ownership, and the degree of custom pipeline complexity required.
| Tool |
Provider |
Tracking |
Registry |
Serving |
Auto-ML |
OSS |
Best For |
| MLflow |
Databricks / Community |
Excellent |
Excellent |
Good |
No |
Yes (Apache 2) |
Teams wanting full control; Databricks-native environments; any framework |
| Weights & Biases |
W&B Inc. |
Excellent |
Good |
Basic |
Yes (Sweeps) |
Partial (free tier) |
Research teams; rich visualisations; hyperparameter sweeps at scale |
| Kubeflow Pipelines |
Google / Community |
Basic |
Basic |
Via KFServing |
Via Katib |
Yes (Apache 2) |
Kubernetes-native teams; complex multi-step pipelines; on-premise deployments |
| Amazon SageMaker |
AWS |
Good |
Good |
Excellent |
Excellent (Autopilot) |
No |
AWS-native teams; fully managed end-to-end; enterprise scale; compliance requirements |
| Vertex AI |
Google Cloud |
Good |
Good |
Excellent |
Excellent (Vertex AutoML) |
No |
GCP-native teams; BigQuery ML integration; TensorFlow / JAX workloads |
| Azure Machine Learning |
Microsoft |
Good |
Good |
Good |
Good (AutoML) |
Partial (SDK OSS) |
Azure/Office 365 ecosystems; regulated industries (HIPAA, FedRAMP); MLflow integration |
Production Patterns: Scaling & Reliability
Moving a model from a single serving endpoint to a production-grade system serving millions of requests requires addressing a set of engineering challenges that go beyond the model itself: horizontal scaling, request batching, model versioning, circuit breakers, and graceful degradation. These patterns come from traditional distributed systems engineering but require ML-specific adaptations because model inference has different performance characteristics than typical API computation — inference is CPU/GPU-bound rather than I/O-bound, latency is highly variable depending on input sequence length (for transformers), and the memory footprint of loaded models can be several gigabytes.
Horizontal Scaling and Load Balancing
A single model serving instance can handle a limited number of concurrent requests determined by the inference latency and the number of available CPU or GPU cores. For a model with 50ms average inference latency running on a 4-core CPU instance, Amdahl's Law gives a theoretical maximum throughput of approximately 80 requests per second per instance. Horizontal scaling — running multiple instances behind a load balancer — provides linear throughput scaling. For stateless model serving (no per-user session state, no per-request model loading), horizontal scaling is straightforward: any instance can handle any request, and the load balancer can distribute requests round-robin or least-connections. The critical operational requirement is that all instances serve the same model version simultaneously — version skew, where different instances serve different model versions, is a common source of subtle production bugs where prediction behaviour appears non-deterministic from the caller's perspective.
Request Batching for GPU Serving
GPU inference is most efficient when requests are batched: rather than processing one request at a time, the serving layer accumulates a small number of requests (typically 4-32) and processes them together in a single forward pass. This amortises the GPU kernel launch overhead and achieves much higher GPU utilisation. The trade-off is latency: a request that arrives when the batch is not yet full must wait until the batch is complete before receiving a response. This is typically managed with a maximum batch wait time (e.g., 5ms) — if the batch is not full within the wait time, the partial batch is processed immediately. NVIDIA Triton Inference Server and TorchServe both implement dynamic batching with configurable batch size and timeout parameters. For CPU serving, batching is generally less beneficial and may actually increase latency for variable-length inputs (e.g., transformer sequences).
Circuit Breakers and Graceful Degradation
A model serving endpoint that is overloaded, experiencing a software bug, or serving from a corrupted model version must fail gracefully rather than returning garbage predictions or timing out indefinitely. The circuit breaker pattern from distributed systems engineering applies directly: when the error rate on a model endpoint exceeds a threshold (typically 5-10% over a 30-second sliding window), the circuit opens and subsequent requests are immediately returned a predefined fallback response — typically a default prediction, a "service unavailable" HTTP 503 response, or a rule-based fallback. After a configurable timeout (e.g., 60 seconds), the circuit enters "half-open" state and allows a small number of probe requests through to test recovery. If the probe requests succeed, the circuit closes and full traffic resumes. If they fail, the timeout resets. This pattern prevents a struggling model server from cascading its failures to downstream systems and gives the serving team time to diagnose and resolve the issue without user-facing impact.
Production Pattern
Multi-Tier Model Serving Architecture
A production-grade ML serving architecture for a high-traffic recommendation or classification system typically consists of three tiers:
- Tier 1 — Edge cache / CDN: Cache deterministic predictions (same user + same context = same output) at the CDN edge. Cache hit rates of 20-40% are common for recommendation systems, providing sub-millisecond responses for cached results and dramatically reducing model server load.
- Tier 2 — Application server + feature retrieval: Fetches user and item features from the online feature store (Redis/DynamoDB) and routes to the model server. Adds circuit breaker logic and fallback responses. Handles authentication, rate limiting, and request logging.
- Tier 3 — Model server cluster: Horizontally scaled FastAPI or Triton instances behind a load balancer. Auto-scales based on request queue depth and GPU utilisation. All instances serve the model version tagged "Production" in the MLflow registry. Rolling deployments for zero-downtime updates.
Observability across all three tiers is essential: distributed tracing (using OpenTelemetry + Jaeger) ensures that a slow prediction can be traced back to whether the bottleneck was feature retrieval, model inference, or serialisation. Logging at each tier boundary captures the latency attribution for each stage, enabling targeted optimisation.
Model Performance Budgeting
Production ML systems must budget latency the same way financial systems budget cost. The user experience budget (e.g., "the search results page must load in under 200ms") is decomposed into a latency budget for each component: 10ms for the load balancer, 20ms for feature retrieval from the online store, 50ms for model inference, 10ms for result serialisation, 30ms for network round-trip, leaving 80ms of margin. When the model inference takes 75ms instead of 50ms due to increased input complexity, the budget is violated, and one of two things must happen: the model must be optimised (quantization, pruning, ONNX conversion), or the budget for another component must be reallocated. This explicit budgeting discipline prevents the common failure mode where ML models are added to latency-sensitive serving paths without accounting for their inference cost until they cause user-visible regressions in production.
Exercises
Beginner
Exercise 1: MLflow Experiment Comparison
Set up MLflow locally with mlflow server --host 0.0.0.0 --port 5000. Train three versions of a classifier on the same dataset, varying hyperparameters. Log all parameters, training metrics, and validation metrics to MLflow.
- Use the MLflow UI (localhost:5000) to compare runs. Which configuration achieves the best validation AUC?
- How do the training vs. validation AUC curves indicate overfitting across configurations?
- Register the best run's model to the MLflow Model Registry and set its stage to "Production".
- Write a script that programmatically loads the "Production" model and makes predictions on a held-out test set.
Intermediate
Exercise 2: FastAPI Model Server with Load Testing
Build a FastAPI model serving endpoint using the model from Exercise 1. Include a /predict POST endpoint, a /health GET endpoint, and input validation using Pydantic.
- Add latency logging that records the inference time in milliseconds for every request.
- Add a request counter metric exposed at
/metrics in Prometheus format.
- Test with 100 concurrent requests using
locust or httpx async client. What is the p50, p95, and p99 latency?
- Identify the bottleneck: is it the model inference, the input validation, or the serialisation step?
Advanced
Exercise 3: Full MLOps Pipeline with Drift Detection
Design and implement a complete MLOps pipeline for a tabular classification task covering all five stages: data validation, feature pipeline, training, evaluation gate, and deployment.
- Implement data validation using Great Expectations with at least five expectations (non-null rate, value range, uniqueness, distribution test, schema check).
- Set an evaluation gate that fails the pipeline if validation AUC is below a threshold OR if the candidate model is worse than the current production model by more than 1% AUC.
- Simulate data drift by gradually shifting one feature's distribution. Implement PSI monitoring using Evidently or a custom implementation. At what PSI value does model performance begin to degrade?
- Implement automated retraining triggered when PSI exceeds 0.25 on any monitored feature.
Data Lineage, Dataset Versioning, and Feature Stores
Reproducibility in ML depends on three interlocked components: code versioning (git), model versioning (MLflow Model Registry), and data versioning. Without data versioning, the same training code will produce different models depending on when it is run — because the underlying tables, feature pipelines, and label generation procedures change over time. Data lineage tracking makes explicit the full provenance chain: from raw data source through transformation steps to the training dataset used for each registered model version.
DVC: Data Version Control
DVC (Data Version Control) treats data artifacts the same way git treats source files: each dataset and model file gets a content-addressed hash, stored in a .dvc file that git tracks. The actual data lives in a remote store (S3, GCS, Azure Blob, or HDFS). This means the git repository remains lightweight while every model version is linked to the exact dataset hash that produced it — making any training run fully reproducible by checking out the code commit and running dvc pull.
# Initialise DVC in an existing git repo
git init
dvc init
git add .dvc/
git commit -m "initialise dvc"
# Track a large dataset file — DVC stores hash; git stores .dvc pointer
dvc add data/raw/transactions.parquet
git add data/raw/transactions.parquet.dvc data/raw/.gitignore
git commit -m "add raw transactions dataset v1.0"
# Configure S3 remote for data storage
dvc remote add -d s3remote s3://mycompany-dvc-store/mlops-demo
dvc remote modify s3remote region us-east-1
dvc push # upload to S3
# Define a reproducible pipeline (dvc.yaml)
cat > dvc.yaml << 'EOF'
stages:
preprocess:
cmd: python src/preprocess.py --input data/raw/transactions.parquet --output data/processed/features.parquet
deps:
- src/preprocess.py
- data/raw/transactions.parquet
outs:
- data/processed/features.parquet
params:
- params.yaml:
- preprocess.lookback_days
- preprocess.feature_version
train:
cmd: python src/train.py --features data/processed/features.parquet --model models/classifier.pkl
deps:
- src/train.py
- data/processed/features.parquet
outs:
- models/classifier.pkl
metrics:
- metrics/train_metrics.json:
cache: false
params:
- params.yaml:
- train.n_estimators
- train.max_depth
- train.random_seed
evaluate:
cmd: python src/evaluate.py --model models/classifier.pkl --test data/processed/test.parquet --output metrics/eval_metrics.json
deps:
- src/evaluate.py
- models/classifier.pkl
- data/processed/test.parquet
metrics:
- metrics/eval_metrics.json:
cache: false
EOF
# Run the full pipeline — only re-runs stages whose deps have changed
dvc repro
# Compare metrics across experiments
dvc metrics show
dvc metrics diff HEAD~1 HEAD
# Create an experiment branch (DVC Experiments)
dvc exp run --set-param train.n_estimators=200 --name exp-200trees
dvc exp run --set-param train.n_estimators=500 --name exp-500trees
dvc exp show --sort-by eval_auc # tabular comparison
dvc exp apply exp-500trees # promote best experiment to workspace
Feature Store Architecture
Feature stores solve the training-serving skew problem by providing a single source of truth for feature values: the same feature computation logic runs both offline (to produce training data) and online (to serve real-time inference). A production feature store has two main components:
- Offline store: A batch-oriented store (BigQuery, Snowflake, Hive, Delta Lake) that holds historical feature values for training and batch scoring. Features are computed by scheduled batch jobs (Spark, dbt, Airflow) and written with point-in-time correctness: each training example records the feature values that were observable at the time of the event, preventing label leakage from future data.
- Online store: A low-latency key-value store (Redis, DynamoDB, Bigtable, Cassandra) that holds the most recent feature values for real-time inference. Features are kept in sync with the offline store via a streaming pipeline (Kafka, Kinesis) or a periodic materialisation job. Read latency must be under 5–10ms to not dominate end-to-end serving latency.
| Feature Store Component |
Open Source Option |
Managed Cloud Option |
Primary Use |
Latency Target |
| Offline store |
Feast + Spark + Delta Lake |
SageMaker Feature Store (offline), Vertex Feature Store |
Training dataset generation, batch scoring |
Minutes to hours (batch) |
| Online store |
Feast + Redis |
SageMaker Feature Store (online), Vertex Feature Store |
Real-time inference feature retrieval |
<10ms p99 |
| Feature registry |
Feast feature definitions (YAML) |
Tecton, Hopsworks |
Schema validation, discoverability, lineage |
N/A (metadata) |
| Streaming features |
Kafka + Faust/Flink + Redis |
Kinesis + Lambda + DynamoDB |
Near-real-time aggregations (last 5min, last 1hr) |
<1min end-to-end freshness |
| Point-in-time joins |
Feast historical retrieval, Hopsworks |
Vertex Feature Store time-travel |
Leak-free training data generation |
Minutes (offline) |
Kubernetes Model Serving: Containers, Canary Deployments, and Autoscaling
For production ML serving at scale, containerised deployment on Kubernetes provides the resource management, traffic management, and operational tooling needed to run models reliably. A standard production serving pattern combines: (1) a FastAPI serving container built with a minimal base image; (2) Kubernetes Deployments and Services for lifecycle management; (3) an Ingress or Istio VirtualService for traffic routing; and (4) Horizontal Pod Autoscaler (HPA) for demand-responsive scaling.
# ── Dockerfile for the FastAPI model server ───────────────────────────────
cat > Dockerfile << 'EOF'
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY src/serve.py .
COPY models/classifier.pkl models/
# Non-root user for security
RUN useradd -m appuser && chown -R appuser:appuser /app
USER appuser
EXPOSE 8080
CMD ["uvicorn", "serve:app", "--host", "0.0.0.0", "--port", "8080", "--workers", "2"]
EOF
# Build and push to container registry
docker build -t myregistry.io/fraud-model:v2.3.1 .
docker push myregistry.io/fraud-model:v2.3.1
# ── Kubernetes Deployment manifest ────────────────────────────────────────
cat > k8s/deployment-v2.yaml << 'EOF'
apiVersion: apps/v1
kind: Deployment
metadata:
name: fraud-model-v2
labels:
app: fraud-model
version: v2.3.1
spec:
replicas: 3
selector:
matchLabels:
app: fraud-model
version: v2.3.1
template:
metadata:
labels:
app: fraud-model
version: v2.3.1
spec:
containers:
- name: model-server
image: myregistry.io/fraud-model:v2.3.1
ports:
- containerPort: 8080
resources:
requests:
cpu: "500m"
memory: "512Mi"
limits:
cpu: "2000m"
memory: "2Gi"
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
env:
- name: MODEL_VERSION
value: "v2.3.1"
- name: LOG_LEVEL
value: "INFO"
EOF
kubectl apply -f k8s/deployment-v2.yaml
# ── Horizontal Pod Autoscaler ──────────────────────────────────────────────
kubectl autoscale deployment fraud-model-v2 \
--cpu-percent=70 \
--min=3 \
--max=20
# Custom metric HPA (requests per second via Prometheus adapter)
cat > k8s/hpa-custom.yaml << 'EOF'
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: fraud-model-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: fraud-model-v2
minReplicas: 3
maxReplicas: 20
metrics:
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "100"
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
EOF
# ── Canary deployment: 10% traffic to v2, 90% to v1 ─────────────────────
# Istio VirtualService for traffic splitting
cat > k8s/virtual-service.yaml << 'EOF'
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: fraud-model
spec:
hosts:
- fraud-model-svc
http:
- match:
- headers:
x-canary:
exact: "true"
route:
- destination:
host: fraud-model-svc
subset: v2
- route:
- destination:
host: fraud-model-svc
subset: v1
weight: 90
- destination:
host: fraud-model-svc
subset: v2
weight: 10
EOF
kubectl apply -f k8s/virtual-service.yaml
# Monitor canary metrics for 24h, then promote or rollback
CANARY_ERROR_RATE=$(kubectl exec -n monitoring \
deployment/prometheus -- \
promtool query instant \
'rate(http_requests_total{version="v2.3.1",status=~"5.."}[5m]) / rate(http_requests_total{version="v2.3.1"}[5m])')
if (( $(echo "$CANARY_ERROR_RATE > 0.01" | bc -l) )); then
echo "Canary error rate ${CANARY_ERROR_RATE} exceeds 1% — rolling back"
kubectl patch virtualservice fraud-model --type=json \
-p='[{"op":"replace","path":"/spec/http/1/route/1/weight","value":0},{"op":"replace","path":"/spec/http/1/route/0/weight","value":100}]'
else
echo "Canary healthy — promoting to 100%"
kubectl set image deployment/fraud-model-v1 model-server=myregistry.io/fraud-model:v2.3.1
kubectl delete deployment fraud-model-v2
fi
Model Serving SLOs: Define Service Level Objectives before deploying. Typical ML serving SLOs: p50 latency <50ms, p99 latency <200ms, availability 99.9% (8.7h downtime/year), error rate <0.1%. Canary deployments should run for a minimum of 24 hours (to cover daily traffic patterns) before promotion, with automated rollback triggered if any SLO is violated during the canary window.
Conclusion & Next Steps
MLOps is not a single tool or platform — it is a set of engineering disciplines applied to the unique challenges of ML systems: non-deterministic training, delayed feedback loops, data dependencies, and the need to maintain both predictive accuracy and fairness as the world changes. The five pillars are: reproducibility (experiment tracking and dataset versioning); consistency (feature stores eliminating training-serving skew); automation (CI/CD pipelines replacing manual model handoffs); reliability (serving infrastructure with health checks, validation, and graceful degradation); and observability (drift detection and fairness monitoring that surface problems before they reach customers).
The maturity of your MLOps practice directly determines how quickly you can iterate on model improvements and how quickly you can respond when models degrade. Organisations at MLOps Level 0 — manual, script-driven processes — can take weeks to get an improved model into production and may not notice degradation for months. Organisations at Level 2 — fully automated pipelines with automated evaluation gates and continuous monitoring — can release a new model version within hours of identifying a performance regression and automatically trigger retraining when drift is detected.
The fairness monitoring dimension connects directly back to Part 19: all demographic slice metrics identified in the Ethics Impact Assessment should be tracked in the production monitoring dashboard with the same alert thresholds as overall accuracy metrics. Model improvements should be evaluated not just on aggregate AUC but on the fairness Pareto frontier — ensuring that performance gains for the majority do not come at the cost of minority group performance.
Next in the Series
In Part 21: Edge AI & On-Device Intelligence, we cover model compression (quantization, pruning, distillation), hardware accelerators for edge inference, TFLite, CoreML, ONNX Runtime, and the engineering discipline of deploying AI to resource-constrained devices while maintaining acceptable accuracy.
Continue This Series
Part 19: AI Ethics & Bias Mitigation
Fairness metrics, dataset auditing, debiasing techniques, and participatory design — the ethics foundation that MLOps monitoring must continuously uphold.
Read Article
Part 21: Edge AI & On-Device Intelligence
Quantization, pruning, TFLite, ONNX, and deploying intelligent systems to resource-constrained hardware at the network edge.
Read Article
Part 22: AI Infrastructure, Hardware & Scaling
GPUs, TPUs, distributed training, mixed precision, and the hardware and infrastructure landscape for large-scale AI workloads.
Read Article