Back to Technology

MLOps & Model Deployment

March 30, 2026 Wasil Zafar 33 min read

Taking ML models from Jupyter notebooks to reliable production services is one of the hardest problems in applied AI. This article covers the full MLOps lifecycle: experiment tracking, model registries, CI/CD for ML, serving infrastructure, feature stores, and the monitoring practices that keep models accurate and fair over time.

Table of Contents

  1. The MLOps Problem
  2. Experiment Tracking & Model Registry
  3. Feature Stores
  4. Model Serving Infrastructure
  5. CI/CD for Machine Learning
  6. Monitoring & Drift Detection
  7. MLOps Tools Comparison
  8. Exercises
  9. MLOps Pipeline Generator
  10. Conclusion & Next Steps

AI in the Wild: Real-World Applications & Ethics

Your 24-part learning path • Currently on Step 20
AI & ML Landscape Overview
Paradigms, ecosystem map, real-world applications at a glance
ML Foundations for Practitioners
Supervised learning, bias-variance, model evaluation
Natural Language Processing
Tokenization, embeddings, transformers, semantic search
Computer Vision in the Real World
CNNs, ViTs, detection, segmentation, deployment patterns
Recommender Systems
Collaborative filtering, content-based, two-tower models
Reinforcement Learning Applications
Q-learning, policy gradients, RLHF, real-world deployments
Conversational AI & Chatbots
Dialogue systems, intent detection, RAG, production bots
Large Language Models
Architecture, scaling laws, capabilities, limitations
Prompt Engineering & In-Context Learning
Chain-of-thought, few-shot, structured outputs, prompt patterns
Fine-tuning, RLHF & Model Alignment
LoRA, instruction tuning, DPO, alignment techniques
Generative AI Applications
Diffusion models, GANs, image/audio/video generation
Multimodal AI
Vision-language models, audio-text, cross-modal retrieval
AI Agents & Agentic Workflows
Tool use, planning, memory, multi-agent orchestration
AI in Healthcare & Life Sciences
Diagnostics, drug discovery, clinical NLP, regulatory landscape
AI in Finance & Fraud Detection
Credit scoring, anomaly detection, algorithmic trading
AI in Autonomous Systems & Robotics
Perception, planning, control, sim-to-real transfer
AI Security & Adversarial Robustness
Adversarial attacks, poisoning, model extraction, defences
Explainable AI & Interpretability
SHAP, LIME, attention, mechanistic interpretability
AI Ethics & Bias Mitigation
Fairness metrics, dataset auditing, debiasing techniques
20
MLOps & Model Deployment
CI/CD for ML, feature stores, monitoring, drift detection
You Are Here
21
Edge AI & On-Device Intelligence
Quantization, pruning, TFLite, CoreML, embedded inference
22
AI Infrastructure, Hardware & Scaling
GPUs, TPUs, distributed training, memory hierarchy
23
Responsible AI Governance
Risk frameworks, model cards, auditing, organisational practice
24
AI Policy, Regulation & Future Directions
EU AI Act, global frameworks, emerging risks, what's next
AI in the Wild Part 20 of 24

About This Article

MLOps — the discipline of applying DevOps principles to machine learning — is what separates research projects from production systems. This article covers the complete ML production lifecycle: experiment tracking and reproducibility, feature stores, model registries, serving infrastructure, CI/CD pipelines, and the monitoring practices needed to keep models performing well as the world changes around them.

MLOps Model Deployment MLflow CI/CD for ML Drift Detection

The MLOps Problem

In 2015, Google engineers Sculley et al. published "Hidden Technical Debt in Machine Learning Systems," arguably the most influential paper in applied ML engineering. Their central argument: the ML code that trains and serves a model typically represents only a small fraction of the total system complexity. The surrounding infrastructure — data pipelines, feature engineering, monitoring, serving, configuration management, and the feedback loops between components — creates technical debt that compounds over time and is far harder to pay off than the debt accumulated in traditional software systems. A model that performs excellently in offline evaluation can silently degrade in production for weeks before anyone notices, because the inputs to the model have drifted, the world it was trained to model has changed, or the serving environment has diverged from the training environment.

MLOps is the discipline that addresses this debt systematically. It applies the principles of DevOps — automation, reproducibility, continuous integration and delivery, monitoring, and collaboration between development and operations — to the ML lifecycle. The goal is not merely to deploy a model but to maintain a continuous pipeline that can retrain, evaluate, and deploy new model versions automatically when performance degrades, while keeping detailed records of every experiment, every dataset version, and every deployment decision.

Key Insight: The transition from "model that works in a notebook" to "model that works reliably in production" is not a deployment step — it is an engineering discipline that must be designed from the beginning. The earlier MLOps practices are adopted in a project, the lower the total cost of reaching and maintaining production-quality performance.

ML Technical Debt

The most common forms of ML technical debt in production systems are: undeclared consumers — other systems that silently depend on the model's output schema, which breaks when the schema changes without announcement; feedback loops — the model's outputs influencing future training data in ways that are not tracked or controlled; data dependencies — upstream feature pipelines that evolve independently of the model, causing silent changes in the input distribution; configuration debt — hyperparameters, preprocessing steps, and evaluation thresholds stored in ad hoc files rather than version-controlled configuration; and pipeline jungle — multiple overlapping data pipelines serving the same model, each maintained by different teams, with no authoritative source of truth for feature definitions.

MLOps Maturity Levels

Google's ML Engineering for Production guidelines define three maturity levels. Level 0 — the baseline — involves manual, script-driven processes: data scientists train models locally, hand off serialised model files to engineers, and monitoring is ad hoc. Most organisations start here, and many stay here longer than they should. Level 1 introduces automated ML pipelines: data ingestion, training, evaluation, and deployment are orchestrated by a pipeline tool (Kubeflow Pipelines, Apache Airflow, or a cloud-managed equivalent), and the pipeline is triggered automatically on new data. Model and data versioning are enforced. Level 2 adds CI/CD for the pipeline itself: changes to pipeline code are tested automatically before deployment, the model registry tracks all candidate and production models, and automated testing gates prevent performance regressions from reaching production.

Experiment Tracking & Model Registry

Reproducibility is the foundation of trustworthy ML. Without a systematic record of which code, which data version, which hyperparameters, and which environment produced each model, debugging degraded performance in production is guesswork. Experiment tracking tools capture this information automatically during training, creating a searchable database of all past experiments that enables scientific comparison, regression detection, and confident rollback.

MLflow Experiment Tracking

MLflow is the most widely adopted open-source experiment tracking and model lifecycle management platform. It provides four components: Tracking records parameters, metrics, tags, and artifacts for each training run; Projects packages ML code into reproducible, shareable units; Models provides a standard format for packaging and deploying models across frameworks; and Model Registry maintains a versioned store of production-grade models with lifecycle stage management. The following code demonstrates best-practice MLflow usage for a gradient boosting classifier:

import mlflow
import mlflow.sklearn
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import roc_auc_score, f1_score
import pandas as pd

# MLflow: reproducible experiment tracking
mlflow.set_experiment("churn-prediction-v2")

with mlflow.start_run(run_name="gbm-tuning-round-3"):
    # Log parameters
    params = {"n_estimators": 200, "max_depth": 5, "learning_rate": 0.05, "subsample": 0.8}
    mlflow.log_params(params)

    # Train model
    model = GradientBoostingClassifier(**params, random_state=42)
    model.fit(X_train, y_train)

    # Log metrics
    train_auc = roc_auc_score(y_train, model.predict_proba(X_train)[:, 1])
    val_auc = roc_auc_score(y_val, model.predict_proba(X_val)[:, 1])
    val_f1 = f1_score(y_val, model.predict(X_val))

    mlflow.log_metrics({"train_auc": train_auc, "val_auc": val_auc, "val_f1": val_f1})

    # Log model artifact
    mlflow.sklearn.log_model(model, artifact_path="model",
                              registered_model_name="ChurnPredictor")

    # Log any file artifacts
    mlflow.log_artifact("feature_importance.png")

    print(f"Run ID: {mlflow.active_run().info.run_id}")
    print(f"Val AUC: {val_auc:.4f} (vs baseline: 0.7823)")

# Compare runs programmatically
runs = mlflow.search_runs(experiment_names=["churn-prediction-v2"],
                           order_by=["metrics.val_auc DESC"])
print(runs[["run_id", "params.n_estimators", "metrics.val_auc"]].head())

Model Registry & Versioning

The model registry is the gatekeeper between experimentation and production. Every model that passes evaluation gates is registered with a version number, a link to the training run that produced it, and a lifecycle stage. The stages typically follow a pipeline: None (registered but not evaluated), Staging (passed evaluation gates, under A/B test or shadow mode), Production (serving live traffic), and Archived (superseded by a newer version). This lifecycle ensures that any production model can be rolled back to a previous version in seconds by simply changing the stage label, without re-deploying any code — the serving layer always loads the model tagged "Production" from the registry at startup.

Production Warning: Loading a model from the registry at every prediction request is catastrophically slow. Always load the model once at service startup and cache it in memory. The correct pattern is: load at startup using the "Production" alias, expose a /model/reload admin endpoint that refreshes the cached model, and use a health check endpoint that reports the loaded model version and its registry metadata. Never load from registry per-request.

Feature Stores

A feature store is a centralised system for creating, storing, sharing, and serving ML features — the pre-computed, transformed inputs that models receive. Without a feature store, feature engineering code is duplicated across training pipelines and serving systems, leading to subtle discrepancies in how features are computed between offline training and online serving — the training-serving skew that is one of the most common root causes of production model degradation.

Design Principles

A production feature store has two serving layers. The offline store (typically a data warehouse like BigQuery, Snowflake, or Redshift backed by a columnar storage format like Parquet) provides historical feature values for training and batch scoring. The online store (typically a low-latency key-value store like Redis, DynamoDB, or Cassandra) provides real-time feature lookups for online serving — feature values are precomputed and written to the online store so that model inference can retrieve them in single-digit milliseconds. Feast (open source), Tecton, and Hopsworks are the leading feature store platforms; all major cloud providers now offer managed equivalents.

Training-Serving Skew

Training-serving skew arises when the feature values seen at training time differ from those seen at inference time, even when the raw data has not changed. Common causes include: computing features differently in SQL (offline) versus Python (online); time zone inconsistencies in timestamp arithmetic; different null-handling conventions between the data warehouse and the serving cache; and feature transformation logic captured manually during experimentation but not included in the pipeline. The canonical defence is a single, version-controlled feature transformation function that is called both during dataset generation and during online serving — the feature store enforces this by requiring feature definitions to be registered before use and serving them from the same definition at both stages.

Anti-Pattern

The Feature Duplication Problem

A data science team builds a churn prediction model. The training pipeline computes "days_since_last_purchase" using pandas in UTC. The serving engineer implements the same feature via a SQL DATEDIFF query against the production database using the server's local timezone. For customers in UTC-8, transactions near midnight are credited to the previous day in one computation but not the other. The model, trained to flag 30+ days of inactivity as churn risk, now receives inconsistent feature values and produces systematically wrong predictions for a subset of customers — a bug invisible until a customer-facing analyst notices anomalous churn scores for a specific cohort. A feature store with a single registered feature definition would have prevented this entirely.

Model Serving Infrastructure

Model serving is the engineering discipline of exposing trained models as reliable, low-latency, scalable API services. The serving layer must balance three competing demands: latency (predictions must arrive within the user experience budget — typically under 100ms for interactive applications); throughput (the service must handle peak traffic without degradation); and reliability (the service must be available even when individual components fail, and must degrade gracefully rather than catastrophically). Most serving implementations today use a combination of a model server behind a load balancer, with horizontal autoscaling triggered by request queue depth or CPU utilisation.

FastAPI Model Server

FastAPI has become the de facto standard for Python-based model serving. Its async request handling, automatic input validation via Pydantic, and OpenAPI documentation generation make it well-suited for ML APIs. The critical implementation discipline is loading the model once at service startup, not per request:

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import mlflow.sklearn
import numpy as np
from typing import List
import time

app = FastAPI(title="Churn Prediction API", version="2.1.0")

# Load model at startup (not per-request!)
MODEL_URI = "models:/ChurnPredictor/Production"  # MLflow Model Registry
model = mlflow.sklearn.load_model(MODEL_URI)

class PredictionRequest(BaseModel):
    customer_id: str
    features: List[float]  # must match training feature order

class PredictionResponse(BaseModel):
    customer_id: str
    churn_probability: float
    prediction: str  # "churn" | "retain"
    model_version: str
    latency_ms: float

@app.post("/predict", response_model=PredictionResponse)
async def predict(request: PredictionRequest):
    start = time.time()

    if len(request.features) != 15:  # validate feature count
        raise HTTPException(status_code=422, detail="Expected 15 features")

    features = np.array(request.features).reshape(1, -1)
    churn_prob = float(model.predict_proba(features)[0, 1])

    return PredictionResponse(
        customer_id=request.customer_id,
        churn_probability=round(churn_prob, 4),
        prediction="churn" if churn_prob > 0.5 else "retain",
        model_version="2.1.0",
        latency_ms=round((time.time() - start) * 1000, 2)
    )

@app.get("/health")
async def health():
    return {"status": "healthy", "model": MODEL_URI}

Serving Patterns

Production serving requires more than a single endpoint. Shadow mode deployment runs the new model in parallel with the production model, logging its predictions without serving them to users. Canary deployment routes a small percentage of traffic (e.g., 5%) to the new model and gradually increases the share as confidence grows. A/B testing randomly assigns users to model versions and compares business metrics across groups. Blue-green deployment maintains two identical serving environments and switches the load balancer instantaneously — enabling zero-downtime deployment and instant rollback. The choice of pattern depends on the cost of a bad prediction and the availability of labels for rapid evaluation.

CI/CD for Machine Learning

Continuous integration and delivery for ML extends the software CI/CD paradigm with ML-specific stages: data validation (verifying that new training data meets quality and distribution expectations), model evaluation (verifying that the candidate model meets performance thresholds), and model comparison (verifying that the candidate model outperforms or at minimum does not regress from the current production model). Each stage is a gate: a failed gate aborts the pipeline and triggers notifications to the responsible team.

Pipeline Stages

A mature ML CI/CD pipeline typically consists of five stages. Data validation uses Great Expectations, TFDV, or a custom suite to verify that new data satisfies invariants — expected null rates, value ranges, cardinality constraints, and distribution similarity to the reference dataset. Feature pipeline executes the feature engineering transformations registered in the feature store against the new data. Training executes the training script with the registered hyperparameter configuration, logging all parameters and metrics to the experiment tracker. Evaluation gate compares the candidate model against a predefined threshold (e.g., minimum AUC of 0.82) and against the current production model — if the candidate fails either gate, the pipeline fails. Deployment updates the model registry stage to "Production" and triggers a rolling restart of the serving fleet.

GitHub Actions Implementation

# .github/workflows/ml-pipeline.yml
# Automated ML pipeline: data validation → training → evaluation → deployment

name: ML Training Pipeline

on:
  push:
    branches: [main]
    paths: ['src/**', 'data/**', 'configs/**']

jobs:
  validate-data:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Run Great Expectations data validation
        run: |
          pip install great_expectations
          great_expectations checkpoint run churn_data_checkpoint
          # Fails if data quality < 95% non-null, value ranges out of bounds, etc.

  train-and-evaluate:
    needs: validate-data
    runs-on: [self-hosted, gpu]  # GPU runner for training
    steps:
      - name: Train model
        run: python src/train.py --config configs/gbm_v2.yaml

      - name: Evaluate model
        run: |
          python src/evaluate.py --metric val_auc --threshold 0.82
          # Fails CI if AUC < 0.82 (prevents regression)

      - name: Register model if better than production
        run: python src/register_model.py --compare-with production
        env:
          MLFLOW_TRACKING_URI: ${{ secrets.MLFLOW_URI }}

  deploy-to-staging:
    needs: train-and-evaluate
    runs-on: ubuntu-latest
    steps:
      - name: Deploy to staging
        run: |
          kubectl set image deployment/churn-api churn-api=acr.io/churn:$GITHUB_SHA
          kubectl rollout status deployment/churn-api --timeout=5m

The key design principle is that each stage is independently testable and produces a versioned artifact. The data validation stage produces a validation report stored as a CI artifact. The training stage produces a registered model version in MLflow. The evaluation stage produces a metrics comparison report. The deployment stage produces a deployment record. Any stage can be re-run in isolation if it fails for infrastructure reasons, without re-running the expensive training stage from scratch.

Monitoring & Drift Detection

A model that performs well at launch can degrade silently over days, weeks, or months as the world it was trained to model changes. Monitoring in production ML means tracking not just infrastructure metrics (latency, throughput, error rates) but model health metrics — the statistical properties of model inputs, outputs, and (when available) outcomes. Without systematic monitoring, degraded performance may only surface through downstream business metrics or customer complaints, long after the root cause has become difficult to diagnose.

Drift Types

Drift Type Definition Detection Method Trigger for Retraining Example
Data Drift (Covariate Shift) Distribution of input features P(X) changes, but P(Y|X) stays the same PSI (Population Stability Index), KS test, Jensen-Shannon divergence per feature PSI > 0.25 on any key feature, or average PSI > 0.10 across feature set Pandemic causes customer spending patterns to shift; new user acquisition channel brings different demographics
Concept Drift Relationship between features and target P(Y|X) changes Monitored via degradation in accuracy metrics once labels available; use ADWIN or DDM on rolling window performance Rolling AUC drops below alert threshold; abrupt change detected by DDM algorithm Fraud patterns evolve after new fraud prevention measures; churn drivers change after competitor entry
Prediction Drift Distribution of model output P(Y-hat) changes without labelled ground truth Compare prediction score distribution today vs. reference period; alert on PSI of score histogram Score PSI > 0.20; average predicted probability shifts by more than 3 percentage points Model suddenly predicts high churn for almost everyone after a feature pipeline bug changes a key feature
Label Drift Distribution of actual outcomes P(Y) changes Monitor actual positive rate in labelled production data; chi-square test vs. reference period Actual positive rate deviates > 20% from expected; expected calibration error exceeds threshold Actual churn rate rises seasonally; economic downturn increases default rate above model's prior
Feature Schema Drift Feature names, types, or cardinality change at the source Schema validation at feature ingestion; alert on any schema change relative to registered feature definition Any schema mismatch — fail serving immediately, fall back to previous model version Upstream team renames a column; an enum feature gains a new category; a numeric feature becomes nullable

Detection Methods

The Population Stability Index (PSI) is the most widely used drift metric in industry. PSI computes the sum of divergence terms between the reference distribution (training data) and the production distribution (recent serving data) bucketed into deciles. PSI below 0.10 indicates no significant drift; PSI between 0.10 and 0.25 indicates moderate drift requiring investigation; PSI above 0.25 indicates significant drift requiring model review or retraining. The Kolmogorov-Smirnov test and Jensen-Shannon divergence are statistically rigorous alternatives but less intuitive to communicate to business stakeholders. Evidently AI and Deepchecks are the leading open-source libraries for production drift monitoring; both integrate with standard MLOps stacks and can generate HTML reports, Slack alerts, and metric exports to Prometheus/Grafana dashboards.

Key Insight: Monitoring must cover both the model and the fairness properties documented in Part 19. A model can maintain overall AUC while its disparate impact ratio for a protected demographic group degrades significantly — because the data drift affects that group disproportionately. Add demographic slice metrics to the monitoring dashboard alongside aggregate performance metrics, and set independent drift alert thresholds for each demographic slice.

MLOps Tools Comparison

The MLOps tooling landscape has matured rapidly. The choice of platform depends on team size, existing cloud provider relationships, requirement for open-source ownership, and the degree of custom pipeline complexity required.

Tool Provider Tracking Registry Serving Auto-ML OSS Best For
MLflow Databricks / Community Excellent Excellent Good No Yes (Apache 2) Teams wanting full control; Databricks-native environments; any framework
Weights & Biases W&B Inc. Excellent Good Basic Yes (Sweeps) Partial (free tier) Research teams; rich visualisations; hyperparameter sweeps at scale
Kubeflow Pipelines Google / Community Basic Basic Via KFServing Via Katib Yes (Apache 2) Kubernetes-native teams; complex multi-step pipelines; on-premise deployments
Amazon SageMaker AWS Good Good Excellent Excellent (Autopilot) No AWS-native teams; fully managed end-to-end; enterprise scale; compliance requirements
Vertex AI Google Cloud Good Good Excellent Excellent (Vertex AutoML) No GCP-native teams; BigQuery ML integration; TensorFlow / JAX workloads
Azure Machine Learning Microsoft Good Good Good Good (AutoML) Partial (SDK OSS) Azure/Office 365 ecosystems; regulated industries (HIPAA, FedRAMP); MLflow integration

Production Patterns: Scaling & Reliability

Moving a model from a single serving endpoint to a production-grade system serving millions of requests requires addressing a set of engineering challenges that go beyond the model itself: horizontal scaling, request batching, model versioning, circuit breakers, and graceful degradation. These patterns come from traditional distributed systems engineering but require ML-specific adaptations because model inference has different performance characteristics than typical API computation — inference is CPU/GPU-bound rather than I/O-bound, latency is highly variable depending on input sequence length (for transformers), and the memory footprint of loaded models can be several gigabytes.

Horizontal Scaling and Load Balancing

A single model serving instance can handle a limited number of concurrent requests determined by the inference latency and the number of available CPU or GPU cores. For a model with 50ms average inference latency running on a 4-core CPU instance, Amdahl's Law gives a theoretical maximum throughput of approximately 80 requests per second per instance. Horizontal scaling — running multiple instances behind a load balancer — provides linear throughput scaling. For stateless model serving (no per-user session state, no per-request model loading), horizontal scaling is straightforward: any instance can handle any request, and the load balancer can distribute requests round-robin or least-connections. The critical operational requirement is that all instances serve the same model version simultaneously — version skew, where different instances serve different model versions, is a common source of subtle production bugs where prediction behaviour appears non-deterministic from the caller's perspective.

Request Batching for GPU Serving

GPU inference is most efficient when requests are batched: rather than processing one request at a time, the serving layer accumulates a small number of requests (typically 4-32) and processes them together in a single forward pass. This amortises the GPU kernel launch overhead and achieves much higher GPU utilisation. The trade-off is latency: a request that arrives when the batch is not yet full must wait until the batch is complete before receiving a response. This is typically managed with a maximum batch wait time (e.g., 5ms) — if the batch is not full within the wait time, the partial batch is processed immediately. NVIDIA Triton Inference Server and TorchServe both implement dynamic batching with configurable batch size and timeout parameters. For CPU serving, batching is generally less beneficial and may actually increase latency for variable-length inputs (e.g., transformer sequences).

Circuit Breakers and Graceful Degradation

A model serving endpoint that is overloaded, experiencing a software bug, or serving from a corrupted model version must fail gracefully rather than returning garbage predictions or timing out indefinitely. The circuit breaker pattern from distributed systems engineering applies directly: when the error rate on a model endpoint exceeds a threshold (typically 5-10% over a 30-second sliding window), the circuit opens and subsequent requests are immediately returned a predefined fallback response — typically a default prediction, a "service unavailable" HTTP 503 response, or a rule-based fallback. After a configurable timeout (e.g., 60 seconds), the circuit enters "half-open" state and allows a small number of probe requests through to test recovery. If the probe requests succeed, the circuit closes and full traffic resumes. If they fail, the timeout resets. This pattern prevents a struggling model server from cascading its failures to downstream systems and gives the serving team time to diagnose and resolve the issue without user-facing impact.

Production Pattern

Multi-Tier Model Serving Architecture

A production-grade ML serving architecture for a high-traffic recommendation or classification system typically consists of three tiers:

  • Tier 1 — Edge cache / CDN: Cache deterministic predictions (same user + same context = same output) at the CDN edge. Cache hit rates of 20-40% are common for recommendation systems, providing sub-millisecond responses for cached results and dramatically reducing model server load.
  • Tier 2 — Application server + feature retrieval: Fetches user and item features from the online feature store (Redis/DynamoDB) and routes to the model server. Adds circuit breaker logic and fallback responses. Handles authentication, rate limiting, and request logging.
  • Tier 3 — Model server cluster: Horizontally scaled FastAPI or Triton instances behind a load balancer. Auto-scales based on request queue depth and GPU utilisation. All instances serve the model version tagged "Production" in the MLflow registry. Rolling deployments for zero-downtime updates.

Observability across all three tiers is essential: distributed tracing (using OpenTelemetry + Jaeger) ensures that a slow prediction can be traced back to whether the bottleneck was feature retrieval, model inference, or serialisation. Logging at each tier boundary captures the latency attribution for each stage, enabling targeted optimisation.

Model Performance Budgeting

Production ML systems must budget latency the same way financial systems budget cost. The user experience budget (e.g., "the search results page must load in under 200ms") is decomposed into a latency budget for each component: 10ms for the load balancer, 20ms for feature retrieval from the online store, 50ms for model inference, 10ms for result serialisation, 30ms for network round-trip, leaving 80ms of margin. When the model inference takes 75ms instead of 50ms due to increased input complexity, the budget is violated, and one of two things must happen: the model must be optimised (quantization, pruning, ONNX conversion), or the budget for another component must be reallocated. This explicit budgeting discipline prevents the common failure mode where ML models are added to latency-sensitive serving paths without accounting for their inference cost until they cause user-visible regressions in production.

Exercises

Beginner

Exercise 1: MLflow Experiment Comparison

Set up MLflow locally with mlflow server --host 0.0.0.0 --port 5000. Train three versions of a classifier on the same dataset, varying hyperparameters. Log all parameters, training metrics, and validation metrics to MLflow.

  • Use the MLflow UI (localhost:5000) to compare runs. Which configuration achieves the best validation AUC?
  • How do the training vs. validation AUC curves indicate overfitting across configurations?
  • Register the best run's model to the MLflow Model Registry and set its stage to "Production".
  • Write a script that programmatically loads the "Production" model and makes predictions on a held-out test set.
Intermediate

Exercise 2: FastAPI Model Server with Load Testing

Build a FastAPI model serving endpoint using the model from Exercise 1. Include a /predict POST endpoint, a /health GET endpoint, and input validation using Pydantic.

  • Add latency logging that records the inference time in milliseconds for every request.
  • Add a request counter metric exposed at /metrics in Prometheus format.
  • Test with 100 concurrent requests using locust or httpx async client. What is the p50, p95, and p99 latency?
  • Identify the bottleneck: is it the model inference, the input validation, or the serialisation step?
Advanced

Exercise 3: Full MLOps Pipeline with Drift Detection

Design and implement a complete MLOps pipeline for a tabular classification task covering all five stages: data validation, feature pipeline, training, evaluation gate, and deployment.

  • Implement data validation using Great Expectations with at least five expectations (non-null rate, value range, uniqueness, distribution test, schema check).
  • Set an evaluation gate that fails the pipeline if validation AUC is below a threshold OR if the candidate model is worse than the current production model by more than 1% AUC.
  • Simulate data drift by gradually shifting one feature's distribution. Implement PSI monitoring using Evidently or a custom implementation. At what PSI value does model performance begin to degrade?
  • Implement automated retraining triggered when PSI exceeds 0.25 on any monitored feature.

MLOps Pipeline Design Generator

Document your ML system's production pipeline design. Download as Word, Excel, PDF, or PowerPoint for engineering review and stakeholder alignment.

Draft auto-saved

All data stays in your browser. Nothing is sent to or stored on any server.

Data Lineage, Dataset Versioning, and Feature Stores

Reproducibility in ML depends on three interlocked components: code versioning (git), model versioning (MLflow Model Registry), and data versioning. Without data versioning, the same training code will produce different models depending on when it is run — because the underlying tables, feature pipelines, and label generation procedures change over time. Data lineage tracking makes explicit the full provenance chain: from raw data source through transformation steps to the training dataset used for each registered model version.

DVC: Data Version Control

DVC (Data Version Control) treats data artifacts the same way git treats source files: each dataset and model file gets a content-addressed hash, stored in a .dvc file that git tracks. The actual data lives in a remote store (S3, GCS, Azure Blob, or HDFS). This means the git repository remains lightweight while every model version is linked to the exact dataset hash that produced it — making any training run fully reproducible by checking out the code commit and running dvc pull.

Bash — DVC Dataset Versioning and Pipeline Definition
# Initialise DVC in an existing git repo
git init
dvc init
git add .dvc/
git commit -m "initialise dvc"

# Track a large dataset file — DVC stores hash; git stores .dvc pointer
dvc add data/raw/transactions.parquet
git add data/raw/transactions.parquet.dvc data/raw/.gitignore
git commit -m "add raw transactions dataset v1.0"

# Configure S3 remote for data storage
dvc remote add -d s3remote s3://mycompany-dvc-store/mlops-demo
dvc remote modify s3remote region us-east-1
dvc push   # upload to S3

# Define a reproducible pipeline (dvc.yaml)
cat > dvc.yaml << 'EOF'
stages:
  preprocess:
    cmd: python src/preprocess.py --input data/raw/transactions.parquet --output data/processed/features.parquet
    deps:
      - src/preprocess.py
      - data/raw/transactions.parquet
    outs:
      - data/processed/features.parquet
    params:
      - params.yaml:
          - preprocess.lookback_days
          - preprocess.feature_version

  train:
    cmd: python src/train.py --features data/processed/features.parquet --model models/classifier.pkl
    deps:
      - src/train.py
      - data/processed/features.parquet
    outs:
      - models/classifier.pkl
    metrics:
      - metrics/train_metrics.json:
          cache: false
    params:
      - params.yaml:
          - train.n_estimators
          - train.max_depth
          - train.random_seed

  evaluate:
    cmd: python src/evaluate.py --model models/classifier.pkl --test data/processed/test.parquet --output metrics/eval_metrics.json
    deps:
      - src/evaluate.py
      - models/classifier.pkl
      - data/processed/test.parquet
    metrics:
      - metrics/eval_metrics.json:
          cache: false
EOF

# Run the full pipeline — only re-runs stages whose deps have changed
dvc repro

# Compare metrics across experiments
dvc metrics show
dvc metrics diff HEAD~1 HEAD

# Create an experiment branch (DVC Experiments)
dvc exp run --set-param train.n_estimators=200 --name exp-200trees
dvc exp run --set-param train.n_estimators=500 --name exp-500trees
dvc exp show --sort-by eval_auc   # tabular comparison
dvc exp apply exp-500trees        # promote best experiment to workspace

Feature Store Architecture

Feature stores solve the training-serving skew problem by providing a single source of truth for feature values: the same feature computation logic runs both offline (to produce training data) and online (to serve real-time inference). A production feature store has two main components:

  • Offline store: A batch-oriented store (BigQuery, Snowflake, Hive, Delta Lake) that holds historical feature values for training and batch scoring. Features are computed by scheduled batch jobs (Spark, dbt, Airflow) and written with point-in-time correctness: each training example records the feature values that were observable at the time of the event, preventing label leakage from future data.
  • Online store: A low-latency key-value store (Redis, DynamoDB, Bigtable, Cassandra) that holds the most recent feature values for real-time inference. Features are kept in sync with the offline store via a streaming pipeline (Kafka, Kinesis) or a periodic materialisation job. Read latency must be under 5–10ms to not dominate end-to-end serving latency.
Feature Store Component Open Source Option Managed Cloud Option Primary Use Latency Target
Offline store Feast + Spark + Delta Lake SageMaker Feature Store (offline), Vertex Feature Store Training dataset generation, batch scoring Minutes to hours (batch)
Online store Feast + Redis SageMaker Feature Store (online), Vertex Feature Store Real-time inference feature retrieval <10ms p99
Feature registry Feast feature definitions (YAML) Tecton, Hopsworks Schema validation, discoverability, lineage N/A (metadata)
Streaming features Kafka + Faust/Flink + Redis Kinesis + Lambda + DynamoDB Near-real-time aggregations (last 5min, last 1hr) <1min end-to-end freshness
Point-in-time joins Feast historical retrieval, Hopsworks Vertex Feature Store time-travel Leak-free training data generation Minutes (offline)

Kubernetes Model Serving: Containers, Canary Deployments, and Autoscaling

For production ML serving at scale, containerised deployment on Kubernetes provides the resource management, traffic management, and operational tooling needed to run models reliably. A standard production serving pattern combines: (1) a FastAPI serving container built with a minimal base image; (2) Kubernetes Deployments and Services for lifecycle management; (3) an Ingress or Istio VirtualService for traffic routing; and (4) Horizontal Pod Autoscaler (HPA) for demand-responsive scaling.

Bash — Kubernetes Model Serving: Dockerfile, Deployment, HPA, Canary
# ── Dockerfile for the FastAPI model server ───────────────────────────────
cat > Dockerfile << 'EOF'
FROM python:3.11-slim

WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY src/serve.py .
COPY models/classifier.pkl models/

# Non-root user for security
RUN useradd -m appuser && chown -R appuser:appuser /app
USER appuser

EXPOSE 8080
CMD ["uvicorn", "serve:app", "--host", "0.0.0.0", "--port", "8080", "--workers", "2"]
EOF

# Build and push to container registry
docker build -t myregistry.io/fraud-model:v2.3.1 .
docker push myregistry.io/fraud-model:v2.3.1

# ── Kubernetes Deployment manifest ────────────────────────────────────────
cat > k8s/deployment-v2.yaml << 'EOF'
apiVersion: apps/v1
kind: Deployment
metadata:
  name: fraud-model-v2
  labels:
    app: fraud-model
    version: v2.3.1
spec:
  replicas: 3
  selector:
    matchLabels:
      app: fraud-model
      version: v2.3.1
  template:
    metadata:
      labels:
        app: fraud-model
        version: v2.3.1
    spec:
      containers:
      - name: model-server
        image: myregistry.io/fraud-model:v2.3.1
        ports:
        - containerPort: 8080
        resources:
          requests:
            cpu: "500m"
            memory: "512Mi"
          limits:
            cpu: "2000m"
            memory: "2Gi"
        readinessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 5
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
        env:
        - name: MODEL_VERSION
          value: "v2.3.1"
        - name: LOG_LEVEL
          value: "INFO"
EOF

kubectl apply -f k8s/deployment-v2.yaml

# ── Horizontal Pod Autoscaler ──────────────────────────────────────────────
kubectl autoscale deployment fraud-model-v2 \
  --cpu-percent=70 \
  --min=3 \
  --max=20

# Custom metric HPA (requests per second via Prometheus adapter)
cat > k8s/hpa-custom.yaml << 'EOF'
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: fraud-model-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: fraud-model-v2
  minReplicas: 3
  maxReplicas: 20
  metrics:
  - type: Pods
    pods:
      metric:
        name: http_requests_per_second
      target:
        type: AverageValue
        averageValue: "100"
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
EOF

# ── Canary deployment: 10% traffic to v2, 90% to v1 ─────────────────────
# Istio VirtualService for traffic splitting
cat > k8s/virtual-service.yaml << 'EOF'
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: fraud-model
spec:
  hosts:
  - fraud-model-svc
  http:
  - match:
    - headers:
        x-canary:
          exact: "true"
    route:
    - destination:
        host: fraud-model-svc
        subset: v2
  - route:
    - destination:
        host: fraud-model-svc
        subset: v1
      weight: 90
    - destination:
        host: fraud-model-svc
        subset: v2
      weight: 10
EOF

kubectl apply -f k8s/virtual-service.yaml

# Monitor canary metrics for 24h, then promote or rollback
CANARY_ERROR_RATE=$(kubectl exec -n monitoring \
  deployment/prometheus -- \
  promtool query instant \
  'rate(http_requests_total{version="v2.3.1",status=~"5.."}[5m]) / rate(http_requests_total{version="v2.3.1"}[5m])')

if (( $(echo "$CANARY_ERROR_RATE > 0.01" | bc -l) )); then
    echo "Canary error rate ${CANARY_ERROR_RATE} exceeds 1% — rolling back"
    kubectl patch virtualservice fraud-model --type=json \
      -p='[{"op":"replace","path":"/spec/http/1/route/1/weight","value":0},{"op":"replace","path":"/spec/http/1/route/0/weight","value":100}]'
else
    echo "Canary healthy — promoting to 100%"
    kubectl set image deployment/fraud-model-v1 model-server=myregistry.io/fraud-model:v2.3.1
    kubectl delete deployment fraud-model-v2
fi
Model Serving SLOs: Define Service Level Objectives before deploying. Typical ML serving SLOs: p50 latency <50ms, p99 latency <200ms, availability 99.9% (8.7h downtime/year), error rate <0.1%. Canary deployments should run for a minimum of 24 hours (to cover daily traffic patterns) before promotion, with automated rollback triggered if any SLO is violated during the canary window.

Conclusion & Next Steps

MLOps is not a single tool or platform — it is a set of engineering disciplines applied to the unique challenges of ML systems: non-deterministic training, delayed feedback loops, data dependencies, and the need to maintain both predictive accuracy and fairness as the world changes. The five pillars are: reproducibility (experiment tracking and dataset versioning); consistency (feature stores eliminating training-serving skew); automation (CI/CD pipelines replacing manual model handoffs); reliability (serving infrastructure with health checks, validation, and graceful degradation); and observability (drift detection and fairness monitoring that surface problems before they reach customers).

The maturity of your MLOps practice directly determines how quickly you can iterate on model improvements and how quickly you can respond when models degrade. Organisations at MLOps Level 0 — manual, script-driven processes — can take weeks to get an improved model into production and may not notice degradation for months. Organisations at Level 2 — fully automated pipelines with automated evaluation gates and continuous monitoring — can release a new model version within hours of identifying a performance regression and automatically trigger retraining when drift is detected.

The fairness monitoring dimension connects directly back to Part 19: all demographic slice metrics identified in the Ethics Impact Assessment should be tracked in the production monitoring dashboard with the same alert thresholds as overall accuracy metrics. Model improvements should be evaluated not just on aggregate AUC but on the fairness Pareto frontier — ensuring that performance gains for the majority do not come at the cost of minority group performance.

Next in the Series

In Part 21: Edge AI & On-Device Intelligence, we cover model compression (quantization, pruning, distillation), hardware accelerators for edge inference, TFLite, CoreML, ONNX Runtime, and the engineering discipline of deploying AI to resource-constrained devices while maintaining acceptable accuracy.

Technology