Back to Systems Thinking & Architecture Mastery Series

AI Infrastructure Control & Data Planes

May 15, 2026 Wasil Zafar 24 min read

"AI infrastructure is the newest and most demanding domain where control/data plane separation proves essential — orchestrating GPU clusters, managing model lifecycles, and routing inference requests are fundamentally different problems than executing tensor operations at wire speed."

Table of Contents

  1. AI Control Plane
  2. AI Data Plane
  3. GPU Scheduling as Control Plane
  4. Model Serving as Data Plane
  5. Training vs Inference Planes
  6. MLOps Pipeline Orchestration
  7. AI Gateway Pattern
  8. Vector Database Planes
  9. Key Takeaway

AI Control Plane

The AI control plane encompasses everything that decides what should happen in your AI infrastructure — without directly executing inference or training. It manages model lifecycles, schedules GPU resources, orchestrates workflows, and routes requests to appropriate model endpoints.

Core Insight: The AI control plane answers "which model, on which GPU, with what configuration, serving which traffic?" — decisions that determine how the data plane executes. Just as a network control plane computes routing tables without forwarding packets, the AI control plane orchestrates without computing tensors.

Key components of the AI control plane:

  • Model Registry — versioned storage of trained models with metadata (MLflow Model Registry, Weights & Biases, SageMaker Model Registry)
  • GPU Cluster Scheduler — decides which workloads run on which GPUs (NVIDIA GPU Operator, Kubernetes device plugin, Slurm)
  • Workflow Orchestration — coordinates multi-step ML pipelines (Kubeflow Pipelines, Airflow, Prefect)
  • Experiment Tracking — records hyperparameters, metrics, and artifacts (MLflow Tracking, Neptune, Comet)
  • Model Versioning — tracks which model version is deployed where (shadow, canary, production)
  • Deployment Routing — decides traffic split between model versions (A/B testing, canary, blue-green)
AI Infrastructure — Control Plane vs Data Plane Split
flowchart TB
    subgraph CP["AI Control Plane"]
        MR["Model Registry\n(Versioning)"]
        SCHED["GPU Scheduler\n(Resource Allocation)"]
        ORCH["Workflow Orchestrator\n(Pipeline Coordination)"]
        ROUTE["Deployment Router\n(Traffic Splitting)"]
        EXP["Experiment Tracker\n(Metrics & Params)"]
    end
    subgraph DP["AI Data Plane"]
        INF["Inference Servers\n(vLLM, Triton)"]
        TRAIN["Training Workers\n(PyTorch, JAX)"]
        EMB["Embedding Engines\n(Vector Generation)"]
        BATCH["Batch Processors\n(Offline Scoring)"]
    end
    MR -->|"Deploy model v2.1"| ROUTE
    SCHED -->|"Assign GPU 0,1 to job"| TRAIN
    SCHED -->|"Assign GPU 2,3 to serving"| INF
    ORCH -->|"Trigger training run"| TRAIN
    ROUTE -->|"Route 90% to v2.0, 10% to v2.1"| INF
    EXP -.->|"Log metrics"| TRAIN
                            

AI Data Plane

The AI data plane is where actual computation happens — tensor operations execute on GPUs, models process inputs and produce outputs, embeddings are generated, and batch jobs crunch through datasets. This is the "hot path" where latency matters and throughput is measured.

Key components of the AI data plane:

  • Inference Execution — running forward passes through neural networks at serving time
  • Tensor Computation — matrix multiplications, attention mechanisms, activation functions on GPU
  • Embedding Generation — converting text/images into vector representations
  • Batch Processing — offline scoring of large datasets through models
  • Model Serving — handling concurrent inference requests with batching and queuing
  • KV Cache Management — storing and retrieving key-value pairs for autoregressive generation
Data Plane Performance: AI data plane operations are measured in tokens/second, time-to-first-token (TTFT), inter-token latency (ITL), and GPU utilization percentage. A well-optimized AI data plane achieves >80% GPU utilization with p99 latencies under 100ms for inference.

GPU Scheduling as Control Plane

GPU scheduling is a pure control plane function — it decides which workloads get which GPU resources, when, and with what constraints. The scheduler never executes a CUDA kernel; it allocates the resources that enable execution.

GPU Scheduling Decision Flow
flowchart TD
                A["New Workload Request"] --> B{"Workload Type?"}
                B -->|"Training"| C["Check GPU Memory\nRequirements"]
                B -->|"Inference"| D["Check Latency SLA"]
                B -->|"Batch"| E["Check Queue Priority"]
                C --> F{"Multi-GPU\nNeeded?"}
                F -->|"Yes"| G["Find Node with\nNVLink Topology"]
                F -->|"No"| H["Find Available\nSingle GPU"]
                D --> I{"MIG Partition\nSufficient?"}
                I -->|"Yes"| J["Assign MIG Slice"]
                I -->|"No"| K["Assign Full GPU"]
                E --> L["Queue Until\nResources Free"]
                G --> M["Bind GPUs to Pod"]
                H --> M
                J --> M
                K --> M
                L --> M
                            
# NVIDIA GPU Operator — scheduling configuration
apiVersion: v1
kind: ConfigMap
metadata:
  name: gpu-operator-config
  namespace: gpu-operator
data:
  # MIG (Multi-Instance GPU) partitioning strategy
  mig-config: |
    version: v1
    mig-configs:
      # A100 80GB split for mixed workloads
      a100-mixed-workload:
        - devices: [0]
          mig-enabled: true
          mig-devices:
            "3g.40gb": 1    # Training: half GPU
            "2g.20gb": 1    # Large inference
            "1g.10gb": 2    # Small inference models
        - devices: [1]
          mig-enabled: false  # Full GPU for large training

---
# Time-slicing policy for inference workloads
apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
  name: gpu-time-slicing
spec:
  devicePlugin:
    config:
      name: time-slicing-config
      default: inference-sharing
  # Priority-based GPU queuing
  scheduling:
    preemption:
      enabled: true
      priorities:
        - name: critical-inference
          priority: 100
        - name: training
          priority: 50
        - name: batch-scoring
          priority: 10
Architecture Pattern
MIG Partitioning as Control Plane Decision

NVIDIA's Multi-Instance GPU (MIG) technology is a physical manifestation of control plane decisions. The control plane decides how to partition A100/H100 GPUs into isolated instances — each with dedicated compute, memory, and cache. This decision happens once at configuration time, and the data plane then executes within these boundaries without knowing about the partitioning decisions above it.

NVIDIAMIGResource Isolation

Model Serving as Data Plane

Model serving is the quintessential AI data plane — it receives inference requests and produces predictions. The serving infrastructure's job is pure execution: load model weights into GPU memory, batch incoming requests, execute forward passes, and return results with minimal latency.

"""
vLLM Continuous Batching — AI Data Plane Configuration
Demonstrates the data plane concern: maximizing inference throughput
while meeting latency SLAs through continuous batching and PagedAttention.
"""
from vllm import LLM, SamplingParams

# Data plane configuration — optimized for throughput
llm = LLM(
    model="meta-llama/Llama-3.1-70B-Instruct",
    tensor_parallel_size=4,          # Shard across 4 GPUs
    gpu_memory_utilization=0.90,     # Use 90% of GPU memory
    max_model_len=8192,              # Maximum sequence length
    enable_prefix_caching=True,      # Cache common prefixes
    # Continuous batching config
    max_num_batched_tokens=32768,    # Max tokens per batch
    max_num_seqs=256,                # Max concurrent sequences
    # PagedAttention KV cache
    block_size=16,                   # KV cache block size
    swap_space=4,                    # GB of CPU swap for KV cache
)

# Sampling parameters (per-request data plane config)
sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=512,
    repetition_penalty=1.1,
)

# Execute inference — pure data plane operation
prompts = [
    "Explain control plane vs data plane in 3 sentences.",
    "What is the role of a GPU scheduler?",
]
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    print(f"Prompt: {output.prompt[:50]}...")
    print(f"Output: {output.outputs[0].text[:100]}...")
    print(f"Tokens/s: {len(output.outputs[0].token_ids) / output.metrics.finished_time:.1f}")
    print("---")
Model Serving Pipeline — Data Plane Hot Path
flowchart LR
    REQ["Inference\nRequest"] --> TOK["Tokenizer\n(CPU)"]
    TOK --> BATCH["Continuous\nBatcher"]
    BATCH --> PREFILL["Prefill Phase\n(GPU Compute)"]
    PREFILL --> DECODE["Decode Phase\n(Autoregressive)"]
    DECODE --> KV["KV Cache\n(GPU Memory)"]
    KV --> DECODE
    DECODE --> DETOK["Detokenizer\n(CPU)"]
    DETOK --> RESP["Response\nStream"]
                            

Training vs Inference Planes

Training and inference represent two fundamentally different data plane profiles, each with distinct resource patterns, failure modes, and optimization strategies:

Comparison
Training Data Plane vs Inference Data Plane
DimensionTrainingInference
Batch SizeLarge (thousands)Small (1-64 dynamic)
Latency ToleranceHours/days acceptableMilliseconds matter
GPU UtilizationSustained 95%+Bursty 30-80%
Memory PatternGradients + activationsKV cache + weights
Failure RecoveryCheckpoint + resumeRetry + failover
ScalingData/model parallelismReplica autoscaling
NetworkAll-reduce collectiveRequest/response
TrainingInferenceComparison
Critical Insight: Training and inference should NOT share the same control plane scheduling policies. Training workloads need sustained multi-GPU allocations with fault tolerance (checkpointing). Inference workloads need rapid scaling, load balancing, and preemption capability. Treating them identically leads to either wasted resources or SLA violations.

MLOps Pipeline Orchestration

MLOps tools like Kubeflow and MLflow operate as control planes that orchestrate the training data plane. They define what runs, when, and with what parameters — but they never execute a gradient computation themselves.

MLOps Orchestration — Control Plane Coordinating Data Plane
flowchart TB
    subgraph CTRL["Control Plane (Kubeflow/MLflow)"]
        PIPE["Pipeline Definition\n(DAG of steps)"]
        PARAM["Hyperparameter\nSearch Space"]
        SCHED2["Step Scheduler"]
        MON["Metric Monitor\n(Early Stopping)"]
    end
    subgraph DATA["Data Plane (GPU Workers)"]
        PREP["Data Preprocessing\n(CPU/GPU)"]
        TRAIN2["Model Training\n(Multi-GPU)"]
        EVAL["Model Evaluation\n(Validation Set)"]
        REG["Model Registration\n(Artifact Store)"]
    end
    PIPE --> SCHED2
    SCHED2 -->|"Step 1"| PREP
    SCHED2 -->|"Step 2"| TRAIN2
    PARAM -->|"Trial config"| TRAIN2
    TRAIN2 -->|"Metrics"| MON
    MON -->|"Continue/Stop"| SCHED2
    SCHED2 -->|"Step 3"| EVAL
    SCHED2 -->|"Step 4"| REG
                            
# Kubeflow Pipeline — Control Plane Definition
# Orchestrates training data plane without executing computation
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  name: llm-finetune-pipeline
spec:
  entrypoint: finetune-dag
  templates:
    - name: finetune-dag
      dag:
        tasks:
          - name: data-preparation
            template: preprocess
            arguments:
              parameters:
                - name: dataset
                  value: "s3://data/train.jsonl"
                - name: max-length
                  value: "4096"

          - name: training
            template: train-model
            dependencies: [data-preparation]
            arguments:
              parameters:
                - name: base-model
                  value: "meta-llama/Llama-3.1-8B"
                - name: learning-rate
                  value: "2e-5"
                - name: epochs
                  value: "3"
                - name: gpu-count
                  value: "4"

          - name: evaluation
            template: evaluate
            dependencies: [training]

          - name: deployment
            template: deploy-model
            dependencies: [evaluation]
            when: "{{tasks.evaluation.outputs.parameters.accuracy}} > 0.85"

    - name: train-model
      # Data plane execution — actual GPU work
      container:
        image: training-worker:v2.1
        resources:
          limits:
            nvidia.com/gpu: 4
            memory: "128Gi"
        env:
          - name: NCCL_DEBUG
            value: "INFO"

AI Gateway Pattern

The AI Gateway is a modern architectural pattern that makes the control/data plane split explicit at the API level. It sits between consumers and model endpoints, with clear control plane responsibilities (routing, rate limiting, cost tracking) separate from data plane operations (proxying inference requests).

AI Gateway Control Plane: Routing decisions between model providers (OpenAI, Anthropic, local models), rate limiting per customer/API key, cost tracking and budget enforcement, fallback chains when primary models are unavailable, semantic caching of repeated queries.
AI Gateway Data Plane: Proxying the actual HTTP request to the model endpoint, streaming token responses back to the client, managing connection pools to inference servers, handling request/response transformation (unified API across providers).
# AI Gateway configuration — control plane routing rules
gateway:
  name: production-ai-gateway
  # Control plane: routing decisions
  routes:
    - name: chat-completions
      match:
        path: /v1/chat/completions
      backends:
        - provider: openai
          model: gpt-4o
          weight: 70          # 70% of traffic
          max_tokens_per_min: 100000
          cost_budget_daily: 500.00
        - provider: anthropic
          model: claude-sonnet-4-20250514
          weight: 20          # 20% of traffic
          max_tokens_per_min: 80000
        - provider: local-vllm
          model: llama-3.1-70b
          weight: 10          # 10% (cost optimization)
          endpoint: http://vllm-service:8000

  # Control plane: rate limiting
  rate_limits:
    - key: api_key
      requests_per_minute: 60
      tokens_per_minute: 50000
    - key: organization
      requests_per_minute: 1000
      cost_per_day: 2000.00

  # Control plane: fallback chain
  fallback:
    chain: [openai, anthropic, local-vllm]
    triggers:
      - error_rate > 0.05
      - latency_p99 > 5000ms
      - provider_status != healthy

  # Data plane: connection management
  connections:
    pool_size: 100
    timeout_ms: 30000
    retry_count: 2
    streaming: true

Vector Database Planes

Vector databases exhibit clear control/data plane separation that mirrors traditional database architecture but with AI-specific concerns around index management and similarity search execution.

Vector Database — Control vs Data Plane
flowchart TB
    subgraph VCP["Control Plane"]
        IDX["Index Management\n(Build HNSW/IVF)"]
                SHARD["Shard Placement\n(Rebalancing)"]
                META["Metadata Schema\n(Collection Config)"]
                REP["Replication\n(Consistency)"]
    end
    subgraph VDP["Data Plane"]
                INS["Vector Insertion\n(Write Path)"]
                SIM["Similarity Search\n(ANN Query)"]
                FILT["Filtered Search\n(Metadata + Vector)"]
                FETCH["Point Fetch\n(ID Lookup)"]
    end
    IDX -->|"Index params"| SIM
    SHARD -->|"Route to shard"| INS
    SHARD -->|"Scatter-gather"| SIM
    META -->|"Schema validation"| INS
    REP -->|"Sync replicas"| INS
                            
Deep Dive
Why Vector Index Building is Control Plane

Building an HNSW (Hierarchical Navigable Small World) index is a control plane operation — it decides the structure that enables fast similarity search. The index parameters (M, efConstruction, metric type) are architectural decisions that shape data plane performance. Once built, the data plane traverses the graph structure without modifying it. This mirrors how a routing table (control plane output) enables packet forwarding (data plane operation) without the data plane needing to understand routing algorithms.

HNSWVector SearchIndexing
# Monitor AI infrastructure control/data plane health
# Control plane health — scheduling and orchestration
echo "=== Control Plane Health ==="
kubectl get pods -n kubeflow -l component=ml-pipeline
kubectl get pods -n mlflow -l app=mlflow-server
echo ""
echo "GPU Operator Status:"
kubectl get clusterpolicy -o jsonpath='{.items[0].status.state}'
echo ""
echo "Pending GPU requests (control plane backlog):"
kubectl get pods --field-selector=status.phase=Pending \
  -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.containers[0].resources.limits}{"\n"}{end}' | grep nvidia

echo ""
echo "=== Data Plane Health ==="
# Data plane health — inference servers
echo "vLLM Inference Servers:"
kubectl get pods -n inference -l app=vllm -o wide
echo ""
echo "GPU Utilization (data plane throughput):"
nvidia-smi --query-gpu=index,utilization.gpu,utilization.memory,temperature.gpu \
  --format=csv,noheader,nounits
echo ""
echo "Inference Latency (p50/p95/p99):"
kubectl exec -n inference deploy/vllm-server -- \
  curl -s localhost:8000/metrics | grep -E "request_latency|tokens_per_second"

Key Takeaway

Key Takeaway
AI Infrastructure is the Newest Control/Data Plane Domain

AI infrastructure is perhaps the most demanding modern application of control/data plane separation. The control plane must make complex resource allocation decisions across heterogeneous hardware (different GPU types, memory hierarchies, interconnects), while the data plane must achieve near-hardware-limit throughput for tensor operations. Organizations that conflate these concerns — running model orchestration on the same systems doing inference, or treating GPU scheduling as a simple Kubernetes problem — consistently hit scalability walls and operational complexity explosions.

ArchitectureAIInfrastructure