AI Control Plane
The AI control plane encompasses everything that decides what should happen in your AI infrastructure — without directly executing inference or training. It manages model lifecycles, schedules GPU resources, orchestrates workflows, and routes requests to appropriate model endpoints.
Key components of the AI control plane:
- Model Registry — versioned storage of trained models with metadata (MLflow Model Registry, Weights & Biases, SageMaker Model Registry)
- GPU Cluster Scheduler — decides which workloads run on which GPUs (NVIDIA GPU Operator, Kubernetes device plugin, Slurm)
- Workflow Orchestration — coordinates multi-step ML pipelines (Kubeflow Pipelines, Airflow, Prefect)
- Experiment Tracking — records hyperparameters, metrics, and artifacts (MLflow Tracking, Neptune, Comet)
- Model Versioning — tracks which model version is deployed where (shadow, canary, production)
- Deployment Routing — decides traffic split between model versions (A/B testing, canary, blue-green)
flowchart TB
subgraph CP["AI Control Plane"]
MR["Model Registry\n(Versioning)"]
SCHED["GPU Scheduler\n(Resource Allocation)"]
ORCH["Workflow Orchestrator\n(Pipeline Coordination)"]
ROUTE["Deployment Router\n(Traffic Splitting)"]
EXP["Experiment Tracker\n(Metrics & Params)"]
end
subgraph DP["AI Data Plane"]
INF["Inference Servers\n(vLLM, Triton)"]
TRAIN["Training Workers\n(PyTorch, JAX)"]
EMB["Embedding Engines\n(Vector Generation)"]
BATCH["Batch Processors\n(Offline Scoring)"]
end
MR -->|"Deploy model v2.1"| ROUTE
SCHED -->|"Assign GPU 0,1 to job"| TRAIN
SCHED -->|"Assign GPU 2,3 to serving"| INF
ORCH -->|"Trigger training run"| TRAIN
ROUTE -->|"Route 90% to v2.0, 10% to v2.1"| INF
EXP -.->|"Log metrics"| TRAIN
AI Data Plane
The AI data plane is where actual computation happens — tensor operations execute on GPUs, models process inputs and produce outputs, embeddings are generated, and batch jobs crunch through datasets. This is the "hot path" where latency matters and throughput is measured.
Key components of the AI data plane:
- Inference Execution — running forward passes through neural networks at serving time
- Tensor Computation — matrix multiplications, attention mechanisms, activation functions on GPU
- Embedding Generation — converting text/images into vector representations
- Batch Processing — offline scoring of large datasets through models
- Model Serving — handling concurrent inference requests with batching and queuing
- KV Cache Management — storing and retrieving key-value pairs for autoregressive generation
GPU Scheduling as Control Plane
GPU scheduling is a pure control plane function — it decides which workloads get which GPU resources, when, and with what constraints. The scheduler never executes a CUDA kernel; it allocates the resources that enable execution.
flowchart TD
A["New Workload Request"] --> B{"Workload Type?"}
B -->|"Training"| C["Check GPU Memory\nRequirements"]
B -->|"Inference"| D["Check Latency SLA"]
B -->|"Batch"| E["Check Queue Priority"]
C --> F{"Multi-GPU\nNeeded?"}
F -->|"Yes"| G["Find Node with\nNVLink Topology"]
F -->|"No"| H["Find Available\nSingle GPU"]
D --> I{"MIG Partition\nSufficient?"}
I -->|"Yes"| J["Assign MIG Slice"]
I -->|"No"| K["Assign Full GPU"]
E --> L["Queue Until\nResources Free"]
G --> M["Bind GPUs to Pod"]
H --> M
J --> M
K --> M
L --> M
# NVIDIA GPU Operator — scheduling configuration
apiVersion: v1
kind: ConfigMap
metadata:
name: gpu-operator-config
namespace: gpu-operator
data:
# MIG (Multi-Instance GPU) partitioning strategy
mig-config: |
version: v1
mig-configs:
# A100 80GB split for mixed workloads
a100-mixed-workload:
- devices: [0]
mig-enabled: true
mig-devices:
"3g.40gb": 1 # Training: half GPU
"2g.20gb": 1 # Large inference
"1g.10gb": 2 # Small inference models
- devices: [1]
mig-enabled: false # Full GPU for large training
---
# Time-slicing policy for inference workloads
apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
name: gpu-time-slicing
spec:
devicePlugin:
config:
name: time-slicing-config
default: inference-sharing
# Priority-based GPU queuing
scheduling:
preemption:
enabled: true
priorities:
- name: critical-inference
priority: 100
- name: training
priority: 50
- name: batch-scoring
priority: 10
MIG Partitioning as Control Plane Decision
NVIDIA's Multi-Instance GPU (MIG) technology is a physical manifestation of control plane decisions. The control plane decides how to partition A100/H100 GPUs into isolated instances — each with dedicated compute, memory, and cache. This decision happens once at configuration time, and the data plane then executes within these boundaries without knowing about the partitioning decisions above it.
Model Serving as Data Plane
Model serving is the quintessential AI data plane — it receives inference requests and produces predictions. The serving infrastructure's job is pure execution: load model weights into GPU memory, batch incoming requests, execute forward passes, and return results with minimal latency.
"""
vLLM Continuous Batching — AI Data Plane Configuration
Demonstrates the data plane concern: maximizing inference throughput
while meeting latency SLAs through continuous batching and PagedAttention.
"""
from vllm import LLM, SamplingParams
# Data plane configuration — optimized for throughput
llm = LLM(
model="meta-llama/Llama-3.1-70B-Instruct",
tensor_parallel_size=4, # Shard across 4 GPUs
gpu_memory_utilization=0.90, # Use 90% of GPU memory
max_model_len=8192, # Maximum sequence length
enable_prefix_caching=True, # Cache common prefixes
# Continuous batching config
max_num_batched_tokens=32768, # Max tokens per batch
max_num_seqs=256, # Max concurrent sequences
# PagedAttention KV cache
block_size=16, # KV cache block size
swap_space=4, # GB of CPU swap for KV cache
)
# Sampling parameters (per-request data plane config)
sampling_params = SamplingParams(
temperature=0.7,
top_p=0.9,
max_tokens=512,
repetition_penalty=1.1,
)
# Execute inference — pure data plane operation
prompts = [
"Explain control plane vs data plane in 3 sentences.",
"What is the role of a GPU scheduler?",
]
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
print(f"Prompt: {output.prompt[:50]}...")
print(f"Output: {output.outputs[0].text[:100]}...")
print(f"Tokens/s: {len(output.outputs[0].token_ids) / output.metrics.finished_time:.1f}")
print("---")
flowchart LR
REQ["Inference\nRequest"] --> TOK["Tokenizer\n(CPU)"]
TOK --> BATCH["Continuous\nBatcher"]
BATCH --> PREFILL["Prefill Phase\n(GPU Compute)"]
PREFILL --> DECODE["Decode Phase\n(Autoregressive)"]
DECODE --> KV["KV Cache\n(GPU Memory)"]
KV --> DECODE
DECODE --> DETOK["Detokenizer\n(CPU)"]
DETOK --> RESP["Response\nStream"]
Training vs Inference Planes
Training and inference represent two fundamentally different data plane profiles, each with distinct resource patterns, failure modes, and optimization strategies:
Training Data Plane vs Inference Data Plane
| Dimension | Training | Inference |
|---|---|---|
| Batch Size | Large (thousands) | Small (1-64 dynamic) |
| Latency Tolerance | Hours/days acceptable | Milliseconds matter |
| GPU Utilization | Sustained 95%+ | Bursty 30-80% |
| Memory Pattern | Gradients + activations | KV cache + weights |
| Failure Recovery | Checkpoint + resume | Retry + failover |
| Scaling | Data/model parallelism | Replica autoscaling |
| Network | All-reduce collective | Request/response |
MLOps Pipeline Orchestration
MLOps tools like Kubeflow and MLflow operate as control planes that orchestrate the training data plane. They define what runs, when, and with what parameters — but they never execute a gradient computation themselves.
flowchart TB
subgraph CTRL["Control Plane (Kubeflow/MLflow)"]
PIPE["Pipeline Definition\n(DAG of steps)"]
PARAM["Hyperparameter\nSearch Space"]
SCHED2["Step Scheduler"]
MON["Metric Monitor\n(Early Stopping)"]
end
subgraph DATA["Data Plane (GPU Workers)"]
PREP["Data Preprocessing\n(CPU/GPU)"]
TRAIN2["Model Training\n(Multi-GPU)"]
EVAL["Model Evaluation\n(Validation Set)"]
REG["Model Registration\n(Artifact Store)"]
end
PIPE --> SCHED2
SCHED2 -->|"Step 1"| PREP
SCHED2 -->|"Step 2"| TRAIN2
PARAM -->|"Trial config"| TRAIN2
TRAIN2 -->|"Metrics"| MON
MON -->|"Continue/Stop"| SCHED2
SCHED2 -->|"Step 3"| EVAL
SCHED2 -->|"Step 4"| REG
# Kubeflow Pipeline — Control Plane Definition
# Orchestrates training data plane without executing computation
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
name: llm-finetune-pipeline
spec:
entrypoint: finetune-dag
templates:
- name: finetune-dag
dag:
tasks:
- name: data-preparation
template: preprocess
arguments:
parameters:
- name: dataset
value: "s3://data/train.jsonl"
- name: max-length
value: "4096"
- name: training
template: train-model
dependencies: [data-preparation]
arguments:
parameters:
- name: base-model
value: "meta-llama/Llama-3.1-8B"
- name: learning-rate
value: "2e-5"
- name: epochs
value: "3"
- name: gpu-count
value: "4"
- name: evaluation
template: evaluate
dependencies: [training]
- name: deployment
template: deploy-model
dependencies: [evaluation]
when: "{{tasks.evaluation.outputs.parameters.accuracy}} > 0.85"
- name: train-model
# Data plane execution — actual GPU work
container:
image: training-worker:v2.1
resources:
limits:
nvidia.com/gpu: 4
memory: "128Gi"
env:
- name: NCCL_DEBUG
value: "INFO"
AI Gateway Pattern
The AI Gateway is a modern architectural pattern that makes the control/data plane split explicit at the API level. It sits between consumers and model endpoints, with clear control plane responsibilities (routing, rate limiting, cost tracking) separate from data plane operations (proxying inference requests).
# AI Gateway configuration — control plane routing rules
gateway:
name: production-ai-gateway
# Control plane: routing decisions
routes:
- name: chat-completions
match:
path: /v1/chat/completions
backends:
- provider: openai
model: gpt-4o
weight: 70 # 70% of traffic
max_tokens_per_min: 100000
cost_budget_daily: 500.00
- provider: anthropic
model: claude-sonnet-4-20250514
weight: 20 # 20% of traffic
max_tokens_per_min: 80000
- provider: local-vllm
model: llama-3.1-70b
weight: 10 # 10% (cost optimization)
endpoint: http://vllm-service:8000
# Control plane: rate limiting
rate_limits:
- key: api_key
requests_per_minute: 60
tokens_per_minute: 50000
- key: organization
requests_per_minute: 1000
cost_per_day: 2000.00
# Control plane: fallback chain
fallback:
chain: [openai, anthropic, local-vllm]
triggers:
- error_rate > 0.05
- latency_p99 > 5000ms
- provider_status != healthy
# Data plane: connection management
connections:
pool_size: 100
timeout_ms: 30000
retry_count: 2
streaming: true
Vector Database Planes
Vector databases exhibit clear control/data plane separation that mirrors traditional database architecture but with AI-specific concerns around index management and similarity search execution.
flowchart TB
subgraph VCP["Control Plane"]
IDX["Index Management\n(Build HNSW/IVF)"]
SHARD["Shard Placement\n(Rebalancing)"]
META["Metadata Schema\n(Collection Config)"]
REP["Replication\n(Consistency)"]
end
subgraph VDP["Data Plane"]
INS["Vector Insertion\n(Write Path)"]
SIM["Similarity Search\n(ANN Query)"]
FILT["Filtered Search\n(Metadata + Vector)"]
FETCH["Point Fetch\n(ID Lookup)"]
end
IDX -->|"Index params"| SIM
SHARD -->|"Route to shard"| INS
SHARD -->|"Scatter-gather"| SIM
META -->|"Schema validation"| INS
REP -->|"Sync replicas"| INS
Why Vector Index Building is Control Plane
Building an HNSW (Hierarchical Navigable Small World) index is a control plane operation — it decides the structure that enables fast similarity search. The index parameters (M, efConstruction, metric type) are architectural decisions that shape data plane performance. Once built, the data plane traverses the graph structure without modifying it. This mirrors how a routing table (control plane output) enables packet forwarding (data plane operation) without the data plane needing to understand routing algorithms.
# Monitor AI infrastructure control/data plane health
# Control plane health — scheduling and orchestration
echo "=== Control Plane Health ==="
kubectl get pods -n kubeflow -l component=ml-pipeline
kubectl get pods -n mlflow -l app=mlflow-server
echo ""
echo "GPU Operator Status:"
kubectl get clusterpolicy -o jsonpath='{.items[0].status.state}'
echo ""
echo "Pending GPU requests (control plane backlog):"
kubectl get pods --field-selector=status.phase=Pending \
-o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.containers[0].resources.limits}{"\n"}{end}' | grep nvidia
echo ""
echo "=== Data Plane Health ==="
# Data plane health — inference servers
echo "vLLM Inference Servers:"
kubectl get pods -n inference -l app=vllm -o wide
echo ""
echo "GPU Utilization (data plane throughput):"
nvidia-smi --query-gpu=index,utilization.gpu,utilization.memory,temperature.gpu \
--format=csv,noheader,nounits
echo ""
echo "Inference Latency (p50/p95/p99):"
kubectl exec -n inference deploy/vllm-server -- \
curl -s localhost:8000/metrics | grep -E "request_latency|tokens_per_second"
Key Takeaway
AI Infrastructure is the Newest Control/Data Plane Domain
AI infrastructure is perhaps the most demanding modern application of control/data plane separation. The control plane must make complex resource allocation decisions across heterogeneous hardware (different GPU types, memory hierarchies, interconnects), while the data plane must achieve near-hardware-limit throughput for tensor operations. Organizations that conflate these concerns — running model orchestration on the same systems doing inference, or treating GPU scheduling as a simple Kubernetes problem — consistently hit scalability walls and operational complexity explosions.