Observability Control vs Data Plane
Observability infrastructure has the same architectural DNA as networking, storage, and security: a control plane that makes decisions about telemetry policy, and a data plane that generates and transports the actual observability data.
flowchart TB
subgraph CP["Observability Control Plane"]
CONFIG["Telemetry Config\n(What to collect)"]
SAMPLE["Sampling Rules\n(What to keep)"]
ROUTE["Routing Rules\n(Where to send)"]
ALERT["Alert Rules\n(When to fire)"]
DASH["Dashboard Defs\n(How to show)"]
end
subgraph DP["Observability Data Plane"]
GEN["Data Generation\n(Metrics, Logs, Traces)"]
COLLECT["Collection\n(Scrape / Receive)"]
PROCESS["Processing\n(Filter, Transform)"]
EXPORT["Export\n(To backends)"]
GEN --> COLLECT
COLLECT --> PROCESS
PROCESS --> EXPORT
end
CONFIG -->|"Instrument"| GEN
SAMPLE -->|"Filter"| PROCESS
ROUTE -->|"Destination"| EXPORT
OpenTelemetry Collector
The OpenTelemetry (OTel) Collector is the best example of a component that embodies BOTH control and data plane functions. Its configuration defines the control plane logic, while its runtime processes the data plane traffic.
Collector Pipeline Components
- Receivers (Data Plane) — ingest telemetry data from applications (OTLP, Jaeger, Prometheus, Zipkin formats)
- Processors (Control Logic) — apply sampling, filtering, batching, attribute modification, tail-based sampling decisions
- Exporters (Data Plane) — send processed telemetry to backends (Prometheus, Jaeger, Grafana Cloud, Datadog, etc.)
- Connectors — bridge between pipelines, allowing data to flow from one pipeline's exporter to another's receiver
flowchart LR
subgraph RECV["Receivers (Data Ingestion)"]
R1["OTLP gRPC\n(:4317)"]
R2["OTLP HTTP\n(:4318)"]
R3["Prometheus\n(scrape)"]
end
subgraph PROC["Processors (Control Logic)"]
P1["Batch\n(group for efficiency)"]
P2["Filter\n(drop unwanted)"]
P3["Tail Sampling\n(keep interesting)"]
P4["Attributes\n(enrich/transform)"]
P1 --> P2
P2 --> P3
P3 --> P4
end
subgraph EXPORT["Exporters (Data Output)"]
E1["Prometheus\nRemote Write"]
E2["OTLP\n(Tempo/Jaeger)"]
E3["Loki\n(Logs)"]
end
R1 --> P1
R2 --> P1
R3 --> P1
P4 --> E1
P4 --> E2
P4 --> E3
# OpenTelemetry Collector Configuration
# This YAML IS the control plane — defining what the data plane does
receivers:
otlp:
protocols:
grpc:
endpoint: "0.0.0.0:4317"
http:
endpoint: "0.0.0.0:4318"
prometheus:
config:
scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
processors:
batch:
timeout: 5s
send_batch_size: 1000
filter/drop-health:
traces:
span:
- 'attributes["http.route"] == "/healthz"'
- 'attributes["http.route"] == "/readyz"'
tail_sampling:
decision_wait: 10s
policies:
- name: errors-always
type: status_code
status_code: {status_codes: [ERROR]}
- name: slow-requests
type: latency
latency: {threshold_ms: 2000}
- name: probabilistic-sample
type: probabilistic
probabilistic: {sampling_percentage: 10}
attributes/enrich:
actions:
- key: environment
value: production
action: upsert
- key: cluster
value: us-west-2-primary
action: upsert
exporters:
prometheusremotewrite:
endpoint: "http://mimir:9009/api/v1/push"
otlp/tempo:
endpoint: "tempo:4317"
tls:
insecure: true
loki:
endpoint: "http://loki:3100/loki/api/v1/push"
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch, filter/drop-health, tail_sampling, attributes/enrich]
exporters: [otlp/tempo]
metrics:
receivers: [otlp, prometheus]
processors: [batch, attributes/enrich]
exporters: [prometheusremotewrite]
logs:
receivers: [otlp]
processors: [batch, attributes/enrich]
exporters: [loki]
The Collector Config IS the Control Plane
In OpenTelemetry, the collector's YAML configuration IS the control plane definition — it declares what data to accept, how to process it, and where to send it. The collector runtime is the data plane that executes those declarations. This means observability teams can version-control their telemetry pipeline behavior (GitOps for observability), review changes in PRs, and deploy configuration updates independently of application code. The control plane is literally a YAML file.
Prometheus Architecture
Prometheus splits cleanly into control and data plane concerns:
Prometheus Control Plane
- Scrape configuration — defines WHAT to monitor (targets, intervals, relabeling)
- Recording rules — pre-compute expensive queries into new time series
- Alerting rules — define conditions that trigger notifications
- Service discovery — dynamically finds scrape targets (Kubernetes, Consul, DNS, file-based)
Prometheus Data Plane
- /metrics endpoints — applications expose metrics (data generation)
- Scrape engine — pulls metrics from targets at configured intervals
- TSDB — time-series database storing samples (2-hour blocks, compaction)
- Remote write — ships data to long-term storage (Thanos, Mimir, Cortex)
flowchart TB
subgraph PROM["Prometheus Server"]
subgraph CTRL["Control Plane Logic"]
SD["Service Discovery\n(Find targets)"]
RULES["Recording +\nAlerting Rules"]
RELABEL["Relabeling\n(Transform labels)"]
end
subgraph DATA["Data Plane Logic"]
SCRAPE["Scrape Engine\n(Pull /metrics)"]
TSDB["TSDB\n(Store samples)"]
QUERY["PromQL Engine\n(Query data)"]
RW["Remote Write\n(Ship to backends)"]
end
SD --> SCRAPE
RULES --> QUERY
SCRAPE --> TSDB
TSDB --> QUERY
TSDB --> RW
end
subgraph TARGETS["Targets (Data Generation)"]
T1["App 1\n/metrics"]
T2["App 2\n/metrics"]
T3["Node Exporter\n/metrics"]
end
SCRAPE -->|"GET /metrics"| T1
SCRAPE -->|"GET /metrics"| T2
SCRAPE -->|"GET /metrics"| T3
# Prometheus scrape configuration — the control plane definition
# Tells Prometheus WHAT to scrape, HOW to find targets, WHEN to collect
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
cluster: us-west-2-primary
environment: production
rule_files:
- /etc/prometheus/rules/*.yaml
alerting:
alertmanagers:
- static_configs:
- targets: ['alertmanager:9093']
scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
target_label: __address__
regex: (.+)
replacement: ${1}:${2}
- job_name: 'node-exporter'
kubernetes_sd_configs:
- role: node
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
Grafana Stack (Loki, Tempo, Mimir)
The Grafana observability stack separates control and data plane cleanly across its components:
- Grafana (Control Plane) — dashboards, alerting rules, data source configuration, team/folder permissions, notification channels
- Mimir (Data Plane — Metrics) — ingests, stores, and queries Prometheus-compatible metrics at scale
- Loki (Data Plane — Logs) — ingests, indexes (by labels only), and queries log streams
- Tempo (Data Plane — Traces) — ingests, stores, and queries distributed traces
# Grafana alerting rules — Control Plane definition
# Grafana evaluates these rules and fires alerts
apiVersion: 1
groups:
- orgId: 1
name: SLO Alerts
folder: Production
interval: 1m
rules:
- uid: high-error-rate
title: "High Error Rate (>1% of requests)"
condition: C
data:
- refId: A
relativeTimeRange:
from: 300
to: 0
datasourceUid: prometheus
model:
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
- refId: C
datasourceUid: __expr__
model:
type: threshold
conditions:
- evaluator:
type: gt
params: [0.01]
noDataState: NoData
execErrState: Error
for: 5m
labels:
severity: critical
team: platform
annotations:
summary: "Error rate is {{ $value | printf \"%.2f\" }}%"
Sampling Strategies
Sampling is the quintessential control plane decision in observability — it determines how much data the data plane generates and transmits. The sampling strategy directly controls cost, storage volume, and query performance.
Types of Sampling
- Head-based sampling — decision made at span start (before outcome known). Simple: random percentage. Con: may drop interesting traces
- Tail-based sampling — decision made after full trace completes. Can keep errors, high-latency, or specific attributes. Con: requires buffering all spans until decision
- Priority sampling — certain traffic always sampled (errors, canary, specific users), rest probabilistic
- Rate-limited sampling — cap at N traces/second regardless of traffic volume (predictable cost)
flowchart TB
subgraph HEAD["Head-Based Sampling"]
H1["Request arrives"]
H2{"Random < 10%?"}
H3["Sample: YES\n(Generate all spans)"]
H4["Sample: NO\n(Skip trace)"]
H1 --> H2
H2 -->|"Yes"| H3
H2 -->|"No"| H4
end
subgraph TAIL["Tail-Based Sampling"]
T1["Request arrives"]
T2["Generate ALL spans\n(buffer in collector)"]
T3["Trace completes"]
T4{"Error? Slow?\nImportant?"}
T5["Keep trace"]
T6["Drop trace"]
T1 --> T2
T2 --> T3
T3 --> T4
T4 -->|"Yes"| T5
T4 -->|"No (random 5%)"| T5
T4 -->|"No (95%)"| T6
end
Pipeline Architecture
Modern observability pipelines follow a multi-tier architecture: agents (close to data source) → collectors (aggregation layer) → backends (storage + query). Each tier has its own control/data plane split.
- Agents (per-host/per-pod) — lightweight data plane collection; configuration pushed from central control plane
- Collectors (regional/cluster) — aggregation, sampling, routing; control plane decisions with significant data plane throughput
- Backends (centralized) — storage, indexing, query serving; mostly data plane with control plane for retention/compaction
Why Two-Tier Collection Wins
A common anti-pattern is sending telemetry directly from agents to backends. This tight coupling means agent configuration changes require redeploying every agent. The two-tier pattern (agent → collector → backend) introduces a control plane boundary: collectors can change routing, sampling, and enrichment WITHOUT touching agents. The collector layer is a dedicated control plane for your telemetry pipeline — it's where you implement sampling decisions, add metadata, filter noise, and route to multiple backends.
Cost Control via Control Plane
Observability cost is directly proportional to data plane volume — the bytes generated, transmitted, and stored. The control plane is your lever for managing costs without losing visibility.
Control Plane Cost Levers
- Sampling rate — reduce trace volume (10% baseline + 100% errors is typical)
- Metric cardinality — drop high-cardinality labels (user_id, request_id) before storage
- Log filtering — drop DEBUG/INFO in production, keep WARN/ERROR
- Aggregation — pre-aggregate metrics at collector (5s → 60s resolution)
- Retention policies — hot (7d full resolution), warm (30d downsampled), cold (1yr aggregated)
- Routing — send critical service telemetry to premium storage, background jobs to cheap storage
# Check OTel Collector pipeline status (data plane health)
# Assumes collector exposes metrics on :8888
curl -s http://localhost:8888/metrics | grep otelcol_receiver_accepted
# Key metrics to monitor (data plane throughput)
# otelcol_receiver_accepted_spans — traces ingested
# otelcol_processor_dropped_spans — traces filtered (control plane effect)
# otelcol_exporter_sent_spans — traces exported to backends
# otelcol_exporter_send_failed_spans — export failures
# Calculate effective sampling rate
curl -s http://localhost:8888/metrics | grep -E "accepted_spans|dropped_spans"
Adaptive Sampling
Adaptive sampling is the most sophisticated observability control plane pattern — it dynamically adjusts data plane collection based on real-time conditions. The control plane monitors system state and modifies sampling rules on the fly.
How Adaptive Sampling Works
- Normal state — sample 5% of traces, aggregate metrics at 60s intervals
- Anomaly detected — control plane increases sampling to 50% for affected services
- Incident active — control plane sets 100% sampling for all services in the blast radius
- Recovery — gradually reduce back to baseline as error rates normalize
Feedback Loop: Alerting Drives Sampling
The most advanced observability architectures create a feedback loop: alerting rules (control plane) detect anomalies from metrics (data plane), then dynamically increase trace sampling (control plane adjustment) for the affected services, providing the detailed data needed to diagnose the issue. After resolution, sampling returns to baseline. This is a self-regulating system — the control plane optimizes the data plane based on the data plane's own output. It's observability observing itself.
Observability: The Pattern Applied to Itself
Observability is meta: it's the control/data plane pattern applied to monitoring control/data plane systems. Your OTel Collector config (control plane) determines what telemetry (data plane) your Kubernetes cluster's control plane and data plane emit. The separation at the observability layer mirrors the separation in the systems being observed. Master this pattern once, and you understand the architecture of monitoring, the architecture of what's being monitored, and the recursive relationship between them.