Back to Systems Thinking & Architecture Mastery Series

Observability Control & Data Planes

May 15, 2026 Wasil Zafar 22 min read

Observability systems follow the control/data plane pattern: one layer decides WHAT to observe, HOW to sample, and WHERE to send telemetry (control), while another layer actually generates, collects, and transmits metrics, logs, and traces (data). Understanding this split is key to cost-effective observability at scale.

Table of Contents

  1. Observability Control vs Data Plane
  2. OpenTelemetry Collector
  3. Prometheus Architecture
  4. Grafana Stack (Loki, Tempo, Mimir)
  5. Sampling Strategies
  6. Pipeline Architecture
  7. Cost Control via Control Plane
  8. Adaptive Sampling

Observability Control vs Data Plane

Observability infrastructure has the same architectural DNA as networking, storage, and security: a control plane that makes decisions about telemetry policy, and a data plane that generates and transports the actual observability data.

Observability Control Plane: Telemetry configuration (what to instrument), sampling rules (what to keep vs discard), pipeline routing (where to send data), alerting rules (when to notify), dashboard definitions (how to visualize). It answers: "What should we observe? At what granularity? Where should the data go?"
Observability Data Plane: Actual metric generation (counters, gauges, histograms), log emission (structured events), trace span creation (distributed request tracking), data collection (scraping, receiving), transmission (to backends). It answers: "Generate this metric NOW. Emit this log. Create this span. Ship these bytes."
Observability Stack — Control & Data Plane Layers
flowchart TB
    subgraph CP["Observability Control Plane"]
        CONFIG["Telemetry Config\n(What to collect)"]
        SAMPLE["Sampling Rules\n(What to keep)"]
        ROUTE["Routing Rules\n(Where to send)"]
        ALERT["Alert Rules\n(When to fire)"]
        DASH["Dashboard Defs\n(How to show)"]
    end
    subgraph DP["Observability Data Plane"]
        GEN["Data Generation\n(Metrics, Logs, Traces)"]
        COLLECT["Collection\n(Scrape / Receive)"]
        PROCESS["Processing\n(Filter, Transform)"]
        EXPORT["Export\n(To backends)"]
        GEN --> COLLECT
        COLLECT --> PROCESS
        PROCESS --> EXPORT
    end
    CONFIG -->|"Instrument"| GEN
    SAMPLE -->|"Filter"| PROCESS
    ROUTE -->|"Destination"| EXPORT
                            

OpenTelemetry Collector

The OpenTelemetry (OTel) Collector is the best example of a component that embodies BOTH control and data plane functions. Its configuration defines the control plane logic, while its runtime processes the data plane traffic.

Collector Pipeline Components

  • Receivers (Data Plane) — ingest telemetry data from applications (OTLP, Jaeger, Prometheus, Zipkin formats)
  • Processors (Control Logic) — apply sampling, filtering, batching, attribute modification, tail-based sampling decisions
  • Exporters (Data Plane) — send processed telemetry to backends (Prometheus, Jaeger, Grafana Cloud, Datadog, etc.)
  • Connectors — bridge between pipelines, allowing data to flow from one pipeline's exporter to another's receiver
OpenTelemetry Collector Pipeline Architecture
flowchart LR
    subgraph RECV["Receivers (Data Ingestion)"]
        R1["OTLP gRPC\n(:4317)"]
        R2["OTLP HTTP\n(:4318)"]
        R3["Prometheus\n(scrape)"]
    end
    subgraph PROC["Processors (Control Logic)"]
        P1["Batch\n(group for efficiency)"]
        P2["Filter\n(drop unwanted)"]
        P3["Tail Sampling\n(keep interesting)"]
        P4["Attributes\n(enrich/transform)"]
        P1 --> P2
        P2 --> P3
        P3 --> P4
    end
    subgraph EXPORT["Exporters (Data Output)"]
        E1["Prometheus\nRemote Write"]
        E2["OTLP\n(Tempo/Jaeger)"]
        E3["Loki\n(Logs)"]
    end
    R1 --> P1
    R2 --> P1
    R3 --> P1
    P4 --> E1
    P4 --> E2
    P4 --> E3
                            
# OpenTelemetry Collector Configuration
# This YAML IS the control plane — defining what the data plane does
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: "0.0.0.0:4317"
      http:
        endpoint: "0.0.0.0:4318"
  prometheus:
    config:
      scrape_configs:
        - job_name: 'kubernetes-pods'
          kubernetes_sd_configs:
            - role: pod
          relabel_configs:
            - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
              action: keep
              regex: true

processors:
  batch:
    timeout: 5s
    send_batch_size: 1000
  filter/drop-health:
    traces:
      span:
        - 'attributes["http.route"] == "/healthz"'
        - 'attributes["http.route"] == "/readyz"'
  tail_sampling:
    decision_wait: 10s
    policies:
      - name: errors-always
        type: status_code
        status_code: {status_codes: [ERROR]}
      - name: slow-requests
        type: latency
        latency: {threshold_ms: 2000}
      - name: probabilistic-sample
        type: probabilistic
        probabilistic: {sampling_percentage: 10}
  attributes/enrich:
    actions:
      - key: environment
        value: production
        action: upsert
      - key: cluster
        value: us-west-2-primary
        action: upsert

exporters:
  prometheusremotewrite:
    endpoint: "http://mimir:9009/api/v1/push"
  otlp/tempo:
    endpoint: "tempo:4317"
    tls:
      insecure: true
  loki:
    endpoint: "http://loki:3100/loki/api/v1/push"

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch, filter/drop-health, tail_sampling, attributes/enrich]
      exporters: [otlp/tempo]
    metrics:
      receivers: [otlp, prometheus]
      processors: [batch, attributes/enrich]
      exporters: [prometheusremotewrite]
    logs:
      receivers: [otlp]
      processors: [batch, attributes/enrich]
      exporters: [loki]
Architecture Insight
The Collector Config IS the Control Plane

In OpenTelemetry, the collector's YAML configuration IS the control plane definition — it declares what data to accept, how to process it, and where to send it. The collector runtime is the data plane that executes those declarations. This means observability teams can version-control their telemetry pipeline behavior (GitOps for observability), review changes in PRs, and deploy configuration updates independently of application code. The control plane is literally a YAML file.

OTelGitOpsConfiguration

Prometheus Architecture

Prometheus splits cleanly into control and data plane concerns:

Prometheus Control Plane

  • Scrape configuration — defines WHAT to monitor (targets, intervals, relabeling)
  • Recording rules — pre-compute expensive queries into new time series
  • Alerting rules — define conditions that trigger notifications
  • Service discovery — dynamically finds scrape targets (Kubernetes, Consul, DNS, file-based)

Prometheus Data Plane

  • /metrics endpoints — applications expose metrics (data generation)
  • Scrape engine — pulls metrics from targets at configured intervals
  • TSDB — time-series database storing samples (2-hour blocks, compaction)
  • Remote write — ships data to long-term storage (Thanos, Mimir, Cortex)
Prometheus — Scrape Config (Control) + Targets (Data)
flowchart TB
    subgraph PROM["Prometheus Server"]
        subgraph CTRL["Control Plane Logic"]
            SD["Service Discovery\n(Find targets)"]
            RULES["Recording +\nAlerting Rules"]
            RELABEL["Relabeling\n(Transform labels)"]
        end
        subgraph DATA["Data Plane Logic"]
            SCRAPE["Scrape Engine\n(Pull /metrics)"]
            TSDB["TSDB\n(Store samples)"]
            QUERY["PromQL Engine\n(Query data)"]
            RW["Remote Write\n(Ship to backends)"]
        end
        SD --> SCRAPE
        RULES --> QUERY
        SCRAPE --> TSDB
        TSDB --> QUERY
        TSDB --> RW
    end
    subgraph TARGETS["Targets (Data Generation)"]
        T1["App 1\n/metrics"]
        T2["App 2\n/metrics"]
        T3["Node Exporter\n/metrics"]
    end
    SCRAPE -->|"GET /metrics"| T1
    SCRAPE -->|"GET /metrics"| T2
    SCRAPE -->|"GET /metrics"| T3
                            
# Prometheus scrape configuration — the control plane definition
# Tells Prometheus WHAT to scrape, HOW to find targets, WHEN to collect
global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    cluster: us-west-2-primary
    environment: production

rule_files:
  - /etc/prometheus/rules/*.yaml

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']

scrape_configs:
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        target_label: __address__
        regex: (.+)
        replacement: ${1}:${2}

  - job_name: 'node-exporter'
    kubernetes_sd_configs:
      - role: node
    relabel_configs:
      - action: labelmap
        regex: __meta_kubernetes_node_label_(.+)

Grafana Stack (Loki, Tempo, Mimir)

The Grafana observability stack separates control and data plane cleanly across its components:

  • Grafana (Control Plane) — dashboards, alerting rules, data source configuration, team/folder permissions, notification channels
  • Mimir (Data Plane — Metrics) — ingests, stores, and queries Prometheus-compatible metrics at scale
  • Loki (Data Plane — Logs) — ingests, indexes (by labels only), and queries log streams
  • Tempo (Data Plane — Traces) — ingests, stores, and queries distributed traces
# Grafana alerting rules — Control Plane definition
# Grafana evaluates these rules and fires alerts
apiVersion: 1
groups:
  - orgId: 1
    name: SLO Alerts
    folder: Production
    interval: 1m
    rules:
      - uid: high-error-rate
        title: "High Error Rate (>1% of requests)"
        condition: C
        data:
          - refId: A
            relativeTimeRange:
              from: 300
              to: 0
            datasourceUid: prometheus
            model:
              expr: |
                sum(rate(http_requests_total{status=~"5.."}[5m]))
                /
                sum(rate(http_requests_total[5m]))
          - refId: C
            datasourceUid: __expr__
            model:
              type: threshold
              conditions:
                - evaluator:
                    type: gt
                    params: [0.01]
        noDataState: NoData
        execErrState: Error
        for: 5m
        labels:
          severity: critical
          team: platform
        annotations:
          summary: "Error rate is {{ $value | printf \"%.2f\" }}%"

Sampling Strategies

Sampling is the quintessential control plane decision in observability — it determines how much data the data plane generates and transmits. The sampling strategy directly controls cost, storage volume, and query performance.

Types of Sampling

  • Head-based sampling — decision made at span start (before outcome known). Simple: random percentage. Con: may drop interesting traces
  • Tail-based sampling — decision made after full trace completes. Can keep errors, high-latency, or specific attributes. Con: requires buffering all spans until decision
  • Priority sampling — certain traffic always sampled (errors, canary, specific users), rest probabilistic
  • Rate-limited sampling — cap at N traces/second regardless of traffic volume (predictable cost)
Sampling Decision Flow — Head vs Tail-Based
flowchart TB
    subgraph HEAD["Head-Based Sampling"]
        H1["Request arrives"]
        H2{"Random < 10%?"}
        H3["Sample: YES\n(Generate all spans)"]
        H4["Sample: NO\n(Skip trace)"]
        H1 --> H2
        H2 -->|"Yes"| H3
        H2 -->|"No"| H4
    end
    subgraph TAIL["Tail-Based Sampling"]
        T1["Request arrives"]
        T2["Generate ALL spans\n(buffer in collector)"]
        T3["Trace completes"]
        T4{"Error? Slow?\nImportant?"}
        T5["Keep trace"]
        T6["Drop trace"]
        T1 --> T2
        T2 --> T3
        T3 --> T4
        T4 -->|"Yes"| T5
        T4 -->|"No (random 5%)"| T5
        T4 -->|"No (95%)"| T6
    end
                            
The Sampling Economics: At 100K requests/second, storing all traces might cost $50K/month. With 10% head-based sampling: $5K/month. With tail-based sampling (keep all errors + 5% baseline): $8K/month but with 100% error coverage. The control plane sampling policy is literally the cost knob for your observability data plane. Get this wrong and you either miss critical data or overspend by 10x.

Pipeline Architecture

Modern observability pipelines follow a multi-tier architecture: agents (close to data source) → collectors (aggregation layer) → backends (storage + query). Each tier has its own control/data plane split.

  • Agents (per-host/per-pod) — lightweight data plane collection; configuration pushed from central control plane
  • Collectors (regional/cluster) — aggregation, sampling, routing; control plane decisions with significant data plane throughput
  • Backends (centralized) — storage, indexing, query serving; mostly data plane with control plane for retention/compaction
Pipeline Design
Why Two-Tier Collection Wins

A common anti-pattern is sending telemetry directly from agents to backends. This tight coupling means agent configuration changes require redeploying every agent. The two-tier pattern (agent → collector → backend) introduces a control plane boundary: collectors can change routing, sampling, and enrichment WITHOUT touching agents. The collector layer is a dedicated control plane for your telemetry pipeline — it's where you implement sampling decisions, add metadata, filter noise, and route to multiple backends.

PipelineArchitectureScale

Cost Control via Control Plane

Observability cost is directly proportional to data plane volume — the bytes generated, transmitted, and stored. The control plane is your lever for managing costs without losing visibility.

Control Plane Cost Levers

  • Sampling rate — reduce trace volume (10% baseline + 100% errors is typical)
  • Metric cardinality — drop high-cardinality labels (user_id, request_id) before storage
  • Log filtering — drop DEBUG/INFO in production, keep WARN/ERROR
  • Aggregation — pre-aggregate metrics at collector (5s → 60s resolution)
  • Retention policies — hot (7d full resolution), warm (30d downsampled), cold (1yr aggregated)
  • Routing — send critical service telemetry to premium storage, background jobs to cheap storage
The 90/10 Rule of Observability Data: In most organizations, 90% of observability value comes from 10% of the data (errors, slow requests, SLO violations). The control plane's job is to identify and preserve that 10% while aggressively reducing the other 90%. This is why tail-based sampling + smart filtering outperforms brute-force "collect everything."
# Check OTel Collector pipeline status (data plane health)
# Assumes collector exposes metrics on :8888
curl -s http://localhost:8888/metrics | grep otelcol_receiver_accepted

# Key metrics to monitor (data plane throughput)
# otelcol_receiver_accepted_spans — traces ingested
# otelcol_processor_dropped_spans — traces filtered (control plane effect)
# otelcol_exporter_sent_spans — traces exported to backends
# otelcol_exporter_send_failed_spans — export failures

# Calculate effective sampling rate
curl -s http://localhost:8888/metrics | grep -E "accepted_spans|dropped_spans"

Adaptive Sampling

Adaptive sampling is the most sophisticated observability control plane pattern — it dynamically adjusts data plane collection based on real-time conditions. The control plane monitors system state and modifies sampling rules on the fly.

How Adaptive Sampling Works

  • Normal state — sample 5% of traces, aggregate metrics at 60s intervals
  • Anomaly detected — control plane increases sampling to 50% for affected services
  • Incident active — control plane sets 100% sampling for all services in the blast radius
  • Recovery — gradually reduce back to baseline as error rates normalize
Advanced Pattern
Feedback Loop: Alerting Drives Sampling

The most advanced observability architectures create a feedback loop: alerting rules (control plane) detect anomalies from metrics (data plane), then dynamically increase trace sampling (control plane adjustment) for the affected services, providing the detailed data needed to diagnose the issue. After resolution, sampling returns to baseline. This is a self-regulating system — the control plane optimizes the data plane based on the data plane's own output. It's observability observing itself.

Feedback LoopAdaptiveAdvanced
Key Takeaway
Observability: The Pattern Applied to Itself

Observability is meta: it's the control/data plane pattern applied to monitoring control/data plane systems. Your OTel Collector config (control plane) determines what telemetry (data plane) your Kubernetes cluster's control plane and data plane emit. The separation at the observability layer mirrors the separation in the systems being observed. Master this pattern once, and you understand the architecture of monitoring, the architecture of what's being monitored, and the recursive relationship between them.

MetaPatternArchitecture