Grafana Deep Dive Part 11: Architecting an Observability Platform

The Platform Mindset

Building an observability platform is fundamentally different from deploying individual monitoring tools. A platform provides self-service capabilities to multiple teams, enforces consistent standards, abstracts infrastructure complexity, and scales gracefully as the organization grows. Think of it as the difference between each team running their own Prometheus instance versus providing a managed metrics service that handles ingestion, storage, querying, and visualization transparently.

                            
                            Key Insight: An observability platform is not a collection of tools — it’s a product with internal customers. Treat it with the same product management discipline you’d apply to any customer-facing service: define personas, gather requirements, measure adoption, and iterate based on feedback.
                        

Why Build a Platform?

Organizations typically evolve through stages of observability adoption. Early on, individual teams install tools ad-hoc — one team uses Datadog, another deploys Prometheus manually, a third relies on cloud-native metrics. This fragmentation creates blind spots at service boundaries, inconsistent alerting thresholds, duplicated costs, and knowledge silos.

A centralized platform solves these problems by providing:

Unified telemetry pipeline — all services emit data through a consistent collection layer
Cross-team correlation — trace a request from frontend through 15 microservices without switching tools
Cost efficiency — shared infrastructure with centralized volume management and retention policies
Compliance & governance — consistent data access controls, audit trails, and retention enforcement
Developer velocity — teams onboard in minutes via self-service rather than weeks of infrastructure setup

The Observability Platform Team

Successful platforms require a dedicated team that operates the infrastructure and provides developer experience tooling. This team typically includes:

Team Structure Organizational Pattern

Observability Platform Team Composition

Role	Responsibilities	Typical Ratio
Platform Engineer	Infrastructure provisioning, scaling, upgrades, incident response for the platform itself	1 per 50–100 monitored services
Developer Experience Engineer	SDKs, instrumentation libraries, onboarding guides, internal documentation	1 per 200+ developers
Data Engineer	Pipeline optimization, cost analysis, retention policy management, data quality	1 per 500 TB/month ingested
Product Manager	Roadmap, stakeholder communication, adoption metrics, feature prioritization	1 per platform

Platform EngineeringTeam Topology

Observability Maturity Model

Assess where your organization sits to determine the right level of platform investment:

Observability Maturity Levels

flowchart LR
    L1["Level 1
Reactive
Basic health checks"]
    L2["Level 2
Proactive
Metrics + alerts"]
    L3["Level 3
Correlated
Logs + traces + metrics"]
    L4["Level 4
Predictive
ML anomaly detection"]
    L5["Level 5
Self-Healing
Automated remediation"]
    L1 --> L2 --> L3 --> L4 --> L5

Most organizations building a Grafana platform are transitioning from Level 2 to Level 3 — moving from isolated metrics monitoring to correlated, multi-signal observability. The platform architecture decisions you make at this stage determine how easily you can progress to Levels 4 and 5.

Defining a Data Architecture

Before selecting tools or sizing infrastructure, define what data you need, how much you’ll generate, how long to keep it, and who owns it. These decisions drive every downstream architectural choice.

Telemetry Type Selection

Not every service needs every telemetry type. Define a tiering model based on service criticality:

# telemetry-tiers.yaml - Service instrumentation requirements
tiers:
  tier-1-critical:
    description: "Revenue-generating, customer-facing services"
    examples: ["payment-service", "checkout-api", "auth-service"]
    required_signals:
      - metrics: "RED + resource utilization + business KPIs"
      - logs: "Structured JSON, correlation IDs, request context"
      - traces: "100% sampling for errors, 10% head-based for success"
      - profiles: "Continuous CPU + memory profiling"
    retention:
      metrics: "13 months (for YoY comparison)"
      logs: "30 days hot, 90 days warm, 1 year cold"
      traces: "7 days full fidelity, 30 days sampled"
    slo_target: "99.95%"

  tier-2-important:
    description: "Internal services, batch processors, async workers"
    examples: ["email-sender", "report-generator", "data-pipeline"]
    required_signals:
      - metrics: "RED metrics + queue depth"
      - logs: "Structured JSON, error-level minimum"
      - traces: "5% head-based sampling"
    retention:
      metrics: "6 months"
      logs: "14 days hot, 30 days cold"
      traces: "3 days"
    slo_target: "99.9%"

  tier-3-best-effort:
    description: "Development tools, internal dashboards, experiments"
    examples: ["feature-flags-ui", "internal-wiki", "dev-sandbox"]
    required_signals:
      - metrics: "Basic UP/DOWN + request rate"
      - logs: "Error logs only"
    retention:
      metrics: "30 days"
      logs: "7 days"
    slo_target: "99.0%"

Cardinality & Volume Planning

Cardinality — the number of unique time series — is the primary cost driver for metrics backends like Mimir and Prometheus. A single poorly-labeled metric can generate millions of series and crash your cluster.

                            
                            Warning: A metric with labels user_id, request_path, and status_code across 1M users, 10K paths, and 5 status codes creates 50 billion potential series. Always use bounded label values. Replace high-cardinality identifiers with aggregated dimensions.
                        

Use this formula for capacity estimation:

# Capacity estimation for Mimir/Prometheus
# Formula: active_series = services × metrics_per_service × label_combinations

# Example: 200 services, 50 metrics each, avg 20 label combinations
active_series=$((200 * 50 * 20))
echo "Estimated active series: $active_series"  # 200,000

# Storage estimate (assuming 2 bytes per sample, 15s scrape interval)
samples_per_day=$((active_series * 5760))  # 86400/15 = 5760 samples/day
bytes_per_day=$((samples_per_day * 2))
gb_per_day=$(echo "scale=2; $bytes_per_day / 1073741824" | bc)
echo "Storage per day: ${gb_per_day} GB"

# With 13 month retention:
echo "Total storage needed: $(echo "scale=0; $gb_per_day * 395" | bc) GB"

For logs, volume planning is simpler but the numbers are larger:

Component	Typical Volume	Cost Driver
Metrics (Mimir)	100K–10M active series	Cardinality (unique series count)
Logs (Loki)	10–500 GB/day	Ingestion volume (bytes/sec)
Traces (Tempo)	1–100 GB/day	Span count & sampling rate
Profiles (Pyroscope)	1–20 GB/day	Number of profiled services

Retention Policies

Design tiered storage to balance query performance against cost:

Tiered Storage Architecture

flowchart TD
    I["Ingestion Layer
OTel Collector / Alloy"]
    H["Hot Storage
SSD, < 7 days
Fast queries"]
    W["Warm Storage
HDD/S3, 7-90 days
Acceptable latency"]
    C["Cold Storage
S3 Glacier/Archive
90+ days, slow access"]
    D["Delete
Past retention window"]
    I --> H
    H -->|"Compaction & downsampling"| W
    W -->|"Lifecycle policy"| C
    C -->|"TTL expiry"| D

# Mimir compactor configuration for tiered retention
compactor:
  compaction_interval: 1h
  retention_period: 395d          # 13 months
  block_ranges_period:
    - 2h    # Level 1 blocks
    - 12h   # Level 2 blocks
    - 24h   # Level 3 blocks
  deletion_delay: 12h

# Per-tenant retention overrides
overrides:
  tenant_critical_services:
    max_global_series_per_user: 5000000
    ingestion_rate: 500000          # samples/sec
    retention_period: 395d
  tenant_development:
    max_global_series_per_user: 100000
    ingestion_rate: 50000
    retention_period: 30d

Data Ownership & Governance

Establish clear ownership boundaries for telemetry data:

Producers — service teams own the quality and correctness of their telemetry output
Pipeline — the platform team owns collection, routing, and transformation infrastructure
Storage — the platform team owns backends but tenants own their data within boundaries
Consumption — consuming teams own their dashboards and alerts but follow platform standards

Establishing System Architecture

Deployment Topology

The Grafana stack can be deployed in several topologies depending on scale and operational requirements:

Reference Architecture — Multi-Cluster Deployment

flowchart TD
    subgraph WC1["Workload Cluster 1"]
        A1["OTel Collector
(DaemonSet)"]
        G1["Grafana Alloy
(Gateway)"]
        A1 --> G1
    end
    subgraph WC2["Workload Cluster 2"]
        A2["OTel Collector
(DaemonSet)"]
        G2["Grafana Alloy
(Gateway)"]
        A2 --> G2
    end
    subgraph OC["Observability Cluster"]
        LB["Load Balancer"]
        MI["Mimir
(Metrics)"]
        LO["Loki
(Logs)"]
        TE["Tempo
(Traces)"]
        PY["Pyroscope
(Profiles)"]
        GR["Grafana
(Visualization)"]
        LB --> MI
        LB --> LO
        LB --> TE
        LB --> PY
        GR --> MI
        GR --> LO
        GR --> TE
        GR --> PY
    end
    subgraph ST["Object Storage"]
        S3["S3 / GCS / Azure Blob"]
    end
    G1 --> LB
    G2 --> LB
    MI --> S3
    LO --> S3
    TE --> S3
    PY --> S3

Key topology decisions:

Pattern	When to Use	Trade-offs
Monolithic	< 100K series, single team	Simple ops, limited scale
Read/Write Split	100K–5M series, growing team	Independent scaling of reads vs writes
Microservices (Full)	> 5M series, multi-tenant	Maximum flexibility, complex operations
Grafana Cloud	Any scale, minimal ops team	Managed, cost per usage, less control

Multi-Tenant Design

Multi-tenancy in the Grafana stack is achieved through tenant IDs propagated via HTTP headers. Each component enforces isolation at the data layer:

# Mimir multi-tenant configuration
multitenancy_enabled: true

# Tenant header used by all components
# X-Scope-OrgID header identifies the tenant
server:
  http_listen_port: 8080

distributor:
  ring:
    kvstore:
      store: memberlist
  # Per-tenant rate limiting
  instance_limits:
    max_ingestion_rate: 0        # Unlimited (use per-tenant overrides)

# Per-tenant overrides
overrides:
  defaults:
    max_global_series_per_user: 1500000
    max_global_series_per_metric: 50000
    ingestion_rate: 200000
    ingestion_burst_size: 400000
    max_label_names_per_series: 30
    max_label_value_length: 2048

  # Critical production tenant gets higher limits
  production:
    max_global_series_per_user: 10000000
    ingestion_rate: 1000000
    ingestion_burst_size: 2000000

  # Development tenant is constrained
  development:
    max_global_series_per_user: 200000
    ingestion_rate: 50000
    retention_period: 7d

                            
                            Tenant Isolation Strategy: Use the X-Scope-OrgID header for soft tenancy (shared infrastructure, logical separation) or deploy separate clusters per tenant for hard tenancy (complete isolation, higher cost). Most organizations start with soft tenancy and graduate critical tenants to dedicated infrastructure as they grow.
                        

Horizontal Scaling Patterns

Each component in the LGTM stack scales differently:

Mimir Scaling

# Mimir microservices mode - independent scaling per component
# Distributor: scales with ingestion rate (CPU-bound)
distributor:
  replicas: 3
  resources:
    requests: { cpu: "2", memory: "4Gi" }
    limits: { cpu: "4", memory: "8Gi" }

# Ingester: scales with active series (memory-bound)
ingester:
  replicas: 6
  resources:
    requests: { cpu: "2", memory: "16Gi" }
    limits: { cpu: "4", memory: "32Gi" }
  persistence:
    volumeClaimTemplate:
      spec:
        storageClassName: fast-ssd
        resources:
          requests:
            storage: 100Gi

# Querier: scales with query concurrency (CPU + memory)
querier:
  replicas: 4
  resources:
    requests: { cpu: "4", memory: "8Gi" }

# Store-gateway: scales with storage volume (memory for index cache)
store_gateway:
  replicas: 3
  resources:
    requests: { cpu: "1", memory: "16Gi" }
  persistence:
    volumeClaimTemplate:
      spec:
        resources:
          requests:
            storage: 200Gi

Loki Scaling

# Loki Simple Scalable Deployment (SSD mode)
# Read path: scales with query load
read:
  replicas: 3
  resources:
    requests: { cpu: "2", memory: "4Gi" }
  autoscaling:
    enabled: true
    minReplicas: 3
    maxReplicas: 10
    targetCPUUtilizationPercentage: 70

# Write path: scales with ingestion volume
write:
  replicas: 3
  resources:
    requests: { cpu: "1", memory: "2Gi" }
  autoscaling:
    enabled: true
    minReplicas: 3
    maxReplicas: 20
    targetMemoryUtilizationPercentage: 80

# Backend: compactor + index gateway
backend:
  replicas: 2
  resources:
    requests: { cpu: "1", memory: "4Gi" }

High Availability

Each component requires specific HA strategies:

Component	HA Mechanism	Minimum Replicas
Mimir Ingester	Replication factor 3, zone-aware	3 (across 3 AZs)
Loki Write	Replication factor 3	3
Tempo Ingester	Replication factor 3	3
Grafana	Stateless, shared database	2+
Alertmanager	Gossip-based clustering	3

# Zone-aware replication for Mimir ingesters
ingester:
  ring:
    replication_factor: 3
    zone_awareness_enabled: true
  # Ensure ingesters spread across availability zones
  topologySpreadConstraints:
    - maxSkew: 1
      topologyKey: topology.kubernetes.io/zone
      whenUnsatisfiable: DoNotSchedule
      labelSelector:
        matchLabels:
          app.kubernetes.io/component: ingester

Management & Automation

Capacity Planning

Observability platforms must monitor themselves. Deploy a separate “meta-monitoring” stack that watches the observability infrastructure:

# Meta-monitoring alerts for the observability platform
groups:
  - name: platform_capacity
    rules:
      - alert: MimirIngesterMemoryPressure
        expr: |
          container_memory_working_set_bytes{container="ingester"}
          / container_spec_memory_limit_bytes{container="ingester"} > 0.85
        for: 15m
        labels:
          severity: warning
          team: observability-platform
        annotations:
          summary: "Mimir ingester memory usage above 85%"
          runbook: "Scale ingesters or investigate high-cardinality tenants"

      - alert: LokiIngestionLagging
        expr: |
          rate(loki_distributor_bytes_received_total[5m])
          > rate(loki_ingester_chunks_flushed_total[5m]) * 1.2
        for: 10m
        labels:
          severity: critical
          team: observability-platform

      - alert: TenantCardinalityExplosion
        expr: |
          cortex_ingester_active_series{} > 2000000
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Tenant {{ $labels.user }} exceeding 2M active series"

Cost Optimization

The three largest cost centers for a Grafana observability platform are storage, compute (ingesters/queriers), and data transfer. Apply these strategies:

Cost Optimization Strategy Guide

Cost Reduction Strategies by Impact

Strategy	Typical Savings	Effort
Reduce metric cardinality (drop unused labels)	30–60%	Medium
Implement tail-based trace sampling	40–70%	Medium
Tiered log retention (hot/warm/cold)	50–70%	Low
Downsampling old metrics (5m → 1h resolution)	20–40%	Low
Drop debug-level logs in production	20–50%	Low
Use recording rules for expensive queries	10–30% compute	Medium
Object storage lifecycle policies	30–50% storage	Low

FinOpsCost Engineering

# OTel Collector processor for cost optimization
processors:
  # Drop high-cardinality labels before sending to Mimir
  metricstransform:
    transforms:
      - include: http_request_duration_seconds
        action: update
        operations:
          - action: delete_label_value
            label: request_path
            # Replace exact paths with patterns
          - action: aggregate_labels
            label_set: [method, status_code, service]
            aggregation_type: sum

  # Tail-based sampling for traces
  tail_sampling:
    decision_wait: 10s
    policies:
      - name: errors-always
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: slow-requests
        type: latency
        latency: { threshold_ms: 2000 }
      - name: sample-remainder
        type: probabilistic
        probabilistic: { sampling_percentage: 5 }

  # Filter unnecessary log levels
  filter:
    logs:
      exclude:
        match_type: strict
        bodies:
          - "health check"
          - "readiness probe"
        severity_texts:
          - "DEBUG"
          - "TRACE"

Operational Runbooks

Every alert must have a corresponding runbook. Structure them consistently:

# runbook-template.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: runbook-mimir-ingester-memory
data:
  runbook.md: |
    # Mimir Ingester Memory Pressure

    ## Symptoms
    - Alert: MimirIngesterMemoryPressure fired
    - Ingester pods showing high memory usage (> 85%)
    - Potential OOMKill risk

    ## Investigation Steps
    1. Identify which tenant is causing growth:
       `sum by (user) (cortex_ingester_active_series) > 1000000`
    2. Check for cardinality spikes:
       `rate(cortex_ingester_active_series[1h]) > 10000`
    3. Review recent deployments that may have added labels

    ## Remediation
    - **Immediate**: Scale ingesters horizontally
      `kubectl scale statefulset mimir-ingester --replicas=N+2`
    - **Short-term**: Apply per-tenant series limits
    - **Long-term**: Work with offending team to reduce cardinality

    ## Escalation
    - If > 95% memory after scaling: Page on-call lead
    - If data loss suspected: Invoke incident process

Developing a Proof of Concept

Scoping the PoC

A well-scoped PoC validates key architectural decisions without over-investing. Target 2–4 weeks duration with 3–5 representative services:

1 high-traffic service — validates ingestion scale and query performance
1 multi-dependency service — validates distributed tracing correlation
1 batch/async service — validates log aggregation for non-HTTP workloads
Infrastructure layer — validates Kubernetes metrics collection

Success Metrics

Define measurable outcomes before starting:

Metric	Target	How to Measure
Mean Time to Detection (MTTD)	< 5 minutes	Inject known failure, measure alert latency
Mean Time to Investigate (MTTI)	< 15 minutes	Time from alert to identifying root cause
Developer Onboarding	< 30 minutes	New service emitting all 3 signals
Query Latency (p99)	< 3 seconds	Dashboard load time for 24h range
Data Completeness	> 99.5%	No gaps in metrics/traces for instrumented services

Phased Rollout Strategy

Platform Rollout Phases

flowchart LR
    P1["Phase 1
Foundation
2-4 weeks"]
    P2["Phase 2
Early Adopters
4-6 weeks"]
    P3["Phase 3
Broad Adoption
8-12 weeks"]
    P4["Phase 4
Full Production
Ongoing"]
    P1 -->|"3-5 services"| P2
    P2 -->|"20-30 services"| P3
    P3 -->|"All services"| P4

Phase 1: Deploy core stack, instrument PoC services, validate data quality
Phase 2: Onboard willing teams, build self-service tooling, establish standards
Phase 3: Mandatory adoption for new services, migration support for legacy
Phase 4: Advanced features (profiling, ML anomaly detection, self-healing)

Containerization & Virtualization

Kubernetes Deployment

The Grafana LGTM stack is designed for Kubernetes. Each component ships official Helm charts and can run as StatefulSets (ingesters, store-gateways) or Deployments (distributors, queriers, frontends):

# Deploy the full Grafana observability stack with Helm
# Add Grafana Helm repository
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update

# Deploy Mimir (metrics backend)
helm install mimir grafana/mimir-distributed \
  --namespace observability \
  --create-namespace \
  --values mimir-values.yaml

# Deploy Loki (logs backend)
helm install loki grafana/loki \
  --namespace observability \
  --values loki-values.yaml

# Deploy Tempo (traces backend)
helm install tempo grafana/tempo-distributed \
  --namespace observability \
  --values tempo-values.yaml

# Deploy Grafana (visualization)
helm install grafana grafana/grafana \
  --namespace observability \
  --values grafana-values.yaml

# Deploy Alloy (collector)
helm install alloy grafana/alloy \
  --namespace observability \
  --values alloy-values.yaml

Helm Charts & Operators

For production deployments, use the distributed Helm charts which deploy components as separate microservices:

# mimir-values.yaml - Production configuration
global:
  extraEnvFrom:
    - secretRef:
        name: mimir-s3-credentials

mimir:
  structuredConfig:
    multitenancy_enabled: true
    common:
      storage:
        backend: s3
        s3:
          endpoint: s3.us-east-1.amazonaws.com
          bucket_name: observability-mimir-blocks
          region: us-east-1

ingester:
  replicas: 6
  persistentVolume:
    enabled: true
    size: 100Gi
    storageClass: gp3
  zoneAwareReplication:
    enabled: true
    zones:
      - name: zone-a
        nodeSelector:
          topology.kubernetes.io/zone: us-east-1a
      - name: zone-b
        nodeSelector:
          topology.kubernetes.io/zone: us-east-1b
      - name: zone-c
        nodeSelector:
          topology.kubernetes.io/zone: us-east-1c

distributor:
  replicas: 3

querier:
  replicas: 4

query_frontend:
  replicas: 2

compactor:
  replicas: 1
  persistentVolume:
    size: 200Gi

Resource Management

Proper resource requests and limits prevent noisy-neighbor problems and ensure predictable performance:

# Resource guidelines per component (per replica)
# Adjust based on your actual workload after PoC benchmarking

resources:
  mimir_ingester:
    requests: { cpu: "2", memory: "16Gi" }
    limits: { cpu: "4", memory: "32Gi" }
    notes: "Memory scales with active series. ~1KB per series."

  mimir_distributor:
    requests: { cpu: "2", memory: "2Gi" }
    limits: { cpu: "4", memory: "4Gi" }
    notes: "CPU-bound. Scales with samples/sec ingestion rate."

  loki_write:
    requests: { cpu: "1", memory: "2Gi" }
    limits: { cpu: "2", memory: "4Gi" }
    notes: "Memory scales with chunk buffer size."

  loki_read:
    requests: { cpu: "2", memory: "4Gi" }
    limits: { cpu: "4", memory: "8Gi" }
    notes: "CPU/memory scale with query complexity and parallelism."

  tempo_ingester:
    requests: { cpu: "1", memory: "4Gi" }
    limits: { cpu: "2", memory: "8Gi" }
    notes: "Memory scales with trace buffer before flush."

  grafana:
    requests: { cpu: "500m", memory: "512Mi" }
    limits: { cpu: "2", memory: "2Gi" }
    notes: "Stateless. Scale replicas for concurrent users."

Setting the Right Access Levels

RBAC in Grafana

Grafana Enterprise and Grafana Cloud provide fine-grained role-based access control. Design roles around the observability personas identified in Part 1:

Role	Persona	Permissions
Platform Admin	Ophelia Operator	Full control: data sources, users, orgs, plugins, API keys
Team Lead	Masha Manager	Manage team folders, create/edit dashboards, manage alerts
Developer	Diego Developer	View all dashboards, edit team dashboards, create personal dashboards
Service Account	CI/CD pipelines	Provisioning: create/update dashboards and alerts via API
Viewer	Pelé Product / Stakeholders	View specific folders, no edit permissions

# Grafana RBAC configuration via provisioning
# File: provisioning/access-control/roles.yaml
apiVersion: 1
roles:
  - name: "team-developer"
    description: "Standard developer role for service teams"
    permissions:
      - action: "dashboards:read"
        scope: "folders:*"
      - action: "dashboards:write"
        scope: "folders:uid:team-${team_name}"
      - action: "dashboards:create"
        scope: "folders:uid:team-${team_name}"
      - action: "datasources:query"
        scope: "datasources:*"
      - action: "alerting.rules:read"
        scope: "folders:*"
      - action: "alerting.rules:write"
        scope: "folders:uid:team-${team_name}"

  - name: "platform-admin"
    description: "Full platform administration"
    permissions:
      - action: "*"
        scope: "*"

Data Source Permissions

Restrict which teams can query which data sources to enforce tenant isolation at the visualization layer:

# Data source provisioning with team-based access
apiVersion: 1
datasources:
  - name: "Mimir - Production"
    type: prometheus
    url: http://mimir-query-frontend.observability:8080/prometheus
    access: proxy
    jsonData:
      httpHeaderName1: "X-Scope-OrgID"
    secureJsonData:
      httpHeaderValue1: "production"
    # Only platform-admin and production teams can query
    permissions:
      - teamId: 1   # platform-admin
        permission: 2  # Admin
      - teamId: 5   # backend-team
        permission: 1  # Query

  - name: "Mimir - Development"
    type: prometheus
    url: http://mimir-query-frontend.observability:8080/prometheus
    access: proxy
    jsonData:
      httpHeaderName1: "X-Scope-OrgID"
    secureJsonData:
      httpHeaderValue1: "development"
    permissions:
      - teamId: 1
        permission: 2
      - teamId: 10  # all-developers
        permission: 1

Organization & Folder Structure

Design a folder hierarchy that maps to your organizational structure and access patterns:

Dashboard Folder Hierarchy

flowchart TD
    R["Root"]
    R --> PL["Platform
(Platform team only)"]
    R --> SH["Shared
(All viewers)"]
    R --> T1["Team: Payments
(Team RBAC)"]
    R --> T2["Team: Catalog
(Team RBAC)"]
    R --> T3["Team: Auth
(Team RBAC)"]
    PL --> PL1["Infrastructure Health"]
    PL --> PL2["Platform SLOs"]
    PL --> PL3["Cost & Capacity"]
    SH --> SH1["Service Overview"]
    SH --> SH2["Business KPIs"]
    T1 --> T1A["Service Dashboards"]
    T1 --> T1B["Team Alerts"]
    T1 --> T1C["Debug / Ad-hoc"]

Sending Telemetry to Multiple Consumers

Fan-Out Architectures

Production observability platforms often need to send telemetry to multiple destinations — a primary Grafana stack for real-time monitoring, a data lake for long-term analytics, and potentially a security team’s SIEM:

# OTel Collector with fan-out to multiple backends
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    send_batch_size: 10000
    timeout: 5s

  # Clone metrics for different destinations
  routing:
    default_exporters: [prometheusremotewrite/mimir]
    table:
      - statement: route() where attributes["service.name"] == "payment-service"
        exporters: [prometheusremotewrite/mimir, prometheusremotewrite/security_siem]

exporters:
  # Primary: Grafana Mimir
  prometheusremotewrite/mimir:
    endpoint: http://mimir-distributor:8080/api/v1/push
    headers:
      X-Scope-OrgID: production

  # Secondary: Data Lake (for ML / long-term analytics)
  otlp/datalake:
    endpoint: analytics-collector.data-team:4317
    compression: zstd

  # Security: SIEM for audit trails
  prometheusremotewrite/security_siem:
    endpoint: https://siem.internal/api/v1/metrics
    headers:
      Authorization: "Bearer ${SIEM_TOKEN}"

  # Logs fan-out
  loki/primary:
    endpoint: http://loki-distributor:3100/loki/api/v1/push
    headers:
      X-Scope-OrgID: production

  loki/compliance:
    endpoint: http://loki-compliance:3100/loki/api/v1/push
    headers:
      X-Scope-OrgID: compliance

service:
  pipelines:
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [prometheusremotewrite/mimir, otlp/datalake]
    logs:
      receivers: [otlp]
      processors: [batch]
      exporters: [loki/primary, loki/compliance]
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlp/tempo, otlp/datalake]

Sampling Strategies

When sending to multiple consumers with different fidelity requirements, apply sampling at the collector layer:

Consumer	Sampling Strategy	Rationale
Real-time alerting (Mimir)	100% of metrics	Alerts need complete data
Trace investigation (Tempo)	100% errors + 5–10% success	Must capture all failures
Data lake / ML training	1–5% uniform sample	Statistical significance sufficient
Compliance / audit	100% for regulated services	Regulatory requirement

Data Pipeline Design

For high-volume environments, add a message queue between collectors and backends to absorb bursts and enable replay:

Buffered Pipeline with Kafka

flowchart LR
    A["OTel Collectors
(Edge)"]
    K["Kafka / Pulsar
(Buffer)"]
    C1["Consumer: Mimir
(Real-time metrics)"]
    C2["Consumer: Loki
(Real-time logs)"]
    C3["Consumer: Data Lake
(Analytics)"]
    C4["Consumer: SIEM
(Security)"]
    A --> K
    K --> C1
    K --> C2
    K --> C3
    K --> C4

                            
                            When to Add Kafka: Introduce a message queue when (1) ingestion rates exceed 1M events/sec, (2) you need replay capability for backfilling, (3) multiple independent consumers need the same data, or (4) you need to decouple producers from backends for independent scaling. For most organizations under 500 services, direct push from OTel Collector to backends is simpler and sufficient.
                        

Summary & Next Steps

Architecting an observability platform requires thinking beyond individual tools to consider the entire system holistically:

Platform mindset — treat observability as an internal product with clear personas, SLOs, and a roadmap
Data architecture — define telemetry tiers, cardinality budgets, and retention policies before deploying infrastructure
System architecture — choose the right deployment topology and scaling patterns for your scale
Multi-tenancy — use X-Scope-OrgID headers with per-tenant limits for cost control and isolation
Access control — design RBAC, data source permissions, and folder structures around team boundaries
Fan-out pipelines — route telemetry to multiple consumers with appropriate sampling per destination
Phased rollout — start with a PoC, validate success metrics, then expand systematically

Next in the Series

In Part 12: Real User Monitoring with Grafana, we’ll explore frontend observability — capturing Web Vitals, tracking user sessions with Grafana Faro, correlating frontend errors with backend traces, and building RUM dashboards that connect user experience to infrastructure health.

Previous Part 10: Automation with Infrastructure as Code Next Part 12: Real User Monitoring