The Platform Mindset
Building an observability platform is fundamentally different from deploying individual monitoring tools. A platform provides self-service capabilities to multiple teams, enforces consistent standards, abstracts infrastructure complexity, and scales gracefully as the organization grows. Think of it as the difference between each team running their own Prometheus instance versus providing a managed metrics service that handles ingestion, storage, querying, and visualization transparently.
Why Build a Platform?
Organizations typically evolve through stages of observability adoption. Early on, individual teams install tools ad-hoc — one team uses Datadog, another deploys Prometheus manually, a third relies on cloud-native metrics. This fragmentation creates blind spots at service boundaries, inconsistent alerting thresholds, duplicated costs, and knowledge silos.
A centralized platform solves these problems by providing:
- Unified telemetry pipeline — all services emit data through a consistent collection layer
- Cross-team correlation — trace a request from frontend through 15 microservices without switching tools
- Cost efficiency — shared infrastructure with centralized volume management and retention policies
- Compliance & governance — consistent data access controls, audit trails, and retention enforcement
- Developer velocity — teams onboard in minutes via self-service rather than weeks of infrastructure setup
The Observability Platform Team
Successful platforms require a dedicated team that operates the infrastructure and provides developer experience tooling. This team typically includes:
Observability Platform Team Composition
| Role | Responsibilities | Typical Ratio |
|---|---|---|
| Platform Engineer | Infrastructure provisioning, scaling, upgrades, incident response for the platform itself | 1 per 50–100 monitored services |
| Developer Experience Engineer | SDKs, instrumentation libraries, onboarding guides, internal documentation | 1 per 200+ developers |
| Data Engineer | Pipeline optimization, cost analysis, retention policy management, data quality | 1 per 500 TB/month ingested |
| Product Manager | Roadmap, stakeholder communication, adoption metrics, feature prioritization | 1 per platform |
Observability Maturity Model
Assess where your organization sits to determine the right level of platform investment:
flowchart LR
L1["Level 1
Reactive
Basic health checks"]
L2["Level 2
Proactive
Metrics + alerts"]
L3["Level 3
Correlated
Logs + traces + metrics"]
L4["Level 4
Predictive
ML anomaly detection"]
L5["Level 5
Self-Healing
Automated remediation"]
L1 --> L2 --> L3 --> L4 --> L5
Most organizations building a Grafana platform are transitioning from Level 2 to Level 3 — moving from isolated metrics monitoring to correlated, multi-signal observability. The platform architecture decisions you make at this stage determine how easily you can progress to Levels 4 and 5.
Defining a Data Architecture
Before selecting tools or sizing infrastructure, define what data you need, how much you’ll generate, how long to keep it, and who owns it. These decisions drive every downstream architectural choice.
Telemetry Type Selection
Not every service needs every telemetry type. Define a tiering model based on service criticality:
# telemetry-tiers.yaml - Service instrumentation requirements
tiers:
tier-1-critical:
description: "Revenue-generating, customer-facing services"
examples: ["payment-service", "checkout-api", "auth-service"]
required_signals:
- metrics: "RED + resource utilization + business KPIs"
- logs: "Structured JSON, correlation IDs, request context"
- traces: "100% sampling for errors, 10% head-based for success"
- profiles: "Continuous CPU + memory profiling"
retention:
metrics: "13 months (for YoY comparison)"
logs: "30 days hot, 90 days warm, 1 year cold"
traces: "7 days full fidelity, 30 days sampled"
slo_target: "99.95%"
tier-2-important:
description: "Internal services, batch processors, async workers"
examples: ["email-sender", "report-generator", "data-pipeline"]
required_signals:
- metrics: "RED metrics + queue depth"
- logs: "Structured JSON, error-level minimum"
- traces: "5% head-based sampling"
retention:
metrics: "6 months"
logs: "14 days hot, 30 days cold"
traces: "3 days"
slo_target: "99.9%"
tier-3-best-effort:
description: "Development tools, internal dashboards, experiments"
examples: ["feature-flags-ui", "internal-wiki", "dev-sandbox"]
required_signals:
- metrics: "Basic UP/DOWN + request rate"
- logs: "Error logs only"
retention:
metrics: "30 days"
logs: "7 days"
slo_target: "99.0%"
Cardinality & Volume Planning
Cardinality — the number of unique time series — is the primary cost driver for metrics backends like Mimir and Prometheus. A single poorly-labeled metric can generate millions of series and crash your cluster.
user_id, request_path, and status_code across 1M users, 10K paths, and 5 status codes creates 50 billion potential series. Always use bounded label values. Replace high-cardinality identifiers with aggregated dimensions.
Use this formula for capacity estimation:
# Capacity estimation for Mimir/Prometheus
# Formula: active_series = services × metrics_per_service × label_combinations
# Example: 200 services, 50 metrics each, avg 20 label combinations
active_series=$((200 * 50 * 20))
echo "Estimated active series: $active_series" # 200,000
# Storage estimate (assuming 2 bytes per sample, 15s scrape interval)
samples_per_day=$((active_series * 5760)) # 86400/15 = 5760 samples/day
bytes_per_day=$((samples_per_day * 2))
gb_per_day=$(echo "scale=2; $bytes_per_day / 1073741824" | bc)
echo "Storage per day: ${gb_per_day} GB"
# With 13 month retention:
echo "Total storage needed: $(echo "scale=0; $gb_per_day * 395" | bc) GB"
For logs, volume planning is simpler but the numbers are larger:
| Component | Typical Volume | Cost Driver |
|---|---|---|
| Metrics (Mimir) | 100K–10M active series | Cardinality (unique series count) |
| Logs (Loki) | 10–500 GB/day | Ingestion volume (bytes/sec) |
| Traces (Tempo) | 1–100 GB/day | Span count & sampling rate |
| Profiles (Pyroscope) | 1–20 GB/day | Number of profiled services |
Retention Policies
Design tiered storage to balance query performance against cost:
flowchart TD
I["Ingestion Layer
OTel Collector / Alloy"]
H["Hot Storage
SSD, < 7 days
Fast queries"]
W["Warm Storage
HDD/S3, 7-90 days
Acceptable latency"]
C["Cold Storage
S3 Glacier/Archive
90+ days, slow access"]
D["Delete
Past retention window"]
I --> H
H -->|"Compaction & downsampling"| W
W -->|"Lifecycle policy"| C
C -->|"TTL expiry"| D
# Mimir compactor configuration for tiered retention
compactor:
compaction_interval: 1h
retention_period: 395d # 13 months
block_ranges_period:
- 2h # Level 1 blocks
- 12h # Level 2 blocks
- 24h # Level 3 blocks
deletion_delay: 12h
# Per-tenant retention overrides
overrides:
tenant_critical_services:
max_global_series_per_user: 5000000
ingestion_rate: 500000 # samples/sec
retention_period: 395d
tenant_development:
max_global_series_per_user: 100000
ingestion_rate: 50000
retention_period: 30d
Data Ownership & Governance
Establish clear ownership boundaries for telemetry data:
- Producers — service teams own the quality and correctness of their telemetry output
- Pipeline — the platform team owns collection, routing, and transformation infrastructure
- Storage — the platform team owns backends but tenants own their data within boundaries
- Consumption — consuming teams own their dashboards and alerts but follow platform standards
Establishing System Architecture
Deployment Topology
The Grafana stack can be deployed in several topologies depending on scale and operational requirements:
flowchart TD
subgraph WC1["Workload Cluster 1"]
A1["OTel Collector
(DaemonSet)"]
G1["Grafana Alloy
(Gateway)"]
A1 --> G1
end
subgraph WC2["Workload Cluster 2"]
A2["OTel Collector
(DaemonSet)"]
G2["Grafana Alloy
(Gateway)"]
A2 --> G2
end
subgraph OC["Observability Cluster"]
LB["Load Balancer"]
MI["Mimir
(Metrics)"]
LO["Loki
(Logs)"]
TE["Tempo
(Traces)"]
PY["Pyroscope
(Profiles)"]
GR["Grafana
(Visualization)"]
LB --> MI
LB --> LO
LB --> TE
LB --> PY
GR --> MI
GR --> LO
GR --> TE
GR --> PY
end
subgraph ST["Object Storage"]
S3["S3 / GCS / Azure Blob"]
end
G1 --> LB
G2 --> LB
MI --> S3
LO --> S3
TE --> S3
PY --> S3
Key topology decisions:
| Pattern | When to Use | Trade-offs |
|---|---|---|
| Monolithic | < 100K series, single team | Simple ops, limited scale |
| Read/Write Split | 100K–5M series, growing team | Independent scaling of reads vs writes |
| Microservices (Full) | > 5M series, multi-tenant | Maximum flexibility, complex operations |
| Grafana Cloud | Any scale, minimal ops team | Managed, cost per usage, less control |
Multi-Tenant Design
Multi-tenancy in the Grafana stack is achieved through tenant IDs propagated via HTTP headers. Each component enforces isolation at the data layer:
# Mimir multi-tenant configuration
multitenancy_enabled: true
# Tenant header used by all components
# X-Scope-OrgID header identifies the tenant
server:
http_listen_port: 8080
distributor:
ring:
kvstore:
store: memberlist
# Per-tenant rate limiting
instance_limits:
max_ingestion_rate: 0 # Unlimited (use per-tenant overrides)
# Per-tenant overrides
overrides:
defaults:
max_global_series_per_user: 1500000
max_global_series_per_metric: 50000
ingestion_rate: 200000
ingestion_burst_size: 400000
max_label_names_per_series: 30
max_label_value_length: 2048
# Critical production tenant gets higher limits
production:
max_global_series_per_user: 10000000
ingestion_rate: 1000000
ingestion_burst_size: 2000000
# Development tenant is constrained
development:
max_global_series_per_user: 200000
ingestion_rate: 50000
retention_period: 7d
X-Scope-OrgID header for soft tenancy (shared infrastructure, logical separation) or deploy separate clusters per tenant for hard tenancy (complete isolation, higher cost). Most organizations start with soft tenancy and graduate critical tenants to dedicated infrastructure as they grow.
Horizontal Scaling Patterns
Each component in the LGTM stack scales differently:
Mimir Scaling
# Mimir microservices mode - independent scaling per component
# Distributor: scales with ingestion rate (CPU-bound)
distributor:
replicas: 3
resources:
requests: { cpu: "2", memory: "4Gi" }
limits: { cpu: "4", memory: "8Gi" }
# Ingester: scales with active series (memory-bound)
ingester:
replicas: 6
resources:
requests: { cpu: "2", memory: "16Gi" }
limits: { cpu: "4", memory: "32Gi" }
persistence:
volumeClaimTemplate:
spec:
storageClassName: fast-ssd
resources:
requests:
storage: 100Gi
# Querier: scales with query concurrency (CPU + memory)
querier:
replicas: 4
resources:
requests: { cpu: "4", memory: "8Gi" }
# Store-gateway: scales with storage volume (memory for index cache)
store_gateway:
replicas: 3
resources:
requests: { cpu: "1", memory: "16Gi" }
persistence:
volumeClaimTemplate:
spec:
resources:
requests:
storage: 200Gi
Loki Scaling
# Loki Simple Scalable Deployment (SSD mode)
# Read path: scales with query load
read:
replicas: 3
resources:
requests: { cpu: "2", memory: "4Gi" }
autoscaling:
enabled: true
minReplicas: 3
maxReplicas: 10
targetCPUUtilizationPercentage: 70
# Write path: scales with ingestion volume
write:
replicas: 3
resources:
requests: { cpu: "1", memory: "2Gi" }
autoscaling:
enabled: true
minReplicas: 3
maxReplicas: 20
targetMemoryUtilizationPercentage: 80
# Backend: compactor + index gateway
backend:
replicas: 2
resources:
requests: { cpu: "1", memory: "4Gi" }
High Availability
Each component requires specific HA strategies:
| Component | HA Mechanism | Minimum Replicas |
|---|---|---|
| Mimir Ingester | Replication factor 3, zone-aware | 3 (across 3 AZs) |
| Loki Write | Replication factor 3 | 3 |
| Tempo Ingester | Replication factor 3 | 3 |
| Grafana | Stateless, shared database | 2+ |
| Alertmanager | Gossip-based clustering | 3 |
# Zone-aware replication for Mimir ingesters
ingester:
ring:
replication_factor: 3
zone_awareness_enabled: true
# Ensure ingesters spread across availability zones
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app.kubernetes.io/component: ingester
Management & Automation
Capacity Planning
Observability platforms must monitor themselves. Deploy a separate “meta-monitoring” stack that watches the observability infrastructure:
# Meta-monitoring alerts for the observability platform
groups:
- name: platform_capacity
rules:
- alert: MimirIngesterMemoryPressure
expr: |
container_memory_working_set_bytes{container="ingester"}
/ container_spec_memory_limit_bytes{container="ingester"} > 0.85
for: 15m
labels:
severity: warning
team: observability-platform
annotations:
summary: "Mimir ingester memory usage above 85%"
runbook: "Scale ingesters or investigate high-cardinality tenants"
- alert: LokiIngestionLagging
expr: |
rate(loki_distributor_bytes_received_total[5m])
> rate(loki_ingester_chunks_flushed_total[5m]) * 1.2
for: 10m
labels:
severity: critical
team: observability-platform
- alert: TenantCardinalityExplosion
expr: |
cortex_ingester_active_series{} > 2000000
for: 5m
labels:
severity: critical
annotations:
summary: "Tenant {{ $labels.user }} exceeding 2M active series"
Cost Optimization
The three largest cost centers for a Grafana observability platform are storage, compute (ingesters/queriers), and data transfer. Apply these strategies:
Cost Reduction Strategies by Impact
| Strategy | Typical Savings | Effort |
|---|---|---|
| Reduce metric cardinality (drop unused labels) | 30–60% | Medium |
| Implement tail-based trace sampling | 40–70% | Medium |
| Tiered log retention (hot/warm/cold) | 50–70% | Low |
| Downsampling old metrics (5m → 1h resolution) | 20–40% | Low |
| Drop debug-level logs in production | 20–50% | Low |
| Use recording rules for expensive queries | 10–30% compute | Medium |
| Object storage lifecycle policies | 30–50% storage | Low |
# OTel Collector processor for cost optimization
processors:
# Drop high-cardinality labels before sending to Mimir
metricstransform:
transforms:
- include: http_request_duration_seconds
action: update
operations:
- action: delete_label_value
label: request_path
# Replace exact paths with patterns
- action: aggregate_labels
label_set: [method, status_code, service]
aggregation_type: sum
# Tail-based sampling for traces
tail_sampling:
decision_wait: 10s
policies:
- name: errors-always
type: status_code
status_code: { status_codes: [ERROR] }
- name: slow-requests
type: latency
latency: { threshold_ms: 2000 }
- name: sample-remainder
type: probabilistic
probabilistic: { sampling_percentage: 5 }
# Filter unnecessary log levels
filter:
logs:
exclude:
match_type: strict
bodies:
- "health check"
- "readiness probe"
severity_texts:
- "DEBUG"
- "TRACE"
Operational Runbooks
Every alert must have a corresponding runbook. Structure them consistently:
# runbook-template.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: runbook-mimir-ingester-memory
data:
runbook.md: |
# Mimir Ingester Memory Pressure
## Symptoms
- Alert: MimirIngesterMemoryPressure fired
- Ingester pods showing high memory usage (> 85%)
- Potential OOMKill risk
## Investigation Steps
1. Identify which tenant is causing growth:
`sum by (user) (cortex_ingester_active_series) > 1000000`
2. Check for cardinality spikes:
`rate(cortex_ingester_active_series[1h]) > 10000`
3. Review recent deployments that may have added labels
## Remediation
- **Immediate**: Scale ingesters horizontally
`kubectl scale statefulset mimir-ingester --replicas=N+2`
- **Short-term**: Apply per-tenant series limits
- **Long-term**: Work with offending team to reduce cardinality
## Escalation
- If > 95% memory after scaling: Page on-call lead
- If data loss suspected: Invoke incident process
Developing a Proof of Concept
Scoping the PoC
A well-scoped PoC validates key architectural decisions without over-investing. Target 2–4 weeks duration with 3–5 representative services:
- 1 high-traffic service — validates ingestion scale and query performance
- 1 multi-dependency service — validates distributed tracing correlation
- 1 batch/async service — validates log aggregation for non-HTTP workloads
- Infrastructure layer — validates Kubernetes metrics collection
Success Metrics
Define measurable outcomes before starting:
| Metric | Target | How to Measure |
|---|---|---|
| Mean Time to Detection (MTTD) | < 5 minutes | Inject known failure, measure alert latency |
| Mean Time to Investigate (MTTI) | < 15 minutes | Time from alert to identifying root cause |
| Developer Onboarding | < 30 minutes | New service emitting all 3 signals |
| Query Latency (p99) | < 3 seconds | Dashboard load time for 24h range |
| Data Completeness | > 99.5% | No gaps in metrics/traces for instrumented services |
Phased Rollout Strategy
flowchart LR
P1["Phase 1
Foundation
2-4 weeks"]
P2["Phase 2
Early Adopters
4-6 weeks"]
P3["Phase 3
Broad Adoption
8-12 weeks"]
P4["Phase 4
Full Production
Ongoing"]
P1 -->|"3-5 services"| P2
P2 -->|"20-30 services"| P3
P3 -->|"All services"| P4
- Phase 1: Deploy core stack, instrument PoC services, validate data quality
- Phase 2: Onboard willing teams, build self-service tooling, establish standards
- Phase 3: Mandatory adoption for new services, migration support for legacy
- Phase 4: Advanced features (profiling, ML anomaly detection, self-healing)
Containerization & Virtualization
Kubernetes Deployment
The Grafana LGTM stack is designed for Kubernetes. Each component ships official Helm charts and can run as StatefulSets (ingesters, store-gateways) or Deployments (distributors, queriers, frontends):
# Deploy the full Grafana observability stack with Helm
# Add Grafana Helm repository
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update
# Deploy Mimir (metrics backend)
helm install mimir grafana/mimir-distributed \
--namespace observability \
--create-namespace \
--values mimir-values.yaml
# Deploy Loki (logs backend)
helm install loki grafana/loki \
--namespace observability \
--values loki-values.yaml
# Deploy Tempo (traces backend)
helm install tempo grafana/tempo-distributed \
--namespace observability \
--values tempo-values.yaml
# Deploy Grafana (visualization)
helm install grafana grafana/grafana \
--namespace observability \
--values grafana-values.yaml
# Deploy Alloy (collector)
helm install alloy grafana/alloy \
--namespace observability \
--values alloy-values.yaml
Helm Charts & Operators
For production deployments, use the distributed Helm charts which deploy components as separate microservices:
# mimir-values.yaml - Production configuration
global:
extraEnvFrom:
- secretRef:
name: mimir-s3-credentials
mimir:
structuredConfig:
multitenancy_enabled: true
common:
storage:
backend: s3
s3:
endpoint: s3.us-east-1.amazonaws.com
bucket_name: observability-mimir-blocks
region: us-east-1
ingester:
replicas: 6
persistentVolume:
enabled: true
size: 100Gi
storageClass: gp3
zoneAwareReplication:
enabled: true
zones:
- name: zone-a
nodeSelector:
topology.kubernetes.io/zone: us-east-1a
- name: zone-b
nodeSelector:
topology.kubernetes.io/zone: us-east-1b
- name: zone-c
nodeSelector:
topology.kubernetes.io/zone: us-east-1c
distributor:
replicas: 3
querier:
replicas: 4
query_frontend:
replicas: 2
compactor:
replicas: 1
persistentVolume:
size: 200Gi
Resource Management
Proper resource requests and limits prevent noisy-neighbor problems and ensure predictable performance:
# Resource guidelines per component (per replica)
# Adjust based on your actual workload after PoC benchmarking
resources:
mimir_ingester:
requests: { cpu: "2", memory: "16Gi" }
limits: { cpu: "4", memory: "32Gi" }
notes: "Memory scales with active series. ~1KB per series."
mimir_distributor:
requests: { cpu: "2", memory: "2Gi" }
limits: { cpu: "4", memory: "4Gi" }
notes: "CPU-bound. Scales with samples/sec ingestion rate."
loki_write:
requests: { cpu: "1", memory: "2Gi" }
limits: { cpu: "2", memory: "4Gi" }
notes: "Memory scales with chunk buffer size."
loki_read:
requests: { cpu: "2", memory: "4Gi" }
limits: { cpu: "4", memory: "8Gi" }
notes: "CPU/memory scale with query complexity and parallelism."
tempo_ingester:
requests: { cpu: "1", memory: "4Gi" }
limits: { cpu: "2", memory: "8Gi" }
notes: "Memory scales with trace buffer before flush."
grafana:
requests: { cpu: "500m", memory: "512Mi" }
limits: { cpu: "2", memory: "2Gi" }
notes: "Stateless. Scale replicas for concurrent users."
Setting the Right Access Levels
RBAC in Grafana
Grafana Enterprise and Grafana Cloud provide fine-grained role-based access control. Design roles around the observability personas identified in Part 1:
| Role | Persona | Permissions |
|---|---|---|
| Platform Admin | Ophelia Operator | Full control: data sources, users, orgs, plugins, API keys |
| Team Lead | Masha Manager | Manage team folders, create/edit dashboards, manage alerts |
| Developer | Diego Developer | View all dashboards, edit team dashboards, create personal dashboards |
| Service Account | CI/CD pipelines | Provisioning: create/update dashboards and alerts via API |
| Viewer | Pelé Product / Stakeholders | View specific folders, no edit permissions |
# Grafana RBAC configuration via provisioning
# File: provisioning/access-control/roles.yaml
apiVersion: 1
roles:
- name: "team-developer"
description: "Standard developer role for service teams"
permissions:
- action: "dashboards:read"
scope: "folders:*"
- action: "dashboards:write"
scope: "folders:uid:team-${team_name}"
- action: "dashboards:create"
scope: "folders:uid:team-${team_name}"
- action: "datasources:query"
scope: "datasources:*"
- action: "alerting.rules:read"
scope: "folders:*"
- action: "alerting.rules:write"
scope: "folders:uid:team-${team_name}"
- name: "platform-admin"
description: "Full platform administration"
permissions:
- action: "*"
scope: "*"
Data Source Permissions
Restrict which teams can query which data sources to enforce tenant isolation at the visualization layer:
# Data source provisioning with team-based access
apiVersion: 1
datasources:
- name: "Mimir - Production"
type: prometheus
url: http://mimir-query-frontend.observability:8080/prometheus
access: proxy
jsonData:
httpHeaderName1: "X-Scope-OrgID"
secureJsonData:
httpHeaderValue1: "production"
# Only platform-admin and production teams can query
permissions:
- teamId: 1 # platform-admin
permission: 2 # Admin
- teamId: 5 # backend-team
permission: 1 # Query
- name: "Mimir - Development"
type: prometheus
url: http://mimir-query-frontend.observability:8080/prometheus
access: proxy
jsonData:
httpHeaderName1: "X-Scope-OrgID"
secureJsonData:
httpHeaderValue1: "development"
permissions:
- teamId: 1
permission: 2
- teamId: 10 # all-developers
permission: 1
Organization & Folder Structure
Design a folder hierarchy that maps to your organizational structure and access patterns:
flowchart TD
R["Root"]
R --> PL["Platform
(Platform team only)"]
R --> SH["Shared
(All viewers)"]
R --> T1["Team: Payments
(Team RBAC)"]
R --> T2["Team: Catalog
(Team RBAC)"]
R --> T3["Team: Auth
(Team RBAC)"]
PL --> PL1["Infrastructure Health"]
PL --> PL2["Platform SLOs"]
PL --> PL3["Cost & Capacity"]
SH --> SH1["Service Overview"]
SH --> SH2["Business KPIs"]
T1 --> T1A["Service Dashboards"]
T1 --> T1B["Team Alerts"]
T1 --> T1C["Debug / Ad-hoc"]
Sending Telemetry to Multiple Consumers
Fan-Out Architectures
Production observability platforms often need to send telemetry to multiple destinations — a primary Grafana stack for real-time monitoring, a data lake for long-term analytics, and potentially a security team’s SIEM:
# OTel Collector with fan-out to multiple backends
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
send_batch_size: 10000
timeout: 5s
# Clone metrics for different destinations
routing:
default_exporters: [prometheusremotewrite/mimir]
table:
- statement: route() where attributes["service.name"] == "payment-service"
exporters: [prometheusremotewrite/mimir, prometheusremotewrite/security_siem]
exporters:
# Primary: Grafana Mimir
prometheusremotewrite/mimir:
endpoint: http://mimir-distributor:8080/api/v1/push
headers:
X-Scope-OrgID: production
# Secondary: Data Lake (for ML / long-term analytics)
otlp/datalake:
endpoint: analytics-collector.data-team:4317
compression: zstd
# Security: SIEM for audit trails
prometheusremotewrite/security_siem:
endpoint: https://siem.internal/api/v1/metrics
headers:
Authorization: "Bearer ${SIEM_TOKEN}"
# Logs fan-out
loki/primary:
endpoint: http://loki-distributor:3100/loki/api/v1/push
headers:
X-Scope-OrgID: production
loki/compliance:
endpoint: http://loki-compliance:3100/loki/api/v1/push
headers:
X-Scope-OrgID: compliance
service:
pipelines:
metrics:
receivers: [otlp]
processors: [batch]
exporters: [prometheusremotewrite/mimir, otlp/datalake]
logs:
receivers: [otlp]
processors: [batch]
exporters: [loki/primary, loki/compliance]
traces:
receivers: [otlp]
processors: [batch]
exporters: [otlp/tempo, otlp/datalake]
Sampling Strategies
When sending to multiple consumers with different fidelity requirements, apply sampling at the collector layer:
| Consumer | Sampling Strategy | Rationale |
|---|---|---|
| Real-time alerting (Mimir) | 100% of metrics | Alerts need complete data |
| Trace investigation (Tempo) | 100% errors + 5–10% success | Must capture all failures |
| Data lake / ML training | 1–5% uniform sample | Statistical significance sufficient |
| Compliance / audit | 100% for regulated services | Regulatory requirement |
Data Pipeline Design
For high-volume environments, add a message queue between collectors and backends to absorb bursts and enable replay:
flowchart LR
A["OTel Collectors
(Edge)"]
K["Kafka / Pulsar
(Buffer)"]
C1["Consumer: Mimir
(Real-time metrics)"]
C2["Consumer: Loki
(Real-time logs)"]
C3["Consumer: Data Lake
(Analytics)"]
C4["Consumer: SIEM
(Security)"]
A --> K
K --> C1
K --> C2
K --> C3
K --> C4
Summary & Next Steps
Architecting an observability platform requires thinking beyond individual tools to consider the entire system holistically:
- Platform mindset — treat observability as an internal product with clear personas, SLOs, and a roadmap
- Data architecture — define telemetry tiers, cardinality budgets, and retention policies before deploying infrastructure
- System architecture — choose the right deployment topology and scaling patterns for your scale
- Multi-tenancy — use
X-Scope-OrgIDheaders with per-tenant limits for cost control and isolation - Access control — design RBAC, data source permissions, and folder structures around team boundaries
- Fan-out pipelines — route telemetry to multiple consumers with appropriate sampling per destination
- Phased rollout — start with a PoC, validate success metrics, then expand systematically
Next in the Series
In Part 12: Real User Monitoring with Grafana, we’ll explore frontend observability — capturing Web Vitals, tracking user sessions with Grafana Faro, correlating frontend errors with backend traces, and building RUM dashboards that connect user experience to infrastructure health.