Observability in a Nutshell
Observability is the ability to understand the internal state of a system by examining its external outputs. Unlike traditional monitoring — which tells you what is broken — observability answers why it broke and how to fix it. In complex distributed systems, you cannot predict every failure mode in advance. Observability gives you the tools to ask arbitrary questions of your running systems without deploying new code.
The three foundational properties of observability are:
- High cardinality — the ability to filter and group data by any dimension (user ID, request ID, service version, deployment region)
- High dimensionality — rich context attached to every telemetry signal (not just "error occurred" but who, what, where, when, and the full request context)
- Explorability — the ability to ask new questions without pre-defining dashboards or alerts. If you have to modify code to answer a question about production, your observability is incomplete.
Case Study: A Ship Passing Through the Panama Canal
Imagine a cargo ship navigating the Panama Canal. The canal authority must know the ship's position (metrics), read the captain's log entries (logs), and trace the ship's journey through each lock and lake segment (traces). They also monitor water levels, wind speed, and tug boat fuel (infrastructure metrics).
If the ship runs aground, they need to correlate: Was the water level low? Did the pilot make an error? Was there a mechanical failure? This requires all three telemetry types correlated together — that's observability.
Telemetry Types & Technologies
Modern observability rests on four pillars of telemetry. Each provides a different lens into system behavior, and the real power comes from correlating across all four.
Metrics
Metrics are numeric measurements collected at regular intervals. They are lightweight, aggregatable, and ideal for alerting and trend analysis. Grafana stores metrics in Mimir (a horizontally scalable, long-term Prometheus-compatible storage).
Key characteristics of metrics:
- Fixed cost per time series regardless of traffic volume
- Ideal for alerting (SLO burn-rate alerts, threshold alerts)
- Native aggregation across dimensions (sum, avg, percentile)
- Retention measured in months to years
- Protocols: Prometheus exposition format, OTLP, StatsD, DogStatsD, SNMP
# Example: Prometheus scrape configuration
scrape_configs:
- job_name: 'my-application'
scrape_interval: 15s
metrics_path: '/metrics'
static_configs:
- targets: ['app-server:8080']
labels:
environment: 'production'
team: 'platform'
Logs
Logs are timestamped text records of discrete events. They provide the richest context of any telemetry type but are expensive to store and search at scale. Grafana stores logs in Loki — a log aggregation system that indexes only metadata labels (not full text), making it dramatically cheaper than Elasticsearch-based solutions.
Log formats range from unstructured to fully structured:
- Unstructured: Free-form text (
ERROR: Connection timeout to database at 10.0.1.5:5432) - Semi-structured: Consistent format but not machine-parseable (
[2026-06-15 14:30:02] ERROR app.database - Connection timeout) - Structured (JSON): Machine-parseable with typed fields — the gold standard for observability
{
"timestamp": "2026-06-15T14:30:02.341Z",
"level": "error",
"service": "checkout-service",
"trace_id": "abc123def456",
"span_id": "789ghi",
"message": "Connection timeout to database",
"attributes": {
"db.host": "10.0.1.5",
"db.port": 5432,
"db.name": "orders",
"retry_count": 3,
"user_id": "usr_42"
}
}
Distributed Traces
A distributed trace follows a single request as it propagates through multiple services. Each unit of work is a span, and spans form a tree (or DAG) representing the full request lifecycle. Grafana stores traces in Tempo — a high-scale, cost-effective trace backend that requires no indexing beyond trace ID.
flowchart LR
A[API Gateway
12ms] --> B[Auth Service
3ms]
A --> C[Order Service
45ms]
C --> D[Payment Service
120ms]
C --> E[Inventory Service
28ms]
D --> F[Bank API
95ms]
E --> G[Database
12ms]
style A fill:#e8f4f4,stroke:#3B9797
style D fill:#fff5f5,stroke:#BF092F
style F fill:#fff5f5,stroke:#BF092F
Key tracing concepts:
- Trace: The complete journey of a request across all services
- Span: A single unit of work with start time, duration, attributes, and status
- Context propagation: Passing trace/span IDs between services via HTTP headers (W3C Trace Context, B3) or gRPC metadata
- Sampling: Collecting a subset of traces (head-based, tail-based, or adaptive) to control cost
Other Telemetry Types
Beyond the classic three pillars, modern observability includes:
- Profiles (Continuous Profiling): CPU, memory, and goroutine flame graphs showing exactly which functions consume resources. Grafana stores profiles in Pyroscope.
- Events: Discrete occurrences (deployments, config changes, feature flag flips) that correlate with metric anomalies
- Exemplars: Links from aggregated metrics to specific trace IDs, enabling drill-down from a latency spike to the exact slow request
- Real User Monitoring (RUM): Browser-side telemetry capturing Core Web Vitals, JavaScript errors, and user sessions via Grafana Faro
User Personas of Observers
Different roles interact with observability data in different ways. Understanding these personas helps design dashboards, alerts, and access controls appropriately:
- Diego Developer: Needs trace-level debugging, log search, and code-level profiling to understand why their service is slow or erroring
- Ophelia Operator: Monitors infrastructure health, manages alerts and on-call rotations, responds to incidents, needs overview dashboards
- Steven Service (SRE): Defines SLOs, manages error budgets, architects the observability platform, bridges dev and ops
- Pelé Product: Tracks feature adoption, user journeys, conversion funnels — needs business metrics derived from telemetry
- Masha Manager: Needs executive summaries, cost reports, SLA compliance dashboards, and incident postmortem insights
The Grafana Stack
Grafana Labs provides a comprehensive, open-source-first observability platform. The stack is modular — you can adopt individual components or deploy the full integrated suite.
The Core LGTM Stack
The acronym LGTM stands for Loki (logs), Grafana (visualization), Tempo (traces), and Mimir (metrics). Together with the collector layer, these form the core:
flowchart TD
subgraph Applications
A1[Service A] --> COLL
A2[Service B] --> COLL
A3[Service C] --> COLL
end
subgraph Collection
COLL[Grafana Alloy
OTel Collector]
end
subgraph Storage
COLL -->|metrics| MIMIR[Grafana Mimir]
COLL -->|logs| LOKI[Grafana Loki]
COLL -->|traces| TEMPO[Grafana Tempo]
COLL -->|profiles| PYRO[Grafana Pyroscope]
end
subgraph Visualization
MIMIR --> GF[Grafana]
LOKI --> GF
TEMPO --> GF
PYRO --> GF
end
style COLL fill:#e8f4f4,stroke:#3B9797
style GF fill:#f0f4f8,stroke:#16476A
style MIMIR fill:#e8f4f4,stroke:#3B9797
style LOKI fill:#e8f4f4,stroke:#3B9797
style TEMPO fill:#e8f4f4,stroke:#3B9797
style PYRO fill:#e8f4f4,stroke:#3B9797
| Component | Role | Query Language | Key Feature |
|---|---|---|---|
| Grafana | Visualization & alerting | N/A (queries backends) | Unified UI for all telemetry types |
| Mimir | Metrics storage | PromQL | Horizontally scalable Prometheus |
| Loki | Log aggregation | LogQL | Index-free design (label-based) |
| Tempo | Trace storage | TraceQL | No indexing required, object storage backend |
| Pyroscope | Continuous profiling | FlameQL | Always-on profiling with minimal overhead |
| Alloy | Telemetry collector | River config | Unified collection for all signal types |
| Beyla | eBPF auto-instrumentation | N/A | Zero-code instrumentation via eBPF |
Grafana Enterprise Plugins
Grafana Enterprise extends the open-source platform with:
- Enterprise data sources: Oracle, SAP HANA, Snowflake, Databricks, ServiceNow, Splunk, Datadog
- Enhanced security: Data source permissions, team-level RBAC, audit logging, SAML/LDAP/OAuth SSO
- Reporting: Scheduled PDF/CSV report generation and email delivery
- Caching: Query caching layer for expensive data source queries
- Recorded queries: Precompute expensive queries on a schedule
Incident Response & Management (IRM)
Grafana's IRM suite provides end-to-end incident lifecycle management:
- Grafana Alerting: Unified alerting engine supporting Prometheus-style, Loki, and multi-dimensional alert rules with notification policies and silences
- Grafana OnCall: On-call scheduling, escalation chains, alert grouping, and integration with PagerDuty/Opsgenie/Slack
- Grafana Incident: Collaborative incident management with war rooms, timelines, roles, and automated postmortem generation
- Grafana SLO: Define and monitor Service Level Objectives with burn-rate alerting
Other Grafana Tools
- Grafana k6: Load testing and performance testing with JavaScript-based test scripts
- Grafana Synthetic Monitoring: Proactive endpoint and protocol checks from global locations
- Grafana Faro: Real User Monitoring (RUM) SDK for browser applications
- Grafana Machine Learning: Anomaly detection, forecasting, and AI-powered alerting (Sift)
- Grafana Assistant: Natural language querying of observability data via LLM
Alternatives to the Grafana Stack
Understanding the competitive landscape helps you make informed architecture decisions:
| Category | Grafana Solution | Alternatives |
|---|---|---|
| Data Collection | Alloy, Beyla | OpenTelemetry Collector, Fluent Bit, Vector, Telegraf |
| Metrics Storage | Mimir | Prometheus, Thanos, Cortex, VictoriaMetrics, InfluxDB |
| Log Storage | Loki | Elasticsearch/OpenSearch, Splunk, Datadog Logs |
| Trace Storage | Tempo | Jaeger, Zipkin, Datadog APM, New Relic, Honeycomb |
| Profiling | Pyroscope | Parca, Polar Signals, Datadog Profiling |
| Visualization | Grafana | Kibana, Datadog, New Relic, Chronograf |
| Full Platform | Grafana Cloud | Datadog, New Relic, Dynatrace, Splunk Observability |
Deploying the Grafana Stack
The Grafana stack supports multiple deployment models to fit any organization:
- Grafana Cloud (SaaS): Fully managed service with a generous free tier (10K metrics series, 50GB logs, 50GB traces). Zero infrastructure management. Best for teams that want to focus on using observability, not running it.
- Self-hosted (Kubernetes): Deploy via Helm charts (grafana/helm-charts). Full control over data residency, scaling, and cost. Requires operational expertise.
- Self-hosted (Docker Compose): Single-machine deployment for development, testing, or small-scale production. Simple but limited scaling.
- Hybrid: Self-hosted collection (Alloy) with Grafana Cloud storage. Keep data collection close to workloads while offloading storage/querying.
# Quick start: Deploy full LGTM stack locally with Docker
docker run --name lgtm -p 3000:3000 -p 4317:4317 -p 4318:4318 \
grafana/otel-lgtm:latest
# Access Grafana at http://localhost:3000 (admin/admin)
# Send OTLP data to localhost:4317 (gRPC) or localhost:4318 (HTTP)
# Production: Deploy via Helm on Kubernetes
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update
# Install individual components
helm install mimir grafana/mimir-distributed -n observability
helm install loki grafana/loki -n observability
helm install tempo grafana/tempo-distributed -n observability
helm install grafana grafana/grafana -n observability
helm install alloy grafana/alloy -n observability
Summary & Next Steps
In this first part of the Grafana Deep Dive track, we established the foundational concepts:
- Observability answers why systems fail, not just what failed
- Four telemetry types: metrics (Mimir), logs (Loki), traces (Tempo), profiles (Pyroscope)
- The Grafana stack is modular, open-source-first, and avoids vendor lock-in
- Different personas (developer, operator, SRE, product, management) need different views
- Deployment options range from fully managed cloud to self-hosted Kubernetes
Next in the Grafana Track
In Part 2: Instrumenting Applications & Infrastructure, we'll dive into common log formats, metric types and protocols, tracing best practices, and how to use OpenTelemetry libraries to instrument your applications efficiently.