Grafana Deep Dive Part 1: Introducing Observability & the Grafana Stack

Observability in a Nutshell

Observability is the ability to understand the internal state of a system by examining its external outputs. Unlike traditional monitoring — which tells you what is broken — observability answers why it broke and how to fix it. In complex distributed systems, you cannot predict every failure mode in advance. Observability gives you the tools to ask arbitrary questions of your running systems without deploying new code.

                            
                            The Core Distinction: Monitoring is about known-unknowns (things you expect might fail). Observability is about unknown-unknowns (novel failures you never anticipated). A well-observed system lets you debug issues you've never seen before.
                        

The three foundational properties of observability are:

High cardinality — the ability to filter and group data by any dimension (user ID, request ID, service version, deployment region)
High dimensionality — rich context attached to every telemetry signal (not just "error occurred" but who, what, where, when, and the full request context)
Explorability — the ability to ask new questions without pre-defining dashboards or alerts. If you have to modify code to answer a question about production, your observability is incomplete.

Case Study: A Ship Passing Through the Panama Canal

Analogy Real-World Observability

Imagine a cargo ship navigating the Panama Canal. The canal authority must know the ship's position (metrics), read the captain's log entries (logs), and trace the ship's journey through each lock and lake segment (traces). They also monitor water levels, wind speed, and tug boat fuel (infrastructure metrics).

If the ship runs aground, they need to correlate: Was the water level low? Did the pilot make an error? Was there a mechanical failure? This requires all three telemetry types correlated together — that's observability.

Metrics Logs Traces Correlation

Telemetry Types & Technologies

Modern observability rests on four pillars of telemetry. Each provides a different lens into system behavior, and the real power comes from correlating across all four.

Metrics

Metrics are numeric measurements collected at regular intervals. They are lightweight, aggregatable, and ideal for alerting and trend analysis. Grafana stores metrics in Mimir (a horizontally scalable, long-term Prometheus-compatible storage).

                            
                            Metric Types: Counters (monotonically increasing values like request count), Gauges (point-in-time values like CPU usage), Histograms (distributions like request latency), and Summaries (pre-calculated quantiles).
                        

Key characteristics of metrics:

Fixed cost per time series regardless of traffic volume
Ideal for alerting (SLO burn-rate alerts, threshold alerts)
Native aggregation across dimensions (sum, avg, percentile)
Retention measured in months to years
Protocols: Prometheus exposition format, OTLP, StatsD, DogStatsD, SNMP

# Example: Prometheus scrape configuration
scrape_configs:
  - job_name: 'my-application'
    scrape_interval: 15s
    metrics_path: '/metrics'
    static_configs:
      - targets: ['app-server:8080']
        labels:
          environment: 'production'
          team: 'platform'

Logs

Logs are timestamped text records of discrete events. They provide the richest context of any telemetry type but are expensive to store and search at scale. Grafana stores logs in Loki — a log aggregation system that indexes only metadata labels (not full text), making it dramatically cheaper than Elasticsearch-based solutions.

Log formats range from unstructured to fully structured:

Unstructured: Free-form text (ERROR: Connection timeout to database at 10.0.1.5:5432)
Semi-structured: Consistent format but not machine-parseable ([2026-06-15 14:30:02] ERROR app.database - Connection timeout)
Structured (JSON): Machine-parseable with typed fields — the gold standard for observability

{
  "timestamp": "2026-06-15T14:30:02.341Z",
  "level": "error",
  "service": "checkout-service",
  "trace_id": "abc123def456",
  "span_id": "789ghi",
  "message": "Connection timeout to database",
  "attributes": {
    "db.host": "10.0.1.5",
    "db.port": 5432,
    "db.name": "orders",
    "retry_count": 3,
    "user_id": "usr_42"
  }
}

Distributed Traces

A distributed trace follows a single request as it propagates through multiple services. Each unit of work is a span, and spans form a tree (or DAG) representing the full request lifecycle. Grafana stores traces in Tempo — a high-scale, cost-effective trace backend that requires no indexing beyond trace ID.

Distributed Trace: Order Checkout Flow

flowchart LR
    A[API Gateway
12ms] --> B[Auth Service
3ms]
    A --> C[Order Service
45ms]
    C --> D[Payment Service
120ms]
    C --> E[Inventory Service
28ms]
    D --> F[Bank API
95ms]
    E --> G[Database
12ms]

    style A fill:#e8f4f4,stroke:#3B9797
    style D fill:#fff5f5,stroke:#BF092F
    style F fill:#fff5f5,stroke:#BF092F

Key tracing concepts:

Trace: The complete journey of a request across all services
Span: A single unit of work with start time, duration, attributes, and status
Context propagation: Passing trace/span IDs between services via HTTP headers (W3C Trace Context, B3) or gRPC metadata
Sampling: Collecting a subset of traces (head-based, tail-based, or adaptive) to control cost

Other Telemetry Types

Beyond the classic three pillars, modern observability includes:

Profiles (Continuous Profiling): CPU, memory, and goroutine flame graphs showing exactly which functions consume resources. Grafana stores profiles in Pyroscope.
Events: Discrete occurrences (deployments, config changes, feature flag flips) that correlate with metric anomalies
Exemplars: Links from aggregated metrics to specific trace IDs, enabling drill-down from a latency spike to the exact slow request
Real User Monitoring (RUM): Browser-side telemetry capturing Core Web Vitals, JavaScript errors, and user sessions via Grafana Faro

User Personas of Observers

Different roles interact with observability data in different ways. Understanding these personas helps design dashboards, alerts, and access controls appropriately:

Personas Who Uses Observability?

Diego Developer: Needs trace-level debugging, log search, and code-level profiling to understand why their service is slow or erroring
Ophelia Operator: Monitors infrastructure health, manages alerts and on-call rotations, responds to incidents, needs overview dashboards
Steven Service (SRE): Defines SLOs, manages error budgets, architects the observability platform, bridges dev and ops
Pelé Product: Tracks feature adoption, user journeys, conversion funnels — needs business metrics derived from telemetry
Masha Manager: Needs executive summaries, cost reports, SLA compliance dashboards, and incident postmortem insights

Developer Operator SRE Product Management

The Grafana Stack

Grafana Labs provides a comprehensive, open-source-first observability platform. The stack is modular — you can adopt individual components or deploy the full integrated suite.

The Core LGTM Stack

The acronym LGTM stands for Loki (logs), Grafana (visualization), Tempo (traces), and Mimir (metrics). Together with the collector layer, these form the core:

Grafana LGTM Stack Architecture

flowchart TD
    subgraph Applications
        A1[Service A] --> COLL
        A2[Service B] --> COLL
        A3[Service C] --> COLL
    end

    subgraph Collection
        COLL[Grafana Alloy
OTel Collector]
    end

    subgraph Storage
        COLL -->|metrics| MIMIR[Grafana Mimir]
        COLL -->|logs| LOKI[Grafana Loki]
        COLL -->|traces| TEMPO[Grafana Tempo]
        COLL -->|profiles| PYRO[Grafana Pyroscope]
    end

    subgraph Visualization
        MIMIR --> GF[Grafana]
        LOKI --> GF
        TEMPO --> GF
        PYRO --> GF
    end

    style COLL fill:#e8f4f4,stroke:#3B9797
    style GF fill:#f0f4f8,stroke:#16476A
    style MIMIR fill:#e8f4f4,stroke:#3B9797
    style LOKI fill:#e8f4f4,stroke:#3B9797
    style TEMPO fill:#e8f4f4,stroke:#3B9797
    style PYRO fill:#e8f4f4,stroke:#3B9797

Component	Role	Query Language	Key Feature
Grafana	Visualization & alerting	N/A (queries backends)	Unified UI for all telemetry types
Mimir	Metrics storage	PromQL	Horizontally scalable Prometheus
Loki	Log aggregation	LogQL	Index-free design (label-based)
Tempo	Trace storage	TraceQL	No indexing required, object storage backend
Pyroscope	Continuous profiling	FlameQL	Always-on profiling with minimal overhead
Alloy	Telemetry collector	River config	Unified collection for all signal types
Beyla	eBPF auto-instrumentation	N/A	Zero-code instrumentation via eBPF

Grafana Enterprise Plugins

Grafana Enterprise extends the open-source platform with:

Enterprise data sources: Oracle, SAP HANA, Snowflake, Databricks, ServiceNow, Splunk, Datadog
Enhanced security: Data source permissions, team-level RBAC, audit logging, SAML/LDAP/OAuth SSO
Reporting: Scheduled PDF/CSV report generation and email delivery
Caching: Query caching layer for expensive data source queries
Recorded queries: Precompute expensive queries on a schedule

Incident Response & Management (IRM)

Grafana's IRM suite provides end-to-end incident lifecycle management:

Grafana Alerting: Unified alerting engine supporting Prometheus-style, Loki, and multi-dimensional alert rules with notification policies and silences
Grafana OnCall: On-call scheduling, escalation chains, alert grouping, and integration with PagerDuty/Opsgenie/Slack
Grafana Incident: Collaborative incident management with war rooms, timelines, roles, and automated postmortem generation
Grafana SLO: Define and monitor Service Level Objectives with burn-rate alerting

Other Grafana Tools

Grafana k6: Load testing and performance testing with JavaScript-based test scripts
Grafana Synthetic Monitoring: Proactive endpoint and protocol checks from global locations
Grafana Faro: Real User Monitoring (RUM) SDK for browser applications
Grafana Machine Learning: Anomaly detection, forecasting, and AI-powered alerting (Sift)
Grafana Assistant: Natural language querying of observability data via LLM

Alternatives to the Grafana Stack

Understanding the competitive landscape helps you make informed architecture decisions:

Category	Grafana Solution	Alternatives
Data Collection	Alloy, Beyla	OpenTelemetry Collector, Fluent Bit, Vector, Telegraf
Metrics Storage	Mimir	Prometheus, Thanos, Cortex, VictoriaMetrics, InfluxDB
Log Storage	Loki	Elasticsearch/OpenSearch, Splunk, Datadog Logs
Trace Storage	Tempo	Jaeger, Zipkin, Datadog APM, New Relic, Honeycomb
Profiling	Pyroscope	Parca, Polar Signals, Datadog Profiling
Visualization	Grafana	Kibana, Datadog, New Relic, Chronograf
Full Platform	Grafana Cloud	Datadog, New Relic, Dynatrace, Splunk Observability

                            
                            Why Grafana Wins on Flexibility: Unlike proprietary platforms, the Grafana stack is open-source, avoids vendor lock-in, supports 150+ data source plugins, and lets you mix commercial and OSS backends freely. You can query Datadog metrics alongside Prometheus in the same dashboard.
                        

Deploying the Grafana Stack

The Grafana stack supports multiple deployment models to fit any organization:

Grafana Cloud (SaaS): Fully managed service with a generous free tier (10K metrics series, 50GB logs, 50GB traces). Zero infrastructure management. Best for teams that want to focus on using observability, not running it.
Self-hosted (Kubernetes): Deploy via Helm charts (grafana/helm-charts). Full control over data residency, scaling, and cost. Requires operational expertise.
Self-hosted (Docker Compose): Single-machine deployment for development, testing, or small-scale production. Simple but limited scaling.
Hybrid: Self-hosted collection (Alloy) with Grafana Cloud storage. Keep data collection close to workloads while offloading storage/querying.

# Quick start: Deploy full LGTM stack locally with Docker
docker run --name lgtm -p 3000:3000 -p 4317:4317 -p 4318:4318 \
  grafana/otel-lgtm:latest

# Access Grafana at http://localhost:3000 (admin/admin)
# Send OTLP data to localhost:4317 (gRPC) or localhost:4318 (HTTP)

# Production: Deploy via Helm on Kubernetes
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update

# Install individual components
helm install mimir grafana/mimir-distributed -n observability
helm install loki grafana/loki -n observability
helm install tempo grafana/tempo-distributed -n observability
helm install grafana grafana/grafana -n observability
helm install alloy grafana/alloy -n observability

Summary & Next Steps

In this first part of the Grafana Deep Dive track, we established the foundational concepts:

Observability answers why systems fail, not just what failed
Four telemetry types: metrics (Mimir), logs (Loki), traces (Tempo), profiles (Pyroscope)
The Grafana stack is modular, open-source-first, and avoids vendor lock-in
Different personas (developer, operator, SRE, product, management) need different views
Deployment options range from fully managed cloud to self-hosted Kubernetes

Next in the Grafana Track

In Part 2: Instrumenting Applications & Infrastructure, we'll dive into common log formats, metric types and protocols, tracing best practices, and how to use OpenTelemetry libraries to instrument your applications efficiently.

Next Part 2: Instrumenting Applications & Infrastructure