Back to Monitoring & Observability Series

Grafana Deep Dive Part 1: Introducing Observability & the Grafana Stack

June 15, 2026 Wasil Zafar 25 min read

A comprehensive introduction to observability as a discipline and the Grafana ecosystem that supports it — from telemetry fundamentals (metrics, logs, traces, profiles) to the full LGTM stack architecture, user personas, deployment models, and how every Grafana component connects.

Table of Contents

  1. Observability in a Nutshell
  2. Telemetry Types & Technologies
  3. User Personas of Observers
  4. The Grafana Stack
  5. Alternatives to the Grafana Stack
  6. Deploying the Grafana Stack
  7. Summary & Next Steps

Observability in a Nutshell

Observability is the ability to understand the internal state of a system by examining its external outputs. Unlike traditional monitoring — which tells you what is broken — observability answers why it broke and how to fix it. In complex distributed systems, you cannot predict every failure mode in advance. Observability gives you the tools to ask arbitrary questions of your running systems without deploying new code.

The Core Distinction: Monitoring is about known-unknowns (things you expect might fail). Observability is about unknown-unknowns (novel failures you never anticipated). A well-observed system lets you debug issues you've never seen before.

The three foundational properties of observability are:

  • High cardinality — the ability to filter and group data by any dimension (user ID, request ID, service version, deployment region)
  • High dimensionality — rich context attached to every telemetry signal (not just "error occurred" but who, what, where, when, and the full request context)
  • Explorability — the ability to ask new questions without pre-defining dashboards or alerts. If you have to modify code to answer a question about production, your observability is incomplete.

Case Study: A Ship Passing Through the Panama Canal

Analogy Real-World Observability

Imagine a cargo ship navigating the Panama Canal. The canal authority must know the ship's position (metrics), read the captain's log entries (logs), and trace the ship's journey through each lock and lake segment (traces). They also monitor water levels, wind speed, and tug boat fuel (infrastructure metrics).

If the ship runs aground, they need to correlate: Was the water level low? Did the pilot make an error? Was there a mechanical failure? This requires all three telemetry types correlated together — that's observability.

Metrics Logs Traces Correlation

Telemetry Types & Technologies

Modern observability rests on four pillars of telemetry. Each provides a different lens into system behavior, and the real power comes from correlating across all four.

Metrics

Metrics are numeric measurements collected at regular intervals. They are lightweight, aggregatable, and ideal for alerting and trend analysis. Grafana stores metrics in Mimir (a horizontally scalable, long-term Prometheus-compatible storage).

Metric Types: Counters (monotonically increasing values like request count), Gauges (point-in-time values like CPU usage), Histograms (distributions like request latency), and Summaries (pre-calculated quantiles).

Key characteristics of metrics:

  • Fixed cost per time series regardless of traffic volume
  • Ideal for alerting (SLO burn-rate alerts, threshold alerts)
  • Native aggregation across dimensions (sum, avg, percentile)
  • Retention measured in months to years
  • Protocols: Prometheus exposition format, OTLP, StatsD, DogStatsD, SNMP
# Example: Prometheus scrape configuration
scrape_configs:
  - job_name: 'my-application'
    scrape_interval: 15s
    metrics_path: '/metrics'
    static_configs:
      - targets: ['app-server:8080']
        labels:
          environment: 'production'
          team: 'platform'

Logs

Logs are timestamped text records of discrete events. They provide the richest context of any telemetry type but are expensive to store and search at scale. Grafana stores logs in Loki — a log aggregation system that indexes only metadata labels (not full text), making it dramatically cheaper than Elasticsearch-based solutions.

Log formats range from unstructured to fully structured:

  • Unstructured: Free-form text (ERROR: Connection timeout to database at 10.0.1.5:5432)
  • Semi-structured: Consistent format but not machine-parseable ([2026-06-15 14:30:02] ERROR app.database - Connection timeout)
  • Structured (JSON): Machine-parseable with typed fields — the gold standard for observability
{
  "timestamp": "2026-06-15T14:30:02.341Z",
  "level": "error",
  "service": "checkout-service",
  "trace_id": "abc123def456",
  "span_id": "789ghi",
  "message": "Connection timeout to database",
  "attributes": {
    "db.host": "10.0.1.5",
    "db.port": 5432,
    "db.name": "orders",
    "retry_count": 3,
    "user_id": "usr_42"
  }
}

Distributed Traces

A distributed trace follows a single request as it propagates through multiple services. Each unit of work is a span, and spans form a tree (or DAG) representing the full request lifecycle. Grafana stores traces in Tempo — a high-scale, cost-effective trace backend that requires no indexing beyond trace ID.

Distributed Trace: Order Checkout Flow
flowchart LR
    A[API Gateway
12ms] --> B[Auth Service
3ms] A --> C[Order Service
45ms] C --> D[Payment Service
120ms] C --> E[Inventory Service
28ms] D --> F[Bank API
95ms] E --> G[Database
12ms] style A fill:#e8f4f4,stroke:#3B9797 style D fill:#fff5f5,stroke:#BF092F style F fill:#fff5f5,stroke:#BF092F

Key tracing concepts:

  • Trace: The complete journey of a request across all services
  • Span: A single unit of work with start time, duration, attributes, and status
  • Context propagation: Passing trace/span IDs between services via HTTP headers (W3C Trace Context, B3) or gRPC metadata
  • Sampling: Collecting a subset of traces (head-based, tail-based, or adaptive) to control cost

Other Telemetry Types

Beyond the classic three pillars, modern observability includes:

  • Profiles (Continuous Profiling): CPU, memory, and goroutine flame graphs showing exactly which functions consume resources. Grafana stores profiles in Pyroscope.
  • Events: Discrete occurrences (deployments, config changes, feature flag flips) that correlate with metric anomalies
  • Exemplars: Links from aggregated metrics to specific trace IDs, enabling drill-down from a latency spike to the exact slow request
  • Real User Monitoring (RUM): Browser-side telemetry capturing Core Web Vitals, JavaScript errors, and user sessions via Grafana Faro

User Personas of Observers

Different roles interact with observability data in different ways. Understanding these personas helps design dashboards, alerts, and access controls appropriately:

Personas Who Uses Observability?
  • Diego Developer: Needs trace-level debugging, log search, and code-level profiling to understand why their service is slow or erroring
  • Ophelia Operator: Monitors infrastructure health, manages alerts and on-call rotations, responds to incidents, needs overview dashboards
  • Steven Service (SRE): Defines SLOs, manages error budgets, architects the observability platform, bridges dev and ops
  • Pelé Product: Tracks feature adoption, user journeys, conversion funnels — needs business metrics derived from telemetry
  • Masha Manager: Needs executive summaries, cost reports, SLA compliance dashboards, and incident postmortem insights
Developer Operator SRE Product Management

The Grafana Stack

Grafana Labs provides a comprehensive, open-source-first observability platform. The stack is modular — you can adopt individual components or deploy the full integrated suite.

The Core LGTM Stack

The acronym LGTM stands for Loki (logs), Grafana (visualization), Tempo (traces), and Mimir (metrics). Together with the collector layer, these form the core:

Grafana LGTM Stack Architecture
flowchart TD
    subgraph Applications
        A1[Service A] --> COLL
        A2[Service B] --> COLL
        A3[Service C] --> COLL
    end

    subgraph Collection
        COLL[Grafana Alloy
OTel Collector] end subgraph Storage COLL -->|metrics| MIMIR[Grafana Mimir] COLL -->|logs| LOKI[Grafana Loki] COLL -->|traces| TEMPO[Grafana Tempo] COLL -->|profiles| PYRO[Grafana Pyroscope] end subgraph Visualization MIMIR --> GF[Grafana] LOKI --> GF TEMPO --> GF PYRO --> GF end style COLL fill:#e8f4f4,stroke:#3B9797 style GF fill:#f0f4f8,stroke:#16476A style MIMIR fill:#e8f4f4,stroke:#3B9797 style LOKI fill:#e8f4f4,stroke:#3B9797 style TEMPO fill:#e8f4f4,stroke:#3B9797 style PYRO fill:#e8f4f4,stroke:#3B9797
ComponentRoleQuery LanguageKey Feature
GrafanaVisualization & alertingN/A (queries backends)Unified UI for all telemetry types
MimirMetrics storagePromQLHorizontally scalable Prometheus
LokiLog aggregationLogQLIndex-free design (label-based)
TempoTrace storageTraceQLNo indexing required, object storage backend
PyroscopeContinuous profilingFlameQLAlways-on profiling with minimal overhead
AlloyTelemetry collectorRiver configUnified collection for all signal types
BeylaeBPF auto-instrumentationN/AZero-code instrumentation via eBPF

Grafana Enterprise Plugins

Grafana Enterprise extends the open-source platform with:

  • Enterprise data sources: Oracle, SAP HANA, Snowflake, Databricks, ServiceNow, Splunk, Datadog
  • Enhanced security: Data source permissions, team-level RBAC, audit logging, SAML/LDAP/OAuth SSO
  • Reporting: Scheduled PDF/CSV report generation and email delivery
  • Caching: Query caching layer for expensive data source queries
  • Recorded queries: Precompute expensive queries on a schedule

Incident Response & Management (IRM)

Grafana's IRM suite provides end-to-end incident lifecycle management:

  • Grafana Alerting: Unified alerting engine supporting Prometheus-style, Loki, and multi-dimensional alert rules with notification policies and silences
  • Grafana OnCall: On-call scheduling, escalation chains, alert grouping, and integration with PagerDuty/Opsgenie/Slack
  • Grafana Incident: Collaborative incident management with war rooms, timelines, roles, and automated postmortem generation
  • Grafana SLO: Define and monitor Service Level Objectives with burn-rate alerting

Other Grafana Tools

  • Grafana k6: Load testing and performance testing with JavaScript-based test scripts
  • Grafana Synthetic Monitoring: Proactive endpoint and protocol checks from global locations
  • Grafana Faro: Real User Monitoring (RUM) SDK for browser applications
  • Grafana Machine Learning: Anomaly detection, forecasting, and AI-powered alerting (Sift)
  • Grafana Assistant: Natural language querying of observability data via LLM

Alternatives to the Grafana Stack

Understanding the competitive landscape helps you make informed architecture decisions:

CategoryGrafana SolutionAlternatives
Data CollectionAlloy, BeylaOpenTelemetry Collector, Fluent Bit, Vector, Telegraf
Metrics StorageMimirPrometheus, Thanos, Cortex, VictoriaMetrics, InfluxDB
Log StorageLokiElasticsearch/OpenSearch, Splunk, Datadog Logs
Trace StorageTempoJaeger, Zipkin, Datadog APM, New Relic, Honeycomb
ProfilingPyroscopeParca, Polar Signals, Datadog Profiling
VisualizationGrafanaKibana, Datadog, New Relic, Chronograf
Full PlatformGrafana CloudDatadog, New Relic, Dynatrace, Splunk Observability
Why Grafana Wins on Flexibility: Unlike proprietary platforms, the Grafana stack is open-source, avoids vendor lock-in, supports 150+ data source plugins, and lets you mix commercial and OSS backends freely. You can query Datadog metrics alongside Prometheus in the same dashboard.

Deploying the Grafana Stack

The Grafana stack supports multiple deployment models to fit any organization:

  • Grafana Cloud (SaaS): Fully managed service with a generous free tier (10K metrics series, 50GB logs, 50GB traces). Zero infrastructure management. Best for teams that want to focus on using observability, not running it.
  • Self-hosted (Kubernetes): Deploy via Helm charts (grafana/helm-charts). Full control over data residency, scaling, and cost. Requires operational expertise.
  • Self-hosted (Docker Compose): Single-machine deployment for development, testing, or small-scale production. Simple but limited scaling.
  • Hybrid: Self-hosted collection (Alloy) with Grafana Cloud storage. Keep data collection close to workloads while offloading storage/querying.
# Quick start: Deploy full LGTM stack locally with Docker
docker run --name lgtm -p 3000:3000 -p 4317:4317 -p 4318:4318 \
  grafana/otel-lgtm:latest

# Access Grafana at http://localhost:3000 (admin/admin)
# Send OTLP data to localhost:4317 (gRPC) or localhost:4318 (HTTP)
# Production: Deploy via Helm on Kubernetes
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update

# Install individual components
helm install mimir grafana/mimir-distributed -n observability
helm install loki grafana/loki -n observability
helm install tempo grafana/tempo-distributed -n observability
helm install grafana grafana/grafana -n observability
helm install alloy grafana/alloy -n observability

Summary & Next Steps

In this first part of the Grafana Deep Dive track, we established the foundational concepts:

  • Observability answers why systems fail, not just what failed
  • Four telemetry types: metrics (Mimir), logs (Loki), traces (Tempo), profiles (Pyroscope)
  • The Grafana stack is modular, open-source-first, and avoids vendor lock-in
  • Different personas (developer, operator, SRE, product, management) need different views
  • Deployment options range from fully managed cloud to self-hosted Kubernetes

Next in the Grafana Track

In Part 2: Instrumenting Applications & Infrastructure, we'll dive into common log formats, metric types and protocols, tracing best practices, and how to use OpenTelemetry libraries to instrument your applications efficiently.