Back to Monitoring, Observability & Reliability Series

Tool Deep Dive: Grafana Complete Guide

May 14, 2026 Wasil Zafar 20 min read

The definitive reference for Grafana — from dashboard design principles and panel types to advanced features like mixed data sources, transformations, template variables, provisioning, and dashboard-as-code with Grafonnet.

Table of Contents

  1. Dashboard Design Principles
  2. Panel Types & When to Use
  3. Variables & Templating
  4. Transformations
  5. Provisioning & Dashboard as Code
  6. Cross-Pillar Correlations
  7. Production Best Practices

Dashboard Design Principles

Effective dashboards follow a visual hierarchy — the most important information at the top, progressive detail as you scroll down:

  1. Row 1: Status overview — Stat panels showing current state (green/red), SLO compliance %, active alerts count
  2. Row 2: Golden Signals — Time series panels for request rate, error rate, latency, saturation
  3. Row 3: Service-specific — Panels unique to this service (queue depth, cache hit ratio, DB connections)
  4. Row 4: Infrastructure — CPU, memory, disk, network utilisation for underlying resources
The 5-Second Rule: A dashboard should communicate the system's health within 5 seconds of viewing. If someone needs to read query expressions or hover over data to understand whether the system is healthy, the dashboard needs redesigning.

Panel Types & When to Use Each

PanelBest ForNot For
Time SeriesTrends over time, rate metrics, latencyCurrent state, discrete categories
StatSingle current value with colour thresholdsShowing trends or history
GaugeUtilisation percentages (CPU, memory, disk)Rates or counts
Bar ChartComparing categories (top endpoints, error counts by service)Time-based data
TableMulti-column data, alert details, top-N listsVisual trend analysis
HeatmapDistribution over time (histogram buckets, latency)Simple metrics, gauges
LogsLog lines correlated with metrics panelAggregate data
TracesDistributed trace waterfall viewNon-trace data

Variables & Templating

# Variable: $service — dynamically populated from Prometheus labels
# Type: Query
# Query: label_values(http_requests_total, service)
# Refresh: On time range change
# Multi-value: Enabled
# Include All option: Enabled

# Use in panel queries:
sum(rate(http_requests_total{service=~"$service"}[5m])) by (service)

# Variable: $namespace — cascading dependency on $cluster
# Query: label_values(kube_pod_info{cluster="$cluster"}, namespace)

# Variable: $interval — auto-scaled scrape interval
# Type: Interval
# Values: 1m, 5m, 15m, 30m, 1h
# Auto: enabled, step count: 30

Transformations

TransformationUse Case
MergeCombine multiple queries into one table (e.g., metrics + labels from different sources)
Filter by valueShow only rows where error rate > threshold
Organize fieldsRename, reorder, or hide columns in table panels
ReduceCollapse time series to single values (last, mean, max) for stat panels
Group byAggregate rows by a field (sum errors by service)
Join by fieldSQL-like join between Prometheus and Loki query results

Provisioning & Dashboard as Code

# provisioning/dashboards/dashboards.yaml
apiVersion: 1
providers:
  - name: 'default'
    orgId: 1
    folder: 'Auto-Provisioned'
    type: file
    disableDeletion: false
    editable: true
    options:
      path: /var/lib/grafana/dashboards
      foldersFromFilesStructure: true  # Subfolder = Grafana folder

# provisioning/datasources/datasources.yaml
apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    jsonData:
      timeInterval: '15s'     # Scrape interval for $__rate_interval
      httpMethod: POST          # Allows longer queries
  - name: Loki
    type: loki
    access: proxy
    url: http://loki:3100
    jsonData:
      derivedFields:
        - datasourceUid: tempo-uid
          matcherRegex: '"traceID":"(\w+)"'
          name: TraceID
          url: '$${__value.raw}'  # Links log lines to traces

Cross-Pillar Correlations

Metrics → Logs → Traces: Grafana's power comes from correlating all three pillars in one view. Configure derived fields in Loki to extract trace IDs and link to Tempo. Use exemplars in Prometheus to jump from a latency spike directly to the trace that caused it. Build dashboards with side-by-side metrics and log panels sharing the same time range and variables.

Production Best Practices

Checklist

Grafana Production Readiness

  1. Use provisioning (YAML files) for data sources and dashboards — never configure manually in production
  2. Set $__rate_interval instead of hardcoded [5m] in Prometheus queries
  3. Use recording rules for expensive dashboard queries (never compute high-cardinality aggregations in Grafana)
  4. Enable dashboard versioning and audit logging
  5. Set max data points per panel to prevent browser memory exhaustion
  6. Use folders and RBAC to organise dashboards by team/service
  7. Configure PostgreSQL as the Grafana database (not default SQLite) for HA
  8. Enable caching for frequently viewed dashboards
GrafanaProvisioningProduction