Tool Deep Dive: Grafana Complete Guide

Dashboard Design Principles

Effective dashboards follow a visual hierarchy — the most important information at the top, progressive detail as you scroll down:

Row 1: Status overview — Stat panels showing current state (green/red), SLO compliance %, active alerts count
Row 2: Golden Signals — Time series panels for request rate, error rate, latency, saturation
Row 3: Service-specific — Panels unique to this service (queue depth, cache hit ratio, DB connections)
Row 4: Infrastructure — CPU, memory, disk, network utilisation for underlying resources

                            
                            The 5-Second Rule: A dashboard should communicate the system's health within 5 seconds of viewing. If someone needs to read query expressions or hover over data to understand whether the system is healthy, the dashboard needs redesigning.
                        

Panel Types & When to Use Each

Panel	Best For	Not For
Time Series	Trends over time, rate metrics, latency	Current state, discrete categories
Stat	Single current value with colour thresholds	Showing trends or history
Gauge	Utilisation percentages (CPU, memory, disk)	Rates or counts
Bar Chart	Comparing categories (top endpoints, error counts by service)	Time-based data
Table	Multi-column data, alert details, top-N lists	Visual trend analysis
Heatmap	Distribution over time (histogram buckets, latency)	Simple metrics, gauges
Logs	Log lines correlated with metrics panel	Aggregate data
Traces	Distributed trace waterfall view	Non-trace data

Variables & Templating

# Variable: $service — dynamically populated from Prometheus labels
# Type: Query
# Query: label_values(http_requests_total, service)
# Refresh: On time range change
# Multi-value: Enabled
# Include All option: Enabled

# Use in panel queries:
sum(rate(http_requests_total{service=~"$service"}[5m])) by (service)

# Variable: $namespace — cascading dependency on $cluster
# Query: label_values(kube_pod_info{cluster="$cluster"}, namespace)

# Variable: $interval — auto-scaled scrape interval
# Type: Interval
# Values: 1m, 5m, 15m, 30m, 1h
# Auto: enabled, step count: 30

Transformations

Transformation	Use Case
Merge	Combine multiple queries into one table (e.g., metrics + labels from different sources)
Filter by value	Show only rows where error rate > threshold
Organize fields	Rename, reorder, or hide columns in table panels
Reduce	Collapse time series to single values (last, mean, max) for stat panels
Group by	Aggregate rows by a field (sum errors by service)
Join by field	SQL-like join between Prometheus and Loki query results

Provisioning & Dashboard as Code

# provisioning/dashboards/dashboards.yaml
apiVersion: 1
providers:
  - name: 'default'
    orgId: 1
    folder: 'Auto-Provisioned'
    type: file
    disableDeletion: false
    editable: true
    options:
      path: /var/lib/grafana/dashboards
      foldersFromFilesStructure: true  # Subfolder = Grafana folder

# provisioning/datasources/datasources.yaml
apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    jsonData:
      timeInterval: '15s'     # Scrape interval for $__rate_interval
      httpMethod: POST          # Allows longer queries
  - name: Loki
    type: loki
    access: proxy
    url: http://loki:3100
    jsonData:
      derivedFields:
        - datasourceUid: tempo-uid
          matcherRegex: '"traceID":"(\w+)"'
          name: TraceID
          url: '$${__value.raw}'  # Links log lines to traces

Cross-Pillar Correlations

                            
                            Metrics → Logs → Traces: Grafana's power comes from correlating all three pillars in one view. Configure derived fields in Loki to extract trace IDs and link to Tempo. Use exemplars in Prometheus to jump from a latency spike directly to the trace that caused it. Build dashboards with side-by-side metrics and log panels sharing the same time range and variables.
                        

Production Best Practices

Checklist

Grafana Production Readiness

Use provisioning (YAML files) for data sources and dashboards — never configure manually in production
Set $__rate_interval instead of hardcoded [5m] in Prometheus queries
Use recording rules for expensive dashboard queries (never compute high-cardinality aggregations in Grafana)
Enable dashboard versioning and audit logging
Set max data points per panel to prevent browser memory exhaustion
Use folders and RBAC to organise dashboards by team/service
Configure PostgreSQL as the Grafana database (not default SQLite) for HA
Enable caching for frequently viewed dashboards

GrafanaProvisioningProduction

Previous Deep DivePrometheus Complete Guide Next Deep Dive Loki Complete Guide

Cookie Consent

Tool Deep Dive: Grafana Complete Guide

Table of Contents

Dashboard Design Principles

Panel Types & When to Use Each

Variables & Templating

Transformations

Provisioning & Dashboard as Code

Cross-Pillar Correlations

Production Best Practices

Grafana Production Readiness

Cookie Consent

Tool Deep Dive: Grafana Complete Guide

Table of Contents

Dashboard Design Principles

Panel Types & When to Use Each

Variables & Templating

Transformations

Provisioning & Dashboard as Code

Cross-Pillar Correlations

Production Best Practices

Grafana Production Readiness

Related Deep Dives

Tool Deep Dive: Prometheus Complete Guide

Tool Deep Dive: Loki Complete Guide

Part 7: Visualization & Alerting