Dashboard Design Principles
Effective dashboards follow a visual hierarchy — the most important information at the top, progressive detail as you scroll down:
- Row 1: Status overview — Stat panels showing current state (green/red), SLO compliance %, active alerts count
- Row 2: Golden Signals — Time series panels for request rate, error rate, latency, saturation
- Row 3: Service-specific — Panels unique to this service (queue depth, cache hit ratio, DB connections)
- Row 4: Infrastructure — CPU, memory, disk, network utilisation for underlying resources
The 5-Second Rule: A dashboard should communicate the system's health within 5 seconds of viewing. If someone needs to read query expressions or hover over data to understand whether the system is healthy, the dashboard needs redesigning.
Panel Types & When to Use Each
| Panel | Best For | Not For |
|---|---|---|
| Time Series | Trends over time, rate metrics, latency | Current state, discrete categories |
| Stat | Single current value with colour thresholds | Showing trends or history |
| Gauge | Utilisation percentages (CPU, memory, disk) | Rates or counts |
| Bar Chart | Comparing categories (top endpoints, error counts by service) | Time-based data |
| Table | Multi-column data, alert details, top-N lists | Visual trend analysis |
| Heatmap | Distribution over time (histogram buckets, latency) | Simple metrics, gauges |
| Logs | Log lines correlated with metrics panel | Aggregate data |
| Traces | Distributed trace waterfall view | Non-trace data |
Variables & Templating
# Variable: $service — dynamically populated from Prometheus labels
# Type: Query
# Query: label_values(http_requests_total, service)
# Refresh: On time range change
# Multi-value: Enabled
# Include All option: Enabled
# Use in panel queries:
sum(rate(http_requests_total{service=~"$service"}[5m])) by (service)
# Variable: $namespace — cascading dependency on $cluster
# Query: label_values(kube_pod_info{cluster="$cluster"}, namespace)
# Variable: $interval — auto-scaled scrape interval
# Type: Interval
# Values: 1m, 5m, 15m, 30m, 1h
# Auto: enabled, step count: 30
Transformations
| Transformation | Use Case |
|---|---|
| Merge | Combine multiple queries into one table (e.g., metrics + labels from different sources) |
| Filter by value | Show only rows where error rate > threshold |
| Organize fields | Rename, reorder, or hide columns in table panels |
| Reduce | Collapse time series to single values (last, mean, max) for stat panels |
| Group by | Aggregate rows by a field (sum errors by service) |
| Join by field | SQL-like join between Prometheus and Loki query results |
Provisioning & Dashboard as Code
# provisioning/dashboards/dashboards.yaml
apiVersion: 1
providers:
- name: 'default'
orgId: 1
folder: 'Auto-Provisioned'
type: file
disableDeletion: false
editable: true
options:
path: /var/lib/grafana/dashboards
foldersFromFilesStructure: true # Subfolder = Grafana folder
# provisioning/datasources/datasources.yaml
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
jsonData:
timeInterval: '15s' # Scrape interval for $__rate_interval
httpMethod: POST # Allows longer queries
- name: Loki
type: loki
access: proxy
url: http://loki:3100
jsonData:
derivedFields:
- datasourceUid: tempo-uid
matcherRegex: '"traceID":"(\w+)"'
name: TraceID
url: '$${__value.raw}' # Links log lines to traces
Cross-Pillar Correlations
Metrics → Logs → Traces: Grafana's power comes from correlating all three pillars in one view. Configure derived fields in Loki to extract trace IDs and link to Tempo. Use exemplars in Prometheus to jump from a latency spike directly to the trace that caused it. Build dashboards with side-by-side metrics and log panels sharing the same time range and variables.
Production Best Practices
Grafana Production Readiness
- Use provisioning (YAML files) for data sources and dashboards — never configure manually in production
- Set
$__rate_intervalinstead of hardcoded[5m]in Prometheus queries - Use recording rules for expensive dashboard queries (never compute high-cardinality aggregations in Grafana)
- Enable dashboard versioning and audit logging
- Set max data points per panel to prevent browser memory exhaustion
- Use folders and RBAC to organise dashboards by team/service
- Configure PostgreSQL as the Grafana database (not default SQLite) for HA
- Enable caching for frequently viewed dashboards