Grafana Deep Dive Part 8: Displaying Data with Dashboards

Creating Your First Dashboard

Dashboards are the heart of Grafana — they transform raw telemetry data into actionable visual insights. A dashboard is a collection of panels arranged on a grid, each panel displaying a specific query result through a chosen visualization. Whether you’re monitoring infrastructure health, application performance, or business KPIs, the dashboard is where all your observability data comes together into a coherent narrative.

Dashboard Creation Workflow

flowchart LR
    A[Configure Data Source] --> B[Create Dashboard]
    B --> C[Add Panel]
    C --> D[Write Query]
    D --> E[Choose Visualization]
    E --> F[Configure Options]
    F --> G[Save Dashboard]
    G --> H{Need More Panels?}
    H -->|Yes| C
    H -->|No| I[Share / Publish]

Connecting Data Sources

Before creating any dashboard, you need at least one configured data source. Grafana supports over 150 data source plugins, but the most common for observability are Prometheus/Mimir (metrics), Loki (logs), and Tempo (traces). Each data source connection defines how Grafana communicates with your backend storage.

{
  "apiVersion": 1,
  "datasources": [
    {
      "name": "Mimir",
      "type": "prometheus",
      "uid": "mimir-prod",
      "url": "http://mimir-gateway.monitoring:8080/prometheus",
      "access": "proxy",
      "isDefault": true,
      "jsonData": {
        "httpMethod": "POST",
        "timeInterval": "15s",
        "exemplarTraceIdDestinations": [
          {
            "name": "traceID",
            "datasourceUid": "tempo-prod"
          }
        ]
      }
    },
    {
      "name": "Loki",
      "type": "loki",
      "uid": "loki-prod",
      "url": "http://loki-gateway.monitoring:3100",
      "access": "proxy",
      "jsonData": {
        "derivedFields": [
          {
            "matcherRegex": "traceID=(\\w+)",
            "name": "TraceID",
            "url": "",
            "datasourceUid": "tempo-prod"
          }
        ]
      }
    },
    {
      "name": "Tempo",
      "type": "tempo",
      "uid": "tempo-prod",
      "url": "http://tempo-gateway.monitoring:3200",
      "access": "proxy",
      "jsonData": {
        "tracesToLogsV2": {
          "datasourceUid": "loki-prod",
          "filterByTraceID": true
        },
        "tracesToMetrics": {
          "datasourceUid": "mimir-prod"
        },
        "serviceMap": {
          "datasourceUid": "mimir-prod"
        }
      }
    }
  ]
}

                            
                            Key Insight: Always configure cross-references between data sources (exemplar links from metrics to traces, derived fields from logs to traces, traces-to-logs/metrics links). This enables seamless correlation across signals directly from your dashboard panels — the foundation of effective observability.
                        

Panel Creation & Basic Queries

Creating a panel involves three fundamental steps: writing a query, selecting a visualization, and configuring display options. The query editor adapts based on your data source type — PromQL for Prometheus/Mimir, LogQL for Loki, and TraceQL for Tempo.

Here’s the JSON model of a basic panel querying CPU utilization:

{
  "title": "CPU Utilization by Instance",
  "type": "timeseries",
  "datasource": {
    "type": "prometheus",
    "uid": "mimir-prod"
  },
  "targets": [
    {
      "expr": "100 - (avg by(instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
      "legendFormat": "{{instance}}",
      "refId": "A"
    }
  ],
  "fieldConfig": {
    "defaults": {
      "unit": "percent",
      "min": 0,
      "max": 100,
      "thresholds": {
        "mode": "absolute",
        "steps": [
          {"color": "green", "value": null},
          {"color": "yellow", "value": 70},
          {"color": "red", "value": 90}
        ]
      }
    }
  },
  "options": {
    "tooltip": {"mode": "multi"},
    "legend": {"displayMode": "table", "placement": "bottom", "calcs": ["mean", "max", "lastNotNull"]}
  }
}

Time Range Controls

Every dashboard has a global time range picker that controls the time window for all panels (unless individually overridden). Understanding time range controls is essential for effective troubleshooting:

Relative time ranges — now-1h, now-6h, now-24h, now-7d
Absolute time ranges — Exact start/end timestamps for incident forensics
Auto-refresh intervals — 5s, 10s, 30s, 1m, 5m for real-time monitoring
Time zone settings — Browser local, UTC, or specific timezone
Fiscal year quarters — For business-aligned time periods

Individual panels can override the dashboard time range using the Relative time field in panel options — useful for showing “last 7 days” trends alongside real-time data.

Developing Your Dashboard Further

Once you have basic panels working, it’s time to organize and enrich your dashboard with structural elements, contextual annotations, and navigation links that transform a collection of panels into a coherent monitoring story.

Rows & Panel Organization

Rows are collapsible containers that group related panels. They provide visual hierarchy and allow users to focus on specific areas without being overwhelmed by information. A well-organized dashboard typically follows a top-down pattern: high-level overview at the top, detailed breakdowns below.

Recommended Dashboard Row Organization

flowchart TD
    subgraph "Row 1: Overview (always visible)"
        S1[Service Health Stat]
        S2[Error Rate Stat]
        S3[P99 Latency Stat]
        S4[Throughput Stat]
    end

    subgraph "Row 2: Traffic & Latency"
        T1[Request Rate Time Series]
        T2[Latency Distribution Heatmap]
    end

    subgraph "Row 3: Errors & Saturation"
        E1[Error Rate by Type]
        E2[Queue Depth / Saturation]
    end

    subgraph "Row 4: Resources (collapsed)"
        R1[CPU Usage]
        R2[Memory Usage]
        R3[Disk I/O]
        R4[Network I/O]
    end

    subgraph "Row 5: Logs (collapsed)"
        L1[Error Logs Panel]
    end

Panel sizing follows Grafana’s 24-column grid system. Common layouts include:

Full width (24 cols) — Logs panels, wide time series
Half width (12 cols) — Side-by-side comparisons
Third width (8 cols) — Three-panel rows
Quarter width (6 cols) — Stat panels in overview rows

Annotations

Annotations overlay contextual markers on time series panels — deployments, incidents, configuration changes, or any event that might correlate with metric changes. They’re invaluable for incident correlation: “Did the latency spike coincide with that deployment?”

{
  "annotations": {
    "list": [
      {
        "name": "Deployments",
        "datasource": {"type": "prometheus", "uid": "mimir-prod"},
        "enable": true,
        "iconColor": "blue",
        "expr": "changes(kube_deployment_status_observed_generation{namespace=\"production\"}[1m]) > 0",
        "titleFormat": "Deploy: {{deployment}}",
        "tagKeys": "namespace,deployment"
      },
      {
        "name": "Alerts",
        "datasource": {"type": "datasource", "uid": "-- Grafana --"},
        "enable": true,
        "iconColor": "red",
        "type": "alert"
      },
      {
        "name": "Incidents",
        "datasource": {"type": "loki", "uid": "loki-prod"},
        "enable": false,
        "iconColor": "orange",
        "expr": "{app=\"incident-bot\"} |= \"incident created\"",
        "titleFormat": "Incident"
      }
    ]
  }
}

Dashboard Links

Dashboard links create navigation pathways between related dashboards, enabling drill-down workflows. Grafana supports three types of links:

Dashboard links — Navigate to another dashboard, optionally passing current variable values and time range
URL links — Link to external systems (runbooks, wikis, incident management)
Data links — Panel-level links that pass the clicked data point’s value as a parameter

{
  "links": [
    {
      "title": "Service Detail",
      "type": "link",
      "url": "/d/service-detail/service-detail?var-service=${service}&from=${__from}&to=${__to}",
      "tooltip": "Drill into service-specific metrics",
      "icon": "external link"
    },
    {
      "title": "Related Dashboards",
      "type": "dashboards",
      "tags": ["production", "microservices"],
      "tooltip": "All production dashboards"
    }
  ]
}

Using Visualizations in Grafana

Choosing the right visualization is critical — each panel type is optimized for specific data patterns and user questions. Grafana ships with 16+ built-in visualization types, each serving distinct analytical purposes.

Time Series

The time series panel is the most commonly used visualization in Grafana. It displays metric data points over time with configurable line styles, fill opacity, gradient modes, point visibility, and stacking options. It supports multiple Y-axes, series overrides, and tooltip modes (single, multi, hidden).

Best used for: CPU/memory usage trends, request rates, latency percentiles, any metric evolving over time.

Time Series Tips

Use gradient fill for single-series panels to emphasize magnitude
Enable stacking (normal or percent) to show composition over time
Set connect null values to handle gaps from scrape failures
Use exemplars overlay to link metric spikes to specific traces
Configure legend as table with calc values (mean, max, current) for quick reference

Stat, Gauge & Bar Chart

The stat panel shows a single numeric value with optional sparkline, color-coded by thresholds. Ideal for KPI overview rows (uptime %, current error rate, total requests). The gauge adds a visual indicator of where the current value falls within a defined range — perfect for resource utilization (0–100%). The bar chart displays categorical data as vertical or horizontal bars with grouping and stacking support.

Stat — Current service count, total errors today, uptime percentage
Gauge — CPU utilization, memory pressure, disk fullness
Bar chart — Top 10 endpoints by request count, error distribution by service

Table & Heatmap

The table visualization presents data in row/column format with sorting, filtering, cell coloring, and link support. It excels at displaying multi-dimensional data or inventory-style views (list of pods with their status, services with their SLO compliance).

The heatmap visualizes distribution over time — each cell represents a bucket of values for a time interval, colored by density. It’s the ideal choice for latency distributions (replacing percentile lines with full distribution visibility) and reveals patterns that percentiles hide.

{
  "title": "Request Latency Distribution",
  "type": "heatmap",
  "targets": [
    {
      "expr": "sum(rate(http_request_duration_seconds_bucket{service=\"$service\"}[$__rate_interval])) by (le)",
      "format": "heatmap",
      "legendFormat": "{{le}}"
    }
  ],
  "options": {
    "calculate": false,
    "yAxis": {"unit": "s", "reverse": false},
    "color": {"scheme": "Spectral", "steps": 64},
    "cellGap": 1,
    "tooltip": {"show": true}
  }
}

Geomap, Logs & Traces

The geomap panel renders data points on a world map. Use it for visualizing request origins, CDN performance by region, or infrastructure distribution across availability zones. It supports multiple layer types: markers, heatmap overlay, and route layers.

The logs panel integrates directly with Loki, displaying log lines with syntax highlighting, log level detection, and expandable log details. Combined with the derived fields configuration, clicking a trace ID in a log line navigates directly to the trace view.

The traces panel renders distributed traces from Tempo as waterfall diagrams showing span hierarchies, durations, and service boundaries. It supports filtering by span attributes and duration thresholds directly within the panel.

Flame Graph, Node Graph & Canvas

The flame graph panel visualizes profiling data from Pyroscope, showing function call hierarchies with CPU time or memory allocation. It enables developers to identify hot code paths directly from dashboards without switching tools.

The node graph panel displays service topology maps — nodes represent services and edges represent connections with metrics (request rate, error rate, latency). This is Grafana’s service map visualization, powered by Tempo’s service graph metrics.

The canvas panel provides a free-form layout where elements (icons, text, metric values) can be placed at arbitrary positions. It’s used for custom diagrams, architecture overviews, or floor plans with real-time data bindings.

                            
                            Canvas Use Cases: Network topology diagrams with live throughput overlays, data center floor plans showing server health, custom architectural views of your specific system with real-time metrics bound to each component.
                        

Histogram, Pie Chart & State Timeline

The histogram panel shows the distribution of values as a bar chart with configurable bucket sizes — useful for understanding value distributions at a glance (response size distribution, batch job duration spread).

The pie chart shows proportional relationships — traffic split across services, error distribution by type, resource allocation by team. Use sparingly; tables or bar charts often communicate the same data more effectively.

The state timeline panel displays discrete states over time as colored bands — perfect for showing service health transitions (healthy → degraded → down → recovered), deployment rollout progress, or feature flag changes. Each state maps to a color for immediate visual pattern recognition.

Developing a Dashboard Purpose

Effective dashboards answer specific questions rather than displaying every available metric. Industry-proven methodologies provide frameworks for what to monitor and how to organize it. The choice of methodology depends on what you’re monitoring: resources (USE), services (RED), or SRE practices (Golden Signals).

Monitoring Methodology Selection

flowchart TD
    Q{What are you monitoring?}
    Q -->|Infrastructure Resources| USE[USE Method]
    Q -->|Request-Driven Services| RED[RED Method]
    Q -->|SRE Practice| GS[Golden Signals]
    Q -->|Business Outcomes| BM[Business Metrics]

    USE --> U1[Utilization]
    USE --> U2[Saturation]
    USE --> U3[Errors]

    RED --> R1[Rate]
    RED --> R2[Errors]
    RED --> R3[Duration]

    GS --> G1[Latency]
    GS --> G2[Traffic]
    GS --> G3[Errors]
    GS --> G4[Saturation]

    BM --> B1[Revenue Impact]
    BM --> B2[User Experience]
    BM --> B3[Conversion Rates]

USE Method Dashboard

The USE Method (Utilization, Saturation, Errors), developed by Brendan Gregg, targets infrastructure resources — CPUs, memory, disks, network interfaces, and any component with a capacity limit.

USE Method Metrics

Resource	Utilization	Saturation	Errors
CPU	`node_cpu_seconds_total` (idle complement)	`node_load1` / CPU count	Machine check exceptions
Memory	`node_memory_MemAvailable` / Total	Swap usage, OOM kills	ECC errors
Disk	`node_filesystem_avail` / size	I/O queue depth (`node_disk_io_now`)	`node_disk_io_errors`
Network	Bandwidth utilization %	TCP retransmits, queue drops	`node_network_receive_errs`

RED Method Dashboard

The RED Method (Rate, Errors, Duration) focuses on request-driven services. Created by Tom Wilkie, it answers: “Is the service working?” For every service in your system, monitor these three signals:

Rate — Requests per second (rate(http_requests_total[5m]))
Errors — Failed requests per second (rate(http_requests_total{status=~"5.."}[5m]))
Duration — Latency distribution (histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])))

{
  "title": "RED Dashboard - Service Overview",
  "panels": [
    {
      "title": "Request Rate",
      "type": "timeseries",
      "gridPos": {"h": 8, "w": 8, "x": 0, "y": 0},
      "targets": [{"expr": "sum(rate(http_requests_total{service=\"$service\"}[$__rate_interval]))"}]
    },
    {
      "title": "Error Rate",
      "type": "timeseries",
      "gridPos": {"h": 8, "w": 8, "x": 8, "y": 0},
      "targets": [{"expr": "sum(rate(http_requests_total{service=\"$service\",status=~\"5..\"}[$__rate_interval])) / sum(rate(http_requests_total{service=\"$service\"}[$__rate_interval])) * 100"}],
      "fieldConfig": {"defaults": {"unit": "percent"}}
    },
    {
      "title": "Latency (p50 / p95 / p99)",
      "type": "timeseries",
      "gridPos": {"h": 8, "w": 8, "x": 16, "y": 0},
      "targets": [
        {"expr": "histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket{service=\"$service\"}[$__rate_interval])) by (le))", "legendFormat": "p50"},
        {"expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{service=\"$service\"}[$__rate_interval])) by (le))", "legendFormat": "p95"},
        {"expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{service=\"$service\"}[$__rate_interval])) by (le))", "legendFormat": "p99"}
      ],
      "fieldConfig": {"defaults": {"unit": "s"}}
    }
  ]
}

Golden Signals Dashboard

Google’s Four Golden Signals from the SRE book combine elements of both USE and RED: Latency, Traffic, Errors, and Saturation. This methodology works for any type of system and is the recommended starting point for teams adopting SRE practices.

Latency — Time to serve a request (distinguish successful vs. failed request latency)
Traffic — Demand on the system (requests/sec, sessions, transactions)
Errors — Rate of failed requests (explicit 5xx, implicit policy violations)
Saturation — How “full” the service is (queue depth, memory pressure, thread pool exhaustion)

                            
                            Critical Distinction: Track latency separately for successful and failed requests. A service returning errors quickly (low latency) may mask high latency for successful requests. An averaging approach hides both problems — always use separate series or percentile breakdowns by status class.
                        

Business Metrics Dashboard

Technical metrics tell you what is broken; business metrics tell you why it matters. A business metrics dashboard bridges the gap between engineering and stakeholders by showing revenue impact, user experience scores, and conversion funnels alongside the technical signals that affect them.

Common business metrics to display:

Orders per minute — Direct revenue indicator
Cart abandonment rate — Correlated with latency spikes
Active users / sessions — Traffic indicator in business terms
Payment success rate — Critical path health
Search result relevance — Product experience quality
Feature adoption rates — New feature rollout health

Advanced Dashboard Techniques

Moving beyond basic panels, Grafana’s advanced features enable dynamic, reusable dashboards that adapt to different environments, services, and time windows without manual editing.

Variables & Templating

Variables are the foundation of reusable dashboards. Instead of hardcoding label values in queries, variables create dropdown menus that dynamically filter all panels. Grafana supports several variable types:

Variable Types

Type	Source	Example
Query	Data source query	`label_values(up, instance)`
Custom	Manual comma-separated list	`production, staging, development`
Interval	Time interval options	`1m, 5m, 15m, 1h`
Data source	Available data sources by type	All Prometheus data sources
Text box	Free-form user input	Custom filter string
Constant	Hidden fixed value	Provisioned environment name

Chained variables create cascading filters where one variable’s selection constrains the next. For example, selecting a namespace filters the service dropdown to only show services in that namespace:

{
  "templating": {
    "list": [
      {
        "name": "namespace",
        "type": "query",
        "query": "label_values(kube_pod_info, namespace)",
        "refresh": 2,
        "sort": 1
      },
      {
        "name": "service",
        "type": "query",
        "query": "label_values(kube_pod_info{namespace=\"$namespace\"}, created_by_name)",
        "refresh": 2,
        "sort": 1,
        "multi": true,
        "includeAll": true
      },
      {
        "name": "pod",
        "type": "query",
        "query": "label_values(kube_pod_info{namespace=\"$namespace\", created_by_name=~\"$service\"}, pod)",
        "refresh": 2,
        "sort": 1,
        "multi": true,
        "includeAll": true
      }
    ]
  }
}

Built-in variables provide automatic context:

$__interval — Automatically calculated based on time range and panel width (prevents over/under-sampling)
$__rate_interval — Safe interval for rate() functions (at least 4x scrape interval)
$__from / $__to — Current time range boundaries in epoch milliseconds
$__range — Duration of the current time range (e.g., 1h)
${__dashboard.uid} — Current dashboard UID for self-referencing links

Transformations

Transformations process query results before visualization — enabling calculations, joins, filtering, and restructuring that would be difficult or impossible in the query language alone. They’re applied in sequence, forming a pipeline.

Common Transformations

Merge — Combine multiple queries into a single table (useful for joining metrics from different sources)
Filter by value — Show only rows matching a condition (e.g., error rate > 1%)
Calculate field — Create new fields using math (Field A / Field B * 100), binary operations, or reduce functions
Group by — Aggregate rows by a field with sum, mean, min, max, count, first, last
Sort by — Order results by any field ascending or descending
Join by field — SQL-style inner/outer join of multiple queries on a shared key
Series to rows — Convert multiple time series into a table format
Organize fields — Rename, reorder, or hide specific columns
Reduce — Collapse time series into single values (sum, mean, max, range)

Example: Creating a “Top Services by Error Rate” table using transformations:

{
  "title": "Top Services by Error Rate",
  "type": "table",
  "targets": [
    {
      "expr": "sum(rate(http_requests_total{status=~\"5..\"}[$__rate_interval])) by (service)",
      "legendFormat": "{{service}}",
      "refId": "errors",
      "instant": true
    },
    {
      "expr": "sum(rate(http_requests_total[$__rate_interval])) by (service)",
      "legendFormat": "{{service}}",
      "refId": "total",
      "instant": true
    }
  ],
  "transformations": [
    {"id": "merge", "options": {}},
    {
      "id": "calculateField",
      "options": {
        "mode": "binary",
        "reduce": {"reducer": "sum"},
        "binary": {"left": "errors", "operator": "/", "right": "total"},
        "alias": "Error Rate"
      }
    },
    {"id": "sortBy", "options": {"sort": [{"field": "Error Rate", "desc": true}]}},
    {"id": "filterByValue", "options": {"filters": [{"fieldName": "Error Rate", "config": {"id": "greater", "options": {"value": 0.001}}}], "type": "include", "match": "any"}}
  ]
}

Mixed Data Sources in a Single Panel

Grafana supports querying multiple data sources within a single panel using the Mixed data source. This enables powerful correlations — overlaying deployment annotations from Loki on top of Prometheus metrics, or comparing CloudWatch metrics with self-hosted Mimir data in the same time series panel.

To use mixed data sources:

Select -- Mixed -- as the panel data source
Each query target independently selects its own data source
Results are merged based on time alignment

                            
                            Pro Tip: Mixed data sources are particularly powerful for SLO dashboards that combine: application metrics (Prometheus) for error budget calculation, log counts (Loki) for error categorization, and external uptime monitors (CloudWatch Synthetics) for customer-facing availability — all in one unified view.
                        

Links & Drilldowns

Effective dashboards form a navigation hierarchy. Data links on panels enable context-sensitive drill-downs — clicking a specific service in a table navigates to that service’s detail dashboard with all variables pre-populated.

{
  "fieldConfig": {
    "defaults": {
      "links": [
        {
          "title": "View Service Detail",
          "url": "/d/service-detail?var-service=${__data.fields.service}&from=${__from}&to=${__to}",
          "targetBlank": false
        },
        {
          "title": "View Traces",
          "url": "/explore?left={\"datasource\":\"tempo-prod\",\"queries\":[{\"queryType\":\"traceqlSearch\",\"filters\":[{\"id\":\"service-name\",\"value\":[\"${__data.fields.service}\"]}]}]}&from=${__from}&to=${__to}",
          "targetBlank": true
        },
        {
          "title": "View Logs",
          "url": "/explore?left={\"datasource\":\"loki-prod\",\"queries\":[{\"expr\":\"{service=\\\"${__data.fields.service}\\\"}|=\\\"error\\\"\"}]}&from=${__from}&to=${__to}",
          "targetBlank": true
        }
      ]
    }
  }
}

The drill-down pattern typically follows: Overview Dashboard (all services) → Service Dashboard (single service detail) → Instance Dashboard (single pod/container) → Explore (ad-hoc investigation).

Panel Overrides & Thresholds

Thresholds define color boundaries for values — green below 70%, yellow at 70-90%, red above 90%. They apply to stat panels (background color), gauges (arc color), time series (fill/line color), and tables (cell coloring).

Field overrides allow per-series or per-field customization within a panel. You can override colors, units, display names, axis placement, thresholds, and visualization options for specific fields matching a name, regex, or query type:

{
  "fieldConfig": {
    "defaults": {
      "unit": "reqps",
      "color": {"mode": "palette-classic"}
    },
    "overrides": [
      {
        "matcher": {"id": "byName", "options": "errors"},
        "properties": [
          {"id": "color", "value": {"mode": "fixed", "fixedColor": "red"}},
          {"id": "custom.axisPlacement", "value": "right"},
          {"id": "unit", "value": "percentunit"},
          {"id": "custom.fillOpacity", "value": 10}
        ]
      },
      {
        "matcher": {"id": "byRegexp", "options": "/p99|p95/"},
        "properties": [
          {"id": "custom.lineStyle", "value": {"fill": "dash", "dash": [10, 5]}},
          {"id": "unit", "value": "s"}
        ]
      }
    ]
  }
}

Managing & Organizing Dashboards

As your dashboard collection grows, organization becomes critical. Grafana provides several mechanisms for managing dashboards at scale across teams and environments.

Folders & Permissions

Folders group related dashboards and provide access control boundaries. A common organizational pattern:

Infrastructure/ — Node, network, storage dashboards (SRE team)
Platform/ — Kubernetes, service mesh, message queues (Platform team)
Services/ — Per-service RED dashboards (Development teams)
Business/ — Revenue, user experience, SLO dashboards (Leadership)
Alerts/ — Alert-specific investigation dashboards (On-call)

Permissions can be set at the folder level (inherited by all dashboards within) or overridden per dashboard. Roles include Viewer, Editor, and Admin with granular control over who can view, edit, or manage alerts for each folder.

Playlists & Snapshots

Playlists cycle through multiple dashboards automatically — ideal for wall-mounted NOC displays. Configure rotation intervals (30s–5m) and select dashboards by tag or manual selection. Playlists run in kiosk mode, hiding navigation for clean display.

Snapshots capture a dashboard’s current state (including data) as a static, shareable artifact. They’re invaluable for:

Sharing incident evidence with stakeholders who lack Grafana access
Preserving dashboard state at a specific point during a postmortem
Creating reports without requiring live data source connectivity

Library Panels

Library panels are reusable panel definitions shared across multiple dashboards. When you update a library panel, the change propagates to every dashboard that uses it. This is essential for maintaining consistency — a standardized “Service Health” panel used across 50 service dashboards should look and behave identically.

{
  "name": "Standard Service Health",
  "type": "stat",
  "model": {
    "targets": [
      {
        "expr": "sum(up{service=\"$service\"}) / count(up{service=\"$service\"}) * 100",
        "legendFormat": "Health"
      }
    ],
    "fieldConfig": {
      "defaults": {
        "unit": "percent",
        "min": 0,
        "max": 100,
        "thresholds": {
          "steps": [
            {"color": "red", "value": null},
            {"color": "yellow", "value": 95},
            {"color": "green", "value": 99}
          ]
        },
        "mappings": [
          {"type": "range", "options": {"from": 99, "to": 100, "result": {"text": "Healthy", "color": "green"}}},
          {"type": "range", "options": {"from": 95, "to": 99, "result": {"text": "Degraded", "color": "yellow"}}},
          {"type": "range", "options": {"from": 0, "to": 95, "result": {"text": "Critical", "color": "red"}}}
        ]
      }
    }
  }
}

                            
                            Library Panel Strategy: Create library panels for standardized components that appear across multiple dashboards: service health indicators, SLO burn rate gauges, resource utilization gauges, and annotation queries. This ensures consistent thresholds, units, and coloring across your entire monitoring estate.
                        

Case Study: Building an Overall System View

Let’s bring everything together by building a comprehensive system overview dashboard for a microservices e-commerce platform. This dashboard serves as the “single pane of glass” for on-call engineers, answering: “Is the system healthy? If not, where should I look?”

Application Architecture

E-Commerce Platform Architecture

flowchart TD
    LB[Load Balancer] --> GW[API Gateway]
    GW --> US[User Service]
    GW --> PS[Product Service]
    GW --> CS[Cart Service]
    GW --> OS[Order Service]
    GW --> PY[Payment Service]

    US --> DB1[(Users DB)]
    PS --> DB2[(Products DB)]
    PS --> CH[Redis Cache]
    CS --> CH
    OS --> DB3[(Orders DB)]
    OS --> MQ[Message Queue]
    PY --> EXT[External Payment Provider]
    MQ --> NS[Notification Service]
    NS --> EM[Email Provider]
    NS --> SM[SMS Provider]

Dashboard Implementation

The system overview dashboard is organized into five rows, progressing from high-level health to detailed breakdowns:

Row 1: System Health Summary (always expanded) — Four stat panels showing overall availability, total error rate, P95 latency, and active user sessions. These panels use the library panel pattern with standardized thresholds.

Row 2: Service Topology — Node graph panel powered by Tempo’s service graph metrics, showing live request flow between services with error rates on edges and latency in node labels.

Row 3: Golden Signals by Service — A table panel with rows per service showing current rate, error percentage, P50/P95/P99 latency, and saturation. Color-coded cells with data links to per-service dashboards.

Row 4: Infrastructure Health (collapsed by default) — CPU, memory, disk, and network gauges for the Kubernetes cluster, with per-node breakdown available via variable selection.

Row 5: Recent Events (collapsed by default) — Combined logs panel showing recent errors across all services, filtered to ERROR and FATAL levels, with trace ID links.

Dashboard-as-Code Provisioning

For production environments, dashboards should be version-controlled and provisioned as code. Grafana supports provisioning via JSON files, Terraform, or the Grafana Operator for Kubernetes:

{
  "apiVersion": 1,
  "providers": [
    {
      "name": "platform-dashboards",
      "orgId": 1,
      "folder": "Platform",
      "type": "file",
      "disableDeletion": true,
      "editable": false,
      "options": {
        "path": "/etc/grafana/provisioning/dashboards/platform",
        "foldersFromFilesStructure": true
      }
    }
  ]
}

The complete dashboard JSON for provisioning:

{
  "uid": "system-overview",
  "title": "System Overview",
  "tags": ["production", "overview", "golden-signals"],
  "timezone": "browser",
  "refresh": "30s",
  "time": {"from": "now-1h", "to": "now"},
  "templating": {
    "list": [
      {
        "name": "namespace",
        "type": "query",
        "query": "label_values(up{job=~\".+\"}, namespace)",
        "current": {"text": "production", "value": "production"}
      },
      {
        "name": "service",
        "type": "query",
        "query": "label_values(up{namespace=\"$namespace\"}, service)",
        "multi": true,
        "includeAll": true
      }
    ]
  },
  "annotations": {
    "list": [
      {"name": "Deployments", "datasource": {"uid": "mimir-prod"}, "enable": true, "iconColor": "blue", "expr": "changes(kube_deployment_status_observed_generation{namespace=\"$namespace\"}[2m]) > 0"},
      {"name": "Alerts", "datasource": {"uid": "-- Grafana --"}, "enable": true, "iconColor": "red", "type": "alert"}
    ]
  },
  "panels": [
    {
      "title": "System Availability",
      "type": "stat",
      "gridPos": {"h": 4, "w": 6, "x": 0, "y": 0},
      "targets": [{"expr": "avg(up{namespace=\"$namespace\"}) * 100"}],
      "fieldConfig": {"defaults": {"unit": "percent", "thresholds": {"steps": [{"color": "red", "value": null}, {"color": "yellow", "value": 99}, {"color": "green", "value": 99.9}]}}}
    },
    {
      "title": "Error Rate",
      "type": "stat",
      "gridPos": {"h": 4, "w": 6, "x": 6, "y": 0},
      "targets": [{"expr": "sum(rate(http_requests_total{namespace=\"$namespace\",status=~\"5..\"}[$__rate_interval])) / sum(rate(http_requests_total{namespace=\"$namespace\"}[$__rate_interval])) * 100"}],
      "fieldConfig": {"defaults": {"unit": "percent", "thresholds": {"steps": [{"color": "green", "value": null}, {"color": "yellow", "value": 1}, {"color": "red", "value": 5}]}}}
    },
    {
      "title": "P95 Latency",
      "type": "stat",
      "gridPos": {"h": 4, "w": 6, "x": 12, "y": 0},
      "targets": [{"expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{namespace=\"$namespace\"}[$__rate_interval])) by (le))"}],
      "fieldConfig": {"defaults": {"unit": "s", "thresholds": {"steps": [{"color": "green", "value": null}, {"color": "yellow", "value": 0.5}, {"color": "red", "value": 2}]}}}
    },
    {
      "title": "Active Sessions",
      "type": "stat",
      "gridPos": {"h": 4, "w": 6, "x": 18, "y": 0},
      "targets": [{"expr": "sum(active_sessions{namespace=\"$namespace\"})"}],
      "fieldConfig": {"defaults": {"unit": "short"}}
    }
  ],
  "links": [
    {"title": "Infrastructure", "type": "link", "url": "/d/infra-overview", "icon": "bolt"},
    {"title": "SLO Dashboard", "type": "link", "url": "/d/slo-overview?var-namespace=$namespace", "icon": "chart-line"},
    {"title": "Incident Runbooks", "type": "link", "url": "https://wiki.example.com/runbooks", "icon": "book", "targetBlank": true}
  ]
}

                            
                            Dashboard-as-Code Best Practices: Store dashboard JSON in Git alongside application code. Use CI/CD pipelines to validate JSON syntax, check for hardcoded data source UIDs (use variables instead), and deploy to Grafana via the provisioning API or Terraform. Pin dashboard versions with version field and use editable: false in production to prevent drift from the source-of-truth in Git.
                        

Summary & Next Steps

Grafana dashboards transform raw telemetry into actionable observability. In this guide, we covered the full spectrum from creating your first panel to building production-grade system views:

Foundation — Data source configuration with cross-signal linking, panel creation, and time range management
Organization — Rows, annotations, and dashboard links create navigable monitoring stories
Visualizations — 16+ panel types each optimized for specific data patterns and user questions
Methodology — USE, RED, Golden Signals, and business metrics provide purpose-driven dashboard design
Advanced Features — Variables, transformations, mixed data sources, and overrides enable dynamic, reusable dashboards
Management — Folders, permissions, library panels, and dashboard-as-code ensure scalable governance
Production Pattern — A complete system overview dashboard demonstrating all concepts together

The key principle: every dashboard should have a clear purpose and audience. A dashboard for on-call triage looks fundamentally different from one built for capacity planning or executive reporting. Design your dashboards around the questions they need to answer, not around the metrics you happen to have.

Next in the Series

In Part 9: Managing Incidents Using Alerts, we’ll explore Grafana’s unified alerting system — configuring alert rules with multi-dimensional evaluation, notification policies with routing trees, contact points, silences, and the complete incident lifecycle from detection through resolution.

Previous Part 7: Infrastructure Monitoring — Kubernetes, AWS, GCP & Azure Next Part 9: Managing Incidents Using Alerts