Creating Your First Dashboard
Dashboards are the heart of Grafana — they transform raw telemetry data into actionable visual insights. A dashboard is a collection of panels arranged on a grid, each panel displaying a specific query result through a chosen visualization. Whether you’re monitoring infrastructure health, application performance, or business KPIs, the dashboard is where all your observability data comes together into a coherent narrative.
flowchart LR
A[Configure Data Source] --> B[Create Dashboard]
B --> C[Add Panel]
C --> D[Write Query]
D --> E[Choose Visualization]
E --> F[Configure Options]
F --> G[Save Dashboard]
G --> H{Need More Panels?}
H -->|Yes| C
H -->|No| I[Share / Publish]
Connecting Data Sources
Before creating any dashboard, you need at least one configured data source. Grafana supports over 150 data source plugins, but the most common for observability are Prometheus/Mimir (metrics), Loki (logs), and Tempo (traces). Each data source connection defines how Grafana communicates with your backend storage.
{
"apiVersion": 1,
"datasources": [
{
"name": "Mimir",
"type": "prometheus",
"uid": "mimir-prod",
"url": "http://mimir-gateway.monitoring:8080/prometheus",
"access": "proxy",
"isDefault": true,
"jsonData": {
"httpMethod": "POST",
"timeInterval": "15s",
"exemplarTraceIdDestinations": [
{
"name": "traceID",
"datasourceUid": "tempo-prod"
}
]
}
},
{
"name": "Loki",
"type": "loki",
"uid": "loki-prod",
"url": "http://loki-gateway.monitoring:3100",
"access": "proxy",
"jsonData": {
"derivedFields": [
{
"matcherRegex": "traceID=(\\w+)",
"name": "TraceID",
"url": "",
"datasourceUid": "tempo-prod"
}
]
}
},
{
"name": "Tempo",
"type": "tempo",
"uid": "tempo-prod",
"url": "http://tempo-gateway.monitoring:3200",
"access": "proxy",
"jsonData": {
"tracesToLogsV2": {
"datasourceUid": "loki-prod",
"filterByTraceID": true
},
"tracesToMetrics": {
"datasourceUid": "mimir-prod"
},
"serviceMap": {
"datasourceUid": "mimir-prod"
}
}
}
]
}
Panel Creation & Basic Queries
Creating a panel involves three fundamental steps: writing a query, selecting a visualization, and configuring display options. The query editor adapts based on your data source type — PromQL for Prometheus/Mimir, LogQL for Loki, and TraceQL for Tempo.
Here’s the JSON model of a basic panel querying CPU utilization:
{
"title": "CPU Utilization by Instance",
"type": "timeseries",
"datasource": {
"type": "prometheus",
"uid": "mimir-prod"
},
"targets": [
{
"expr": "100 - (avg by(instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
"legendFormat": "{{instance}}",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"min": 0,
"max": 100,
"thresholds": {
"mode": "absolute",
"steps": [
{"color": "green", "value": null},
{"color": "yellow", "value": 70},
{"color": "red", "value": 90}
]
}
}
},
"options": {
"tooltip": {"mode": "multi"},
"legend": {"displayMode": "table", "placement": "bottom", "calcs": ["mean", "max", "lastNotNull"]}
}
}
Time Range Controls
Every dashboard has a global time range picker that controls the time window for all panels (unless individually overridden). Understanding time range controls is essential for effective troubleshooting:
- Relative time ranges —
now-1h,now-6h,now-24h,now-7d - Absolute time ranges — Exact start/end timestamps for incident forensics
- Auto-refresh intervals — 5s, 10s, 30s, 1m, 5m for real-time monitoring
- Time zone settings — Browser local, UTC, or specific timezone
- Fiscal year quarters — For business-aligned time periods
Individual panels can override the dashboard time range using the Relative time field in panel options — useful for showing “last 7 days” trends alongside real-time data.
Developing Your Dashboard Further
Once you have basic panels working, it’s time to organize and enrich your dashboard with structural elements, contextual annotations, and navigation links that transform a collection of panels into a coherent monitoring story.
Rows & Panel Organization
Rows are collapsible containers that group related panels. They provide visual hierarchy and allow users to focus on specific areas without being overwhelmed by information. A well-organized dashboard typically follows a top-down pattern: high-level overview at the top, detailed breakdowns below.
flowchart TD
subgraph "Row 1: Overview (always visible)"
S1[Service Health Stat]
S2[Error Rate Stat]
S3[P99 Latency Stat]
S4[Throughput Stat]
end
subgraph "Row 2: Traffic & Latency"
T1[Request Rate Time Series]
T2[Latency Distribution Heatmap]
end
subgraph "Row 3: Errors & Saturation"
E1[Error Rate by Type]
E2[Queue Depth / Saturation]
end
subgraph "Row 4: Resources (collapsed)"
R1[CPU Usage]
R2[Memory Usage]
R3[Disk I/O]
R4[Network I/O]
end
subgraph "Row 5: Logs (collapsed)"
L1[Error Logs Panel]
end
Panel sizing follows Grafana’s 24-column grid system. Common layouts include:
- Full width (24 cols) — Logs panels, wide time series
- Half width (12 cols) — Side-by-side comparisons
- Third width (8 cols) — Three-panel rows
- Quarter width (6 cols) — Stat panels in overview rows
Annotations
Annotations overlay contextual markers on time series panels — deployments, incidents, configuration changes, or any event that might correlate with metric changes. They’re invaluable for incident correlation: “Did the latency spike coincide with that deployment?”
{
"annotations": {
"list": [
{
"name": "Deployments",
"datasource": {"type": "prometheus", "uid": "mimir-prod"},
"enable": true,
"iconColor": "blue",
"expr": "changes(kube_deployment_status_observed_generation{namespace=\"production\"}[1m]) > 0",
"titleFormat": "Deploy: {{deployment}}",
"tagKeys": "namespace,deployment"
},
{
"name": "Alerts",
"datasource": {"type": "datasource", "uid": "-- Grafana --"},
"enable": true,
"iconColor": "red",
"type": "alert"
},
{
"name": "Incidents",
"datasource": {"type": "loki", "uid": "loki-prod"},
"enable": false,
"iconColor": "orange",
"expr": "{app=\"incident-bot\"} |= \"incident created\"",
"titleFormat": "Incident"
}
]
}
}
Dashboard Links
Dashboard links create navigation pathways between related dashboards, enabling drill-down workflows. Grafana supports three types of links:
- Dashboard links — Navigate to another dashboard, optionally passing current variable values and time range
- URL links — Link to external systems (runbooks, wikis, incident management)
- Data links — Panel-level links that pass the clicked data point’s value as a parameter
{
"links": [
{
"title": "Service Detail",
"type": "link",
"url": "/d/service-detail/service-detail?var-service=${service}&from=${__from}&to=${__to}",
"tooltip": "Drill into service-specific metrics",
"icon": "external link"
},
{
"title": "Related Dashboards",
"type": "dashboards",
"tags": ["production", "microservices"],
"tooltip": "All production dashboards"
}
]
}
Using Visualizations in Grafana
Choosing the right visualization is critical — each panel type is optimized for specific data patterns and user questions. Grafana ships with 16+ built-in visualization types, each serving distinct analytical purposes.
Time Series
The time series panel is the most commonly used visualization in Grafana. It displays metric data points over time with configurable line styles, fill opacity, gradient modes, point visibility, and stacking options. It supports multiple Y-axes, series overrides, and tooltip modes (single, multi, hidden).
Best used for: CPU/memory usage trends, request rates, latency percentiles, any metric evolving over time.
- Use gradient fill for single-series panels to emphasize magnitude
- Enable stacking (normal or percent) to show composition over time
- Set connect null values to handle gaps from scrape failures
- Use exemplars overlay to link metric spikes to specific traces
- Configure legend as table with calc values (mean, max, current) for quick reference
Stat, Gauge & Bar Chart
The stat panel shows a single numeric value with optional sparkline, color-coded by thresholds. Ideal for KPI overview rows (uptime %, current error rate, total requests). The gauge adds a visual indicator of where the current value falls within a defined range — perfect for resource utilization (0–100%). The bar chart displays categorical data as vertical or horizontal bars with grouping and stacking support.
- Stat — Current service count, total errors today, uptime percentage
- Gauge — CPU utilization, memory pressure, disk fullness
- Bar chart — Top 10 endpoints by request count, error distribution by service
Table & Heatmap
The table visualization presents data in row/column format with sorting, filtering, cell coloring, and link support. It excels at displaying multi-dimensional data or inventory-style views (list of pods with their status, services with their SLO compliance).
The heatmap visualizes distribution over time — each cell represents a bucket of values for a time interval, colored by density. It’s the ideal choice for latency distributions (replacing percentile lines with full distribution visibility) and reveals patterns that percentiles hide.
{
"title": "Request Latency Distribution",
"type": "heatmap",
"targets": [
{
"expr": "sum(rate(http_request_duration_seconds_bucket{service=\"$service\"}[$__rate_interval])) by (le)",
"format": "heatmap",
"legendFormat": "{{le}}"
}
],
"options": {
"calculate": false,
"yAxis": {"unit": "s", "reverse": false},
"color": {"scheme": "Spectral", "steps": 64},
"cellGap": 1,
"tooltip": {"show": true}
}
}
Geomap, Logs & Traces
The geomap panel renders data points on a world map. Use it for visualizing request origins, CDN performance by region, or infrastructure distribution across availability zones. It supports multiple layer types: markers, heatmap overlay, and route layers.
The logs panel integrates directly with Loki, displaying log lines with syntax highlighting, log level detection, and expandable log details. Combined with the derived fields configuration, clicking a trace ID in a log line navigates directly to the trace view.
The traces panel renders distributed traces from Tempo as waterfall diagrams showing span hierarchies, durations, and service boundaries. It supports filtering by span attributes and duration thresholds directly within the panel.
Flame Graph, Node Graph & Canvas
The flame graph panel visualizes profiling data from Pyroscope, showing function call hierarchies with CPU time or memory allocation. It enables developers to identify hot code paths directly from dashboards without switching tools.
The node graph panel displays service topology maps — nodes represent services and edges represent connections with metrics (request rate, error rate, latency). This is Grafana’s service map visualization, powered by Tempo’s service graph metrics.
The canvas panel provides a free-form layout where elements (icons, text, metric values) can be placed at arbitrary positions. It’s used for custom diagrams, architecture overviews, or floor plans with real-time data bindings.
Histogram, Pie Chart & State Timeline
The histogram panel shows the distribution of values as a bar chart with configurable bucket sizes — useful for understanding value distributions at a glance (response size distribution, batch job duration spread).
The pie chart shows proportional relationships — traffic split across services, error distribution by type, resource allocation by team. Use sparingly; tables or bar charts often communicate the same data more effectively.
The state timeline panel displays discrete states over time as colored bands — perfect for showing service health transitions (healthy → degraded → down → recovered), deployment rollout progress, or feature flag changes. Each state maps to a color for immediate visual pattern recognition.
Developing a Dashboard Purpose
Effective dashboards answer specific questions rather than displaying every available metric. Industry-proven methodologies provide frameworks for what to monitor and how to organize it. The choice of methodology depends on what you’re monitoring: resources (USE), services (RED), or SRE practices (Golden Signals).
flowchart TD
Q{What are you monitoring?}
Q -->|Infrastructure Resources| USE[USE Method]
Q -->|Request-Driven Services| RED[RED Method]
Q -->|SRE Practice| GS[Golden Signals]
Q -->|Business Outcomes| BM[Business Metrics]
USE --> U1[Utilization]
USE --> U2[Saturation]
USE --> U3[Errors]
RED --> R1[Rate]
RED --> R2[Errors]
RED --> R3[Duration]
GS --> G1[Latency]
GS --> G2[Traffic]
GS --> G3[Errors]
GS --> G4[Saturation]
BM --> B1[Revenue Impact]
BM --> B2[User Experience]
BM --> B3[Conversion Rates]
USE Method Dashboard
The USE Method (Utilization, Saturation, Errors), developed by Brendan Gregg, targets infrastructure resources — CPUs, memory, disks, network interfaces, and any component with a capacity limit.
| Resource | Utilization | Saturation | Errors |
|---|---|---|---|
| CPU | node_cpu_seconds_total (idle complement) | node_load1 / CPU count | Machine check exceptions |
| Memory | node_memory_MemAvailable / Total | Swap usage, OOM kills | ECC errors |
| Disk | node_filesystem_avail / size | I/O queue depth (node_disk_io_now) | node_disk_io_errors |
| Network | Bandwidth utilization % | TCP retransmits, queue drops | node_network_receive_errs |
RED Method Dashboard
The RED Method (Rate, Errors, Duration) focuses on request-driven services. Created by Tom Wilkie, it answers: “Is the service working?” For every service in your system, monitor these three signals:
- Rate — Requests per second (
rate(http_requests_total[5m])) - Errors — Failed requests per second (
rate(http_requests_total{status=~"5.."}[5m])) - Duration — Latency distribution (
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])))
{
"title": "RED Dashboard - Service Overview",
"panels": [
{
"title": "Request Rate",
"type": "timeseries",
"gridPos": {"h": 8, "w": 8, "x": 0, "y": 0},
"targets": [{"expr": "sum(rate(http_requests_total{service=\"$service\"}[$__rate_interval]))"}]
},
{
"title": "Error Rate",
"type": "timeseries",
"gridPos": {"h": 8, "w": 8, "x": 8, "y": 0},
"targets": [{"expr": "sum(rate(http_requests_total{service=\"$service\",status=~\"5..\"}[$__rate_interval])) / sum(rate(http_requests_total{service=\"$service\"}[$__rate_interval])) * 100"}],
"fieldConfig": {"defaults": {"unit": "percent"}}
},
{
"title": "Latency (p50 / p95 / p99)",
"type": "timeseries",
"gridPos": {"h": 8, "w": 8, "x": 16, "y": 0},
"targets": [
{"expr": "histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket{service=\"$service\"}[$__rate_interval])) by (le))", "legendFormat": "p50"},
{"expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{service=\"$service\"}[$__rate_interval])) by (le))", "legendFormat": "p95"},
{"expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{service=\"$service\"}[$__rate_interval])) by (le))", "legendFormat": "p99"}
],
"fieldConfig": {"defaults": {"unit": "s"}}
}
]
}
Golden Signals Dashboard
Google’s Four Golden Signals from the SRE book combine elements of both USE and RED: Latency, Traffic, Errors, and Saturation. This methodology works for any type of system and is the recommended starting point for teams adopting SRE practices.
- Latency — Time to serve a request (distinguish successful vs. failed request latency)
- Traffic — Demand on the system (requests/sec, sessions, transactions)
- Errors — Rate of failed requests (explicit 5xx, implicit policy violations)
- Saturation — How “full” the service is (queue depth, memory pressure, thread pool exhaustion)
Business Metrics Dashboard
Technical metrics tell you what is broken; business metrics tell you why it matters. A business metrics dashboard bridges the gap between engineering and stakeholders by showing revenue impact, user experience scores, and conversion funnels alongside the technical signals that affect them.
Common business metrics to display:
- Orders per minute — Direct revenue indicator
- Cart abandonment rate — Correlated with latency spikes
- Active users / sessions — Traffic indicator in business terms
- Payment success rate — Critical path health
- Search result relevance — Product experience quality
- Feature adoption rates — New feature rollout health
Advanced Dashboard Techniques
Moving beyond basic panels, Grafana’s advanced features enable dynamic, reusable dashboards that adapt to different environments, services, and time windows without manual editing.
Variables & Templating
Variables are the foundation of reusable dashboards. Instead of hardcoding label values in queries, variables create dropdown menus that dynamically filter all panels. Grafana supports several variable types:
| Type | Source | Example |
|---|---|---|
| Query | Data source query | label_values(up, instance) |
| Custom | Manual comma-separated list | production, staging, development |
| Interval | Time interval options | 1m, 5m, 15m, 1h |
| Data source | Available data sources by type | All Prometheus data sources |
| Text box | Free-form user input | Custom filter string |
| Constant | Hidden fixed value | Provisioned environment name |
Chained variables create cascading filters where one variable’s selection constrains the next. For example, selecting a namespace filters the service dropdown to only show services in that namespace:
{
"templating": {
"list": [
{
"name": "namespace",
"type": "query",
"query": "label_values(kube_pod_info, namespace)",
"refresh": 2,
"sort": 1
},
{
"name": "service",
"type": "query",
"query": "label_values(kube_pod_info{namespace=\"$namespace\"}, created_by_name)",
"refresh": 2,
"sort": 1,
"multi": true,
"includeAll": true
},
{
"name": "pod",
"type": "query",
"query": "label_values(kube_pod_info{namespace=\"$namespace\", created_by_name=~\"$service\"}, pod)",
"refresh": 2,
"sort": 1,
"multi": true,
"includeAll": true
}
]
}
}
Built-in variables provide automatic context:
$__interval— Automatically calculated based on time range and panel width (prevents over/under-sampling)$__rate_interval— Safe interval forrate()functions (at least 4x scrape interval)$__from/$__to— Current time range boundaries in epoch milliseconds$__range— Duration of the current time range (e.g.,1h)${__dashboard.uid}— Current dashboard UID for self-referencing links
Transformations
Transformations process query results before visualization — enabling calculations, joins, filtering, and restructuring that would be difficult or impossible in the query language alone. They’re applied in sequence, forming a pipeline.
- Merge — Combine multiple queries into a single table (useful for joining metrics from different sources)
- Filter by value — Show only rows matching a condition (e.g., error rate > 1%)
- Calculate field — Create new fields using math (
Field A / Field B * 100), binary operations, or reduce functions - Group by — Aggregate rows by a field with sum, mean, min, max, count, first, last
- Sort by — Order results by any field ascending or descending
- Join by field — SQL-style inner/outer join of multiple queries on a shared key
- Series to rows — Convert multiple time series into a table format
- Organize fields — Rename, reorder, or hide specific columns
- Reduce — Collapse time series into single values (sum, mean, max, range)
Example: Creating a “Top Services by Error Rate” table using transformations:
{
"title": "Top Services by Error Rate",
"type": "table",
"targets": [
{
"expr": "sum(rate(http_requests_total{status=~\"5..\"}[$__rate_interval])) by (service)",
"legendFormat": "{{service}}",
"refId": "errors",
"instant": true
},
{
"expr": "sum(rate(http_requests_total[$__rate_interval])) by (service)",
"legendFormat": "{{service}}",
"refId": "total",
"instant": true
}
],
"transformations": [
{"id": "merge", "options": {}},
{
"id": "calculateField",
"options": {
"mode": "binary",
"reduce": {"reducer": "sum"},
"binary": {"left": "errors", "operator": "/", "right": "total"},
"alias": "Error Rate"
}
},
{"id": "sortBy", "options": {"sort": [{"field": "Error Rate", "desc": true}]}},
{"id": "filterByValue", "options": {"filters": [{"fieldName": "Error Rate", "config": {"id": "greater", "options": {"value": 0.001}}}], "type": "include", "match": "any"}}
]
}
Mixed Data Sources in a Single Panel
Grafana supports querying multiple data sources within a single panel using the Mixed data source. This enables powerful correlations — overlaying deployment annotations from Loki on top of Prometheus metrics, or comparing CloudWatch metrics with self-hosted Mimir data in the same time series panel.
To use mixed data sources:
- Select
-- Mixed --as the panel data source - Each query target independently selects its own data source
- Results are merged based on time alignment
Links & Drilldowns
Effective dashboards form a navigation hierarchy. Data links on panels enable context-sensitive drill-downs — clicking a specific service in a table navigates to that service’s detail dashboard with all variables pre-populated.
{
"fieldConfig": {
"defaults": {
"links": [
{
"title": "View Service Detail",
"url": "/d/service-detail?var-service=${__data.fields.service}&from=${__from}&to=${__to}",
"targetBlank": false
},
{
"title": "View Traces",
"url": "/explore?left={\"datasource\":\"tempo-prod\",\"queries\":[{\"queryType\":\"traceqlSearch\",\"filters\":[{\"id\":\"service-name\",\"value\":[\"${__data.fields.service}\"]}]}]}&from=${__from}&to=${__to}",
"targetBlank": true
},
{
"title": "View Logs",
"url": "/explore?left={\"datasource\":\"loki-prod\",\"queries\":[{\"expr\":\"{service=\\\"${__data.fields.service}\\\"}|=\\\"error\\\"\"}]}&from=${__from}&to=${__to}",
"targetBlank": true
}
]
}
}
}
The drill-down pattern typically follows: Overview Dashboard (all services) → Service Dashboard (single service detail) → Instance Dashboard (single pod/container) → Explore (ad-hoc investigation).
Panel Overrides & Thresholds
Thresholds define color boundaries for values — green below 70%, yellow at 70-90%, red above 90%. They apply to stat panels (background color), gauges (arc color), time series (fill/line color), and tables (cell coloring).
Field overrides allow per-series or per-field customization within a panel. You can override colors, units, display names, axis placement, thresholds, and visualization options for specific fields matching a name, regex, or query type:
{
"fieldConfig": {
"defaults": {
"unit": "reqps",
"color": {"mode": "palette-classic"}
},
"overrides": [
{
"matcher": {"id": "byName", "options": "errors"},
"properties": [
{"id": "color", "value": {"mode": "fixed", "fixedColor": "red"}},
{"id": "custom.axisPlacement", "value": "right"},
{"id": "unit", "value": "percentunit"},
{"id": "custom.fillOpacity", "value": 10}
]
},
{
"matcher": {"id": "byRegexp", "options": "/p99|p95/"},
"properties": [
{"id": "custom.lineStyle", "value": {"fill": "dash", "dash": [10, 5]}},
{"id": "unit", "value": "s"}
]
}
]
}
}
Managing & Organizing Dashboards
As your dashboard collection grows, organization becomes critical. Grafana provides several mechanisms for managing dashboards at scale across teams and environments.
Folders & Permissions
Folders group related dashboards and provide access control boundaries. A common organizational pattern:
- Infrastructure/ — Node, network, storage dashboards (SRE team)
- Platform/ — Kubernetes, service mesh, message queues (Platform team)
- Services/ — Per-service RED dashboards (Development teams)
- Business/ — Revenue, user experience, SLO dashboards (Leadership)
- Alerts/ — Alert-specific investigation dashboards (On-call)
Permissions can be set at the folder level (inherited by all dashboards within) or overridden per dashboard. Roles include Viewer, Editor, and Admin with granular control over who can view, edit, or manage alerts for each folder.
Playlists & Snapshots
Playlists cycle through multiple dashboards automatically — ideal for wall-mounted NOC displays. Configure rotation intervals (30s–5m) and select dashboards by tag or manual selection. Playlists run in kiosk mode, hiding navigation for clean display.
Snapshots capture a dashboard’s current state (including data) as a static, shareable artifact. They’re invaluable for:
- Sharing incident evidence with stakeholders who lack Grafana access
- Preserving dashboard state at a specific point during a postmortem
- Creating reports without requiring live data source connectivity
Library Panels
Library panels are reusable panel definitions shared across multiple dashboards. When you update a library panel, the change propagates to every dashboard that uses it. This is essential for maintaining consistency — a standardized “Service Health” panel used across 50 service dashboards should look and behave identically.
{
"name": "Standard Service Health",
"type": "stat",
"model": {
"targets": [
{
"expr": "sum(up{service=\"$service\"}) / count(up{service=\"$service\"}) * 100",
"legendFormat": "Health"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"min": 0,
"max": 100,
"thresholds": {
"steps": [
{"color": "red", "value": null},
{"color": "yellow", "value": 95},
{"color": "green", "value": 99}
]
},
"mappings": [
{"type": "range", "options": {"from": 99, "to": 100, "result": {"text": "Healthy", "color": "green"}}},
{"type": "range", "options": {"from": 95, "to": 99, "result": {"text": "Degraded", "color": "yellow"}}},
{"type": "range", "options": {"from": 0, "to": 95, "result": {"text": "Critical", "color": "red"}}}
]
}
}
}
}
Case Study: Building an Overall System View
Let’s bring everything together by building a comprehensive system overview dashboard for a microservices e-commerce platform. This dashboard serves as the “single pane of glass” for on-call engineers, answering: “Is the system healthy? If not, where should I look?”
Application Architecture
flowchart TD
LB[Load Balancer] --> GW[API Gateway]
GW --> US[User Service]
GW --> PS[Product Service]
GW --> CS[Cart Service]
GW --> OS[Order Service]
GW --> PY[Payment Service]
US --> DB1[(Users DB)]
PS --> DB2[(Products DB)]
PS --> CH[Redis Cache]
CS --> CH
OS --> DB3[(Orders DB)]
OS --> MQ[Message Queue]
PY --> EXT[External Payment Provider]
MQ --> NS[Notification Service]
NS --> EM[Email Provider]
NS --> SM[SMS Provider]
Dashboard Implementation
The system overview dashboard is organized into five rows, progressing from high-level health to detailed breakdowns:
Row 1: System Health Summary (always expanded) — Four stat panels showing overall availability, total error rate, P95 latency, and active user sessions. These panels use the library panel pattern with standardized thresholds.
Row 2: Service Topology — Node graph panel powered by Tempo’s service graph metrics, showing live request flow between services with error rates on edges and latency in node labels.
Row 3: Golden Signals by Service — A table panel with rows per service showing current rate, error percentage, P50/P95/P99 latency, and saturation. Color-coded cells with data links to per-service dashboards.
Row 4: Infrastructure Health (collapsed by default) — CPU, memory, disk, and network gauges for the Kubernetes cluster, with per-node breakdown available via variable selection.
Row 5: Recent Events (collapsed by default) — Combined logs panel showing recent errors across all services, filtered to ERROR and FATAL levels, with trace ID links.
Dashboard-as-Code Provisioning
For production environments, dashboards should be version-controlled and provisioned as code. Grafana supports provisioning via JSON files, Terraform, or the Grafana Operator for Kubernetes:
{
"apiVersion": 1,
"providers": [
{
"name": "platform-dashboards",
"orgId": 1,
"folder": "Platform",
"type": "file",
"disableDeletion": true,
"editable": false,
"options": {
"path": "/etc/grafana/provisioning/dashboards/platform",
"foldersFromFilesStructure": true
}
}
]
}
The complete dashboard JSON for provisioning:
{
"uid": "system-overview",
"title": "System Overview",
"tags": ["production", "overview", "golden-signals"],
"timezone": "browser",
"refresh": "30s",
"time": {"from": "now-1h", "to": "now"},
"templating": {
"list": [
{
"name": "namespace",
"type": "query",
"query": "label_values(up{job=~\".+\"}, namespace)",
"current": {"text": "production", "value": "production"}
},
{
"name": "service",
"type": "query",
"query": "label_values(up{namespace=\"$namespace\"}, service)",
"multi": true,
"includeAll": true
}
]
},
"annotations": {
"list": [
{"name": "Deployments", "datasource": {"uid": "mimir-prod"}, "enable": true, "iconColor": "blue", "expr": "changes(kube_deployment_status_observed_generation{namespace=\"$namespace\"}[2m]) > 0"},
{"name": "Alerts", "datasource": {"uid": "-- Grafana --"}, "enable": true, "iconColor": "red", "type": "alert"}
]
},
"panels": [
{
"title": "System Availability",
"type": "stat",
"gridPos": {"h": 4, "w": 6, "x": 0, "y": 0},
"targets": [{"expr": "avg(up{namespace=\"$namespace\"}) * 100"}],
"fieldConfig": {"defaults": {"unit": "percent", "thresholds": {"steps": [{"color": "red", "value": null}, {"color": "yellow", "value": 99}, {"color": "green", "value": 99.9}]}}}
},
{
"title": "Error Rate",
"type": "stat",
"gridPos": {"h": 4, "w": 6, "x": 6, "y": 0},
"targets": [{"expr": "sum(rate(http_requests_total{namespace=\"$namespace\",status=~\"5..\"}[$__rate_interval])) / sum(rate(http_requests_total{namespace=\"$namespace\"}[$__rate_interval])) * 100"}],
"fieldConfig": {"defaults": {"unit": "percent", "thresholds": {"steps": [{"color": "green", "value": null}, {"color": "yellow", "value": 1}, {"color": "red", "value": 5}]}}}
},
{
"title": "P95 Latency",
"type": "stat",
"gridPos": {"h": 4, "w": 6, "x": 12, "y": 0},
"targets": [{"expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{namespace=\"$namespace\"}[$__rate_interval])) by (le))"}],
"fieldConfig": {"defaults": {"unit": "s", "thresholds": {"steps": [{"color": "green", "value": null}, {"color": "yellow", "value": 0.5}, {"color": "red", "value": 2}]}}}
},
{
"title": "Active Sessions",
"type": "stat",
"gridPos": {"h": 4, "w": 6, "x": 18, "y": 0},
"targets": [{"expr": "sum(active_sessions{namespace=\"$namespace\"})"}],
"fieldConfig": {"defaults": {"unit": "short"}}
}
],
"links": [
{"title": "Infrastructure", "type": "link", "url": "/d/infra-overview", "icon": "bolt"},
{"title": "SLO Dashboard", "type": "link", "url": "/d/slo-overview?var-namespace=$namespace", "icon": "chart-line"},
{"title": "Incident Runbooks", "type": "link", "url": "https://wiki.example.com/runbooks", "icon": "book", "targetBlank": true}
]
}
version field and use editable: false in production to prevent drift from the source-of-truth in Git.
Summary & Next Steps
Grafana dashboards transform raw telemetry into actionable observability. In this guide, we covered the full spectrum from creating your first panel to building production-grade system views:
- Foundation — Data source configuration with cross-signal linking, panel creation, and time range management
- Organization — Rows, annotations, and dashboard links create navigable monitoring stories
- Visualizations — 16+ panel types each optimized for specific data patterns and user questions
- Methodology — USE, RED, Golden Signals, and business metrics provide purpose-driven dashboard design
- Advanced Features — Variables, transformations, mixed data sources, and overrides enable dynamic, reusable dashboards
- Management — Folders, permissions, library panels, and dashboard-as-code ensure scalable governance
- Production Pattern — A complete system overview dashboard demonstrating all concepts together
The key principle: every dashboard should have a clear purpose and audience. A dashboard for on-call triage looks fundamentally different from one built for capacity planning or executive reporting. Design your dashboards around the questions they need to answer, not around the metrics you happen to have.
Next in the Series
In Part 9: Managing Incidents Using Alerts, we’ll explore Grafana’s unified alerting system — configuring alert rules with multi-dimensional evaluation, notification policies with routing trees, contact points, silences, and the complete incident lifecycle from detection through resolution.