Dashboard Design Principles
A dashboard is not a place to show every metric you can collect. A good dashboard answers a specific question for a specific audience in under 10 seconds. Bad dashboards are information dumps — walls of graphs that nobody looks at because nothing stands out.
The Dashboard Hierarchy
Organise dashboards in layers, from high-level overviews to detailed drill-downs:
flowchart TD
A[Level 0: Executive Overview\n3-5 SLO gauges, overall health] -->|Click SLO| B[Level 1: Service Overview\nGolden signals per service]
B -->|Click service| C[Level 2: Service Detail\nEndpoint-level metrics, error breakdown]
C -->|Click anomaly| D[Level 3: Debug\nTraces, logs, infrastructure metrics]
| Level | Audience | Purpose | Refresh Rate |
|---|---|---|---|
| Level 0 | Executives, on-call | "Is anything on fire?" | 30s |
| Level 1 | On-call, SREs | "Which service has a problem?" | 15s |
| Level 2 | Service owners | "What endpoint/operation is failing?" | 10s |
| Level 3 | Engineers debugging | "Show me the traces and logs for this error" | 5s |
Grafana Dashboard Best Practices
- Use variables: Template dashboards with
$service,$environment,$namespacevariables so one dashboard works for all services - Left-to-right flow: Place request rate → error rate → latency → saturation from left to right (follows the narrative: "How much traffic? How many errors? How fast? How full?")
- Stat panels first: Put single-stat panels at the top for instant status; time series below for trends
- Consistent units: Always label axes with units (ms, req/s, %, bytes). Never leave a graph unlabelled
- Threshold colours: Green → Yellow → Red thresholds on stat panels matching alert severity levels
- Annotations: Mark deployment events on time-series panels so you can correlate metrics changes with releases
- Links: Every panel should link to the drill-down dashboard or relevant log/trace query
The Four Essential Dashboards
Dashboard 1: Service Overview (Golden Signals)
Every service needs a dashboard showing the Four Golden Signals (from Part 2). This is the Level 1 dashboard that on-call engineers look at first.
# Grafana dashboard panels (pseudo-config showing PromQL queries)
# Row 1: Stat panels (instant status)
- title: "Request Rate"
type: stat
query: sum(rate(http_requests_total{service="$service"}[5m]))
unit: "req/s"
thresholds: { green: 0, yellow: 1000, red: 5000 }
- title: "Error Rate"
type: stat
query: |
sum(rate(http_requests_total{service="$service",status=~"5.."}[5m]))
/ sum(rate(http_requests_total{service="$service"}[5m])) * 100
unit: "%"
thresholds: { green: 0, yellow: 1, red: 5 }
- title: "p99 Latency"
type: stat
query: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{service="$service"}[5m])) by (le))
unit: "s"
thresholds: { green: 0, yellow: 0.5, red: 2 }
# Row 2: Time series (trends)
- title: "Request Rate Over Time"
type: timeseries
query: sum(rate(http_requests_total{service="$service"}[5m])) by (method, route)
- title: "Error Rate Over Time"
type: timeseries
query: sum(rate(http_requests_total{service="$service",status=~"5.."}[5m])) by (route)
- title: "Latency Distribution Over Time"
type: timeseries
queries:
- histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket{service="$service"}[5m])) by (le))
- histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{service="$service"}[5m])) by (le))
- histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{service="$service"}[5m])) by (le))
Dashboard 2: Infrastructure
Shows resource consumption across compute, memory, disk, and network. Critical for capacity planning and detecting resource saturation before it causes user-facing impact.
Key panels: CPU usage per pod/node, memory usage vs limits, disk I/O and space, network throughput, container restart counts, pod scheduling failures.
Dashboard 3: SLO Burn Rate
Shows how fast each SLO is consuming its error budget. This is the most important operational dashboard — it answers "Are we on track to meet our reliability targets?"
# SLO burn rate queries
# 1h burn rate (fast burn — detecting acute incidents)
sum(rate(http_requests_total{service="$service",status=~"5.."}[1h]))
/ sum(rate(http_requests_total{service="$service"}[1h]))
/ (1 - 0.999) # SLO target = 99.9%
# 6h burn rate (slow burn — detecting gradual degradation)
sum(rate(http_requests_total{service="$service",status=~"5.."}[6h]))
/ sum(rate(http_requests_total{service="$service"}[6h]))
/ (1 - 0.999)
# Error budget remaining (30-day window)
1 - (
sum(increase(http_requests_total{service="$service",status=~"5.."}[30d]))
/ sum(increase(http_requests_total{service="$service"}[30d]))
) / (1 - 0.999)
Dashboard 4: Business Metrics
Technical metrics exist to serve business outcomes. This dashboard bridges the gap:
- Revenue per minute: Correlate with deployment annotations to catch revenue-impacting bugs
- Checkout completion rate: Detect conversion drops from latency or error spikes
- User signups per hour: Detect registration flow failures
- API usage by customer tier: Detect customer-specific issues
Alerting Strategy
What to Alert On (and What NOT To)
| Alert On (Page-Worthy) | Don't Alert On (Dashboard Only) |
|---|---|
| SLO burn rate > threshold | Individual container restart |
| Error rate > 5% for 5+ minutes | CPU usage > 80% (use saturation alert instead) |
| p99 latency > 2s for 5+ minutes | Single failed health check |
| Zero request rate (complete outage) | Disk usage > 70% (alert at 85%+) |
| Certificate expiry < 7 days | Memory usage fluctuations |
| Database replication lag > 30s | Individual node going down (if HA) |
Alert Severity Levels
| Severity | Response Time | Notification | Example |
|---|---|---|---|
| P1 — Critical | Immediate (wake people up) | Phone call + SMS + Slack | Complete service outage, data loss risk |
| P2 — High | Within 30 minutes | SMS + Slack | Significant error rate spike, SLO breach imminent |
| P3 — Warning | Within 4 hours (business hours) | Slack only | Elevated latency, disk approaching capacity |
| P4 — Info | Next business day | Ticket / dashboard | Certificate renewal needed, capacity planning |
Alertmanager Configuration
Prometheus Alertmanager handles alert deduplication, grouping, inhibition, silencing, and routing. It is the central brain that decides which human receives which alert, and how.
# alertmanager.yml — Production routing configuration
global:
resolve_timeout: 5m
slack_api_url: 'https://hooks.slack.com/services/T00/B00/XXXXX'
# Inhibition rules — suppress lower-severity alerts when higher ones fire
inhibit_rules:
- source_matchers: [severity = "critical"]
target_matchers: [severity = "warning"]
equal: [service, environment] # Same service + environment
# Routing tree
route:
group_by: ['alertname', 'service', 'environment']
group_wait: 30s # Wait 30s to batch related alerts
group_interval: 5m # Re-notify every 5m for ongoing alerts
repeat_interval: 4h # Re-escalate every 4h if unacknowledged
receiver: 'slack-warnings' # Default: non-critical to Slack
routes:
# P1: Critical → PagerDuty (wake people up)
- matchers: [severity = "critical"]
receiver: 'pagerduty-critical'
group_wait: 10s
repeat_interval: 15m
# P2: High → PagerDuty (during hours) + Slack
- matchers: [severity = "high"]
receiver: 'pagerduty-high'
repeat_interval: 1h
# P3: Warning → Slack channel only
- matchers: [severity = "warning"]
receiver: 'slack-warnings'
repeat_interval: 4h
# Team-specific routing
- matchers: [team = "payments"]
receiver: 'slack-payments-team'
routes:
- matchers: [severity = "critical"]
receiver: 'pagerduty-payments'
receivers:
- name: 'pagerduty-critical'
pagerduty_configs:
- routing_key: ''
severity: critical
- name: 'pagerduty-high'
pagerduty_configs:
- routing_key: ''
severity: error
- name: 'slack-warnings'
slack_configs:
- channel: '#alerts-warnings'
title: '{{ .GroupLabels.alertname }} — {{ .GroupLabels.service }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
send_resolved: true
- name: 'slack-payments-team'
slack_configs:
- channel: '#team-payments-alerts'
send_resolved: true
- name: 'pagerduty-payments'
pagerduty_configs:
- routing_key: ''
Fighting Alert Fatigue
Alert fatigue is when engineers stop responding to alerts because they fire too often, are too noisy, or are not actionable. It is the single biggest failure mode in alerting — worse than no alerts at all, because it creates a false sense of safety.
The Alert Fatigue Cycle
Studies from Google SRE and PagerDuty show that teams receiving more than 2 actionable pages per on-call shift (12 hours) experience degraded response quality. Above 5 pages per shift, engineers begin ignoring alerts entirely.
- Too many alerts fire → engineer cannot investigate all of them
- Engineer learns that most alerts resolve themselves → starts ignoring alerts
- A real incident occurs → alert is ignored or response is delayed
- Incident escalates → outage impacts users → post-mortem identifies "alert fatigue" as a factor
Strategies to fight alert fatigue:
- Regular alert review: Every sprint, review all alerts that fired. Delete or tune alerts with >50% false positive rate
- SLO-based alerting: Alert on error budget burn rate instead of individual metric thresholds — this naturally deduplicates
- Inhibition rules: When a critical alert fires, suppress all related warning alerts (configured in Alertmanager)
- Grouping: Group related alerts into a single notification (Alertmanager groups by service + alertname)
- Runbook links: Every alert must link to a runbook explaining what to do — if you cannot write a runbook, the alert may not be actionable
- Ownership: Every alert must have a clear owner team — alerts with no owner get ignored
On-Call & Incident Routing
flowchart TD
A[Prometheus Alert Rule\nFires based on PromQL] --> B[Alertmanager\nRoute + Deduplicate + Group]
B --> C{Severity?}
C -->|P1/P2| D[PagerDuty / Opsgenie\nPage on-call engineer]
C -->|P3| E[Slack Channel\nNotify team]
C -->|P4| F[Ticket System\nJira / Linear]
D --> G[On-Call Engineer\nAcknowledge within SLA]
G --> H[Investigate\nDashboards → Traces → Logs]
H --> I[Mitigate\nRollback / Scale / Hotfix]
I --> J[Post-Incident Review\nRunbook update + alert tuning]
Conclusion & Next Steps
Dashboards and alerts are the human interface to your observability data. Key takeaways from Part 7:
- Dashboard hierarchy: Build four levels from executive overview to debug detail — each with a clear audience and purpose
- Four essential dashboards: Service Golden Signals, Infrastructure, SLO Burn Rate, and Business Metrics
- SLO-based alerting on error budget burn rate is more effective than individual metric thresholds
- Alert severity determines notification channel — P1 wakes people up, P4 creates a ticket
- Alertmanager handles routing, grouping, inhibition, and silencing — configure it carefully
- Alert fatigue kills reliability — review alerts regularly, require runbooks, enforce ownership