Part 7: Visualization & Alerting

Dashboard Design Principles

A dashboard is not a place to show every metric you can collect. A good dashboard answers a specific question for a specific audience in under 10 seconds. Bad dashboards are information dumps — walls of graphs that nobody looks at because nothing stands out.

The Dashboard Hierarchy

Organise dashboards in layers, from high-level overviews to detailed drill-downs:

Dashboard Hierarchy — Drill-Down Model

                                flowchart TD
                                    A[Level 0: Executive Overview\n3-5 SLO gauges, overall health] -->|Click SLO| B[Level 1: Service Overview\nGolden signals per service]
                                    B -->|Click service| C[Level 2: Service Detail\nEndpoint-level metrics, error breakdown]
                                    C -->|Click anomaly| D[Level 3: Debug\nTraces, logs, infrastructure metrics]

Level	Audience	Purpose	Refresh Rate
Level 0	Executives, on-call	"Is anything on fire?"	30s
Level 1	On-call, SREs	"Which service has a problem?"	15s
Level 2	Service owners	"What endpoint/operation is failing?"	10s
Level 3	Engineers debugging	"Show me the traces and logs for this error"	5s

Grafana Dashboard Best Practices

Use variables: Template dashboards with $service, $environment, $namespace variables so one dashboard works for all services
Left-to-right flow: Place request rate → error rate → latency → saturation from left to right (follows the narrative: "How much traffic? How many errors? How fast? How full?")
Stat panels first: Put single-stat panels at the top for instant status; time series below for trends
Consistent units: Always label axes with units (ms, req/s, %, bytes). Never leave a graph unlabelled
Threshold colours: Green → Yellow → Red thresholds on stat panels matching alert severity levels
Annotations: Mark deployment events on time-series panels so you can correlate metrics changes with releases
Links: Every panel should link to the drill-down dashboard or relevant log/trace query

The Four Essential Dashboards

Dashboard 1: Service Overview (Golden Signals)

Every service needs a dashboard showing the Four Golden Signals (from Part 2). This is the Level 1 dashboard that on-call engineers look at first.

# Grafana dashboard panels (pseudo-config showing PromQL queries)

# Row 1: Stat panels (instant status)
- title: "Request Rate"
  type: stat
  query: sum(rate(http_requests_total{service="$service"}[5m]))
  unit: "req/s"
  thresholds: { green: 0, yellow: 1000, red: 5000 }

- title: "Error Rate"
  type: stat
  query: |
    sum(rate(http_requests_total{service="$service",status=~"5.."}[5m]))
    / sum(rate(http_requests_total{service="$service"}[5m])) * 100
  unit: "%"
  thresholds: { green: 0, yellow: 1, red: 5 }

- title: "p99 Latency"
  type: stat
  query: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{service="$service"}[5m])) by (le))
  unit: "s"
  thresholds: { green: 0, yellow: 0.5, red: 2 }

# Row 2: Time series (trends)
- title: "Request Rate Over Time"
  type: timeseries
  query: sum(rate(http_requests_total{service="$service"}[5m])) by (method, route)

- title: "Error Rate Over Time"
  type: timeseries
  query: sum(rate(http_requests_total{service="$service",status=~"5.."}[5m])) by (route)

- title: "Latency Distribution Over Time"
  type: timeseries
  queries:
    - histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket{service="$service"}[5m])) by (le))
    - histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{service="$service"}[5m])) by (le))
    - histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{service="$service"}[5m])) by (le))

Dashboard 2: Infrastructure

Shows resource consumption across compute, memory, disk, and network. Critical for capacity planning and detecting resource saturation before it causes user-facing impact.

Key panels: CPU usage per pod/node, memory usage vs limits, disk I/O and space, network throughput, container restart counts, pod scheduling failures.

Dashboard 3: SLO Burn Rate

Shows how fast each SLO is consuming its error budget. This is the most important operational dashboard — it answers "Are we on track to meet our reliability targets?"

# SLO burn rate queries
# 1h burn rate (fast burn — detecting acute incidents)
sum(rate(http_requests_total{service="$service",status=~"5.."}[1h]))
/ sum(rate(http_requests_total{service="$service"}[1h]))
/ (1 - 0.999)  # SLO target = 99.9%

# 6h burn rate (slow burn — detecting gradual degradation)
sum(rate(http_requests_total{service="$service",status=~"5.."}[6h]))
/ sum(rate(http_requests_total{service="$service"}[6h]))
/ (1 - 0.999)

# Error budget remaining (30-day window)
1 - (
  sum(increase(http_requests_total{service="$service",status=~"5.."}[30d]))
  / sum(increase(http_requests_total{service="$service"}[30d]))
) / (1 - 0.999)

                            
                            Burn Rate Interpretation: A burn rate of 1x means you are consuming error budget at exactly the rate that would exhaust it in the SLO window (e.g., 30 days). A burn rate of 14.4x means you would exhaust your entire 30-day error budget in roughly 2 hours — this is a critical incident. Alert on multi-window burn rates: fast burn (1h at 14.4x) AND slow burn (6h at 6x).
                        

Dashboard 4: Business Metrics

Technical metrics exist to serve business outcomes. This dashboard bridges the gap:

Revenue per minute: Correlate with deployment annotations to catch revenue-impacting bugs
Checkout completion rate: Detect conversion drops from latency or error spikes
User signups per hour: Detect registration flow failures
API usage by customer tier: Detect customer-specific issues

Alerting Strategy

What to Alert On (and What NOT To)

                            
                            The Golden Rule of Alerting: Every alert that pages a human should be actionable, urgent, and represent a real problem affecting users. If an alert fires and the correct response is "wait and see" or "nothing to do", it should not be a page — it should be a warning in a dashboard.
                        

Alert On (Page-Worthy)	Don't Alert On (Dashboard Only)
SLO burn rate > threshold	Individual container restart
Error rate > 5% for 5+ minutes	CPU usage > 80% (use saturation alert instead)
p99 latency > 2s for 5+ minutes	Single failed health check
Zero request rate (complete outage)	Disk usage > 70% (alert at 85%+)
Certificate expiry < 7 days	Memory usage fluctuations
Database replication lag > 30s	Individual node going down (if HA)

Alert Severity Levels

Severity	Response Time	Notification	Example
P1 — Critical	Immediate (wake people up)	Phone call + SMS + Slack	Complete service outage, data loss risk
P2 — High	Within 30 minutes	SMS + Slack	Significant error rate spike, SLO breach imminent
P3 — Warning	Within 4 hours (business hours)	Slack only	Elevated latency, disk approaching capacity
P4 — Info	Next business day	Ticket / dashboard	Certificate renewal needed, capacity planning

Alertmanager Configuration

Prometheus Alertmanager handles alert deduplication, grouping, inhibition, silencing, and routing. It is the central brain that decides which human receives which alert, and how.

# alertmanager.yml — Production routing configuration
global:
  resolve_timeout: 5m
  slack_api_url: 'https://hooks.slack.com/services/T00/B00/XXXXX'

# Inhibition rules — suppress lower-severity alerts when higher ones fire
inhibit_rules:
  - source_matchers: [severity = "critical"]
    target_matchers: [severity = "warning"]
    equal: [service, environment]  # Same service + environment

# Routing tree
route:
  group_by: ['alertname', 'service', 'environment']
  group_wait: 30s        # Wait 30s to batch related alerts
  group_interval: 5m     # Re-notify every 5m for ongoing alerts
  repeat_interval: 4h    # Re-escalate every 4h if unacknowledged
  receiver: 'slack-warnings'  # Default: non-critical to Slack

  routes:
    # P1: Critical → PagerDuty (wake people up)
    - matchers: [severity = "critical"]
      receiver: 'pagerduty-critical'
      group_wait: 10s
      repeat_interval: 15m

    # P2: High → PagerDuty (during hours) + Slack
    - matchers: [severity = "high"]
      receiver: 'pagerduty-high'
      repeat_interval: 1h

    # P3: Warning → Slack channel only
    - matchers: [severity = "warning"]
      receiver: 'slack-warnings'
      repeat_interval: 4h

    # Team-specific routing
    - matchers: [team = "payments"]
      receiver: 'slack-payments-team'
      routes:
        - matchers: [severity = "critical"]
          receiver: 'pagerduty-payments'

receivers:
  - name: 'pagerduty-critical'
    pagerduty_configs:
      - routing_key: ''
        severity: critical

  - name: 'pagerduty-high'
    pagerduty_configs:
      - routing_key: ''
        severity: error

  - name: 'slack-warnings'
    slack_configs:
      - channel: '#alerts-warnings'
        title: '{{ .GroupLabels.alertname }} — {{ .GroupLabels.service }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
        send_resolved: true

  - name: 'slack-payments-team'
    slack_configs:
      - channel: '#team-payments-alerts'
        send_resolved: true

  - name: 'pagerduty-payments'
    pagerduty_configs:
      - routing_key: ''

Fighting Alert Fatigue

Alert fatigue is when engineers stop responding to alerts because they fire too often, are too noisy, or are not actionable. It is the single biggest failure mode in alerting — worse than no alerts at all, because it creates a false sense of safety.

Industry Research

The Alert Fatigue Cycle

Studies from Google SRE and PagerDuty show that teams receiving more than 2 actionable pages per on-call shift (12 hours) experience degraded response quality. Above 5 pages per shift, engineers begin ignoring alerts entirely.

Too many alerts fire → engineer cannot investigate all of them
Engineer learns that most alerts resolve themselves → starts ignoring alerts
A real incident occurs → alert is ignored or response is delayed
Incident escalates → outage impacts users → post-mortem identifies "alert fatigue" as a factor

Alert Fatigue On-Call Health SRE Culture

Strategies to fight alert fatigue:

Regular alert review: Every sprint, review all alerts that fired. Delete or tune alerts with >50% false positive rate
SLO-based alerting: Alert on error budget burn rate instead of individual metric thresholds — this naturally deduplicates
Inhibition rules: When a critical alert fires, suppress all related warning alerts (configured in Alertmanager)
Grouping: Group related alerts into a single notification (Alertmanager groups by service + alertname)
Runbook links: Every alert must link to a runbook explaining what to do — if you cannot write a runbook, the alert may not be actionable
Ownership: Every alert must have a clear owner team — alerts with no owner get ignored

On-Call & Incident Routing

Alert → Incident → Resolution Pipeline

                                flowchart TD
                                    A[Prometheus Alert Rule\nFires based on PromQL] --> B[Alertmanager\nRoute + Deduplicate + Group]
                                    B --> C{Severity?}
                                    C -->|P1/P2| D[PagerDuty / Opsgenie\nPage on-call engineer]
                                    C -->|P3| E[Slack Channel\nNotify team]
                                    C -->|P4| F[Ticket System\nJira / Linear]
                                    D --> G[On-Call Engineer\nAcknowledge within SLA]
                                    G --> H[Investigate\nDashboards → Traces → Logs]
                                    H --> I[Mitigate\nRollback / Scale / Hotfix]
                                    I --> J[Post-Incident Review\nRunbook update + alert tuning]

                            
                            Runbook Template: Every paging alert should link to a runbook with: (1) What this alert means, (2) Likely causes, (3) Investigation steps (specific dashboard links, log queries), (4) Mitigation actions (rollback commands, scaling procedures), (5) Escalation path if unresolved in 30 minutes.
                        

Conclusion & Next Steps

Dashboards and alerts are the human interface to your observability data. Key takeaways from Part 7:

Dashboard hierarchy: Build four levels from executive overview to debug detail — each with a clear audience and purpose
Four essential dashboards: Service Golden Signals, Infrastructure, SLO Burn Rate, and Business Metrics
SLO-based alerting on error budget burn rate is more effective than individual metric thresholds
Alert severity determines notification channel — P1 wakes people up, P4 creates a ticket
Alertmanager handles routing, grouping, inhibition, and silencing — configure it carefully
Alert fatigue kills reliability — review alerts regularly, require runbooks, enforce ownership

Previous Part 6: OpenTelemetry Next Part 8: Kubernetes Observability

Cookie Consent

Part 7: Visualization, Dashboards & Alerting

Table of Contents

Dashboard Design Principles

The Dashboard Hierarchy

Grafana Dashboard Best Practices

The Four Essential Dashboards

Dashboard 1: Service Overview (Golden Signals)

Dashboard 2: Infrastructure

Dashboard 3: SLO Burn Rate

Dashboard 4: Business Metrics

Alerting Strategy

What to Alert On (and What NOT To)

Alert Severity Levels

Alertmanager Configuration

Fighting Alert Fatigue

The Alert Fatigue Cycle

On-Call & Incident Routing

Conclusion & Next Steps

Cookie Consent

Part 7: Visualization, Dashboards & Alerting

Table of Contents

Dashboard Design Principles

The Dashboard Hierarchy

Grafana Dashboard Best Practices

The Four Essential Dashboards

Dashboard 1: Service Overview (Golden Signals)

Dashboard 2: Infrastructure

Dashboard 3: SLO Burn Rate

Dashboard 4: Business Metrics

Alerting Strategy

What to Alert On (and What NOT To)

Alert Severity Levels

Alertmanager Configuration

Fighting Alert Fatigue

The Alert Fatigue Cycle

On-Call & Incident Routing

Conclusion & Next Steps

Continue the Series

Part 8: Kubernetes Observability

Part 2: Metrics Fundamentals & the Four Golden Signals

Tool Deep Dive: Grafana Complete Guide