Grafana Deep Dive Part 15: Troubleshooting & Production Best Practices

Systematic Troubleshooting

Metrics → Logs → Traces Workflow

The most effective troubleshooting follows a systematic narrowing pattern: start broad with metrics, narrow with logs, then pinpoint with traces. This is the “golden path” through the LGTM stack:

Systematic Debugging Workflow

flowchart TD
    A["Alert Fires
Error rate > 5%"]
    M["Metrics: WHICH service?
Dashboard shows order-service"]
    L["Logs: WHAT errors?
LogQL: NullPointerException in PaymentHandler"]
    T["Traces: WHERE in the request?
TraceQL: payment-service → database timeout"]
    P["Profiles: WHY is it slow?
Pyroscope: connection pool exhausted"]
    F["Fix: Increase pool size
Deploy → validate metrics recovered"]
    A --> M --> L --> T --> P --> F

# Step 1: Metrics — Identify the affected service
# PromQL: Which service has elevated error rate?
topk(5,
  sum by (service) (rate(http_requests_total{status=~"5.."}[5m]))
  / sum by (service) (rate(http_requests_total[5m]))
  * 100
)
# Result: order-service at 12% error rate

# Step 2: Logs — What errors are occurring?
# LogQL: Filter to the affected service and time window
{service="order-service"} | json | level="error"
  | line_format "{{.error_type}}: {{.message}}"
# Result: "DatabaseException: Connection pool exhausted (max: 10, active: 10)"

# Step 3: Traces — Find an affected request
# TraceQL: Find slow traces with errors in order-service
{resource.service.name="order-service" && status=error && duration>2s}
# Result: Trace shows 8s waiting for database connection

# Step 4: Profiles — Why is the pool exhausted?
# Pyroscope: Check goroutine/thread profile for order-service
# Profile type: goroutines (Go) or threads (Java)
# Result: 200 goroutines blocked on sql.DB.conn() — pool too small for load

# Step 5: Fix and validate
# Increase pool: max_open_conns: 50
# Monitor: error rate drops to 0.1% within 2 minutes

Common Failure Scenarios

Troubleshooting Failure Patterns

Common Failure Patterns & Resolution

Symptom	First Signal	Root Cause Pattern	Resolution
Latency spike (all services)	Metrics: p99 jump	Shared dependency (DB, cache, DNS)	Check infrastructure metrics for the shared resource
Error rate spike (one service)	Metrics: 5xx count	Bad deploy, config change, dependency failure	Check deployment annotations, then downstream service health
Gradual memory growth	Metrics: container memory	Memory leak, unbounded cache, goroutine leak	Pyroscope inuse_space profile over 4+ hours
Intermittent timeouts	Traces: long spans	Connection pool exhaustion, network issues, GC pauses	Correlate with GC metrics and pool utilization
Data inconsistency	Logs: warning messages	Race condition, retry storms, eventual consistency lag	Trace the specific failing requests end-to-end
Alert storm	Multiple alerts fire	Cascading failure from upstream dependency	Find the root service using dependency graph in traces

DebuggingIncident Response

Grafana Debugging Tools

Grafana provides several built-in tools for efficient debugging:

Explore mode — ad-hoc queries across all data sources without saving dashboards
Split view — compare metrics on the left with logs or traces on the right
Trace-to-logs — click a trace span to jump directly to correlated logs in Loki
Logs-to-traces — click a traceID in log lines to open the full trace in Tempo
Exemplars — click data points on metric graphs to jump to specific trace exemplars
Flame graph — drill into Pyroscope profiles from a specific time range

# Enable trace-to-logs correlation in Grafana data source config
# Tempo data source settings:
datasources:
  - name: Tempo
    type: tempo
    jsonData:
      tracesToLogs:
        datasourceUid: loki-uid
        tags: ['service.name', 'k8s.pod.name']
        mappedTags: [{ key: 'service.name', value: 'service' }]
        mapTagNamesEnabled: true
        spanStartTimeShift: '-1m'
        spanEndTimeShift: '1m'
        filterByTraceID: true
        filterBySpanID: false
      tracesToMetrics:
        datasourceUid: mimir-uid
        tags: [{ key: 'service.name', value: 'service' }]
        queries:
          - name: 'Request Rate'
            query: 'sum(rate(http_requests_total{$$__tags}[5m]))'
          - name: 'Error Rate'
            query: 'sum(rate(http_requests_total{status=~"5..",$$__tags}[5m]))'
      tracesToProfiles:
        datasourceUid: pyroscope-uid
        tags: [{ key: 'service.name', value: 'service_name' }]
        profileTypeId: 'process_cpu:cpu:nanoseconds:cpu:nanoseconds'

Production Anti-Patterns

Alert Fatigue

                            
                            Anti-Pattern: Alerting on every metric exceeding a threshold. Teams receive 50+ alerts daily, learn to ignore them, and miss critical issues buried in noise.
                        

Solution: Alert on symptoms (user impact), not causes (CPU usage). Use multi-signal alerts that require multiple conditions:

# BAD: Alert on CPU (cause, not symptom)
- alert: HighCPU
  expr: node_cpu_utilization > 80
  # This fires constantly and rarely indicates user impact

# GOOD: Alert on user-facing SLI degradation
- alert: OrderServiceDegraded
  expr: |
    (
      sum(rate(http_requests_total{service="order-service", status=~"5.."}[5m]))
      / sum(rate(http_requests_total{service="order-service"}[5m]))
    ) > 0.01
    AND
    sum(rate(http_requests_total{service="order-service"}[5m])) > 10
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Order service error rate > 1% (with sufficient traffic)"

Dashboard Sprawl

                            
                            Anti-Pattern: 500+ dashboards with no ownership, naming convention, or hierarchy. Engineers can’t find the right dashboard during incidents, duplicates proliferate, and stale dashboards show misleading data.
                        

Solution: Implement a dashboard taxonomy with ownership and lifecycle management:

Level	Purpose	Audience	Count Target
L0: Business	Revenue, conversions, user satisfaction	Leadership, product	3–5 total
L1: Service	RED metrics per service (Rate, Errors, Duration)	On-call engineers	1 per service
L2: Component	Database, cache, queue internals	Service owners	1 per component
L3: Debug	Ad-hoc investigation, temporary	Individual engineers	Auto-delete after 30 days

Missing Context

                            
                            Anti-Pattern: Telemetry exists but lacks the context needed to act on it. Logs say “request failed” without saying which request, from which user, to which endpoint.
                        

Solution: Enforce mandatory context fields across all telemetry:

# Mandatory context for all telemetry signals
required_attributes:
  # Resource attributes (set once per service instance)
  resource:
    - service.name         # Which service
    - service.version      # Which version
    - deployment.environment  # prod/staging/dev
    - k8s.namespace.name   # Which namespace
    - k8s.pod.name         # Which pod (for debugging)

  # Span/log attributes (set per request)
  request:
    - http.method          # GET/POST/PUT
    - http.route           # /api/orders/{id} (NOT /api/orders/12345)
    - http.status_code     # 200, 500, etc.
    - user.id              # WHO is affected (hashed for privacy)

  # Error context
  error:
    - error.type           # Exception class name
    - error.message        # Human-readable message
    - error.stack_trace    # Full stack (logs only, not metrics)

Cost Traps

                            
                            Anti-Pattern: Unbounded data collection without cost awareness. One team adds user_id as a metric label, creating 10M new series overnight and tripling the Mimir bill.
                        

Solution: Implement guardrails at multiple layers:

Collector level: OTel Collector processors that drop/aggregate high-cardinality data before ingestion
Backend level: Per-tenant limits in Mimir/Loki (max_global_series_per_user, ingestion_rate)
Organizational level: Cost attribution dashboards showing spend by team, with budget alerts
Review process: New instrumentation requires review (like code review) for cardinality impact

Operational Checklists

New Service Onboarding Checklist

Checklist Service Onboarding

Observability Readiness Checklist

Category	Item	Validation
Metrics	RED metrics exposed (Rate, Errors, Duration)	PromQL query returns data for the service
Metrics	Resource metrics (CPU, memory, network)	cAdvisor/kubelet metrics visible
Metrics	Business metrics defined (if applicable)	Custom metrics documented and exposed
Logs	Structured JSON logging with correlation IDs	LogQL query returns parsed fields
Logs	Log levels configured (no DEBUG in prod)	Verify log volume is within budget
Traces	OTel SDK integrated with trace propagation	Traces visible in Tempo for the service
Traces	Span attributes include http.route, status_code	TraceQL filter works for the service
Alerts	SLO-based alerts defined (latency + error rate)	Alert rule exists in Grafana Alerting
Alerts	Runbook linked to each alert	Annotation URL resolves to runbook
Dashboard	L1 service dashboard created	Dashboard in correct team folder
On-call	Service registered in Grafana OnCall	Test alert routes to correct team
Ownership	Service ownership documented	Team label on all telemetry

OnboardingReadiness

Incident Response Checklist

# Incident Response Workflow
incident_response:
  detection:  # 0-2 minutes
    - Acknowledge alert in Grafana OnCall
    - Open service L1 dashboard
    - Check deployment annotations (was anything deployed?)
    - Identify blast radius (which services/users affected?)

  investigation:  # 2-15 minutes
    - Open Grafana Explore in split view
    - Left panel: metrics showing the anomaly
    - Right panel: logs filtered by time + service
    - If request-level issue: find trace ID from logs, open in Tempo
    - If performance issue: check Pyroscope profiles for the time window
    - Check upstream/downstream service health

  mitigation:  # 15-30 minutes
    - If bad deploy: rollback (kubectl rollout undo or revert PR)
    - If resource exhaustion: scale horizontally
    - If dependency failure: enable circuit breaker / fallback
    - If configuration issue: revert config change

  communication:
    - Update incident status in Grafana Incident
    - Post initial summary to stakeholder channel
    - Set severity level based on user impact
    - Provide estimated time to resolution

  resolution:
    - Confirm metrics return to baseline
    - Close incident with summary
    - Schedule blameless postmortem (within 48 hours)
    - Create follow-up tickets for permanent fixes

Weekly Platform Health Review

# Weekly review checklist for the observability platform team
weekly_review:
  capacity:
    - Check ingestion rates vs limits (should be < 70% of max)
    - Review active series count trend (growing > 10%/week needs investigation)
    - Verify storage utilization and retention policy execution
    - Check query performance (p99 dashboard load < 5s)

  cost:
    - Review per-tenant cost attribution dashboard
    - Identify top 3 cost growth drivers
    - Check for new high-cardinality metrics (> 100K series per metric)
    - Validate sampling rates are appropriate

  reliability:
    - Review meta-monitoring alerts (was the platform itself healthy?)
    - Check data completeness (any gaps in ingestion?)
    - Verify alert delivery latency (alert fired → notification received)
    - Test disaster recovery (can we restore from backup?)

  adoption:
    - Count new services onboarded this week
    - Review onboarding friction (any support tickets?)
    - Check dashboard creation/usage metrics
    - Gather feedback from engineering teams

Reference Architecture

Complete System Diagram

Enterprise Grafana Observability Reference Architecture

flowchart TD
    subgraph Apps["Application Layer"]
        S1["Microservices
(OTel SDK)"]
        S2["Infrastructure
(Node Exporter)"]
        S3["Frontend
(Grafana Faro)"]
    end
    subgraph Collection["Collection Layer"]
        AL["Grafana Alloy
(DaemonSet)"]
        OC["OTel Collector
(Gateway)"]
    end
    subgraph Backend["Backend Layer"]
        MI["Mimir
(Metrics)"]
        LO["Loki
(Logs)"]
        TE["Tempo
(Traces)"]
        PY["Pyroscope
(Profiles)"]
    end
    subgraph Storage["Storage Layer"]
        S3S["Object Storage
(S3/GCS/Azure)"]
    end
    subgraph Viz["Visualization & Action"]
        GR["Grafana
(Dashboards)"]
        GA["Grafana Alerting
(Rules Engine)"]
        GO["Grafana OnCall
(Notifications)"]
        GI["Grafana Incident
(Lifecycle)"]
    end
    S1 --> AL
    S2 --> AL
    S3 --> OC
    AL --> OC
    OC --> MI
    OC --> LO
    OC --> TE
    OC --> PY
    MI --> S3S
    LO --> S3S
    TE --> S3S
    PY --> S3S
    GR --> MI
    GR --> LO
    GR --> TE
    GR --> PY
    GA --> GR
    GA --> GO
    GO --> GI

Component Versions & Compatibility

Component	Recommended Version	Deployment Mode	Key Dependency
Grafana	11.x+	Deployment (stateless)	PostgreSQL / MySQL for state
Mimir	2.14+	Microservices (StatefulSets)	S3-compatible object storage
Loki	3.x+	Simple Scalable (SSD mode)	S3-compatible object storage
Tempo	2.6+	Distributed (microservices)	S3-compatible object storage
Pyroscope	1.8+	Single binary or microservices	S3-compatible object storage
Alloy	1.4+	DaemonSet + Gateway Deployment	Configured via River files
OTel Collector	0.100+	DaemonSet + Gateway	Configured via YAML

Sizing Guide

Scale	Active Series	Log Volume	Services	Mimir Ingesters	Loki Write	Object Storage
Small	< 500K	< 50 GB/day	10–50	3 × 8GB RAM	3 × 4GB RAM	~2 TB/year
Medium	500K–5M	50–500 GB/day	50–200	6 × 16GB RAM	6 × 8GB RAM	~20 TB/year
Large	5M–50M	500 GB–5 TB/day	200–1000	12 × 32GB RAM	12 × 16GB RAM	~200 TB/year
XL	> 50M	> 5 TB/day	1000+	24+ × 64GB RAM	24+ × 32GB RAM	~2 PB/year

Driving Adoption

Champions Program

Successful observability platforms are adopted bottom-up. Identify and empower “observability champions” in each team:

Identify: Find engineers who are naturally curious about system behavior and already use monitoring tools
Train: Provide advanced workshops on PromQL, LogQL, TraceQL, and dashboard design
Empower: Give them early access to new features and a direct channel to the platform team
Recognize: Highlight their contributions in engineering all-hands and internal blogs
Scale: Champions train their own teams, creating a multiplicative effect

Internal Documentation

The platform is only as good as its documentation. Maintain these living documents:

Document	Audience	Content
Getting Started Guide	New developers	5-minute quickstart: add SDK, see first data in Grafana
Instrumentation Standards	All engineers	Required attributes, naming conventions, sampling policies
Dashboard Cookbook	Service owners	Templates for L1/L2 dashboards with PromQL/LogQL examples
Alerting Playbook	On-call engineers	Alert definitions, escalation policies, runbook templates
Troubleshooting Guide	On-call engineers	Common failure patterns and resolution steps (this article!)
Platform Roadmap	All stakeholders	Upcoming features, migration plans, deprecations

Measuring Success

# Platform success metrics
adoption_metrics:
  coverage:
    - "% of production services emitting all 3 signals (metrics, logs, traces)"
    - target: "> 90% within 6 months of platform launch"

  usage:
    - "Weekly active Grafana users / total engineering headcount"
    - target: "> 60%"

  quality:
    - "Mean Time to Detection (MTTD) for production incidents"
    - target: "< 5 minutes (down from 15+ pre-platform)"

  efficiency:
    - "Mean Time to Resolution (MTTR) for P1 incidents"
    - target: "< 1 hour (down from 4+ pre-platform)"

  satisfaction:
    - "Quarterly developer experience survey score"
    - target: "> 4.0 / 5.0"

  cost:
    - "Observability cost per monitored service per month"
    - target: "< $50/service/month at scale"

Series Conclusion

Over 15 parts, we’ve built a comprehensive understanding of modern observability with the Grafana stack:

Part	Topic	Key Takeaway
1	The Observability Stack	LGTM architecture and how components interconnect
2	Instrumentation	OpenTelemetry as the universal instrumentation layer
3	Learning Environment	Docker Compose lab for hands-on practice
4	Loki & LogQL	Structured logging with efficient label-based queries
5	Mimir & PromQL	Time series metrics with powerful query language
6	Tempo & TraceQL	Distributed tracing for request flow analysis
7	Infrastructure & Cloud	Monitoring Kubernetes, cloud services, and infrastructure
8	Dashboards	Effective visualization design and dashboard patterns
9	Alerting & Incidents	SLO-based alerting with complete incident lifecycle
10	Infrastructure as Code	Terraform, Helm, Grafonnet, and GitOps workflows
11	Platform Architecture	Multi-tenant design, scaling, and cost optimization
12	Real User Monitoring	Grafana Faro for frontend-to-backend correlation
13	Pyroscope & k6	Continuous profiling and load testing
14	DevOps Observability	DORA metrics, canary analysis, and chaos engineering
15	Troubleshooting & Best Practices	Systematic debugging and production operational patterns

                            
                            Final Thought: Observability is not a destination but a continuous practice. The tools and techniques in this series give you the foundation — but the real value comes from building a culture where engineers instrument by default, investigate with data, and treat operational excellence as a first-class engineering discipline. Start small, iterate based on real incidents, and let production reality guide your platform evolution.
                        

Previous Part 14: DevOps Observability Series Home Monitoring & Observability Index