Back to Monitoring & Observability Series

Grafana Deep Dive Part 15: Troubleshooting & Production Best Practices

June 15, 2026 Wasil Zafar 30 min read

Bring together everything from Parts 1–14 into battle-tested production workflows. Master systematic troubleshooting with metrics-logs-traces correlation, avoid common anti-patterns, use operational checklists for day-2 operations, and reference a complete architecture diagram for enterprise Grafana deployments.

Table of Contents

  1. Systematic Troubleshooting
  2. Production Anti-Patterns
  3. Operational Checklists
  4. Reference Architecture
  5. Driving Adoption
  6. Series Conclusion

Systematic Troubleshooting

Metrics → Logs → Traces Workflow

The most effective troubleshooting follows a systematic narrowing pattern: start broad with metrics, narrow with logs, then pinpoint with traces. This is the “golden path” through the LGTM stack:

Systematic Debugging Workflow
flowchart TD
    A["Alert Fires
Error rate > 5%"] M["Metrics: WHICH service?
Dashboard shows order-service"] L["Logs: WHAT errors?
LogQL: NullPointerException in PaymentHandler"] T["Traces: WHERE in the request?
TraceQL: payment-service → database timeout"] P["Profiles: WHY is it slow?
Pyroscope: connection pool exhausted"] F["Fix: Increase pool size
Deploy → validate metrics recovered"] A --> M --> L --> T --> P --> F
# Step 1: Metrics — Identify the affected service
# PromQL: Which service has elevated error rate?
topk(5,
  sum by (service) (rate(http_requests_total{status=~"5.."}[5m]))
  / sum by (service) (rate(http_requests_total[5m]))
  * 100
)
# Result: order-service at 12% error rate

# Step 2: Logs — What errors are occurring?
# LogQL: Filter to the affected service and time window
{service="order-service"} | json | level="error"
  | line_format "{{.error_type}}: {{.message}}"
# Result: "DatabaseException: Connection pool exhausted (max: 10, active: 10)"

# Step 3: Traces — Find an affected request
# TraceQL: Find slow traces with errors in order-service
{resource.service.name="order-service" && status=error && duration>2s}
# Result: Trace shows 8s waiting for database connection

# Step 4: Profiles — Why is the pool exhausted?
# Pyroscope: Check goroutine/thread profile for order-service
# Profile type: goroutines (Go) or threads (Java)
# Result: 200 goroutines blocked on sql.DB.conn() — pool too small for load

# Step 5: Fix and validate
# Increase pool: max_open_conns: 50
# Monitor: error rate drops to 0.1% within 2 minutes

Common Failure Scenarios

Troubleshooting Failure Patterns

Common Failure Patterns & Resolution

SymptomFirst SignalRoot Cause PatternResolution
Latency spike (all services)Metrics: p99 jumpShared dependency (DB, cache, DNS)Check infrastructure metrics for the shared resource
Error rate spike (one service)Metrics: 5xx countBad deploy, config change, dependency failureCheck deployment annotations, then downstream service health
Gradual memory growthMetrics: container memoryMemory leak, unbounded cache, goroutine leakPyroscope inuse_space profile over 4+ hours
Intermittent timeoutsTraces: long spansConnection pool exhaustion, network issues, GC pausesCorrelate with GC metrics and pool utilization
Data inconsistencyLogs: warning messagesRace condition, retry storms, eventual consistency lagTrace the specific failing requests end-to-end
Alert stormMultiple alerts fireCascading failure from upstream dependencyFind the root service using dependency graph in traces
DebuggingIncident Response

Grafana Debugging Tools

Grafana provides several built-in tools for efficient debugging:

  • Explore mode — ad-hoc queries across all data sources without saving dashboards
  • Split view — compare metrics on the left with logs or traces on the right
  • Trace-to-logs — click a trace span to jump directly to correlated logs in Loki
  • Logs-to-traces — click a traceID in log lines to open the full trace in Tempo
  • Exemplars — click data points on metric graphs to jump to specific trace exemplars
  • Flame graph — drill into Pyroscope profiles from a specific time range
# Enable trace-to-logs correlation in Grafana data source config
# Tempo data source settings:
datasources:
  - name: Tempo
    type: tempo
    jsonData:
      tracesToLogs:
        datasourceUid: loki-uid
        tags: ['service.name', 'k8s.pod.name']
        mappedTags: [{ key: 'service.name', value: 'service' }]
        mapTagNamesEnabled: true
        spanStartTimeShift: '-1m'
        spanEndTimeShift: '1m'
        filterByTraceID: true
        filterBySpanID: false
      tracesToMetrics:
        datasourceUid: mimir-uid
        tags: [{ key: 'service.name', value: 'service' }]
        queries:
          - name: 'Request Rate'
            query: 'sum(rate(http_requests_total{$$__tags}[5m]))'
          - name: 'Error Rate'
            query: 'sum(rate(http_requests_total{status=~"5..",$$__tags}[5m]))'
      tracesToProfiles:
        datasourceUid: pyroscope-uid
        tags: [{ key: 'service.name', value: 'service_name' }]
        profileTypeId: 'process_cpu:cpu:nanoseconds:cpu:nanoseconds'

Production Anti-Patterns

Alert Fatigue

Anti-Pattern: Alerting on every metric exceeding a threshold. Teams receive 50+ alerts daily, learn to ignore them, and miss critical issues buried in noise.

Solution: Alert on symptoms (user impact), not causes (CPU usage). Use multi-signal alerts that require multiple conditions:

# BAD: Alert on CPU (cause, not symptom)
- alert: HighCPU
  expr: node_cpu_utilization > 80
  # This fires constantly and rarely indicates user impact

# GOOD: Alert on user-facing SLI degradation
- alert: OrderServiceDegraded
  expr: |
    (
      sum(rate(http_requests_total{service="order-service", status=~"5.."}[5m]))
      / sum(rate(http_requests_total{service="order-service"}[5m]))
    ) > 0.01
    AND
    sum(rate(http_requests_total{service="order-service"}[5m])) > 10
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Order service error rate > 1% (with sufficient traffic)"

Dashboard Sprawl

Anti-Pattern: 500+ dashboards with no ownership, naming convention, or hierarchy. Engineers can’t find the right dashboard during incidents, duplicates proliferate, and stale dashboards show misleading data.

Solution: Implement a dashboard taxonomy with ownership and lifecycle management:

LevelPurposeAudienceCount Target
L0: BusinessRevenue, conversions, user satisfactionLeadership, product3–5 total
L1: ServiceRED metrics per service (Rate, Errors, Duration)On-call engineers1 per service
L2: ComponentDatabase, cache, queue internalsService owners1 per component
L3: DebugAd-hoc investigation, temporaryIndividual engineersAuto-delete after 30 days

Missing Context

Anti-Pattern: Telemetry exists but lacks the context needed to act on it. Logs say “request failed” without saying which request, from which user, to which endpoint.

Solution: Enforce mandatory context fields across all telemetry:

# Mandatory context for all telemetry signals
required_attributes:
  # Resource attributes (set once per service instance)
  resource:
    - service.name         # Which service
    - service.version      # Which version
    - deployment.environment  # prod/staging/dev
    - k8s.namespace.name   # Which namespace
    - k8s.pod.name         # Which pod (for debugging)

  # Span/log attributes (set per request)
  request:
    - http.method          # GET/POST/PUT
    - http.route           # /api/orders/{id} (NOT /api/orders/12345)
    - http.status_code     # 200, 500, etc.
    - user.id              # WHO is affected (hashed for privacy)

  # Error context
  error:
    - error.type           # Exception class name
    - error.message        # Human-readable message
    - error.stack_trace    # Full stack (logs only, not metrics)

Cost Traps

Anti-Pattern: Unbounded data collection without cost awareness. One team adds user_id as a metric label, creating 10M new series overnight and tripling the Mimir bill.

Solution: Implement guardrails at multiple layers:

  • Collector level: OTel Collector processors that drop/aggregate high-cardinality data before ingestion
  • Backend level: Per-tenant limits in Mimir/Loki (max_global_series_per_user, ingestion_rate)
  • Organizational level: Cost attribution dashboards showing spend by team, with budget alerts
  • Review process: New instrumentation requires review (like code review) for cardinality impact

Operational Checklists

New Service Onboarding Checklist

Checklist Service Onboarding

Observability Readiness Checklist

CategoryItemValidation
MetricsRED metrics exposed (Rate, Errors, Duration)PromQL query returns data for the service
MetricsResource metrics (CPU, memory, network)cAdvisor/kubelet metrics visible
MetricsBusiness metrics defined (if applicable)Custom metrics documented and exposed
LogsStructured JSON logging with correlation IDsLogQL query returns parsed fields
LogsLog levels configured (no DEBUG in prod)Verify log volume is within budget
TracesOTel SDK integrated with trace propagationTraces visible in Tempo for the service
TracesSpan attributes include http.route, status_codeTraceQL filter works for the service
AlertsSLO-based alerts defined (latency + error rate)Alert rule exists in Grafana Alerting
AlertsRunbook linked to each alertAnnotation URL resolves to runbook
DashboardL1 service dashboard createdDashboard in correct team folder
On-callService registered in Grafana OnCallTest alert routes to correct team
OwnershipService ownership documentedTeam label on all telemetry
OnboardingReadiness

Incident Response Checklist

# Incident Response Workflow
incident_response:
  detection:  # 0-2 minutes
    - Acknowledge alert in Grafana OnCall
    - Open service L1 dashboard
    - Check deployment annotations (was anything deployed?)
    - Identify blast radius (which services/users affected?)

  investigation:  # 2-15 minutes
    - Open Grafana Explore in split view
    - Left panel: metrics showing the anomaly
    - Right panel: logs filtered by time + service
    - If request-level issue: find trace ID from logs, open in Tempo
    - If performance issue: check Pyroscope profiles for the time window
    - Check upstream/downstream service health

  mitigation:  # 15-30 minutes
    - If bad deploy: rollback (kubectl rollout undo or revert PR)
    - If resource exhaustion: scale horizontally
    - If dependency failure: enable circuit breaker / fallback
    - If configuration issue: revert config change

  communication:
    - Update incident status in Grafana Incident
    - Post initial summary to stakeholder channel
    - Set severity level based on user impact
    - Provide estimated time to resolution

  resolution:
    - Confirm metrics return to baseline
    - Close incident with summary
    - Schedule blameless postmortem (within 48 hours)
    - Create follow-up tickets for permanent fixes

Weekly Platform Health Review

# Weekly review checklist for the observability platform team
weekly_review:
  capacity:
    - Check ingestion rates vs limits (should be < 70% of max)
    - Review active series count trend (growing > 10%/week needs investigation)
    - Verify storage utilization and retention policy execution
    - Check query performance (p99 dashboard load < 5s)

  cost:
    - Review per-tenant cost attribution dashboard
    - Identify top 3 cost growth drivers
    - Check for new high-cardinality metrics (> 100K series per metric)
    - Validate sampling rates are appropriate

  reliability:
    - Review meta-monitoring alerts (was the platform itself healthy?)
    - Check data completeness (any gaps in ingestion?)
    - Verify alert delivery latency (alert fired → notification received)
    - Test disaster recovery (can we restore from backup?)

  adoption:
    - Count new services onboarded this week
    - Review onboarding friction (any support tickets?)
    - Check dashboard creation/usage metrics
    - Gather feedback from engineering teams

Reference Architecture

Complete System Diagram

Enterprise Grafana Observability Reference Architecture
flowchart TD
    subgraph Apps["Application Layer"]
        S1["Microservices
(OTel SDK)"] S2["Infrastructure
(Node Exporter)"] S3["Frontend
(Grafana Faro)"] end subgraph Collection["Collection Layer"] AL["Grafana Alloy
(DaemonSet)"] OC["OTel Collector
(Gateway)"] end subgraph Backend["Backend Layer"] MI["Mimir
(Metrics)"] LO["Loki
(Logs)"] TE["Tempo
(Traces)"] PY["Pyroscope
(Profiles)"] end subgraph Storage["Storage Layer"] S3S["Object Storage
(S3/GCS/Azure)"] end subgraph Viz["Visualization & Action"] GR["Grafana
(Dashboards)"] GA["Grafana Alerting
(Rules Engine)"] GO["Grafana OnCall
(Notifications)"] GI["Grafana Incident
(Lifecycle)"] end S1 --> AL S2 --> AL S3 --> OC AL --> OC OC --> MI OC --> LO OC --> TE OC --> PY MI --> S3S LO --> S3S TE --> S3S PY --> S3S GR --> MI GR --> LO GR --> TE GR --> PY GA --> GR GA --> GO GO --> GI

Component Versions & Compatibility

ComponentRecommended VersionDeployment ModeKey Dependency
Grafana11.x+Deployment (stateless)PostgreSQL / MySQL for state
Mimir2.14+Microservices (StatefulSets)S3-compatible object storage
Loki3.x+Simple Scalable (SSD mode)S3-compatible object storage
Tempo2.6+Distributed (microservices)S3-compatible object storage
Pyroscope1.8+Single binary or microservicesS3-compatible object storage
Alloy1.4+DaemonSet + Gateway DeploymentConfigured via River files
OTel Collector0.100+DaemonSet + GatewayConfigured via YAML

Sizing Guide

ScaleActive SeriesLog VolumeServicesMimir IngestersLoki WriteObject Storage
Small< 500K< 50 GB/day10–503 × 8GB RAM3 × 4GB RAM~2 TB/year
Medium500K–5M50–500 GB/day50–2006 × 16GB RAM6 × 8GB RAM~20 TB/year
Large5M–50M500 GB–5 TB/day200–100012 × 32GB RAM12 × 16GB RAM~200 TB/year
XL> 50M> 5 TB/day1000+24+ × 64GB RAM24+ × 32GB RAM~2 PB/year

Driving Adoption

Champions Program

Successful observability platforms are adopted bottom-up. Identify and empower “observability champions” in each team:

  • Identify: Find engineers who are naturally curious about system behavior and already use monitoring tools
  • Train: Provide advanced workshops on PromQL, LogQL, TraceQL, and dashboard design
  • Empower: Give them early access to new features and a direct channel to the platform team
  • Recognize: Highlight their contributions in engineering all-hands and internal blogs
  • Scale: Champions train their own teams, creating a multiplicative effect

Internal Documentation

The platform is only as good as its documentation. Maintain these living documents:

DocumentAudienceContent
Getting Started GuideNew developers5-minute quickstart: add SDK, see first data in Grafana
Instrumentation StandardsAll engineersRequired attributes, naming conventions, sampling policies
Dashboard CookbookService ownersTemplates for L1/L2 dashboards with PromQL/LogQL examples
Alerting PlaybookOn-call engineersAlert definitions, escalation policies, runbook templates
Troubleshooting GuideOn-call engineersCommon failure patterns and resolution steps (this article!)
Platform RoadmapAll stakeholdersUpcoming features, migration plans, deprecations

Measuring Success

# Platform success metrics
adoption_metrics:
  coverage:
    - "% of production services emitting all 3 signals (metrics, logs, traces)"
    - target: "> 90% within 6 months of platform launch"

  usage:
    - "Weekly active Grafana users / total engineering headcount"
    - target: "> 60%"

  quality:
    - "Mean Time to Detection (MTTD) for production incidents"
    - target: "< 5 minutes (down from 15+ pre-platform)"

  efficiency:
    - "Mean Time to Resolution (MTTR) for P1 incidents"
    - target: "< 1 hour (down from 4+ pre-platform)"

  satisfaction:
    - "Quarterly developer experience survey score"
    - target: "> 4.0 / 5.0"

  cost:
    - "Observability cost per monitored service per month"
    - target: "< $50/service/month at scale"

Series Conclusion

Over 15 parts, we’ve built a comprehensive understanding of modern observability with the Grafana stack:

PartTopicKey Takeaway
1The Observability StackLGTM architecture and how components interconnect
2InstrumentationOpenTelemetry as the universal instrumentation layer
3Learning EnvironmentDocker Compose lab for hands-on practice
4Loki & LogQLStructured logging with efficient label-based queries
5Mimir & PromQLTime series metrics with powerful query language
6Tempo & TraceQLDistributed tracing for request flow analysis
7Infrastructure & CloudMonitoring Kubernetes, cloud services, and infrastructure
8DashboardsEffective visualization design and dashboard patterns
9Alerting & IncidentsSLO-based alerting with complete incident lifecycle
10Infrastructure as CodeTerraform, Helm, Grafonnet, and GitOps workflows
11Platform ArchitectureMulti-tenant design, scaling, and cost optimization
12Real User MonitoringGrafana Faro for frontend-to-backend correlation
13Pyroscope & k6Continuous profiling and load testing
14DevOps ObservabilityDORA metrics, canary analysis, and chaos engineering
15Troubleshooting & Best PracticesSystematic debugging and production operational patterns
Final Thought: Observability is not a destination but a continuous practice. The tools and techniques in this series give you the foundation — but the real value comes from building a culture where engineers instrument by default, investigate with data, and treat operational excellence as a first-class engineering discipline. Start small, iterate based on real incidents, and let production reality guide your platform evolution.