Systematic Troubleshooting
Metrics → Logs → Traces Workflow
The most effective troubleshooting follows a systematic narrowing pattern: start broad with metrics, narrow with logs, then pinpoint with traces. This is the “golden path” through the LGTM stack:
flowchart TD
A["Alert Fires
Error rate > 5%"]
M["Metrics: WHICH service?
Dashboard shows order-service"]
L["Logs: WHAT errors?
LogQL: NullPointerException in PaymentHandler"]
T["Traces: WHERE in the request?
TraceQL: payment-service → database timeout"]
P["Profiles: WHY is it slow?
Pyroscope: connection pool exhausted"]
F["Fix: Increase pool size
Deploy → validate metrics recovered"]
A --> M --> L --> T --> P --> F
# Step 1: Metrics — Identify the affected service
# PromQL: Which service has elevated error rate?
topk(5,
sum by (service) (rate(http_requests_total{status=~"5.."}[5m]))
/ sum by (service) (rate(http_requests_total[5m]))
* 100
)
# Result: order-service at 12% error rate
# Step 2: Logs — What errors are occurring?
# LogQL: Filter to the affected service and time window
{service="order-service"} | json | level="error"
| line_format "{{.error_type}}: {{.message}}"
# Result: "DatabaseException: Connection pool exhausted (max: 10, active: 10)"
# Step 3: Traces — Find an affected request
# TraceQL: Find slow traces with errors in order-service
{resource.service.name="order-service" && status=error && duration>2s}
# Result: Trace shows 8s waiting for database connection
# Step 4: Profiles — Why is the pool exhausted?
# Pyroscope: Check goroutine/thread profile for order-service
# Profile type: goroutines (Go) or threads (Java)
# Result: 200 goroutines blocked on sql.DB.conn() — pool too small for load
# Step 5: Fix and validate
# Increase pool: max_open_conns: 50
# Monitor: error rate drops to 0.1% within 2 minutes
Common Failure Scenarios
Common Failure Patterns & Resolution
| Symptom | First Signal | Root Cause Pattern | Resolution |
|---|---|---|---|
| Latency spike (all services) | Metrics: p99 jump | Shared dependency (DB, cache, DNS) | Check infrastructure metrics for the shared resource |
| Error rate spike (one service) | Metrics: 5xx count | Bad deploy, config change, dependency failure | Check deployment annotations, then downstream service health |
| Gradual memory growth | Metrics: container memory | Memory leak, unbounded cache, goroutine leak | Pyroscope inuse_space profile over 4+ hours |
| Intermittent timeouts | Traces: long spans | Connection pool exhaustion, network issues, GC pauses | Correlate with GC metrics and pool utilization |
| Data inconsistency | Logs: warning messages | Race condition, retry storms, eventual consistency lag | Trace the specific failing requests end-to-end |
| Alert storm | Multiple alerts fire | Cascading failure from upstream dependency | Find the root service using dependency graph in traces |
Grafana Debugging Tools
Grafana provides several built-in tools for efficient debugging:
- Explore mode — ad-hoc queries across all data sources without saving dashboards
- Split view — compare metrics on the left with logs or traces on the right
- Trace-to-logs — click a trace span to jump directly to correlated logs in Loki
- Logs-to-traces — click a traceID in log lines to open the full trace in Tempo
- Exemplars — click data points on metric graphs to jump to specific trace exemplars
- Flame graph — drill into Pyroscope profiles from a specific time range
# Enable trace-to-logs correlation in Grafana data source config
# Tempo data source settings:
datasources:
- name: Tempo
type: tempo
jsonData:
tracesToLogs:
datasourceUid: loki-uid
tags: ['service.name', 'k8s.pod.name']
mappedTags: [{ key: 'service.name', value: 'service' }]
mapTagNamesEnabled: true
spanStartTimeShift: '-1m'
spanEndTimeShift: '1m'
filterByTraceID: true
filterBySpanID: false
tracesToMetrics:
datasourceUid: mimir-uid
tags: [{ key: 'service.name', value: 'service' }]
queries:
- name: 'Request Rate'
query: 'sum(rate(http_requests_total{$$__tags}[5m]))'
- name: 'Error Rate'
query: 'sum(rate(http_requests_total{status=~"5..",$$__tags}[5m]))'
tracesToProfiles:
datasourceUid: pyroscope-uid
tags: [{ key: 'service.name', value: 'service_name' }]
profileTypeId: 'process_cpu:cpu:nanoseconds:cpu:nanoseconds'
Production Anti-Patterns
Alert Fatigue
Solution: Alert on symptoms (user impact), not causes (CPU usage). Use multi-signal alerts that require multiple conditions:
# BAD: Alert on CPU (cause, not symptom)
- alert: HighCPU
expr: node_cpu_utilization > 80
# This fires constantly and rarely indicates user impact
# GOOD: Alert on user-facing SLI degradation
- alert: OrderServiceDegraded
expr: |
(
sum(rate(http_requests_total{service="order-service", status=~"5.."}[5m]))
/ sum(rate(http_requests_total{service="order-service"}[5m]))
) > 0.01
AND
sum(rate(http_requests_total{service="order-service"}[5m])) > 10
for: 5m
labels:
severity: warning
annotations:
summary: "Order service error rate > 1% (with sufficient traffic)"
Dashboard Sprawl
Solution: Implement a dashboard taxonomy with ownership and lifecycle management:
| Level | Purpose | Audience | Count Target |
|---|---|---|---|
| L0: Business | Revenue, conversions, user satisfaction | Leadership, product | 3–5 total |
| L1: Service | RED metrics per service (Rate, Errors, Duration) | On-call engineers | 1 per service |
| L2: Component | Database, cache, queue internals | Service owners | 1 per component |
| L3: Debug | Ad-hoc investigation, temporary | Individual engineers | Auto-delete after 30 days |
Missing Context
Solution: Enforce mandatory context fields across all telemetry:
# Mandatory context for all telemetry signals
required_attributes:
# Resource attributes (set once per service instance)
resource:
- service.name # Which service
- service.version # Which version
- deployment.environment # prod/staging/dev
- k8s.namespace.name # Which namespace
- k8s.pod.name # Which pod (for debugging)
# Span/log attributes (set per request)
request:
- http.method # GET/POST/PUT
- http.route # /api/orders/{id} (NOT /api/orders/12345)
- http.status_code # 200, 500, etc.
- user.id # WHO is affected (hashed for privacy)
# Error context
error:
- error.type # Exception class name
- error.message # Human-readable message
- error.stack_trace # Full stack (logs only, not metrics)
Cost Traps
user_id as a metric label, creating 10M new series overnight and tripling the Mimir bill.
Solution: Implement guardrails at multiple layers:
- Collector level: OTel Collector processors that drop/aggregate high-cardinality data before ingestion
- Backend level: Per-tenant limits in Mimir/Loki (max_global_series_per_user, ingestion_rate)
- Organizational level: Cost attribution dashboards showing spend by team, with budget alerts
- Review process: New instrumentation requires review (like code review) for cardinality impact
Operational Checklists
New Service Onboarding Checklist
Observability Readiness Checklist
| Category | Item | Validation |
|---|---|---|
| Metrics | RED metrics exposed (Rate, Errors, Duration) | PromQL query returns data for the service |
| Metrics | Resource metrics (CPU, memory, network) | cAdvisor/kubelet metrics visible |
| Metrics | Business metrics defined (if applicable) | Custom metrics documented and exposed |
| Logs | Structured JSON logging with correlation IDs | LogQL query returns parsed fields |
| Logs | Log levels configured (no DEBUG in prod) | Verify log volume is within budget |
| Traces | OTel SDK integrated with trace propagation | Traces visible in Tempo for the service |
| Traces | Span attributes include http.route, status_code | TraceQL filter works for the service |
| Alerts | SLO-based alerts defined (latency + error rate) | Alert rule exists in Grafana Alerting |
| Alerts | Runbook linked to each alert | Annotation URL resolves to runbook |
| Dashboard | L1 service dashboard created | Dashboard in correct team folder |
| On-call | Service registered in Grafana OnCall | Test alert routes to correct team |
| Ownership | Service ownership documented | Team label on all telemetry |
Incident Response Checklist
# Incident Response Workflow
incident_response:
detection: # 0-2 minutes
- Acknowledge alert in Grafana OnCall
- Open service L1 dashboard
- Check deployment annotations (was anything deployed?)
- Identify blast radius (which services/users affected?)
investigation: # 2-15 minutes
- Open Grafana Explore in split view
- Left panel: metrics showing the anomaly
- Right panel: logs filtered by time + service
- If request-level issue: find trace ID from logs, open in Tempo
- If performance issue: check Pyroscope profiles for the time window
- Check upstream/downstream service health
mitigation: # 15-30 minutes
- If bad deploy: rollback (kubectl rollout undo or revert PR)
- If resource exhaustion: scale horizontally
- If dependency failure: enable circuit breaker / fallback
- If configuration issue: revert config change
communication:
- Update incident status in Grafana Incident
- Post initial summary to stakeholder channel
- Set severity level based on user impact
- Provide estimated time to resolution
resolution:
- Confirm metrics return to baseline
- Close incident with summary
- Schedule blameless postmortem (within 48 hours)
- Create follow-up tickets for permanent fixes
Weekly Platform Health Review
# Weekly review checklist for the observability platform team
weekly_review:
capacity:
- Check ingestion rates vs limits (should be < 70% of max)
- Review active series count trend (growing > 10%/week needs investigation)
- Verify storage utilization and retention policy execution
- Check query performance (p99 dashboard load < 5s)
cost:
- Review per-tenant cost attribution dashboard
- Identify top 3 cost growth drivers
- Check for new high-cardinality metrics (> 100K series per metric)
- Validate sampling rates are appropriate
reliability:
- Review meta-monitoring alerts (was the platform itself healthy?)
- Check data completeness (any gaps in ingestion?)
- Verify alert delivery latency (alert fired → notification received)
- Test disaster recovery (can we restore from backup?)
adoption:
- Count new services onboarded this week
- Review onboarding friction (any support tickets?)
- Check dashboard creation/usage metrics
- Gather feedback from engineering teams
Reference Architecture
Complete System Diagram
flowchart TD
subgraph Apps["Application Layer"]
S1["Microservices
(OTel SDK)"]
S2["Infrastructure
(Node Exporter)"]
S3["Frontend
(Grafana Faro)"]
end
subgraph Collection["Collection Layer"]
AL["Grafana Alloy
(DaemonSet)"]
OC["OTel Collector
(Gateway)"]
end
subgraph Backend["Backend Layer"]
MI["Mimir
(Metrics)"]
LO["Loki
(Logs)"]
TE["Tempo
(Traces)"]
PY["Pyroscope
(Profiles)"]
end
subgraph Storage["Storage Layer"]
S3S["Object Storage
(S3/GCS/Azure)"]
end
subgraph Viz["Visualization & Action"]
GR["Grafana
(Dashboards)"]
GA["Grafana Alerting
(Rules Engine)"]
GO["Grafana OnCall
(Notifications)"]
GI["Grafana Incident
(Lifecycle)"]
end
S1 --> AL
S2 --> AL
S3 --> OC
AL --> OC
OC --> MI
OC --> LO
OC --> TE
OC --> PY
MI --> S3S
LO --> S3S
TE --> S3S
PY --> S3S
GR --> MI
GR --> LO
GR --> TE
GR --> PY
GA --> GR
GA --> GO
GO --> GI
Component Versions & Compatibility
| Component | Recommended Version | Deployment Mode | Key Dependency |
|---|---|---|---|
| Grafana | 11.x+ | Deployment (stateless) | PostgreSQL / MySQL for state |
| Mimir | 2.14+ | Microservices (StatefulSets) | S3-compatible object storage |
| Loki | 3.x+ | Simple Scalable (SSD mode) | S3-compatible object storage |
| Tempo | 2.6+ | Distributed (microservices) | S3-compatible object storage |
| Pyroscope | 1.8+ | Single binary or microservices | S3-compatible object storage |
| Alloy | 1.4+ | DaemonSet + Gateway Deployment | Configured via River files |
| OTel Collector | 0.100+ | DaemonSet + Gateway | Configured via YAML |
Sizing Guide
| Scale | Active Series | Log Volume | Services | Mimir Ingesters | Loki Write | Object Storage |
|---|---|---|---|---|---|---|
| Small | < 500K | < 50 GB/day | 10–50 | 3 × 8GB RAM | 3 × 4GB RAM | ~2 TB/year |
| Medium | 500K–5M | 50–500 GB/day | 50–200 | 6 × 16GB RAM | 6 × 8GB RAM | ~20 TB/year |
| Large | 5M–50M | 500 GB–5 TB/day | 200–1000 | 12 × 32GB RAM | 12 × 16GB RAM | ~200 TB/year |
| XL | > 50M | > 5 TB/day | 1000+ | 24+ × 64GB RAM | 24+ × 32GB RAM | ~2 PB/year |
Driving Adoption
Champions Program
Successful observability platforms are adopted bottom-up. Identify and empower “observability champions” in each team:
- Identify: Find engineers who are naturally curious about system behavior and already use monitoring tools
- Train: Provide advanced workshops on PromQL, LogQL, TraceQL, and dashboard design
- Empower: Give them early access to new features and a direct channel to the platform team
- Recognize: Highlight their contributions in engineering all-hands and internal blogs
- Scale: Champions train their own teams, creating a multiplicative effect
Internal Documentation
The platform is only as good as its documentation. Maintain these living documents:
| Document | Audience | Content |
|---|---|---|
| Getting Started Guide | New developers | 5-minute quickstart: add SDK, see first data in Grafana |
| Instrumentation Standards | All engineers | Required attributes, naming conventions, sampling policies |
| Dashboard Cookbook | Service owners | Templates for L1/L2 dashboards with PromQL/LogQL examples |
| Alerting Playbook | On-call engineers | Alert definitions, escalation policies, runbook templates |
| Troubleshooting Guide | On-call engineers | Common failure patterns and resolution steps (this article!) |
| Platform Roadmap | All stakeholders | Upcoming features, migration plans, deprecations |
Measuring Success
# Platform success metrics
adoption_metrics:
coverage:
- "% of production services emitting all 3 signals (metrics, logs, traces)"
- target: "> 90% within 6 months of platform launch"
usage:
- "Weekly active Grafana users / total engineering headcount"
- target: "> 60%"
quality:
- "Mean Time to Detection (MTTD) for production incidents"
- target: "< 5 minutes (down from 15+ pre-platform)"
efficiency:
- "Mean Time to Resolution (MTTR) for P1 incidents"
- target: "< 1 hour (down from 4+ pre-platform)"
satisfaction:
- "Quarterly developer experience survey score"
- target: "> 4.0 / 5.0"
cost:
- "Observability cost per monitored service per month"
- target: "< $50/service/month at scale"
Series Conclusion
Over 15 parts, we’ve built a comprehensive understanding of modern observability with the Grafana stack:
| Part | Topic | Key Takeaway |
|---|---|---|
| 1 | The Observability Stack | LGTM architecture and how components interconnect |
| 2 | Instrumentation | OpenTelemetry as the universal instrumentation layer |
| 3 | Learning Environment | Docker Compose lab for hands-on practice |
| 4 | Loki & LogQL | Structured logging with efficient label-based queries |
| 5 | Mimir & PromQL | Time series metrics with powerful query language |
| 6 | Tempo & TraceQL | Distributed tracing for request flow analysis |
| 7 | Infrastructure & Cloud | Monitoring Kubernetes, cloud services, and infrastructure |
| 8 | Dashboards | Effective visualization design and dashboard patterns |
| 9 | Alerting & Incidents | SLO-based alerting with complete incident lifecycle |
| 10 | Infrastructure as Code | Terraform, Helm, Grafonnet, and GitOps workflows |
| 11 | Platform Architecture | Multi-tenant design, scaling, and cost optimization |
| 12 | Real User Monitoring | Grafana Faro for frontend-to-backend correlation |
| 13 | Pyroscope & k6 | Continuous profiling and load testing |
| 14 | DevOps Observability | DORA metrics, canary analysis, and chaos engineering |
| 15 | Troubleshooting & Best Practices | Systematic debugging and production operational patterns |