Back to Distributed Systems & Kubernetes Series

Part 15: Observability & Troubleshooting

May 14, 2026 Wasil Zafar 43 min read

You can't fix what you can't see. Observability — metrics, traces, and logs — gives you the ability to understand system behaviour from its outputs. In distributed systems, the question isn't "will things fail?" but "how fast can you detect and diagnose the failure?"

Table of Contents

  1. Three Pillars of Observability
  2. Prometheus & Metrics
  3. Grafana Dashboards
  4. Distributed Tracing
  5. Logging
  6. Kubernetes Troubleshooting
  7. Exercises
  8. Conclusion

Three Pillars of Observability

Metrics vs Logs vs Traces

The Three Pillars — Complementary Signals
flowchart LR
    subgraph Metrics ["Metrics (What)"]
        M1[Counters
Gauges
Histograms] M2[Low cardinality
Aggregatable
Cheap to store] end subgraph Logs ["Logs (Why)"] L1[Events
Errors
Audit trails] L2[High cardinality
Context-rich
Expensive to store] end subgraph Traces ["Traces (Where)"] T1[Request paths
Latency breakdown
Dependencies] T2[Per-request
Cross-service
Causal relationships] end Metrics -->|Alert triggers| Logs Logs -->|Correlate with| Traces Traces -->|Identify bottleneck| Metrics
Pillar Answers Tool Retention
Metrics "How many requests? What's the error rate? Is CPU high?" Prometheus, Datadog, CloudWatch Months–years
Logs "What happened? Why did it fail? What was the error message?" Loki, EFK/ELK, CloudWatch Logs Days–months
Traces "Where did the request go? Which service is slow? What's the call chain?" Jaeger, Tempo, Zipkin, X-Ray Hours–days (sampled)

Prometheus & Metrics

Architecture

Pull Model: Prometheus pulls metrics from targets (scrapes endpoints). Each application exposes a /metrics endpoint in Prometheus text format. This is opposite to push-based systems (StatsD, Datadog Agent) where applications push metrics to a collector.
# ServiceMonitor: tell Prometheus what to scrape (via Prometheus Operator)
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: payment-service
  labels:
    monitoring: enabled          # Matches Prometheus serviceMonitorSelector
spec:
  selector:
    matchLabels:
      app: payment               # Scrape Services with this label
  endpoints:
  - port: metrics                # Named port in the Service
    interval: 15s                # Scrape every 15 seconds
    path: /metrics               # Default path
  namespaceSelector:
    matchNames:
    - production
# What a /metrics endpoint looks like (Prometheus text format):
# TYPE http_requests_total counter
http_requests_total{method="GET",endpoint="/api/orders",status="200"} 14523
http_requests_total{method="GET",endpoint="/api/orders",status="500"} 23
http_requests_total{method="POST",endpoint="/api/orders",status="201"} 892

# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{le="0.01"} 9823
http_request_duration_seconds_bucket{le="0.05"} 12456
http_request_duration_seconds_bucket{le="0.1"} 13200
http_request_duration_seconds_bucket{le="0.5"} 14100
http_request_duration_seconds_bucket{le="1.0"} 14480
http_request_duration_seconds_bucket{le="+Inf"} 14523
http_request_duration_seconds_sum 1456.78
http_request_duration_seconds_count 14523

# TYPE memory_usage_bytes gauge
memory_usage_bytes{pod="payment-abc12"} 268435456

PromQL Essentials

# PromQL: the query language for Prometheus

# Request rate (requests per second over 5 minutes):
rate(http_requests_total[5m])

# Error rate percentage:
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
* 100

# P99 latency (99th percentile):
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))

# P95 latency by endpoint:
histogram_quantile(0.95,
  sum by (le, endpoint) (rate(http_request_duration_seconds_bucket[5m]))
)

# CPU usage per pod (percentage of request):
sum by (pod) (rate(container_cpu_usage_seconds_total[5m]))
/
sum by (pod) (kube_pod_container_resource_requests{resource="cpu"})
* 100

# Memory usage vs limit:
container_memory_working_set_bytes{container!=""}
/
kube_pod_container_resource_limits{resource="memory"}
* 100

# Top 5 pods by CPU:
topk(5, sum by (pod) (rate(container_cpu_usage_seconds_total[5m])))

Alerting

# PrometheusRule: alerting rules
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: payment-alerts
  labels:
    monitoring: enabled
spec:
  groups:
  - name: payment.slo
    rules:
    # High error rate:
    - alert: PaymentHighErrorRate
      expr: |
        sum(rate(http_requests_total{job="payment",status=~"5.."}[5m]))
        / sum(rate(http_requests_total{job="payment"}[5m])) > 0.01
      for: 5m
      labels:
        severity: critical
        team: payments
      annotations:
        summary: "Payment service error rate > 1% for 5 minutes"
        runbook: "https://wiki.internal/runbooks/payment-errors"
    # High latency:
    - alert: PaymentHighLatency
      expr: |
        histogram_quantile(0.95,
          sum by (le) (rate(http_request_duration_seconds_bucket{job="payment"}[5m]))
        ) > 0.5
      for: 5m
      labels:
        severity: warning
        team: payments
      annotations:
        summary: "Payment P95 latency > 500ms"
    # Pod restarts:
    - alert: PodCrashLooping
      expr: |
        rate(kube_pod_container_status_restarts_total[15m]) * 60 * 5 > 5
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "Pod {{ $labels.pod }} restarting > 5 times in 15m"

Grafana Dashboards

Golden Signals & RED Method

Google SRE Golden Signals:
  1. Latency — how long requests take (p50, p95, p99)
  2. Traffic — how much demand (requests/second)
  3. Errors — how many requests fail (error rate %)
  4. Saturation — how full the system is (CPU, memory, disk, queue depth)
# RED Method (for microservices — per service):
# Rate:     requests per second
# Errors:   error rate (% of requests failing)
# Duration: latency distribution (p50, p95, p99)

# USE Method (for infrastructure — per resource):
# Utilization: % of resource capacity being used
# Saturation:  amount of work queued/waiting
# Errors:      count of error events

# Key Grafana dashboard panels for a Kubernetes service:
# Row 1: Request Rate | Error Rate | P95 Latency | P99 Latency
# Row 2: CPU Usage vs Request | Memory Usage vs Limit | Pod Count
# Row 3: Pod Restarts | OOMKilled Events | Network I/O
# Row 4: Database Connections | Cache Hit Rate | Queue Depth

Distributed Tracing

OpenTelemetry

OpenTelemetry (OTel) is the CNCF standard for collecting telemetry — a single SDK for metrics, traces, and logs across all languages:

# OpenTelemetry Collector: receive, process, export telemetry
apiVersion: opentelemetry.io/v1beta1
kind: OpenTelemetryCollector
metadata:
  name: otel-collector
spec:
  mode: deployment
  config:
    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317
          http:
            endpoint: 0.0.0.0:4318
    processors:
      batch:
        timeout: 5s
        send_batch_size: 1000
      memory_limiter:
        check_interval: 1s
        limit_mib: 512
      resource:
        attributes:
        - key: cluster
          value: production
          action: upsert
    exporters:
      # Send traces to Jaeger:
      otlp/jaeger:
        endpoint: jaeger-collector.observability:4317
        tls:
          insecure: true
      # Send metrics to Prometheus:
      prometheus:
        endpoint: 0.0.0.0:8889
      # Send logs to Loki:
      loki:
        endpoint: http://loki.observability:3100/loki/api/v1/push
    service:
      pipelines:
        traces:
          receivers: [otlp]
          processors: [memory_limiter, batch, resource]
          exporters: [otlp/jaeger]
        metrics:
          receivers: [otlp]
          processors: [memory_limiter, batch]
          exporters: [prometheus]
        logs:
          receivers: [otlp]
          processors: [memory_limiter, batch]
          exporters: [loki]

Trace Backends

Backend Storage Strengths Best For
Jaeger Elasticsearch, Cassandra, Kafka Rich UI, adaptive sampling Production debugging
Tempo (Grafana) Object storage (S3/GCS) Cost-effective, Grafana-native Grafana stack users
Zipkin Elasticsearch, MySQL Simple, lightweight Small deployments

Logging

Structured Logging

// Structured log (JSON) — machine-parseable:
{
  "timestamp": "2026-05-14T10:30:45.123Z",
  "level": "error",
  "service": "payment-service",
  "trace_id": "abc123def456",
  "span_id": "789ghi",
  "message": "Payment processing failed",
  "error": "insufficient_funds",
  "customer_id": "cust_12345",
  "amount": 99.99,
  "currency": "USD",
  "duration_ms": 234,
  "kubernetes": {
    "pod": "payment-abc12",
    "namespace": "production",
    "node": "worker-3"
  }
}

// vs Unstructured log — hard to search/filter:
// 2026-05-14 10:30:45 ERROR Payment processing failed for customer
// cust_12345: insufficient_funds (amount: $99.99)

Log Aggregation

# Grafana Loki: lightweight log aggregation (labels, not full-text index)
# Deployed via Helm:
# helm install loki grafana/loki-stack --set promtail.enabled=true

# Loki uses labels for indexing (like Prometheus):
# {namespace="production", app="payment"}

# LogQL queries (similar to PromQL):
# All error logs from payment service:
# {app="payment"} |= "error"

# JSON parsing and filtering:
# {app="payment"} | json | level="error" | duration_ms > 1000

# Count errors per minute:
# sum(rate({app="payment"} |= "error" [1m])) by (pod)
---
# Alternative: EFK Stack (Elasticsearch + Fluentd + Kibana)
# - Elasticsearch: full-text search, aggregations
# - Fluentd/Fluent Bit: log collection from pods
# - Kibana: visualization and search

# Fluent Bit DaemonSet config:
apiVersion: v1
kind: ConfigMap
metadata:
  name: fluent-bit-config
data:
  fluent-bit.conf: |
    [SERVICE]
        Flush        5
        Daemon       Off
        Log_Level    info
    [INPUT]
        Name         tail
        Path         /var/log/containers/*.log
        Parser       docker
        Tag          kube.*
    [FILTER]
        Name         kubernetes
        Match        kube.*
        Merge_Log    On
    [OUTPUT]
        Name         es
        Match        *
        Host         elasticsearch.logging.svc
        Port         9200
        Index        logs-%Y.%m.%d

Kubernetes Troubleshooting

Debugging Pods

# Systematic pod troubleshooting flow:

# 1. Check pod status:
kubectl get pods -n production
# NAME          READY   STATUS             RESTARTS   AGE
# payment-x1    1/1     Running            0          5d
# payment-x2    0/1     CrashLoopBackOff   5          10m  ← problem
# payment-x3    0/1     ImagePullBackOff   0          2m   ← problem

# 2. Describe for events:
kubectl describe pod payment-x2 -n production
# Events:
#   Warning  BackOff  2m  kubelet  Back-off restarting failed container
#   Warning  Failed   3m  kubelet  Error: OOMKilled

# 3. Check logs (current and previous crash):
kubectl logs payment-x2 -n production
kubectl logs payment-x2 -n production --previous  # Previous crash logs

# 4. Check resource usage:
kubectl top pod payment-x2 -n production
# NAME         CPU(cores)   MEMORY(bytes)
# payment-x2   450m         498Mi        ← near 512Mi limit = OOM

# 5. Exec into running pod (if possible):
kubectl exec -it payment-x1 -n production -- /bin/sh
# Check disk, network, processes, config files

# 6. Ephemeral debug container (for distroless/scratch images):
kubectl debug -it payment-x2 -n production --image=busybox --target=payment
# Shares PID namespace with target container — can inspect processes
# Common pod failure states and causes:

# CrashLoopBackOff:
# - Application crash (check logs --previous)
# - OOMKilled (increase memory limit)
# - Liveness probe failing (check probe config)
# - Missing config/secret (check env vars and mounts)

# ImagePullBackOff:
# - Wrong image name/tag
# - Private registry without imagePullSecrets
# - Image doesn't exist (was it pushed?)

# Pending:
# - Insufficient resources (no node fits requests)
# - Unsatisfiable nodeSelector/affinity
# - PVC not bound (storage issue)
# - Taint not tolerated

# Init:Error / Init:CrashLoopBackOff:
# - Init container failing (kubectl logs pod -c init-container-name)

# Terminating (stuck):
# - Finalizer blocking deletion
# - Force delete: kubectl delete pod X --grace-period=0 --force

Debugging Networking

# Network debugging toolkit:

# 1. DNS resolution:
kubectl run dns-test --image=busybox:1.36 --rm -it -- nslookup payment.production.svc.cluster.local
# Server:    10.96.0.10  (CoreDNS)
# Address:   10.96.0.10:53
# Name:      payment.production.svc.cluster.local
# Address:   10.100.45.67

# 2. Service connectivity:
kubectl run curl-test --image=curlimages/curl --rm -it -- \
  curl -v http://payment.production.svc:8080/health
# If timeout → check Network Policies, Service selector, endpoint

# 3. Check endpoints (does Service have backends?):
kubectl get endpoints payment -n production
# NAME      ENDPOINTS                           AGE
# payment   10.244.1.5:8080,10.244.2.8:8080    5d

# No endpoints? → Labels don't match Service selector:
kubectl get pods -n production -l app=payment --show-labels

# 4. Check Network Policies:
kubectl get networkpolicies -n production
kubectl describe networkpolicy default-deny -n production

# 5. Port-forward for direct testing:
kubectl port-forward svc/payment 8080:8080 -n production
# Access at http://localhost:8080 — bypasses ingress/network policies

# 6. Packet capture (advanced):
kubectl debug -it payment-x1 --image=nicolaka/netshoot -- tcpdump -i eth0 port 8080

Debugging Performance

# Performance investigation workflow:

# 1. Identify slow service (from traces or metrics):
#    P95 latency payment-service: 2.3s (SLO: <500ms)

# 2. Check resource saturation:
kubectl top pods -n production -l app=payment
# NAME         CPU(cores)   MEMORY(bytes)
# payment-x1   980m         450Mi   ← CPU near limit (1000m)
# payment-x2   950m         440Mi   ← CPU throttled!
# payment-x3   920m         430Mi

# 3. Check CPU throttling (cgroup metrics):
# container_cpu_cfs_throttled_periods_total high → increase CPU limit

# 4. Check if HPA is maxed:
kubectl get hpa payment-hpa -n production
# NAME          TARGETS   MINPODS   MAXPODS   REPLICAS
# payment-hpa   95%/70%   3         10        10    ← at max, still overloaded

# 5. Look at dependencies (database, cache):
# PromQL: histogram_quantile(0.95, rate(db_query_duration_seconds_bucket[5m]))
# If database P95 is high → slow queries causing upstream latency

# 6. Check for noisy neighbours on the same node:
kubectl get pods -o wide --field-selector spec.nodeName=worker-3
# Other pods on the same node consuming resources?

# Resolution:
# - Increase CPU limits / pod count
# - Scale HPA maxReplicas
# - Optimise application (profiling, caching)
# - Add pod anti-affinity to spread load

Exercises

Exercise 1 — Prometheus Setup: Deploy kube-prometheus-stack (via Helm). Explore the built-in dashboards. Write PromQL queries to find: (a) cluster CPU utilisation, (b) top 5 memory-consuming pods, (c) pod restart count over last hour. Create a custom alert rule for high memory usage.
Exercise 2 — Distributed Tracing: Deploy Jaeger (all-in-one). Instrument a sample multi-service app with OpenTelemetry SDK. Make requests and explore traces in the Jaeger UI. Identify: which service is the bottleneck? What's the critical path latency?
Exercise 3 — Log Aggregation: Deploy Loki + Promtail (or Fluent Bit + Elasticsearch). Deploy an app that produces structured JSON logs. Query logs by label, filter by error level, and correlate with pod restarts shown in Prometheus metrics.
Exercise 4 — Troubleshooting Lab: Intentionally break a deployment in 3 ways: (a) set memory limit too low (OOMKill), (b) reference a non-existent image, (c) point health check at wrong port. Practice the systematic debugging flow to identify and fix each issue using only kubectl commands.

Conclusion

Observability is what makes distributed systems operable. Without it, you're debugging blind in production. The key principles:

  • Metrics for detecting problems (Prometheus + alerting → "something is wrong")
  • Traces for locating problems (OpenTelemetry → "the problem is in service X, function Y")
  • Logs for understanding problems (structured JSON → "the error was Z because of W")
  • Golden Signals (latency, traffic, errors, saturation) on every service dashboard
  • Systematic debugging — describe, events, logs, exec, port-forward — in that order

In Part 16, our finale, we'll cover the Cloud Native Ecosystem — CI/CD with GitOps, Helm charts, Kustomize, multi-cluster management, FinOps, and the CNCF landscape that ties everything together.