Three Pillars of Observability
Metrics vs Logs vs Traces
flowchart LR
subgraph Metrics ["Metrics (What)"]
M1[Counters
Gauges
Histograms]
M2[Low cardinality
Aggregatable
Cheap to store]
end
subgraph Logs ["Logs (Why)"]
L1[Events
Errors
Audit trails]
L2[High cardinality
Context-rich
Expensive to store]
end
subgraph Traces ["Traces (Where)"]
T1[Request paths
Latency breakdown
Dependencies]
T2[Per-request
Cross-service
Causal relationships]
end
Metrics -->|Alert triggers| Logs
Logs -->|Correlate with| Traces
Traces -->|Identify bottleneck| Metrics
| Pillar | Answers | Tool | Retention |
|---|---|---|---|
| Metrics | "How many requests? What's the error rate? Is CPU high?" | Prometheus, Datadog, CloudWatch | Months–years |
| Logs | "What happened? Why did it fail? What was the error message?" | Loki, EFK/ELK, CloudWatch Logs | Days–months |
| Traces | "Where did the request go? Which service is slow? What's the call chain?" | Jaeger, Tempo, Zipkin, X-Ray | Hours–days (sampled) |
Prometheus & Metrics
Architecture
/metrics endpoint in Prometheus text format. This is opposite to push-based systems (StatsD, Datadog Agent) where applications push metrics to a collector.
# ServiceMonitor: tell Prometheus what to scrape (via Prometheus Operator)
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: payment-service
labels:
monitoring: enabled # Matches Prometheus serviceMonitorSelector
spec:
selector:
matchLabels:
app: payment # Scrape Services with this label
endpoints:
- port: metrics # Named port in the Service
interval: 15s # Scrape every 15 seconds
path: /metrics # Default path
namespaceSelector:
matchNames:
- production
# What a /metrics endpoint looks like (Prometheus text format):
# TYPE http_requests_total counter
http_requests_total{method="GET",endpoint="/api/orders",status="200"} 14523
http_requests_total{method="GET",endpoint="/api/orders",status="500"} 23
http_requests_total{method="POST",endpoint="/api/orders",status="201"} 892
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{le="0.01"} 9823
http_request_duration_seconds_bucket{le="0.05"} 12456
http_request_duration_seconds_bucket{le="0.1"} 13200
http_request_duration_seconds_bucket{le="0.5"} 14100
http_request_duration_seconds_bucket{le="1.0"} 14480
http_request_duration_seconds_bucket{le="+Inf"} 14523
http_request_duration_seconds_sum 1456.78
http_request_duration_seconds_count 14523
# TYPE memory_usage_bytes gauge
memory_usage_bytes{pod="payment-abc12"} 268435456
PromQL Essentials
# PromQL: the query language for Prometheus
# Request rate (requests per second over 5 minutes):
rate(http_requests_total[5m])
# Error rate percentage:
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
* 100
# P99 latency (99th percentile):
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
# P95 latency by endpoint:
histogram_quantile(0.95,
sum by (le, endpoint) (rate(http_request_duration_seconds_bucket[5m]))
)
# CPU usage per pod (percentage of request):
sum by (pod) (rate(container_cpu_usage_seconds_total[5m]))
/
sum by (pod) (kube_pod_container_resource_requests{resource="cpu"})
* 100
# Memory usage vs limit:
container_memory_working_set_bytes{container!=""}
/
kube_pod_container_resource_limits{resource="memory"}
* 100
# Top 5 pods by CPU:
topk(5, sum by (pod) (rate(container_cpu_usage_seconds_total[5m])))
Alerting
# PrometheusRule: alerting rules
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: payment-alerts
labels:
monitoring: enabled
spec:
groups:
- name: payment.slo
rules:
# High error rate:
- alert: PaymentHighErrorRate
expr: |
sum(rate(http_requests_total{job="payment",status=~"5.."}[5m]))
/ sum(rate(http_requests_total{job="payment"}[5m])) > 0.01
for: 5m
labels:
severity: critical
team: payments
annotations:
summary: "Payment service error rate > 1% for 5 minutes"
runbook: "https://wiki.internal/runbooks/payment-errors"
# High latency:
- alert: PaymentHighLatency
expr: |
histogram_quantile(0.95,
sum by (le) (rate(http_request_duration_seconds_bucket{job="payment"}[5m]))
) > 0.5
for: 5m
labels:
severity: warning
team: payments
annotations:
summary: "Payment P95 latency > 500ms"
# Pod restarts:
- alert: PodCrashLooping
expr: |
rate(kube_pod_container_status_restarts_total[15m]) * 60 * 5 > 5
for: 5m
labels:
severity: critical
annotations:
summary: "Pod {{ $labels.pod }} restarting > 5 times in 15m"
Grafana Dashboards
Golden Signals & RED Method
- Latency — how long requests take (p50, p95, p99)
- Traffic — how much demand (requests/second)
- Errors — how many requests fail (error rate %)
- Saturation — how full the system is (CPU, memory, disk, queue depth)
# RED Method (for microservices — per service):
# Rate: requests per second
# Errors: error rate (% of requests failing)
# Duration: latency distribution (p50, p95, p99)
# USE Method (for infrastructure — per resource):
# Utilization: % of resource capacity being used
# Saturation: amount of work queued/waiting
# Errors: count of error events
# Key Grafana dashboard panels for a Kubernetes service:
# Row 1: Request Rate | Error Rate | P95 Latency | P99 Latency
# Row 2: CPU Usage vs Request | Memory Usage vs Limit | Pod Count
# Row 3: Pod Restarts | OOMKilled Events | Network I/O
# Row 4: Database Connections | Cache Hit Rate | Queue Depth
Distributed Tracing
OpenTelemetry
OpenTelemetry (OTel) is the CNCF standard for collecting telemetry — a single SDK for metrics, traces, and logs across all languages:
# OpenTelemetry Collector: receive, process, export telemetry
apiVersion: opentelemetry.io/v1beta1
kind: OpenTelemetryCollector
metadata:
name: otel-collector
spec:
mode: deployment
config:
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 5s
send_batch_size: 1000
memory_limiter:
check_interval: 1s
limit_mib: 512
resource:
attributes:
- key: cluster
value: production
action: upsert
exporters:
# Send traces to Jaeger:
otlp/jaeger:
endpoint: jaeger-collector.observability:4317
tls:
insecure: true
# Send metrics to Prometheus:
prometheus:
endpoint: 0.0.0.0:8889
# Send logs to Loki:
loki:
endpoint: http://loki.observability:3100/loki/api/v1/push
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch, resource]
exporters: [otlp/jaeger]
metrics:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [prometheus]
logs:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [loki]
Trace Backends
| Backend | Storage | Strengths | Best For |
|---|---|---|---|
| Jaeger | Elasticsearch, Cassandra, Kafka | Rich UI, adaptive sampling | Production debugging |
| Tempo (Grafana) | Object storage (S3/GCS) | Cost-effective, Grafana-native | Grafana stack users |
| Zipkin | Elasticsearch, MySQL | Simple, lightweight | Small deployments |
Logging
Structured Logging
// Structured log (JSON) — machine-parseable:
{
"timestamp": "2026-05-14T10:30:45.123Z",
"level": "error",
"service": "payment-service",
"trace_id": "abc123def456",
"span_id": "789ghi",
"message": "Payment processing failed",
"error": "insufficient_funds",
"customer_id": "cust_12345",
"amount": 99.99,
"currency": "USD",
"duration_ms": 234,
"kubernetes": {
"pod": "payment-abc12",
"namespace": "production",
"node": "worker-3"
}
}
// vs Unstructured log — hard to search/filter:
// 2026-05-14 10:30:45 ERROR Payment processing failed for customer
// cust_12345: insufficient_funds (amount: $99.99)
Log Aggregation
# Grafana Loki: lightweight log aggregation (labels, not full-text index)
# Deployed via Helm:
# helm install loki grafana/loki-stack --set promtail.enabled=true
# Loki uses labels for indexing (like Prometheus):
# {namespace="production", app="payment"}
# LogQL queries (similar to PromQL):
# All error logs from payment service:
# {app="payment"} |= "error"
# JSON parsing and filtering:
# {app="payment"} | json | level="error" | duration_ms > 1000
# Count errors per minute:
# sum(rate({app="payment"} |= "error" [1m])) by (pod)
---
# Alternative: EFK Stack (Elasticsearch + Fluentd + Kibana)
# - Elasticsearch: full-text search, aggregations
# - Fluentd/Fluent Bit: log collection from pods
# - Kibana: visualization and search
# Fluent Bit DaemonSet config:
apiVersion: v1
kind: ConfigMap
metadata:
name: fluent-bit-config
data:
fluent-bit.conf: |
[SERVICE]
Flush 5
Daemon Off
Log_Level info
[INPUT]
Name tail
Path /var/log/containers/*.log
Parser docker
Tag kube.*
[FILTER]
Name kubernetes
Match kube.*
Merge_Log On
[OUTPUT]
Name es
Match *
Host elasticsearch.logging.svc
Port 9200
Index logs-%Y.%m.%d
Kubernetes Troubleshooting
Debugging Pods
# Systematic pod troubleshooting flow:
# 1. Check pod status:
kubectl get pods -n production
# NAME READY STATUS RESTARTS AGE
# payment-x1 1/1 Running 0 5d
# payment-x2 0/1 CrashLoopBackOff 5 10m ← problem
# payment-x3 0/1 ImagePullBackOff 0 2m ← problem
# 2. Describe for events:
kubectl describe pod payment-x2 -n production
# Events:
# Warning BackOff 2m kubelet Back-off restarting failed container
# Warning Failed 3m kubelet Error: OOMKilled
# 3. Check logs (current and previous crash):
kubectl logs payment-x2 -n production
kubectl logs payment-x2 -n production --previous # Previous crash logs
# 4. Check resource usage:
kubectl top pod payment-x2 -n production
# NAME CPU(cores) MEMORY(bytes)
# payment-x2 450m 498Mi ← near 512Mi limit = OOM
# 5. Exec into running pod (if possible):
kubectl exec -it payment-x1 -n production -- /bin/sh
# Check disk, network, processes, config files
# 6. Ephemeral debug container (for distroless/scratch images):
kubectl debug -it payment-x2 -n production --image=busybox --target=payment
# Shares PID namespace with target container — can inspect processes
# Common pod failure states and causes:
# CrashLoopBackOff:
# - Application crash (check logs --previous)
# - OOMKilled (increase memory limit)
# - Liveness probe failing (check probe config)
# - Missing config/secret (check env vars and mounts)
# ImagePullBackOff:
# - Wrong image name/tag
# - Private registry without imagePullSecrets
# - Image doesn't exist (was it pushed?)
# Pending:
# - Insufficient resources (no node fits requests)
# - Unsatisfiable nodeSelector/affinity
# - PVC not bound (storage issue)
# - Taint not tolerated
# Init:Error / Init:CrashLoopBackOff:
# - Init container failing (kubectl logs pod -c init-container-name)
# Terminating (stuck):
# - Finalizer blocking deletion
# - Force delete: kubectl delete pod X --grace-period=0 --force
Debugging Networking
# Network debugging toolkit:
# 1. DNS resolution:
kubectl run dns-test --image=busybox:1.36 --rm -it -- nslookup payment.production.svc.cluster.local
# Server: 10.96.0.10 (CoreDNS)
# Address: 10.96.0.10:53
# Name: payment.production.svc.cluster.local
# Address: 10.100.45.67
# 2. Service connectivity:
kubectl run curl-test --image=curlimages/curl --rm -it -- \
curl -v http://payment.production.svc:8080/health
# If timeout → check Network Policies, Service selector, endpoint
# 3. Check endpoints (does Service have backends?):
kubectl get endpoints payment -n production
# NAME ENDPOINTS AGE
# payment 10.244.1.5:8080,10.244.2.8:8080 5d
# No endpoints? → Labels don't match Service selector:
kubectl get pods -n production -l app=payment --show-labels
# 4. Check Network Policies:
kubectl get networkpolicies -n production
kubectl describe networkpolicy default-deny -n production
# 5. Port-forward for direct testing:
kubectl port-forward svc/payment 8080:8080 -n production
# Access at http://localhost:8080 — bypasses ingress/network policies
# 6. Packet capture (advanced):
kubectl debug -it payment-x1 --image=nicolaka/netshoot -- tcpdump -i eth0 port 8080
Debugging Performance
# Performance investigation workflow:
# 1. Identify slow service (from traces or metrics):
# P95 latency payment-service: 2.3s (SLO: <500ms)
# 2. Check resource saturation:
kubectl top pods -n production -l app=payment
# NAME CPU(cores) MEMORY(bytes)
# payment-x1 980m 450Mi ← CPU near limit (1000m)
# payment-x2 950m 440Mi ← CPU throttled!
# payment-x3 920m 430Mi
# 3. Check CPU throttling (cgroup metrics):
# container_cpu_cfs_throttled_periods_total high → increase CPU limit
# 4. Check if HPA is maxed:
kubectl get hpa payment-hpa -n production
# NAME TARGETS MINPODS MAXPODS REPLICAS
# payment-hpa 95%/70% 3 10 10 ← at max, still overloaded
# 5. Look at dependencies (database, cache):
# PromQL: histogram_quantile(0.95, rate(db_query_duration_seconds_bucket[5m]))
# If database P95 is high → slow queries causing upstream latency
# 6. Check for noisy neighbours on the same node:
kubectl get pods -o wide --field-selector spec.nodeName=worker-3
# Other pods on the same node consuming resources?
# Resolution:
# - Increase CPU limits / pod count
# - Scale HPA maxReplicas
# - Optimise application (profiling, caching)
# - Add pod anti-affinity to spread load
Exercises
Conclusion
Observability is what makes distributed systems operable. Without it, you're debugging blind in production. The key principles:
- Metrics for detecting problems (Prometheus + alerting → "something is wrong")
- Traces for locating problems (OpenTelemetry → "the problem is in service X, function Y")
- Logs for understanding problems (structured JSON → "the error was Z because of W")
- Golden Signals (latency, traffic, errors, saturation) on every service dashboard
- Systematic debugging — describe, events, logs, exec, port-forward — in that order
In Part 16, our finale, we'll cover the Cloud Native Ecosystem — CI/CD with GitOps, Helm charts, Kustomize, multi-cluster management, FinOps, and the CNCF landscape that ties everything together.