Module 33: System Telemetry
Three Pillars of Observability
Observability is built on three complementary signal types. Each answers different questions about system behavior, and together they provide the complete picture needed to understand, debug, and optimize production systems.
flowchart TD
O[Observability] --> M[Metrics]
O --> T[Traces]
O --> L[Logs]
M --> M1["What is happening?
Aggregated numerical data"]
M --> M2["Counters, Gauges, Histograms"]
M --> M3["Low cardinality, cheap to store"]
T --> T1["Where is it slow?
Request flow across services"]
T --> T2["Spans, Context Propagation"]
T --> T3["High cardinality, sampled"]
L --> L1["Why did it happen?
Detailed event records"]
L --> L2["Structured JSON, Correlation IDs"]
L --> L3["Highest volume, most expensive"]
Metrics: Aggregated Numerical Data
Metrics are numerical measurements collected at regular intervals. They are the cheapest signal to store and query, making them ideal for dashboards, alerting, and trend analysis. The Prometheus data model defines three core metric types:
Counter: Monotonically increasing value (only goes up, resets on restart).
- Use for: total requests, total errors, bytes transferred
- Query pattern:
rate(http_requests_total[5m])— requests per second
Gauge: Value that goes up and down (snapshot of current state).
- Use for: CPU usage, memory usage, queue depth, active connections
- Query pattern:
node_memory_available_bytes— current value
Histogram: Distribution of values across configurable buckets.
- Use for: request latency, response sizes, batch processing times
- Query pattern:
histogram_quantile(0.99, rate(http_duration_seconds_bucket[5m]))
user_id label to a metric with 1M users, you create 1M time series — this will kill your Prometheus instance. Keep label cardinality low (< 100 unique values per label). High-cardinality data belongs in traces or logs, not metrics.
Distributed Traces: Request Flow Across Services
A distributed trace tracks a single request as it flows through multiple services. Each unit of work is a span — spans form a tree (parent-child relationships) that represents the full request execution path.
sequenceDiagram
participant Client
participant Gateway
participant OrderSvc
participant PaymentSvc
participant DB
Note over Client,DB: Trace ID: abc-123 propagated via W3C TraceContext headers
Client->>Gateway: POST /order [traceparent: abc-123]
activate Gateway
Note right of Gateway: Span: gateway (root)
Gateway->>OrderSvc: CreateOrder [traceparent: abc-123]
activate OrderSvc
Note right of OrderSvc: Span: create-order (child)
OrderSvc->>PaymentSvc: ChargeCard [traceparent: abc-123]
activate PaymentSvc
Note right of PaymentSvc: Span: charge-card (child)
PaymentSvc-->>OrderSvc: OK (200ms)
deactivate PaymentSvc
OrderSvc->>DB: INSERT order [traceparent: abc-123]
activate DB
Note right of DB: Span: db-insert (child)
DB-->>OrderSvc: OK (15ms)
deactivate DB
OrderSvc-->>Gateway: 201 Created
deactivate OrderSvc
Gateway-->>Client: 201 Created (280ms total)
deactivate Gateway
Key tracing concepts:
- Trace ID: Unique identifier for the entire request journey (128-bit, propagated via headers)
- Span ID: Unique identifier for one unit of work within the trace
- Parent Span ID: Links child spans to their parent, forming the tree
- W3C TraceContext: Standard header format:
traceparent: 00-{trace-id}-{span-id}-{flags} - Sampling: Not every request is traced — head-based (random %) or tail-based (keep interesting traces) sampling reduces cost
Structured Logs: Detailed Event Records
Logs provide the highest-fidelity record of system behavior. Structured logging (JSON format with consistent fields) makes logs queryable and correlatable with metrics and traces.
Essential structured log fields:
timestamp— ISO 8601 with timezone (not Unix epoch for readability)level— ERROR, WARN, INFO, DEBUGservice— which service emitted the logtrace_id— correlate with distributed tracecorrelation_id— business-level request ID (survives async boundaries)message— human-readable descriptioncontext— structured data (user_id, order_id, etc.)
{
"timestamp": "2026-05-15T14:23:01.847Z",
"level": "ERROR",
"service": "payment-service",
"instance": "payment-service-7b4f9-xk2p",
"trace_id": "abc123def456",
"span_id": "span-789",
"correlation_id": "order-98765",
"message": "Payment charge failed — card declined",
"error": {
"type": "PaymentDeclinedException",
"code": "CARD_DECLINED",
"provider_code": "do_not_honor"
},
"context": {
"user_id": "usr_12345",
"amount_cents": 9999,
"currency": "USD",
"card_last4": "4242",
"retry_attempt": 2,
"latency_ms": 340
}
}
Module 34: Performance Modeling
USE Method (for Resources)
The USE Method (Brendan Gregg) provides a systematic approach to analyzing resource performance. For every resource (CPU, memory, disk, network, locks), check three things:
- U — Utilization: Percentage of time the resource is busy (0-100%). High utilization (>80%) indicates the resource is becoming a bottleneck.
- S — Saturation: Degree of extra work queued that the resource can't service. Any saturation > 0 means requests are waiting (queue depth, run queue length).
- E — Errors: Number of error events. Some errors indicate resource failure (disk I/O errors, network packet drops, ECC memory corrections).
flowchart TD
START[For Each Resource] --> CPU[CPU]
START --> MEM[Memory]
START --> DISK[Disk I/O]
START --> NET[Network]
CPU --> CPU_U["U: % busy
mpstat, top"]
CPU --> CPU_S["S: run queue length
vmstat r column"]
CPU --> CPU_E["E: machine check
perf, dmesg"]
MEM --> MEM_U["U: % used
free, /proc/meminfo"]
MEM --> MEM_S["S: swap activity
vmstat si/so"]
MEM --> MEM_E["E: OOM kills
dmesg, kmsg"]
DISK --> DISK_U["U: % busy
iostat %util"]
DISK --> DISK_S["S: queue depth
iostat avgqu-sz"]
DISK --> DISK_E["E: I/O errors
smartctl, dmesg"]
NET --> NET_U["U: bandwidth %
sar, nstat"]
NET --> NET_S["S: backlog drops
ss, netstat"]
NET --> NET_E["E: CRC, drops
ifconfig, ethtool"]
RED Method (for Services)
The RED Method (Tom Wilkie, inspired by Google's Four Golden Signals) applies to services rather than resources. For every service endpoint, measure:
- R — Rate: Requests per second. Tracks demand/throughput. Sudden drops indicate upstream failures; spikes indicate load events.
- E — Errors: Failed requests per second (or error rate %). Distinguish client errors (4xx) from server errors (5xx). Track by error type for prioritization.
- D — Duration: Response time distribution (not just average!). Track p50, p95, p99. P99 matters because 1% of users experience it — at 1M requests/day, that's 10,000 slow experiences.
Capacity Planning
Capacity planning answers: "How much infrastructure do we need to handle expected load with acceptable performance?" It bridges traffic forecasting, performance testing, and resource allocation.
The capacity planning process:
- Measure current demand: Peak RPS, p99 latency, resource utilization at peak
- Forecast future demand: Growth rate, seasonal patterns, planned events (launches, sales)
- Determine headroom requirement: Typically 30-50% spare capacity for burst absorption
- Load test to find ceiling: At what RPS does p99 latency exceed SLO? That's your current capacity ceiling.
- Calculate scaling needs:
(forecasted_peak × (1 + headroom)) / capacity_per_instance = instances_needed
#!/bin/bash
# SLI queries using Prometheus — common patterns for RED method metrics
PROM_URL="http://prometheus:9090/api/v1/query"
echo "=== Service SLI Dashboard ==="
# Rate: Requests per second (last 5 minutes)
echo "--- Request Rate ---"
curl -s "$PROM_URL" --data-urlencode \
'query=sum(rate(http_requests_total{service="order-api"}[5m]))' | \
jq -r '.data.result[0].value[1] | tonumber | . * 100 | round / 100' | \
xargs -I {} echo " Current RPS: {}"
# Errors: Error rate percentage
echo "--- Error Rate ---"
curl -s "$PROM_URL" --data-urlencode \
'query=sum(rate(http_requests_total{service="order-api",status=~"5.."}[5m])) / sum(rate(http_requests_total{service="order-api"}[5m])) * 100' | \
jq -r '.data.result[0].value[1] | tonumber | . * 100 | round / 100' | \
xargs -I {} echo " Error Rate: {}%"
# Duration: p99 latency
echo "--- P99 Latency ---"
curl -s "$PROM_URL" --data-urlencode \
'query=histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{service="order-api"}[5m])) by (le))' | \
jq -r '.data.result[0].value[1] | tonumber | . * 1000 | round' | \
xargs -I {} echo " P99 Latency: {}ms"
# Saturation: Active connections / max connections
echo "--- Saturation ---"
curl -s "$PROM_URL" --data-urlencode \
'query=sum(db_connections_active{service="order-api"}) / sum(db_connections_max{service="order-api"}) * 100' | \
jq -r '.data.result[0].value[1] | tonumber | . * 100 | round / 100' | \
xargs -I {} echo " DB Pool Utilization: {}%"
Module 35: Production Behavior Analysis
Cascading Failures
A cascading failure occurs when the failure of one component causes dependent components to fail, which causes their dependents to fail, and so on — like dominoes. The signature pattern in telemetry:
- One service starts returning errors or timing out
- Callers retry aggressively, increasing load on the failing service
- Callers' thread pools fill with waiting requests (saturation)
- Callers can't serve their own requests (propagation)
- Callers' callers see timeouts → the cascade spreads upstream
Detection signals:
- Correlated timeout increases across multiple services (within seconds)
- Thread pool saturation spreading service-to-service
- Error rate spike that starts in one service and propagates upstream
- Latency p99 increasing everywhere while throughput drops
Retry Storms
A retry storm occurs when many clients simultaneously retry failed requests, amplifying traffic exponentially. A service handling 1000 RPS with 3 retries per failure can suddenly face 4000 RPS if errors spike — exactly when it's least able to handle load.
Mitigation patterns:
- Exponential backoff with jitter: Spread retries over time, prevent synchronization
- Retry budgets: Limit total retries to 10% of successful requests
- Circuit breakers: Stop retrying entirely when failure rate exceeds threshold
- Adaptive concurrency: Reduce in-flight requests as latency increases (TCP Vegaslike)
Traffic Amplification
Traffic amplification occurs when one external request generates many internal requests (fanout). A user search might query 20 shards, each shard queries 3 replicas, and each replica makes 2 index lookups — one user request becomes 120 internal requests.
Fanout multiplier: internal_RPS / external_RPS. A multiplier of 50x means your internal infrastructure must handle 50x the user-visible traffic. Monitor this ratio — if it grows unexpectedly (new feature, changed query patterns), capacity planning breaks.
Module 36: Reliability Modeling
Availability Calculations
Availability measures the fraction of time a system is operational and serving requests correctly. It's typically expressed as "nines":
| Availability | Nines | Downtime/Year | Downtime/Month |
|---|---|---|---|
| 99% | Two nines | 3.65 days | 7.3 hours |
| 99.9% | Three nines | 8.77 hours | 43.8 min |
| 99.95% | Three and a half | 4.38 hours | 21.9 min |
| 99.99% | Four nines | 52.6 min | 4.38 min |
| 99.999% | Five nines | 5.26 min | 26.3 sec |
Composite availability for serial dependencies: If Service A (99.9%) calls Service B (99.9%), the combined availability is 99.9% × 99.9% = 99.8%. Each additional serial dependency multiplies the failure probability. Five services at 99.9% each = 99.5% combined — from "three nines" to barely "two and a half nines".
def calculate_availability(uptime_hours: float, total_hours: float) -> dict:
"""Calculate availability metrics from uptime data."""
availability = uptime_hours / total_hours
downtime_hours = total_hours - uptime_hours
# Calculate nines
if availability >= 1.0:
nines = float('inf')
elif availability <= 0:
nines = 0
else:
import math
nines = -math.log10(1 - availability)
return {
"availability_percent": round(availability * 100, 4),
"nines": round(nines, 2),
"downtime_hours_per_year": round((1 - availability) * 8760, 2),
"downtime_minutes_per_month": round((1 - availability) * 43800, 2),
}
def composite_serial(availabilities: list[float]) -> float:
"""Combined availability for serial (all must work) dependencies."""
result = 1.0
for a in availabilities:
result *= a
return result
def composite_parallel(availabilities: list[float]) -> float:
"""Combined availability for parallel (any can work) redundancy."""
# P(all fail) = product of (1-a) for each
all_fail = 1.0
for a in availabilities:
all_fail *= (1 - a)
return 1 - all_fail
# Example: System with 3 serial services
services = [0.999, 0.999, 0.999] # Each at 99.9%
serial_avail = composite_serial(services)
print(f"Serial (3 services at 99.9%): {serial_avail*100:.3f}%")
# Output: 99.700% — lost a full nine!
# Example: Database with 3 replicas (any one can serve reads)
replicas = [0.99, 0.99, 0.99] # Each at 99%
parallel_avail = composite_parallel(replicas)
print(f"Parallel (3 replicas at 99%): {parallel_avail*100:.4f}%")
# Output: 99.9999% — gained four nines!
# Example: Full system availability calculation
result = calculate_availability(uptime_hours=8750, total_hours=8760)
print(f"\nSystem availability: {result['availability_percent']}%")
print(f"Nines: {result['nines']}")
print(f"Downtime per year: {result['downtime_hours_per_year']} hours")
print(f"Downtime per month: {result['downtime_minutes_per_month']} minutes")
Error Budgets & Burn Rate
An error budget is the inverse of your SLO — it's how much unreliability you're allowed. If your SLO is 99.9% availability, your error budget is 0.1% (about 43 minutes of downtime per month). The error budget is "spent" by incidents, deployments, and maintenance.
Burn rate measures how fast you're consuming your error budget relative to the SLO window. A burn rate of 1x means you'll exactly exhaust the budget by end of window. A burn rate of 10x means you'll exhaust it in 1/10th the time — trigger an alert.
flowchart LR
subgraph "30-Day Error Budget"
B[Budget: 43.2 min]
B --> R1["1x burn rate
Normal — budget lasts full month"]
B --> R2["2x burn rate
Budget exhausted in 15 days"]
B --> R6["6x burn rate
Budget exhausted in 5 days
⚠️ WARN: Ticket"]
B --> R14["14.4x burn rate
Budget exhausted in 2 days
🚨 CRITICAL: Page"]
end
from dataclasses import dataclass
@dataclass
class ErrorBudget:
"""Calculate error budget and burn rate for an SLO."""
slo_target: float # e.g., 0.999 for 99.9%
window_days: int = 30 # SLO measurement window
@property
def budget_fraction(self) -> float:
"""Allowed error fraction (1 - SLO)."""
return 1 - self.slo_target
@property
def budget_minutes(self) -> float:
"""Error budget in minutes for the window."""
return self.budget_fraction * self.window_days * 24 * 60
def burn_rate(self, error_rate: float) -> float:
"""Current burn rate given observed error rate.
burn_rate = observed_error_rate / allowed_error_rate
burn_rate of 1.0 means budget will last exactly the window.
burn_rate of 10.0 means budget exhausted in 1/10 of window.
"""
if self.budget_fraction == 0:
return float('inf')
return error_rate / self.budget_fraction
def time_to_exhaustion_hours(self, error_rate: float) -> float:
"""Hours until budget is fully consumed at current rate."""
br = self.burn_rate(error_rate)
if br <= 0:
return float('inf')
return (self.window_days * 24) / br
def budget_remaining(self, consumed_minutes: float) -> dict:
"""How much budget remains."""
remaining = self.budget_minutes - consumed_minutes
return {
"total_budget_min": round(self.budget_minutes, 2),
"consumed_min": round(consumed_minutes, 2),
"remaining_min": round(max(0, remaining), 2),
"remaining_percent": round(max(0, remaining / self.budget_minutes * 100), 1),
"exhausted": remaining <= 0
}
# Example: 99.9% SLO over 30 days
budget = ErrorBudget(slo_target=0.999, window_days=30)
print(f"SLO: {budget.slo_target*100}%")
print(f"Error budget: {budget.budget_minutes:.1f} minutes ({budget.budget_fraction*100}%)")
# Current error rate is 0.5% (5x the budget)
current_error_rate = 0.005
br = budget.burn_rate(current_error_rate)
print(f"\nCurrent error rate: {current_error_rate*100}%")
print(f"Burn rate: {br:.1f}x")
print(f"Time to exhaustion: {budget.time_to_exhaustion_hours(current_error_rate):.1f} hours")
# After a 20-minute incident
status = budget.budget_remaining(consumed_minutes=20)
print(f"\nAfter 20-min incident:")
print(f" Remaining: {status['remaining_min']} min ({status['remaining_percent']}%)")
print(f" Exhausted: {status['exhausted']}")
SLI / SLO / SLA
These three concepts form a hierarchy for defining and managing reliability:
SLI (Service Level Indicator): A quantitative measure of some aspect of service quality. It's what you measure — the raw metric.
- Availability SLI:
successful_requests / total_requests - Latency SLI:
requests_under_300ms / total_requests(proportion within threshold) - Throughput SLI:
processed_jobs / submitted_jobs - Correctness SLI:
correct_responses / total_responses
SLO (Service Level Objective): A target value for an SLI over a time window. It's your internal reliability goal.
- "99.9% of requests complete successfully over 30 days"
- "95% of requests complete in under 300ms over 7 days"
- SLO creates the error budget: budget = 100% - SLO
SLA (Service Level Agreement): A business contract with consequences (credits, penalties) if the SLO is missed. SLAs should be less strict than SLOs — your internal target should be higher than your customer promise.
# Prometheus alerting rules for SLO-based alerts (multi-window burn rate)
groups:
- name: slo_alerts
rules:
# Fast burn: 14.4x rate over 1 hour (exhausts 30-day budget in 2 days)
- alert: HighErrorBurnRate_Critical
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[1h]))
/
sum(rate(http_requests_total[1h]))
) > (14.4 * 0.001)
for: 2m
labels:
severity: critical
slo: availability
annotations:
summary: "High error burn rate — budget exhausted in ~2 days"
description: "Error rate {{ $value | humanizePercentage }} exceeds 14.4x burn rate threshold"
runbook: "https://runbooks.internal/slo-budget-critical"
# Slow burn: 6x rate over 6 hours (exhausts budget in 5 days)
- alert: HighErrorBurnRate_Warning
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[6h]))
/
sum(rate(http_requests_total[6h]))
) > (6 * 0.001)
for: 5m
labels:
severity: warning
slo: availability
annotations:
summary: "Elevated error burn rate — budget exhausted in ~5 days"
description: "6h error rate {{ $value | humanizePercentage }} exceeds 6x burn rate"
# Latency SLO: p99 > 500ms for 5 minutes
- alert: LatencySLO_Breach
expr: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
) > 0.5
for: 5m
labels:
severity: warning
slo: latency
annotations:
summary: "P99 latency exceeds 500ms SLO threshold"
description: "P99 latency is {{ $value | humanizeDuration }}"
Case Studies
Google SRE: Error Budgets in Practice
Google's SRE book formalized the error budget concept. The key insight: 100% reliability is the wrong target. Users can't tell the difference between 99.99% and 99.999% (their ISP, device, and WiFi introduce more unreliability), but the engineering cost difference is enormous.
Google's error budget policy:
- Define SLO: Product and SRE agree on a target (e.g., 99.95% availability)
- Measure SLI: Continuously track actual performance against the SLO
- Calculate remaining budget:
budget = (1 - SLO) × window - consumed - If budget remains: Ship features, deploy aggressively, innovate
- If budget exhausted: Freeze features, focus on reliability (postmortems, automation, testing)
Results: Teams with error budgets shipped 30% more features than teams without (they weren't over-cautious). Reliability actually improved because the policy made reliability investment a rational, data-driven decision rather than a political battle between "move fast" and "don't break things."
Multi-window burn rate alerts (Google's recommendation): Use two alert windows — a short window (1h) with high burn rate threshold (14.4x) for critical pages, and a long window (6h) with lower threshold (6x) for warning tickets. This catches both sudden incidents and slow degradation.
Honeycomb: Observability-Driven Development
Honeycomb pioneered "observability-driven development" — the practice of instrumenting code before shipping it to production, then using high-cardinality trace data to understand behavior rather than pre-defined dashboards.
Key principles:
- Events over metrics: Instead of pre-aggregated counters, store rich structured events with all context (user_id, feature_flag, build_version, query_plan). Query them on-demand.
- High cardinality is essential: You need to slice by user_id (millions of unique values) to debug "why is user X slow?" — metrics can't do this.
- BubbleUp pattern: Compare slow requests vs fast requests — what attributes differ? Automated analysis finds that slow requests all have
region=eu-westanddb_pool=replica-3. - Instrument at deploy: Every deploy includes trace instrumentation. Developers own observability — they add spans around code they want to understand.
Impact: Teams using observability-driven development reported 60% faster incident resolution (MTTR) because they could ask arbitrary questions about production behavior without waiting for new dashboards or metric instrumentation.
Conclusion & Next Steps
The key takeaways:
- Observability requires all three pillars. Metrics for alerting and trends, traces for request-level debugging, logs for root cause detail. Missing any one creates blind spots that extend incident resolution time.
- USE for resources, RED for services. Apply USE (Utilization, Saturation, Errors) to infrastructure components. Apply RED (Rate, Errors, Duration) to service endpoints. Together they pinpoint whether the problem is infrastructure capacity or application logic.
- Percentiles, not averages. Average latency hides the worst experiences. P99 latency affects 1% of users — at scale, that's thousands of people. Alert on p99, optimize for p95, report p50 as "typical experience."
- Cardinality is the cost driver. Every unique label combination is a time series. Keep metrics low-cardinality (tens of values per label). Put high-cardinality data (user_id, request_id) in traces and logs where storage is designed for it.
- Error budgets align incentives. Without error budgets, reliability is subjective ("is this reliable enough?"). With them, it's mathematical — you have X minutes of budget, each incident costs Y minutes. Ship fast when budget is healthy, fix reliability when it's low.
- Burn rate alerts catch both fast and slow problems. Short-window/high-threshold alerts page for sudden incidents. Long-window/low-threshold alerts ticket for gradual degradation. This multi-window approach prevents both alert fatigue and missed slow burns.
- Detect retry storms and cascading failures early. Monitor the ratio of retries to first-attempts. If retries suddenly dominate traffic, you have a retry storm. Monitor correlated timeout increases across services — if timeouts spread upstream, you have a cascade forming.
- SLAs should be weaker than SLOs. Your internal target (SLO) should be stricter than your customer contract (SLA). This gives you a buffer — you can miss your SLO without financial consequences, but it triggers internal reliability investment.
Next in the Series
In Part 17: Evolutionary Architecture & Conway's Law, we'll explore how organizational structure shapes system architecture, fitness functions for measuring architectural health, and strategies for managing technical debt as a first-class engineering concern.