Back to Systems Thinking & Architecture Mastery Series

Part 16: Telemetry & Performance Modeling

May 15, 2026 Wasil Zafar 28 min read

"You can't fix what you can't see." Observability is the ability to understand a system's internal state from its external outputs. This module teaches you to instrument systems for deep visibility, model performance scientifically, and define reliability contracts that balance engineering investment against business needs.

Table of Contents

  1. Module 33: System Telemetry
  2. Module 34: Performance Modeling
  3. Module 35: Production Behavior Analysis
  4. Module 36: Reliability Modeling
  5. Case Studies
  6. Conclusion & Next Steps

Module 33: System Telemetry

Three Pillars of Observability

Observability is built on three complementary signal types. Each answers different questions about system behavior, and together they provide the complete picture needed to understand, debug, and optimize production systems.

Three Pillars of Observability
flowchart TD
    O[Observability] --> M[Metrics]
    O --> T[Traces]
    O --> L[Logs]

    M --> M1["What is happening?
Aggregated numerical data"] M --> M2["Counters, Gauges, Histograms"] M --> M3["Low cardinality, cheap to store"] T --> T1["Where is it slow?
Request flow across services"] T --> T2["Spans, Context Propagation"] T --> T3["High cardinality, sampled"] L --> L1["Why did it happen?
Detailed event records"] L --> L2["Structured JSON, Correlation IDs"] L --> L3["Highest volume, most expensive"]
The Observability Mental Model: Metrics tell you something is wrong (alert fires). Traces tell you where it's wrong (which service, which operation). Logs tell you why it's wrong (the error message, the stack trace, the input that triggered it). You need all three — metrics for detection, traces for localization, logs for root cause.

Metrics: Aggregated Numerical Data

Metrics are numerical measurements collected at regular intervals. They are the cheapest signal to store and query, making them ideal for dashboards, alerting, and trend analysis. The Prometheus data model defines three core metric types:

Counter: Monotonically increasing value (only goes up, resets on restart).

  • Use for: total requests, total errors, bytes transferred
  • Query pattern: rate(http_requests_total[5m]) — requests per second

Gauge: Value that goes up and down (snapshot of current state).

  • Use for: CPU usage, memory usage, queue depth, active connections
  • Query pattern: node_memory_available_bytes — current value

Histogram: Distribution of values across configurable buckets.

  • Use for: request latency, response sizes, batch processing times
  • Query pattern: histogram_quantile(0.99, rate(http_duration_seconds_bucket[5m]))
Cardinality Explosion: Every unique combination of label values creates a new time series. If you add a user_id label to a metric with 1M users, you create 1M time series — this will kill your Prometheus instance. Keep label cardinality low (< 100 unique values per label). High-cardinality data belongs in traces or logs, not metrics.

Distributed Traces: Request Flow Across Services

A distributed trace tracks a single request as it flows through multiple services. Each unit of work is a span — spans form a tree (parent-child relationships) that represents the full request execution path.

Distributed Trace Propagation
sequenceDiagram
    participant Client
    participant Gateway
    participant OrderSvc
    participant PaymentSvc
    participant DB

    Note over Client,DB: Trace ID: abc-123 propagated via W3C TraceContext headers

    Client->>Gateway: POST /order [traceparent: abc-123]
    activate Gateway
    Note right of Gateway: Span: gateway (root)

    Gateway->>OrderSvc: CreateOrder [traceparent: abc-123]
    activate OrderSvc
    Note right of OrderSvc: Span: create-order (child)

    OrderSvc->>PaymentSvc: ChargeCard [traceparent: abc-123]
    activate PaymentSvc
    Note right of PaymentSvc: Span: charge-card (child)
    PaymentSvc-->>OrderSvc: OK (200ms)
    deactivate PaymentSvc

    OrderSvc->>DB: INSERT order [traceparent: abc-123]
    activate DB
    Note right of DB: Span: db-insert (child)
    DB-->>OrderSvc: OK (15ms)
    deactivate DB

    OrderSvc-->>Gateway: 201 Created
    deactivate OrderSvc
    Gateway-->>Client: 201 Created (280ms total)
    deactivate Gateway
                            

Key tracing concepts:

  • Trace ID: Unique identifier for the entire request journey (128-bit, propagated via headers)
  • Span ID: Unique identifier for one unit of work within the trace
  • Parent Span ID: Links child spans to their parent, forming the tree
  • W3C TraceContext: Standard header format: traceparent: 00-{trace-id}-{span-id}-{flags}
  • Sampling: Not every request is traced — head-based (random %) or tail-based (keep interesting traces) sampling reduces cost

Structured Logs: Detailed Event Records

Logs provide the highest-fidelity record of system behavior. Structured logging (JSON format with consistent fields) makes logs queryable and correlatable with metrics and traces.

Essential structured log fields:

  • timestamp — ISO 8601 with timezone (not Unix epoch for readability)
  • level — ERROR, WARN, INFO, DEBUG
  • service — which service emitted the log
  • trace_id — correlate with distributed trace
  • correlation_id — business-level request ID (survives async boundaries)
  • message — human-readable description
  • context — structured data (user_id, order_id, etc.)
{
  "timestamp": "2026-05-15T14:23:01.847Z",
  "level": "ERROR",
  "service": "payment-service",
  "instance": "payment-service-7b4f9-xk2p",
  "trace_id": "abc123def456",
  "span_id": "span-789",
  "correlation_id": "order-98765",
  "message": "Payment charge failed — card declined",
  "error": {
    "type": "PaymentDeclinedException",
    "code": "CARD_DECLINED",
    "provider_code": "do_not_honor"
  },
  "context": {
    "user_id": "usr_12345",
    "amount_cents": 9999,
    "currency": "USD",
    "card_last4": "4242",
    "retry_attempt": 2,
    "latency_ms": 340
  }
}

Module 34: Performance Modeling

USE Method (for Resources)

The USE Method (Brendan Gregg) provides a systematic approach to analyzing resource performance. For every resource (CPU, memory, disk, network, locks), check three things:

  • U — Utilization: Percentage of time the resource is busy (0-100%). High utilization (>80%) indicates the resource is becoming a bottleneck.
  • S — Saturation: Degree of extra work queued that the resource can't service. Any saturation > 0 means requests are waiting (queue depth, run queue length).
  • E — Errors: Number of error events. Some errors indicate resource failure (disk I/O errors, network packet drops, ECC memory corrections).
USE Method Systematic Checklist
flowchart TD
    START[For Each Resource] --> CPU[CPU]
    START --> MEM[Memory]
    START --> DISK[Disk I/O]
    START --> NET[Network]

    CPU --> CPU_U["U: % busy
mpstat, top"] CPU --> CPU_S["S: run queue length
vmstat r column"] CPU --> CPU_E["E: machine check
perf, dmesg"] MEM --> MEM_U["U: % used
free, /proc/meminfo"] MEM --> MEM_S["S: swap activity
vmstat si/so"] MEM --> MEM_E["E: OOM kills
dmesg, kmsg"] DISK --> DISK_U["U: % busy
iostat %util"] DISK --> DISK_S["S: queue depth
iostat avgqu-sz"] DISK --> DISK_E["E: I/O errors
smartctl, dmesg"] NET --> NET_U["U: bandwidth %
sar, nstat"] NET --> NET_S["S: backlog drops
ss, netstat"] NET --> NET_E["E: CRC, drops
ifconfig, ethtool"]

RED Method (for Services)

The RED Method (Tom Wilkie, inspired by Google's Four Golden Signals) applies to services rather than resources. For every service endpoint, measure:

  • R — Rate: Requests per second. Tracks demand/throughput. Sudden drops indicate upstream failures; spikes indicate load events.
  • E — Errors: Failed requests per second (or error rate %). Distinguish client errors (4xx) from server errors (5xx). Track by error type for prioritization.
  • D — Duration: Response time distribution (not just average!). Track p50, p95, p99. P99 matters because 1% of users experience it — at 1M requests/day, that's 10,000 slow experiences.
USE vs RED — When to Apply: Use USE for infrastructure resources (CPU, memory, disk, network interfaces) — it finds hardware bottlenecks. Use RED for application services (APIs, microservices, databases) — it finds user-facing issues. Together they cover the full stack: RED tells you the user experience is degraded, USE tells you which resource constraint is causing it.

Capacity Planning

Capacity planning answers: "How much infrastructure do we need to handle expected load with acceptable performance?" It bridges traffic forecasting, performance testing, and resource allocation.

The capacity planning process:

  1. Measure current demand: Peak RPS, p99 latency, resource utilization at peak
  2. Forecast future demand: Growth rate, seasonal patterns, planned events (launches, sales)
  3. Determine headroom requirement: Typically 30-50% spare capacity for burst absorption
  4. Load test to find ceiling: At what RPS does p99 latency exceed SLO? That's your current capacity ceiling.
  5. Calculate scaling needs: (forecasted_peak × (1 + headroom)) / capacity_per_instance = instances_needed
#!/bin/bash
# SLI queries using Prometheus — common patterns for RED method metrics

PROM_URL="http://prometheus:9090/api/v1/query"

echo "=== Service SLI Dashboard ==="

# Rate: Requests per second (last 5 minutes)
echo "--- Request Rate ---"
curl -s "$PROM_URL" --data-urlencode \
  'query=sum(rate(http_requests_total{service="order-api"}[5m]))' | \
  jq -r '.data.result[0].value[1] | tonumber | . * 100 | round / 100' | \
  xargs -I {} echo "  Current RPS: {}"

# Errors: Error rate percentage
echo "--- Error Rate ---"
curl -s "$PROM_URL" --data-urlencode \
  'query=sum(rate(http_requests_total{service="order-api",status=~"5.."}[5m])) / sum(rate(http_requests_total{service="order-api"}[5m])) * 100' | \
  jq -r '.data.result[0].value[1] | tonumber | . * 100 | round / 100' | \
  xargs -I {} echo "  Error Rate: {}%"

# Duration: p99 latency
echo "--- P99 Latency ---"
curl -s "$PROM_URL" --data-urlencode \
  'query=histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{service="order-api"}[5m])) by (le))' | \
  jq -r '.data.result[0].value[1] | tonumber | . * 1000 | round' | \
  xargs -I {} echo "  P99 Latency: {}ms"

# Saturation: Active connections / max connections
echo "--- Saturation ---"
curl -s "$PROM_URL" --data-urlencode \
  'query=sum(db_connections_active{service="order-api"}) / sum(db_connections_max{service="order-api"}) * 100' | \
  jq -r '.data.result[0].value[1] | tonumber | . * 100 | round / 100' | \
  xargs -I {} echo "  DB Pool Utilization: {}%"

Module 35: Production Behavior Analysis

Cascading Failures

A cascading failure occurs when the failure of one component causes dependent components to fail, which causes their dependents to fail, and so on — like dominoes. The signature pattern in telemetry:

  1. One service starts returning errors or timing out
  2. Callers retry aggressively, increasing load on the failing service
  3. Callers' thread pools fill with waiting requests (saturation)
  4. Callers can't serve their own requests (propagation)
  5. Callers' callers see timeouts → the cascade spreads upstream

Detection signals:

  • Correlated timeout increases across multiple services (within seconds)
  • Thread pool saturation spreading service-to-service
  • Error rate spike that starts in one service and propagates upstream
  • Latency p99 increasing everywhere while throughput drops

Retry Storms

A retry storm occurs when many clients simultaneously retry failed requests, amplifying traffic exponentially. A service handling 1000 RPS with 3 retries per failure can suddenly face 4000 RPS if errors spike — exactly when it's least able to handle load.

Retry Storm Amplification: If a service has N clients each with R retries, a brief failure causes traffic to spike by up to N×R. With layered retries (client → gateway → service, each retrying 3x), a single failed request generates up to 3³ = 27 actual attempts. Detection pattern: traffic spikes 3-10x immediately after an error spike, without corresponding user activity increase.

Mitigation patterns:

  • Exponential backoff with jitter: Spread retries over time, prevent synchronization
  • Retry budgets: Limit total retries to 10% of successful requests
  • Circuit breakers: Stop retrying entirely when failure rate exceeds threshold
  • Adaptive concurrency: Reduce in-flight requests as latency increases (TCP Vegaslike)

Traffic Amplification

Traffic amplification occurs when one external request generates many internal requests (fanout). A user search might query 20 shards, each shard queries 3 replicas, and each replica makes 2 index lookups — one user request becomes 120 internal requests.

Fanout multiplier: internal_RPS / external_RPS. A multiplier of 50x means your internal infrastructure must handle 50x the user-visible traffic. Monitor this ratio — if it grows unexpectedly (new feature, changed query patterns), capacity planning breaks.

Module 36: Reliability Modeling

Availability Calculations

Availability measures the fraction of time a system is operational and serving requests correctly. It's typically expressed as "nines":

AvailabilityNinesDowntime/YearDowntime/Month
99%Two nines3.65 days7.3 hours
99.9%Three nines8.77 hours43.8 min
99.95%Three and a half4.38 hours21.9 min
99.99%Four nines52.6 min4.38 min
99.999%Five nines5.26 min26.3 sec

Composite availability for serial dependencies: If Service A (99.9%) calls Service B (99.9%), the combined availability is 99.9% × 99.9% = 99.8%. Each additional serial dependency multiplies the failure probability. Five services at 99.9% each = 99.5% combined — from "three nines" to barely "two and a half nines".

def calculate_availability(uptime_hours: float, total_hours: float) -> dict:
    """Calculate availability metrics from uptime data."""
    availability = uptime_hours / total_hours
    downtime_hours = total_hours - uptime_hours

    # Calculate nines
    if availability >= 1.0:
        nines = float('inf')
    elif availability <= 0:
        nines = 0
    else:
        import math
        nines = -math.log10(1 - availability)

    return {
        "availability_percent": round(availability * 100, 4),
        "nines": round(nines, 2),
        "downtime_hours_per_year": round((1 - availability) * 8760, 2),
        "downtime_minutes_per_month": round((1 - availability) * 43800, 2),
    }


def composite_serial(availabilities: list[float]) -> float:
    """Combined availability for serial (all must work) dependencies."""
    result = 1.0
    for a in availabilities:
        result *= a
    return result


def composite_parallel(availabilities: list[float]) -> float:
    """Combined availability for parallel (any can work) redundancy."""
    # P(all fail) = product of (1-a) for each
    all_fail = 1.0
    for a in availabilities:
        all_fail *= (1 - a)
    return 1 - all_fail


# Example: System with 3 serial services
services = [0.999, 0.999, 0.999]  # Each at 99.9%
serial_avail = composite_serial(services)
print(f"Serial (3 services at 99.9%): {serial_avail*100:.3f}%")
# Output: 99.700% — lost a full nine!

# Example: Database with 3 replicas (any one can serve reads)
replicas = [0.99, 0.99, 0.99]  # Each at 99%
parallel_avail = composite_parallel(replicas)
print(f"Parallel (3 replicas at 99%): {parallel_avail*100:.4f}%")
# Output: 99.9999% — gained four nines!

# Example: Full system availability calculation
result = calculate_availability(uptime_hours=8750, total_hours=8760)
print(f"\nSystem availability: {result['availability_percent']}%")
print(f"Nines: {result['nines']}")
print(f"Downtime per year: {result['downtime_hours_per_year']} hours")
print(f"Downtime per month: {result['downtime_minutes_per_month']} minutes")

Error Budgets & Burn Rate

An error budget is the inverse of your SLO — it's how much unreliability you're allowed. If your SLO is 99.9% availability, your error budget is 0.1% (about 43 minutes of downtime per month). The error budget is "spent" by incidents, deployments, and maintenance.

Error Budget Policy: When the error budget is exhausted (too many incidents this month), engineering prioritizes reliability over features. This creates a natural balance: product teams want to ship fast (which risks reliability), SRE teams want stability. The error budget is the objective arbiter — spend it on features until it's gone, then fix reliability.

Burn rate measures how fast you're consuming your error budget relative to the SLO window. A burn rate of 1x means you'll exactly exhaust the budget by end of window. A burn rate of 10x means you'll exhaust it in 1/10th the time — trigger an alert.

Error Budget Burn Rate Alert Thresholds
flowchart LR
    subgraph "30-Day Error Budget"
        B[Budget: 43.2 min]
        B --> R1["1x burn rate
Normal — budget lasts full month"] B --> R2["2x burn rate
Budget exhausted in 15 days"] B --> R6["6x burn rate
Budget exhausted in 5 days
⚠️ WARN: Ticket"] B --> R14["14.4x burn rate
Budget exhausted in 2 days
🚨 CRITICAL: Page"] end
from dataclasses import dataclass


@dataclass
class ErrorBudget:
    """Calculate error budget and burn rate for an SLO."""
    slo_target: float         # e.g., 0.999 for 99.9%
    window_days: int = 30     # SLO measurement window

    @property
    def budget_fraction(self) -> float:
        """Allowed error fraction (1 - SLO)."""
        return 1 - self.slo_target

    @property
    def budget_minutes(self) -> float:
        """Error budget in minutes for the window."""
        return self.budget_fraction * self.window_days * 24 * 60

    def burn_rate(self, error_rate: float) -> float:
        """Current burn rate given observed error rate.

        burn_rate = observed_error_rate / allowed_error_rate
        burn_rate of 1.0 means budget will last exactly the window.
        burn_rate of 10.0 means budget exhausted in 1/10 of window.
        """
        if self.budget_fraction == 0:
            return float('inf')
        return error_rate / self.budget_fraction

    def time_to_exhaustion_hours(self, error_rate: float) -> float:
        """Hours until budget is fully consumed at current rate."""
        br = self.burn_rate(error_rate)
        if br <= 0:
            return float('inf')
        return (self.window_days * 24) / br

    def budget_remaining(self, consumed_minutes: float) -> dict:
        """How much budget remains."""
        remaining = self.budget_minutes - consumed_minutes
        return {
            "total_budget_min": round(self.budget_minutes, 2),
            "consumed_min": round(consumed_minutes, 2),
            "remaining_min": round(max(0, remaining), 2),
            "remaining_percent": round(max(0, remaining / self.budget_minutes * 100), 1),
            "exhausted": remaining <= 0
        }


# Example: 99.9% SLO over 30 days
budget = ErrorBudget(slo_target=0.999, window_days=30)
print(f"SLO: {budget.slo_target*100}%")
print(f"Error budget: {budget.budget_minutes:.1f} minutes ({budget.budget_fraction*100}%)")

# Current error rate is 0.5% (5x the budget)
current_error_rate = 0.005
br = budget.burn_rate(current_error_rate)
print(f"\nCurrent error rate: {current_error_rate*100}%")
print(f"Burn rate: {br:.1f}x")
print(f"Time to exhaustion: {budget.time_to_exhaustion_hours(current_error_rate):.1f} hours")

# After a 20-minute incident
status = budget.budget_remaining(consumed_minutes=20)
print(f"\nAfter 20-min incident:")
print(f"  Remaining: {status['remaining_min']} min ({status['remaining_percent']}%)")
print(f"  Exhausted: {status['exhausted']}")

SLI / SLO / SLA

These three concepts form a hierarchy for defining and managing reliability:

SLI (Service Level Indicator): A quantitative measure of some aspect of service quality. It's what you measure — the raw metric.

  • Availability SLI: successful_requests / total_requests
  • Latency SLI: requests_under_300ms / total_requests (proportion within threshold)
  • Throughput SLI: processed_jobs / submitted_jobs
  • Correctness SLI: correct_responses / total_responses

SLO (Service Level Objective): A target value for an SLI over a time window. It's your internal reliability goal.

  • "99.9% of requests complete successfully over 30 days"
  • "95% of requests complete in under 300ms over 7 days"
  • SLO creates the error budget: budget = 100% - SLO

SLA (Service Level Agreement): A business contract with consequences (credits, penalties) if the SLO is missed. SLAs should be less strict than SLOs — your internal target should be higher than your customer promise.

# Prometheus alerting rules for SLO-based alerts (multi-window burn rate)
groups:
  - name: slo_alerts
    rules:
      # Fast burn: 14.4x rate over 1 hour (exhausts 30-day budget in 2 days)
      - alert: HighErrorBurnRate_Critical
        expr: |
          (
            sum(rate(http_requests_total{status=~"5.."}[1h]))
            /
            sum(rate(http_requests_total[1h]))
          ) > (14.4 * 0.001)
        for: 2m
        labels:
          severity: critical
          slo: availability
        annotations:
          summary: "High error burn rate — budget exhausted in ~2 days"
          description: "Error rate {{ $value | humanizePercentage }} exceeds 14.4x burn rate threshold"
          runbook: "https://runbooks.internal/slo-budget-critical"

      # Slow burn: 6x rate over 6 hours (exhausts budget in 5 days)
      - alert: HighErrorBurnRate_Warning
        expr: |
          (
            sum(rate(http_requests_total{status=~"5.."}[6h]))
            /
            sum(rate(http_requests_total[6h]))
          ) > (6 * 0.001)
        for: 5m
        labels:
          severity: warning
          slo: availability
        annotations:
          summary: "Elevated error burn rate — budget exhausted in ~5 days"
          description: "6h error rate {{ $value | humanizePercentage }} exceeds 6x burn rate"

      # Latency SLO: p99 > 500ms for 5 minutes
      - alert: LatencySLO_Breach
        expr: |
          histogram_quantile(0.99,
            sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
          ) > 0.5
        for: 5m
        labels:
          severity: warning
          slo: latency
        annotations:
          summary: "P99 latency exceeds 500ms SLO threshold"
          description: "P99 latency is {{ $value | humanizeDuration }}"

Case Studies

Google SRE: Error Budgets in Practice

Case Study Google — Error Budget Policy

Google's SRE book formalized the error budget concept. The key insight: 100% reliability is the wrong target. Users can't tell the difference between 99.99% and 99.999% (their ISP, device, and WiFi introduce more unreliability), but the engineering cost difference is enormous.

Google's error budget policy:

  1. Define SLO: Product and SRE agree on a target (e.g., 99.95% availability)
  2. Measure SLI: Continuously track actual performance against the SLO
  3. Calculate remaining budget: budget = (1 - SLO) × window - consumed
  4. If budget remains: Ship features, deploy aggressively, innovate
  5. If budget exhausted: Freeze features, focus on reliability (postmortems, automation, testing)

Results: Teams with error budgets shipped 30% more features than teams without (they weren't over-cautious). Reliability actually improved because the policy made reliability investment a rational, data-driven decision rather than a political battle between "move fast" and "don't break things."

Multi-window burn rate alerts (Google's recommendation): Use two alert windows — a short window (1h) with high burn rate threshold (14.4x) for critical pages, and a long window (6h) with lower threshold (6x) for warning tickets. This catches both sudden incidents and slow degradation.

SRE Error Budget Burn Rate Multi-Window

Honeycomb: Observability-Driven Development

Case Study Honeycomb — High-Cardinality Observability

Honeycomb pioneered "observability-driven development" — the practice of instrumenting code before shipping it to production, then using high-cardinality trace data to understand behavior rather than pre-defined dashboards.

Key principles:

  • Events over metrics: Instead of pre-aggregated counters, store rich structured events with all context (user_id, feature_flag, build_version, query_plan). Query them on-demand.
  • High cardinality is essential: You need to slice by user_id (millions of unique values) to debug "why is user X slow?" — metrics can't do this.
  • BubbleUp pattern: Compare slow requests vs fast requests — what attributes differ? Automated analysis finds that slow requests all have region=eu-west and db_pool=replica-3.
  • Instrument at deploy: Every deploy includes trace instrumentation. Developers own observability — they add spans around code they want to understand.

Impact: Teams using observability-driven development reported 60% faster incident resolution (MTTR) because they could ask arbitrary questions about production behavior without waiting for new dashboards or metric instrumentation.

Observability High Cardinality BubbleUp Traces

Conclusion & Next Steps

The key takeaways:

  • Observability requires all three pillars. Metrics for alerting and trends, traces for request-level debugging, logs for root cause detail. Missing any one creates blind spots that extend incident resolution time.
  • USE for resources, RED for services. Apply USE (Utilization, Saturation, Errors) to infrastructure components. Apply RED (Rate, Errors, Duration) to service endpoints. Together they pinpoint whether the problem is infrastructure capacity or application logic.
  • Percentiles, not averages. Average latency hides the worst experiences. P99 latency affects 1% of users — at scale, that's thousands of people. Alert on p99, optimize for p95, report p50 as "typical experience."
  • Cardinality is the cost driver. Every unique label combination is a time series. Keep metrics low-cardinality (tens of values per label). Put high-cardinality data (user_id, request_id) in traces and logs where storage is designed for it.
  • Error budgets align incentives. Without error budgets, reliability is subjective ("is this reliable enough?"). With them, it's mathematical — you have X minutes of budget, each incident costs Y minutes. Ship fast when budget is healthy, fix reliability when it's low.
  • Burn rate alerts catch both fast and slow problems. Short-window/high-threshold alerts page for sudden incidents. Long-window/low-threshold alerts ticket for gradual degradation. This multi-window approach prevents both alert fatigue and missed slow burns.
  • Detect retry storms and cascading failures early. Monitor the ratio of retries to first-attempts. If retries suddenly dominate traffic, you have a retry storm. Monitor correlated timeout increases across services — if timeouts spread upstream, you have a cascade forming.
  • SLAs should be weaker than SLOs. Your internal target (SLO) should be stricter than your customer contract (SLA). This gives you a buffer — you can miss your SLO without financial consequences, but it triggers internal reliability investment.

Next in the Series

In Part 17: Evolutionary Architecture & Conway's Law, we'll explore how organizational structure shapes system architecture, fitness functions for measuring architectural health, and strategies for managing technical debt as a first-class engineering concern.