Part 2: Metrics Fundamentals & the Four Golden Signals

What Is a Metric?

A metric is a numeric representation of system state measured over time. Unlike logs (which record discrete events) or traces (which record request journeys), metrics are aggregations — they summarise many events into a single number at a given point in time.

Consider the statement "1,247 HTTP requests per second." This is a metric. It does not tell you what any individual request contained, or which user made it, or what the response was. It tells you something quantitative about the overall system state at that moment.

Anatomy of a Metric

In modern monitoring systems (especially Prometheus-style), a metric has three components:

Component	Description	Example
Name	Identifies what is being measured	`http_requests_total`
Labels	Key-value pairs that add dimensions	`method="GET", status="200", path="/api/users"`
Value	The numeric measurement	`12845.0`

A complete metric data point also includes a timestamp. Together, a stream of (timestamp, value) pairs for a named metric with given labels forms a time series.

# Example Prometheus metric exposition format
# HELP http_requests_total The total number of HTTP requests.
# TYPE http_requests_total counter
http_requests_total{method="GET",status="200",path="/api/users"} 12845
http_requests_total{method="POST",status="201",path="/api/users"} 384
http_requests_total{method="GET",status="404",path="/api/items"} 27
http_requests_total{method="GET",status="500",path="/api/orders"} 3

Labels and the Cardinality Trap

Labels are powerful — they let you slice and dice your metrics. Instead of one flat "request count" number, you have a multi-dimensional view: count by method, by status code, by endpoint, by region, by customer tier.

                            
                            The Cardinality Trap: Every unique combination of label values creates a separate time series in your metrics database. If you add a label with high cardinality — like user_id (millions of unique values) or request_id (billions) — you will create millions or billions of time series. This is called a cardinality explosion, and it is one of the most common production issues with Prometheus. It can crash your monitoring system. Never use high-cardinality values as metric labels.
                        

Good label candidates (low cardinality):

method (GET, POST, PUT, DELETE — 4-6 values)
status_code (200, 201, 400, 404, 500 — ~10 values)
region (us-east-1, eu-west-1 — <20 values)
environment (prod, staging, dev — 3 values)

Bad label candidates (high cardinality):

user_id, customer_id — millions of values
request_id, trace_id — unique per request
url with query strings — unbounded
error_message — unbounded free text

The Four Metric Types

Prometheus defines four core metric types. Understanding them deeply is essential — choosing the wrong type leads to incorrect queries and misleading dashboards.

Counters — Always Going Up

A counter is a metric that only increases. It represents a cumulative count of events. Counters reset to zero only when the process restarts.

Examples:

Total HTTP requests served since startup
Total bytes sent or received
Total errors encountered
Total database queries executed

                            
                            Querying Counters: Raw counter values are rarely useful — what you want is the rate of change. In PromQL: rate(http_requests_total[5m]) gives you requests per second averaged over the last 5 minutes. In NRQL: use derivative() or rate() functions.
                        

# Prometheus: requests per second over 5 minutes
rate(http_requests_total{job="api"}[5m])

# Prometheus: error rate as a percentage
sum(rate(http_requests_total{status=~"5.."}[5m]))
  /
sum(rate(http_requests_total[5m]))
  * 100

Gauges — Snapshots of Current State

A gauge is a metric that can go up or down. It represents the current value of something at a given moment — like a snapshot of system state.

Examples:

Current memory usage in bytes
Number of active connections
Current queue depth
CPU utilisation percentage
Current number of running goroutines or threads

# Example: Alert when memory usage exceeds 85%
# In Prometheus alerting rule:
# node_memory_Active_bytes / node_memory_MemTotal_bytes * 100 > 85

# Example Prometheus metric
node_memory_Active_bytes 2147483648
node_memory_MemTotal_bytes 8589934592
# Usage: 2147483648 / 8589934592 * 100 = 25%

Histograms — Distributions and Percentiles

A histogram samples observations (typically request durations or response sizes) and counts them in configurable buckets. It enables calculation of approximate percentiles on the server side.

A Prometheus histogram with name http_request_duration_seconds actually creates three time series:

http_request_duration_seconds_bucket{le="0.1"} — count of requests completing in ≤ 0.1s
http_request_duration_seconds_bucket{le="0.5"} — count of requests completing in ≤ 0.5s
http_request_duration_seconds_bucket{le="+Inf"} — total count (same as sum below)
http_request_duration_seconds_sum — sum of all observation values
http_request_duration_seconds_count — total number of observations

# PromQL: Calculate 95th percentile latency
histogram_quantile(0.95,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)

# PromQL: Average request duration
http_request_duration_seconds_sum / http_request_duration_seconds_count

                            
                            Choosing Histogram Buckets: Buckets should be chosen based on your SLO targets. If your latency SLO is p99 < 500ms, include buckets at 100ms, 250ms, 500ms, 1000ms, 2500ms, 5000ms. Without a bucket boundary near your SLO target, you cannot measure compliance accurately.
                        

Summaries — Client-Side Percentiles

Summaries are similar to histograms but calculate percentiles on the client side (in the instrumented application) rather than the server side. This makes them more accurate (exact rather than approximate) but less flexible — you cannot aggregate percentiles from multiple instances of a service.

                            
                            Histograms vs Summaries: In most modern observability setups, prefer histograms. They can be aggregated across multiple service instances and allow percentile calculation at query time with any quantile value. Summaries require you to specify quantiles at instrumentation time and cannot be meaningfully aggregated.
                        

Percentiles vs Averages — Why the Mean Misleads You

This is one of the most important concepts in performance monitoring. The arithmetic mean of response times almost always tells you an incomplete — often dangerously misleading — story.

Why Averages Lie

Mathematical Example

The Hidden Tail: When Average = 50ms Means Users Are Suffering

Imagine a service that handles 1,000 requests per minute. In one minute:

990 requests complete in 20ms
10 requests complete in 3,000ms (3 seconds)

Average response time = (990 × 20 + 10 × 3000) / 1000 = (19,800 + 30,000) / 1000 = 49.8ms

Your average latency dashboard shows ~50ms. Everything looks fine. But 10 users per minute (1% of traffic) are waiting 3 full seconds for a response. If this is a checkout flow, that is 10 frustrated, potentially abandoning customers every minute.

The p99 latency in this scenario is 3,000ms. This is the signal your SLO should be tracking, not the mean.

Tail Latency SLO Design User Experience

p50, p95, p99, p99.9 Explained

Percentiles (or quantiles) answer: "What is the maximum response time for X% of requests?"

Percentile	Meaning	Typical Use
p50 (median)	Half of requests are faster than this	Typical user experience baseline
p95	95% of requests are faster than this; 5% are slower	Common SLO target for non-critical APIs
p99	99% of requests are faster than this; 1% are slower	Common SLO target for user-facing APIs
p99.9	99.9% of requests are faster; 0.1% are slower	SLO target for critical payment/auth flows

                            
                            Rule of Thumb: Monitor p99 for user-facing services. At 100 requests/second, your p99 represents the slowest request in every 100. At 10,000 requests/second, 100 users per second experience that p99 latency. At scale, tail latency matters enormously.
                        

The Four Golden Signals

In the Google SRE Book (one of the foundational texts of reliability engineering), the team describes the "Four Golden Signals" — the minimum viable set of metrics that give you meaningful visibility into any service's health. If you can only instrument four things, instrument these.

The Four Golden Signals

                                flowchart LR
                                    A[Service Health] --> B[Latency\nHow long?]
                                    A --> C[Traffic\nHow much?]
                                    A --> D[Errors\nHow many failing?]
                                    A --> E[Saturation\nHow full?]
                                    style B fill:#3B9797,color:#fff
                                    style C fill:#16476A,color:#fff
                                    style D fill:#BF092F,color:#fff
                                    style E fill:#132440,color:#fff

Signal 1: Latency

Latency measures how long it takes to serve a request. It directly correlates with user experience — slow responses frustrate users, cause SLO violations, and can cascade into broader system failures.

                            
                            Critical Latency Insight: Distinguish between successful request latency and failed request latency. A request that fails instantly (in 1ms with a 500 error) is very different from a request that times out (in 30 seconds with a 504). Tracking error latency separately helps diagnose whether errors are fast-failing or slow-timing-out — the latter is far more damaging to system health.
                        

# PromQL: p99 latency for successful requests only
histogram_quantile(0.99,
  sum(rate(http_request_duration_seconds_bucket{status!~"5.."}[5m])) by (le)
)

# PromQL: Compare successful vs error latency
histogram_quantile(0.99,
  sum(rate(http_request_duration_seconds_bucket{status=~"5.."}[5m])) by (le)
)

Signal 2: Traffic

Traffic measures how much demand is being placed on the system. For web services this is usually requests per second; for streaming services it might be bytes per second; for databases it could be queries per second or transactions per second.

Traffic metrics are essential for:

Capacity planning — understanding growth trends and peak loads
Anomaly detection — sudden traffic drops can indicate upstream failures; sudden spikes can indicate attacks or viral events
Contextualising other signals — a 5% error rate at 10 RPS is very different from 5% at 10,000 RPS

# PromQL: Requests per second over 5-minute window
sum(rate(http_requests_total[5m]))

# PromQL: Traffic by endpoint and method
sum(rate(http_requests_total[5m])) by (method, path)

Signal 3: Errors

Errors measure the rate at which requests are failing. This includes explicit failures (HTTP 5xx responses, exceptions) and implicit failures (HTTP 200 responses that return wrong data, requests that complete but exceed latency SLOs).

                            
                            Explicit vs Implicit Errors: HTTP 500 errors are explicit failures — the service knows the request failed. But a service can return HTTP 200 with stale, incorrect, or incomplete data. These "silent errors" are harder to detect but just as damaging. Instrument your application logic to emit metrics for business-level failures, not just HTTP status codes.
                        

# PromQL: Error rate as percentage of total traffic
100 * sum(rate(http_requests_total{status=~"5.."}[5m]))
  /
sum(rate(http_requests_total[5m]))

# PromQL: Alert when error rate exceeds 1% for 5 minutes
# (Used in Prometheus alerting rules)
# expr: > 1
# for: 5m

Signal 4: Saturation

Saturation measures how "full" the service is — how close to its resource limits it is operating. A service at 100% CPU saturation cannot handle more traffic. A connection pool at capacity will start queueing or rejecting requests. A disk at 100% utilisation will cause writes to fail.

Saturation metrics are leading indicators — a saturating resource predicts future failures before they occur. This makes saturation the most proactive of the four signals.

Resource	Saturation Metric	Warning Threshold
CPU	CPU utilisation %	> 80% sustained
Memory	Memory utilisation % or swap usage	> 85% or any swap
Disk	Disk space utilisation %, IOPS utilisation	> 80% space, > 70% IOPS
Network	Bandwidth utilisation %, packet loss	> 60% bandwidth, any packet loss
DB Connections	Connection pool utilisation %	> 75%
Thread Pool	Active threads / max threads	> 80%

# PromQL: CPU saturation
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# PromQL: Memory saturation
100 * (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)

# PromQL: Disk space saturation
100 * (1 - node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"})

The RED Method

Coined by Tom Wilkie at Grafana Labs, the RED method is a simplified framework for monitoring microservices that focuses on the user-observable behaviour of each service. RED stands for:

Rate — how many requests per second is the service handling?
Errors — how many of those requests are failing?
Duration — how long does each request take?

RED is essentially a service-level subset of the Four Golden Signals (omitting Saturation, which is more infrastructure-focused). It maps perfectly to how users experience a service: they care about throughput (Rate), reliability (Errors), and speed (Duration).

                            
                            RED vs USE vs Golden Signals: Use RED for service-level monitoring (each microservice endpoint). Use USE (Utilisation, Saturation, Errors) for resource-level monitoring (CPU, memory, disk). Use the Four Golden Signals when you want a unified framework that covers both.
                        

Hands-On Exercise

Applying RED to a Real Service

Take any API you operate (or think about a hypothetical checkout service) and answer:

Rate: How many requests per second does it handle at peak? At off-peak? What does a 10x spike look like?
Errors: What HTTP status codes does it return? What percentage are 4xx (client errors)? 5xx (server errors)? What non-HTTP errors can occur (timeouts, circuit breaks, partial failures)?
Duration: What is the p50, p95, p99 latency? What is your SLO target? Are any endpoints systematically slower?

Just answering these questions for every service you operate puts you ahead of most production teams. Instrument the answers, and you have a solid operational baseline.

RED Method Service Monitoring SLO Design

Infrastructure Monitoring

Beyond application metrics, you need visibility into the infrastructure your services run on. Compute, storage, and network metrics form the foundation of your operational picture.

Compute Metrics

Key metrics to monitor for every server or container host:

CPU utilisation — total % across all cores; distinguish user vs system vs iowait
CPU load average — 1/5/15-minute averages; load > number of cores indicates saturation
Memory utilisation — used, cached, available; watch for OOM (out-of-memory) pressure
Process/thread count — can indicate resource leaks

Storage Metrics

Disk space — utilisation % per filesystem; alert at 80%, page at 90%
Disk IOPS — reads and writes per second; compare against disk's rated IOPS capacity
Disk throughput — bytes read/written per second
Disk I/O queue depth — requests waiting for the disk; high queue depth indicates I/O saturation
Disk latency — average I/O completion time; HDD: <10ms, SSD: <1ms, NVMe: <0.1ms

Network Metrics

Packet loss — any packet loss indicates network problems; even 0.1% loss can devastate TCP performance
Bandwidth utilisation — inbound and outbound bytes per second vs link capacity
Round-trip time (RTT) — latency between nodes; spikes indicate congestion or routing problems
Connection counts — established, TIME_WAIT, CLOSE_WAIT; high TIME_WAIT count can indicate connection handling issues
Network errors — dropped packets, retransmits, checksum errors

Conclusion & Next Steps

You now have a solid grounding in metrics — the quantitative backbone of any monitoring system. The key insights from Part 2:

Four metric types: counters (always increasing), gauges (current state), histograms (distributions, server-side percentiles), summaries (exact percentiles, client-side)
Cardinality matters: Never use high-cardinality values as metric labels — it will crash your monitoring system
Percentiles, not averages: p99 latency is the signal that reflects your worst user experiences; averages hide tail latency
Four Golden Signals: Latency, Traffic, Errors, Saturation — instrument these for every service
RED method: Rate, Errors, Duration — a service-focused framework perfect for microservices monitoring

Previous Part 1: Observability Philosophy & Foundations Next Part 3: Time Series Data, Prometheus & PromQL

Cookie Consent

Part 2: Metrics Fundamentals & the Four Golden Signals

Table of Contents

What Is a Metric?

Anatomy of a Metric

Labels and the Cardinality Trap

The Four Metric Types

Counters — Always Going Up

Gauges — Snapshots of Current State

Histograms — Distributions and Percentiles

Summaries — Client-Side Percentiles

Percentiles vs Averages — Why the Mean Misleads You

Why Averages Lie

The Hidden Tail: When Average = 50ms Means Users Are Suffering

p50, p95, p99, p99.9 Explained

The Four Golden Signals

Signal 1: Latency

Signal 2: Traffic

Signal 3: Errors

Signal 4: Saturation

The RED Method

Applying RED to a Real Service

Infrastructure Monitoring

Compute Metrics

Storage Metrics

Network Metrics

Conclusion & Next Steps

Cookie Consent

Part 2: Metrics Fundamentals & the Four Golden Signals

Table of Contents

What Is a Metric?

Anatomy of a Metric

Labels and the Cardinality Trap

The Four Metric Types

Counters — Always Going Up

Gauges — Snapshots of Current State

Histograms — Distributions and Percentiles

Summaries — Client-Side Percentiles

Percentiles vs Averages — Why the Mean Misleads You

Why Averages Lie

The Hidden Tail: When Average = 50ms Means Users Are Suffering

p50, p95, p99, p99.9 Explained

The Four Golden Signals

Signal 1: Latency

Signal 2: Traffic

Signal 3: Errors

Signal 4: Saturation

The RED Method

Applying RED to a Real Service

Infrastructure Monitoring

Compute Metrics

Storage Metrics

Network Metrics

Conclusion & Next Steps

Continue the Series

Part 3: Time Series Data, Prometheus & PromQL

Part 9: SRE Foundations — SLIs, SLOs, SLAs & Error Budgets

Tool Deep Dive: Prometheus Complete Guide