What Is a Metric?
A metric is a numeric representation of system state measured over time. Unlike logs (which record discrete events) or traces (which record request journeys), metrics are aggregations — they summarise many events into a single number at a given point in time.
Consider the statement "1,247 HTTP requests per second." This is a metric. It does not tell you what any individual request contained, or which user made it, or what the response was. It tells you something quantitative about the overall system state at that moment.
Anatomy of a Metric
In modern monitoring systems (especially Prometheus-style), a metric has three components:
| Component | Description | Example |
|---|---|---|
| Name | Identifies what is being measured | http_requests_total |
| Labels | Key-value pairs that add dimensions | method="GET", status="200", path="/api/users" |
| Value | The numeric measurement | 12845.0 |
A complete metric data point also includes a timestamp. Together, a stream of (timestamp, value) pairs for a named metric with given labels forms a time series.
# Example Prometheus metric exposition format
# HELP http_requests_total The total number of HTTP requests.
# TYPE http_requests_total counter
http_requests_total{method="GET",status="200",path="/api/users"} 12845
http_requests_total{method="POST",status="201",path="/api/users"} 384
http_requests_total{method="GET",status="404",path="/api/items"} 27
http_requests_total{method="GET",status="500",path="/api/orders"} 3
Labels and the Cardinality Trap
Labels are powerful — they let you slice and dice your metrics. Instead of one flat "request count" number, you have a multi-dimensional view: count by method, by status code, by endpoint, by region, by customer tier.
user_id (millions of unique values) or request_id (billions) — you will create millions or billions of time series. This is called a cardinality explosion, and it is one of the most common production issues with Prometheus. It can crash your monitoring system. Never use high-cardinality values as metric labels.
Good label candidates (low cardinality):
method(GET, POST, PUT, DELETE — 4-6 values)status_code(200, 201, 400, 404, 500 — ~10 values)region(us-east-1, eu-west-1 — <20 values)environment(prod, staging, dev — 3 values)
Bad label candidates (high cardinality):
user_id,customer_id— millions of valuesrequest_id,trace_id— unique per requesturlwith query strings — unboundederror_message— unbounded free text
The Four Metric Types
Prometheus defines four core metric types. Understanding them deeply is essential — choosing the wrong type leads to incorrect queries and misleading dashboards.
Counters — Always Going Up
A counter is a metric that only increases. It represents a cumulative count of events. Counters reset to zero only when the process restarts.
Examples:
- Total HTTP requests served since startup
- Total bytes sent or received
- Total errors encountered
- Total database queries executed
rate(http_requests_total[5m]) gives you requests per second averaged over the last 5 minutes. In NRQL: use derivative() or rate() functions.
# Prometheus: requests per second over 5 minutes
rate(http_requests_total{job="api"}[5m])
# Prometheus: error rate as a percentage
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
* 100
Gauges — Snapshots of Current State
A gauge is a metric that can go up or down. It represents the current value of something at a given moment — like a snapshot of system state.
Examples:
- Current memory usage in bytes
- Number of active connections
- Current queue depth
- CPU utilisation percentage
- Current number of running goroutines or threads
# Example: Alert when memory usage exceeds 85%
# In Prometheus alerting rule:
# node_memory_Active_bytes / node_memory_MemTotal_bytes * 100 > 85
# Example Prometheus metric
node_memory_Active_bytes 2147483648
node_memory_MemTotal_bytes 8589934592
# Usage: 2147483648 / 8589934592 * 100 = 25%
Histograms — Distributions and Percentiles
A histogram samples observations (typically request durations or response sizes) and counts them in configurable buckets. It enables calculation of approximate percentiles on the server side.
A Prometheus histogram with name http_request_duration_seconds actually creates three time series:
http_request_duration_seconds_bucket{le="0.1"}— count of requests completing in ≤ 0.1shttp_request_duration_seconds_bucket{le="0.5"}— count of requests completing in ≤ 0.5shttp_request_duration_seconds_bucket{le="+Inf"}— total count (same as sum below)http_request_duration_seconds_sum— sum of all observation valueshttp_request_duration_seconds_count— total number of observations
# PromQL: Calculate 95th percentile latency
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)
# PromQL: Average request duration
http_request_duration_seconds_sum / http_request_duration_seconds_count
Summaries — Client-Side Percentiles
Summaries are similar to histograms but calculate percentiles on the client side (in the instrumented application) rather than the server side. This makes them more accurate (exact rather than approximate) but less flexible — you cannot aggregate percentiles from multiple instances of a service.
Percentiles vs Averages — Why the Mean Misleads You
This is one of the most important concepts in performance monitoring. The arithmetic mean of response times almost always tells you an incomplete — often dangerously misleading — story.
Why Averages Lie
The Hidden Tail: When Average = 50ms Means Users Are Suffering
Imagine a service that handles 1,000 requests per minute. In one minute:
- 990 requests complete in 20ms
- 10 requests complete in 3,000ms (3 seconds)
Average response time = (990 × 20 + 10 × 3000) / 1000 = (19,800 + 30,000) / 1000 = 49.8ms
Your average latency dashboard shows ~50ms. Everything looks fine. But 10 users per minute (1% of traffic) are waiting 3 full seconds for a response. If this is a checkout flow, that is 10 frustrated, potentially abandoning customers every minute.
The p99 latency in this scenario is 3,000ms. This is the signal your SLO should be tracking, not the mean.
p50, p95, p99, p99.9 Explained
Percentiles (or quantiles) answer: "What is the maximum response time for X% of requests?"
| Percentile | Meaning | Typical Use |
|---|---|---|
| p50 (median) | Half of requests are faster than this | Typical user experience baseline |
| p95 | 95% of requests are faster than this; 5% are slower | Common SLO target for non-critical APIs |
| p99 | 99% of requests are faster than this; 1% are slower | Common SLO target for user-facing APIs |
| p99.9 | 99.9% of requests are faster; 0.1% are slower | SLO target for critical payment/auth flows |
The Four Golden Signals
In the Google SRE Book (one of the foundational texts of reliability engineering), the team describes the "Four Golden Signals" — the minimum viable set of metrics that give you meaningful visibility into any service's health. If you can only instrument four things, instrument these.
flowchart LR
A[Service Health] --> B[Latency\nHow long?]
A --> C[Traffic\nHow much?]
A --> D[Errors\nHow many failing?]
A --> E[Saturation\nHow full?]
style B fill:#3B9797,color:#fff
style C fill:#16476A,color:#fff
style D fill:#BF092F,color:#fff
style E fill:#132440,color:#fff
Signal 1: Latency
Latency measures how long it takes to serve a request. It directly correlates with user experience — slow responses frustrate users, cause SLO violations, and can cascade into broader system failures.
# PromQL: p99 latency for successful requests only
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket{status!~"5.."}[5m])) by (le)
)
# PromQL: Compare successful vs error latency
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket{status=~"5.."}[5m])) by (le)
)
Signal 2: Traffic
Traffic measures how much demand is being placed on the system. For web services this is usually requests per second; for streaming services it might be bytes per second; for databases it could be queries per second or transactions per second.
Traffic metrics are essential for:
- Capacity planning — understanding growth trends and peak loads
- Anomaly detection — sudden traffic drops can indicate upstream failures; sudden spikes can indicate attacks or viral events
- Contextualising other signals — a 5% error rate at 10 RPS is very different from 5% at 10,000 RPS
# PromQL: Requests per second over 5-minute window
sum(rate(http_requests_total[5m]))
# PromQL: Traffic by endpoint and method
sum(rate(http_requests_total[5m])) by (method, path)
Signal 3: Errors
Errors measure the rate at which requests are failing. This includes explicit failures (HTTP 5xx responses, exceptions) and implicit failures (HTTP 200 responses that return wrong data, requests that complete but exceed latency SLOs).
# PromQL: Error rate as percentage of total traffic
100 * sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
# PromQL: Alert when error rate exceeds 1% for 5 minutes
# (Used in Prometheus alerting rules)
# expr: > 1
# for: 5m
Signal 4: Saturation
Saturation measures how "full" the service is — how close to its resource limits it is operating. A service at 100% CPU saturation cannot handle more traffic. A connection pool at capacity will start queueing or rejecting requests. A disk at 100% utilisation will cause writes to fail.
Saturation metrics are leading indicators — a saturating resource predicts future failures before they occur. This makes saturation the most proactive of the four signals.
| Resource | Saturation Metric | Warning Threshold |
|---|---|---|
| CPU | CPU utilisation % | > 80% sustained |
| Memory | Memory utilisation % or swap usage | > 85% or any swap |
| Disk | Disk space utilisation %, IOPS utilisation | > 80% space, > 70% IOPS |
| Network | Bandwidth utilisation %, packet loss | > 60% bandwidth, any packet loss |
| DB Connections | Connection pool utilisation % | > 75% |
| Thread Pool | Active threads / max threads | > 80% |
# PromQL: CPU saturation
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# PromQL: Memory saturation
100 * (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)
# PromQL: Disk space saturation
100 * (1 - node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"})
The RED Method
Coined by Tom Wilkie at Grafana Labs, the RED method is a simplified framework for monitoring microservices that focuses on the user-observable behaviour of each service. RED stands for:
- Rate — how many requests per second is the service handling?
- Errors — how many of those requests are failing?
- Duration — how long does each request take?
RED is essentially a service-level subset of the Four Golden Signals (omitting Saturation, which is more infrastructure-focused). It maps perfectly to how users experience a service: they care about throughput (Rate), reliability (Errors), and speed (Duration).
Applying RED to a Real Service
Take any API you operate (or think about a hypothetical checkout service) and answer:
- Rate: How many requests per second does it handle at peak? At off-peak? What does a 10x spike look like?
- Errors: What HTTP status codes does it return? What percentage are 4xx (client errors)? 5xx (server errors)? What non-HTTP errors can occur (timeouts, circuit breaks, partial failures)?
- Duration: What is the p50, p95, p99 latency? What is your SLO target? Are any endpoints systematically slower?
Just answering these questions for every service you operate puts you ahead of most production teams. Instrument the answers, and you have a solid operational baseline.
Infrastructure Monitoring
Beyond application metrics, you need visibility into the infrastructure your services run on. Compute, storage, and network metrics form the foundation of your operational picture.
Compute Metrics
Key metrics to monitor for every server or container host:
- CPU utilisation — total % across all cores; distinguish user vs system vs iowait
- CPU load average — 1/5/15-minute averages; load > number of cores indicates saturation
- Memory utilisation — used, cached, available; watch for OOM (out-of-memory) pressure
- Process/thread count — can indicate resource leaks
Storage Metrics
- Disk space — utilisation % per filesystem; alert at 80%, page at 90%
- Disk IOPS — reads and writes per second; compare against disk's rated IOPS capacity
- Disk throughput — bytes read/written per second
- Disk I/O queue depth — requests waiting for the disk; high queue depth indicates I/O saturation
- Disk latency — average I/O completion time; HDD: <10ms, SSD: <1ms, NVMe: <0.1ms
Network Metrics
- Packet loss — any packet loss indicates network problems; even 0.1% loss can devastate TCP performance
- Bandwidth utilisation — inbound and outbound bytes per second vs link capacity
- Round-trip time (RTT) — latency between nodes; spikes indicate congestion or routing problems
- Connection counts — established, TIME_WAIT, CLOSE_WAIT; high TIME_WAIT count can indicate connection handling issues
- Network errors — dropped packets, retransmits, checksum errors
Conclusion & Next Steps
You now have a solid grounding in metrics — the quantitative backbone of any monitoring system. The key insights from Part 2:
- Four metric types: counters (always increasing), gauges (current state), histograms (distributions, server-side percentiles), summaries (exact percentiles, client-side)
- Cardinality matters: Never use high-cardinality values as metric labels — it will crash your monitoring system
- Percentiles, not averages: p99 latency is the signal that reflects your worst user experiences; averages hide tail latency
- Four Golden Signals: Latency, Traffic, Errors, Saturation — instrument these for every service
- RED method: Rate, Errors, Duration — a service-focused framework perfect for microservices monitoring