Introducing PromQL
PromQL (Prometheus Query Language) is the standard query language for time-series metrics in the Prometheus ecosystem. It powers dashboards in Grafana, alert conditions in Alertmanager, and recording rules in both Prometheus and Grafana Mimir. Whether you're querying a local Prometheus instance or a multi-tenant Mimir cluster storing billions of active series, the query language remains identical.
PromQL was designed with a specific philosophy: metrics are multi-dimensional (identified by label sets), append-only (values accumulate over time), and queries should express relationships between time series naturally. Unlike SQL which operates on tables, PromQL operates on vectors — collections of time series samples at specific points in time.
Feature Overview
PromQL provides four core capabilities that combine to express virtually any metrics query:
- Selectors — Choose which time series to operate on using metric names and label matchers
- Operators — Perform arithmetic, comparison, and logical operations between time series
- Functions — Transform, aggregate, or compute derived values from time series data
- Aggregations — Combine multiple time series into fewer series by grouping on label dimensions
Selectors & Matchers
A selector identifies which time series to query. Every PromQL expression starts with a selector:
# Simple metric name selector — selects all series with this name
http_requests_total
# Label matcher — equality
http_requests_total{job="api-server"}
# Label matcher — not equal
http_requests_total{job!="internal-scraper"}
# Label matcher — regex match
http_requests_total{method=~"GET|POST"}
# Label matcher — regex not-match
http_requests_total{status_code!~"2.."}
# Multiple matchers (AND logic)
http_requests_total{job="api-server", method="POST", status_code=~"5.."}
# Metric name is syntactic sugar for __name__ label
{__name__=~"http_requests_total|http_request_duration_seconds_.*"}
There are four matcher types: = (exact equality), != (not equal), =~ (regex match), and !~ (regex not-match). At least one matcher must not match the empty string — you cannot select all metrics without some constraint.
Operators
PromQL supports arithmetic, comparison, and logical operators between scalars, vectors, or combinations thereof:
# Arithmetic operators: + - * / % ^
# Error rate as a percentage
http_requests_total{status_code=~"5.."} / http_requests_total * 100
# Comparison operators: == != > < >= <=
# Filter: only series where value exceeds threshold
http_request_duration_seconds{quantile="0.99"} > 0.5
# bool modifier: return 0/1 instead of filtering
http_request_duration_seconds{quantile="0.99"} > bool 0.5
# Vector matching with 'on' and 'ignoring'
# Divide request errors by total requests, matching on 'method' label
rate(http_requests_total{status_code=~"5.."}[5m])
/ on(method)
rate(http_requests_total[5m])
# group_left / group_right for many-to-one matching
rate(http_requests_total{status_code=~"5.."}[5m])
/ on(job) group_left(team)
rate(http_requests_total[5m])
on(label1, label2) to specify which labels to match on, or ignoring(label1) to exclude labels from matching. group_left/group_right enables many-to-one joins.
Functions
PromQL provides dozens of built-in functions. The most commonly used categories:
# Rate functions (for counters)
rate(http_requests_total[5m]) # per-second rate over 5m
irate(http_requests_total[5m]) # instant rate (last two samples)
increase(http_requests_total[1h]) # total increase over 1h
# Aggregation over time (for gauges)
avg_over_time(node_cpu_seconds_total[5m])
max_over_time(node_memory_MemFree_bytes[1h])
min_over_time(node_filesystem_avail_bytes[6h])
# Mathematical functions
abs(rate(temperature_celsius[5m]))
ceil(http_request_duration_seconds)
floor(available_disk_gb)
round(cpu_usage_percent, 0.01) # round to 2 decimal places
clamp(cpu_usage, 0, 100) # clamp between bounds
clamp_min(free_memory_bytes, 0) # floor at 0
ln(http_requests_total) # natural logarithm
# Label manipulation
label_replace(up, "short_instance", "$1", "instance", "(.*):.*")
label_join(up, "full_path", "/", "namespace", "pod")
# Time functions
time() # current Unix timestamp
timestamp(up) # timestamp of each sample
day_of_week() # 0=Sunday through 6=Saturday
hour() # hour of the day (0-23)
# Prediction and trending
predict_linear(node_disk_free_bytes[6h], 3600*24) # predict value in 24h
deriv(node_network_receive_bytes_total[15m]) # per-second derivative
# Histogram functions
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
Aggregations
Aggregation operators combine multiple time series by label dimensions:
# Basic aggregations
sum(rate(http_requests_total[5m])) # total across all series
avg(node_cpu_seconds_total{mode="idle"}) # average across nodes
count(up == 1) # number of healthy targets
min(node_filesystem_avail_bytes) # minimum free space
max(container_memory_usage_bytes) # peak memory usage
stddev(http_request_duration_seconds) # standard deviation
# Group by specific labels with 'by'
sum by (job, method) (rate(http_requests_total[5m]))
# Group by excluding labels with 'without'
sum without (instance, pod) (rate(http_requests_total[5m]))
# TopK / BottomK
topk(5, sum by (pod) (rate(container_cpu_usage_seconds_total[5m])))
bottomk(3, node_filesystem_avail_bytes)
# count_values — count unique values of a label
count_values("version", build_info)
# quantile — calculate quantile across series (not over time!)
quantile(0.95, rate(http_requests_total[5m]))
Writing PromQL
With the building blocks established, let's explore how to write production-grade PromQL queries for real-world monitoring scenarios.
Instant Vectors
An instant vector is a set of time series where each series has exactly one sample at a given evaluation timestamp. This is what you get from a basic selector or after applying a function:
# Instant vector — one value per series at current time
up{job="api-server"}
# Result:
# up{job="api-server", instance="10.0.0.1:8080"} => 1
# up{job="api-server", instance="10.0.0.2:8080"} => 1
# up{job="api-server", instance="10.0.0.3:8080"} => 0
# After applying rate() — still an instant vector (one value per series)
rate(http_requests_total{job="api-server"}[5m])
# Result:
# {job="api-server", method="GET", status="200"} => 142.5
# {job="api-server", method="POST", status="201"} => 23.8
# {job="api-server", method="GET", status="404"} => 0.3
Instant vectors are what Grafana plots on dashboards (one data point per evaluation interval) and what alert rules evaluate against thresholds.
Range Vectors
A range vector is a set of time series where each series contains multiple samples over a specified time window. Range vectors cannot be graphed directly — they must be passed to a function that reduces them to instant vectors:
# Range vector — raw samples over last 5 minutes
http_requests_total{job="api-server"}[5m]
# Result (cannot graph this directly!):
# {method="GET"} => [(t1, 1000), (t2, 1015), (t3, 1030), ...]
# Range vector durations: [30s] [1m] [5m] [15m] [1h] [6h] [1d] [7d]
# Offset modifier — look back in time
rate(http_requests_total[5m] offset 1h) # rate 1 hour ago
rate(http_requests_total[5m] offset 7d) # rate 1 week ago
# @ modifier — evaluate at a specific timestamp
rate(http_requests_total[5m] @ 1718447400) # rate at specific Unix time
rate(http_requests_total[5m] @ start()) # rate at query start time
rate(http_requests_total[5m] @ end()) # rate at query end time
[1m] for real-time alerting (noisy but fast), [5m] for dashboards (balanced), and [15m] or longer for capacity planning (smooth trends). The range should be at least 4× your scrape interval to ensure enough samples for accurate rate calculation.
rate(), irate() & increase()
These are the three essential functions for working with counters (monotonically increasing metrics that reset on restart):
# rate() — average per-second increase over the range window
# Best for: dashboards, alerting, most use cases
rate(http_requests_total{job="api-server"}[5m])
# irate() — instant rate using only the last two data points
# Best for: volatile, spiky metrics where you want to see peaks
# Caution: very sensitive to scrape timing, poor for alerting
irate(http_requests_total{job="api-server"}[5m])
# increase() — total increase over the range window
# Equivalent to rate() * seconds_in_range
# Best for: "how many requests in the last hour?"
increase(http_requests_total{job="api-server"}[1h])
rate() vs irate() — When to Use Each
| Aspect | rate() | irate() |
|---|---|---|
| Calculation | Average rate over full range | Rate between last 2 samples only |
| Sensitivity | Smooth, dampens spikes | Volatile, shows every spike |
| Alerting | Excellent (stable) | Poor (flapping) |
| Counter Reset | Handles gracefully | May produce artifacts |
| Dashboard Use | Default choice | Only for "zoom into spikes" |
| Recording Rules | Always use rate() | Never record irate() |
histogram_quantile()
The histogram_quantile() function calculates percentiles from Prometheus histogram buckets. This is the primary way to compute latency percentiles (p50, p95, p99):
# P99 latency across all instances
histogram_quantile(0.99,
rate(http_request_duration_seconds_bucket{job="api-server"}[5m])
)
# P95 latency grouped by endpoint
histogram_quantile(0.95,
sum by (le, handler) (
rate(http_request_duration_seconds_bucket{job="api-server"}[5m])
)
)
# P50 (median) response time per method
histogram_quantile(0.50,
sum by (le, method) (
rate(http_request_duration_seconds_bucket[5m])
)
)
# Multiple quantiles in one dashboard (use multiple panels or recording rules)
# Panel 1: histogram_quantile(0.50, sum by (le) (rate(...[5m])))
# Panel 2: histogram_quantile(0.90, sum by (le) (rate(...[5m])))
# Panel 3: histogram_quantile(0.99, sum by (le) (rate(...[5m])))
le (less-than-or-equal) label MUST be preserved in the by clause of any aggregation inside histogram_quantile(). If you sum by (method) without including le, the bucket structure is destroyed and the result is meaningless.
Recording Rules
Recording rules pre-compute expensive PromQL expressions and store the result as new time series. They reduce query latency for dashboards and enable multi-level aggregations that would be too slow to compute at query time:
# prometheus-rules.yaml or mimir-rules.yaml
groups:
- name: api_server_recording_rules
interval: 30s
rules:
# Pre-compute request rate by job and method
- record: job_method:http_requests_total:rate5m
expr: sum by (job, method) (rate(http_requests_total[5m]))
# Pre-compute error ratio
- record: job:http_request_errors:ratio_rate5m
expr: |
sum by (job) (rate(http_requests_total{status_code=~"5.."}[5m]))
/
sum by (job) (rate(http_requests_total[5m]))
# Pre-compute p99 latency by handler
- record: handler:http_request_duration_seconds:p99_rate5m
expr: |
histogram_quantile(0.99,
sum by (le, handler) (
rate(http_request_duration_seconds_bucket[5m])
)
)
# Pre-compute cluster-wide CPU usage
- record: cluster:node_cpu:ratio_rate5m
expr: |
1 - avg by (cluster) (
rate(node_cpu_seconds_total{mode="idle"}[5m])
)
level:metric:operations. The level represents the aggregation labels (e.g., job, handler, cluster), the metric is the source metric name, and operations describe what was applied (e.g., rate5m, p99_rate5m, ratio_rate5m).
RED & USE Methods in PromQL
The RED method (Rate, Errors, Duration) monitors request-driven services. The USE method (Utilization, Saturation, Errors) monitors infrastructure resources. Here's how to implement both in PromQL:
# ═══════════════════════════════════════════
# RED METHOD — for services (API, web, microservices)
# ═══════════════════════════════════════════
# R — Rate: requests per second
sum by (service) (rate(http_requests_total[5m]))
# E — Errors: error rate as percentage
sum by (service) (rate(http_requests_total{status_code=~"5.."}[5m]))
/
sum by (service) (rate(http_requests_total[5m]))
* 100
# D — Duration: request latency (p99)
histogram_quantile(0.99,
sum by (le, service) (rate(http_request_duration_seconds_bucket[5m]))
)
# ═══════════════════════════════════════════
# USE METHOD — for resources (CPU, memory, disk, network)
# ═══════════════════════════════════════════
# U — Utilization: fraction of resource being used
# CPU utilization (percentage busy)
1 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m]))
# Memory utilization
1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)
# Disk utilization
1 - (node_filesystem_avail_bytes / node_filesystem_size_bytes)
# S — Saturation: how overloaded the resource is
# CPU saturation (load average / CPU count)
node_load1 / count without (cpu) (node_cpu_seconds_total{mode="idle"})
# Memory saturation (swap usage or OOM kills)
rate(node_vmstat_pswpin[5m]) + rate(node_vmstat_pswpout[5m])
# Disk saturation (I/O queue depth)
rate(node_disk_io_time_weighted_seconds_total[5m])
# E — Errors: resource error events
# Disk errors
rate(node_disk_io_errs_total[5m])
# Network errors
rate(node_network_receive_errs_total[5m])
+ rate(node_network_transmit_errs_total[5m])
Exploring Data Collection & Metric Protocols
Metrics reach Prometheus and Mimir through various protocols and formats. Understanding each protocol's strengths helps you choose the right collection strategy for different infrastructure components.
StatsD & DogStatsD
StatsD is a lightweight UDP-based protocol originally created by Etsy for application-level metrics. Applications emit metrics as simple text messages over UDP, which a StatsD server aggregates before forwarding to a backend. DogStatsD is Datadog's extension adding tags (labels), histograms, service checks, and events.
# StatsD wire format: metric_name:value|type[|@sample_rate]
# Types: c=counter, g=gauge, ms=timer, h=histogram, s=set
# Counter — increment page views
page.views:1|c
# Gauge — current queue depth
queue.depth:42|g
# Timer — request duration in milliseconds
api.request.duration:320|ms
# Counter with sample rate (only 10% of calls actually send)
api.request.count:1|c|@0.1
# DogStatsD extension — adds tags (key:value pairs)
api.request.duration:320|ms|#method:GET,endpoint:/users,status:200
# DogStatsD histogram
api.response.size:2048|h|#service:checkout
To integrate StatsD metrics into Prometheus/Mimir, use the StatsD Exporter or Grafana Alloy:
# Grafana Alloy config — StatsD receiver
prometheus.exporter.statsd "default" {
listen_address = "0.0.0.0:9125"
listen_protocol = "udp"
// Mapping rules: convert StatsD metric names to Prometheus format
mapping_config {
mappings = [
{
match = "api.request.duration.*"
name = "api_request_duration_milliseconds"
labels = { method = "$1" }
timer_type = "histogram"
buckets = [10, 25, 50, 100, 250, 500, 1000, 2500, 5000]
},
{
match = "page.views"
name = "page_views_total"
labels = { type = "page_view" }
},
]
}
}
OTLP (OpenTelemetry Protocol)
OTLP is the native protocol of OpenTelemetry — the CNCF standard for telemetry data. It transports metrics, logs, and traces in a single protocol using Protocol Buffers over gRPC or HTTP/JSON. OTLP is the recommended approach for new instrumentation.
# OpenTelemetry Collector config — receive OTLP and export to Mimir
receivers:
otlp:
protocols:
grpc:
endpoint: "0.0.0.0:4317"
http:
endpoint: "0.0.0.0:4318"
processors:
batch:
timeout: 10s
send_batch_size: 1000
resource:
attributes:
- key: cluster
value: "production-us-east-1"
action: upsert
exporters:
prometheusremotewrite:
endpoint: "https://mimir.example.com/api/v1/push"
headers:
X-Scope-OrgID: "tenant-1"
resource_to_telemetry_conversion:
enabled: true
tls:
insecure: false
service:
pipelines:
metrics:
receivers: [otlp]
processors: [batch, resource]
exporters: [prometheusremotewrite]
OTLP metrics support three temporality modes: cumulative (like Prometheus counters), delta (difference since last report), and gauge (point-in-time value). Mimir accepts OTLP natively via the /otlp/v1/metrics endpoint since Mimir 2.11.
Prometheus Exposition Format & Remote Write
The Prometheus exposition format is the text-based format that applications expose at their /metrics endpoint for scraping. Remote write is the protocol for pushing scraped data to remote storage backends like Mimir, Thanos, or Cortex.
# Prometheus text exposition format (at /metrics endpoint)
# HELP http_requests_total Total number of HTTP requests
# TYPE http_requests_total counter
http_requests_total{method="GET",status="200"} 1027
http_requests_total{method="POST",status="201"} 342
http_requests_total{method="GET",status="404"} 17
# HELP http_request_duration_seconds Request latency histogram
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{le="0.01"} 500
http_request_duration_seconds_bucket{le="0.05"} 900
http_request_duration_seconds_bucket{le="0.1"} 980
http_request_duration_seconds_bucket{le="0.5"} 1000
http_request_duration_seconds_bucket{le="1.0"} 1010
http_request_duration_seconds_bucket{le="+Inf"} 1027
http_request_duration_seconds_sum 135.7
http_request_duration_seconds_count 1027
# HELP node_cpu_temperature_celsius Current CPU temperature
# TYPE node_cpu_temperature_celsius gauge
node_cpu_temperature_celsius{core="0"} 62.5
node_cpu_temperature_celsius{core="1"} 64.2
# Prometheus remote_write configuration — push to Mimir
# prometheus.yml
remote_write:
- url: "https://mimir.example.com/api/v1/push"
headers:
X-Scope-OrgID: "tenant-1"
queue_config:
max_samples_per_send: 2000
batch_send_deadline: 5s
max_shards: 200
min_backoff: 100ms
max_backoff: 5s
write_relabel_configs:
# Drop high-cardinality metrics before sending
- source_labels: [__name__]
regex: "go_.*"
action: drop
SNMP
SNMP (Simple Network Management Protocol) is the standard for monitoring network devices (routers, switches, firewalls, printers). The SNMP Exporter translates SNMP OID (Object Identifier) data into Prometheus metrics:
# snmp.yml — SNMP Exporter module configuration
modules:
if_mib:
walk:
- 1.3.6.1.2.1.2.2 # ifTable — interface statistics
- 1.3.6.1.2.1.31.1.1 # ifXTable — extended interface stats
lookups:
- source_indexes: [ifIndex]
lookup: ifAlias # use interface alias as label
- source_indexes: [ifIndex]
lookup: ifDescr # use interface description as label
overrides:
ifSpeed:
type: gauge
ifHighSpeed:
type: gauge
cisco_device:
walk:
- 1.3.6.1.4.1.9.9.109 # Cisco CPU utilization
- 1.3.6.1.4.1.9.9.48 # Cisco memory pool
auth:
community: public
version: 2
# Prometheus scrape config for SNMP targets
scrape_configs:
- job_name: 'snmp_network'
scrape_interval: 60s
scrape_timeout: 30s
static_configs:
- targets:
- 10.0.0.1 # core-switch-01
- 10.0.0.2 # core-switch-02
- 10.0.0.10 # firewall-01
metrics_path: /snmp
params:
module: [if_mib]
auth: [public_v2]
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: snmp-exporter:9116
Understanding Data Storage Architectures
Time-series storage engines have evolved significantly over the past decade. Understanding the architectural differences between Graphite, Prometheus, and Mimir helps you choose the right solution and optimize performance.
Graphite Architecture
Graphite (2006) was one of the first modern time-series databases. Its architecture consists of three components:
- Carbon — A daemon that receives metrics over TCP/UDP (line protocol:
metric.path value timestamp) and writes them to Whisper files - Whisper — A fixed-size, round-robin database file format. Each metric gets its own file on disk with pre-allocated space for multiple retention levels
- Graphite-Web — A Django web application providing the query API and basic dashboard rendering
Graphite's dot-delimited naming (servers.web01.cpu.user) predates the label-based model. While simple, this creates challenges with high-dimensionality queries and makes ad-hoc grouping difficult. Each unique metric path creates a separate Whisper file, leading to massive I/O amplification on disk-based systems.
Prometheus Architecture
Prometheus (2012, open-sourced 2015) introduced the label-based data model and a purpose-built time-series database (TSDB) designed for high ingestion rates and efficient queries:
flowchart TB
subgraph Ingestion["Ingestion Path"]
Scrape[Scrape Manager
pulls /metrics]
WAL[Write-Ahead Log
append-only, crash-safe]
Head[Head Block
in-memory, last 2h]
end
subgraph Compaction["Compaction & Storage"]
Persist["Persist to Disk
(every 2h)"]
Block1["Block (2h)
index + chunks"]
Block2["Block (2h)
index + chunks"]
Compact["Compactor
merge small blocks"]
BigBlock["Compacted Block
(longer range)"]
end
subgraph Query["Query Path"]
QEngine[Query Engine
evaluates PromQL]
Merge[Merge Results
head + blocks]
end
Scrape --> WAL
WAL --> Head
Head --> Persist
Persist --> Block1
Persist --> Block2
Block1 --> Compact
Block2 --> Compact
Compact --> BigBlock
QEngine --> Head
QEngine --> Block1
QEngine --> BigBlock
Head --> Merge
Block1 --> Merge
BigBlock --> Merge
Merge --> QEngine
Key components of Prometheus's local TSDB:
- WAL (Write-Ahead Log) — All incoming samples are immediately written to a sequential WAL on disk. This ensures durability even if Prometheus crashes before flushing to blocks. The WAL is replayed on startup to recover in-memory state.
- Head Block — The most recent ~2 hours of data lives in memory for fast writes and queries. Samples are appended to compressed in-memory chunks.
- Persistent Blocks — Every 2 hours, the head block is "cut" into an immutable, on-disk block. Each block contains its own index (label → series mapping) and compressed chunk files.
- Compaction — A background compactor merges small blocks into larger blocks, improving query efficiency and applying retention deletes. The maximum block duration is typically 31 days (or 10% of retention, whichever is smaller).
# Prometheus data directory structure
/prometheus/data/
├── wal/
│ ├── 00000001 # WAL segment files (sequential append)
│ ├── 00000002
│ └── 00000003
├── 01HQXYZ.../ # Block directory (ULID-named)
│ ├── meta.json # Block metadata (time range, stats)
│ ├── index # Label → series postings (inverted index)
│ ├── chunks/
│ │ └── 000001 # Compressed time-series samples
│ └── tombstones # Deletion markers
├── 01HQABC.../ # Another block (older time range)
│ ├── meta.json
│ ├── index
│ └── chunks/
└── lock # Process lock file
Limitations of standalone Prometheus: Single-node (no horizontal scaling), limited retention (typically 15-30 days on local disk), no multi-tenancy, no global query view across multiple Prometheus instances. These limitations are exactly what Mimir solves.
Mimir Architecture
Grafana Mimir is a horizontally-scalable, multi-tenant, long-term storage backend for Prometheus metrics. It accepts data via Prometheus remote write (and OTLP) and provides a Prometheus-compatible query API. Mimir can store years of data cost-effectively using object storage while handling millions of active series.
flowchart TB
subgraph Clients["Data Sources"]
Prom[Prometheus
remote_write]
Alloy[Grafana Alloy
remote_write]
OTel[OTel Collector
OTLP]
end
subgraph WritePath["Write Path"]
Dist[Distributor
validates, shards,
replicates to ingesters]
Ing1[Ingester 1
builds TSDB blocks
in memory]
Ing2[Ingester 2
builds TSDB blocks
in memory]
Ing3[Ingester 3
builds TSDB blocks
in memory]
end
subgraph ReadPath["Read Path"]
QF[Query Frontend
splits queries,
caches results]
QS[Query Scheduler
fair queuing
across tenants]
Q1[Querier
merges ingester +
store-gateway data]
end
subgraph LongTerm["Long-Term Storage"]
SG[Store Gateway
lazy-loads block
indexes from storage]
Obj[(Object Storage
S3 / GCS / Azure
TSDB Blocks)]
end
subgraph Background["Background Processes"]
Comp[Compactor
merges blocks,
deduplicates,
applies retention]
Ruler[Ruler
evaluates recording
& alerting rules]
end
Prom --> Dist
Alloy --> Dist
OTel --> Dist
Dist --> Ing1
Dist --> Ing2
Dist --> Ing3
Ing1 --> Obj
Ing2 --> Obj
Ing3 --> Obj
QF --> QS
QS --> Q1
Q1 --> Ing1
Q1 --> Ing2
Q1 --> Ing3
Q1 --> SG
SG --> Obj
Comp --> Obj
Ruler --> Q1
Mimir's microservices architecture separates concerns cleanly:
Distributor
The entry point for all writes. Distributors validate incoming samples (correct timestamps, label lengths, series limits), hash each series to determine which ingesters should receive it (consistent hashing via a ring), and replicate writes to the configured replication factor (default: 3).
Ingester
Ingesters hold recent data (typically 2 hours) in memory using an embedded Prometheus TSDB. They serve queries for recent data directly from memory (much faster than object storage). Periodically, they flush completed TSDB blocks to object storage. Ingesters use a WAL for crash recovery and a hash ring for ownership coordination.
Store Gateway
Store gateways provide access to historical (flushed) TSDB blocks in object storage. They lazy-load block indexes and chunk metadata into memory, enabling efficient range queries over months or years of data without downloading entire blocks. Store gateways use a compaction-aware sharding strategy to distribute blocks across instances.
Compactor
The compactor runs as a background process that merges small blocks into larger ones, deduplicates samples (from replication), and enforces retention policies. It operates directly on object storage, reading blocks, merging them, writing the result, and deleting the originals.
Query Frontend
An optional (but strongly recommended) component that sits in front of queriers. It splits large time-range queries into smaller sub-queries for parallelism, caches query results (in Redis/Memcached), aligns query ranges to caching boundaries, and retries failed sub-queries.
Query Scheduler
Provides fair queuing across tenants in multi-tenant deployments. Without the scheduler, a single tenant running expensive queries could starve other tenants of querier capacity.
Ruler
Evaluates recording rules and alerting rules on behalf of tenants. Each tenant's rules are evaluated independently, using the same PromQL engine as ad-hoc queries. Results from recording rules are written back as new series; alert results are forwarded to Alertmanager.
# Minimal Mimir configuration (mimir.yaml)
target: all # Run all components in single binary (monolithic mode)
multitenancy_enabled: true
server:
http_listen_port: 8080
grpc_listen_port: 9095
distributor:
ring:
kvstore:
store: memberlist
ingester:
ring:
replication_factor: 3
kvstore:
store: memberlist
blocks_storage:
backend: s3
s3:
endpoint: s3.amazonaws.com
bucket_name: mimir-blocks
region: us-east-1
tsdb:
dir: /data/ingester
block_ranges_period: [2h]
retention_period: 24h # local retention before upload
store_gateway:
sharding_ring:
replication_factor: 1
compactor:
data_dir: /data/compactor
sharding_ring:
kvstore:
store: memberlist
limits:
max_global_series_per_user: 1500000
ingestion_rate: 200000 # samples/sec per tenant
ingestion_burst_size: 400000
compactor_blocks_retention_period: 365d # 1 year retention
ruler:
rule_path: /data/ruler
alertmanager_url: http://alertmanager:9093
Prometheus vs Mimir — When to Choose Each
| Aspect | Prometheus (standalone) | Grafana Mimir |
|---|---|---|
| Scale | Single node, ~10M active series | Horizontally scalable, billions of series |
| Retention | Days to weeks (local disk) | Months to years (object storage) |
| Multi-tenancy | No | Native per-tenant isolation |
| HA | Dual Prometheus + dedup | Built-in replication (RF=3) |
| Global View | Requires federation or Thanos | Single query endpoint across all data |
| Operational Cost | Low (single binary) | Medium (microservices + object storage) |
| Best For | Small teams, single cluster | Platform teams, multi-cluster, enterprise |
Using Exemplars in Grafana
Exemplars are a powerful feature that bridges the gap between metrics and traces. An exemplar is a specific trace ID (or other identifying information) attached to a metric sample, representing a concrete example of a request that contributed to that metric value. They answer the question: "This p99 latency spiked — which specific request was slow?"
Linking Metrics to Traces
When your application records a histogram observation (e.g., request duration), it can simultaneously attach the trace ID of that specific request as an exemplar. Later, when viewing a latency spike on a dashboard, you can click the exemplar markers to jump directly to the trace that caused the spike.
# Python example — recording an exemplar with prometheus_client
from prometheus_client import Histogram
import opentelemetry.trace as trace
REQUEST_DURATION = Histogram(
'http_request_duration_seconds',
'Request latency in seconds',
['method', 'endpoint', 'status'],
buckets=[0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0]
)
def handle_request(request):
# Get current trace context
span = trace.get_current_span()
trace_id = format(span.get_span_context().trace_id, '032x')
# Time the request
start = time.time()
response = process(request)
duration = time.time() - start
# Record metric WITH exemplar (trace_id links to the trace)
REQUEST_DURATION.labels(
method=request.method,
endpoint=request.path,
status=response.status_code
).observe(duration, exemplar={'trace_id': trace_id})
return response
// Go example — recording exemplars with prometheus/client_golang
import (
"github.com/prometheus/client_golang/prometheus"
"go.opentelemetry.io/otel/trace"
)
var requestDuration = prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Help: "Request latency histogram",
Buckets: prometheus.DefBuckets,
},
[]string{"method", "endpoint", "status"},
)
func handleRequest(w http.ResponseWriter, r *http.Request) {
start := time.Now()
// ... process request ...
duration := time.Since(start).Seconds()
// Get trace ID from current span context
spanCtx := trace.SpanFromContext(r.Context()).SpanContext()
// Record with exemplar
requestDuration.WithLabelValues(r.Method, r.URL.Path, "200").(prometheus.ExemplarObserver).ObserveWithExemplar(
duration,
prometheus.Labels{"trace_id": spanCtx.TraceID().String()},
)
}
Configuring Exemplar Storage
Exemplars must be enabled in both the storage backend (Prometheus/Mimir) and Grafana:
# Prometheus — enable exemplar storage
# prometheus.yml
storage:
exemplars:
max_exemplars: 100000 # circular buffer size
# Mimir — exemplars are stored automatically when received
# No special configuration needed beyond standard blocks_storage
# Grafana datasource provisioning — enable exemplars
apiVersion: 1
datasources:
- name: Mimir
type: prometheus
url: http://mimir-query-frontend:8080/prometheus
jsonData:
exemplarTraceIdDestinations:
- name: trace_id
datasourceUid: tempo # Link to your Tempo datasource
urlDisplayLabel: "View Trace"
trace_id from exemplar → opens Tempo trace view → you see the exact slow request with all its spans, including which downstream service caused the delay.
Querying Exemplars in PromQL & Grafana
In Grafana, exemplars appear as small dots overlaid on time-series graphs. Enable them per-panel:
- Open panel edit mode
- In the query options, toggle "Exemplars" on
- Optionally add a filter query to limit which exemplars are shown
- Configure the trace ID field name (usually
trace_idortraceID)
# PromQL query that works with exemplars
# The exemplar data is attached to the underlying histogram buckets
histogram_quantile(0.99,
sum by (le) (rate(http_request_duration_seconds_bucket{job="api-server"}[5m]))
)
# Exemplar API query (used internally by Grafana)
# GET /api/v1/query_exemplars?query=http_request_duration_seconds_bucket&start=...&end=...
# Response includes exemplar data:
# {
# "seriesLabels": {"method": "GET", "status": "200"},
# "exemplars": [
# {"labels": {"trace_id": "abc123def456"}, "value": 0.842, "timestamp": 1718447400}
# ]
# }
Best Practices
Operating Prometheus and Mimir at scale requires disciplined metric design, thoughtful recording rules, and proactive cardinality management. These practices prevent the most common production failures.
Cardinality Management
Cardinality is the number of unique time series in your system. Each unique combination of metric name + label values creates a separate time series. High cardinality is the #1 cause of Prometheus/Mimir performance issues and cost overruns.
# Check current cardinality in Prometheus/Mimir
# Total active series
prometheus_tsdb_head_series
# Series created per scrape (should be stable, not growing)
rate(prometheus_tsdb_head_series_created_total[5m])
# Top metrics by series count (Mimir-specific)
# Use the /api/v1/cardinality/label_names endpoint
# or the Mimir dashboard "Tenants" panel
# Find high-cardinality labels with PromQL
count by (__name__) ({__name__=~".+"}) > 1000
# Specific metric cardinality check
count(http_requests_total)
count by (method) (http_requests_total)
count by (status_code) (http_requests_total)
count by (path) (http_requests_total) # ← likely the culprit!
Cardinality Explosion — Real-World Example
A team added a path label containing the full URL path to their HTTP metrics. With 50,000 unique API endpoints (including user IDs in paths like /users/123/orders), this created:
5 methods × 10 status codes × 50,000 paths = 2,500,000 series — for a single service!
Fix: Use parameterized paths (/users/:id/orders) with at most 50-100 unique values, or move the full path into a log label and keep only the route template in metrics.
# ❌ BAD — unbounded path label
http_requests_total{method="GET", path="/users/12345/orders/67890"}
# ✅ GOOD — parameterized route label
http_requests_total{method="GET", route="/users/:id/orders/:order_id"}
Label Naming Conventions
Consistent label naming across your organization ensures queries are portable and dashboards work across services:
# Recommended label conventions (aligned with OpenTelemetry semantic conventions)
# ═══════════════════════════════════════════
# Infrastructure labels (added by scrape config or Alloy)
cluster: "production-us-east-1" # Kubernetes cluster name
namespace: "checkout" # Kubernetes namespace
pod: "checkout-7b4f8c-xk2p9" # Pod name (high-cardinality but useful)
node: "ip-10-0-1-42" # Node name
container: "api" # Container name
# Service labels (added by instrumentation)
job: "checkout-api" # Scrape job / service identifier
instance: "10.0.1.42:8080" # host:port of the target
# Application labels (added in metric definition)
method: "GET" # HTTP method (GET, POST, PUT, DELETE)
status_code: "200" # HTTP status code
route: "/api/v1/checkout" # Parameterized route (NOT full URL)
handler: "CreateOrder" # Internal handler/function name
# ═══════════════════════════════════════════
# AVOID these patterns:
# ═══════════════════════════════════════════
# ❌ env: "prod-us-east-1" → split into env + region
# ❌ service_version: "v2.3.1-rc4" → too many values (use info metric instead)
# ❌ error_message: "connection..." → unbounded string values
# ❌ user_id: "usr_abc123" → millions of values
Recording Rules in Production
Production recording rules should be organized into groups by purpose and evaluated at appropriate intervals:
# recording-rules.yaml — Production-grade recording rules
groups:
# ─── SLI Recording Rules (foundation for SLOs) ───
- name: sli_recording_rules
interval: 30s
rules:
# Availability SLI: proportion of successful requests
- record: sli:http_requests:availability_rate5m
expr: |
sum by (job) (rate(http_requests_total{status_code!~"5.."}[5m]))
/
sum by (job) (rate(http_requests_total[5m]))
# Latency SLI: proportion of requests faster than threshold
- record: sli:http_requests:latency_rate5m
expr: |
sum by (job) (rate(http_request_duration_seconds_bucket{le="0.3"}[5m]))
/
sum by (job) (rate(http_request_duration_seconds_count[5m]))
# ─── Cluster-Level Aggregations ───
- name: cluster_aggregations
interval: 60s
rules:
- record: cluster:node_cpu:sum_rate5m
expr: sum by (cluster) (rate(node_cpu_seconds_total{mode!="idle"}[5m]))
- record: cluster:node_memory:utilization
expr: |
1 - sum by (cluster) (node_memory_MemAvailable_bytes)
/ sum by (cluster) (node_memory_MemTotal_bytes)
# ─── Cost Attribution (for multi-tenant chargeback) ───
- name: cost_attribution
interval: 300s # 5 minutes is sufficient for cost data
rules:
- record: namespace:container_cpu_usage:sum_rate5m
expr: sum by (namespace, cluster) (rate(container_cpu_usage_seconds_total[5m]))
- record: namespace:container_memory:avg_bytes
expr: avg by (namespace, cluster) (container_memory_working_set_bytes)
Alerting Rules
Well-designed alerting rules reduce noise and ensure on-call engineers receive actionable notifications:
# alerting-rules.yaml — Production alerting rules
groups:
- name: service_alerts
rules:
# ─── SLO-Based Alert (Multi-Window, Multi-Burn-Rate) ───
# Fires when error budget consumption rate exceeds thresholds
- alert: SLOErrorBudgetBurn
expr: |
(
sli:http_requests:availability_rate5m{job="checkout-api"} < 0.999
and
sli:http_requests:availability_rate5m{job="checkout-api"} < 0.9999
)
for: 2m
labels:
severity: critical
team: checkout
slo: availability
annotations:
summary: "{{ $labels.job }} burning error budget too fast"
description: |
Current availability is {{ $value | humanizePercentage }}.
SLO target is 99.9%. Error budget is being consumed rapidly.
runbook_url: "https://wiki.internal/runbooks/slo-burn"
dashboard_url: "https://grafana.internal/d/slo-dashboard?var-job={{ $labels.job }}"
# ─── Latency Degradation ───
- alert: HighP99Latency
expr: |
histogram_quantile(0.99,
sum by (le, job) (rate(http_request_duration_seconds_bucket[5m]))
) > 2.0
for: 5m
labels:
severity: warning
team: platform
annotations:
summary: "{{ $labels.job }} p99 latency above 2s"
description: "P99 latency is {{ $value | humanizeDuration }} for {{ $labels.job }}."
# ─── Infrastructure Alert ───
- alert: DiskSpaceCritical
expr: |
predict_linear(node_filesystem_avail_bytes{mountpoint="/"}[6h], 3600*24) < 0
for: 10m
labels:
severity: critical
team: infrastructure
annotations:
summary: "{{ $labels.instance }} disk will fill within 24 hours"
description: |
Based on 6-hour trend, {{ $labels.instance }} mount {{ $labels.mountpoint }}
will run out of space in approximately 24 hours.
Current available: {{ $value | humanize1024 }}B
# ─── Deadman Alert (absence detection) ───
- alert: TargetDown
expr: up == 0
for: 3m
labels:
severity: critical
annotations:
summary: "{{ $labels.job }}/{{ $labels.instance }} is down"
description: "Target has been unreachable for more than 3 minutes."
for duration to avoid flapping — 2-5 minutes prevents transient spikes from paging. (3) Include runbook URLs in annotations — the on-call engineer needs next steps, not just a description. (4) Use recording rules in alert expressions — keeps alerts readable and prevents expensive evaluations.
Summary & Next Steps
In this deep dive into metrics monitoring, we covered:
- PromQL fundamentals — Selectors, operators, functions, aggregations, and the distinction between instant and range vectors
- Writing production PromQL — rate/irate/increase, histogram_quantile, recording rules, and implementing RED/USE methods
- Data collection protocols — StatsD/DogStatsD for legacy apps, OTLP for modern instrumentation, Prometheus exposition for cloud-native, and SNMP for network devices
- Storage architectures — Graphite's file-per-metric model, Prometheus's local TSDB with WAL and compaction, and Mimir's horizontally-scalable microservices architecture
- Exemplars — Linking metrics to traces for instant correlation during incident response
- Best practices — Cardinality management, label naming conventions, recording rules organization, and alerting rule design
The key insight is that metrics monitoring at scale is as much about discipline as it is about technology. Cardinality explosions, missing recording rules, and noisy alerts are organizational problems that require conventions and governance — not just better tools.
Next in the Grafana Track
In Part 6: Distributed Tracing with Grafana Tempo & TraceQL, we'll explore the traces pillar — TraceQL query language, Tempo's architecture, span-level analysis, service graphs, and correlating traces with metrics and logs for complete observability.