Grafana Deep Dive Part 5: Monitoring with Metrics — Mimir & PromQL

Introducing PromQL

PromQL (Prometheus Query Language) is the standard query language for time-series metrics in the Prometheus ecosystem. It powers dashboards in Grafana, alert conditions in Alertmanager, and recording rules in both Prometheus and Grafana Mimir. Whether you're querying a local Prometheus instance or a multi-tenant Mimir cluster storing billions of active series, the query language remains identical.

PromQL was designed with a specific philosophy: metrics are multi-dimensional (identified by label sets), append-only (values accumulate over time), and queries should express relationships between time series naturally. Unlike SQL which operates on tables, PromQL operates on vectors — collections of time series samples at specific points in time.

Feature Overview

PromQL provides four core capabilities that combine to express virtually any metrics query:

Selectors — Choose which time series to operate on using metric names and label matchers
Operators — Perform arithmetic, comparison, and logical operations between time series
Functions — Transform, aggregate, or compute derived values from time series data
Aggregations — Combine multiple time series into fewer series by grouping on label dimensions

                            
                            Key Insight: PromQL's power comes from its ability to seamlessly combine selectors, functions, and aggregations in a single expression. A single PromQL query can select specific services, compute per-second rates, calculate percentiles, and aggregate across dimensions — all in one readable line.
                        

Selectors & Matchers

A selector identifies which time series to query. Every PromQL expression starts with a selector:

# Simple metric name selector — selects all series with this name
http_requests_total

# Label matcher — equality
http_requests_total{job="api-server"}

# Label matcher — not equal
http_requests_total{job!="internal-scraper"}

# Label matcher — regex match
http_requests_total{method=~"GET|POST"}

# Label matcher — regex not-match
http_requests_total{status_code!~"2.."}

# Multiple matchers (AND logic)
http_requests_total{job="api-server", method="POST", status_code=~"5.."}

# Metric name is syntactic sugar for __name__ label
{__name__=~"http_requests_total|http_request_duration_seconds_.*"}

There are four matcher types: = (exact equality), != (not equal), =~ (regex match), and !~ (regex not-match). At least one matcher must not match the empty string — you cannot select all metrics without some constraint.

Operators

PromQL supports arithmetic, comparison, and logical operators between scalars, vectors, or combinations thereof:

# Arithmetic operators: + - * / % ^
# Error rate as a percentage
http_requests_total{status_code=~"5.."} / http_requests_total * 100

# Comparison operators: == != > < >= <=
# Filter: only series where value exceeds threshold
http_request_duration_seconds{quantile="0.99"} > 0.5

# bool modifier: return 0/1 instead of filtering
http_request_duration_seconds{quantile="0.99"} > bool 0.5

# Vector matching with 'on' and 'ignoring'
# Divide request errors by total requests, matching on 'method' label
rate(http_requests_total{status_code=~"5.."}[5m])
  / on(method)
rate(http_requests_total[5m])

# group_left / group_right for many-to-one matching
rate(http_requests_total{status_code=~"5.."}[5m])
  / on(job) group_left(team)
rate(http_requests_total[5m])

                            
                            Vector Matching: When performing binary operations between two vectors, PromQL matches series by their label sets. Use on(label1, label2) to specify which labels to match on, or ignoring(label1) to exclude labels from matching. group_left/group_right enables many-to-one joins.
                        

Functions

PromQL provides dozens of built-in functions. The most commonly used categories:

# Rate functions (for counters)
rate(http_requests_total[5m])          # per-second rate over 5m
irate(http_requests_total[5m])         # instant rate (last two samples)
increase(http_requests_total[1h])      # total increase over 1h

# Aggregation over time (for gauges)
avg_over_time(node_cpu_seconds_total[5m])
max_over_time(node_memory_MemFree_bytes[1h])
min_over_time(node_filesystem_avail_bytes[6h])

# Mathematical functions
abs(rate(temperature_celsius[5m]))
ceil(http_request_duration_seconds)
floor(available_disk_gb)
round(cpu_usage_percent, 0.01)         # round to 2 decimal places
clamp(cpu_usage, 0, 100)              # clamp between bounds
clamp_min(free_memory_bytes, 0)       # floor at 0
ln(http_requests_total)               # natural logarithm

# Label manipulation
label_replace(up, "short_instance", "$1", "instance", "(.*):.*")
label_join(up, "full_path", "/", "namespace", "pod")

# Time functions
time()                                 # current Unix timestamp
timestamp(up)                          # timestamp of each sample
day_of_week()                          # 0=Sunday through 6=Saturday
hour()                                 # hour of the day (0-23)

# Prediction and trending
predict_linear(node_disk_free_bytes[6h], 3600*24)  # predict value in 24h
deriv(node_network_receive_bytes_total[15m])        # per-second derivative

# Histogram functions
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))

Aggregations

Aggregation operators combine multiple time series by label dimensions:

# Basic aggregations
sum(rate(http_requests_total[5m]))                    # total across all series
avg(node_cpu_seconds_total{mode="idle"})              # average across nodes
count(up == 1)                                        # number of healthy targets
min(node_filesystem_avail_bytes)                      # minimum free space
max(container_memory_usage_bytes)                     # peak memory usage
stddev(http_request_duration_seconds)                 # standard deviation

# Group by specific labels with 'by'
sum by (job, method) (rate(http_requests_total[5m]))

# Group by excluding labels with 'without'
sum without (instance, pod) (rate(http_requests_total[5m]))

# TopK / BottomK
topk(5, sum by (pod) (rate(container_cpu_usage_seconds_total[5m])))
bottomk(3, node_filesystem_avail_bytes)

# count_values — count unique values of a label
count_values("version", build_info)

# quantile — calculate quantile across series (not over time!)
quantile(0.95, rate(http_requests_total[5m]))

Writing PromQL

With the building blocks established, let's explore how to write production-grade PromQL queries for real-world monitoring scenarios.

Instant Vectors

An instant vector is a set of time series where each series has exactly one sample at a given evaluation timestamp. This is what you get from a basic selector or after applying a function:

# Instant vector — one value per series at current time
up{job="api-server"}
# Result:
# up{job="api-server", instance="10.0.0.1:8080"} => 1
# up{job="api-server", instance="10.0.0.2:8080"} => 1
# up{job="api-server", instance="10.0.0.3:8080"} => 0

# After applying rate() — still an instant vector (one value per series)
rate(http_requests_total{job="api-server"}[5m])
# Result:
# {job="api-server", method="GET", status="200"} => 142.5
# {job="api-server", method="POST", status="201"} => 23.8
# {job="api-server", method="GET", status="404"} => 0.3

Instant vectors are what Grafana plots on dashboards (one data point per evaluation interval) and what alert rules evaluate against thresholds.

Range Vectors

A range vector is a set of time series where each series contains multiple samples over a specified time window. Range vectors cannot be graphed directly — they must be passed to a function that reduces them to instant vectors:

# Range vector — raw samples over last 5 minutes
http_requests_total{job="api-server"}[5m]
# Result (cannot graph this directly!):
# {method="GET"} => [(t1, 1000), (t2, 1015), (t3, 1030), ...]

# Range vector durations: [30s] [1m] [5m] [15m] [1h] [6h] [1d] [7d]

# Offset modifier — look back in time
rate(http_requests_total[5m] offset 1h)     # rate 1 hour ago
rate(http_requests_total[5m] offset 7d)     # rate 1 week ago

# @ modifier — evaluate at a specific timestamp
rate(http_requests_total[5m] @ 1718447400)  # rate at specific Unix time
rate(http_requests_total[5m] @ start())     # rate at query start time
rate(http_requests_total[5m] @ end())       # rate at query end time

                            
                            Choosing Range Duration: The range window affects smoothness and responsiveness. Use [1m] for real-time alerting (noisy but fast), [5m] for dashboards (balanced), and [15m] or longer for capacity planning (smooth trends). The range should be at least 4× your scrape interval to ensure enough samples for accurate rate calculation.
                        

rate(), irate() & increase()

These are the three essential functions for working with counters (monotonically increasing metrics that reset on restart):

# rate() — average per-second increase over the range window
# Best for: dashboards, alerting, most use cases
rate(http_requests_total{job="api-server"}[5m])

# irate() — instant rate using only the last two data points
# Best for: volatile, spiky metrics where you want to see peaks
# Caution: very sensitive to scrape timing, poor for alerting
irate(http_requests_total{job="api-server"}[5m])

# increase() — total increase over the range window
# Equivalent to rate() * seconds_in_range
# Best for: "how many requests in the last hour?"
increase(http_requests_total{job="api-server"}[1h])

Comparison

rate() vs irate() — When to Use Each

Aspect	rate()	irate()
Calculation	Average rate over full range	Rate between last 2 samples only
Sensitivity	Smooth, dampens spikes	Volatile, shows every spike
Alerting	Excellent (stable)	Poor (flapping)
Counter Reset	Handles gracefully	May produce artifacts
Dashboard Use	Default choice	Only for "zoom into spikes"
Recording Rules	Always use rate()	Never record irate()

rate counters best-practice

histogram_quantile()

The histogram_quantile() function calculates percentiles from Prometheus histogram buckets. This is the primary way to compute latency percentiles (p50, p95, p99):

# P99 latency across all instances
histogram_quantile(0.99,
  rate(http_request_duration_seconds_bucket{job="api-server"}[5m])
)

# P95 latency grouped by endpoint
histogram_quantile(0.95,
  sum by (le, handler) (
    rate(http_request_duration_seconds_bucket{job="api-server"}[5m])
  )
)

# P50 (median) response time per method
histogram_quantile(0.50,
  sum by (le, method) (
    rate(http_request_duration_seconds_bucket[5m])
  )
)

# Multiple quantiles in one dashboard (use multiple panels or recording rules)
# Panel 1: histogram_quantile(0.50, sum by (le) (rate(...[5m])))
# Panel 2: histogram_quantile(0.90, sum by (le) (rate(...[5m])))
# Panel 3: histogram_quantile(0.99, sum by (le) (rate(...[5m])))

                            
                            Critical Rule: The le (less-than-or-equal) label MUST be preserved in the by clause of any aggregation inside histogram_quantile(). If you sum by (method) without including le, the bucket structure is destroyed and the result is meaningless.
                        

Recording Rules

Recording rules pre-compute expensive PromQL expressions and store the result as new time series. They reduce query latency for dashboards and enable multi-level aggregations that would be too slow to compute at query time:

# prometheus-rules.yaml or mimir-rules.yaml
groups:
  - name: api_server_recording_rules
    interval: 30s
    rules:
      # Pre-compute request rate by job and method
      - record: job_method:http_requests_total:rate5m
        expr: sum by (job, method) (rate(http_requests_total[5m]))

      # Pre-compute error ratio
      - record: job:http_request_errors:ratio_rate5m
        expr: |
          sum by (job) (rate(http_requests_total{status_code=~"5.."}[5m]))
          /
          sum by (job) (rate(http_requests_total[5m]))

      # Pre-compute p99 latency by handler
      - record: handler:http_request_duration_seconds:p99_rate5m
        expr: |
          histogram_quantile(0.99,
            sum by (le, handler) (
              rate(http_request_duration_seconds_bucket[5m])
            )
          )

      # Pre-compute cluster-wide CPU usage
      - record: cluster:node_cpu:ratio_rate5m
        expr: |
          1 - avg by (cluster) (
            rate(node_cpu_seconds_total{mode="idle"}[5m])
          )

                            
                            Naming Convention: Recording rules follow the pattern level:metric:operations. The level represents the aggregation labels (e.g., job, handler, cluster), the metric is the source metric name, and operations describe what was applied (e.g., rate5m, p99_rate5m, ratio_rate5m).
                        

RED & USE Methods in PromQL

The RED method (Rate, Errors, Duration) monitors request-driven services. The USE method (Utilization, Saturation, Errors) monitors infrastructure resources. Here's how to implement both in PromQL:

# ═══════════════════════════════════════════
# RED METHOD — for services (API, web, microservices)
# ═══════════════════════════════════════════

# R — Rate: requests per second
sum by (service) (rate(http_requests_total[5m]))

# E — Errors: error rate as percentage
sum by (service) (rate(http_requests_total{status_code=~"5.."}[5m]))
/
sum by (service) (rate(http_requests_total[5m]))
* 100

# D — Duration: request latency (p99)
histogram_quantile(0.99,
  sum by (le, service) (rate(http_request_duration_seconds_bucket[5m]))
)

# ═══════════════════════════════════════════
# USE METHOD — for resources (CPU, memory, disk, network)
# ═══════════════════════════════════════════

# U — Utilization: fraction of resource being used
# CPU utilization (percentage busy)
1 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m]))

# Memory utilization
1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)

# Disk utilization
1 - (node_filesystem_avail_bytes / node_filesystem_size_bytes)

# S — Saturation: how overloaded the resource is
# CPU saturation (load average / CPU count)
node_load1 / count without (cpu) (node_cpu_seconds_total{mode="idle"})

# Memory saturation (swap usage or OOM kills)
rate(node_vmstat_pswpin[5m]) + rate(node_vmstat_pswpout[5m])

# Disk saturation (I/O queue depth)
rate(node_disk_io_time_weighted_seconds_total[5m])

# E — Errors: resource error events
# Disk errors
rate(node_disk_io_errs_total[5m])

# Network errors
rate(node_network_receive_errs_total[5m])
+ rate(node_network_transmit_errs_total[5m])

Exploring Data Collection & Metric Protocols

Metrics reach Prometheus and Mimir through various protocols and formats. Understanding each protocol's strengths helps you choose the right collection strategy for different infrastructure components.

StatsD & DogStatsD

StatsD is a lightweight UDP-based protocol originally created by Etsy for application-level metrics. Applications emit metrics as simple text messages over UDP, which a StatsD server aggregates before forwarding to a backend. DogStatsD is Datadog's extension adding tags (labels), histograms, service checks, and events.

# StatsD wire format: metric_name:value|type[|@sample_rate]
# Types: c=counter, g=gauge, ms=timer, h=histogram, s=set

# Counter — increment page views
page.views:1|c

# Gauge — current queue depth
queue.depth:42|g

# Timer — request duration in milliseconds
api.request.duration:320|ms

# Counter with sample rate (only 10% of calls actually send)
api.request.count:1|c|@0.1

# DogStatsD extension — adds tags (key:value pairs)
api.request.duration:320|ms|#method:GET,endpoint:/users,status:200

# DogStatsD histogram
api.response.size:2048|h|#service:checkout

To integrate StatsD metrics into Prometheus/Mimir, use the StatsD Exporter or Grafana Alloy:

# Grafana Alloy config — StatsD receiver
prometheus.exporter.statsd "default" {
  listen_address = "0.0.0.0:9125"
  listen_protocol = "udp"

  // Mapping rules: convert StatsD metric names to Prometheus format
  mapping_config {
    mappings = [
      {
        match       = "api.request.duration.*"
        name        = "api_request_duration_milliseconds"
        labels      = { method = "$1" }
        timer_type  = "histogram"
        buckets     = [10, 25, 50, 100, 250, 500, 1000, 2500, 5000]
      },
      {
        match  = "page.views"
        name   = "page_views_total"
        labels = { type = "page_view" }
      },
    ]
  }
}

OTLP (OpenTelemetry Protocol)

OTLP is the native protocol of OpenTelemetry — the CNCF standard for telemetry data. It transports metrics, logs, and traces in a single protocol using Protocol Buffers over gRPC or HTTP/JSON. OTLP is the recommended approach for new instrumentation.

# OpenTelemetry Collector config — receive OTLP and export to Mimir
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: "0.0.0.0:4317"
      http:
        endpoint: "0.0.0.0:4318"

processors:
  batch:
    timeout: 10s
    send_batch_size: 1000
  resource:
    attributes:
      - key: cluster
        value: "production-us-east-1"
        action: upsert

exporters:
  prometheusremotewrite:
    endpoint: "https://mimir.example.com/api/v1/push"
    headers:
      X-Scope-OrgID: "tenant-1"
    resource_to_telemetry_conversion:
      enabled: true
    tls:
      insecure: false

service:
  pipelines:
    metrics:
      receivers: [otlp]
      processors: [batch, resource]
      exporters: [prometheusremotewrite]

OTLP metrics support three temporality modes: cumulative (like Prometheus counters), delta (difference since last report), and gauge (point-in-time value). Mimir accepts OTLP natively via the /otlp/v1/metrics endpoint since Mimir 2.11.

                            
                            OTLP Advantage: Unlike Prometheus scraping (pull-based), OTLP is push-based. This makes it ideal for short-lived processes (serverless functions, batch jobs, CLI tools) that may terminate before a scraper can reach them.
                        

Prometheus Exposition Format & Remote Write

The Prometheus exposition format is the text-based format that applications expose at their /metrics endpoint for scraping. Remote write is the protocol for pushing scraped data to remote storage backends like Mimir, Thanos, or Cortex.

# Prometheus text exposition format (at /metrics endpoint)
# HELP http_requests_total Total number of HTTP requests
# TYPE http_requests_total counter
http_requests_total{method="GET",status="200"} 1027
http_requests_total{method="POST",status="201"} 342
http_requests_total{method="GET",status="404"} 17

# HELP http_request_duration_seconds Request latency histogram
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{le="0.01"} 500
http_request_duration_seconds_bucket{le="0.05"} 900
http_request_duration_seconds_bucket{le="0.1"} 980
http_request_duration_seconds_bucket{le="0.5"} 1000
http_request_duration_seconds_bucket{le="1.0"} 1010
http_request_duration_seconds_bucket{le="+Inf"} 1027
http_request_duration_seconds_sum 135.7
http_request_duration_seconds_count 1027

# HELP node_cpu_temperature_celsius Current CPU temperature
# TYPE node_cpu_temperature_celsius gauge
node_cpu_temperature_celsius{core="0"} 62.5
node_cpu_temperature_celsius{core="1"} 64.2

# Prometheus remote_write configuration — push to Mimir
# prometheus.yml
remote_write:
  - url: "https://mimir.example.com/api/v1/push"
    headers:
      X-Scope-OrgID: "tenant-1"
    queue_config:
      max_samples_per_send: 2000
      batch_send_deadline: 5s
      max_shards: 200
      min_backoff: 100ms
      max_backoff: 5s
    write_relabel_configs:
      # Drop high-cardinality metrics before sending
      - source_labels: [__name__]
        regex: "go_.*"
        action: drop

SNMP

SNMP (Simple Network Management Protocol) is the standard for monitoring network devices (routers, switches, firewalls, printers). The SNMP Exporter translates SNMP OID (Object Identifier) data into Prometheus metrics:

# snmp.yml — SNMP Exporter module configuration
modules:
  if_mib:
    walk:
      - 1.3.6.1.2.1.2.2       # ifTable — interface statistics
      - 1.3.6.1.2.1.31.1.1    # ifXTable — extended interface stats
    lookups:
      - source_indexes: [ifIndex]
        lookup: ifAlias         # use interface alias as label
      - source_indexes: [ifIndex]
        lookup: ifDescr         # use interface description as label
    overrides:
      ifSpeed:
        type: gauge
      ifHighSpeed:
        type: gauge

  cisco_device:
    walk:
      - 1.3.6.1.4.1.9.9.109   # Cisco CPU utilization
      - 1.3.6.1.4.1.9.9.48    # Cisco memory pool
    auth:
      community: public
    version: 2

# Prometheus scrape config for SNMP targets
scrape_configs:
  - job_name: 'snmp_network'
    scrape_interval: 60s
    scrape_timeout: 30s
    static_configs:
      - targets:
          - 10.0.0.1    # core-switch-01
          - 10.0.0.2    # core-switch-02
          - 10.0.0.10   # firewall-01
    metrics_path: /snmp
    params:
      module: [if_mib]
      auth: [public_v2]
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: snmp-exporter:9116

Understanding Data Storage Architectures

Time-series storage engines have evolved significantly over the past decade. Understanding the architectural differences between Graphite, Prometheus, and Mimir helps you choose the right solution and optimize performance.

Graphite Architecture

Graphite (2006) was one of the first modern time-series databases. Its architecture consists of three components:

Carbon — A daemon that receives metrics over TCP/UDP (line protocol: metric.path value timestamp) and writes them to Whisper files
Whisper — A fixed-size, round-robin database file format. Each metric gets its own file on disk with pre-allocated space for multiple retention levels
Graphite-Web — A Django web application providing the query API and basic dashboard rendering

Graphite's dot-delimited naming (servers.web01.cpu.user) predates the label-based model. While simple, this creates challenges with high-dimensionality queries and makes ad-hoc grouping difficult. Each unique metric path creates a separate Whisper file, leading to massive I/O amplification on disk-based systems.

                            
                            Legacy Note: Graphite is still widely deployed in organizations that adopted it before Prometheus existed. Many Grafana dashboards query Graphite backends. The migration path is typically: Graphite → Prometheus (scraping) → Mimir (long-term, multi-tenant).
                        

Prometheus Architecture

Prometheus (2012, open-sourced 2015) introduced the label-based data model and a purpose-built time-series database (TSDB) designed for high ingestion rates and efficient queries:

Prometheus TSDB Architecture

flowchart TB
    subgraph Ingestion["Ingestion Path"]
        Scrape[Scrape Manager
pulls /metrics]
        WAL[Write-Ahead Log
append-only, crash-safe]
        Head[Head Block
in-memory, last 2h]
    end

    subgraph Compaction["Compaction & Storage"]
        Persist["Persist to Disk
(every 2h)"]
        Block1["Block (2h)
index + chunks"]
        Block2["Block (2h)
index + chunks"]
        Compact["Compactor
merge small blocks"]
        BigBlock["Compacted Block
(longer range)"]
    end

    subgraph Query["Query Path"]
        QEngine[Query Engine
evaluates PromQL]
        Merge[Merge Results
head + blocks]
    end

    Scrape --> WAL
    WAL --> Head
    Head --> Persist
    Persist --> Block1
    Persist --> Block2
    Block1 --> Compact
    Block2 --> Compact
    Compact --> BigBlock
    QEngine --> Head
    QEngine --> Block1
    QEngine --> BigBlock
    Head --> Merge
    Block1 --> Merge
    BigBlock --> Merge
    Merge --> QEngine

Key components of Prometheus's local TSDB:

WAL (Write-Ahead Log) — All incoming samples are immediately written to a sequential WAL on disk. This ensures durability even if Prometheus crashes before flushing to blocks. The WAL is replayed on startup to recover in-memory state.
Head Block — The most recent ~2 hours of data lives in memory for fast writes and queries. Samples are appended to compressed in-memory chunks.
Persistent Blocks — Every 2 hours, the head block is "cut" into an immutable, on-disk block. Each block contains its own index (label → series mapping) and compressed chunk files.
Compaction — A background compactor merges small blocks into larger blocks, improving query efficiency and applying retention deletes. The maximum block duration is typically 31 days (or 10% of retention, whichever is smaller).

# Prometheus data directory structure
/prometheus/data/
├── wal/
│   ├── 00000001    # WAL segment files (sequential append)
│   ├── 00000002
│   └── 00000003
├── 01HQXYZ.../     # Block directory (ULID-named)
│   ├── meta.json   # Block metadata (time range, stats)
│   ├── index       # Label → series postings (inverted index)
│   ├── chunks/
│   │   └── 000001  # Compressed time-series samples
│   └── tombstones  # Deletion markers
├── 01HQABC.../     # Another block (older time range)
│   ├── meta.json
│   ├── index
│   └── chunks/
└── lock            # Process lock file

Limitations of standalone Prometheus: Single-node (no horizontal scaling), limited retention (typically 15-30 days on local disk), no multi-tenancy, no global query view across multiple Prometheus instances. These limitations are exactly what Mimir solves.

Mimir Architecture

Grafana Mimir is a horizontally-scalable, multi-tenant, long-term storage backend for Prometheus metrics. It accepts data via Prometheus remote write (and OTLP) and provides a Prometheus-compatible query API. Mimir can store years of data cost-effectively using object storage while handling millions of active series.

Grafana Mimir Architecture

flowchart TB
    subgraph Clients["Data Sources"]
        Prom[Prometheus
remote_write]
        Alloy[Grafana Alloy
remote_write]
        OTel[OTel Collector
OTLP]
    end

    subgraph WritePath["Write Path"]
        Dist[Distributor
validates, shards,
replicates to ingesters]
        Ing1[Ingester 1
builds TSDB blocks
in memory]
        Ing2[Ingester 2
builds TSDB blocks
in memory]
        Ing3[Ingester 3
builds TSDB blocks
in memory]
    end

    subgraph ReadPath["Read Path"]
        QF[Query Frontend
splits queries,
caches results]
        QS[Query Scheduler
fair queuing
across tenants]
        Q1[Querier
merges ingester +
store-gateway data]
    end

    subgraph LongTerm["Long-Term Storage"]
        SG[Store Gateway
lazy-loads block
indexes from storage]
        Obj[(Object Storage
S3 / GCS / Azure
TSDB Blocks)]
    end

    subgraph Background["Background Processes"]
        Comp[Compactor
merges blocks,
deduplicates,
applies retention]
        Ruler[Ruler
evaluates recording
& alerting rules]
    end

    Prom --> Dist
    Alloy --> Dist
    OTel --> Dist
    Dist --> Ing1
    Dist --> Ing2
    Dist --> Ing3
    Ing1 --> Obj
    Ing2 --> Obj
    Ing3 --> Obj
    QF --> QS
    QS --> Q1
    Q1 --> Ing1
    Q1 --> Ing2
    Q1 --> Ing3
    Q1 --> SG
    SG --> Obj
    Comp --> Obj
    Ruler --> Q1

Mimir's microservices architecture separates concerns cleanly:

Distributor

The entry point for all writes. Distributors validate incoming samples (correct timestamps, label lengths, series limits), hash each series to determine which ingesters should receive it (consistent hashing via a ring), and replicate writes to the configured replication factor (default: 3).

Ingester

Ingesters hold recent data (typically 2 hours) in memory using an embedded Prometheus TSDB. They serve queries for recent data directly from memory (much faster than object storage). Periodically, they flush completed TSDB blocks to object storage. Ingesters use a WAL for crash recovery and a hash ring for ownership coordination.

Store Gateway

Store gateways provide access to historical (flushed) TSDB blocks in object storage. They lazy-load block indexes and chunk metadata into memory, enabling efficient range queries over months or years of data without downloading entire blocks. Store gateways use a compaction-aware sharding strategy to distribute blocks across instances.

Compactor

The compactor runs as a background process that merges small blocks into larger ones, deduplicates samples (from replication), and enforces retention policies. It operates directly on object storage, reading blocks, merging them, writing the result, and deleting the originals.

Query Frontend

An optional (but strongly recommended) component that sits in front of queriers. It splits large time-range queries into smaller sub-queries for parallelism, caches query results (in Redis/Memcached), aligns query ranges to caching boundaries, and retries failed sub-queries.

Query Scheduler

Provides fair queuing across tenants in multi-tenant deployments. Without the scheduler, a single tenant running expensive queries could starve other tenants of querier capacity.

Ruler

Evaluates recording rules and alerting rules on behalf of tenants. Each tenant's rules are evaluated independently, using the same PromQL engine as ad-hoc queries. Results from recording rules are written back as new series; alert results are forwarded to Alertmanager.

# Minimal Mimir configuration (mimir.yaml)
target: all  # Run all components in single binary (monolithic mode)

multitenancy_enabled: true

server:
  http_listen_port: 8080
  grpc_listen_port: 9095

distributor:
  ring:
    kvstore:
      store: memberlist

ingester:
  ring:
    replication_factor: 3
    kvstore:
      store: memberlist

blocks_storage:
  backend: s3
  s3:
    endpoint: s3.amazonaws.com
    bucket_name: mimir-blocks
    region: us-east-1
  tsdb:
    dir: /data/ingester
    block_ranges_period: [2h]
    retention_period: 24h  # local retention before upload

store_gateway:
  sharding_ring:
    replication_factor: 1

compactor:
  data_dir: /data/compactor
  sharding_ring:
    kvstore:
      store: memberlist

limits:
  max_global_series_per_user: 1500000
  ingestion_rate: 200000           # samples/sec per tenant
  ingestion_burst_size: 400000
  compactor_blocks_retention_period: 365d  # 1 year retention

ruler:
  rule_path: /data/ruler
  alertmanager_url: http://alertmanager:9093

Architecture

Prometheus vs Mimir — When to Choose Each

Aspect	Prometheus (standalone)	Grafana Mimir
Scale	Single node, ~10M active series	Horizontally scalable, billions of series
Retention	Days to weeks (local disk)	Months to years (object storage)
Multi-tenancy	No	Native per-tenant isolation
HA	Dual Prometheus + dedup	Built-in replication (RF=3)
Global View	Requires federation or Thanos	Single query endpoint across all data
Operational Cost	Low (single binary)	Medium (microservices + object storage)
Best For	Small teams, single cluster	Platform teams, multi-cluster, enterprise

scalability architecture long-term-storage

Using Exemplars in Grafana

Exemplars are a powerful feature that bridges the gap between metrics and traces. An exemplar is a specific trace ID (or other identifying information) attached to a metric sample, representing a concrete example of a request that contributed to that metric value. They answer the question: "This p99 latency spiked — which specific request was slow?"

Linking Metrics to Traces

When your application records a histogram observation (e.g., request duration), it can simultaneously attach the trace ID of that specific request as an exemplar. Later, when viewing a latency spike on a dashboard, you can click the exemplar markers to jump directly to the trace that caused the spike.

# Python example — recording an exemplar with prometheus_client
from prometheus_client import Histogram
import opentelemetry.trace as trace

REQUEST_DURATION = Histogram(
    'http_request_duration_seconds',
    'Request latency in seconds',
    ['method', 'endpoint', 'status'],
    buckets=[0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0]
)

def handle_request(request):
    # Get current trace context
    span = trace.get_current_span()
    trace_id = format(span.get_span_context().trace_id, '032x')

    # Time the request
    start = time.time()
    response = process(request)
    duration = time.time() - start

    # Record metric WITH exemplar (trace_id links to the trace)
    REQUEST_DURATION.labels(
        method=request.method,
        endpoint=request.path,
        status=response.status_code
    ).observe(duration, exemplar={'trace_id': trace_id})

    return response

// Go example — recording exemplars with prometheus/client_golang
import (
    "github.com/prometheus/client_golang/prometheus"
    "go.opentelemetry.io/otel/trace"
)

var requestDuration = prometheus.NewHistogramVec(
    prometheus.HistogramOpts{
        Name:    "http_request_duration_seconds",
        Help:    "Request latency histogram",
        Buckets: prometheus.DefBuckets,
    },
    []string{"method", "endpoint", "status"},
)

func handleRequest(w http.ResponseWriter, r *http.Request) {
    start := time.Now()
    // ... process request ...
    duration := time.Since(start).Seconds()

    // Get trace ID from current span context
    spanCtx := trace.SpanFromContext(r.Context()).SpanContext()

    // Record with exemplar
    requestDuration.WithLabelValues(r.Method, r.URL.Path, "200").(prometheus.ExemplarObserver).ObserveWithExemplar(
        duration,
        prometheus.Labels{"trace_id": spanCtx.TraceID().String()},
    )
}

Configuring Exemplar Storage

Exemplars must be enabled in both the storage backend (Prometheus/Mimir) and Grafana:

# Prometheus — enable exemplar storage
# prometheus.yml
storage:
  exemplars:
    max_exemplars: 100000  # circular buffer size

# Mimir — exemplars are stored automatically when received
# No special configuration needed beyond standard blocks_storage

# Grafana datasource provisioning — enable exemplars
apiVersion: 1
datasources:
  - name: Mimir
    type: prometheus
    url: http://mimir-query-frontend:8080/prometheus
    jsonData:
      exemplarTraceIdDestinations:
        - name: trace_id
          datasourceUid: tempo  # Link to your Tempo datasource
          urlDisplayLabel: "View Trace"

                            
                            The Correlation Flow: Dashboard shows latency spike → click exemplar dot → Grafana reads trace_id from exemplar → opens Tempo trace view → you see the exact slow request with all its spans, including which downstream service caused the delay.
                        

Querying Exemplars in PromQL & Grafana

In Grafana, exemplars appear as small dots overlaid on time-series graphs. Enable them per-panel:

Open panel edit mode
In the query options, toggle "Exemplars" on
Optionally add a filter query to limit which exemplars are shown
Configure the trace ID field name (usually trace_id or traceID)

# PromQL query that works with exemplars
# The exemplar data is attached to the underlying histogram buckets
histogram_quantile(0.99,
  sum by (le) (rate(http_request_duration_seconds_bucket{job="api-server"}[5m]))
)

# Exemplar API query (used internally by Grafana)
# GET /api/v1/query_exemplars?query=http_request_duration_seconds_bucket&start=...&end=...

# Response includes exemplar data:
# {
#   "seriesLabels": {"method": "GET", "status": "200"},
#   "exemplars": [
#     {"labels": {"trace_id": "abc123def456"}, "value": 0.842, "timestamp": 1718447400}
#   ]
# }

                            
                            Important: Exemplars are only stored for histogram and counter metrics, not gauges. They have a fixed circular buffer (in Prometheus) or are stored inline with TSDB blocks (in Mimir). Very high-throughput services should sample which requests get exemplars rather than attaching one to every observation.
                        

Best Practices

Operating Prometheus and Mimir at scale requires disciplined metric design, thoughtful recording rules, and proactive cardinality management. These practices prevent the most common production failures.

Cardinality Management

Cardinality is the number of unique time series in your system. Each unique combination of metric name + label values creates a separate time series. High cardinality is the #1 cause of Prometheus/Mimir performance issues and cost overruns.

# Check current cardinality in Prometheus/Mimir
# Total active series
prometheus_tsdb_head_series

# Series created per scrape (should be stable, not growing)
rate(prometheus_tsdb_head_series_created_total[5m])

# Top metrics by series count (Mimir-specific)
# Use the /api/v1/cardinality/label_names endpoint
# or the Mimir dashboard "Tenants" panel

# Find high-cardinality labels with PromQL
count by (__name__) ({__name__=~".+"}) > 1000

# Specific metric cardinality check
count(http_requests_total)
count by (method) (http_requests_total)
count by (status_code) (http_requests_total)
count by (path) (http_requests_total)  # ← likely the culprit!

Anti-Pattern

Cardinality Explosion — Real-World Example

A team added a path label containing the full URL path to their HTTP metrics. With 50,000 unique API endpoints (including user IDs in paths like /users/123/orders), this created:

5 methods × 10 status codes × 50,000 paths = 2,500,000 series — for a single service!

Fix: Use parameterized paths (/users/:id/orders) with at most 50-100 unique values, or move the full path into a log label and keep only the route template in metrics.

# ❌ BAD — unbounded path label
http_requests_total{method="GET", path="/users/12345/orders/67890"}

# ✅ GOOD — parameterized route label
http_requests_total{method="GET", route="/users/:id/orders/:order_id"}

cardinality anti-pattern performance

Label Naming Conventions

Consistent label naming across your organization ensures queries are portable and dashboards work across services:

# Recommended label conventions (aligned with OpenTelemetry semantic conventions)
# ═══════════════════════════════════════════
# Infrastructure labels (added by scrape config or Alloy)
cluster: "production-us-east-1"      # Kubernetes cluster name
namespace: "checkout"                 # Kubernetes namespace
pod: "checkout-7b4f8c-xk2p9"        # Pod name (high-cardinality but useful)
node: "ip-10-0-1-42"                 # Node name
container: "api"                      # Container name

# Service labels (added by instrumentation)
job: "checkout-api"                   # Scrape job / service identifier
instance: "10.0.1.42:8080"           # host:port of the target

# Application labels (added in metric definition)
method: "GET"                         # HTTP method (GET, POST, PUT, DELETE)
status_code: "200"                    # HTTP status code
route: "/api/v1/checkout"            # Parameterized route (NOT full URL)
handler: "CreateOrder"                # Internal handler/function name

# ═══════════════════════════════════════════
# AVOID these patterns:
# ═══════════════════════════════════════════
# ❌ env: "prod-us-east-1"           → split into env + region
# ❌ service_version: "v2.3.1-rc4"  → too many values (use info metric instead)
# ❌ error_message: "connection..."  → unbounded string values
# ❌ user_id: "usr_abc123"           → millions of values

Recording Rules in Production

Production recording rules should be organized into groups by purpose and evaluated at appropriate intervals:

# recording-rules.yaml — Production-grade recording rules
groups:
  # ─── SLI Recording Rules (foundation for SLOs) ───
  - name: sli_recording_rules
    interval: 30s
    rules:
      # Availability SLI: proportion of successful requests
      - record: sli:http_requests:availability_rate5m
        expr: |
          sum by (job) (rate(http_requests_total{status_code!~"5.."}[5m]))
          /
          sum by (job) (rate(http_requests_total[5m]))

      # Latency SLI: proportion of requests faster than threshold
      - record: sli:http_requests:latency_rate5m
        expr: |
          sum by (job) (rate(http_request_duration_seconds_bucket{le="0.3"}[5m]))
          /
          sum by (job) (rate(http_request_duration_seconds_count[5m]))

  # ─── Cluster-Level Aggregations ───
  - name: cluster_aggregations
    interval: 60s
    rules:
      - record: cluster:node_cpu:sum_rate5m
        expr: sum by (cluster) (rate(node_cpu_seconds_total{mode!="idle"}[5m]))

      - record: cluster:node_memory:utilization
        expr: |
          1 - sum by (cluster) (node_memory_MemAvailable_bytes)
              / sum by (cluster) (node_memory_MemTotal_bytes)

  # ─── Cost Attribution (for multi-tenant chargeback) ───
  - name: cost_attribution
    interval: 300s  # 5 minutes is sufficient for cost data
    rules:
      - record: namespace:container_cpu_usage:sum_rate5m
        expr: sum by (namespace, cluster) (rate(container_cpu_usage_seconds_total[5m]))

      - record: namespace:container_memory:avg_bytes
        expr: avg by (namespace, cluster) (container_memory_working_set_bytes)

Alerting Rules

Well-designed alerting rules reduce noise and ensure on-call engineers receive actionable notifications:

# alerting-rules.yaml — Production alerting rules
groups:
  - name: service_alerts
    rules:
      # ─── SLO-Based Alert (Multi-Window, Multi-Burn-Rate) ───
      # Fires when error budget consumption rate exceeds thresholds
      - alert: SLOErrorBudgetBurn
        expr: |
          (
            sli:http_requests:availability_rate5m{job="checkout-api"} < 0.999
            and
            sli:http_requests:availability_rate5m{job="checkout-api"} < 0.9999
          )
        for: 2m
        labels:
          severity: critical
          team: checkout
          slo: availability
        annotations:
          summary: "{{ $labels.job }} burning error budget too fast"
          description: |
            Current availability is {{ $value | humanizePercentage }}.
            SLO target is 99.9%. Error budget is being consumed rapidly.
          runbook_url: "https://wiki.internal/runbooks/slo-burn"
          dashboard_url: "https://grafana.internal/d/slo-dashboard?var-job={{ $labels.job }}"

      # ─── Latency Degradation ───
      - alert: HighP99Latency
        expr: |
          histogram_quantile(0.99,
            sum by (le, job) (rate(http_request_duration_seconds_bucket[5m]))
          ) > 2.0
        for: 5m
        labels:
          severity: warning
          team: platform
        annotations:
          summary: "{{ $labels.job }} p99 latency above 2s"
          description: "P99 latency is {{ $value | humanizeDuration }} for {{ $labels.job }}."

      # ─── Infrastructure Alert ───
      - alert: DiskSpaceCritical
        expr: |
          predict_linear(node_filesystem_avail_bytes{mountpoint="/"}[6h], 3600*24) < 0
        for: 10m
        labels:
          severity: critical
          team: infrastructure
        annotations:
          summary: "{{ $labels.instance }} disk will fill within 24 hours"
          description: |
            Based on 6-hour trend, {{ $labels.instance }} mount {{ $labels.mountpoint }}
            will run out of space in approximately 24 hours.
            Current available: {{ $value | humanize1024 }}B

      # ─── Deadman Alert (absence detection) ───
      - alert: TargetDown
        expr: up == 0
        for: 3m
        labels:
          severity: critical
        annotations:
          summary: "{{ $labels.job }}/{{ $labels.instance }} is down"
          description: "Target has been unreachable for more than 3 minutes."

                            
                            Alert Design Principles: (1) Alert on symptoms not causes — users care about latency, not CPU%. (2) Use for duration to avoid flapping — 2-5 minutes prevents transient spikes from paging. (3) Include runbook URLs in annotations — the on-call engineer needs next steps, not just a description. (4) Use recording rules in alert expressions — keeps alerts readable and prevents expensive evaluations.
                        

Summary & Next Steps

In this deep dive into metrics monitoring, we covered:

PromQL fundamentals — Selectors, operators, functions, aggregations, and the distinction between instant and range vectors
Writing production PromQL — rate/irate/increase, histogram_quantile, recording rules, and implementing RED/USE methods
Data collection protocols — StatsD/DogStatsD for legacy apps, OTLP for modern instrumentation, Prometheus exposition for cloud-native, and SNMP for network devices
Storage architectures — Graphite's file-per-metric model, Prometheus's local TSDB with WAL and compaction, and Mimir's horizontally-scalable microservices architecture
Exemplars — Linking metrics to traces for instant correlation during incident response
Best practices — Cardinality management, label naming conventions, recording rules organization, and alerting rule design

The key insight is that metrics monitoring at scale is as much about discipline as it is about technology. Cardinality explosions, missing recording rules, and noisy alerts are organizational problems that require conventions and governance — not just better tools.

Next in the Grafana Track

In Part 6: Distributed Tracing with Grafana Tempo & TraceQL, we'll explore the traces pillar — TraceQL query language, Tempo's architecture, span-level analysis, service graphs, and correlating traces with metrics and logs for complete observability.

Previous Part 4: Looking at Logs — Loki & LogQL Next Part 6: Distributed Tracing — Tempo & TraceQL