Performance Fundamentals
Before optimizing, you need to understand how Prometheus consumes resources. Its performance characteristics are directly tied to four primary factors: active time series count, ingestion rate, query complexity, and TSDB operations (compaction, WAL replay).
Resource Consumption Model
Prometheus Resource Consumption Breakdown
| Resource | Primary Driver | Scaling Factor |
|---|---|---|
| Memory (heap) | Active series in head block | ~3–4 KiB per series |
| Memory (mmap) | Queried block data pages | Proportional to query time range |
| CPU | Ingestion + compaction + queries | Linear with samples/sec + rules |
| Disk I/O (write) | WAL appends + compaction | ~1–2 bytes per sample (compressed) |
| Disk I/O (read) | Queries touching persistent blocks | Proportional to query range |
| Network | Scrape traffic + remote write | ~1 KiB per target per scrape |
Key Self-Monitoring Metrics
# Essential metrics for monitoring Prometheus itself
# Head series count (primary memory driver)
prometheus_tsdb_head_series
# Ingestion rate (samples per second)
rate(prometheus_tsdb_head_samples_appended_total[5m])
# Memory usage breakdown
process_resident_memory_bytes
go_memstats_heap_inuse_bytes
go_memstats_heap_alloc_bytes
# Compaction health
prometheus_tsdb_compactions_total
prometheus_tsdb_compaction_duration_seconds
prometheus_tsdb_compactions_failed_total
# WAL health
prometheus_tsdb_wal_corruptions_total
prometheus_tsdb_wal_truncate_duration_seconds
# Query performance
prometheus_engine_query_duration_seconds{quantile="0.99"}
prometheus_engine_queries_concurrent_max
prometheus_engine_query_samples_total
# Scrape performance
prometheus_target_scrape_pool_targets
prometheus_target_scrape_pool_sync_total
scrape_duration_seconds
scrape_samples_scraped
TSDB Tuning
WAL Configuration
The Write-Ahead Log (WAL) is Prometheus’s durability mechanism. Every sample is first written to the WAL before being committed to the head block. WAL configuration affects both startup time (replay) and steady-state performance:
# prometheus.yml — TSDB storage flags
# These are CLI flags, not config file options
# --storage.tsdb.wal-segment-size=128MB # Default: 128MB per WAL segment
# --storage.tsdb.wal-compression # Enable WAL compression (recommended)
# --storage.tsdb.min-block-duration=2h # Head block minimum duration
# --storage.tsdb.max-block-duration=36h # Maximum block duration after compaction
# Example Kubernetes args
args:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.retention.time=30d'
- '--storage.tsdb.retention.size=100GB'
- '--storage.tsdb.wal-compression'
- '--storage.tsdb.min-block-duration=2h'
- '--storage.tsdb.max-block-duration=36h'
- '--web.enable-lifecycle'
- '--web.enable-admin-api'
prometheus_tsdb_wal_replay_duration_seconds.
Head Block Optimization
# Head block chunks — check memory pressure
# If head chunks are too many, the GC pauses increase
prometheus_tsdb_head_chunks
# Out-of-order samples (indicates clock skew or client issues)
prometheus_tsdb_out_of_order_samples_total
# Series created/removed rate (high churn = high GC pressure)
rate(prometheus_tsdb_head_series_created_total[5m])
rate(prometheus_tsdb_head_series_removed_total[5m])
# Head GC duration
prometheus_tsdb_head_gc_duration_seconds
Reducing series churn: High series creation/removal rates (churn) cause excessive garbage collection. Common causes include Kubernetes pod restarts creating new pod labels, or applications using request IDs as label values.
Compaction Tuning
Compaction merges smaller blocks into larger ones, reducing the number of files Prometheus needs to query and reclaiming space from deleted or out-of-retention series:
# Monitor compaction health
# Duration should remain consistent — spikes indicate growing data
prometheus_tsdb_compaction_duration_seconds
# Block count — should stabilize, not grow indefinitely
prometheus_tsdb_blocks_loaded
# Compaction failures (often disk space or I/O issues)
prometheus_tsdb_compactions_failed_total
# Time since last successful compaction
time() - prometheus_tsdb_head_max_time_seconds
Retention Strategies
# Time-based retention (default)
--storage.tsdb.retention.time=30d
# Size-based retention (useful for fixed disk budgets)
--storage.tsdb.retention.size=200GB
# Combined: whichever limit is hit first
--storage.tsdb.retention.time=90d
--storage.tsdb.retention.size=500GB
# Check current disk usage
prometheus_tsdb_storage_blocks_bytes
Cardinality Management
Cardinality — the total number of unique time series — is the single most important factor affecting Prometheus performance. A metric with 10 label values on each of 5 labels produces 10^5 = 100,000 time series from a single metric name.
Identifying High Cardinality
# PromQL queries to find cardinality offenders
# Top 10 metrics by series count
topk(10, count by (__name__) ({__name__=~".+"}))
# Series count per job
count by (job) ({__name__=~".+"})
# Find metrics with high label cardinality
# (look for metrics with many unique label value combinations)
count by (__name__) ({__name__=~"http_request_duration.*"})
# Using the TSDB status API (available at /api/v1/status/tsdb)
# Returns top series by label count, top label names by value count
curl -s http://localhost:9090/api/v1/status/tsdb | jq .
# promtool tsdb analyze — detailed block analysis
promtool tsdb analyze /prometheus/data
# Output includes:
# - Block count and time range
# - Series count per block
# - Top 10 label names by number of values
# - Top 10 metric names by series count
# - Estimated cardinality per label pair
Reduction Strategies
- Drop unused metrics: Use
metric_relabel_configsto drop metrics you never query - Aggregate at source: Configure applications to emit pre-aggregated histograms instead of per-path metrics
- Remove high-cardinality labels: Drop labels like
request_id,trace_id,user_idat scrape time - Limit label values: Replace unbounded labels (URLs) with bounded alternatives (route patterns)
- Use recording rules: Pre-aggregate and drop the raw high-cardinality metrics via remote write relabeling
# metric_relabel_configs — applied AFTER scraping
scrape_configs:
- job_name: 'my-app'
static_configs:
- targets: ['my-app:8080']
metric_relabel_configs:
# Drop Go runtime metrics we don't use
- source_labels: [__name__]
regex: 'go_(gc|memstats)_.*'
action: drop
# Drop high-cardinality labels
- regex: 'request_id|trace_id'
action: labeldrop
# Aggregate HTTP metrics by dropping instance-specific path params
# /api/users/123 → /api/users/:id
- source_labels: [path]
regex: '/api/users/[0-9]+'
target_label: path
replacement: '/api/users/:id'
# Drop entire histogram buckets we don't need
- source_labels: [__name__, le]
regex: 'http_request_duration_seconds_bucket;(0\.001|0\.0025|0\.0075)'
action: drop
Query Optimization
Expensive Query Patterns
Expensive vs Optimized Query Patterns
| Anti-Pattern | Why Expensive | Optimized Alternative |
|---|---|---|
{__name__=~".+"} | Loads ALL series into memory | Always scope with job/metric name |
rate(x[30d]) | Loads 30 days of raw samples | Use recording rule with shorter range |
count(up) without (instance) | Aggregates then filters | count by (job) (up) |
Nested label_replace | String ops on every sample | Relabeling at scrape time |
histogram_quantile on raw | Processes all buckets per series | Recording rule for common quantiles |
Recording Rules for Performance
# Recording rules — pre-compute expensive queries
groups:
- name: performance_recording_rules
interval: 30s
rules:
# Pre-compute P50/P90/P99 to avoid repeated histogram_quantile
- record: job:http_request_duration_seconds:p50
expr: |
histogram_quantile(0.50,
sum by (job, le) (rate(http_request_duration_seconds_bucket[5m]))
)
- record: job:http_request_duration_seconds:p99
expr: |
histogram_quantile(0.99,
sum by (job, le) (rate(http_request_duration_seconds_bucket[5m]))
)
# Pre-aggregate request rate across all instances
- record: job:http_requests:rate5m
expr: sum by (job, method, status) (rate(http_requests_total[5m]))
# Error rate as ratio (used in many dashboards)
- record: job:http_errors:ratio_rate5m
expr: |
sum by (job) (rate(http_requests_total{status=~"5.."}[5m]))
/ sum by (job) (rate(http_requests_total[5m]))
Query Limits & Timeouts
# CLI flags to protect Prometheus from expensive queries
--query.max-concurrency=20 # Max simultaneous queries (default: 20)
--query.timeout=2m # Max query execution time (default: 2m)
--query.max-samples=50000000 # Max samples a query can load (default: 50M)
--query.lookback-delta=5m # Staleness lookback window (default: 5m)
Go Profiling with pprof
Prometheus is written in Go and exposes pprof profiling endpoints by default. These are invaluable for diagnosing CPU hotspots, memory leaks, and goroutine issues in production.
CPU Profiling
# Collect a 30-second CPU profile
curl -s http://localhost:9090/debug/pprof/profile?seconds=30 > cpu.prof
# Analyze with go tool pprof
go tool pprof -http=:8080 cpu.prof
# Opens web UI with flame graphs, call graphs, top functions
# Or use CLI for quick analysis
go tool pprof cpu.prof
(pprof) top 20
(pprof) web # Generate SVG call graph
(pprof) list compactBlocks # Source-level view of specific function
# Common CPU hotspots in Prometheus:
# 1. tsdb.(*Head).Appender — ingestion path
# 2. querier.Select — query evaluation
# 3. compactor.Compact — block compaction
# 4. scrape.(*scrapeLoop).append — scrape processing
Memory (Heap) Profiling
# Current heap allocation profile
curl -s http://localhost:9090/debug/pprof/heap > heap.prof
# Analyze — show allocations by size
go tool pprof -http=:8080 -alloc_space heap.prof
# Compare two heap profiles (before/after suspected leak)
go tool pprof -base heap_before.prof heap_after.prof
(pprof) top 20 -cum # Cumulative allocations
# In-use (resident) vs allocated (total historical)
go tool pprof -inuse_space heap.prof # What's currently held
go tool pprof -alloc_space heap.prof # What was ever allocated
# Common memory consumers:
# 1. tsdb.(*headChunk) — compressed samples in head
# 2. labels.Labels — label sets for active series
# 3. index.(*PostingsReader) — query-time index lookups
# 4. wal.(*Segment) — WAL buffers awaiting compaction
Goroutine Analysis
# Current goroutine dump
curl -s http://localhost:9090/debug/pprof/goroutine?debug=2 > goroutines.txt
# Count goroutines by state
curl -s http://localhost:9090/debug/pprof/goroutine?debug=1 | head -50
# Watch goroutine count over time
# High goroutine count often means:
# - Stuck scrape connections (target not responding)
# - Remote write queue backup
# - Query handlers waiting on slow TSDB reads
prometheus_go_goroutines # Monitor this metric
# Block profile — find contention points
curl -s http://localhost:9090/debug/pprof/block > block.prof
go tool pprof -http=:8080 block.prof
TSDB Analysis Tools
promtool tsdb Commands
# Analyze TSDB blocks — comprehensive overview
promtool tsdb analyze /prometheus/data
# Output example:
# Duration: 720h0m0s (30 days)
# Series: 4,523,891
# Samples: 12,847,291,340
# Chunks: 45,238,910
#
# Highest cardinality labels:
# "instance" with 2,341 unique values
# "pod" with 8,912 unique values
#
# Highest cardinality metric names:
# "container_memory_usage_bytes" with 15,234 series
# "kube_pod_status_phase" with 12,891 series
# List all blocks
promtool tsdb list /prometheus/data
# Dump specific block metadata
promtool tsdb dump /prometheus/data/01HXYZ... --min-time=0 --max-time=9999999999999
# Create snapshot for backup (via admin API)
curl -XPOST http://localhost:9090/api/v1/admin/tsdb/snapshot
# Creates /prometheus/data/snapshots/YYYYMMDDTHHMMSS-
# Clean tombstones (reclaim space from deleted series)
curl -XPOST http://localhost:9090/api/v1/admin/tsdb/clean_tombstones
Block Inspection
# Each block directory contains:
# /prometheus/data/
# ├── 01HXYZ.../ # Block directory (ULID name)
# │ ├── meta.json # Block metadata (time range, stats)
# │ ├── index # Inverted index (label→series mapping)
# │ ├── chunks/ # Compressed sample data
# │ │ └── 000001 # Chunk files (max 512MB each)
# │ └── tombstones # Deleted series markers
# ├── wal/ # Write-Ahead Log
# │ ├── 00000001 # WAL segments (128MB each)
# │ └── checkpoint.00000042 # Last checkpoint
# └── lock # Process lock file
# Inspect block metadata
cat /prometheus/data/01HXYZ.../meta.json | jq .
# {
# "ulid": "01HXYZ...",
# "minTime": 1718000000000,
# "maxTime": 1718007200000,
# "stats": {
# "numSamples": 142891234,
# "numSeries": 4523891,
# "numChunks": 9047782
# },
# "compaction": {
# "level": 2,
# "sources": ["01HABC...", "01HDEF..."]
# }
# }
Systematic Troubleshooting
High Memory Usage
flowchart TD
A["High Memory Alert
process_resident_memory_bytes > threshold"] --> B{"Check head series count"}
B -->|"Growing"| C["Series churn?
Check created/removed rate"]
B -->|"Stable"| D["Check query memory
go_memstats_heap_inuse"]
C -->|"High churn"| E["Fix: relabeling to stabilize labels
Remove pod/container_id from long-lived metrics"]
C -->|"Normal churn"| F["Growing workload
Scale out (shard)"]
D -->|"Spikes with queries"| G["Find expensive queries
prometheus_engine_query_duration_seconds"]
D -->|"Steady high"| H["Check mmap'd blocks
Large query ranges loading block data"]
G --> I["Fix: Add recording rules
Reduce query.max-samples
Set query.timeout shorter"]
H --> J["Fix: Reduce retention
Use remote storage for long-range queries"]
Slow Queries
# Enable query logging to identify slow queries
# CLI flag: --query.log-file=/prometheus/query.log
# Or check the built-in query log API
curl -s http://localhost:9090/api/v1/status/runtimeinfo | jq .
# Find queries taking >10s in the last hour
# (Requires query log enabled)
grep -E '"duration_seconds":[0-9]{2,}' /prometheus/query.log | \
jq '{query: .params.query, duration: .stats.timings.evalTotalTime}'
# Common slow query causes:
# 1. Large time range (7d+) on high-cardinality metric
# 2. Regex matchers on label values: {path=~".*api.*"}
# 3. Nested subqueries: rate(metric[5m])[1h:1m]
# 4. histogram_quantile over many series without pre-aggregation
Ingestion Lag
# Symptoms: prometheus_target_scrapes_exceeded_sample_limit_total increasing
# Or: scrape_samples_scraped showing fewer samples than expected
# Check if scrapes are timing out
rate(prometheus_target_scrape_pools_failed_total[5m])
scrape_duration_seconds{quantile="0.99"}
# Check if targets are returning too many samples
prometheus_target_scrapes_exceeded_sample_limit_total
# Adjust limits if legitimate
scrape_configs:
- job_name: 'heavy-exporter'
sample_limit: 100000 # Increase from default 0 (unlimited)
scrape_timeout: 30s # Increase from default 10s
# Remote write falling behind
prometheus_remote_storage_samples_pending
# If this grows continuously, increase max_shards or check backend
Conclusion
- Manage cardinality — the #1 lever for both memory and query performance
- Add recording rules — pre-compute expensive queries used in dashboards
- Tune TSDB flags — WAL compression, appropriate retention, block durations
- Set query limits — protect the server from runaway queries
- Profile under load — use pprof to find unexpected hotspots
- Scale out — shard when a single server is maxed (Part 7)