Prometheus Deep Dive Part 8: Optimizing & Debugging Prometheus

Performance Fundamentals

Before optimizing, you need to understand how Prometheus consumes resources. Its performance characteristics are directly tied to four primary factors: active time series count, ingestion rate, query complexity, and TSDB operations (compaction, WAL replay).

Resource Consumption Model

                            
                            Memory Rule of Thumb: Prometheus uses approximately 3–4 KiB per active time series in the head block. With 5 million active series, expect 15–20 GiB of heap usage just for the head. Add 30–50% for query buffers, scrape buffers, and Go runtime overhead — so plan for 25–30 GiB total RSS.
                        

Resource Model

Prometheus Resource Consumption Breakdown

Resource	Primary Driver	Scaling Factor
Memory (heap)	Active series in head block	~3–4 KiB per series
Memory (mmap)	Queried block data pages	Proportional to query time range
CPU	Ingestion + compaction + queries	Linear with samples/sec + rules
Disk I/O (write)	WAL appends + compaction	~1–2 bytes per sample (compressed)
Disk I/O (read)	Queries touching persistent blocks	Proportional to query range
Network	Scrape traffic + remote write	~1 KiB per target per scrape

Capacity PlanningPerformance

Key Self-Monitoring Metrics

# Essential metrics for monitoring Prometheus itself

# Head series count (primary memory driver)
prometheus_tsdb_head_series

# Ingestion rate (samples per second)
rate(prometheus_tsdb_head_samples_appended_total[5m])

# Memory usage breakdown
process_resident_memory_bytes
go_memstats_heap_inuse_bytes
go_memstats_heap_alloc_bytes

# Compaction health
prometheus_tsdb_compactions_total
prometheus_tsdb_compaction_duration_seconds
prometheus_tsdb_compactions_failed_total

# WAL health
prometheus_tsdb_wal_corruptions_total
prometheus_tsdb_wal_truncate_duration_seconds

# Query performance
prometheus_engine_query_duration_seconds{quantile="0.99"}
prometheus_engine_queries_concurrent_max
prometheus_engine_query_samples_total

# Scrape performance
prometheus_target_scrape_pool_targets
prometheus_target_scrape_pool_sync_total
scrape_duration_seconds
scrape_samples_scraped

TSDB Tuning

WAL Configuration

The Write-Ahead Log (WAL) is Prometheus’s durability mechanism. Every sample is first written to the WAL before being committed to the head block. WAL configuration affects both startup time (replay) and steady-state performance:

# prometheus.yml — TSDB storage flags
# These are CLI flags, not config file options
# --storage.tsdb.wal-segment-size=128MB     # Default: 128MB per WAL segment
# --storage.tsdb.wal-compression             # Enable WAL compression (recommended)
# --storage.tsdb.min-block-duration=2h       # Head block minimum duration
# --storage.tsdb.max-block-duration=36h      # Maximum block duration after compaction

# Example Kubernetes args
args:
  - '--config.file=/etc/prometheus/prometheus.yml'
  - '--storage.tsdb.path=/prometheus'
  - '--storage.tsdb.retention.time=30d'
  - '--storage.tsdb.retention.size=100GB'
  - '--storage.tsdb.wal-compression'
  - '--storage.tsdb.min-block-duration=2h'
  - '--storage.tsdb.max-block-duration=36h'
  - '--web.enable-lifecycle'
  - '--web.enable-admin-api'

                            
                            WAL Replay Time: After a crash or restart, Prometheus replays the entire WAL. With a default 2h min-block-duration, this means up to 2 hours of data must be replayed. For Prometheus instances with high ingestion rates (>500K samples/sec), WAL replay can take 5–15 minutes. Monitor with prometheus_tsdb_wal_replay_duration_seconds.
                        

Head Block Optimization

# Head block chunks — check memory pressure
# If head chunks are too many, the GC pauses increase
prometheus_tsdb_head_chunks

# Out-of-order samples (indicates clock skew or client issues)
prometheus_tsdb_out_of_order_samples_total

# Series created/removed rate (high churn = high GC pressure)
rate(prometheus_tsdb_head_series_created_total[5m])
rate(prometheus_tsdb_head_series_removed_total[5m])

# Head GC duration
prometheus_tsdb_head_gc_duration_seconds

Reducing series churn: High series creation/removal rates (churn) cause excessive garbage collection. Common causes include Kubernetes pod restarts creating new pod labels, or applications using request IDs as label values.

Compaction Tuning

Compaction merges smaller blocks into larger ones, reducing the number of files Prometheus needs to query and reclaiming space from deleted or out-of-retention series:

# Monitor compaction health
# Duration should remain consistent — spikes indicate growing data
prometheus_tsdb_compaction_duration_seconds

# Block count — should stabilize, not grow indefinitely
prometheus_tsdb_blocks_loaded

# Compaction failures (often disk space or I/O issues)
prometheus_tsdb_compactions_failed_total

# Time since last successful compaction
time() - prometheus_tsdb_head_max_time_seconds

Retention Strategies

# Time-based retention (default)
--storage.tsdb.retention.time=30d

# Size-based retention (useful for fixed disk budgets)
--storage.tsdb.retention.size=200GB

# Combined: whichever limit is hit first
--storage.tsdb.retention.time=90d
--storage.tsdb.retention.size=500GB

# Check current disk usage
prometheus_tsdb_storage_blocks_bytes

Cardinality Management

Cardinality — the total number of unique time series — is the single most important factor affecting Prometheus performance. A metric with 10 label values on each of 5 labels produces 10^5 = 100,000 time series from a single metric name.

Identifying High Cardinality

# PromQL queries to find cardinality offenders

# Top 10 metrics by series count
topk(10, count by (__name__) ({__name__=~".+"}))

# Series count per job
count by (job) ({__name__=~".+"})

# Find metrics with high label cardinality
# (look for metrics with many unique label value combinations)
count by (__name__) ({__name__=~"http_request_duration.*"})

# Using the TSDB status API (available at /api/v1/status/tsdb)
# Returns top series by label count, top label names by value count
curl -s http://localhost:9090/api/v1/status/tsdb | jq .

# promtool tsdb analyze — detailed block analysis
promtool tsdb analyze /prometheus/data

# Output includes:
# - Block count and time range
# - Series count per block
# - Top 10 label names by number of values
# - Top 10 metric names by series count
# - Estimated cardinality per label pair

Reduction Strategies

                            
                            Cardinality Reduction Techniques:
                            Drop unused metrics: Use metric_relabel_configs to drop metrics you never query
Aggregate at source: Configure applications to emit pre-aggregated histograms instead of per-path metrics
Remove high-cardinality labels: Drop labels like request_id, trace_id, user_id at scrape time
Limit label values: Replace unbounded labels (URLs) with bounded alternatives (route patterns)
Use recording rules: Pre-aggregate and drop the raw high-cardinality metrics via remote write relabeling

                        

# metric_relabel_configs — applied AFTER scraping
scrape_configs:
  - job_name: 'my-app'
    static_configs:
      - targets: ['my-app:8080']
    metric_relabel_configs:
      # Drop Go runtime metrics we don't use
      - source_labels: [__name__]
        regex: 'go_(gc|memstats)_.*'
        action: drop

      # Drop high-cardinality labels
      - regex: 'request_id|trace_id'
        action: labeldrop

      # Aggregate HTTP metrics by dropping instance-specific path params
      # /api/users/123 → /api/users/:id
      - source_labels: [path]
        regex: '/api/users/[0-9]+'
        target_label: path
        replacement: '/api/users/:id'

      # Drop entire histogram buckets we don't need
      - source_labels: [__name__, le]
        regex: 'http_request_duration_seconds_bucket;(0\.001|0\.0025|0\.0075)'
        action: drop

Query Optimization

Expensive Query Patterns

Anti-Patterns

Expensive vs Optimized Query Patterns

Anti-Pattern	Why Expensive	Optimized Alternative
`{__name__=~".+"}`	Loads ALL series into memory	Always scope with job/metric name
`rate(x[30d])`	Loads 30 days of raw samples	Use recording rule with shorter range
`count(up) without (instance)`	Aggregates then filters	`count by (job) (up)`
Nested `label_replace`	String ops on every sample	Relabeling at scrape time
`histogram_quantile` on raw	Processes all buckets per series	Recording rule for common quantiles

PromQLPerformance

Recording Rules for Performance

# Recording rules — pre-compute expensive queries
groups:
  - name: performance_recording_rules
    interval: 30s
    rules:
      # Pre-compute P50/P90/P99 to avoid repeated histogram_quantile
      - record: job:http_request_duration_seconds:p50
        expr: |
          histogram_quantile(0.50,
            sum by (job, le) (rate(http_request_duration_seconds_bucket[5m]))
          )

      - record: job:http_request_duration_seconds:p99
        expr: |
          histogram_quantile(0.99,
            sum by (job, le) (rate(http_request_duration_seconds_bucket[5m]))
          )

      # Pre-aggregate request rate across all instances
      - record: job:http_requests:rate5m
        expr: sum by (job, method, status) (rate(http_requests_total[5m]))

      # Error rate as ratio (used in many dashboards)
      - record: job:http_errors:ratio_rate5m
        expr: |
          sum by (job) (rate(http_requests_total{status=~"5.."}[5m]))
          / sum by (job) (rate(http_requests_total[5m]))

Query Limits & Timeouts

# CLI flags to protect Prometheus from expensive queries
--query.max-concurrency=20           # Max simultaneous queries (default: 20)
--query.timeout=2m                   # Max query execution time (default: 2m)
--query.max-samples=50000000         # Max samples a query can load (default: 50M)
--query.lookback-delta=5m            # Staleness lookback window (default: 5m)

Go Profiling with pprof

Prometheus is written in Go and exposes pprof profiling endpoints by default. These are invaluable for diagnosing CPU hotspots, memory leaks, and goroutine issues in production.

CPU Profiling

# Collect a 30-second CPU profile
curl -s http://localhost:9090/debug/pprof/profile?seconds=30 > cpu.prof

# Analyze with go tool pprof
go tool pprof -http=:8080 cpu.prof
# Opens web UI with flame graphs, call graphs, top functions

# Or use CLI for quick analysis
go tool pprof cpu.prof
(pprof) top 20
(pprof) web                     # Generate SVG call graph
(pprof) list compactBlocks      # Source-level view of specific function

# Common CPU hotspots in Prometheus:
# 1. tsdb.(*Head).Appender — ingestion path
# 2. querier.Select — query evaluation
# 3. compactor.Compact — block compaction
# 4. scrape.(*scrapeLoop).append — scrape processing

Memory (Heap) Profiling

# Current heap allocation profile
curl -s http://localhost:9090/debug/pprof/heap > heap.prof

# Analyze — show allocations by size
go tool pprof -http=:8080 -alloc_space heap.prof

# Compare two heap profiles (before/after suspected leak)
go tool pprof -base heap_before.prof heap_after.prof
(pprof) top 20 -cum    # Cumulative allocations

# In-use (resident) vs allocated (total historical)
go tool pprof -inuse_space heap.prof     # What's currently held
go tool pprof -alloc_space heap.prof     # What was ever allocated

# Common memory consumers:
# 1. tsdb.(*headChunk) — compressed samples in head
# 2. labels.Labels — label sets for active series
# 3. index.(*PostingsReader) — query-time index lookups
# 4. wal.(*Segment) — WAL buffers awaiting compaction

Goroutine Analysis

# Current goroutine dump
curl -s http://localhost:9090/debug/pprof/goroutine?debug=2 > goroutines.txt

# Count goroutines by state
curl -s http://localhost:9090/debug/pprof/goroutine?debug=1 | head -50

# Watch goroutine count over time
# High goroutine count often means:
# - Stuck scrape connections (target not responding)
# - Remote write queue backup
# - Query handlers waiting on slow TSDB reads
prometheus_go_goroutines   # Monitor this metric

# Block profile — find contention points
curl -s http://localhost:9090/debug/pprof/block > block.prof
go tool pprof -http=:8080 block.prof

TSDB Analysis Tools

promtool tsdb Commands

# Analyze TSDB blocks — comprehensive overview
promtool tsdb analyze /prometheus/data

# Output example:
# Duration: 720h0m0s (30 days)
# Series: 4,523,891
# Samples: 12,847,291,340
# Chunks: 45,238,910
#
# Highest cardinality labels:
#   "instance" with 2,341 unique values
#   "pod" with 8,912 unique values
#
# Highest cardinality metric names:
#   "container_memory_usage_bytes" with 15,234 series
#   "kube_pod_status_phase" with 12,891 series

# List all blocks
promtool tsdb list /prometheus/data

# Dump specific block metadata
promtool tsdb dump /prometheus/data/01HXYZ... --min-time=0 --max-time=9999999999999

# Create snapshot for backup (via admin API)
curl -XPOST http://localhost:9090/api/v1/admin/tsdb/snapshot
# Creates /prometheus/data/snapshots/YYYYMMDDTHHMMSS-

# Clean tombstones (reclaim space from deleted series)
curl -XPOST http://localhost:9090/api/v1/admin/tsdb/clean_tombstones

Block Inspection

# Each block directory contains:
# /prometheus/data/
# ├── 01HXYZ.../               # Block directory (ULID name)
# │   ├── meta.json            # Block metadata (time range, stats)
# │   ├── index                # Inverted index (label→series mapping)
# │   ├── chunks/              # Compressed sample data
# │   │   └── 000001           # Chunk files (max 512MB each)
# │   └── tombstones           # Deleted series markers
# ├── wal/                     # Write-Ahead Log
# │   ├── 00000001             # WAL segments (128MB each)
# │   └── checkpoint.00000042  # Last checkpoint
# └── lock                     # Process lock file

# Inspect block metadata
cat /prometheus/data/01HXYZ.../meta.json | jq .
# {
#   "ulid": "01HXYZ...",
#   "minTime": 1718000000000,
#   "maxTime": 1718007200000,
#   "stats": {
#     "numSamples": 142891234,
#     "numSeries": 4523891,
#     "numChunks": 9047782
#   },
#   "compaction": {
#     "level": 2,
#     "sources": ["01HABC...", "01HDEF..."]
#   }
# }

Systematic Troubleshooting

High Memory Usage

High Memory Troubleshooting Flowchart

flowchart TD
    A["High Memory Alert
process_resident_memory_bytes > threshold"] --> B{"Check head series count"}
    B -->|"Growing"| C["Series churn?
Check created/removed rate"]
    B -->|"Stable"| D["Check query memory
go_memstats_heap_inuse"]

    C -->|"High churn"| E["Fix: relabeling to stabilize labels
Remove pod/container_id from long-lived metrics"]
    C -->|"Normal churn"| F["Growing workload
Scale out (shard)"]

    D -->|"Spikes with queries"| G["Find expensive queries
prometheus_engine_query_duration_seconds"]
    D -->|"Steady high"| H["Check mmap'd blocks
Large query ranges loading block data"]

    G --> I["Fix: Add recording rules
Reduce query.max-samples
Set query.timeout shorter"]
    H --> J["Fix: Reduce retention
Use remote storage for long-range queries"]

Slow Queries

# Enable query logging to identify slow queries
# CLI flag: --query.log-file=/prometheus/query.log

# Or check the built-in query log API
curl -s http://localhost:9090/api/v1/status/runtimeinfo | jq .

# Find queries taking >10s in the last hour
# (Requires query log enabled)
grep -E '"duration_seconds":[0-9]{2,}' /prometheus/query.log | \
  jq '{query: .params.query, duration: .stats.timings.evalTotalTime}'

# Common slow query causes:
# 1. Large time range (7d+) on high-cardinality metric
# 2. Regex matchers on label values: {path=~".*api.*"}
# 3. Nested subqueries: rate(metric[5m])[1h:1m]
# 4. histogram_quantile over many series without pre-aggregation

Ingestion Lag

# Symptoms: prometheus_target_scrapes_exceeded_sample_limit_total increasing
# Or: scrape_samples_scraped showing fewer samples than expected

# Check if scrapes are timing out
rate(prometheus_target_scrape_pools_failed_total[5m])
scrape_duration_seconds{quantile="0.99"}

# Check if targets are returning too many samples
prometheus_target_scrapes_exceeded_sample_limit_total

# Adjust limits if legitimate
scrape_configs:
  - job_name: 'heavy-exporter'
    sample_limit: 100000    # Increase from default 0 (unlimited)
    scrape_timeout: 30s     # Increase from default 10s

# Remote write falling behind
prometheus_remote_storage_samples_pending
# If this grows continuously, increase max_shards or check backend

Conclusion

                            
                            Optimization Priorities (in order):
                            Manage cardinality — the #1 lever for both memory and query performance
Add recording rules — pre-compute expensive queries used in dashboards
Tune TSDB flags — WAL compression, appropriate retention, block durations
Set query limits — protect the server from runaway queries
Profile under load — use pprof to find unexpected hotspots
Scale out — shard when a single server is maxed (Part 7)

                        

Previous Part 7: Sharding, Federation & HA Next Part 9: Node Exporter