Back to Monitoring, Observability & Reliability Series

Tool Deep Dive: Prometheus Complete Guide

May 14, 2026 Wasil Zafar 22 min read

A comprehensive reference guide to Prometheus — advanced PromQL patterns, recording rules for performance, federation for multi-cluster setups, remote write for long-term storage, service discovery mechanisms, TSDB tuning, and high-availability architectures with Thanos and Cortex.

Table of Contents

  1. Prometheus Architecture
  2. Advanced PromQL
  3. Recording Rules
  4. Service Discovery
  5. Federation & Remote Write
  6. High Availability Patterns
  7. TSDB Storage Tuning
  8. Production Checklist

Prometheus Architecture

Prometheus Internal Architecture
                                flowchart TD
                                    A[Service Discovery\nK8s, Consul, DNS, file, EC2] --> B[Scrape Manager\nPull /metrics at interval]
                                    B --> C[TSDB\nTime Series Database]
                                    C --> D[PromQL Engine\nQuery evaluation]
                                    C --> E[Remote Write\nTo Thanos/Mimir/Cortex]
                                    D --> F[HTTP API\n/api/v1/query]
                                    F --> G[Grafana / Alertmanager]
                                    C --> H[Rule Manager\nRecording + Alerting rules]
                                    H --> I[Alertmanager\nNotification routing]
                            

Advanced PromQL Patterns

# Apdex score (Application Performance Index)
# Satisfied: < 300ms, Tolerating: 300ms-1.2s, Frustrated: > 1.2s
(
  sum(rate(http_request_duration_seconds_bucket{le="0.3"}[5m]))
  + sum(rate(http_request_duration_seconds_bucket{le="1.2"}[5m]))
) / 2
/ sum(rate(http_request_duration_seconds_count[5m]))

# Error budget remaining (30-day window, 99.9% SLO)
1 - (
  sum(increase(http_requests_total{status=~"5.."}[30d]))
  / sum(increase(http_requests_total[30d]))
) / (1 - 0.999)

# Top 5 endpoints by error rate
topk(5,
  sum(rate(http_requests_total{status=~"5.."}[5m])) by (handler)
  / sum(rate(http_requests_total[5m])) by (handler)
)

# Predict disk full in 4 hours using linear regression
predict_linear(node_filesystem_free_bytes{mountpoint="/"}[6h], 4*3600) < 0

# Histogram quantile with label join for service + method
histogram_quantile(0.99,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service, method)
)

# Rate of change of a gauge (derivative approximation)
deriv(process_resident_memory_bytes[5m])

Recording Rules

Recording rules pre-compute expensive queries and store results as new time series. Essential for dashboards with complex queries that would be slow to compute in real-time.

# recording-rules.yaml
groups:
  - name: sli_recording_rules
    interval: 30s
    rules:
      # Pre-compute error ratio per service
      - record: service:http_error_ratio:rate5m
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
          / sum(rate(http_requests_total[5m])) by (service)

      # Pre-compute p99 latency per service
      - record: service:http_latency_p99:rate5m
        expr: |
          histogram_quantile(0.99,
            sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
          )

      # Pre-compute request rate per service
      - record: service:http_requests:rate5m
        expr: sum(rate(http_requests_total[5m])) by (service)

  - name: resource_recording_rules
    interval: 60s
    rules:
      # CPU usage ratio per pod
      - record: pod:container_cpu_usage:ratio
        expr: |
          sum(rate(container_cpu_usage_seconds_total[5m])) by (namespace, pod)
          / on(namespace, pod) group_left()
          sum(kube_pod_container_resource_requests{resource="cpu"}) by (namespace, pod)
Recording Rule Naming Convention: Use the pattern level:metric:operations — e.g., service:http_error_ratio:rate5m. The level indicates the aggregation (service, pod, namespace), the metric is what is measured, and the operations describe the computation applied.

Service Discovery

MechanismUse CaseConfig Key
kubernetes_sdK8s pods, services, endpoints, nodeskubernetes_sd_configs
consul_sdConsul-registered servicesconsul_sd_configs
ec2_sdAWS EC2 instances by tag/filterec2_sd_configs
dns_sdDNS SRV/A recordsdns_sd_configs
file_sdJSON/YAML files (static with hot reload)file_sd_configs
static_configsFixed targets (dev/test)static_configs

Federation & Remote Write

Federation lets a global Prometheus scrape aggregated metrics from cluster-level instances. Remote write pushes metrics to long-term storage backends.

# Global Prometheus — federates from cluster Prometheus instances
scrape_configs:
  - job_name: 'federate-clusters'
    scrape_interval: 30s
    honor_labels: true
    metrics_path: '/federate'
    params:
      'match[]':
        - 'service:http_requests:rate5m'      # Only pull recording rules
        - 'service:http_error_ratio:rate5m'
        - 'service:http_latency_p99:rate5m'
    static_configs:
      - targets: ['prometheus-us-east:9090', 'prometheus-eu-west:9090']

# Remote write to Thanos/Mimir for long-term storage
remote_write:
  - url: "http://mimir.monitoring:9009/api/v1/push"
    queue_config:
      max_samples_per_send: 5000
      batch_send_deadline: 5s
      max_shards: 10

High Availability Patterns

PatternHow It WorksBest For
Duplicate PairTwo identical Prometheus servers scraping same targetsSimple HA, small scale
Thanos SidecarSidecar uploads TSDB blocks to object storage; Thanos Query deduplicatesMulti-cluster, long retention
Grafana MimirHorizontally scalable write + read path with object storageLarge scale (10M+ series)
CortexMulti-tenant, horizontally scalable (Mimir predecessor)Multi-tenant SaaS platforms
VictoriaMetricsDrop-in replacement with clustering supportCost-effective large scale

TSDB Storage Tuning

# prometheus.yml storage settings
storage:
  tsdb:
    path: /prometheus/data
    retention.time: 15d        # Keep 15 days locally
    retention.size: 50GB       # Or cap by size
    min-block-duration: 2h     # Minimum block size
    max-block-duration: 36h    # Maximum block size (for compaction)
    wal-compression: true      # Compress WAL (saves 50% disk I/O)
Memory Planning: Prometheus uses ~3KB of RAM per active time series. At 1 million active series, expect ~3GB RAM for TSDB alone, plus query overhead. Monitor prometheus_tsdb_head_series and plan capacity accordingly.

Production Checklist

Checklist

Prometheus Production Readiness

  1. Recording rules for all dashboard queries (never query raw high-cardinality metrics in Grafana)
  2. WAL compression enabled (--storage.tsdb.wal-compression)
  3. Remote write configured for long-term retention (Thanos/Mimir)
  4. Service discovery configured (not static targets)
  5. Relabel configs to drop unnecessary labels/metrics
  6. Alert on Prometheus itself: scrape failures, TSDB head series count, WAL corruption
  7. Persistent volume for TSDB data (not ephemeral storage)
  8. Resource limits set: 2-4 CPU, 4-16GB RAM depending on series count
ProductionTSDBScale