Tool Deep Dive: Prometheus Complete Guide

Prometheus Architecture

Prometheus Internal Architecture

                                flowchart TD
                                    A[Service Discovery\nK8s, Consul, DNS, file, EC2] --> B[Scrape Manager\nPull /metrics at interval]
                                    B --> C[TSDB\nTime Series Database]
                                    C --> D[PromQL Engine\nQuery evaluation]
                                    C --> E[Remote Write\nTo Thanos/Mimir/Cortex]
                                    D --> F[HTTP API\n/api/v1/query]
                                    F --> G[Grafana / Alertmanager]
                                    C --> H[Rule Manager\nRecording + Alerting rules]
                                    H --> I[Alertmanager\nNotification routing]

Advanced PromQL Patterns

# Apdex score (Application Performance Index)
# Satisfied: < 300ms, Tolerating: 300ms-1.2s, Frustrated: > 1.2s
(
  sum(rate(http_request_duration_seconds_bucket{le="0.3"}[5m]))
  + sum(rate(http_request_duration_seconds_bucket{le="1.2"}[5m]))
) / 2
/ sum(rate(http_request_duration_seconds_count[5m]))

# Error budget remaining (30-day window, 99.9% SLO)
1 - (
  sum(increase(http_requests_total{status=~"5.."}[30d]))
  / sum(increase(http_requests_total[30d]))
) / (1 - 0.999)

# Top 5 endpoints by error rate
topk(5,
  sum(rate(http_requests_total{status=~"5.."}[5m])) by (handler)
  / sum(rate(http_requests_total[5m])) by (handler)
)

# Predict disk full in 4 hours using linear regression
predict_linear(node_filesystem_free_bytes{mountpoint="/"}[6h], 4*3600) < 0

# Histogram quantile with label join for service + method
histogram_quantile(0.99,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service, method)
)

# Rate of change of a gauge (derivative approximation)
deriv(process_resident_memory_bytes[5m])

Recording Rules

Recording rules pre-compute expensive queries and store results as new time series. Essential for dashboards with complex queries that would be slow to compute in real-time.

# recording-rules.yaml
groups:
  - name: sli_recording_rules
    interval: 30s
    rules:
      # Pre-compute error ratio per service
      - record: service:http_error_ratio:rate5m
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
          / sum(rate(http_requests_total[5m])) by (service)

      # Pre-compute p99 latency per service
      - record: service:http_latency_p99:rate5m
        expr: |
          histogram_quantile(0.99,
            sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
          )

      # Pre-compute request rate per service
      - record: service:http_requests:rate5m
        expr: sum(rate(http_requests_total[5m])) by (service)

  - name: resource_recording_rules
    interval: 60s
    rules:
      # CPU usage ratio per pod
      - record: pod:container_cpu_usage:ratio
        expr: |
          sum(rate(container_cpu_usage_seconds_total[5m])) by (namespace, pod)
          / on(namespace, pod) group_left()
          sum(kube_pod_container_resource_requests{resource="cpu"}) by (namespace, pod)

                            
                            Recording Rule Naming Convention: Use the pattern level:metric:operations — e.g., service:http_error_ratio:rate5m. The level indicates the aggregation (service, pod, namespace), the metric is what is measured, and the operations describe the computation applied.
                        

Service Discovery

Mechanism	Use Case	Config Key
kubernetes_sd	K8s pods, services, endpoints, nodes	`kubernetes_sd_configs`
consul_sd	Consul-registered services	`consul_sd_configs`
ec2_sd	AWS EC2 instances by tag/filter	`ec2_sd_configs`
dns_sd	DNS SRV/A records	`dns_sd_configs`
file_sd	JSON/YAML files (static with hot reload)	`file_sd_configs`
static_configs	Fixed targets (dev/test)	`static_configs`

Federation & Remote Write

Federation lets a global Prometheus scrape aggregated metrics from cluster-level instances. Remote write pushes metrics to long-term storage backends.

# Global Prometheus — federates from cluster Prometheus instances
scrape_configs:
  - job_name: 'federate-clusters'
    scrape_interval: 30s
    honor_labels: true
    metrics_path: '/federate'
    params:
      'match[]':
        - 'service:http_requests:rate5m'      # Only pull recording rules
        - 'service:http_error_ratio:rate5m'
        - 'service:http_latency_p99:rate5m'
    static_configs:
      - targets: ['prometheus-us-east:9090', 'prometheus-eu-west:9090']

# Remote write to Thanos/Mimir for long-term storage
remote_write:
  - url: "http://mimir.monitoring:9009/api/v1/push"
    queue_config:
      max_samples_per_send: 5000
      batch_send_deadline: 5s
      max_shards: 10

High Availability Patterns

Pattern	How It Works	Best For
Duplicate Pair	Two identical Prometheus servers scraping same targets	Simple HA, small scale
Thanos Sidecar	Sidecar uploads TSDB blocks to object storage; Thanos Query deduplicates	Multi-cluster, long retention
Grafana Mimir	Horizontally scalable write + read path with object storage	Large scale (10M+ series)
Cortex	Multi-tenant, horizontally scalable (Mimir predecessor)	Multi-tenant SaaS platforms
VictoriaMetrics	Drop-in replacement with clustering support	Cost-effective large scale

TSDB Storage Tuning

# prometheus.yml storage settings
storage:
  tsdb:
    path: /prometheus/data
    retention.time: 15d        # Keep 15 days locally
    retention.size: 50GB       # Or cap by size
    min-block-duration: 2h     # Minimum block size
    max-block-duration: 36h    # Maximum block size (for compaction)
    wal-compression: true      # Compress WAL (saves 50% disk I/O)

                            
                            Memory Planning: Prometheus uses ~3KB of RAM per active time series. At 1 million active series, expect ~3GB RAM for TSDB alone, plus query overhead. Monitor prometheus_tsdb_head_series and plan capacity accordingly.
                        

Production Checklist

Checklist

Prometheus Production Readiness

Recording rules for all dashboard queries (never query raw high-cardinality metrics in Grafana)
WAL compression enabled (--storage.tsdb.wal-compression)
Remote write configured for long-term retention (Thanos/Mimir)
Service discovery configured (not static targets)
Relabel configs to drop unnecessary labels/metrics
Alert on Prometheus itself: scrape failures, TSDB head series count, WAL corruption
Persistent volume for TSDB data (not ephemeral storage)
Resource limits set: 2-4 CPU, 4-16GB RAM depending on series count

ProductionTSDBScale

Series IndexAll Parts & Deep Dives Next Deep Dive Grafana Complete Guide

Cookie Consent

Tool Deep Dive: Prometheus Complete Guide

Table of Contents

Prometheus Architecture

Advanced PromQL Patterns

Recording Rules

Service Discovery

Federation & Remote Write

High Availability Patterns

TSDB Storage Tuning

Production Checklist

Prometheus Production Readiness

Cookie Consent

Tool Deep Dive: Prometheus Complete Guide

Table of Contents

Prometheus Architecture

Advanced PromQL Patterns

Recording Rules

Service Discovery

Federation & Remote Write

High Availability Patterns

TSDB Storage Tuning

Production Checklist

Prometheus Production Readiness

Related Deep Dives

Tool Deep Dive: Grafana Complete Guide

Tool Deep Dive: Alertmanager Complete Guide

Part 3: Time-Series & Prometheus