Prometheus Architecture
Prometheus Internal Architecture
flowchart TD
A[Service Discovery\nK8s, Consul, DNS, file, EC2] --> B[Scrape Manager\nPull /metrics at interval]
B --> C[TSDB\nTime Series Database]
C --> D[PromQL Engine\nQuery evaluation]
C --> E[Remote Write\nTo Thanos/Mimir/Cortex]
D --> F[HTTP API\n/api/v1/query]
F --> G[Grafana / Alertmanager]
C --> H[Rule Manager\nRecording + Alerting rules]
H --> I[Alertmanager\nNotification routing]
Advanced PromQL Patterns
# Apdex score (Application Performance Index)
# Satisfied: < 300ms, Tolerating: 300ms-1.2s, Frustrated: > 1.2s
(
sum(rate(http_request_duration_seconds_bucket{le="0.3"}[5m]))
+ sum(rate(http_request_duration_seconds_bucket{le="1.2"}[5m]))
) / 2
/ sum(rate(http_request_duration_seconds_count[5m]))
# Error budget remaining (30-day window, 99.9% SLO)
1 - (
sum(increase(http_requests_total{status=~"5.."}[30d]))
/ sum(increase(http_requests_total[30d]))
) / (1 - 0.999)
# Top 5 endpoints by error rate
topk(5,
sum(rate(http_requests_total{status=~"5.."}[5m])) by (handler)
/ sum(rate(http_requests_total[5m])) by (handler)
)
# Predict disk full in 4 hours using linear regression
predict_linear(node_filesystem_free_bytes{mountpoint="/"}[6h], 4*3600) < 0
# Histogram quantile with label join for service + method
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service, method)
)
# Rate of change of a gauge (derivative approximation)
deriv(process_resident_memory_bytes[5m])
Recording Rules
Recording rules pre-compute expensive queries and store results as new time series. Essential for dashboards with complex queries that would be slow to compute in real-time.
# recording-rules.yaml
groups:
- name: sli_recording_rules
interval: 30s
rules:
# Pre-compute error ratio per service
- record: service:http_error_ratio:rate5m
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
/ sum(rate(http_requests_total[5m])) by (service)
# Pre-compute p99 latency per service
- record: service:http_latency_p99:rate5m
expr: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
)
# Pre-compute request rate per service
- record: service:http_requests:rate5m
expr: sum(rate(http_requests_total[5m])) by (service)
- name: resource_recording_rules
interval: 60s
rules:
# CPU usage ratio per pod
- record: pod:container_cpu_usage:ratio
expr: |
sum(rate(container_cpu_usage_seconds_total[5m])) by (namespace, pod)
/ on(namespace, pod) group_left()
sum(kube_pod_container_resource_requests{resource="cpu"}) by (namespace, pod)
Recording Rule Naming Convention: Use the pattern
level:metric:operations — e.g., service:http_error_ratio:rate5m. The level indicates the aggregation (service, pod, namespace), the metric is what is measured, and the operations describe the computation applied.
Service Discovery
| Mechanism | Use Case | Config Key |
|---|---|---|
| kubernetes_sd | K8s pods, services, endpoints, nodes | kubernetes_sd_configs |
| consul_sd | Consul-registered services | consul_sd_configs |
| ec2_sd | AWS EC2 instances by tag/filter | ec2_sd_configs |
| dns_sd | DNS SRV/A records | dns_sd_configs |
| file_sd | JSON/YAML files (static with hot reload) | file_sd_configs |
| static_configs | Fixed targets (dev/test) | static_configs |
Federation & Remote Write
Federation lets a global Prometheus scrape aggregated metrics from cluster-level instances. Remote write pushes metrics to long-term storage backends.
# Global Prometheus — federates from cluster Prometheus instances
scrape_configs:
- job_name: 'federate-clusters'
scrape_interval: 30s
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
- 'service:http_requests:rate5m' # Only pull recording rules
- 'service:http_error_ratio:rate5m'
- 'service:http_latency_p99:rate5m'
static_configs:
- targets: ['prometheus-us-east:9090', 'prometheus-eu-west:9090']
# Remote write to Thanos/Mimir for long-term storage
remote_write:
- url: "http://mimir.monitoring:9009/api/v1/push"
queue_config:
max_samples_per_send: 5000
batch_send_deadline: 5s
max_shards: 10
High Availability Patterns
| Pattern | How It Works | Best For |
|---|---|---|
| Duplicate Pair | Two identical Prometheus servers scraping same targets | Simple HA, small scale |
| Thanos Sidecar | Sidecar uploads TSDB blocks to object storage; Thanos Query deduplicates | Multi-cluster, long retention |
| Grafana Mimir | Horizontally scalable write + read path with object storage | Large scale (10M+ series) |
| Cortex | Multi-tenant, horizontally scalable (Mimir predecessor) | Multi-tenant SaaS platforms |
| VictoriaMetrics | Drop-in replacement with clustering support | Cost-effective large scale |
TSDB Storage Tuning
# prometheus.yml storage settings
storage:
tsdb:
path: /prometheus/data
retention.time: 15d # Keep 15 days locally
retention.size: 50GB # Or cap by size
min-block-duration: 2h # Minimum block size
max-block-duration: 36h # Maximum block size (for compaction)
wal-compression: true # Compress WAL (saves 50% disk I/O)
Memory Planning: Prometheus uses ~3KB of RAM per active time series. At 1 million active series, expect ~3GB RAM for TSDB alone, plus query overhead. Monitor
prometheus_tsdb_head_series and plan capacity accordingly.
Production Checklist
Prometheus Production Readiness
- Recording rules for all dashboard queries (never query raw high-cardinality metrics in Grafana)
- WAL compression enabled (
--storage.tsdb.wal-compression) - Remote write configured for long-term retention (Thanos/Mimir)
- Service discovery configured (not static targets)
- Relabel configs to drop unnecessary labels/metrics
- Alert on Prometheus itself: scrape failures, TSDB head series count, WAL corruption
- Persistent volume for TSDB data (not ephemeral storage)
- Resource limits set: 2-4 CPU, 4-16GB RAM depending on series count