Prometheus Scaling Limits
Prometheus was designed as a single-server monitoring system with a local TSDB. This design makes it simple to operate but introduces hard limits as your infrastructure grows. Understanding where those limits lie is essential before choosing a scaling strategy.
Single-Server Ceiling
A well-tuned Prometheus server on modern hardware (16 cores, 64 GiB RAM, fast NVMe SSD) can typically handle:
Single Prometheus Server Practical Limits
| Metric | Comfortable | Maximum | Bottleneck |
|---|---|---|---|
| Active time series | 5–8 million | ~15 million | Memory (head block) |
| Ingestion rate | 500K samples/sec | ~1M samples/sec | CPU (appends + compaction) |
| Scrape targets | 5,000 | ~20,000 | Scrape goroutines + network |
| Rule evaluations | 5,000 rules | ~20,000 rules | CPU (PromQL evaluation) |
| Retention (local) | 15–30 days | ~90 days | Disk I/O + compaction time |
When to Scale Out
The key insight is that scaling Prometheus is not about making one server bigger — it’s about distributing the workload across multiple Prometheus instances, each responsible for a subset of the monitoring surface.
Functional Sharding
Functional sharding is the simplest and most commonly recommended approach. Each Prometheus instance is responsible for a logical domain — a team, a namespace, a service tier, or a functional area.
Sharding by Team / Namespace
flowchart TD
subgraph Platform["Platform Team"]
PP[Prometheus
platform-prom]
PT1[K8s Control Plane]
PT2[Ingress Controllers]
PT3[Cert Manager]
end
subgraph Payments["Payments Team"]
PAY[Prometheus
payments-prom]
PAYT1[Payment Service]
PAYT2[Fraud Detection]
PAYT3[Payment Gateway]
end
subgraph Data["Data Platform"]
DP[Prometheus
data-prom]
DPT1[Kafka Clusters]
DPT2[Spark Jobs]
DPT3[Data Pipelines]
end
subgraph Global["Global View"]
FED[Federation
Prometheus]
end
PP --> PT1 & PT2 & PT3
PAY --> PAYT1 & PAYT2 & PAYT3
DP --> DPT1 & DPT2 & DPT3
PP & PAY & DP --> FED
Sharding by Concern
An alternative functional split separates by monitoring concern rather than team ownership:
- Infrastructure Prometheus — Node exporters, kube-state-metrics, cAdvisor
- Application Prometheus — Service-level metrics (HTTP, gRPC, custom business metrics)
- Synthetic Prometheus — Blackbox exporter probes, external availability checks
Configuration Patterns
# Prometheus instance: payments-prom
# Only scrapes targets in the payments namespace
global:
scrape_interval: 15s
external_labels:
cluster: production-us-east-1
shard: payments # Identifies this shard in federation
scrape_configs:
# Auto-discover pods in payments namespace
- job_name: 'payments-pods'
kubernetes_sd_configs:
- role: pod
namespaces:
names: ['payments', 'payments-staging']
relabel_configs:
# Only scrape pods with prometheus.io/scrape annotation
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: 'true'
# Use annotation for custom port
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
target_label: __address__
regex: (.+)
replacement: ${1}
# Preserve namespace and pod labels
- source_labels: [__meta_kubernetes_namespace]
target_label: namespace
- source_labels: [__meta_kubernetes_pod_name]
target_label: pod
# ServiceMonitors for payments team
- job_name: 'payments-services'
kubernetes_sd_configs:
- role: endpoints
namespaces:
names: ['payments']
relabel_configs:
- source_labels: [__meta_kubernetes_service_label_monitoring]
action: keep
regex: 'true'
Hashmod Sharding
When you need to split a single large scrape job across multiple Prometheus instances (e.g., 10,000 node exporters that exceed one server’s capacity), hashmod sharding distributes targets deterministically using consistent hashing.
How Hashmod Works
The hashmod relabeling action computes a hash of a source label value, takes modulo N (number of shards), and stores the result. Each Prometheus instance then keeps only targets where the hash result matches its shard index.
flowchart LR
subgraph Targets["All Targets (10,000 nodes)"]
T["hash(__address__) mod 3"]
end
subgraph S0["Shard 0 (~3,333 targets)"]
P0[prometheus-shard-0]
end
subgraph S1["Shard 1 (~3,333 targets)"]
P1[prometheus-shard-1]
end
subgraph S2["Shard 2 (~3,334 targets)"]
P2[prometheus-shard-2]
end
T -->|"mod=0"| P0
T -->|"mod=1"| P1
T -->|"mod=2"| P2
Configuration
# Shard 0 of 3 — each shard uses identical config except SHARD_INDEX
# Deploy via Helm with shard index as environment variable
global:
scrape_interval: 30s
external_labels:
cluster: production
__replica__: shard-0 # For deduplication in long-term storage
scrape_configs:
- job_name: 'node-exporter'
kubernetes_sd_configs:
- role: node
relabel_configs:
# Standard node exporter relabeling
- source_labels: [__meta_kubernetes_node_name]
target_label: instance
# === HASHMOD SHARDING ===
# Step 1: Compute hash of __address__, mod 3
- source_labels: [__address__]
modulus: 3
target_label: __tmp_hash
action: hashmod
# Step 2: Keep only targets where hash == this shard's index
- source_labels: [__tmp_hash]
regex: "0" # Change to "1" for shard-1, "2" for shard-2
action: keep
# Kubernetes StatefulSet for hashmod shards
# Each replica gets its shard index from the pod ordinal
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: prometheus-sharded
spec:
replicas: 3
serviceName: prometheus-sharded
template:
spec:
containers:
- name: prometheus
image: prom/prometheus:v2.53.0
args:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.retention.time=15d'
- '--web.enable-lifecycle'
env:
- name: SHARD_INDEX
valueFrom:
fieldRef:
fieldPath: metadata.labels['apps.kubernetes.io/pod-index']
volumeMounts:
- name: config
mountPath: /etc/prometheus
- name: storage
mountPath: /prometheus
volumeClaimTemplates:
- metadata:
name: storage
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 200Gi
storageClassName: gp3
Tradeoffs & Pitfalls
- Pro: Even distribution, stateless shard assignment, easy to add/remove shards
- Pro: Each shard is independent — failure affects only 1/N of targets
- Con: Changing shard count redistributes ALL targets (brief gap in data)
- Con: Cross-shard queries impossible without federation or remote storage
- Con: Recording rules that span targets across shards produce incomplete results
Federation
Federation allows one Prometheus server to scrape selected time series from another Prometheus server’s /federate endpoint. It’s primarily used to create aggregated global views from distributed shards.
Hierarchical Federation
In hierarchical federation, leaf Prometheus instances scrape targets directly, while a higher-level “global” Prometheus federates aggregated metrics from the leaves:
# Global Prometheus — federates from shard Prometheus instances
# Only pulls pre-aggregated recording rules, NOT raw metrics
global:
scrape_interval: 60s # Longer interval — we're pulling aggregates
external_labels:
cluster: global
scrape_configs:
# Federate from each shard
- job_name: 'federate-shards'
honor_labels: true # Preserve original labels from shards
metrics_path: '/federate'
params:
'match[]':
# Only federate recording rules (namespace-level aggregates)
- '{__name__=~"namespace:.+"}'
# And critical alerts-relevant metrics
- '{__name__=~"job:.+"}'
# Plus cluster-level SLO indicators
- '{__name__=~"slo:.+"}'
static_configs:
- targets:
- 'prometheus-payments:9090'
- 'prometheus-platform:9090'
- 'prometheus-data:9090'
labels:
federated: 'true'
# Also federate from infrastructure shards
- job_name: 'federate-infra'
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
- '{__name__=~"node:.+"}' # Node-level aggregates
- '{__name__=~"cluster:.+"}' # Cluster-wide metrics
- 'up' # Target health
kubernetes_sd_configs:
- role: service
namespaces:
names: ['monitoring']
relabel_configs:
- source_labels: [__meta_kubernetes_service_label_app]
regex: prometheus-infra-shard
action: keep
# Recording rules on leaf Prometheus (payments-prom)
# These are what get federated to global
groups:
- name: namespace_aggregates
interval: 30s
rules:
# Request rate per service
- record: namespace:http_requests_total:rate5m
expr: |
sum by (namespace, service, method, status_code) (
rate(http_requests_total[5m])
)
# P99 latency per service
- record: namespace:http_duration_seconds:p99
expr: |
histogram_quantile(0.99,
sum by (namespace, service, le) (
rate(http_request_duration_seconds_bucket[5m])
)
)
# Error ratio per service
- record: namespace:http_errors:ratio_rate5m
expr: |
sum by (namespace, service) (rate(http_requests_total{status_code=~"5.."}[5m]))
/ sum by (namespace, service) (rate(http_requests_total[5m]))
Cross-Service Federation
Cross-service federation is used when one team needs specific metrics from another team’s Prometheus without a full global view:
# SRE team federates SLO-relevant metrics from service teams
scrape_configs:
- job_name: 'federate-slo-metrics'
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
# Only the specific SLI metrics needed for SLO calculation
- 'http_requests_total{job="payment-api"}'
- 'http_request_duration_seconds_bucket{job="payment-api"}'
- 'grpc_server_handled_total{job="order-service"}'
static_configs:
- targets: ['prometheus-payments:9090']
Federation Pitfalls
- Staleness: Federated data is always at least one scrape interval behind. A 60s federation interval means up to 60s staleness.
- Label conflicts: Always use
honor_labels: trueto preserve original labels. Without it, the federating Prometheus overwritesinstanceandjoblabels. - Cardinality explosion: Federating
{__name__=~".+"}(all metrics) defeats the purpose — the global Prometheus must handle the combined cardinality of all leaves. - Single point of failure: If the global Prometheus goes down, cross-shard dashboards break. Use HA for the global instance too.
High Availability
Prometheus HA is conceptually simple: run two (or more) identical Prometheus instances scraping the same targets. Both collect the same data independently — there is no leader election, no replication protocol, no consensus algorithm.
HA Pairs Pattern
flowchart TD
subgraph Targets["Scrape Targets"]
T1[Service A]
T2[Service B]
T3[Service C]
end
subgraph HA["HA Pair"]
P1["Prometheus
replica=prom-0"]
P2["Prometheus
replica=prom-1"]
end
subgraph Query["Query Layer"]
TH[Thanos Query
or Mimir]
DEDUP["Deduplication
by __replica__ label"]
end
T1 & T2 & T3 --> P1
T1 & T2 & T3 --> P2
P1 --> TH
P2 --> TH
TH --> DEDUP
# HA Pair — Prometheus replica 0
# Identical config on replica 1, only external_labels differ
global:
scrape_interval: 15s
external_labels:
cluster: production
__replica__: prom-0 # prom-1 on the other instance
# Both replicas have IDENTICAL scrape_configs
scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: 'true'
# Both remote-write to the same long-term store
remote_write:
- url: https://mimir.internal/api/v1/push
headers:
X-Scope-OrgID: production
queue_config:
max_samples_per_send: 2000
batch_send_deadline: 5s
Query-Time Deduplication
When both replicas remote-write to Mimir or Thanos, the query layer deduplicates by the __replica__ label. The deduplication strategy picks one replica’s data per time window, preferring the one with fewer gaps:
# Thanos Query — deduplication configuration
apiVersion: apps/v1
kind: Deployment
metadata:
name: thanos-query
spec:
template:
spec:
containers:
- name: thanos-query
image: quay.io/thanos/thanos:v0.35.0
args:
- query
- --log.level=info
- --query.replica-label=__replica__ # Dedup on this label
- --query.auto-downsampling
- --store=dnssrv+_grpc._tcp.thanos-store.monitoring.svc
- --store=dnssrv+_grpc._tcp.prometheus-ha.monitoring.svc
ports:
- containerPort: 10902
name: http
- containerPort: 10901
name: grpc
HA Alertmanager Cluster
With HA Prometheus pairs, both replicas fire the same alerts. Alertmanager handles this via its gossip-based clustering protocol (using Hashicorp’s Memberlist):
# Alertmanager cluster — 3 instances for HA
# They gossip to deduplicate notifications
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: alertmanager
spec:
replicas: 3
template:
spec:
containers:
- name: alertmanager
image: prom/alertmanager:v0.27.0
args:
- '--config.file=/etc/alertmanager/alertmanager.yml'
- '--storage.path=/alertmanager'
- '--cluster.listen-address=0.0.0.0:9094'
# Peer discovery via DNS
- '--cluster.peer=alertmanager-0.alertmanager:9094'
- '--cluster.peer=alertmanager-1.alertmanager:9094'
- '--cluster.peer=alertmanager-2.alertmanager:9094'
# Settle time before sending notifications
- '--cluster.settle-timeout=60s'
ports:
- containerPort: 9093 # HTTP API
- containerPort: 9094 # Cluster gossip
# Both Prometheus replicas send alerts to ALL Alertmanager instances
# Alertmanager cluster deduplicates internally
alerting:
alert_relabel_configs:
# Drop the replica label from alerts — both replicas fire identical alerts
- action: labeldrop
regex: __replica__
alertmanagers:
- kubernetes_sd_configs:
- role: endpoints
namespaces:
names: ['monitoring']
relabel_configs:
- source_labels: [__meta_kubernetes_service_name]
regex: alertmanager
action: keep
- source_labels: [__meta_kubernetes_endpoint_port_name]
regex: http
action: keep
Remote Write Fan-Out
Remote write allows Prometheus to push samples to one or more remote endpoints in parallel. This is the foundation for centralizing data from distributed shards into long-term storage.
Write-Path Relabeling
# Remote write with selective forwarding
# Only send aggregated metrics to the global store (cost optimization)
remote_write:
# Primary: Send everything to team's Mimir tenant
- url: https://mimir.internal/api/v1/push
headers:
X-Scope-OrgID: payments-team
queue_config:
max_samples_per_send: 2000
batch_send_deadline: 5s
max_shards: 30
# Secondary: Send only aggregates to global view
- url: https://mimir-global.internal/api/v1/push
headers:
X-Scope-OrgID: global
write_relabel_configs:
# Only forward recording rules (pre-aggregated)
- source_labels: [__name__]
regex: '(namespace|job|cluster|slo):.*'
action: keep
queue_config:
max_samples_per_send: 1000
batch_send_deadline: 10s
max_shards: 10
Queue Configuration Tuning
Remote Write Queue Parameters
| Parameter | Default | Recommended (High Volume) | Effect |
|---|---|---|---|
max_shards | 200 | 30–50 | Max concurrent write goroutines |
min_shards | 1 | 5–10 | Min concurrent writes (faster ramp-up) |
max_samples_per_send | 2000 | 2000–5000 | Batch size per request |
batch_send_deadline | 5s | 5s | Max wait before sending partial batch |
capacity | 2500 | 10000 | Per-shard buffer capacity |
retry_on_http_429 | true | true | Retry on rate-limit responses |
# Key metrics to monitor remote write health
# Pending samples in the WAL waiting to be sent
prometheus_remote_storage_samples_pending
# Failed sends (should be near zero)
rate(prometheus_remote_storage_samples_failed_total[5m])
# Highest timestamp successfully sent vs current time (lag)
prometheus_remote_storage_highest_timestamp_in_seconds
- prometheus_remote_storage_queue_highest_sent_timestamp_seconds
# Shard scaling — are we at max shards?
prometheus_remote_storage_shards
prometheus_remote_storage_shards_max
Combined Patterns
Reference Architecture
A production deployment typically combines multiple patterns. Here’s a reference architecture for a mid-size organization (500+ microservices, 50M+ active series):
flowchart TD
subgraph Leaf["Leaf Layer (Per-Team Shards)"]
direction LR
LP1["Platform
HA Pair"]
LP2["Payments
HA Pair"]
LP3["Data
HA Pair"]
LP4["Frontend
HA Pair"]
end
subgraph LTS["Long-Term Storage"]
MIMIR[Grafana Mimir
Multi-Tenant]
end
subgraph Query["Query Layer"]
GF[Grafana
Dashboards]
end
subgraph Alert["Alert Layer"]
AM["Alertmanager
HA Cluster (3)"]
end
LP1 & LP2 & LP3 & LP4 -->|"remote_write"| MIMIR
LP1 & LP2 & LP3 & LP4 -->|"alerts"| AM
GF -->|"PromQL"| MIMIR
Decision Matrix
When to Use Each Pattern
| Pattern | Use When | Avoid When |
|---|---|---|
| Functional Sharding | Clear team/domain boundaries; independent ownership wanted | Single team owns everything; <5M series total |
| Hashmod Sharding | One massive job exceeds single-server capacity (e.g., 20K nodes) | Cross-target recording rules needed; few targets |
| Hierarchical Federation | Global dashboards needed without central long-term store | High-cardinality raw metrics needed globally |
| HA Pairs | Always — any production deployment | Cost-constrained dev/staging environments |
| Remote Write + Mimir/Thanos | Global query view; retention >30d; multi-cluster | Single small cluster; cost sensitivity |
Conclusion
Scaling Prometheus is not a single decision but a combination of complementary patterns applied based on your specific constraints:
- Start with functional sharding — it aligns with team ownership and is simplest to operate
- Always deploy HA pairs — the cost of a second replica is trivial compared to monitoring blind spots
- Federate aggregates, not raw metrics — recording rules on leaves, federation of pre-aggregated series
- Use remote write for global view — Mimir/Thanos provide better global querying than federation
- Hashmod is a last resort — use it only when a single job truly exceeds one server
- Monitor your monitoring — track WAL size, ingestion rate, remote write lag, and query latency