Back to Monitoring & Observability Series

Prometheus Deep Dive Part 7: Sharding, Federation & High Availability

June 15, 2026 Wasil Zafar 30 min read

A single Prometheus server has practical limits. Learn when and how to scale horizontally with functional sharding, hashmod-based sharding, hierarchical federation, and high-availability pairs — building distributed metric collection that remains reliable at enterprise scale.

Table of Contents

  1. Prometheus Scaling Limits
  2. Functional Sharding
  3. Hashmod Sharding
  4. Federation
  5. High Availability
  6. Remote Write Fan-Out
  7. Combined Patterns
  8. Conclusion

Prometheus Scaling Limits

Prometheus was designed as a single-server monitoring system with a local TSDB. This design makes it simple to operate but introduces hard limits as your infrastructure grows. Understanding where those limits lie is essential before choosing a scaling strategy.

Single-Server Ceiling

A well-tuned Prometheus server on modern hardware (16 cores, 64 GiB RAM, fast NVMe SSD) can typically handle:

Capacity Reference

Single Prometheus Server Practical Limits

MetricComfortableMaximumBottleneck
Active time series5–8 million~15 millionMemory (head block)
Ingestion rate500K samples/sec~1M samples/secCPU (appends + compaction)
Scrape targets5,000~20,000Scrape goroutines + network
Rule evaluations5,000 rules~20,000 rulesCPU (PromQL evaluation)
Retention (local)15–30 days~90 daysDisk I/O + compaction time
Capacity PlanningInfrastructure

When to Scale Out

Scale-Out Signals: If you observe any of these consistently, it’s time to distribute: ingestion rate >80% of capacity; head series count approaching 10M; query latency P99 >10s for dashboard queries; WAL replay time >10 minutes after restart; compaction falling behind (increasing block count).

The key insight is that scaling Prometheus is not about making one server bigger — it’s about distributing the workload across multiple Prometheus instances, each responsible for a subset of the monitoring surface.

Functional Sharding

Functional sharding is the simplest and most commonly recommended approach. Each Prometheus instance is responsible for a logical domain — a team, a namespace, a service tier, or a functional area.

Sharding by Team / Namespace

Functional Sharding by Team
flowchart TD
    subgraph Platform["Platform Team"]
        PP[Prometheus
platform-prom] PT1[K8s Control Plane] PT2[Ingress Controllers] PT3[Cert Manager] end subgraph Payments["Payments Team"] PAY[Prometheus
payments-prom] PAYT1[Payment Service] PAYT2[Fraud Detection] PAYT3[Payment Gateway] end subgraph Data["Data Platform"] DP[Prometheus
data-prom] DPT1[Kafka Clusters] DPT2[Spark Jobs] DPT3[Data Pipelines] end subgraph Global["Global View"] FED[Federation
Prometheus] end PP --> PT1 & PT2 & PT3 PAY --> PAYT1 & PAYT2 & PAYT3 DP --> DPT1 & DPT2 & DPT3 PP & PAY & DP --> FED

Sharding by Concern

An alternative functional split separates by monitoring concern rather than team ownership:

  • Infrastructure Prometheus — Node exporters, kube-state-metrics, cAdvisor
  • Application Prometheus — Service-level metrics (HTTP, gRPC, custom business metrics)
  • Synthetic Prometheus — Blackbox exporter probes, external availability checks

Configuration Patterns

# Prometheus instance: payments-prom
# Only scrapes targets in the payments namespace
global:
  scrape_interval: 15s
  external_labels:
    cluster: production-us-east-1
    shard: payments    # Identifies this shard in federation

scrape_configs:
  # Auto-discover pods in payments namespace
  - job_name: 'payments-pods'
    kubernetes_sd_configs:
      - role: pod
        namespaces:
          names: ['payments', 'payments-staging']

    relabel_configs:
      # Only scrape pods with prometheus.io/scrape annotation
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: 'true'

      # Use annotation for custom port
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        target_label: __address__
        regex: (.+)
        replacement: ${1}

      # Preserve namespace and pod labels
      - source_labels: [__meta_kubernetes_namespace]
        target_label: namespace
      - source_labels: [__meta_kubernetes_pod_name]
        target_label: pod

  # ServiceMonitors for payments team
  - job_name: 'payments-services'
    kubernetes_sd_configs:
      - role: endpoints
        namespaces:
          names: ['payments']
    relabel_configs:
      - source_labels: [__meta_kubernetes_service_label_monitoring]
        action: keep
        regex: 'true'

Hashmod Sharding

When you need to split a single large scrape job across multiple Prometheus instances (e.g., 10,000 node exporters that exceed one server’s capacity), hashmod sharding distributes targets deterministically using consistent hashing.

How Hashmod Works

The hashmod relabeling action computes a hash of a source label value, takes modulo N (number of shards), and stores the result. Each Prometheus instance then keeps only targets where the hash result matches its shard index.

Hashmod Sharding (3 Shards)
flowchart LR
    subgraph Targets["All Targets (10,000 nodes)"]
        T["hash(__address__) mod 3"]
    end

    subgraph S0["Shard 0 (~3,333 targets)"]
        P0[prometheus-shard-0]
    end
    subgraph S1["Shard 1 (~3,333 targets)"]
        P1[prometheus-shard-1]
    end
    subgraph S2["Shard 2 (~3,334 targets)"]
        P2[prometheus-shard-2]
    end

    T -->|"mod=0"| P0
    T -->|"mod=1"| P1
    T -->|"mod=2"| P2
                            

Configuration

# Shard 0 of 3 — each shard uses identical config except SHARD_INDEX
# Deploy via Helm with shard index as environment variable
global:
  scrape_interval: 30s
  external_labels:
    cluster: production
    __replica__: shard-0    # For deduplication in long-term storage

scrape_configs:
  - job_name: 'node-exporter'
    kubernetes_sd_configs:
      - role: node

    relabel_configs:
      # Standard node exporter relabeling
      - source_labels: [__meta_kubernetes_node_name]
        target_label: instance

      # === HASHMOD SHARDING ===
      # Step 1: Compute hash of __address__, mod 3
      - source_labels: [__address__]
        modulus: 3
        target_label: __tmp_hash
        action: hashmod

      # Step 2: Keep only targets where hash == this shard's index
      - source_labels: [__tmp_hash]
        regex: "0"          # Change to "1" for shard-1, "2" for shard-2
        action: keep
# Kubernetes StatefulSet for hashmod shards
# Each replica gets its shard index from the pod ordinal
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: prometheus-sharded
spec:
  replicas: 3
  serviceName: prometheus-sharded
  template:
    spec:
      containers:
        - name: prometheus
          image: prom/prometheus:v2.53.0
          args:
            - '--config.file=/etc/prometheus/prometheus.yml'
            - '--storage.tsdb.path=/prometheus'
            - '--storage.tsdb.retention.time=15d'
            - '--web.enable-lifecycle'
          env:
            - name: SHARD_INDEX
              valueFrom:
                fieldRef:
                  fieldPath: metadata.labels['apps.kubernetes.io/pod-index']
          volumeMounts:
            - name: config
              mountPath: /etc/prometheus
            - name: storage
              mountPath: /prometheus
  volumeClaimTemplates:
    - metadata:
        name: storage
      spec:
        accessModes: ["ReadWriteOnce"]
        resources:
          requests:
            storage: 200Gi
        storageClassName: gp3

Tradeoffs & Pitfalls

Hashmod Considerations:
  • Pro: Even distribution, stateless shard assignment, easy to add/remove shards
  • Pro: Each shard is independent — failure affects only 1/N of targets
  • Con: Changing shard count redistributes ALL targets (brief gap in data)
  • Con: Cross-shard queries impossible without federation or remote storage
  • Con: Recording rules that span targets across shards produce incomplete results

Federation

Federation allows one Prometheus server to scrape selected time series from another Prometheus server’s /federate endpoint. It’s primarily used to create aggregated global views from distributed shards.

Hierarchical Federation

In hierarchical federation, leaf Prometheus instances scrape targets directly, while a higher-level “global” Prometheus federates aggregated metrics from the leaves:

# Global Prometheus — federates from shard Prometheus instances
# Only pulls pre-aggregated recording rules, NOT raw metrics
global:
  scrape_interval: 60s    # Longer interval — we're pulling aggregates
  external_labels:
    cluster: global

scrape_configs:
  # Federate from each shard
  - job_name: 'federate-shards'
    honor_labels: true     # Preserve original labels from shards
    metrics_path: '/federate'
    params:
      'match[]':
        # Only federate recording rules (namespace-level aggregates)
        - '{__name__=~"namespace:.+"}'
        # And critical alerts-relevant metrics
        - '{__name__=~"job:.+"}'
        # Plus cluster-level SLO indicators
        - '{__name__=~"slo:.+"}'
    static_configs:
      - targets:
          - 'prometheus-payments:9090'
          - 'prometheus-platform:9090'
          - 'prometheus-data:9090'
        labels:
          federated: 'true'

  # Also federate from infrastructure shards
  - job_name: 'federate-infra'
    honor_labels: true
    metrics_path: '/federate'
    params:
      'match[]':
        - '{__name__=~"node:.+"}'       # Node-level aggregates
        - '{__name__=~"cluster:.+"}'    # Cluster-wide metrics
        - 'up'                          # Target health
    kubernetes_sd_configs:
      - role: service
        namespaces:
          names: ['monitoring']
    relabel_configs:
      - source_labels: [__meta_kubernetes_service_label_app]
        regex: prometheus-infra-shard
        action: keep
The Golden Rule of Federation: Only federate aggregated metrics (recording rules). Never federate raw high-cardinality series — you’ll simply move the cardinality problem to the global Prometheus. Use recording rules on leaves to pre-aggregate, then federate the aggregates.
# Recording rules on leaf Prometheus (payments-prom)
# These are what get federated to global
groups:
  - name: namespace_aggregates
    interval: 30s
    rules:
      # Request rate per service
      - record: namespace:http_requests_total:rate5m
        expr: |
          sum by (namespace, service, method, status_code) (
            rate(http_requests_total[5m])
          )

      # P99 latency per service
      - record: namespace:http_duration_seconds:p99
        expr: |
          histogram_quantile(0.99,
            sum by (namespace, service, le) (
              rate(http_request_duration_seconds_bucket[5m])
            )
          )

      # Error ratio per service
      - record: namespace:http_errors:ratio_rate5m
        expr: |
          sum by (namespace, service) (rate(http_requests_total{status_code=~"5.."}[5m]))
          / sum by (namespace, service) (rate(http_requests_total[5m]))

Cross-Service Federation

Cross-service federation is used when one team needs specific metrics from another team’s Prometheus without a full global view:

# SRE team federates SLO-relevant metrics from service teams
scrape_configs:
  - job_name: 'federate-slo-metrics'
    honor_labels: true
    metrics_path: '/federate'
    params:
      'match[]':
        # Only the specific SLI metrics needed for SLO calculation
        - 'http_requests_total{job="payment-api"}'
        - 'http_request_duration_seconds_bucket{job="payment-api"}'
        - 'grpc_server_handled_total{job="order-service"}'
    static_configs:
      - targets: ['prometheus-payments:9090']

Federation Pitfalls

  • Staleness: Federated data is always at least one scrape interval behind. A 60s federation interval means up to 60s staleness.
  • Label conflicts: Always use honor_labels: true to preserve original labels. Without it, the federating Prometheus overwrites instance and job labels.
  • Cardinality explosion: Federating {__name__=~".+"} (all metrics) defeats the purpose — the global Prometheus must handle the combined cardinality of all leaves.
  • Single point of failure: If the global Prometheus goes down, cross-shard dashboards break. Use HA for the global instance too.

High Availability

Prometheus HA is conceptually simple: run two (or more) identical Prometheus instances scraping the same targets. Both collect the same data independently — there is no leader election, no replication protocol, no consensus algorithm.

HA Pairs Pattern

Prometheus HA Pair Architecture
flowchart TD
    subgraph Targets["Scrape Targets"]
        T1[Service A]
        T2[Service B]
        T3[Service C]
    end

    subgraph HA["HA Pair"]
        P1["Prometheus
replica=prom-0"] P2["Prometheus
replica=prom-1"] end subgraph Query["Query Layer"] TH[Thanos Query
or Mimir] DEDUP["Deduplication
by __replica__ label"] end T1 & T2 & T3 --> P1 T1 & T2 & T3 --> P2 P1 --> TH P2 --> TH TH --> DEDUP
# HA Pair — Prometheus replica 0
# Identical config on replica 1, only external_labels differ
global:
  scrape_interval: 15s
  external_labels:
    cluster: production
    __replica__: prom-0    # prom-1 on the other instance

# Both replicas have IDENTICAL scrape_configs
scrape_configs:
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: 'true'

# Both remote-write to the same long-term store
remote_write:
  - url: https://mimir.internal/api/v1/push
    headers:
      X-Scope-OrgID: production
    queue_config:
      max_samples_per_send: 2000
      batch_send_deadline: 5s

Query-Time Deduplication

When both replicas remote-write to Mimir or Thanos, the query layer deduplicates by the __replica__ label. The deduplication strategy picks one replica’s data per time window, preferring the one with fewer gaps:

# Thanos Query — deduplication configuration
apiVersion: apps/v1
kind: Deployment
metadata:
  name: thanos-query
spec:
  template:
    spec:
      containers:
        - name: thanos-query
          image: quay.io/thanos/thanos:v0.35.0
          args:
            - query
            - --log.level=info
            - --query.replica-label=__replica__     # Dedup on this label
            - --query.auto-downsampling
            - --store=dnssrv+_grpc._tcp.thanos-store.monitoring.svc
            - --store=dnssrv+_grpc._tcp.prometheus-ha.monitoring.svc
          ports:
            - containerPort: 10902
              name: http
            - containerPort: 10901
              name: grpc
Why Not Active-Passive? Unlike databases, Prometheus HA uses active-active. Both replicas scrape independently because: (1) scraping is cheap compared to query load, (2) no failover delay — both are always warm, (3) no split-brain risk since there’s no shared state, (4) slight timestamp differences between replicas are handled by deduplication.

HA Alertmanager Cluster

With HA Prometheus pairs, both replicas fire the same alerts. Alertmanager handles this via its gossip-based clustering protocol (using Hashicorp’s Memberlist):

# Alertmanager cluster — 3 instances for HA
# They gossip to deduplicate notifications
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: alertmanager
spec:
  replicas: 3
  template:
    spec:
      containers:
        - name: alertmanager
          image: prom/alertmanager:v0.27.0
          args:
            - '--config.file=/etc/alertmanager/alertmanager.yml'
            - '--storage.path=/alertmanager'
            - '--cluster.listen-address=0.0.0.0:9094'
            # Peer discovery via DNS
            - '--cluster.peer=alertmanager-0.alertmanager:9094'
            - '--cluster.peer=alertmanager-1.alertmanager:9094'
            - '--cluster.peer=alertmanager-2.alertmanager:9094'
            # Settle time before sending notifications
            - '--cluster.settle-timeout=60s'
          ports:
            - containerPort: 9093    # HTTP API
            - containerPort: 9094    # Cluster gossip
# Both Prometheus replicas send alerts to ALL Alertmanager instances
# Alertmanager cluster deduplicates internally
alerting:
  alert_relabel_configs:
    # Drop the replica label from alerts — both replicas fire identical alerts
    - action: labeldrop
      regex: __replica__

  alertmanagers:
    - kubernetes_sd_configs:
        - role: endpoints
          namespaces:
            names: ['monitoring']
      relabel_configs:
        - source_labels: [__meta_kubernetes_service_name]
          regex: alertmanager
          action: keep
        - source_labels: [__meta_kubernetes_endpoint_port_name]
          regex: http
          action: keep

Remote Write Fan-Out

Remote write allows Prometheus to push samples to one or more remote endpoints in parallel. This is the foundation for centralizing data from distributed shards into long-term storage.

Write-Path Relabeling

# Remote write with selective forwarding
# Only send aggregated metrics to the global store (cost optimization)
remote_write:
  # Primary: Send everything to team's Mimir tenant
  - url: https://mimir.internal/api/v1/push
    headers:
      X-Scope-OrgID: payments-team
    queue_config:
      max_samples_per_send: 2000
      batch_send_deadline: 5s
      max_shards: 30

  # Secondary: Send only aggregates to global view
  - url: https://mimir-global.internal/api/v1/push
    headers:
      X-Scope-OrgID: global
    write_relabel_configs:
      # Only forward recording rules (pre-aggregated)
      - source_labels: [__name__]
        regex: '(namespace|job|cluster|slo):.*'
        action: keep
    queue_config:
      max_samples_per_send: 1000
      batch_send_deadline: 10s
      max_shards: 10

Queue Configuration Tuning

Configuration

Remote Write Queue Parameters

ParameterDefaultRecommended (High Volume)Effect
max_shards20030–50Max concurrent write goroutines
min_shards15–10Min concurrent writes (faster ramp-up)
max_samples_per_send20002000–5000Batch size per request
batch_send_deadline5s5sMax wait before sending partial batch
capacity250010000Per-shard buffer capacity
retry_on_http_429truetrueRetry on rate-limit responses
PerformanceTuning
# Key metrics to monitor remote write health
# Pending samples in the WAL waiting to be sent
prometheus_remote_storage_samples_pending

# Failed sends (should be near zero)
rate(prometheus_remote_storage_samples_failed_total[5m])

# Highest timestamp successfully sent vs current time (lag)
prometheus_remote_storage_highest_timestamp_in_seconds
  - prometheus_remote_storage_queue_highest_sent_timestamp_seconds

# Shard scaling — are we at max shards?
prometheus_remote_storage_shards
prometheus_remote_storage_shards_max

Combined Patterns

Reference Architecture

A production deployment typically combines multiple patterns. Here’s a reference architecture for a mid-size organization (500+ microservices, 50M+ active series):

Combined Scaling Architecture
flowchart TD
    subgraph Leaf["Leaf Layer (Per-Team Shards)"]
        direction LR
        LP1["Platform
HA Pair"] LP2["Payments
HA Pair"] LP3["Data
HA Pair"] LP4["Frontend
HA Pair"] end subgraph LTS["Long-Term Storage"] MIMIR[Grafana Mimir
Multi-Tenant] end subgraph Query["Query Layer"] GF[Grafana
Dashboards] end subgraph Alert["Alert Layer"] AM["Alertmanager
HA Cluster (3)"] end LP1 & LP2 & LP3 & LP4 -->|"remote_write"| MIMIR LP1 & LP2 & LP3 & LP4 -->|"alerts"| AM GF -->|"PromQL"| MIMIR

Decision Matrix

Decision Guide

When to Use Each Pattern

PatternUse WhenAvoid When
Functional ShardingClear team/domain boundaries; independent ownership wantedSingle team owns everything; <5M series total
Hashmod ShardingOne massive job exceeds single-server capacity (e.g., 20K nodes)Cross-target recording rules needed; few targets
Hierarchical FederationGlobal dashboards needed without central long-term storeHigh-cardinality raw metrics needed globally
HA PairsAlways — any production deploymentCost-constrained dev/staging environments
Remote Write + Mimir/ThanosGlobal query view; retention >30d; multi-clusterSingle small cluster; cost sensitivity
ArchitectureDecision Making

Conclusion

Scaling Prometheus is not a single decision but a combination of complementary patterns applied based on your specific constraints:

Key Takeaways:
  • Start with functional sharding — it aligns with team ownership and is simplest to operate
  • Always deploy HA pairs — the cost of a second replica is trivial compared to monitoring blind spots
  • Federate aggregates, not raw metrics — recording rules on leaves, federation of pre-aggregated series
  • Use remote write for global view — Mimir/Thanos provide better global querying than federation
  • Hashmod is a last resort — use it only when a single job truly exceeds one server
  • Monitor your monitoring — track WAL size, ingestion rate, remote write lag, and query latency