Prometheus Deep Dive Part 7: Sharding, Federation & High Availability

Prometheus Scaling Limits

Prometheus was designed as a single-server monitoring system with a local TSDB. This design makes it simple to operate but introduces hard limits as your infrastructure grows. Understanding where those limits lie is essential before choosing a scaling strategy.

Single-Server Ceiling

A well-tuned Prometheus server on modern hardware (16 cores, 64 GiB RAM, fast NVMe SSD) can typically handle:

Capacity Reference

Single Prometheus Server Practical Limits

Metric	Comfortable	Maximum	Bottleneck
Active time series	5–8 million	~15 million	Memory (head block)
Ingestion rate	500K samples/sec	~1M samples/sec	CPU (appends + compaction)
Scrape targets	5,000	~20,000	Scrape goroutines + network
Rule evaluations	5,000 rules	~20,000 rules	CPU (PromQL evaluation)
Retention (local)	15–30 days	~90 days	Disk I/O + compaction time

Capacity PlanningInfrastructure

When to Scale Out

                            
                            Scale-Out Signals: If you observe any of these consistently, it’s time to distribute: ingestion rate >80% of capacity; head series count approaching 10M; query latency P99 >10s for dashboard queries; WAL replay time >10 minutes after restart; compaction falling behind (increasing block count).
                        

The key insight is that scaling Prometheus is not about making one server bigger — it’s about distributing the workload across multiple Prometheus instances, each responsible for a subset of the monitoring surface.

Functional Sharding

Functional sharding is the simplest and most commonly recommended approach. Each Prometheus instance is responsible for a logical domain — a team, a namespace, a service tier, or a functional area.

Sharding by Team / Namespace

Functional Sharding by Team

flowchart TD
    subgraph Platform["Platform Team"]
        PP[Prometheus
platform-prom]
        PT1[K8s Control Plane]
        PT2[Ingress Controllers]
        PT3[Cert Manager]
    end

    subgraph Payments["Payments Team"]
        PAY[Prometheus
payments-prom]
        PAYT1[Payment Service]
        PAYT2[Fraud Detection]
        PAYT3[Payment Gateway]
    end

    subgraph Data["Data Platform"]
        DP[Prometheus
data-prom]
        DPT1[Kafka Clusters]
        DPT2[Spark Jobs]
        DPT3[Data Pipelines]
    end

    subgraph Global["Global View"]
        FED[Federation
Prometheus]
    end

    PP --> PT1 & PT2 & PT3
    PAY --> PAYT1 & PAYT2 & PAYT3
    DP --> DPT1 & DPT2 & DPT3
    PP & PAY & DP --> FED

Sharding by Concern

An alternative functional split separates by monitoring concern rather than team ownership:

Infrastructure Prometheus — Node exporters, kube-state-metrics, cAdvisor
Application Prometheus — Service-level metrics (HTTP, gRPC, custom business metrics)
Synthetic Prometheus — Blackbox exporter probes, external availability checks

Configuration Patterns

# Prometheus instance: payments-prom
# Only scrapes targets in the payments namespace
global:
  scrape_interval: 15s
  external_labels:
    cluster: production-us-east-1
    shard: payments    # Identifies this shard in federation

scrape_configs:
  # Auto-discover pods in payments namespace
  - job_name: 'payments-pods'
    kubernetes_sd_configs:
      - role: pod
        namespaces:
          names: ['payments', 'payments-staging']

    relabel_configs:
      # Only scrape pods with prometheus.io/scrape annotation
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: 'true'

      # Use annotation for custom port
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        target_label: __address__
        regex: (.+)
        replacement: ${1}

      # Preserve namespace and pod labels
      - source_labels: [__meta_kubernetes_namespace]
        target_label: namespace
      - source_labels: [__meta_kubernetes_pod_name]
        target_label: pod

  # ServiceMonitors for payments team
  - job_name: 'payments-services'
    kubernetes_sd_configs:
      - role: endpoints
        namespaces:
          names: ['payments']
    relabel_configs:
      - source_labels: [__meta_kubernetes_service_label_monitoring]
        action: keep
        regex: 'true'

Hashmod Sharding

When you need to split a single large scrape job across multiple Prometheus instances (e.g., 10,000 node exporters that exceed one server’s capacity), hashmod sharding distributes targets deterministically using consistent hashing.

How Hashmod Works

The hashmod relabeling action computes a hash of a source label value, takes modulo N (number of shards), and stores the result. Each Prometheus instance then keeps only targets where the hash result matches its shard index.

Hashmod Sharding (3 Shards)

flowchart LR
    subgraph Targets["All Targets (10,000 nodes)"]
        T["hash(__address__) mod 3"]
    end

    subgraph S0["Shard 0 (~3,333 targets)"]
        P0[prometheus-shard-0]
    end
    subgraph S1["Shard 1 (~3,333 targets)"]
        P1[prometheus-shard-1]
    end
    subgraph S2["Shard 2 (~3,334 targets)"]
        P2[prometheus-shard-2]
    end

    T -->|"mod=0"| P0
    T -->|"mod=1"| P1
    T -->|"mod=2"| P2

Configuration

# Shard 0 of 3 — each shard uses identical config except SHARD_INDEX
# Deploy via Helm with shard index as environment variable
global:
  scrape_interval: 30s
  external_labels:
    cluster: production
    __replica__: shard-0    # For deduplication in long-term storage

scrape_configs:
  - job_name: 'node-exporter'
    kubernetes_sd_configs:
      - role: node

    relabel_configs:
      # Standard node exporter relabeling
      - source_labels: [__meta_kubernetes_node_name]
        target_label: instance

      # === HASHMOD SHARDING ===
      # Step 1: Compute hash of __address__, mod 3
      - source_labels: [__address__]
        modulus: 3
        target_label: __tmp_hash
        action: hashmod

      # Step 2: Keep only targets where hash == this shard's index
      - source_labels: [__tmp_hash]
        regex: "0"          # Change to "1" for shard-1, "2" for shard-2
        action: keep

# Kubernetes StatefulSet for hashmod shards
# Each replica gets its shard index from the pod ordinal
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: prometheus-sharded
spec:
  replicas: 3
  serviceName: prometheus-sharded
  template:
    spec:
      containers:
        - name: prometheus
          image: prom/prometheus:v2.53.0
          args:
            - '--config.file=/etc/prometheus/prometheus.yml'
            - '--storage.tsdb.path=/prometheus'
            - '--storage.tsdb.retention.time=15d'
            - '--web.enable-lifecycle'
          env:
            - name: SHARD_INDEX
              valueFrom:
                fieldRef:
                  fieldPath: metadata.labels['apps.kubernetes.io/pod-index']
          volumeMounts:
            - name: config
              mountPath: /etc/prometheus
            - name: storage
              mountPath: /prometheus
  volumeClaimTemplates:
    - metadata:
        name: storage
      spec:
        accessModes: ["ReadWriteOnce"]
        resources:
          requests:
            storage: 200Gi
        storageClassName: gp3

Tradeoffs & Pitfalls

                            
                            Hashmod Considerations:
                            Pro: Even distribution, stateless shard assignment, easy to add/remove shards
Pro: Each shard is independent — failure affects only 1/N of targets
Con: Changing shard count redistributes ALL targets (brief gap in data)
Con: Cross-shard queries impossible without federation or remote storage
Con: Recording rules that span targets across shards produce incomplete results

                        

Federation

Federation allows one Prometheus server to scrape selected time series from another Prometheus server’s /federate endpoint. It’s primarily used to create aggregated global views from distributed shards.

Hierarchical Federation

In hierarchical federation, leaf Prometheus instances scrape targets directly, while a higher-level “global” Prometheus federates aggregated metrics from the leaves:

# Global Prometheus — federates from shard Prometheus instances
# Only pulls pre-aggregated recording rules, NOT raw metrics
global:
  scrape_interval: 60s    # Longer interval — we're pulling aggregates
  external_labels:
    cluster: global

scrape_configs:
  # Federate from each shard
  - job_name: 'federate-shards'
    honor_labels: true     # Preserve original labels from shards
    metrics_path: '/federate'
    params:
      'match[]':
        # Only federate recording rules (namespace-level aggregates)
        - '{__name__=~"namespace:.+"}'
        # And critical alerts-relevant metrics
        - '{__name__=~"job:.+"}'
        # Plus cluster-level SLO indicators
        - '{__name__=~"slo:.+"}'
    static_configs:
      - targets:
          - 'prometheus-payments:9090'
          - 'prometheus-platform:9090'
          - 'prometheus-data:9090'
        labels:
          federated: 'true'

  # Also federate from infrastructure shards
  - job_name: 'federate-infra'
    honor_labels: true
    metrics_path: '/federate'
    params:
      'match[]':
        - '{__name__=~"node:.+"}'       # Node-level aggregates
        - '{__name__=~"cluster:.+"}'    # Cluster-wide metrics
        - 'up'                          # Target health
    kubernetes_sd_configs:
      - role: service
        namespaces:
          names: ['monitoring']
    relabel_configs:
      - source_labels: [__meta_kubernetes_service_label_app]
        regex: prometheus-infra-shard
        action: keep

                            
                            The Golden Rule of Federation: Only federate aggregated metrics (recording rules). Never federate raw high-cardinality series — you’ll simply move the cardinality problem to the global Prometheus. Use recording rules on leaves to pre-aggregate, then federate the aggregates.
                        

# Recording rules on leaf Prometheus (payments-prom)
# These are what get federated to global
groups:
  - name: namespace_aggregates
    interval: 30s
    rules:
      # Request rate per service
      - record: namespace:http_requests_total:rate5m
        expr: |
          sum by (namespace, service, method, status_code) (
            rate(http_requests_total[5m])
          )

      # P99 latency per service
      - record: namespace:http_duration_seconds:p99
        expr: |
          histogram_quantile(0.99,
            sum by (namespace, service, le) (
              rate(http_request_duration_seconds_bucket[5m])
            )
          )

      # Error ratio per service
      - record: namespace:http_errors:ratio_rate5m
        expr: |
          sum by (namespace, service) (rate(http_requests_total{status_code=~"5.."}[5m]))
          / sum by (namespace, service) (rate(http_requests_total[5m]))

Cross-Service Federation

Cross-service federation is used when one team needs specific metrics from another team’s Prometheus without a full global view:

# SRE team federates SLO-relevant metrics from service teams
scrape_configs:
  - job_name: 'federate-slo-metrics'
    honor_labels: true
    metrics_path: '/federate'
    params:
      'match[]':
        # Only the specific SLI metrics needed for SLO calculation
        - 'http_requests_total{job="payment-api"}'
        - 'http_request_duration_seconds_bucket{job="payment-api"}'
        - 'grpc_server_handled_total{job="order-service"}'
    static_configs:
      - targets: ['prometheus-payments:9090']

Federation Pitfalls

Staleness: Federated data is always at least one scrape interval behind. A 60s federation interval means up to 60s staleness.
Label conflicts: Always use honor_labels: true to preserve original labels. Without it, the federating Prometheus overwrites instance and job labels.
Cardinality explosion: Federating {__name__=~".+"} (all metrics) defeats the purpose — the global Prometheus must handle the combined cardinality of all leaves.
Single point of failure: If the global Prometheus goes down, cross-shard dashboards break. Use HA for the global instance too.

High Availability

Prometheus HA is conceptually simple: run two (or more) identical Prometheus instances scraping the same targets. Both collect the same data independently — there is no leader election, no replication protocol, no consensus algorithm.

HA Pairs Pattern

Prometheus HA Pair Architecture

flowchart TD
    subgraph Targets["Scrape Targets"]
        T1[Service A]
        T2[Service B]
        T3[Service C]
    end

    subgraph HA["HA Pair"]
        P1["Prometheus
replica=prom-0"]
        P2["Prometheus
replica=prom-1"]
    end

    subgraph Query["Query Layer"]
        TH[Thanos Query
or Mimir]
        DEDUP["Deduplication
by __replica__ label"]
    end

    T1 & T2 & T3 --> P1
    T1 & T2 & T3 --> P2
    P1 --> TH
    P2 --> TH
    TH --> DEDUP

# HA Pair — Prometheus replica 0
# Identical config on replica 1, only external_labels differ
global:
  scrape_interval: 15s
  external_labels:
    cluster: production
    __replica__: prom-0    # prom-1 on the other instance

# Both replicas have IDENTICAL scrape_configs
scrape_configs:
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: 'true'

# Both remote-write to the same long-term store
remote_write:
  - url: https://mimir.internal/api/v1/push
    headers:
      X-Scope-OrgID: production
    queue_config:
      max_samples_per_send: 2000
      batch_send_deadline: 5s

Query-Time Deduplication

When both replicas remote-write to Mimir or Thanos, the query layer deduplicates by the __replica__ label. The deduplication strategy picks one replica’s data per time window, preferring the one with fewer gaps:

# Thanos Query — deduplication configuration
apiVersion: apps/v1
kind: Deployment
metadata:
  name: thanos-query
spec:
  template:
    spec:
      containers:
        - name: thanos-query
          image: quay.io/thanos/thanos:v0.35.0
          args:
            - query
            - --log.level=info
            - --query.replica-label=__replica__     # Dedup on this label
            - --query.auto-downsampling
            - --store=dnssrv+_grpc._tcp.thanos-store.monitoring.svc
            - --store=dnssrv+_grpc._tcp.prometheus-ha.monitoring.svc
          ports:
            - containerPort: 10902
              name: http
            - containerPort: 10901
              name: grpc

                            
                            Why Not Active-Passive? Unlike databases, Prometheus HA uses active-active. Both replicas scrape independently because: (1) scraping is cheap compared to query load, (2) no failover delay — both are always warm, (3) no split-brain risk since there’s no shared state, (4) slight timestamp differences between replicas are handled by deduplication.
                        

HA Alertmanager Cluster

With HA Prometheus pairs, both replicas fire the same alerts. Alertmanager handles this via its gossip-based clustering protocol (using Hashicorp’s Memberlist):

# Alertmanager cluster — 3 instances for HA
# They gossip to deduplicate notifications
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: alertmanager
spec:
  replicas: 3
  template:
    spec:
      containers:
        - name: alertmanager
          image: prom/alertmanager:v0.27.0
          args:
            - '--config.file=/etc/alertmanager/alertmanager.yml'
            - '--storage.path=/alertmanager'
            - '--cluster.listen-address=0.0.0.0:9094'
            # Peer discovery via DNS
            - '--cluster.peer=alertmanager-0.alertmanager:9094'
            - '--cluster.peer=alertmanager-1.alertmanager:9094'
            - '--cluster.peer=alertmanager-2.alertmanager:9094'
            # Settle time before sending notifications
            - '--cluster.settle-timeout=60s'
          ports:
            - containerPort: 9093    # HTTP API
            - containerPort: 9094    # Cluster gossip

# Both Prometheus replicas send alerts to ALL Alertmanager instances
# Alertmanager cluster deduplicates internally
alerting:
  alert_relabel_configs:
    # Drop the replica label from alerts — both replicas fire identical alerts
    - action: labeldrop
      regex: __replica__

  alertmanagers:
    - kubernetes_sd_configs:
        - role: endpoints
          namespaces:
            names: ['monitoring']
      relabel_configs:
        - source_labels: [__meta_kubernetes_service_name]
          regex: alertmanager
          action: keep
        - source_labels: [__meta_kubernetes_endpoint_port_name]
          regex: http
          action: keep

Remote Write Fan-Out

Remote write allows Prometheus to push samples to one or more remote endpoints in parallel. This is the foundation for centralizing data from distributed shards into long-term storage.

Write-Path Relabeling

# Remote write with selective forwarding
# Only send aggregated metrics to the global store (cost optimization)
remote_write:
  # Primary: Send everything to team's Mimir tenant
  - url: https://mimir.internal/api/v1/push
    headers:
      X-Scope-OrgID: payments-team
    queue_config:
      max_samples_per_send: 2000
      batch_send_deadline: 5s
      max_shards: 30

  # Secondary: Send only aggregates to global view
  - url: https://mimir-global.internal/api/v1/push
    headers:
      X-Scope-OrgID: global
    write_relabel_configs:
      # Only forward recording rules (pre-aggregated)
      - source_labels: [__name__]
        regex: '(namespace|job|cluster|slo):.*'
        action: keep
    queue_config:
      max_samples_per_send: 1000
      batch_send_deadline: 10s
      max_shards: 10

Queue Configuration Tuning

Configuration

Remote Write Queue Parameters

Parameter	Default	Recommended (High Volume)	Effect
`max_shards`	200	30–50	Max concurrent write goroutines
`min_shards`	1	5–10	Min concurrent writes (faster ramp-up)
`max_samples_per_send`	2000	2000–5000	Batch size per request
`batch_send_deadline`	5s	5s	Max wait before sending partial batch
`capacity`	2500	10000	Per-shard buffer capacity
`retry_on_http_429`	true	true	Retry on rate-limit responses

PerformanceTuning

# Key metrics to monitor remote write health
# Pending samples in the WAL waiting to be sent
prometheus_remote_storage_samples_pending

# Failed sends (should be near zero)
rate(prometheus_remote_storage_samples_failed_total[5m])

# Highest timestamp successfully sent vs current time (lag)
prometheus_remote_storage_highest_timestamp_in_seconds
  - prometheus_remote_storage_queue_highest_sent_timestamp_seconds

# Shard scaling — are we at max shards?
prometheus_remote_storage_shards
prometheus_remote_storage_shards_max

Combined Patterns

Reference Architecture

A production deployment typically combines multiple patterns. Here’s a reference architecture for a mid-size organization (500+ microservices, 50M+ active series):

Combined Scaling Architecture

flowchart TD
    subgraph Leaf["Leaf Layer (Per-Team Shards)"]
        direction LR
        LP1["Platform
HA Pair"]
        LP2["Payments
HA Pair"]
        LP3["Data
HA Pair"]
        LP4["Frontend
HA Pair"]
    end

    subgraph LTS["Long-Term Storage"]
        MIMIR[Grafana Mimir
Multi-Tenant]
    end

    subgraph Query["Query Layer"]
        GF[Grafana
Dashboards]
    end

    subgraph Alert["Alert Layer"]
        AM["Alertmanager
HA Cluster (3)"]
    end

    LP1 & LP2 & LP3 & LP4 -->|"remote_write"| MIMIR
    LP1 & LP2 & LP3 & LP4 -->|"alerts"| AM
    GF -->|"PromQL"| MIMIR

Decision Matrix

Decision Guide

When to Use Each Pattern

Pattern	Use When	Avoid When
Functional Sharding	Clear team/domain boundaries; independent ownership wanted	Single team owns everything; <5M series total
Hashmod Sharding	One massive job exceeds single-server capacity (e.g., 20K nodes)	Cross-target recording rules needed; few targets
Hierarchical Federation	Global dashboards needed without central long-term store	High-cardinality raw metrics needed globally
HA Pairs	Always — any production deployment	Cost-constrained dev/staging environments
Remote Write + Mimir/Thanos	Global query view; retention >30d; multi-cluster	Single small cluster; cost sensitivity

ArchitectureDecision Making

Conclusion

Scaling Prometheus is not a single decision but a combination of complementary patterns applied based on your specific constraints:

                            
                            Key Takeaways:
                            Start with functional sharding — it aligns with team ownership and is simplest to operate
Always deploy HA pairs — the cost of a second replica is trivial compared to monitoring blind spots
Federate aggregates, not raw metrics — recording rules on leaves, federation of pre-aggregated series
Use remote write for global view — Mimir/Thanos provide better global querying than federation
Hashmod is a last resort — use it only when a single job truly exceeds one server
Monitor your monitoring — track WAL size, ingestion rate, remote write lag, and query latency

                        

Previous Part 6: Alerting & Alertmanager Next Part 8: Optimizing & Debugging

Prometheus Deep Dive Part 7: Sharding, Federation & High Availability

Table of Contents

Prometheus Scaling Limits

Single-Server Ceiling

Single Prometheus Server Practical Limits

When to Scale Out

Functional Sharding

Sharding by Team / Namespace

Sharding by Concern

Configuration Patterns

Hashmod Sharding

How Hashmod Works

Configuration

Tradeoffs & Pitfalls

Federation

Hierarchical Federation

Cross-Service Federation

Federation Pitfalls

High Availability

HA Pairs Pattern

Query-Time Deduplication

HA Alertmanager Cluster

Remote Write Fan-Out

Write-Path Relabeling

Queue Configuration Tuning

Remote Write Queue Parameters

Combined Patterns

Reference Architecture

Decision Matrix

When to Use Each Pattern

Conclusion

Related Articles in This Series

Part 6: Effective Alerting & Alertmanager

Part 8: Optimizing & Debugging Prometheus

Part 10: Remote Storage Systems