Back to Monitoring & Observability Series

Prometheus Deep Dive Part 5: Service Discovery

June 15, 2026 Wasil Zafar 30 min read

In dynamic environments, targets come and go constantly. Master Prometheus service discovery — from Kubernetes SD roles and the power of relabeling to Consul, DNS, file-based, and HTTP service discovery. Learn to build custom SD providers for any infrastructure.

Table of Contents

  1. Service Discovery Overview
  2. Kubernetes Service Discovery
  3. Relabeling Deep Dive
  4. Other SD Mechanisms
  5. Conclusion & What’s Next

Service Discovery Overview

Static vs Dynamic

Prometheus needs to know where to scrape metrics. In static environments, you list targets manually. In dynamic environments (Kubernetes, cloud, service mesh), targets appear and disappear constantly — requiring automated discovery:

# Static targets - suitable for fixed infrastructure
scrape_configs:
  - job_name: 'database-servers'
    static_configs:
      - targets:
          - 'db-primary.internal:9104'
          - 'db-replica-1.internal:9104'
          - 'db-replica-2.internal:9104'
        labels:
          env: production
          team: data

# Dynamic targets - Kubernetes discovers pods automatically
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    # Relabeling filters and transforms discovered targets
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true

Target Lifecycle

Target Discovery & Scrape Lifecycle
flowchart TD
    SD["Service Discovery
(watches K8s API, Consul, etc.)"] TG["Target Groups
(raw targets + __meta labels)"] RL["relabel_configs
(filter, transform, drop)"] AT["Active Targets
(final scrape list)"] SC["Scrape Loop
(GET /metrics every interval)"] MR["metric_relabel_configs
(filter/modify scraped metrics)"] TSDB["TSDB Storage"] SD -->|"discover"| TG TG -->|"apply"| RL RL -->|"keep/drop"| AT AT -->|"scrape"| SC SC -->|"post-process"| MR MR -->|"ingest"| TSDB RL -->|"action: drop"| X1["Discarded
(never scraped)"] MR -->|"action: drop"| X2["Discarded
(scraped but not stored)"]

Kubernetes Service Discovery

SD Roles

Reference

Kubernetes SD Roles

RoleDiscoversTarget AddressUse Case
nodeCluster nodesNodeIP:KubeletPortNode exporter, kubelet metrics
podAll pods (all containers)PodIP:ContainerPortApplication metrics (annotation-based)
serviceServicesServiceIP:ServicePortBlackbox probing service endpoints
endpointsEndpoints (pods behind services)PodIP:TargetPortServiceMonitor pattern (most common)
endpointsliceEndpointSlices (modern)PodIP:TargetPortSame as endpoints, better scaling
ingressIngress resourcesIngressHost:80/443Blackbox probing external URLs
KubernetesDiscoveryRoles

__meta_kubernetes Labels

Kubernetes SD injects rich metadata as __meta_kubernetes_* labels. These are available during relabeling but are NOT stored in TSDB (double-underscore labels are dropped after relabeling):

# Key __meta labels for pod role:
__meta_kubernetes_pod_name                    # Pod name
__meta_kubernetes_pod_ip                      # Pod IP address
__meta_kubernetes_pod_node_name               # Node the pod runs on
__meta_kubernetes_namespace                   # Pod namespace
__meta_kubernetes_pod_label_<name>            # Pod labels (dots → underscores)
__meta_kubernetes_pod_annotation_<name>       # Pod annotations
__meta_kubernetes_pod_container_name          # Container name
__meta_kubernetes_pod_container_port_name     # Named port
__meta_kubernetes_pod_container_port_number   # Port number
__meta_kubernetes_pod_ready                   # "true" if pod is ready
__meta_kubernetes_pod_phase                   # Running, Pending, etc.

# Key __meta labels for node role:
__meta_kubernetes_node_name                   # Node name
__meta_kubernetes_node_label_<name>           # Node labels
__meta_kubernetes_node_address_InternalIP     # Internal IP
__meta_kubernetes_node_address_Hostname       # Hostname

Practical Examples

# Pattern 1: Annotation-based pod discovery (classic Prometheus pattern)
# Pods opt-in by adding annotations:
#   prometheus.io/scrape: "true"
#   prometheus.io/port: "8080"
#   prometheus.io/path: "/metrics"
scrape_configs:
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      # Only scrape pods with prometheus.io/scrape annotation
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      # Use custom port if specified
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        target_label: __address__
        regex: (.+)
        replacement: ${1}
      # Use custom path if specified
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      # Add namespace and pod labels
      - source_labels: [__meta_kubernetes_namespace]
        target_label: namespace
      - source_labels: [__meta_kubernetes_pod_name]
        target_label: pod
      - source_labels: [__meta_kubernetes_pod_label_app]
        target_label: app

Relabeling Deep Dive

relabel_configs vs metric_relabel_configs

Critical Distinction:
  • relabel_configs — runs BEFORE scraping. Decides which targets to scrape and sets target labels. Operates on __meta_* and __address__ labels. Dropping here = target never scraped = saves CPU
  • metric_relabel_configs — runs AFTER scraping. Filters or modifies individual metrics from the scrape response. Dropping here = metric scraped but not stored = saves disk/memory but still costs scrape CPU

Actions: keep, drop, replace, labelmap, hashmod

# Action: keep - only keep targets where regex matches
relabel_configs:
  - source_labels: [__meta_kubernetes_namespace]
    action: keep
    regex: (production|staging)   # Only monitor prod and staging

# Action: drop - discard targets where regex matches
  - source_labels: [__meta_kubernetes_pod_label_app]
    action: drop
    regex: (test-.*|debug-.*)     # Skip test/debug pods

# Action: replace - regex extract and set target label
  - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_name]
    action: replace
    target_label: app
    regex: (.+)
    replacement: ${1}

# Action: labelmap - copy __meta labels to target labels by regex
  - action: labelmap
    regex: __meta_kubernetes_pod_label_(.+)
    replacement: ${1}              # All pod labels become target labels

# Action: hashmod - consistent hashing for sharding
  - source_labels: [__address__]
    modulus: 3                      # 3 Prometheus shards
    target_label: __tmp_hash
    action: hashmod
  - source_labels: [__tmp_hash]
    regex: 0                        # This shard handles bucket 0
    action: keep

# Action: labeldrop / labelkeep - remove/keep labels by regex
metric_relabel_configs:
  - action: labeldrop
    regex: (pod_template_hash|controller_revision_hash)  # Remove noisy labels

Common Relabeling Patterns

# Pattern: Replace __address__ port (scrape different port than discovered)
relabel_configs:
  - source_labels: [__address__]
    action: replace
    regex: '([^:]+)(?::\d+)?'
    replacement: '${1}:9090'
    target_label: __address__

# Pattern: Set metrics_path from annotation
  - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
    action: replace
    target_label: __metrics_path__
    regex: (.+)

# Pattern: Add static label to all metrics from a job
  - target_label: environment
    replacement: production

# Pattern: Extract version from pod name (api-v2-abc123 → v2)
  - source_labels: [__meta_kubernetes_pod_name]
    action: replace
    regex: '.*-(v\d+)-.*'
    replacement: '${1}'
    target_label: version

Other SD Mechanisms

Consul SD

# Consul service discovery - for non-Kubernetes environments
scrape_configs:
  - job_name: 'consul-services'
    consul_sd_configs:
      - server: 'consul.internal:8500'
        services: []              # Empty = discover ALL services
        tags: ['prometheus']      # Only services with this tag
        refresh_interval: 30s
    relabel_configs:
      - source_labels: [__meta_consul_service]
        target_label: service
      - source_labels: [__meta_consul_dc]
        target_label: datacenter
      - source_labels: [__meta_consul_tags]
        regex: '.*,production,.*'
        action: keep

DNS SD

# DNS service discovery - SRV or A records
scrape_configs:
  - job_name: 'dns-discovery'
    dns_sd_configs:
      - names:
          - '_prometheus._tcp.api.internal'    # SRV record
          - 'metrics.api.internal'             # A record
        type: SRV       # or A, AAAA
        port: 9090      # Required for A/AAAA records
        refresh_interval: 30s

File-Based SD

# File-based SD - read targets from JSON/YAML files
# Prometheus watches files for changes (inotify)
scrape_configs:
  - job_name: 'file-discovery'
    file_sd_configs:
      - files:
          - '/etc/prometheus/targets/*.json'
          - '/etc/prometheus/targets/*.yml'
        refresh_interval: 5m
[
  {
    "targets": ["10.0.1.5:9090", "10.0.1.6:9090"],
    "labels": {
      "env": "production",
      "team": "platform",
      "service": "api-gateway"
    }
  },
  {
    "targets": ["10.0.2.10:9090"],
    "labels": {
      "env": "staging",
      "team": "platform",
      "service": "api-gateway"
    }
  }
]

HTTP SD (Custom Providers)

HTTP SD (Prometheus 2.28+) lets you build a custom service discovery provider as any HTTP server that returns target JSON. This is the most flexible approach for non-standard infrastructure:

# HTTP service discovery configuration
scrape_configs:
  - job_name: 'custom-http-sd'
    http_sd_configs:
      - url: 'http://sd-provider.internal:8080/targets'
        refresh_interval: 30s
        # Optional authentication
        authorization:
          type: Bearer
          credentials: 'my-secret-token'
// Simple HTTP SD provider (Node.js example)
// Returns JSON array of target groups (same format as file_sd)
const express = require('express');
const app = express();

app.get('/targets', async (req, res) => {
  // Query your CMDB, cloud API, or custom registry
  const instances = await getInstancesFromCMDB();

  const targetGroups = instances.map(inst => ({
    targets: [`${inst.ip}:${inst.metricsPort}`],
    labels: {
      __metrics_path__: inst.metricsPath || '/metrics',
      service: inst.serviceName,
      env: inst.environment,
      region: inst.region,
      owner: inst.team
    }
  }));

  res.json(targetGroups);
});

app.listen(8080);
When to Use HTTP SD: Use HTTP SD when your targets live in a system Prometheus doesn’t natively support — a custom CMDB, a proprietary cloud platform, a legacy inventory database, or when you need complex business logic to determine which targets to scrape (e.g., only instances that passed health checks in the last 5 minutes).
Choosing a Service Discovery Mechanism
flowchart TD
    Q1{"Where do targets live?"}
    Q1 -->|Kubernetes| K8S["kubernetes_sd_configs
(native, recommended)"] Q1 -->|Consul/Nomad| CON["consul_sd_configs"] Q1 -->|AWS/GCP/Azure| CLOUD["ec2/gce/azure SD"] Q1 -->|DNS SRV records| DNS["dns_sd_configs"] Q1 -->|Fixed list| STATIC["static_configs"] Q1 -->|Custom system| CUSTOM{"Can you write
an HTTP endpoint?"} CUSTOM -->|Yes| HTTP["http_sd_configs"] CUSTOM -->|No| FILE["file_sd_configs
(cron job writes JSON)"]

Conclusion & What’s Next

Service discovery is the bridge between Prometheus and your dynamic infrastructure. Key takeaways:

  • Kubernetes SD with endpoints role is the most common production pattern
  • relabel_configs runs before scraping (saves CPU); metric_relabel_configs runs after (saves storage)
  • The keep/drop actions are your primary filtering tools
  • hashmod enables sharding targets across multiple Prometheus instances
  • HTTP SD is the universal escape hatch for any custom infrastructure

Next in the Series

In Part 6: Effective Alerting & Alertmanager, we’ll turn PromQL queries into actionable alerts with proper routing, grouping, inhibition rules, and multi-channel notification.