Back to Monitoring & Observability Series

Prometheus Deep Dive Part 1: Observability, Monitoring & Prometheus

June 15, 2026 Wasil Zafar 28 min read

From Google’s Borgmon to CNCF’s second graduated project — trace the evolution of modern monitoring systems, establish the vocabulary of observability, and understand why Prometheus became the de facto standard for cloud-native metrics collection and alerting.

Table of Contents

  1. The Evolution of Monitoring
  2. Observability Terminology
  3. Prometheus’ Role in Observability
  4. Design Philosophy & Tradeoffs
  5. Prometheus vs Other Systems
  6. Conclusion & What’s Next

The Evolution of Monitoring

To understand where Prometheus fits in the monitoring landscape, we need to trace the lineage of monitoring systems from their earliest forms to the cloud-native era. Each generation solved the problems of its time while creating the constraints that the next generation would overcome.

Early Monitoring Systems

The history of system monitoring stretches back to the earliest networked computers, but the modern era begins with a few pivotal systems:

Timeline

Monitoring System Evolution

EraSystemParadigmLimitations
1988SNMPAgent-based pollingLimited to network devices, complex MIBs
1999NagiosCheck-based (up/down)No time series, configuration explosion at scale
2006GraphitePush metrics + dashboardsNo labels/dimensions, hierarchical naming
2008Borgmon (Google internal)Pull-based, label-dimensional, rulesProprietary, never open-sourced
2012Prometheus (SoundCloud)Pull-based, multi-dimensional, PromQLSingle-node TSDB, 15-day default retention
2013InfluxDBPush-based time series DBClustering complexity, commercial features gated
2017Thanos / CortexPrometheus long-term storageOperational complexity
2020Grafana MimirHorizontally scalable PrometheusRequires object storage infrastructure
2023OpenTelemetry MetricsVendor-neutral telemetry pipelineStill maturing, ecosystem fragmentation
HistoryEvolutionCloud Native

The Nagios era (late 1990s–2010s) defined monitoring as “Is it up or down?” Nagios and its forks (Icinga, Shinken) excelled at host and service checks but lacked time-series storage. You knew something was broken, but understanding why required SSH-ing into machines and reading logs manually.

The Graphite era (2006–2015) introduced time-series metrics collection via StatsD and carbon. Teams could finally graph metric values over time. But Graphite used hierarchical dot-notation naming (servers.web01.cpu.user) which made ad-hoc querying across dimensions nearly impossible. Want CPU usage by region? You needed a completely different metric path.

Google’s Borgmon

Inside Google, the Borg cluster manager (predecessor to Kubernetes) had its own monitoring system: Borgmon. First described publicly in Google’s 2016 SRE book, Borgmon introduced several revolutionary concepts:

Borgmon’s Key Innovations:
  • Multi-dimensional data model — metrics identified by name + key-value label pairs, not hierarchical paths
  • Pull-based collection — Borgmon scrapes targets rather than targets pushing to it
  • Powerful query language — algebraic expressions over time series for dashboards and alerting
  • Rules-based alerting — alerts defined as expressions, not threshold checks
  • Service discovery integration — automatically finds targets from the cluster scheduler

These same principles became the foundation of Prometheus. Matt Proud and Julius Volz, both ex-Googlers who joined SoundCloud, brought Borgmon’s philosophy to the open-source world.

Birth at SoundCloud

In 2012, SoundCloud was rapidly adopting microservices and found existing monitoring tools inadequate. Nagios couldn’t handle the dynamic nature of containerized services, and Graphite’s hierarchical naming was too rigid for multi-dimensional queries.

Matt Proud and Julius Volz began building Prometheus as an internal project, drawing directly from their experience with Borgmon at Google. The key design decisions made at SoundCloud:

  • Written in Go — single binary, easy deployment, no external dependencies
  • Pull-based scraping — Prometheus actively fetches metrics from instrumented targets
  • Local TSDB storage — no external database required; everything self-contained
  • Label-based data model — every metric has arbitrary key-value pairs for dimensional queries
  • PromQL — a functional query language purpose-built for time series aggregation
  • Alerting via expressions — alerts are PromQL queries that evaluate to true/false

Prometheus was open-sourced in January 2015 and quickly gained traction as the Kubernetes ecosystem was forming. The timing was perfect — Kubernetes needed a monitoring system that understood dynamic service discovery, and Prometheus needed a platform that generated the kind of ephemeral, labeled workloads it was designed to monitor.

CNCF Graduation & Ecosystem

Prometheus joined the Cloud Native Computing Foundation (CNCF) in May 2016 as its second hosted project after Kubernetes itself. It graduated in August 2018, signifying production readiness and a healthy governance model.

Prometheus Project Timeline
timeline
    title Prometheus Milestones
    2012 : Development begins at SoundCloud
    2015 : Open-sourced (v0.1)
    2016 : Joins CNCF
         : Prometheus 1.0 release
    2017 : Prometheus 2.0 (new TSDB)
         : Thanos project announced
    2018 : CNCF Graduation
         : Remote write protocol formalized
    2019 : Cortex donated to CNCF
    2020 : OpenMetrics standardization
    2022 : Grafana Mimir open-sourced
         : Native histograms introduced
    2023 : Prometheus 2.47+ (UTF-8 metrics)
    2024 : OpenTelemetry Prometheus receiver GA
    2025 : Prometheus 3.0 (OTLP native ingestion)
                            

Today, Prometheus has over 55,000 GitHub stars, 900+ contributors, and forms the metrics backbone for the majority of Kubernetes deployments worldwide. Its exposition format became the basis for the OpenMetrics standard (RFC draft), and its remote-write protocol is the de facto standard for metrics ingestion across the ecosystem.

Observability Terminology

Monitoring vs Observability

These terms are often used interchangeably, but they represent fundamentally different philosophies:

Monitoring tells you when something is wrong. It answers predefined questions: “Is the service up?”, “Is latency above threshold?”, “Are errors increasing?” You must know what to ask in advance.
Observability lets you ask arbitrary questions about your system’s internal state by examining its external outputs. It answers questions you didn’t anticipate: “Why are requests from region X, for user segment Y, using API version Z, experiencing 3x normal latency?”

Prometheus primarily enables monitoring (predefined metric collection and alerting), but its multi-dimensional data model and PromQL make it significantly more powerful than traditional monitoring tools. Combined with logs (Loki) and traces (Tempo), it forms a complete observability system.

The Three Pillars + Profiles

Modern observability is built on four telemetry signal types, each offering a different lens into system behavior:

The Four Signals of Observability
flowchart LR
    subgraph Signals["Telemetry Signals"]
        M["Metrics
What happened?
Numeric aggregates
over time"] L["Logs
What details?
Discrete events
with context"] T["Traces
Where specifically?
Request flow
across services"] P["Profiles
Why resource usage?
CPU/memory
at code level"] end M -->|"Prometheus
Mimir"| D["Dashboards
& Alerts"] L -->|"Loki
Elasticsearch"| D T -->|"Tempo
Jaeger"| D P -->|"Pyroscope
pprof"| D
SignalPrometheus RoleExampleCardinality
MetricsPrimary — collection, storage, querying, alertinghttp_requests_total{method="GET", status="200"}Low (aggregated)
LogsIndirect — Loki uses PromQL-like LogQL{job="api"} |= "error" | jsonHigh (per-event)
TracesIndirect — exemplars link metrics to tracesTrace ID embedded in histogram bucketVery high (per-request)
ProfilesComplementary — Pyroscope correlates with metricsCPU flame graph for high-latency periodVery high (per-function)

Metric Types & Semantics

Prometheus defines four core metric types, each with distinct semantics that determine how they should be queried:

# Counter - monotonically increasing value (resets on restart)
# USE: request counts, bytes sent, errors total
# QUERY: Always use rate() or increase() - raw value is meaningless
http_requests_total{method="GET", handler="/api/users", status="200"} 142857

# Gauge - value that can go up and down
# USE: temperature, memory usage, active connections, queue depth
# QUERY: Direct value is meaningful; use avg_over_time(), max_over_time()
node_memory_MemAvailable_bytes 8589934592

# Histogram - samples observations into configurable buckets
# USE: request duration, response size - anything where distribution matters
# QUERY: histogram_quantile(0.99, rate(http_duration_seconds_bucket[5m]))
http_request_duration_seconds_bucket{le="0.1"} 24054
http_request_duration_seconds_bucket{le="0.25"} 100392
http_request_duration_seconds_bucket{le="0.5"} 129389
http_request_duration_seconds_bucket{le="1.0"} 133988
http_request_duration_seconds_bucket{le="+Inf"} 144320
http_request_duration_seconds_sum 53423.67
http_request_duration_seconds_count 144320

# Summary - client-side calculated quantiles (pre-aggregated)
# USE: When you need exact quantiles but don't need to aggregate across instances
# LIMITATION: Cannot aggregate quantiles across multiple instances
go_gc_duration_seconds{quantile="0.5"} 0.000235
go_gc_duration_seconds{quantile="0.9"} 0.000892
go_gc_duration_seconds{quantile="0.99"} 0.003401
go_gc_duration_seconds_sum 4.293820
go_gc_duration_seconds_count 18232
Common Mistake: Using summary when you should use histogram. Summary quantiles cannot be aggregated across instances — the P99 of P99s is NOT the global P99. Always prefer histograms for request latency in distributed systems. The only exception is when you have a single instance and need precise quantiles with zero server-side computation.

Labels & Dimensions

Labels are the heart of Prometheus’ power. Every unique combination of metric name + label key-value pairs creates a distinct time series:

# These are FOUR distinct time series:
http_requests_total{method="GET", handler="/api/users", status="200"}   → series 1
http_requests_total{method="GET", handler="/api/users", status="500"}   → series 2
http_requests_total{method="POST", handler="/api/users", status="201"}  → series 3
http_requests_total{method="GET", handler="/api/orders", status="200"}  → series 4

# Cardinality = unique combinations of all label values
# For this metric: methods(4) × handlers(50) × statuses(10) = 2,000 series
# Add an instance label with 100 pods: 2,000 × 100 = 200,000 series!

# GOOD labels (bounded cardinality):
http_requests_total{method="GET", status="200", service="user-api", environment="production"}

# BAD labels (unbounded cardinality - will kill your Prometheus):
http_requests_total{user_id="abc123", request_id="req-789", trace_id="..."}

The cardinality of a metric is the total number of unique time series it creates. High cardinality is the primary scaling challenge in Prometheus deployments. We’ll explore cardinality management in depth in Part 8 (Optimizing & Debugging).

Prometheus’ Role in Observability

The Pull-Based Model

Prometheus’ most distinctive architectural choice is its pull-based (scrape) model. Instead of applications pushing metrics to a central collector, Prometheus actively fetches metrics from HTTP endpoints exposed by targets:

Pull vs Push Models
flowchart TD
    subgraph Pull["Pull Model (Prometheus)"]
        P[Prometheus Server]
        T1[Target: /metrics]
        T2[Target: /metrics]
        T3[Target: /metrics]
        P -->|"GET /metrics
every 15s"| T1 P -->|"GET /metrics
every 15s"| T2 P -->|"GET /metrics
every 15s"| T3 end subgraph Push["Push Model (StatsD/InfluxDB)"] C[Collector/DB] A1[App] A2[App] A3[App] A1 -->|"Push UDP/TCP"| C A2 -->|"Push UDP/TCP"| C A3 -->|"Push UDP/TCP"| C end

Advantages of pull:

  • Easier to detect “target down” — if a scrape fails, Prometheus knows immediately; with push, silence is ambiguous
  • No backpressure on applications — targets serve metrics on demand, they don’t need retry/buffer logic
  • Central control of scrape frequency — one configuration change affects all targets
  • Simpler firewall rules — only Prometheus needs outbound access to targets
  • Development convenience — curl the /metrics endpoint directly to debug instrumentation

Disadvantages of pull:

  • Short-lived jobs — batch jobs that finish before the next scrape miss data (solved by Pushgateway)
  • Network boundaries — Prometheus must reach all targets (solved by federation/remote-write)
  • Event-based metrics — not ideal for high-frequency events where every occurrence matters

Architecture Overview

Prometheus Core Architecture
flowchart TD
    subgraph Targets["Monitored Targets"]
        App1["Application
/metrics endpoint"] App2["Node Exporter
/metrics"] App3["Kubernetes API
Service Discovery"] end subgraph Prom["Prometheus Server"] SD["Service Discovery"] SC["Scrape Manager"] TSDB["Local TSDB
(time series storage)"] RE["Rule Engine
(recording + alerting rules)"] API["HTTP API
(PromQL queries)"] end subgraph Downstream["Downstream"] AM["Alertmanager
(dedup, routing, silencing)"] GF["Grafana
(dashboards)"] RW["Remote Write
(Mimir, Thanos, VictoriaMetrics)"] end App3 --> SD SD --> SC SC -->|"scrape /metrics"| App1 SC -->|"scrape /metrics"| App2 SC --> TSDB TSDB --> RE TSDB --> API RE -->|"fire alerts"| AM API -->|"query"| GF TSDB -->|"remote_write"| RW

Ecosystem Components

Prometheus is not a single binary but an ecosystem of purpose-built components:

ComponentPurposeWhen You Need It
Prometheus ServerScraping, storage, query, rulesAlways — the core
AlertmanagerAlert routing, deduplication, silencingAny production alerting
PushgatewayMetrics bridge for batch/short-lived jobsCron jobs, CI builds, Lambda functions
Node ExporterLinux host metrics (CPU, memory, disk, network)Any Linux infrastructure
Blackbox ExporterProbing endpoints (HTTP, TCP, DNS, ICMP)Synthetic monitoring, uptime checks
Client LibrariesInstrument application code (Go, Java, Python, etc.)Custom application metrics
Exporters (100+)Translate third-party metrics to Prometheus formatMySQL, Redis, Kafka, AWS, etc.

Design Philosophy & Tradeoffs

Reliability Over Accuracy

Prometheus makes a deliberate tradeoff: reliability of the monitoring system itself over perfect accuracy of every data point. This manifests in several ways:

Prometheus Design Principles:
  • Each Prometheus server is independent — no clustering required for basic operation. If one server fails, others continue working
  • Eventual consistency is acceptable — a missed scrape or small gap in data is preferable to a monitoring system that crashes under load
  • Simple over complex — the core binary has zero external dependencies (no ZooKeeper, no Kafka, no Cassandra)
  • Local decision-making — alerting rules evaluate locally; alerts fire even if the network to downstream is degraded

This philosophy stems from a core insight: your monitoring system must be more reliable than the systems it monitors. A monitoring system with complex distributed consensus requirements will fail in exactly the scenarios where you need it most — during network partitions, infrastructure failures, and cascading outages.

Local Storage by Default

Prometheus stores all time series data in a local Time Series Database (TSDB) on disk. This is simultaneously its greatest strength and primary limitation:

StrengthLimitation
Zero external dependenciesLimited by local disk capacity
Fast queries (local SSD)Data loss if disk fails (mitigate with RAID/replication)
Simple operationsNo global query view across instances
Predictable performanceTypically 15–30 day retention

For most teams, the local TSDB is sufficient. When you outgrow it, the remote write protocol lets you replicate data to long-term storage systems (Mimir, Thanos, VictoriaMetrics) — covered in Parts 10 and 11 of this track.

What Prometheus Is Not

Understanding Prometheus’ boundaries prevents misuse and disappointment:

Prometheus is NOT:
  • An event logging system — it stores aggregated metrics, not individual events. Use Loki for logs
  • A long-term storage system — default retention is 15 days. Use Mimir/Thanos for years of data
  • 100% accurate per-request billing — scrape intervals mean some data points are interpolated
  • A distributed database — each server is independent. Use federation or remote storage for global views
  • An anomaly detection engine — it provides the data; ML-based detection requires additional tooling
  • A dashboarding tool — Grafana is the standard visualization layer

Prometheus vs Other Systems

vs Graphite

AspectPrometheusGraphite
Data ModelMulti-dimensional (labels)Hierarchical (dot-notation)
CollectionPull (scrape)Push (StatsD/carbon)
Query LanguagePromQL (functional)Graphite functions (pipe-based)
StorageLocal TSDB (compressed)Whisper files (fixed-size)
Service DiscoveryNative (K8s, Consul, DNS, etc.)None (manual configuration)
AlertingBuilt-in rules + AlertmanagerRequires external (Grafana alerts)
ScalabilitySingle server; scale via sharding/remote-writeRelay + carbon-cache clustering

vs InfluxDB

AspectPrometheusInfluxDB
LicenseApache 2.0 (fully open)MIT (OSS) / Proprietary (Cloud)
Data ModelMetrics only (labels)Tags + fields (richer but complex)
CollectionPullPush (line protocol)
Query LanguagePromQLFlux / InfluxQL
Use CaseMonitoring & alertingGeneral time series (IoT, analytics)
EcosystemMassive (CNCF, K8s native)Smaller, self-contained
ClusteringExternal (Mimir, Thanos)Enterprise only (proprietary)

vs Datadog & Commercial APM

AspectPrometheusDatadog / New Relic / Dynatrace
CostFree (infrastructure costs only)Per-host/per-metric pricing ($15-$50/host/month)
Data ResidencyYour infrastructureVendor’s cloud (compliance concern)
Operational BurdenYou manage itFully managed SaaS
CustomizationComplete controlLimited to vendor features
Integration DepthDeep K8s/cloud-nativeBroad but shallower per-tool
Vendor Lock-inNone (OpenMetrics standard)High (proprietary agents, query languages)
When to Choose Prometheus: Organizations with Kubernetes infrastructure, engineering teams comfortable with open-source operations, cost-sensitive environments, or strict data residency requirements. Prometheus is the right choice when you want full control and the long-term ability to avoid vendor lock-in.

Conclusion & What’s Next

Prometheus didn’t emerge in a vacuum. It’s the open-source crystallization of Google’s decade-long internal monitoring experience, adapted for the cloud-native era. Its design philosophy — reliability over accuracy, simplicity over features, pull over push — makes it uniquely suited to monitoring dynamic, containerized infrastructure.

Key takeaways from this foundational part:

  • Prometheus descends directly from Google’s Borgmon via ex-Googlers at SoundCloud
  • The multi-dimensional label model replaced rigid hierarchical naming
  • Pull-based collection makes “target down” detection trivial
  • Four metric types (counter, gauge, histogram, summary) have distinct query semantics
  • Cardinality (unique label combinations) is the primary scaling constraint
  • Prometheus is one component in a larger observability ecosystem (metrics, logs, traces, profiles)

Next in the Series

In Part 2: Deploying Prometheus to Kubernetes, we’ll set up a complete Prometheus deployment using the kube-prometheus-stack Helm chart, configure service discovery for Kubernetes workloads, and build the lab environment we’ll use throughout the rest of this track.