The Evolution of Monitoring
To understand where Prometheus fits in the monitoring landscape, we need to trace the lineage of monitoring systems from their earliest forms to the cloud-native era. Each generation solved the problems of its time while creating the constraints that the next generation would overcome.
Early Monitoring Systems
The history of system monitoring stretches back to the earliest networked computers, but the modern era begins with a few pivotal systems:
Monitoring System Evolution
| Era | System | Paradigm | Limitations |
|---|---|---|---|
| 1988 | SNMP | Agent-based polling | Limited to network devices, complex MIBs |
| 1999 | Nagios | Check-based (up/down) | No time series, configuration explosion at scale |
| 2006 | Graphite | Push metrics + dashboards | No labels/dimensions, hierarchical naming |
| 2008 | Borgmon (Google internal) | Pull-based, label-dimensional, rules | Proprietary, never open-sourced |
| 2012 | Prometheus (SoundCloud) | Pull-based, multi-dimensional, PromQL | Single-node TSDB, 15-day default retention |
| 2013 | InfluxDB | Push-based time series DB | Clustering complexity, commercial features gated |
| 2017 | Thanos / Cortex | Prometheus long-term storage | Operational complexity |
| 2020 | Grafana Mimir | Horizontally scalable Prometheus | Requires object storage infrastructure |
| 2023 | OpenTelemetry Metrics | Vendor-neutral telemetry pipeline | Still maturing, ecosystem fragmentation |
The Nagios era (late 1990s–2010s) defined monitoring as “Is it up or down?” Nagios and its forks (Icinga, Shinken) excelled at host and service checks but lacked time-series storage. You knew something was broken, but understanding why required SSH-ing into machines and reading logs manually.
The Graphite era (2006–2015) introduced time-series metrics collection via StatsD and carbon. Teams could finally graph metric values over time. But Graphite used hierarchical dot-notation naming (servers.web01.cpu.user) which made ad-hoc querying across dimensions nearly impossible. Want CPU usage by region? You needed a completely different metric path.
Google’s Borgmon
Inside Google, the Borg cluster manager (predecessor to Kubernetes) had its own monitoring system: Borgmon. First described publicly in Google’s 2016 SRE book, Borgmon introduced several revolutionary concepts:
- Multi-dimensional data model — metrics identified by name + key-value label pairs, not hierarchical paths
- Pull-based collection — Borgmon scrapes targets rather than targets pushing to it
- Powerful query language — algebraic expressions over time series for dashboards and alerting
- Rules-based alerting — alerts defined as expressions, not threshold checks
- Service discovery integration — automatically finds targets from the cluster scheduler
These same principles became the foundation of Prometheus. Matt Proud and Julius Volz, both ex-Googlers who joined SoundCloud, brought Borgmon’s philosophy to the open-source world.
Birth at SoundCloud
In 2012, SoundCloud was rapidly adopting microservices and found existing monitoring tools inadequate. Nagios couldn’t handle the dynamic nature of containerized services, and Graphite’s hierarchical naming was too rigid for multi-dimensional queries.
Matt Proud and Julius Volz began building Prometheus as an internal project, drawing directly from their experience with Borgmon at Google. The key design decisions made at SoundCloud:
- Written in Go — single binary, easy deployment, no external dependencies
- Pull-based scraping — Prometheus actively fetches metrics from instrumented targets
- Local TSDB storage — no external database required; everything self-contained
- Label-based data model — every metric has arbitrary key-value pairs for dimensional queries
- PromQL — a functional query language purpose-built for time series aggregation
- Alerting via expressions — alerts are PromQL queries that evaluate to true/false
Prometheus was open-sourced in January 2015 and quickly gained traction as the Kubernetes ecosystem was forming. The timing was perfect — Kubernetes needed a monitoring system that understood dynamic service discovery, and Prometheus needed a platform that generated the kind of ephemeral, labeled workloads it was designed to monitor.
CNCF Graduation & Ecosystem
Prometheus joined the Cloud Native Computing Foundation (CNCF) in May 2016 as its second hosted project after Kubernetes itself. It graduated in August 2018, signifying production readiness and a healthy governance model.
timeline
title Prometheus Milestones
2012 : Development begins at SoundCloud
2015 : Open-sourced (v0.1)
2016 : Joins CNCF
: Prometheus 1.0 release
2017 : Prometheus 2.0 (new TSDB)
: Thanos project announced
2018 : CNCF Graduation
: Remote write protocol formalized
2019 : Cortex donated to CNCF
2020 : OpenMetrics standardization
2022 : Grafana Mimir open-sourced
: Native histograms introduced
2023 : Prometheus 2.47+ (UTF-8 metrics)
2024 : OpenTelemetry Prometheus receiver GA
2025 : Prometheus 3.0 (OTLP native ingestion)
Today, Prometheus has over 55,000 GitHub stars, 900+ contributors, and forms the metrics backbone for the majority of Kubernetes deployments worldwide. Its exposition format became the basis for the OpenMetrics standard (RFC draft), and its remote-write protocol is the de facto standard for metrics ingestion across the ecosystem.
Observability Terminology
Monitoring vs Observability
These terms are often used interchangeably, but they represent fundamentally different philosophies:
Prometheus primarily enables monitoring (predefined metric collection and alerting), but its multi-dimensional data model and PromQL make it significantly more powerful than traditional monitoring tools. Combined with logs (Loki) and traces (Tempo), it forms a complete observability system.
The Three Pillars + Profiles
Modern observability is built on four telemetry signal types, each offering a different lens into system behavior:
flowchart LR
subgraph Signals["Telemetry Signals"]
M["Metrics
What happened?
Numeric aggregates
over time"]
L["Logs
What details?
Discrete events
with context"]
T["Traces
Where specifically?
Request flow
across services"]
P["Profiles
Why resource usage?
CPU/memory
at code level"]
end
M -->|"Prometheus
Mimir"| D["Dashboards
& Alerts"]
L -->|"Loki
Elasticsearch"| D
T -->|"Tempo
Jaeger"| D
P -->|"Pyroscope
pprof"| D
| Signal | Prometheus Role | Example | Cardinality |
|---|---|---|---|
| Metrics | Primary — collection, storage, querying, alerting | http_requests_total{method="GET", status="200"} | Low (aggregated) |
| Logs | Indirect — Loki uses PromQL-like LogQL | {job="api"} |= "error" | json | High (per-event) |
| Traces | Indirect — exemplars link metrics to traces | Trace ID embedded in histogram bucket | Very high (per-request) |
| Profiles | Complementary — Pyroscope correlates with metrics | CPU flame graph for high-latency period | Very high (per-function) |
Metric Types & Semantics
Prometheus defines four core metric types, each with distinct semantics that determine how they should be queried:
# Counter - monotonically increasing value (resets on restart)
# USE: request counts, bytes sent, errors total
# QUERY: Always use rate() or increase() - raw value is meaningless
http_requests_total{method="GET", handler="/api/users", status="200"} 142857
# Gauge - value that can go up and down
# USE: temperature, memory usage, active connections, queue depth
# QUERY: Direct value is meaningful; use avg_over_time(), max_over_time()
node_memory_MemAvailable_bytes 8589934592
# Histogram - samples observations into configurable buckets
# USE: request duration, response size - anything where distribution matters
# QUERY: histogram_quantile(0.99, rate(http_duration_seconds_bucket[5m]))
http_request_duration_seconds_bucket{le="0.1"} 24054
http_request_duration_seconds_bucket{le="0.25"} 100392
http_request_duration_seconds_bucket{le="0.5"} 129389
http_request_duration_seconds_bucket{le="1.0"} 133988
http_request_duration_seconds_bucket{le="+Inf"} 144320
http_request_duration_seconds_sum 53423.67
http_request_duration_seconds_count 144320
# Summary - client-side calculated quantiles (pre-aggregated)
# USE: When you need exact quantiles but don't need to aggregate across instances
# LIMITATION: Cannot aggregate quantiles across multiple instances
go_gc_duration_seconds{quantile="0.5"} 0.000235
go_gc_duration_seconds{quantile="0.9"} 0.000892
go_gc_duration_seconds{quantile="0.99"} 0.003401
go_gc_duration_seconds_sum 4.293820
go_gc_duration_seconds_count 18232
summary when you should use histogram. Summary quantiles cannot be aggregated across instances — the P99 of P99s is NOT the global P99. Always prefer histograms for request latency in distributed systems. The only exception is when you have a single instance and need precise quantiles with zero server-side computation.
Labels & Dimensions
Labels are the heart of Prometheus’ power. Every unique combination of metric name + label key-value pairs creates a distinct time series:
# These are FOUR distinct time series:
http_requests_total{method="GET", handler="/api/users", status="200"} → series 1
http_requests_total{method="GET", handler="/api/users", status="500"} → series 2
http_requests_total{method="POST", handler="/api/users", status="201"} → series 3
http_requests_total{method="GET", handler="/api/orders", status="200"} → series 4
# Cardinality = unique combinations of all label values
# For this metric: methods(4) × handlers(50) × statuses(10) = 2,000 series
# Add an instance label with 100 pods: 2,000 × 100 = 200,000 series!
# GOOD labels (bounded cardinality):
http_requests_total{method="GET", status="200", service="user-api", environment="production"}
# BAD labels (unbounded cardinality - will kill your Prometheus):
http_requests_total{user_id="abc123", request_id="req-789", trace_id="..."}
The cardinality of a metric is the total number of unique time series it creates. High cardinality is the primary scaling challenge in Prometheus deployments. We’ll explore cardinality management in depth in Part 8 (Optimizing & Debugging).
Prometheus’ Role in Observability
The Pull-Based Model
Prometheus’ most distinctive architectural choice is its pull-based (scrape) model. Instead of applications pushing metrics to a central collector, Prometheus actively fetches metrics from HTTP endpoints exposed by targets:
flowchart TD
subgraph Pull["Pull Model (Prometheus)"]
P[Prometheus Server]
T1[Target: /metrics]
T2[Target: /metrics]
T3[Target: /metrics]
P -->|"GET /metrics
every 15s"| T1
P -->|"GET /metrics
every 15s"| T2
P -->|"GET /metrics
every 15s"| T3
end
subgraph Push["Push Model (StatsD/InfluxDB)"]
C[Collector/DB]
A1[App]
A2[App]
A3[App]
A1 -->|"Push UDP/TCP"| C
A2 -->|"Push UDP/TCP"| C
A3 -->|"Push UDP/TCP"| C
end
Advantages of pull:
- Easier to detect “target down” — if a scrape fails, Prometheus knows immediately; with push, silence is ambiguous
- No backpressure on applications — targets serve metrics on demand, they don’t need retry/buffer logic
- Central control of scrape frequency — one configuration change affects all targets
- Simpler firewall rules — only Prometheus needs outbound access to targets
- Development convenience — curl the /metrics endpoint directly to debug instrumentation
Disadvantages of pull:
- Short-lived jobs — batch jobs that finish before the next scrape miss data (solved by Pushgateway)
- Network boundaries — Prometheus must reach all targets (solved by federation/remote-write)
- Event-based metrics — not ideal for high-frequency events where every occurrence matters
Architecture Overview
flowchart TD
subgraph Targets["Monitored Targets"]
App1["Application
/metrics endpoint"]
App2["Node Exporter
/metrics"]
App3["Kubernetes API
Service Discovery"]
end
subgraph Prom["Prometheus Server"]
SD["Service Discovery"]
SC["Scrape Manager"]
TSDB["Local TSDB
(time series storage)"]
RE["Rule Engine
(recording + alerting rules)"]
API["HTTP API
(PromQL queries)"]
end
subgraph Downstream["Downstream"]
AM["Alertmanager
(dedup, routing, silencing)"]
GF["Grafana
(dashboards)"]
RW["Remote Write
(Mimir, Thanos, VictoriaMetrics)"]
end
App3 --> SD
SD --> SC
SC -->|"scrape /metrics"| App1
SC -->|"scrape /metrics"| App2
SC --> TSDB
TSDB --> RE
TSDB --> API
RE -->|"fire alerts"| AM
API -->|"query"| GF
TSDB -->|"remote_write"| RW
Ecosystem Components
Prometheus is not a single binary but an ecosystem of purpose-built components:
| Component | Purpose | When You Need It |
|---|---|---|
| Prometheus Server | Scraping, storage, query, rules | Always — the core |
| Alertmanager | Alert routing, deduplication, silencing | Any production alerting |
| Pushgateway | Metrics bridge for batch/short-lived jobs | Cron jobs, CI builds, Lambda functions |
| Node Exporter | Linux host metrics (CPU, memory, disk, network) | Any Linux infrastructure |
| Blackbox Exporter | Probing endpoints (HTTP, TCP, DNS, ICMP) | Synthetic monitoring, uptime checks |
| Client Libraries | Instrument application code (Go, Java, Python, etc.) | Custom application metrics |
| Exporters (100+) | Translate third-party metrics to Prometheus format | MySQL, Redis, Kafka, AWS, etc. |
Design Philosophy & Tradeoffs
Reliability Over Accuracy
Prometheus makes a deliberate tradeoff: reliability of the monitoring system itself over perfect accuracy of every data point. This manifests in several ways:
- Each Prometheus server is independent — no clustering required for basic operation. If one server fails, others continue working
- Eventual consistency is acceptable — a missed scrape or small gap in data is preferable to a monitoring system that crashes under load
- Simple over complex — the core binary has zero external dependencies (no ZooKeeper, no Kafka, no Cassandra)
- Local decision-making — alerting rules evaluate locally; alerts fire even if the network to downstream is degraded
This philosophy stems from a core insight: your monitoring system must be more reliable than the systems it monitors. A monitoring system with complex distributed consensus requirements will fail in exactly the scenarios where you need it most — during network partitions, infrastructure failures, and cascading outages.
Local Storage by Default
Prometheus stores all time series data in a local Time Series Database (TSDB) on disk. This is simultaneously its greatest strength and primary limitation:
| Strength | Limitation |
|---|---|
| Zero external dependencies | Limited by local disk capacity |
| Fast queries (local SSD) | Data loss if disk fails (mitigate with RAID/replication) |
| Simple operations | No global query view across instances |
| Predictable performance | Typically 15–30 day retention |
For most teams, the local TSDB is sufficient. When you outgrow it, the remote write protocol lets you replicate data to long-term storage systems (Mimir, Thanos, VictoriaMetrics) — covered in Parts 10 and 11 of this track.
What Prometheus Is Not
Understanding Prometheus’ boundaries prevents misuse and disappointment:
- An event logging system — it stores aggregated metrics, not individual events. Use Loki for logs
- A long-term storage system — default retention is 15 days. Use Mimir/Thanos for years of data
- 100% accurate per-request billing — scrape intervals mean some data points are interpolated
- A distributed database — each server is independent. Use federation or remote storage for global views
- An anomaly detection engine — it provides the data; ML-based detection requires additional tooling
- A dashboarding tool — Grafana is the standard visualization layer
Prometheus vs Other Systems
vs Graphite
| Aspect | Prometheus | Graphite |
|---|---|---|
| Data Model | Multi-dimensional (labels) | Hierarchical (dot-notation) |
| Collection | Pull (scrape) | Push (StatsD/carbon) |
| Query Language | PromQL (functional) | Graphite functions (pipe-based) |
| Storage | Local TSDB (compressed) | Whisper files (fixed-size) |
| Service Discovery | Native (K8s, Consul, DNS, etc.) | None (manual configuration) |
| Alerting | Built-in rules + Alertmanager | Requires external (Grafana alerts) |
| Scalability | Single server; scale via sharding/remote-write | Relay + carbon-cache clustering |
vs InfluxDB
| Aspect | Prometheus | InfluxDB |
|---|---|---|
| License | Apache 2.0 (fully open) | MIT (OSS) / Proprietary (Cloud) |
| Data Model | Metrics only (labels) | Tags + fields (richer but complex) |
| Collection | Pull | Push (line protocol) |
| Query Language | PromQL | Flux / InfluxQL |
| Use Case | Monitoring & alerting | General time series (IoT, analytics) |
| Ecosystem | Massive (CNCF, K8s native) | Smaller, self-contained |
| Clustering | External (Mimir, Thanos) | Enterprise only (proprietary) |
vs Datadog & Commercial APM
| Aspect | Prometheus | Datadog / New Relic / Dynatrace |
|---|---|---|
| Cost | Free (infrastructure costs only) | Per-host/per-metric pricing ($15-$50/host/month) |
| Data Residency | Your infrastructure | Vendor’s cloud (compliance concern) |
| Operational Burden | You manage it | Fully managed SaaS |
| Customization | Complete control | Limited to vendor features |
| Integration Depth | Deep K8s/cloud-native | Broad but shallower per-tool |
| Vendor Lock-in | None (OpenMetrics standard) | High (proprietary agents, query languages) |
Conclusion & What’s Next
Prometheus didn’t emerge in a vacuum. It’s the open-source crystallization of Google’s decade-long internal monitoring experience, adapted for the cloud-native era. Its design philosophy — reliability over accuracy, simplicity over features, pull over push — makes it uniquely suited to monitoring dynamic, containerized infrastructure.
Key takeaways from this foundational part:
- Prometheus descends directly from Google’s Borgmon via ex-Googlers at SoundCloud
- The multi-dimensional label model replaced rigid hierarchical naming
- Pull-based collection makes “target down” detection trivial
- Four metric types (counter, gauge, histogram, summary) have distinct query semantics
- Cardinality (unique label combinations) is the primary scaling constraint
- Prometheus is one component in a larger observability ecosystem (metrics, logs, traces, profiles)
Next in the Series
In Part 2: Deploying Prometheus to Kubernetes, we’ll set up a complete Prometheus deployment using the kube-prometheus-stack Helm chart, configure service discovery for Kubernetes workloads, and build the lab environment we’ll use throughout the rest of this track.