The Time Series Data Model
Anatomy of a Time Series
Every piece of data in Prometheus is a time series — a stream of timestamped values belonging to the same metric and label set. The data model is deceptively simple but immensely powerful:
# A single sample (data point):
# metric_name{label1="value1", label2="value2"} float64_value timestamp_ms
# Concrete example - one sample at one moment in time:
http_requests_total{method="POST", handler="/api/orders", status="201", instance="10.0.1.5:8080", job="order-service"} 48923 1718467200000
# This decomposes to:
# Metric Name: http_requests_total
# Labels: {method="POST", handler="/api/orders", status="201", instance="10.0.1.5:8080", job="order-service"}
# Value: 48923 (float64)
# Timestamp: 1718467200000 (Unix milliseconds = 2026-06-15T12:00:00Z)
__name__. Internally, http_requests_total{method="GET"} is stored as {__name__="http_requests_total", method="GET"}.
Naming Conventions
# Prometheus metric naming conventions:
# Format: library_name_unit_suffix
# GOOD - clear, follows conventions:
http_request_duration_seconds_bucket # histogram bucket (seconds)
http_request_duration_seconds_sum # histogram sum
http_request_duration_seconds_count # histogram count
node_cpu_seconds_total # counter (seconds, total suffix)
node_memory_MemAvailable_bytes # gauge (bytes unit)
process_open_fds # gauge (no unit = count)
# BAD - avoid these patterns:
request_latency # no unit! seconds? ms? unclear
requestLatency_ms # camelCase, non-standard unit suffix
http.requests.count # dots not allowed (use underscores)
http_request_total_count # redundant (total IS the suffix for counters)
Sample Structure in Memory
flowchart TD
subgraph Series["Time Series Identity"]
ID["Series ID: 42
Fingerprint: 0xABCDEF12"]
Labels["Labels Map:
__name__ = http_requests_total
method = GET
status = 200
job = api-server"]
end
subgraph Samples["Samples (append-only)"]
S1["t=1718467200000, v=1000.0"]
S2["t=1718467215000, v=1003.0"]
S3["t=1718467230000, v=1007.0"]
S4["t=1718467245000, v=1012.0"]
S5["... (every 15s scrape)"]
end
subgraph Chunk["Chunk (120 samples max)"]
ENC["XOR-encoded timestamps
Gorilla-encoded values
~1.5 bytes/sample average"]
end
ID --> Samples
Labels --> ID
Samples --> Chunk
TSDB Architecture Overview
The Prometheus TSDB (introduced in Prometheus 2.0, 2017) was designed by Fabian Reinartz to solve three problems simultaneously: fast ingestion of millions of samples per second, efficient compression for storage, and millisecond query responses over large time ranges.
flowchart LR
subgraph Ingest["Ingestion"]
SC["Scrape
Samples"]
end
subgraph WAL["Write-Ahead Log"]
W1["WAL Segment 1
(128 MB)"]
W2["WAL Segment 2
(128 MB)"]
W3["WAL Segment N..."]
end
subgraph Head["Head Block (in-memory)"]
HC["Active Chunks
(last 2h of data)"]
MM["Memory-Mapped
Chunks (older)"]
end
subgraph Persist["Persistent Blocks (on-disk)"]
B1["Block 01DB...
2h range
mint → maxt"]
B2["Block 01DC...
2h range"]
B3["Block 01DD...
(compacted)
4h+ range"]
end
SC -->|"1. Write"| WAL
WAL -->|"2. Replay on
restart"| Head
SC -->|"3. Append"| HC
HC -->|"4. Cut every 2h"| Persist
MM -->|"5. Already
persisted"| Persist
B1 & B2 -->|"6. Compaction"| B3
Write-Ahead Log (WAL)
Every incoming sample is first written to the WAL before being added to the in-memory head block. This ensures durability — if Prometheus crashes, it can replay the WAL to recover recent data:
# WAL directory structure
data/
├── wal/
│ ├── 00000001 # WAL segment (up to 128 MB each)
│ ├── 00000002
│ ├── 00000003 # Currently active segment
│ └── checkpoint.00000001/ # Compressed checkpoint
│ └── 00000000
# WAL contains three record types:
# 1. Series records - new time series (labels → seriesID mapping)
# 2. Sample records - timestamp + value for existing series
# 3. Tombstone records - deletion markers
# Inspect WAL health
promtool tsdb analyze /path/to/data/
# WAL replay time on restart (depends on WAL size):
# 128 MB WAL segment ≈ 2-5 seconds replay
# Multiple segments = linear replay time
promtool tsdb clean-tombstones, or (3) In extreme cases, delete the entire WAL directory and accept data loss since the last persisted block.
The Head Block
The head block holds all recently ingested data in memory. It contains the most recent 2 hours (by default) of samples for every active time series. When the head block’s oldest data exceeds the --storage.tsdb.min-block-duration (default: 2h), it gets “cut” into a persistent block on disk.
# Head block metrics (query in Prometheus)
prometheus_tsdb_head_series # Active time series in head
prometheus_tsdb_head_samples_appended_total # Samples ingested
prometheus_tsdb_head_chunks # Number of chunks in memory
prometheus_tsdb_head_chunks_created_total # Chunks created
prometheus_tsdb_head_gc_duration_seconds # GC duration
# Memory usage estimate for head block:
# Each active series ≈ 3-4 KB in memory
# 100,000 series × 4 KB = ~400 MB base memory for head
# Plus WAL buffers, query buffers, etc.
Persistent Blocks
Once the head block is cut, the data becomes a persistent block — an immutable, self-contained directory on disk. Each block covers a specific time range and contains its own index, chunks, and metadata:
# Block directory structure
data/
├── 01HQKR5M7S8TQR4JNWVPZKM3QN/ # Block ULID (time-sortable)
│ ├── meta.json # Block metadata (time range, stats)
│ ├── index # Series index (label → series mapping)
│ ├── chunks/
│ │ └── 000001 # Chunk data (compressed samples)
│ └── tombstones # Deletion markers
├── 01HQMGT8N2RDP5XYZAB7CDEF01/
│ ├── meta.json
│ ├── index
│ ├── chunks/
│ │ └── 000001
│ └── tombstones
└── wal/
// meta.json - Block metadata example
{
"ulid": "01HQKR5M7S8TQR4JNWVPZKM3QN",
"minTime": 1718380800000,
"maxTime": 1718388000000,
"stats": {
"numSamples": 28543921,
"numSeries": 142350,
"numChunks": 284700
},
"compaction": {
"level": 1,
"sources": ["01HQKR5M7S8TQR4JNWVPZKM3QN"]
},
"version": 1
}
Block Structure on Disk
Directory Layout
# Examine your Prometheus data directory
ls -la /prometheus/data/
# See block time ranges
for d in /prometheus/data/01*/; do
echo "$d: $(jq -r '.minTime, .maxTime' $d/meta.json | \
xargs -I{} date -d @$(echo {} / 1000 | bc) '+%Y-%m-%d %H:%M')"
done
# Check total storage usage
du -sh /prometheus/data/
du -sh /prometheus/data/wal/
du -sh /prometheus/data/01*/chunks/
Chunks File Format
Each chunk stores up to 120 samples for a single time series. Chunks are the fundamental unit of I/O — when Prometheus reads data for a query, it reads entire chunks, not individual samples:
Chunk Binary Layout
| Field | Size | Description |
|---|---|---|
| Encoding byte | 1 byte | 0x01 = XOR encoding (standard) |
| Num samples | 2 bytes (varint) | Number of samples in chunk (max 120) |
| First timestamp | 8 bytes | Absolute timestamp (int64 ms) |
| First value | 8 bytes | Absolute float64 value |
| Delta-of-delta timestamps | Variable | XOR + varint encoded |
| XOR values | Variable | Gorilla-style XOR encoding |
Average chunk size: ~200–300 bytes for 120 samples (vs 1,920 bytes uncompressed) = ~6:1 compression ratio
Index File Structure
The index file is the key to fast PromQL queries. It maps label pairs to series IDs and series IDs to chunk locations — enabling Prometheus to resolve complex label matchers without scanning all data:
flowchart TD
subgraph Index["Index File Sections"]
SYM["Symbol Table
All unique strings
(label names + values)"]
SER["Series Section
Series ID → labels + chunk refs"]
LI["Label Index
label_name → [all values]"]
PL["Posting Lists
label_pair → [series IDs]"]
PLO["Posting List Offsets
Quick lookup table"]
TOC["Table of Contents
Section offsets"]
end
subgraph Query["Query: {job='api', status='500'}"]
Q1["1. Find posting list
for job='api'
→ [1, 3, 7, 12, 45...]"]
Q2["2. Find posting list
for status='500'
→ [3, 12, 28, 45...]"]
Q3["3. Intersect lists
→ [3, 12, 45...]"]
Q4["4. Look up chunk refs
for each series ID"]
end
PL --> Q1
PL --> Q2
Q1 & Q2 --> Q3
Q3 --> SER
SER --> Q4
Chunk Encoding & Compression
XOR Encoding for Timestamps
Prometheus uses delta-of-delta encoding for timestamps, exploiting the fact that scrape intervals are highly regular. If scrapes happen every 15 seconds, the delta between timestamps is always ~15000ms, and the delta-of-delta is near zero:
# Timestamp encoding example:
# Raw timestamps (ms): 1000, 1015, 1030, 1045, 1060, 1075
# Deltas: 15, 15, 15, 15, 15
# Delta-of-deltas: 0, 0, 0, 0
# When delta-of-delta = 0: encode as single bit (0)
# Regular scrapes → nearly free timestamp storage!
# Irregular timestamps (jitter/missed scrapes):
# Raw: 1000, 1015, 1032, 1045, 1061, 1075
# Deltas: 15, 17, 13, 16, 14
# DoD: 2, -4, 3, -2
# These require more bits but are still compact (varint encoding)
Gorilla Encoding for Values
For float64 values, Prometheus uses Gorilla compression (from Facebook’s 2015 paper). It XORs consecutive values and encodes only the meaningful bits that changed:
# Gorilla XOR encoding for float64 values:
#
# Value 1: 72.0 → IEEE 754: 0 10000000101 001000000000000000000000000000000000000000000000000
# Value 2: 72.5 → IEEE 754: 0 10000000101 001001000000000000000000000000000000000000000000000
#
# XOR result: 0 00000000000 000001000000000000000000000000000000000000000000000
# ^^^^^^ only these bits differ!
#
# Encoding: store leading zeros count + meaningful bits count + meaningful bits
# Result: just a few bits instead of 64 bits per value!
#
# Best case (value unchanged): 1 bit (just a zero flag)
# Typical case: 10-20 bits per value
# Worst case (totally different value): 65 bits
# Compression efficiency depends on value patterns:
# - Counters (incrementing): excellent (small XOR differences)
# - Gauges (stable): excellent (many unchanged values)
# - Gauges (noisy): good (similar magnitude, different bits)
# - Random values: poor (full 64-bit differences)
Compression Ratios
TSDB Compression Efficiency
| Metric Type | Raw Size/Sample | Compressed Size/Sample | Ratio | Notes |
|---|---|---|---|---|
| Counter (monotonic) | 16 bytes | 1.2–1.5 bytes | ~12:1 | Regular increments compress extremely well |
| Gauge (stable) | 16 bytes | 1.0–1.3 bytes | ~14:1 | Unchanged values = 1 bit each |
| Gauge (noisy) | 16 bytes | 2.0–3.0 bytes | ~6:1 | CPU/memory with fluctuation |
| Histogram (buckets) | 16 bytes | 1.5–2.0 bytes | ~9:1 | Multiple series per histogram |
| Average across all types | 16 bytes | 1.5–2.0 bytes | ~8–10:1 |
Compaction Process
When Compaction Runs
Compaction merges smaller blocks into larger ones, reducing the number of blocks Prometheus must search during queries and enabling more efficient compression across longer time ranges:
flowchart LR
subgraph L1["Level 1 (2h blocks)"]
B1["Block A
00:00-02:00"]
B2["Block B
02:00-04:00"]
B3["Block C
04:00-06:00"]
B4["Block D
06:00-08:00"]
end
subgraph L2["Level 2 (merged)"]
M1["Block AB
00:00-04:00"]
M2["Block CD
04:00-08:00"]
end
subgraph L3["Level 3 (merged)"]
F1["Block ABCD
00:00-08:00"]
end
B1 & B2 -->|"compact"| M1
B3 & B4 -->|"compact"| M2
M1 & M2 -->|"compact"| F1
# Compaction metrics to monitor
prometheus_tsdb_compactions_total # Total compactions run
prometheus_tsdb_compaction_duration_seconds # Duration per compaction
prometheus_tsdb_blocks_loaded # Currently loaded blocks
prometheus_tsdb_size_retentions_total # Blocks deleted by retention
# TSDB flags controlling compaction:
# --storage.tsdb.min-block-duration=2h # Minimum block time range
# --storage.tsdb.max-block-duration=36h # Maximum block time range (capped at 10% of retention)
# Max block duration is automatically: min(31d, retention/10)
Tombstones & Deletion
Prometheus supports deleting data via the Admin API. Rather than rewriting blocks, it creates tombstone files that mark time ranges as deleted. The actual data is removed during the next compaction:
# Delete series data via Admin API (must enable --web.enable-admin-api)
# Delete all data for a specific metric in a time range
curl -X POST 'http://localhost:9090/api/v1/admin/tsdb/delete_series' \
-d 'match[]={__name__="expensive_metric_to_remove"}' \
-d 'start=2026-06-01T00:00:00Z' \
-d 'end=2026-06-10T00:00:00Z'
# Force compaction to reclaim disk space immediately
curl -X POST 'http://localhost:9090/api/v1/admin/tsdb/clean_tombstones'
# WARNING: Deletion is permanent after clean_tombstones!
# The tombstones file can be manually deleted to "undo" before compaction
Retention Enforcement
# Retention is enforced in two ways (whichever triggers first):
# Time-based retention (default: 15 days)
--storage.tsdb.retention.time=15d
# Size-based retention (delete oldest blocks when exceeded)
--storage.tsdb.retention.size=50GB
# Both can be combined:
# "Keep 30 days OR 100GB, whichever limit is hit first"
--storage.tsdb.retention.time=30d
--storage.tsdb.retention.size=100GB
Index Structure Deep Dive
Posting Lists
The posting list is an inverted index mapping each label pair to the set of series IDs containing that pair. This is identical to how search engines map terms to document IDs:
# Conceptual posting list structure:
#
# Label Pair → Series IDs (sorted)
# ─────────────────────────────────────────────
# job="api-server" → [1, 2, 3, 4, 5, 6, 7, 8]
# job="worker" → [9, 10, 11, 12, 13, 14]
# method="GET" → [1, 3, 5, 7, 9, 11, 13]
# method="POST" → [2, 4, 6, 8, 10, 12, 14]
# status="200" → [1, 2, 3, 4, 9, 10, 11, 12]
# status="500" → [5, 6, 7, 8, 13, 14]
#
# Query: {job="api-server", status="500"}
# 1. Posting list for job="api-server" → [1, 2, 3, 4, 5, 6, 7, 8]
# 2. Posting list for status="500" → [5, 6, 7, 8, 13, 14]
# 3. Intersect (sorted merge) → [5, 6, 7, 8]
# 4. Result: 4 matching series (very fast!)
Label Value Index
The label index provides fast lookup of all values for a given label name — used by the /api/v1/label/<name>/values endpoint and autocomplete in Grafana:
# Label index structure:
# label_name → [value1, value2, value3, ...]
#
# __name__ → ["http_requests_total", "node_cpu_seconds_total", "up", ...]
# job → ["api-server", "worker", "node-exporter", "prometheus", ...]
# instance → ["10.0.1.5:8080", "10.0.1.6:8080", "10.0.2.3:9100", ...]
# method → ["GET", "POST", "PUT", "DELETE", "PATCH"]
# status → ["200", "201", "204", "400", "401", "403", "404", "500", "503"]
# Query all label values for 'job':
curl -s 'http://localhost:9090/api/v1/label/job/values' | jq '.data'
Query Resolution Path
- Parse PromQL expression into AST
- Identify label matchers from selectors
- For each block overlapping the query time range:
- a. Look up posting lists for each matcher
- b. Intersect posting lists (AND logic) or union (OR logic)
- c. For each matching series, find chunk references
- d. Read and decompress relevant chunks
- Merge results across blocks (head + persistent)
- Apply PromQL functions (rate, sum, etc.)
- Return result
Memory-Mapped Chunks
Since Prometheus 2.19, chunks from persisted blocks are memory-mapped (mmap) rather than loaded entirely into RAM. The operating system’s page cache handles which chunks are in memory and which are on disk:
# Memory-mapped chunks reduce Prometheus memory usage dramatically
# Before mmap: all queried data must fit in process heap
# After mmap: OS page cache manages hot/cold data automatically
# Monitor mmap behavior:
prometheus_tsdb_head_chunks_storage_size_bytes # Head chunk memory
process_resident_memory_bytes # Total RSS
process_virtual_memory_bytes # Includes mmap regions
# The difference between virtual and resident memory
# shows how much block data is mapped but not in RAM:
# virtual - resident ≈ cold mmap data on disk
# Practical impact: A Prometheus with 500GB on disk might show
# 500GB virtual memory but only 8GB resident (actual RAM used)
TSDB Configuration
# Key TSDB-related Prometheus flags:
# (set via command-line args or Helm values)
# Storage path
--storage.tsdb.path=/prometheus/data
# Retention
--storage.tsdb.retention.time=15d
--storage.tsdb.retention.size=0 # 0 = disabled (time-based only)
# Block durations
--storage.tsdb.min-block-duration=2h # Don't change unless you know why
--storage.tsdb.max-block-duration=36h # Auto-capped at retention/10
# WAL configuration
--storage.tsdb.wal-segment-size=128MB # Default, rarely needs changing
--storage.tsdb.wal-compression # Enable WAL compression (saves ~50% WAL disk)
# Head chunks
--storage.tsdb.head-chunks-write-queue-size=0 # Async chunk writes (0=sync)
# Out-of-order ingestion (Prometheus 2.39+)
--storage.tsdb.out-of-order-time-window=30m # Accept samples up to 30m late
# No-lockfile (for shared/readonly mounts)
--storage.tsdb.no-lockfile=false
--storage.tsdb.out-of-order-time-window to the maximum expected delay.
Conclusion & What’s Next
The Prometheus TSDB is a masterpiece of systems engineering — achieving 8–10x compression ratios while maintaining sub-second query performance across millions of time series. Key takeaways:
- A time series = metric name + labels + ordered (timestamp, value) pairs
- The WAL provides crash recovery; the head block holds recent data in memory
- Chunks use Gorilla/XOR encoding for ~1.5 bytes per sample (vs 16 bytes raw)
- Posting lists enable millisecond label-based lookups across millions of series
- Compaction merges blocks over time for more efficient storage and queries
- Memory-mapped chunks let the OS page cache manage hot/cold data transparently
Next in the Series
In Part 4: Mastering PromQL, we’ll harness the full power of Prometheus’ query language — from instant vectors and range vectors through aggregation operators, binary operations, and the recording rules that tame complex queries for dashboard performance.