Prometheus Deep Dive Part 3: The Prometheus Data Model & TSDB

The Time Series Data Model

Anatomy of a Time Series

Every piece of data in Prometheus is a time series — a stream of timestamped values belonging to the same metric and label set. The data model is deceptively simple but immensely powerful:

# A single sample (data point):
# metric_name{label1="value1", label2="value2"} float64_value timestamp_ms

# Concrete example - one sample at one moment in time:
http_requests_total{method="POST", handler="/api/orders", status="201", instance="10.0.1.5:8080", job="order-service"} 48923 1718467200000

# This decomposes to:
# Metric Name:  http_requests_total
# Labels:       {method="POST", handler="/api/orders", status="201", instance="10.0.1.5:8080", job="order-service"}
# Value:        48923 (float64)
# Timestamp:    1718467200000 (Unix milliseconds = 2026-06-15T12:00:00Z)

                            
                            Identity Rule: A time series is uniquely identified by its metric name plus the complete set of label key-value pairs. The metric name is technically just another label: __name__. Internally, http_requests_total{method="GET"} is stored as {__name__="http_requests_total", method="GET"}.
                        

Naming Conventions

# Prometheus metric naming conventions:
# Format: library_name_unit_suffix

# GOOD - clear, follows conventions:
http_request_duration_seconds_bucket    # histogram bucket (seconds)
http_request_duration_seconds_sum       # histogram sum
http_request_duration_seconds_count     # histogram count
node_cpu_seconds_total                  # counter (seconds, total suffix)
node_memory_MemAvailable_bytes          # gauge (bytes unit)
process_open_fds                        # gauge (no unit = count)

# BAD - avoid these patterns:
request_latency                  # no unit! seconds? ms? unclear
requestLatency_ms                # camelCase, non-standard unit suffix
http.requests.count              # dots not allowed (use underscores)
http_request_total_count         # redundant (total IS the suffix for counters)

Sample Structure in Memory

Time Series Data Structure

flowchart TD
    subgraph Series["Time Series Identity"]
        ID["Series ID: 42
Fingerprint: 0xABCDEF12"]
        Labels["Labels Map:
__name__ = http_requests_total
method = GET
status = 200
job = api-server"]
    end

    subgraph Samples["Samples (append-only)"]
        S1["t=1718467200000, v=1000.0"]
        S2["t=1718467215000, v=1003.0"]
        S3["t=1718467230000, v=1007.0"]
        S4["t=1718467245000, v=1012.0"]
        S5["... (every 15s scrape)"]
    end

    subgraph Chunk["Chunk (120 samples max)"]
        ENC["XOR-encoded timestamps
Gorilla-encoded values
~1.5 bytes/sample average"]
    end

    ID --> Samples
    Labels --> ID
    Samples --> Chunk

TSDB Architecture Overview

The Prometheus TSDB (introduced in Prometheus 2.0, 2017) was designed by Fabian Reinartz to solve three problems simultaneously: fast ingestion of millions of samples per second, efficient compression for storage, and millisecond query responses over large time ranges.

TSDB Write & Storage Pipeline

flowchart LR
    subgraph Ingest["Ingestion"]
        SC["Scrape
Samples"]
    end

    subgraph WAL["Write-Ahead Log"]
        W1["WAL Segment 1
(128 MB)"]
        W2["WAL Segment 2
(128 MB)"]
        W3["WAL Segment N..."]
    end

    subgraph Head["Head Block (in-memory)"]
        HC["Active Chunks
(last 2h of data)"]
        MM["Memory-Mapped
Chunks (older)"]
    end

    subgraph Persist["Persistent Blocks (on-disk)"]
        B1["Block 01DB...
2h range
mint → maxt"]
        B2["Block 01DC...
2h range"]
        B3["Block 01DD...
(compacted)
4h+ range"]
    end

    SC -->|"1. Write"| WAL
    WAL -->|"2. Replay on
restart"| Head
    SC -->|"3. Append"| HC
    HC -->|"4. Cut every 2h"| Persist
    MM -->|"5. Already
persisted"| Persist
    B1 & B2 -->|"6. Compaction"| B3

Write-Ahead Log (WAL)

Every incoming sample is first written to the WAL before being added to the in-memory head block. This ensures durability — if Prometheus crashes, it can replay the WAL to recover recent data:

# WAL directory structure
data/
├── wal/
│   ├── 00000001    # WAL segment (up to 128 MB each)
│   ├── 00000002
│   ├── 00000003    # Currently active segment
│   └── checkpoint.00000001/  # Compressed checkpoint
│       └── 00000000

# WAL contains three record types:
# 1. Series records - new time series (labels → seriesID mapping)
# 2. Sample records - timestamp + value for existing series
# 3. Tombstone records - deletion markers

# Inspect WAL health
promtool tsdb analyze /path/to/data/

# WAL replay time on restart (depends on WAL size):
# 128 MB WAL segment ≈ 2-5 seconds replay
# Multiple segments = linear replay time

                            
                            WAL Corruption: If WAL segments become corrupted (disk errors, OOM kills mid-write), Prometheus will fail to start. Recovery options: (1) Delete the corrupted segment (loses ~2h of data), (2) Use promtool tsdb clean-tombstones, or (3) In extreme cases, delete the entire WAL directory and accept data loss since the last persisted block.
                        

The Head Block

The head block holds all recently ingested data in memory. It contains the most recent 2 hours (by default) of samples for every active time series. When the head block’s oldest data exceeds the --storage.tsdb.min-block-duration (default: 2h), it gets “cut” into a persistent block on disk.

# Head block metrics (query in Prometheus)
prometheus_tsdb_head_series              # Active time series in head
prometheus_tsdb_head_samples_appended_total  # Samples ingested
prometheus_tsdb_head_chunks              # Number of chunks in memory
prometheus_tsdb_head_chunks_created_total # Chunks created
prometheus_tsdb_head_gc_duration_seconds  # GC duration

# Memory usage estimate for head block:
# Each active series ≈ 3-4 KB in memory
# 100,000 series × 4 KB = ~400 MB base memory for head
# Plus WAL buffers, query buffers, etc.

Persistent Blocks

Once the head block is cut, the data becomes a persistent block — an immutable, self-contained directory on disk. Each block covers a specific time range and contains its own index, chunks, and metadata:

# Block directory structure
data/
├── 01HQKR5M7S8TQR4JNWVPZKM3QN/   # Block ULID (time-sortable)
│   ├── meta.json          # Block metadata (time range, stats)
│   ├── index              # Series index (label → series mapping)
│   ├── chunks/
│   │   └── 000001         # Chunk data (compressed samples)
│   └── tombstones         # Deletion markers
├── 01HQMGT8N2RDP5XYZAB7CDEF01/
│   ├── meta.json
│   ├── index
│   ├── chunks/
│   │   └── 000001
│   └── tombstones
└── wal/

// meta.json - Block metadata example
{
  "ulid": "01HQKR5M7S8TQR4JNWVPZKM3QN",
  "minTime": 1718380800000,
  "maxTime": 1718388000000,
  "stats": {
    "numSamples": 28543921,
    "numSeries": 142350,
    "numChunks": 284700
  },
  "compaction": {
    "level": 1,
    "sources": ["01HQKR5M7S8TQR4JNWVPZKM3QN"]
  },
  "version": 1
}

Block Structure on Disk

Directory Layout

# Examine your Prometheus data directory
ls -la /prometheus/data/

# See block time ranges
for d in /prometheus/data/01*/; do
  echo "$d: $(jq -r '.minTime, .maxTime' $d/meta.json | \
    xargs -I{} date -d @$(echo {} / 1000 | bc) '+%Y-%m-%d %H:%M')"
done

# Check total storage usage
du -sh /prometheus/data/
du -sh /prometheus/data/wal/
du -sh /prometheus/data/01*/chunks/

Chunks File Format

Each chunk stores up to 120 samples for a single time series. Chunks are the fundamental unit of I/O — when Prometheus reads data for a query, it reads entire chunks, not individual samples:

Internal Format

Chunk Binary Layout

Field	Size	Description
Encoding byte	1 byte	`0x01` = XOR encoding (standard)
Num samples	2 bytes (varint)	Number of samples in chunk (max 120)
First timestamp	8 bytes	Absolute timestamp (int64 ms)
First value	8 bytes	Absolute float64 value
Delta-of-delta timestamps	Variable	XOR + varint encoded
XOR values	Variable	Gorilla-style XOR encoding

Average chunk size: ~200–300 bytes for 120 samples (vs 1,920 bytes uncompressed) = ~6:1 compression ratio

StorageEncodingCompression

Index File Structure

The index file is the key to fast PromQL queries. It maps label pairs to series IDs and series IDs to chunk locations — enabling Prometheus to resolve complex label matchers without scanning all data:

Index File Logical Structure

flowchart TD
    subgraph Index["Index File Sections"]
        SYM["Symbol Table
All unique strings
(label names + values)"]
        SER["Series Section
Series ID → labels + chunk refs"]
        LI["Label Index
label_name → [all values]"]
        PL["Posting Lists
label_pair → [series IDs]"]
        PLO["Posting List Offsets
Quick lookup table"]
        TOC["Table of Contents
Section offsets"]
    end

    subgraph Query["Query: {job='api', status='500'}"]
        Q1["1. Find posting list
for job='api'
→ [1, 3, 7, 12, 45...]"]
        Q2["2. Find posting list
for status='500'
→ [3, 12, 28, 45...]"]
        Q3["3. Intersect lists
→ [3, 12, 45...]"]
        Q4["4. Look up chunk refs
for each series ID"]
    end

    PL --> Q1
    PL --> Q2
    Q1 & Q2 --> Q3
    Q3 --> SER
    SER --> Q4

Chunk Encoding & Compression

XOR Encoding for Timestamps

Prometheus uses delta-of-delta encoding for timestamps, exploiting the fact that scrape intervals are highly regular. If scrapes happen every 15 seconds, the delta between timestamps is always ~15000ms, and the delta-of-delta is near zero:

# Timestamp encoding example:
# Raw timestamps (ms): 1000, 1015, 1030, 1045, 1060, 1075
# Deltas:                    15,   15,   15,   15,   15
# Delta-of-deltas:                 0,    0,    0,    0

# When delta-of-delta = 0: encode as single bit (0)
# Regular scrapes → nearly free timestamp storage!

# Irregular timestamps (jitter/missed scrapes):
# Raw: 1000, 1015, 1032, 1045, 1061, 1075
# Deltas:    15,   17,   13,   16,   14
# DoD:             2,   -4,    3,   -2
# These require more bits but are still compact (varint encoding)

Gorilla Encoding for Values

For float64 values, Prometheus uses Gorilla compression (from Facebook’s 2015 paper). It XORs consecutive values and encodes only the meaningful bits that changed:

# Gorilla XOR encoding for float64 values:
#
# Value 1: 72.0  → IEEE 754: 0 10000000101 001000000000000000000000000000000000000000000000000
# Value 2: 72.5  → IEEE 754: 0 10000000101 001001000000000000000000000000000000000000000000000
#
# XOR result:     0 00000000000 000001000000000000000000000000000000000000000000000
#                              ^^^^^^ only these bits differ!
#
# Encoding: store leading zeros count + meaningful bits count + meaningful bits
# Result: just a few bits instead of 64 bits per value!
#
# Best case (value unchanged): 1 bit (just a zero flag)
# Typical case: 10-20 bits per value
# Worst case (totally different value): 65 bits

# Compression efficiency depends on value patterns:
# - Counters (incrementing): excellent (small XOR differences)
# - Gauges (stable): excellent (many unchanged values)
# - Gauges (noisy): good (similar magnitude, different bits)
# - Random values: poor (full 64-bit differences)

Compression Ratios

Benchmarks

TSDB Compression Efficiency

Metric Type	Raw Size/Sample	Compressed Size/Sample	Ratio	Notes
Counter (monotonic)	16 bytes	1.2–1.5 bytes	~12:1	Regular increments compress extremely well
Gauge (stable)	16 bytes	1.0–1.3 bytes	~14:1	Unchanged values = 1 bit each
Gauge (noisy)	16 bytes	2.0–3.0 bytes	~6:1	CPU/memory with fluctuation
Histogram (buckets)	16 bytes	1.5–2.0 bytes	~9:1	Multiple series per histogram
Average across all types	16 bytes	1.5–2.0 bytes	~8–10:1

CompressionPerformanceStorage

Compaction Process

When Compaction Runs

Compaction merges smaller blocks into larger ones, reducing the number of blocks Prometheus must search during queries and enabling more efficient compression across longer time ranges:

Block Compaction Levels

flowchart LR
    subgraph L1["Level 1 (2h blocks)"]
        B1["Block A
00:00-02:00"]
        B2["Block B
02:00-04:00"]
        B3["Block C
04:00-06:00"]
        B4["Block D
06:00-08:00"]
    end

    subgraph L2["Level 2 (merged)"]
        M1["Block AB
00:00-04:00"]
        M2["Block CD
04:00-08:00"]
    end

    subgraph L3["Level 3 (merged)"]
        F1["Block ABCD
00:00-08:00"]
    end

    B1 & B2 -->|"compact"| M1
    B3 & B4 -->|"compact"| M2
    M1 & M2 -->|"compact"| F1

# Compaction metrics to monitor
prometheus_tsdb_compactions_total         # Total compactions run
prometheus_tsdb_compaction_duration_seconds # Duration per compaction
prometheus_tsdb_blocks_loaded             # Currently loaded blocks
prometheus_tsdb_size_retentions_total     # Blocks deleted by retention

# TSDB flags controlling compaction:
# --storage.tsdb.min-block-duration=2h   # Minimum block time range
# --storage.tsdb.max-block-duration=36h  # Maximum block time range (capped at 10% of retention)
# Max block duration is automatically: min(31d, retention/10)

Tombstones & Deletion

Prometheus supports deleting data via the Admin API. Rather than rewriting blocks, it creates tombstone files that mark time ranges as deleted. The actual data is removed during the next compaction:

# Delete series data via Admin API (must enable --web.enable-admin-api)
# Delete all data for a specific metric in a time range
curl -X POST 'http://localhost:9090/api/v1/admin/tsdb/delete_series' \
  -d 'match[]={__name__="expensive_metric_to_remove"}' \
  -d 'start=2026-06-01T00:00:00Z' \
  -d 'end=2026-06-10T00:00:00Z'

# Force compaction to reclaim disk space immediately
curl -X POST 'http://localhost:9090/api/v1/admin/tsdb/clean_tombstones'

# WARNING: Deletion is permanent after clean_tombstones!
# The tombstones file can be manually deleted to "undo" before compaction

Retention Enforcement

# Retention is enforced in two ways (whichever triggers first):

# Time-based retention (default: 15 days)
--storage.tsdb.retention.time=15d

# Size-based retention (delete oldest blocks when exceeded)
--storage.tsdb.retention.size=50GB

# Both can be combined:
# "Keep 30 days OR 100GB, whichever limit is hit first"
--storage.tsdb.retention.time=30d
--storage.tsdb.retention.size=100GB

Index Structure Deep Dive

Posting Lists

The posting list is an inverted index mapping each label pair to the set of series IDs containing that pair. This is identical to how search engines map terms to document IDs:

# Conceptual posting list structure:
#
# Label Pair              → Series IDs (sorted)
# ─────────────────────────────────────────────
# job="api-server"        → [1, 2, 3, 4, 5, 6, 7, 8]
# job="worker"            → [9, 10, 11, 12, 13, 14]
# method="GET"            → [1, 3, 5, 7, 9, 11, 13]
# method="POST"           → [2, 4, 6, 8, 10, 12, 14]
# status="200"            → [1, 2, 3, 4, 9, 10, 11, 12]
# status="500"            → [5, 6, 7, 8, 13, 14]
#
# Query: {job="api-server", status="500"}
# 1. Posting list for job="api-server"  → [1, 2, 3, 4, 5, 6, 7, 8]
# 2. Posting list for status="500"      → [5, 6, 7, 8, 13, 14]
# 3. Intersect (sorted merge)           → [5, 6, 7, 8]
# 4. Result: 4 matching series (very fast!)

Label Value Index

The label index provides fast lookup of all values for a given label name — used by the /api/v1/label/<name>/values endpoint and autocomplete in Grafana:

# Label index structure:
# label_name → [value1, value2, value3, ...]
#
# __name__  → ["http_requests_total", "node_cpu_seconds_total", "up", ...]
# job       → ["api-server", "worker", "node-exporter", "prometheus", ...]
# instance  → ["10.0.1.5:8080", "10.0.1.6:8080", "10.0.2.3:9100", ...]
# method    → ["GET", "POST", "PUT", "DELETE", "PATCH"]
# status    → ["200", "201", "204", "400", "401", "403", "404", "500", "503"]

# Query all label values for 'job':
curl -s 'http://localhost:9090/api/v1/label/job/values' | jq '.data'

Query Resolution Path

                            
                            PromQL Query Execution Path:
                            Parse PromQL expression into AST
Identify label matchers from selectors
For each block overlapping the query time range:
    a. Look up posting lists for each matcher
    b. Intersect posting lists (AND logic) or union (OR logic)
    c. For each matching series, find chunk references
    d. Read and decompress relevant chunks
Merge results across blocks (head + persistent)
Apply PromQL functions (rate, sum, etc.)
Return result

                        

Memory-Mapped Chunks

Since Prometheus 2.19, chunks from persisted blocks are memory-mapped (mmap) rather than loaded entirely into RAM. The operating system’s page cache handles which chunks are in memory and which are on disk:

# Memory-mapped chunks reduce Prometheus memory usage dramatically
# Before mmap: all queried data must fit in process heap
# After mmap: OS page cache manages hot/cold data automatically

# Monitor mmap behavior:
prometheus_tsdb_head_chunks_storage_size_bytes   # Head chunk memory
process_resident_memory_bytes                     # Total RSS
process_virtual_memory_bytes                      # Includes mmap regions

# The difference between virtual and resident memory
# shows how much block data is mapped but not in RAM:
# virtual - resident ≈ cold mmap data on disk

# Practical impact: A Prometheus with 500GB on disk might show
# 500GB virtual memory but only 8GB resident (actual RAM used)

TSDB Configuration

# Key TSDB-related Prometheus flags:
# (set via command-line args or Helm values)

# Storage path
--storage.tsdb.path=/prometheus/data

# Retention
--storage.tsdb.retention.time=15d
--storage.tsdb.retention.size=0        # 0 = disabled (time-based only)

# Block durations
--storage.tsdb.min-block-duration=2h   # Don't change unless you know why
--storage.tsdb.max-block-duration=36h  # Auto-capped at retention/10

# WAL configuration
--storage.tsdb.wal-segment-size=128MB  # Default, rarely needs changing
--storage.tsdb.wal-compression         # Enable WAL compression (saves ~50% WAL disk)

# Head chunks
--storage.tsdb.head-chunks-write-queue-size=0  # Async chunk writes (0=sync)

# Out-of-order ingestion (Prometheus 2.39+)
--storage.tsdb.out-of-order-time-window=30m  # Accept samples up to 30m late

# No-lockfile (for shared/readonly mounts)
--storage.tsdb.no-lockfile=false

                            
                            Out-of-Order Ingestion: Enabled in Prometheus 2.39+, this allows accepting samples with timestamps older than the most recent sample for a series. Critical for remote-write receivers, Prometheus agent mode, and environments with clock skew. Set --storage.tsdb.out-of-order-time-window to the maximum expected delay.
                        

Conclusion & What’s Next

The Prometheus TSDB is a masterpiece of systems engineering — achieving 8–10x compression ratios while maintaining sub-second query performance across millions of time series. Key takeaways:

A time series = metric name + labels + ordered (timestamp, value) pairs
The WAL provides crash recovery; the head block holds recent data in memory
Chunks use Gorilla/XOR encoding for ~1.5 bytes per sample (vs 16 bytes raw)
Posting lists enable millisecond label-based lookups across millions of series
Compaction merges blocks over time for more efficient storage and queries
Memory-mapped chunks let the OS page cache manage hot/cold data transparently

Next in the Series

In Part 4: Mastering PromQL, we’ll harness the full power of Prometheus’ query language — from instant vectors and range vectors through aggregation operators, binary operations, and the recording rules that tame complex queries for dashboard performance.

Previous Part 2: Deploying to Kubernetes Next Part 4: Mastering PromQL