Back to Monitoring, Observability & Reliability Series

Tool Deep Dive: Jaeger Complete Guide

May 14, 2026 Wasil Zafar 18 min read

A comprehensive guide to Jaeger — the open-source distributed tracing platform. From architecture internals and deployment strategies to storage selection, sampling configuration, and production-ready operations.

Table of Contents

  1. Jaeger Architecture
  2. Deployment Strategies
  3. Storage Backends
  4. Sampling Strategies
  5. Query & Analysis
  6. Production Checklist
  7. Related Posts

Jaeger Architecture

Jaeger, originally developed at Uber and now a CNCF graduated project, provides end-to-end distributed tracing for microservices. Its architecture follows a pipeline model where trace data flows from instrumented applications through collection infrastructure into queryable storage.

Jaeger Architecture — Data Flow Pipeline
flowchart LR
    A[Application
OTel SDK] -->|UDP/gRPC| B[Jaeger Agent
Local Daemon] B -->|gRPC| C[Jaeger Collector
Validation & Indexing] C -->|Write| D[(Storage
Cassandra / ES / Kafka)] D -->|Read| E[Jaeger Query
API Service] E -->|Serve| F[Jaeger UI
Web Interface] C -->|Optional| G[Kafka
Buffer] G -->|Consume| H[Jaeger Ingester
Storage Writer] H -->|Write| D

Component Responsibilities

Application SDK (OpenTelemetry) — instruments application code, creates spans with context propagation, and exports trace data. Modern deployments use the OpenTelemetry SDK directly rather than legacy Jaeger client libraries.

Jaeger Agent — a lightweight daemon running as a sidecar or host agent that receives spans over UDP, batches them, and forwards to the collector. In OTel-native deployments, the OTel Collector replaces this component.

Jaeger Collector — receives spans from agents or directly from SDKs, validates traces, enriches with metadata, runs sampling decisions for tail-based sampling, and writes to storage.

Storage Backend — persists trace data for querying. Supports Cassandra, Elasticsearch, Kafka (as intermediate buffer), and Badger (embedded). Each backend has distinct performance characteristics.

Jaeger Query — serves the Jaeger UI and exposes a gRPC/REST API for trace retrieval, search, and dependency graph generation.

Jaeger UI — React-based web interface for searching traces, viewing span timelines, comparing traces, and visualizing service dependency graphs.

Deployment Strategies

Jaeger supports multiple deployment topologies depending on scale, reliability requirements, and operational maturity. Choose the simplest model that meets your needs and evolve as traffic grows.

Strategy Components Best For Considerations
All-in-One Single binary (agent + collector + query + UI + in-memory storage) Development, testing, demos No persistence, not for production
Production Separate agent, collector, query, storage Production workloads up to ~50K spans/sec Scale collector horizontally, requires storage cluster
Streaming Collector → Kafka → Ingester → Storage High-throughput (100K+ spans/sec), decoupled pipeline Kafka adds buffer and replay capability, higher ops overhead
OpenTelemetry-native OTel Collector replaces Jaeger Agent, exports to Jaeger Collector or directly to storage Multi-signal pipelines, vendor-agnostic instrumentation Recommended for new deployments, unified config for metrics/logs/traces

All-in-One Docker Compose

The quickest way to run Jaeger locally for development and testing:

# docker-compose.yml — Jaeger All-in-One for development
version: "3.8"

services:
  jaeger:
    image: jaegertracing/all-in-one:1.54
    container_name: jaeger
    environment:
      - COLLECTOR_OTLP_ENABLED=true
      - SPAN_STORAGE_TYPE=badger
      - BADGER_EPHEMERAL=false
      - BADGER_DIRECTORY_VALUE=/badger/data
      - BADGER_DIRECTORY_KEY=/badger/key
    ports:
      - "6831:6831/udp"    # Jaeger Agent — Thrift compact
      - "6832:6832/udp"    # Jaeger Agent — Thrift binary
      - "4317:4317"        # OTLP gRPC receiver
      - "4318:4318"        # OTLP HTTP receiver
      - "14268:14268"      # Collector HTTP — spans
      - "14269:14269"      # Collector admin/health
      - "16686:16686"      # Jaeger UI
      - "16687:16687"      # Query admin/health
    volumes:
      - jaeger_data:/badger
    restart: unless-stopped

volumes:
  jaeger_data:
    driver: local

Start with docker compose up -d and access the UI at http://localhost:16686. The Badger storage provides local persistence that survives container restarts.

Storage Backends

Storage selection is the most impactful architectural decision for Jaeger. Each backend offers different trade-offs across write throughput, query flexibility, operational complexity, and cost.

Backend Write Performance Query Flexibility Scale Model Best For
Cassandra Excellent (LSM-tree optimized) Limited (predefined indexes only) Horizontal — add nodes linearly Write-heavy workloads, predictable query patterns
Elasticsearch Good (bulk indexing) Excellent (full-text, tag filtering, aggregations) Horizontal — shard-based Ad-hoc search, complex filtering, tag-based queries
Kafka + Flink Excellent (buffered, backpressure-safe) N/A (intermediate buffer only) Horizontal — partition-based Buffering between collector and storage, replay capability
Badger Good (SSD-optimized) Basic Single-node only Development, small deployments, embedded use cases

Choosing Your Storage Backend

Storage Selection Decision Framework: Start with Elasticsearch if you need flexible querying and your team already runs an ES cluster. Choose Cassandra if write throughput is your primary concern and queries follow predictable patterns (by service, operation, traceID). Add Kafka as an intermediate buffer when you exceed 50K spans/sec or need decoupled ingestion for reliability. Use Badger only for development, CI pipelines, or single-node edge deployments.

Sampling Strategies

Sampling determines which traces are captured and stored. Without sampling, high-traffic services generate unsustainable data volumes. Jaeger supports multiple strategies that can be combined and dynamically adjusted.

Strategy Mechanism Configuration Use Case
Const Always sample (1) or never sample (0) param: 1 Development, low-traffic services
Probabilistic Random percentage of traces param: 0.1 (10%) Steady baseline coverage
Rate Limiting Fixed traces per second per service param: 2.0 (2/sec) Cost-controlled budgets
Remote Polling central config from collector Served by collector endpoint Centralized control without redeployment
Adaptive Auto-adjusts per-operation rates to meet target throughput sampling.target-samples-per-second Mixed-traffic services, automatic optimization

Sampling Configuration

Configure sampling strategies centrally on the Jaeger Collector for remote and adaptive sampling:

# jaeger-sampling.yaml — Collector sampling configuration
# Defines per-service sampling strategies served to agents/SDKs

strategies:
  - service: "order-service"
    type: probabilistic
    param: 0.5        # Sample 50% of traces

  - service: "payment-service"
    type: ratelimiting
    param: 10.0       # Max 10 traces/sec

  - service: "frontend-gateway"
    type: probabilistic
    param: 0.01       # 1% — high traffic service

  # Default strategy for unlisted services
  default_strategy:
    type: probabilistic
    param: 0.1        # 10% baseline

# Adaptive sampling (Jaeger 1.27+)
# Automatically adjusts per-operation rates
adaptive_sampling:
  target_samples_per_second: 1.0
  sampling_refresh_interval: 10s
  initial_sampling_probability: 0.001

Pass this configuration to the collector with the --sampling.strategies-file flag. Services poll the collector's sampling endpoint at regular intervals to fetch their assigned strategy without requiring redeployment.

Query & Analysis

Jaeger UI provides powerful trace exploration capabilities. Mastering its query features dramatically reduces mean-time-to-diagnosis for distributed system issues.

Effective Trace Search Patterns

Service + Operation filtering — narrow results to specific endpoints. Combine with duration filters (minDuration, maxDuration) to find slow requests immediately.

Tag-based search — query by span tags like http.status_code=500, error=true, or custom business tags like user.tier=premium. Elasticsearch backend enables full-text search across tag values.

Trace comparison — select two traces of the same operation (one fast, one slow) and use Jaeger's diff view to identify exactly which spans diverge in duration or structure.

Dependency graph — the System Architecture tab auto-generates a service dependency DAG from trace data, showing call frequencies and error rates between services without manual configuration.

Trace-to-Logs Correlation

Trace-to-Logs Correlation: Embed the trace_id and span_id in every log line using your logging framework's MDC (Mapped Diagnostic Context). Configure Jaeger's --query.additional-headers to enable linking from Jaeger UI directly to Grafana Loki or Elasticsearch logs filtered by trace ID. This creates a seamless investigation flow: find a slow trace → click through to see the exact log output for that request across all services.
{
  "timestamp": "2026-05-14T10:32:15.123Z",
  "level": "ERROR",
  "service": "order-service",
  "trace_id": "abc123def456",
  "span_id": "span789xyz",
  "message": "Payment gateway timeout after 5000ms",
  "http.method": "POST",
  "http.url": "/api/v1/payments/charge",
  "http.status_code": 504
}

Production Checklist

Operations Production Readiness

Jaeger Production Deployment Checklist

  1. Enable TLS — configure mTLS between all Jaeger components (agent → collector, collector → storage, query → storage) using certificate rotation
  2. Configure adaptive sampling — set target samples per second to control storage costs while ensuring coverage across all operations
  3. Deploy multiple collectors — run 3+ collector replicas behind a load balancer for high availability and horizontal throughput scaling
  4. Set retention policies — configure TTL on storage (Cassandra: compaction.default_time_to_live, ES: ILM policies) to auto-expire old traces (7-14 days typical)
  5. Monitor Jaeger itself — expose collector/query metrics to Prometheus, alert on dropped spans (jaeger_collector_spans_dropped_total), queue saturation, and storage write errors
  6. Add Kafka buffer — for workloads exceeding 50K spans/sec, insert Kafka between collector and storage to absorb bursts and enable replay on storage failures
  7. Configure resource limits — set memory limits on collectors (queue bounded by --collector.queue-size), CPU requests on query service, and storage IOPS reservations
  8. Enable span enrichment — add environment, region, and deployment version tags at the collector level using --collector.tags for consistent cross-service filtering
distributed tracing production ops reliability