Tool Deep Dive: Jaeger Complete Guide

Jaeger Architecture

Jaeger, originally developed at Uber and now a CNCF graduated project, provides end-to-end distributed tracing for microservices. Its architecture follows a pipeline model where trace data flows from instrumented applications through collection infrastructure into queryable storage.

Jaeger Architecture — Data Flow Pipeline

flowchart LR
    A[Application
OTel SDK] -->|UDP/gRPC| B[Jaeger Agent
Local Daemon]
    B -->|gRPC| C[Jaeger Collector
Validation & Indexing]
    C -->|Write| D[(Storage
Cassandra / ES / Kafka)]
    D -->|Read| E[Jaeger Query
API Service]
    E -->|Serve| F[Jaeger UI
Web Interface]
    C -->|Optional| G[Kafka
Buffer]
    G -->|Consume| H[Jaeger Ingester
Storage Writer]
    H -->|Write| D

Component Responsibilities

Application SDK (OpenTelemetry) — instruments application code, creates spans with context propagation, and exports trace data. Modern deployments use the OpenTelemetry SDK directly rather than legacy Jaeger client libraries.

Jaeger Agent — a lightweight daemon running as a sidecar or host agent that receives spans over UDP, batches them, and forwards to the collector. In OTel-native deployments, the OTel Collector replaces this component.

Jaeger Collector — receives spans from agents or directly from SDKs, validates traces, enriches with metadata, runs sampling decisions for tail-based sampling, and writes to storage.

Storage Backend — persists trace data for querying. Supports Cassandra, Elasticsearch, Kafka (as intermediate buffer), and Badger (embedded). Each backend has distinct performance characteristics.

Jaeger Query — serves the Jaeger UI and exposes a gRPC/REST API for trace retrieval, search, and dependency graph generation.

Jaeger UI — React-based web interface for searching traces, viewing span timelines, comparing traces, and visualizing service dependency graphs.

Deployment Strategies

Jaeger supports multiple deployment topologies depending on scale, reliability requirements, and operational maturity. Choose the simplest model that meets your needs and evolve as traffic grows.

Strategy	Components	Best For	Considerations
All-in-One	Single binary (agent + collector + query + UI + in-memory storage)	Development, testing, demos	No persistence, not for production
Production	Separate agent, collector, query, storage	Production workloads up to ~50K spans/sec	Scale collector horizontally, requires storage cluster
Streaming	Collector → Kafka → Ingester → Storage	High-throughput (100K+ spans/sec), decoupled pipeline	Kafka adds buffer and replay capability, higher ops overhead
OpenTelemetry-native	OTel Collector replaces Jaeger Agent, exports to Jaeger Collector or directly to storage	Multi-signal pipelines, vendor-agnostic instrumentation	Recommended for new deployments, unified config for metrics/logs/traces

All-in-One Docker Compose

The quickest way to run Jaeger locally for development and testing:

# docker-compose.yml — Jaeger All-in-One for development
version: "3.8"

services:
  jaeger:
    image: jaegertracing/all-in-one:1.54
    container_name: jaeger
    environment:
      - COLLECTOR_OTLP_ENABLED=true
      - SPAN_STORAGE_TYPE=badger
      - BADGER_EPHEMERAL=false
      - BADGER_DIRECTORY_VALUE=/badger/data
      - BADGER_DIRECTORY_KEY=/badger/key
    ports:
      - "6831:6831/udp"    # Jaeger Agent — Thrift compact
      - "6832:6832/udp"    # Jaeger Agent — Thrift binary
      - "4317:4317"        # OTLP gRPC receiver
      - "4318:4318"        # OTLP HTTP receiver
      - "14268:14268"      # Collector HTTP — spans
      - "14269:14269"      # Collector admin/health
      - "16686:16686"      # Jaeger UI
      - "16687:16687"      # Query admin/health
    volumes:
      - jaeger_data:/badger
    restart: unless-stopped

volumes:
  jaeger_data:
    driver: local

Start with docker compose up -d and access the UI at http://localhost:16686. The Badger storage provides local persistence that survives container restarts.

Storage Backends

Storage selection is the most impactful architectural decision for Jaeger. Each backend offers different trade-offs across write throughput, query flexibility, operational complexity, and cost.

Backend	Write Performance	Query Flexibility	Scale Model	Best For
Cassandra	Excellent (LSM-tree optimized)	Limited (predefined indexes only)	Horizontal — add nodes linearly	Write-heavy workloads, predictable query patterns
Elasticsearch	Good (bulk indexing)	Excellent (full-text, tag filtering, aggregations)	Horizontal — shard-based	Ad-hoc search, complex filtering, tag-based queries
Kafka + Flink	Excellent (buffered, backpressure-safe)	N/A (intermediate buffer only)	Horizontal — partition-based	Buffering between collector and storage, replay capability
Badger	Good (SSD-optimized)	Basic	Single-node only	Development, small deployments, embedded use cases

Choosing Your Storage Backend

                            
                            Storage Selection Decision Framework: Start with Elasticsearch if you need flexible querying and your team already runs an ES cluster. Choose Cassandra if write throughput is your primary concern and queries follow predictable patterns (by service, operation, traceID). Add Kafka as an intermediate buffer when you exceed 50K spans/sec or need decoupled ingestion for reliability. Use Badger only for development, CI pipelines, or single-node edge deployments.
                        

Sampling Strategies

Sampling determines which traces are captured and stored. Without sampling, high-traffic services generate unsustainable data volumes. Jaeger supports multiple strategies that can be combined and dynamically adjusted.

Strategy	Mechanism	Configuration	Use Case
Const	Always sample (1) or never sample (0)	`param: 1`	Development, low-traffic services
Probabilistic	Random percentage of traces	`param: 0.1` (10%)	Steady baseline coverage
Rate Limiting	Fixed traces per second per service	`param: 2.0` (2/sec)	Cost-controlled budgets
Remote	Polling central config from collector	Served by collector endpoint	Centralized control without redeployment
Adaptive	Auto-adjusts per-operation rates to meet target throughput	`sampling.target-samples-per-second`	Mixed-traffic services, automatic optimization

Sampling Configuration

Configure sampling strategies centrally on the Jaeger Collector for remote and adaptive sampling:

# jaeger-sampling.yaml — Collector sampling configuration
# Defines per-service sampling strategies served to agents/SDKs

strategies:
  - service: "order-service"
    type: probabilistic
    param: 0.5        # Sample 50% of traces

  - service: "payment-service"
    type: ratelimiting
    param: 10.0       # Max 10 traces/sec

  - service: "frontend-gateway"
    type: probabilistic
    param: 0.01       # 1% — high traffic service

  # Default strategy for unlisted services
  default_strategy:
    type: probabilistic
    param: 0.1        # 10% baseline

# Adaptive sampling (Jaeger 1.27+)
# Automatically adjusts per-operation rates
adaptive_sampling:
  target_samples_per_second: 1.0
  sampling_refresh_interval: 10s
  initial_sampling_probability: 0.001

Pass this configuration to the collector with the --sampling.strategies-file flag. Services poll the collector's sampling endpoint at regular intervals to fetch their assigned strategy without requiring redeployment.

Query & Analysis

Jaeger UI provides powerful trace exploration capabilities. Mastering its query features dramatically reduces mean-time-to-diagnosis for distributed system issues.

Effective Trace Search Patterns

Service + Operation filtering — narrow results to specific endpoints. Combine with duration filters (minDuration, maxDuration) to find slow requests immediately.

Tag-based search — query by span tags like http.status_code=500, error=true, or custom business tags like user.tier=premium. Elasticsearch backend enables full-text search across tag values.

Trace comparison — select two traces of the same operation (one fast, one slow) and use Jaeger's diff view to identify exactly which spans diverge in duration or structure.

Dependency graph — the System Architecture tab auto-generates a service dependency DAG from trace data, showing call frequencies and error rates between services without manual configuration.

Trace-to-Logs Correlation

                            
                            Trace-to-Logs Correlation: Embed the trace_id and span_id in every log line using your logging framework's MDC (Mapped Diagnostic Context). Configure Jaeger's --query.additional-headers to enable linking from Jaeger UI directly to Grafana Loki or Elasticsearch logs filtered by trace ID. This creates a seamless investigation flow: find a slow trace → click through to see the exact log output for that request across all services.
                        

{
  "timestamp": "2026-05-14T10:32:15.123Z",
  "level": "ERROR",
  "service": "order-service",
  "trace_id": "abc123def456",
  "span_id": "span789xyz",
  "message": "Payment gateway timeout after 5000ms",
  "http.method": "POST",
  "http.url": "/api/v1/payments/charge",
  "http.status_code": 504
}

Production Checklist

Operations Production Readiness

Jaeger Production Deployment Checklist

Enable TLS — configure mTLS between all Jaeger components (agent → collector, collector → storage, query → storage) using certificate rotation
Configure adaptive sampling — set target samples per second to control storage costs while ensuring coverage across all operations
Deploy multiple collectors — run 3+ collector replicas behind a load balancer for high availability and horizontal throughput scaling
Set retention policies — configure TTL on storage (Cassandra: compaction.default_time_to_live, ES: ILM policies) to auto-expire old traces (7-14 days typical)
Monitor Jaeger itself — expose collector/query metrics to Prometheus, alert on dropped spans (jaeger_collector_spans_dropped_total), queue saturation, and storage write errors
Add Kafka buffer — for workloads exceeding 50K spans/sec, insert Kafka between collector and storage to absorb bursts and enable replay on storage failures
Configure resource limits — set memory limits on collectors (queue bounded by --collector.queue-size), CPU requests on query service, and storage IOPS reservations
Enable span enrichment — add environment, region, and deployment version tags at the collector level using --collector.tags for consistent cross-service filtering

distributed tracing production ops reliability

Previous Tool Deep Dive: Loki Complete Guide Next Tool Deep Dive: Alertmanager Complete Guide

Cookie Consent

Tool Deep Dive: Jaeger Complete Guide

Table of Contents

Jaeger Architecture

Component Responsibilities

Deployment Strategies

All-in-One Docker Compose

Storage Backends

Choosing Your Storage Backend

Sampling Strategies

Sampling Configuration

Query & Analysis

Effective Trace Search Patterns

Trace-to-Logs Correlation

Production Checklist

Jaeger Production Deployment Checklist

Cookie Consent

Tool Deep Dive: Jaeger Complete Guide

Table of Contents

Jaeger Architecture

Component Responsibilities

Deployment Strategies

All-in-One Docker Compose

Storage Backends

Choosing Your Storage Backend

Sampling Strategies

Sampling Configuration

Query & Analysis

Effective Trace Search Patterns

Trace-to-Logs Correlation

Production Checklist

Jaeger Production Deployment Checklist

Related Posts

Related Articles in This Series

Tool Deep Dive: OTel Collector Complete Guide

Part 5: Distributed Tracing

Tool Deep Dive: Loki Complete Guide