Jaeger Architecture
Jaeger, originally developed at Uber and now a CNCF graduated project, provides end-to-end distributed tracing for microservices. Its architecture follows a pipeline model where trace data flows from instrumented applications through collection infrastructure into queryable storage.
flowchart LR
A[Application
OTel SDK] -->|UDP/gRPC| B[Jaeger Agent
Local Daemon]
B -->|gRPC| C[Jaeger Collector
Validation & Indexing]
C -->|Write| D[(Storage
Cassandra / ES / Kafka)]
D -->|Read| E[Jaeger Query
API Service]
E -->|Serve| F[Jaeger UI
Web Interface]
C -->|Optional| G[Kafka
Buffer]
G -->|Consume| H[Jaeger Ingester
Storage Writer]
H -->|Write| D
Component Responsibilities
Application SDK (OpenTelemetry) — instruments application code, creates spans with context propagation, and exports trace data. Modern deployments use the OpenTelemetry SDK directly rather than legacy Jaeger client libraries.
Jaeger Agent — a lightweight daemon running as a sidecar or host agent that receives spans over UDP, batches them, and forwards to the collector. In OTel-native deployments, the OTel Collector replaces this component.
Jaeger Collector — receives spans from agents or directly from SDKs, validates traces, enriches with metadata, runs sampling decisions for tail-based sampling, and writes to storage.
Storage Backend — persists trace data for querying. Supports Cassandra, Elasticsearch, Kafka (as intermediate buffer), and Badger (embedded). Each backend has distinct performance characteristics.
Jaeger Query — serves the Jaeger UI and exposes a gRPC/REST API for trace retrieval, search, and dependency graph generation.
Jaeger UI — React-based web interface for searching traces, viewing span timelines, comparing traces, and visualizing service dependency graphs.
Deployment Strategies
Jaeger supports multiple deployment topologies depending on scale, reliability requirements, and operational maturity. Choose the simplest model that meets your needs and evolve as traffic grows.
| Strategy | Components | Best For | Considerations |
|---|---|---|---|
| All-in-One | Single binary (agent + collector + query + UI + in-memory storage) | Development, testing, demos | No persistence, not for production |
| Production | Separate agent, collector, query, storage | Production workloads up to ~50K spans/sec | Scale collector horizontally, requires storage cluster |
| Streaming | Collector → Kafka → Ingester → Storage | High-throughput (100K+ spans/sec), decoupled pipeline | Kafka adds buffer and replay capability, higher ops overhead |
| OpenTelemetry-native | OTel Collector replaces Jaeger Agent, exports to Jaeger Collector or directly to storage | Multi-signal pipelines, vendor-agnostic instrumentation | Recommended for new deployments, unified config for metrics/logs/traces |
All-in-One Docker Compose
The quickest way to run Jaeger locally for development and testing:
# docker-compose.yml — Jaeger All-in-One for development
version: "3.8"
services:
jaeger:
image: jaegertracing/all-in-one:1.54
container_name: jaeger
environment:
- COLLECTOR_OTLP_ENABLED=true
- SPAN_STORAGE_TYPE=badger
- BADGER_EPHEMERAL=false
- BADGER_DIRECTORY_VALUE=/badger/data
- BADGER_DIRECTORY_KEY=/badger/key
ports:
- "6831:6831/udp" # Jaeger Agent — Thrift compact
- "6832:6832/udp" # Jaeger Agent — Thrift binary
- "4317:4317" # OTLP gRPC receiver
- "4318:4318" # OTLP HTTP receiver
- "14268:14268" # Collector HTTP — spans
- "14269:14269" # Collector admin/health
- "16686:16686" # Jaeger UI
- "16687:16687" # Query admin/health
volumes:
- jaeger_data:/badger
restart: unless-stopped
volumes:
jaeger_data:
driver: local
Start with docker compose up -d and access the UI at http://localhost:16686. The Badger storage provides local persistence that survives container restarts.
Storage Backends
Storage selection is the most impactful architectural decision for Jaeger. Each backend offers different trade-offs across write throughput, query flexibility, operational complexity, and cost.
| Backend | Write Performance | Query Flexibility | Scale Model | Best For |
|---|---|---|---|---|
| Cassandra | Excellent (LSM-tree optimized) | Limited (predefined indexes only) | Horizontal — add nodes linearly | Write-heavy workloads, predictable query patterns |
| Elasticsearch | Good (bulk indexing) | Excellent (full-text, tag filtering, aggregations) | Horizontal — shard-based | Ad-hoc search, complex filtering, tag-based queries |
| Kafka + Flink | Excellent (buffered, backpressure-safe) | N/A (intermediate buffer only) | Horizontal — partition-based | Buffering between collector and storage, replay capability |
| Badger | Good (SSD-optimized) | Basic | Single-node only | Development, small deployments, embedded use cases |
Choosing Your Storage Backend
Sampling Strategies
Sampling determines which traces are captured and stored. Without sampling, high-traffic services generate unsustainable data volumes. Jaeger supports multiple strategies that can be combined and dynamically adjusted.
| Strategy | Mechanism | Configuration | Use Case |
|---|---|---|---|
| Const | Always sample (1) or never sample (0) | param: 1 |
Development, low-traffic services |
| Probabilistic | Random percentage of traces | param: 0.1 (10%) |
Steady baseline coverage |
| Rate Limiting | Fixed traces per second per service | param: 2.0 (2/sec) |
Cost-controlled budgets |
| Remote | Polling central config from collector | Served by collector endpoint | Centralized control without redeployment |
| Adaptive | Auto-adjusts per-operation rates to meet target throughput | sampling.target-samples-per-second |
Mixed-traffic services, automatic optimization |
Sampling Configuration
Configure sampling strategies centrally on the Jaeger Collector for remote and adaptive sampling:
# jaeger-sampling.yaml — Collector sampling configuration
# Defines per-service sampling strategies served to agents/SDKs
strategies:
- service: "order-service"
type: probabilistic
param: 0.5 # Sample 50% of traces
- service: "payment-service"
type: ratelimiting
param: 10.0 # Max 10 traces/sec
- service: "frontend-gateway"
type: probabilistic
param: 0.01 # 1% — high traffic service
# Default strategy for unlisted services
default_strategy:
type: probabilistic
param: 0.1 # 10% baseline
# Adaptive sampling (Jaeger 1.27+)
# Automatically adjusts per-operation rates
adaptive_sampling:
target_samples_per_second: 1.0
sampling_refresh_interval: 10s
initial_sampling_probability: 0.001
Pass this configuration to the collector with the --sampling.strategies-file flag. Services poll the collector's sampling endpoint at regular intervals to fetch their assigned strategy without requiring redeployment.
Query & Analysis
Jaeger UI provides powerful trace exploration capabilities. Mastering its query features dramatically reduces mean-time-to-diagnosis for distributed system issues.
Effective Trace Search Patterns
Service + Operation filtering — narrow results to specific endpoints. Combine with duration filters (minDuration, maxDuration) to find slow requests immediately.
Tag-based search — query by span tags like http.status_code=500, error=true, or custom business tags like user.tier=premium. Elasticsearch backend enables full-text search across tag values.
Trace comparison — select two traces of the same operation (one fast, one slow) and use Jaeger's diff view to identify exactly which spans diverge in duration or structure.
Dependency graph — the System Architecture tab auto-generates a service dependency DAG from trace data, showing call frequencies and error rates between services without manual configuration.
Trace-to-Logs Correlation
trace_id and span_id in every log line using your logging framework's MDC (Mapped Diagnostic Context). Configure Jaeger's --query.additional-headers to enable linking from Jaeger UI directly to Grafana Loki or Elasticsearch logs filtered by trace ID. This creates a seamless investigation flow: find a slow trace → click through to see the exact log output for that request across all services.
{
"timestamp": "2026-05-14T10:32:15.123Z",
"level": "ERROR",
"service": "order-service",
"trace_id": "abc123def456",
"span_id": "span789xyz",
"message": "Payment gateway timeout after 5000ms",
"http.method": "POST",
"http.url": "/api/v1/payments/charge",
"http.status_code": 504
}
Production Checklist
Jaeger Production Deployment Checklist
- Enable TLS — configure mTLS between all Jaeger components (agent → collector, collector → storage, query → storage) using certificate rotation
- Configure adaptive sampling — set target samples per second to control storage costs while ensuring coverage across all operations
- Deploy multiple collectors — run 3+ collector replicas behind a load balancer for high availability and horizontal throughput scaling
- Set retention policies — configure TTL on storage (Cassandra:
compaction.default_time_to_live, ES: ILM policies) to auto-expire old traces (7-14 days typical) - Monitor Jaeger itself — expose collector/query metrics to Prometheus, alert on dropped spans (
jaeger_collector_spans_dropped_total), queue saturation, and storage write errors - Add Kafka buffer — for workloads exceeding 50K spans/sec, insert Kafka between collector and storage to absorb bursts and enable replay on storage failures
- Configure resource limits — set memory limits on collectors (queue bounded by
--collector.queue-size), CPU requests on query service, and storage IOPS reservations - Enable span enrichment — add environment, region, and deployment version tags at the collector level using
--collector.tagsfor consistent cross-service filtering