Why Distributed Tracing Exists
The Microservices Visibility Gap
In a monolithic application, every function call during a request happens within a single process. If something goes wrong, a stack trace shows you the complete call chain — from the entry point to the error. Debugging is straightforward because all the information lives in one place.
In a microservice architecture, this information is shattered across dozens of processes running on different machines. A single user request might touch:
- An API gateway (authentication, rate limiting)
- A frontend service (request routing)
- A user service (profile lookup)
- A product service (item details)
- A pricing service (dynamic pricing calculation)
- An inventory service (availability check)
- A recommendation service (related items)
- A cache layer (Redis lookups)
- Multiple database queries across multiple databases
When this request takes 5 seconds instead of the usual 200ms, which service is responsible? Without tracing, you are left correlating timestamps across individual service logs — a tedious, error-prone process that scales poorly under incident pressure.
What Distributed Tracing Solves
Specific problems tracing solves:
- Latency diagnosis: Which service or database call is the bottleneck?
- Error attribution: Which downstream service caused the user-facing error?
- Dependency mapping: What does this service actually call? (Often different from what the docs say)
- Concurrency issues: Are downstream calls happening in parallel or sequentially?
- Retry visibility: How many retries occurred and which ones succeeded?
- Cross-team debugging: Team A and Team B can both see the same trace without needing access to each other's logs
Traces, Spans & Hierarchies
Anatomy of a Trace
A trace represents the entire journey of a single request through your system. It is uniquely identified by a trace_id — a 128-bit random identifier generated at the entry point.
A trace is composed of spans. Each span represents a single unit of work — an HTTP call, a database query, a message publish, or any meaningful operation. Spans have parent-child relationships forming a tree (or DAG) that represents the execution hierarchy.
gantt
title Distributed Trace – GET /api/checkout
dateFormat X
axisFormat %L ms
section API Gateway
Authentication :0, 20
Route to Cart :20, 25
section Cart Service
Get Cart Items :25, 80
DB Read Items :30, 60
section Pricing Service
Calculate Total :80, 150
Apply Discounts :90, 130
DB Read Prices :95, 120
section Payment Service
Charge Card :150, 350
Call Stripe API :160, 320
Each row in this diagram is a span. The hierarchical relationships are:
- The root span (API Gateway → Authentication) is the parent of the entire trace
- "Get Cart Items" is a child span of "Route to Cart"
- "DB: SELECT items" is a child span of "Get Cart Items"
- The Payment Service's "Charge Card" → "Call Stripe API" chain shows the deepest nesting
Span Attributes & Events
Every span carries structured metadata that makes it searchable and meaningful:
{
"trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
"span_id": "00f067aa0ba902b7",
"parent_span_id": "a1b2c3d4e5f67890",
"name": "HTTP GET /api/users/123",
"kind": "CLIENT",
"start_time": "2026-05-14T14:23:45.123456Z",
"end_time": "2026-05-14T14:23:45.189456Z",
"duration_ms": 66,
"status": { "code": "OK" },
"attributes": {
"http.method": "GET",
"http.url": "https://user-service.internal/api/users/123",
"http.status_code": 200,
"http.response_content_length": 1247,
"net.peer.name": "user-service.internal",
"net.peer.port": 443,
"service.name": "cart-service",
"service.version": "2.4.1",
"deployment.environment": "production"
},
"events": [
{
"name": "cache_miss",
"timestamp": "2026-05-14T14:23:45.125000Z",
"attributes": { "cache.key": "user:123", "cache.backend": "redis" }
}
]
}
Key span fields:
| Field | Purpose |
|---|---|
trace_id | Links all spans in a single request together |
span_id | Unique identifier for this specific span |
parent_span_id | The parent span (null for root span) |
name | Human-readable operation name |
kind | CLIENT, SERVER, PRODUCER, CONSUMER, or INTERNAL |
attributes | Key-value metadata (HTTP details, DB info, etc.) |
events | Timestamped annotations (cache miss, retry, etc.) |
status | OK, ERROR, or UNSET — whether the operation succeeded |
Span Status & Error Recording
When a span represents a failed operation, it should be marked with an error status and include exception details as an event:
from opentelemetry import trace
tracer = trace.get_tracer("payment-service")
def charge_card(amount, card_token):
with tracer.start_as_current_span("charge_card") as span:
span.set_attribute("payment.amount", amount)
span.set_attribute("payment.currency", "USD")
try:
result = stripe_client.charges.create(
amount=amount,
source=card_token
)
span.set_attribute("payment.charge_id", result.id)
return result
except stripe.error.CardError as e:
span.set_status(trace.Status(trace.StatusCode.ERROR, str(e)))
span.record_exception(e)
raise
Context Propagation
Context propagation is the mechanism by which trace context (trace_id, span_id, sampling decision) is transmitted between services. Without propagation, each service would create isolated traces — and you could not reconstruct the full request journey.
W3C Trace Context Standard
The W3C Trace Context specification defines two HTTP headers for propagating trace context:
# W3C Trace Context headers
traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
# ^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^ ^^
# | trace-id (128-bit hex) parent-id (64-bit) flags
# version (01 = sampled)
tracestate: vendor1=value1,vendor2=value2
# Optional vendor-specific trace data
The traceparent header format:
| Field | Size | Description |
|---|---|---|
version | 2 hex | Always "00" for current spec |
trace-id | 32 hex | Unique identifier for the entire trace |
parent-id | 16 hex | Span ID of the calling service's current span |
trace-flags | 2 hex | Sampling decision (01 = sampled, 00 = not sampled) |
Propagation Across Different Transports
Context must be propagated across all communication boundaries between services:
| Transport | Propagation Mechanism |
|---|---|
| HTTP/REST | traceparent / tracestate HTTP headers |
| gRPC | gRPC metadata (binary header format) |
| Message queues (Kafka, RabbitMQ) | Message headers / properties |
| GraphQL | HTTP headers (same as REST) |
| Database queries | SQL comments (e.g., /*trace_id=abc*/) |
| Background jobs | Job metadata / serialised context |
traceparent and pass it to outgoing requests), the trace breaks at that point. Downstream services will create new root traces with no connection to the original request. This is the #1 reason traces appear incomplete in production. Ensure every service, proxy, and gateway propagates trace headers.
Baggage — User-Defined Context
Beyond trace/span IDs, you often want to propagate business-relevant data across service boundaries — like user ID, tenant ID, or feature flag variants. OpenTelemetry Baggage provides this:
from opentelemetry import baggage, context
from opentelemetry.propagate import inject
# Set baggage at the entry point (API gateway)
ctx = baggage.set_baggage("user.id", "usr_12345")
ctx = baggage.set_baggage("tenant.id", "acme-corp", context=ctx)
ctx = baggage.set_baggage("experiment.variant", "checkout-v2", context=ctx)
# Inject into outgoing HTTP headers
headers = {}
inject(headers, context=ctx)
# headers now contains: baggage: user.id=usr_12345,tenant.id=acme-corp,experiment.variant=checkout-v2
# Any downstream service can read baggage:
user_id = baggage.get_baggage("user.id")
# Returns: "usr_12345"
Sampling Strategies
Tracing every request is expensive — storing and indexing every span across every service generates massive data volumes. Sampling selectively records only a subset of traces while maintaining statistical validity.
| Strategy | How It Works | Best For |
|---|---|---|
| Head-based (probabilistic) | Decision made at trace start; e.g., sample 10% of all traces randomly | High-traffic services where volume is the constraint |
| Tail-based | Decision made after trace completes; keep traces with errors or high latency | Capturing all interesting traces (errors, slow requests) |
| Rate limiting | Keep at most N traces per second per service | Predictable storage cost |
| Always-on for errors | Sample 100% of traces that contain errors, 1% of successful traces | Debugging reliability while controlling cost |
| Priority-based | Important endpoints (checkout, login) sampled at higher rates | Ensuring visibility into critical paths |
Tracing Storage at Scale
Consider a system handling 10,000 requests/second where each request generates 15 spans (average microservice depth). Each span is approximately 500 bytes of data:
- 100% sampling: 10,000 × 15 × 500 = 75 MB/second = 6.3 TB/day
- 10% sampling: 750 KB/second = 630 GB/day
- 1% sampling: 75 KB/second = 63 GB/day
With tail-based sampling keeping all error traces + 1% healthy traces, you might store ~100 GB/day while never missing an interesting trace. This is the sweet spot for most production systems.
Tracing Backends
Tracing backends receive, store, index, and visualise trace data. The major options:
| Backend | Architecture | Query Language | Best For |
|---|---|---|---|
| Jaeger | Collector → Storage (Cassandra/ES/Badger) → Query → UI | Tag-based search | Self-hosted, Kubernetes-native, CNCF graduated |
| Grafana Tempo | Distributor → Ingester → Object Storage (S3/GCS) | TraceQL | Cost-efficient (object storage), Grafana ecosystem |
| Zipkin | Collector → Storage (ES/Cassandra/MySQL) → API → UI | Simple API queries | Lightweight, easy to start with |
| AWS X-Ray | Managed service (SDK → Daemon → X-Ray Service) | Filter expressions | AWS-native workloads |
| Azure App Insights | Managed service (SDK → Ingestion → Log Analytics) | KQL | Azure-native workloads |
Conclusion & Next Steps
Distributed tracing is the most powerful tool in your observability arsenal for understanding microservice behaviour. Key takeaways from Part 5:
- Traces represent a single request's journey; spans represent individual operations within that journey
- Context propagation via W3C Trace Context headers (
traceparent) is what connects spans across service boundaries - Every service, proxy, and gateway must propagate trace headers — one gap breaks the trace
- Baggage propagates business context (user ID, tenant ID) across all services
- Sampling controls cost — tail-based sampling captures all errors while sampling healthy traces
- Tempo and Jaeger are the leading open-source tracing backends for different cost/complexity trade-offs