Part 5: Distributed Tracing & Context Propagation

Why Distributed Tracing Exists

The Microservices Visibility Gap

In a monolithic application, every function call during a request happens within a single process. If something goes wrong, a stack trace shows you the complete call chain — from the entry point to the error. Debugging is straightforward because all the information lives in one place.

In a microservice architecture, this information is shattered across dozens of processes running on different machines. A single user request might touch:

An API gateway (authentication, rate limiting)
A frontend service (request routing)
A user service (profile lookup)
A product service (item details)
A pricing service (dynamic pricing calculation)
An inventory service (availability check)
A recommendation service (related items)
A cache layer (Redis lookups)
Multiple database queries across multiple databases

When this request takes 5 seconds instead of the usual 200ms, which service is responsible? Without tracing, you are left correlating timestamps across individual service logs — a tedious, error-prone process that scales poorly under incident pressure.

What Distributed Tracing Solves

                            
                            Distributed tracing answers: For any given request — what did it do, where did it go, how long did each step take, and where did it fail? It reconstructs the full execution timeline across all services.
                        

Specific problems tracing solves:

Latency diagnosis: Which service or database call is the bottleneck?
Error attribution: Which downstream service caused the user-facing error?
Dependency mapping: What does this service actually call? (Often different from what the docs say)
Concurrency issues: Are downstream calls happening in parallel or sequentially?
Retry visibility: How many retries occurred and which ones succeeded?
Cross-team debugging: Team A and Team B can both see the same trace without needing access to each other's logs

Traces, Spans & Hierarchies

Anatomy of a Trace

A trace represents the entire journey of a single request through your system. It is uniquely identified by a trace_id — a 128-bit random identifier generated at the entry point.

A trace is composed of spans. Each span represents a single unit of work — an HTTP call, a database query, a message publish, or any meaningful operation. Spans have parent-child relationships forming a tree (or DAG) that represents the execution hierarchy.

Trace Structure — A Request Through 4 Services

                                gantt
                                    title Distributed Trace – GET /api/checkout
                                    dateFormat X
                                    axisFormat %L ms

                                    section API Gateway
                                    Authentication     :0, 20
                                    Route to Cart      :20, 25

                                    section Cart Service
                                    Get Cart Items     :25, 80
                                    DB Read Items      :30, 60

                                    section Pricing Service
                                    Calculate Total    :80, 150
                                    Apply Discounts    :90, 130
                                    DB Read Prices     :95, 120

                                    section Payment Service
                                    Charge Card        :150, 350
                                    Call Stripe API    :160, 320

Each row in this diagram is a span. The hierarchical relationships are:

The root span (API Gateway → Authentication) is the parent of the entire trace
"Get Cart Items" is a child span of "Route to Cart"
"DB: SELECT items" is a child span of "Get Cart Items"
The Payment Service's "Charge Card" → "Call Stripe API" chain shows the deepest nesting

Span Attributes & Events

Every span carries structured metadata that makes it searchable and meaningful:

{
  "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
  "span_id": "00f067aa0ba902b7",
  "parent_span_id": "a1b2c3d4e5f67890",
  "name": "HTTP GET /api/users/123",
  "kind": "CLIENT",
  "start_time": "2026-05-14T14:23:45.123456Z",
  "end_time": "2026-05-14T14:23:45.189456Z",
  "duration_ms": 66,
  "status": { "code": "OK" },
  "attributes": {
    "http.method": "GET",
    "http.url": "https://user-service.internal/api/users/123",
    "http.status_code": 200,
    "http.response_content_length": 1247,
    "net.peer.name": "user-service.internal",
    "net.peer.port": 443,
    "service.name": "cart-service",
    "service.version": "2.4.1",
    "deployment.environment": "production"
  },
  "events": [
    {
      "name": "cache_miss",
      "timestamp": "2026-05-14T14:23:45.125000Z",
      "attributes": { "cache.key": "user:123", "cache.backend": "redis" }
    }
  ]
}

Key span fields:

Field	Purpose
`trace_id`	Links all spans in a single request together
`span_id`	Unique identifier for this specific span
`parent_span_id`	The parent span (null for root span)
`name`	Human-readable operation name
`kind`	CLIENT, SERVER, PRODUCER, CONSUMER, or INTERNAL
`attributes`	Key-value metadata (HTTP details, DB info, etc.)
`events`	Timestamped annotations (cache miss, retry, etc.)
`status`	OK, ERROR, or UNSET — whether the operation succeeded

Span Status & Error Recording

When a span represents a failed operation, it should be marked with an error status and include exception details as an event:

from opentelemetry import trace

tracer = trace.get_tracer("payment-service")

def charge_card(amount, card_token):
    with tracer.start_as_current_span("charge_card") as span:
        span.set_attribute("payment.amount", amount)
        span.set_attribute("payment.currency", "USD")

        try:
            result = stripe_client.charges.create(
                amount=amount,
                source=card_token
            )
            span.set_attribute("payment.charge_id", result.id)
            return result
        except stripe.error.CardError as e:
            span.set_status(trace.Status(trace.StatusCode.ERROR, str(e)))
            span.record_exception(e)
            raise

Context Propagation

Context propagation is the mechanism by which trace context (trace_id, span_id, sampling decision) is transmitted between services. Without propagation, each service would create isolated traces — and you could not reconstruct the full request journey.

W3C Trace Context Standard

The W3C Trace Context specification defines two HTTP headers for propagating trace context:

# W3C Trace Context headers
traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
#            ^^  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^  ^^^^^^^^^^^^^^^^  ^^
#            |   trace-id (128-bit hex)           parent-id (64-bit) flags
#            version                                                 (01 = sampled)

tracestate: vendor1=value1,vendor2=value2
# Optional vendor-specific trace data

The traceparent header format:

Field	Size	Description
`version`	2 hex	Always "00" for current spec
`trace-id`	32 hex	Unique identifier for the entire trace
`parent-id`	16 hex	Span ID of the calling service's current span
`trace-flags`	2 hex	Sampling decision (01 = sampled, 00 = not sampled)

Propagation Across Different Transports

Context must be propagated across all communication boundaries between services:

Transport	Propagation Mechanism
HTTP/REST	`traceparent` / `tracestate` HTTP headers
gRPC	gRPC metadata (binary header format)
Message queues (Kafka, RabbitMQ)	Message headers / properties
GraphQL	HTTP headers (same as REST)
Database queries	SQL comments (e.g., `/trace_id=abc/`)
Background jobs	Job metadata / serialised context

                            
                            Critical Rule: If any service in the chain fails to propagate context (does not read the incoming traceparent and pass it to outgoing requests), the trace breaks at that point. Downstream services will create new root traces with no connection to the original request. This is the #1 reason traces appear incomplete in production. Ensure every service, proxy, and gateway propagates trace headers.
                        

Baggage — User-Defined Context

Beyond trace/span IDs, you often want to propagate business-relevant data across service boundaries — like user ID, tenant ID, or feature flag variants. OpenTelemetry Baggage provides this:

from opentelemetry import baggage, context
from opentelemetry.propagate import inject

# Set baggage at the entry point (API gateway)
ctx = baggage.set_baggage("user.id", "usr_12345")
ctx = baggage.set_baggage("tenant.id", "acme-corp", context=ctx)
ctx = baggage.set_baggage("experiment.variant", "checkout-v2", context=ctx)

# Inject into outgoing HTTP headers
headers = {}
inject(headers, context=ctx)
# headers now contains: baggage: user.id=usr_12345,tenant.id=acme-corp,experiment.variant=checkout-v2

# Any downstream service can read baggage:
user_id = baggage.get_baggage("user.id")
# Returns: "usr_12345"

                            
                            Baggage vs Span Attributes: Baggage propagates across service boundaries (all downstream services can read it). Span attributes are local to a single span. Use baggage for data needed by multiple services (tenant ID, user ID). Use span attributes for operation-specific context (HTTP method, DB table name).
                        

Sampling Strategies

Tracing every request is expensive — storing and indexing every span across every service generates massive data volumes. Sampling selectively records only a subset of traces while maintaining statistical validity.

Strategy	How It Works	Best For
Head-based (probabilistic)	Decision made at trace start; e.g., sample 10% of all traces randomly	High-traffic services where volume is the constraint
Tail-based	Decision made after trace completes; keep traces with errors or high latency	Capturing all interesting traces (errors, slow requests)
Rate limiting	Keep at most N traces per second per service	Predictable storage cost
Always-on for errors	Sample 100% of traces that contain errors, 1% of successful traces	Debugging reliability while controlling cost
Priority-based	Important endpoints (checkout, login) sampled at higher rates	Ensuring visibility into critical paths

Cost Calculation

Tracing Storage at Scale

Consider a system handling 10,000 requests/second where each request generates 15 spans (average microservice depth). Each span is approximately 500 bytes of data:

100% sampling: 10,000 × 15 × 500 = 75 MB/second = 6.3 TB/day
10% sampling: 750 KB/second = 630 GB/day
1% sampling: 75 KB/second = 63 GB/day

With tail-based sampling keeping all error traces + 1% healthy traces, you might store ~100 GB/day while never missing an interesting trace. This is the sweet spot for most production systems.

Cost Optimisation Tail-Based Sampling Capacity Planning

Tracing Backends

Tracing backends receive, store, index, and visualise trace data. The major options:

Backend	Architecture	Query Language	Best For
Jaeger	Collector → Storage (Cassandra/ES/Badger) → Query → UI	Tag-based search	Self-hosted, Kubernetes-native, CNCF graduated
Grafana Tempo	Distributor → Ingester → Object Storage (S3/GCS)	TraceQL	Cost-efficient (object storage), Grafana ecosystem
Zipkin	Collector → Storage (ES/Cassandra/MySQL) → API → UI	Simple API queries	Lightweight, easy to start with
AWS X-Ray	Managed service (SDK → Daemon → X-Ray Service)	Filter expressions	AWS-native workloads
Azure App Insights	Managed service (SDK → Ingestion → Log Analytics)	KQL	Azure-native workloads

                            
                            Tempo + Grafana: Grafana Tempo is the fastest-growing tracing backend because it stores traces in cheap object storage (S3, GCS, Azure Blob) without requiring a dedicated index database. Combined with Grafana's trace visualisation, it provides production-grade tracing at a fraction of the cost of Elasticsearch-backed solutions.
                        

Conclusion & Next Steps

Distributed tracing is the most powerful tool in your observability arsenal for understanding microservice behaviour. Key takeaways from Part 5:

Traces represent a single request's journey; spans represent individual operations within that journey
Context propagation via W3C Trace Context headers (traceparent) is what connects spans across service boundaries
Every service, proxy, and gateway must propagate trace headers — one gap breaks the trace
Baggage propagates business context (user ID, tenant ID) across all services
Sampling controls cost — tail-based sampling captures all errors while sampling healthy traces
Tempo and Jaeger are the leading open-source tracing backends for different cost/complexity trade-offs

Previous Part 4: Logging Deep Dive Next Part 6: OpenTelemetry — The Modern Observability Standard

Cookie Consent

Part 5: Distributed Tracing & Context Propagation

Table of Contents

Why Distributed Tracing Exists

The Microservices Visibility Gap

What Distributed Tracing Solves

Traces, Spans & Hierarchies

Anatomy of a Trace

Span Attributes & Events

Span Status & Error Recording

Context Propagation

W3C Trace Context Standard

Propagation Across Different Transports

Baggage — User-Defined Context

Sampling Strategies

Tracing Storage at Scale

Tracing Backends

Conclusion & Next Steps

Cookie Consent

Part 5: Distributed Tracing & Context Propagation

Table of Contents

Why Distributed Tracing Exists

The Microservices Visibility Gap

What Distributed Tracing Solves

Traces, Spans & Hierarchies

Anatomy of a Trace

Span Attributes & Events

Span Status & Error Recording

Context Propagation

W3C Trace Context Standard

Propagation Across Different Transports

Baggage — User-Defined Context

Sampling Strategies

Tracing Storage at Scale

Tracing Backends

Conclusion & Next Steps

Continue the Series

Part 6: OpenTelemetry — The Modern Observability Standard

Tool Deep Dive: Jaeger & Tempo Tracing Guide

Part 4: Logging Deep Dive