Back to Monitoring, Observability & Reliability Series

Part 5: Distributed Tracing & Context Propagation

May 14, 2026 Wasil Zafar 19 min read

In microservice architectures, a single user request traverses many services. Distributed tracing gives you the complete picture — showing every service a request touched, how long each hop took, and exactly where failures occurred. This is the most powerful debugging tool for distributed systems.

Table of Contents

  1. Why Distributed Tracing Exists
  2. Traces, Spans & Hierarchies
  3. Context Propagation
  4. Sampling Strategies
  5. Tracing Backends
  6. Conclusion & Next Steps

Why Distributed Tracing Exists

The Microservices Visibility Gap

In a monolithic application, every function call during a request happens within a single process. If something goes wrong, a stack trace shows you the complete call chain — from the entry point to the error. Debugging is straightforward because all the information lives in one place.

In a microservice architecture, this information is shattered across dozens of processes running on different machines. A single user request might touch:

  • An API gateway (authentication, rate limiting)
  • A frontend service (request routing)
  • A user service (profile lookup)
  • A product service (item details)
  • A pricing service (dynamic pricing calculation)
  • An inventory service (availability check)
  • A recommendation service (related items)
  • A cache layer (Redis lookups)
  • Multiple database queries across multiple databases

When this request takes 5 seconds instead of the usual 200ms, which service is responsible? Without tracing, you are left correlating timestamps across individual service logs — a tedious, error-prone process that scales poorly under incident pressure.

What Distributed Tracing Solves

Distributed tracing answers: For any given request — what did it do, where did it go, how long did each step take, and where did it fail? It reconstructs the full execution timeline across all services.

Specific problems tracing solves:

  • Latency diagnosis: Which service or database call is the bottleneck?
  • Error attribution: Which downstream service caused the user-facing error?
  • Dependency mapping: What does this service actually call? (Often different from what the docs say)
  • Concurrency issues: Are downstream calls happening in parallel or sequentially?
  • Retry visibility: How many retries occurred and which ones succeeded?
  • Cross-team debugging: Team A and Team B can both see the same trace without needing access to each other's logs

Traces, Spans & Hierarchies

Anatomy of a Trace

A trace represents the entire journey of a single request through your system. It is uniquely identified by a trace_id — a 128-bit random identifier generated at the entry point.

A trace is composed of spans. Each span represents a single unit of work — an HTTP call, a database query, a message publish, or any meaningful operation. Spans have parent-child relationships forming a tree (or DAG) that represents the execution hierarchy.

Trace Structure — A Request Through 4 Services
                                gantt
                                    title Distributed Trace – GET /api/checkout
                                    dateFormat X
                                    axisFormat %L ms

                                    section API Gateway
                                    Authentication     :0, 20
                                    Route to Cart      :20, 25

                                    section Cart Service
                                    Get Cart Items     :25, 80
                                    DB Read Items      :30, 60

                                    section Pricing Service
                                    Calculate Total    :80, 150
                                    Apply Discounts    :90, 130
                                    DB Read Prices     :95, 120

                                    section Payment Service
                                    Charge Card        :150, 350
                                    Call Stripe API    :160, 320
                            

Each row in this diagram is a span. The hierarchical relationships are:

  • The root span (API Gateway → Authentication) is the parent of the entire trace
  • "Get Cart Items" is a child span of "Route to Cart"
  • "DB: SELECT items" is a child span of "Get Cart Items"
  • The Payment Service's "Charge Card" → "Call Stripe API" chain shows the deepest nesting

Span Attributes & Events

Every span carries structured metadata that makes it searchable and meaningful:

{
  "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
  "span_id": "00f067aa0ba902b7",
  "parent_span_id": "a1b2c3d4e5f67890",
  "name": "HTTP GET /api/users/123",
  "kind": "CLIENT",
  "start_time": "2026-05-14T14:23:45.123456Z",
  "end_time": "2026-05-14T14:23:45.189456Z",
  "duration_ms": 66,
  "status": { "code": "OK" },
  "attributes": {
    "http.method": "GET",
    "http.url": "https://user-service.internal/api/users/123",
    "http.status_code": 200,
    "http.response_content_length": 1247,
    "net.peer.name": "user-service.internal",
    "net.peer.port": 443,
    "service.name": "cart-service",
    "service.version": "2.4.1",
    "deployment.environment": "production"
  },
  "events": [
    {
      "name": "cache_miss",
      "timestamp": "2026-05-14T14:23:45.125000Z",
      "attributes": { "cache.key": "user:123", "cache.backend": "redis" }
    }
  ]
}

Key span fields:

FieldPurpose
trace_idLinks all spans in a single request together
span_idUnique identifier for this specific span
parent_span_idThe parent span (null for root span)
nameHuman-readable operation name
kindCLIENT, SERVER, PRODUCER, CONSUMER, or INTERNAL
attributesKey-value metadata (HTTP details, DB info, etc.)
eventsTimestamped annotations (cache miss, retry, etc.)
statusOK, ERROR, or UNSET — whether the operation succeeded

Span Status & Error Recording

When a span represents a failed operation, it should be marked with an error status and include exception details as an event:

from opentelemetry import trace

tracer = trace.get_tracer("payment-service")

def charge_card(amount, card_token):
    with tracer.start_as_current_span("charge_card") as span:
        span.set_attribute("payment.amount", amount)
        span.set_attribute("payment.currency", "USD")

        try:
            result = stripe_client.charges.create(
                amount=amount,
                source=card_token
            )
            span.set_attribute("payment.charge_id", result.id)
            return result
        except stripe.error.CardError as e:
            span.set_status(trace.Status(trace.StatusCode.ERROR, str(e)))
            span.record_exception(e)
            raise

Context Propagation

Context propagation is the mechanism by which trace context (trace_id, span_id, sampling decision) is transmitted between services. Without propagation, each service would create isolated traces — and you could not reconstruct the full request journey.

W3C Trace Context Standard

The W3C Trace Context specification defines two HTTP headers for propagating trace context:

# W3C Trace Context headers
traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
#            ^^  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^  ^^^^^^^^^^^^^^^^  ^^
#            |   trace-id (128-bit hex)           parent-id (64-bit) flags
#            version                                                 (01 = sampled)

tracestate: vendor1=value1,vendor2=value2
# Optional vendor-specific trace data

The traceparent header format:

FieldSizeDescription
version2 hexAlways "00" for current spec
trace-id32 hexUnique identifier for the entire trace
parent-id16 hexSpan ID of the calling service's current span
trace-flags2 hexSampling decision (01 = sampled, 00 = not sampled)

Propagation Across Different Transports

Context must be propagated across all communication boundaries between services:

TransportPropagation Mechanism
HTTP/RESTtraceparent / tracestate HTTP headers
gRPCgRPC metadata (binary header format)
Message queues (Kafka, RabbitMQ)Message headers / properties
GraphQLHTTP headers (same as REST)
Database queriesSQL comments (e.g., /*trace_id=abc*/)
Background jobsJob metadata / serialised context
Critical Rule: If any service in the chain fails to propagate context (does not read the incoming traceparent and pass it to outgoing requests), the trace breaks at that point. Downstream services will create new root traces with no connection to the original request. This is the #1 reason traces appear incomplete in production. Ensure every service, proxy, and gateway propagates trace headers.

Baggage — User-Defined Context

Beyond trace/span IDs, you often want to propagate business-relevant data across service boundaries — like user ID, tenant ID, or feature flag variants. OpenTelemetry Baggage provides this:

from opentelemetry import baggage, context
from opentelemetry.propagate import inject

# Set baggage at the entry point (API gateway)
ctx = baggage.set_baggage("user.id", "usr_12345")
ctx = baggage.set_baggage("tenant.id", "acme-corp", context=ctx)
ctx = baggage.set_baggage("experiment.variant", "checkout-v2", context=ctx)

# Inject into outgoing HTTP headers
headers = {}
inject(headers, context=ctx)
# headers now contains: baggage: user.id=usr_12345,tenant.id=acme-corp,experiment.variant=checkout-v2

# Any downstream service can read baggage:
user_id = baggage.get_baggage("user.id")
# Returns: "usr_12345"
Baggage vs Span Attributes: Baggage propagates across service boundaries (all downstream services can read it). Span attributes are local to a single span. Use baggage for data needed by multiple services (tenant ID, user ID). Use span attributes for operation-specific context (HTTP method, DB table name).

Sampling Strategies

Tracing every request is expensive — storing and indexing every span across every service generates massive data volumes. Sampling selectively records only a subset of traces while maintaining statistical validity.

StrategyHow It WorksBest For
Head-based (probabilistic) Decision made at trace start; e.g., sample 10% of all traces randomly High-traffic services where volume is the constraint
Tail-based Decision made after trace completes; keep traces with errors or high latency Capturing all interesting traces (errors, slow requests)
Rate limiting Keep at most N traces per second per service Predictable storage cost
Always-on for errors Sample 100% of traces that contain errors, 1% of successful traces Debugging reliability while controlling cost
Priority-based Important endpoints (checkout, login) sampled at higher rates Ensuring visibility into critical paths
Cost Calculation

Tracing Storage at Scale

Consider a system handling 10,000 requests/second where each request generates 15 spans (average microservice depth). Each span is approximately 500 bytes of data:

  • 100% sampling: 10,000 × 15 × 500 = 75 MB/second = 6.3 TB/day
  • 10% sampling: 750 KB/second = 630 GB/day
  • 1% sampling: 75 KB/second = 63 GB/day

With tail-based sampling keeping all error traces + 1% healthy traces, you might store ~100 GB/day while never missing an interesting trace. This is the sweet spot for most production systems.

Cost Optimisation Tail-Based Sampling Capacity Planning

Tracing Backends

Tracing backends receive, store, index, and visualise trace data. The major options:

BackendArchitectureQuery LanguageBest For
Jaeger Collector → Storage (Cassandra/ES/Badger) → Query → UI Tag-based search Self-hosted, Kubernetes-native, CNCF graduated
Grafana Tempo Distributor → Ingester → Object Storage (S3/GCS) TraceQL Cost-efficient (object storage), Grafana ecosystem
Zipkin Collector → Storage (ES/Cassandra/MySQL) → API → UI Simple API queries Lightweight, easy to start with
AWS X-Ray Managed service (SDK → Daemon → X-Ray Service) Filter expressions AWS-native workloads
Azure App Insights Managed service (SDK → Ingestion → Log Analytics) KQL Azure-native workloads
Tempo + Grafana: Grafana Tempo is the fastest-growing tracing backend because it stores traces in cheap object storage (S3, GCS, Azure Blob) without requiring a dedicated index database. Combined with Grafana's trace visualisation, it provides production-grade tracing at a fraction of the cost of Elasticsearch-backed solutions.

Conclusion & Next Steps

Distributed tracing is the most powerful tool in your observability arsenal for understanding microservice behaviour. Key takeaways from Part 5:

  • Traces represent a single request's journey; spans represent individual operations within that journey
  • Context propagation via W3C Trace Context headers (traceparent) is what connects spans across service boundaries
  • Every service, proxy, and gateway must propagate trace headers — one gap breaks the trace
  • Baggage propagates business context (user ID, tenant ID) across all services
  • Sampling controls cost — tail-based sampling captures all errors while sampling healthy traces
  • Tempo and Jaeger are the leading open-source tracing backends for different cost/complexity trade-offs