Part 6: OpenTelemetry — The Modern Observability Standard

Why OpenTelemetry?

The Fragmentation Problem

Before OpenTelemetry, if you wanted to instrument your application you had to choose between competing, incompatible solutions:

Prometheus client libraries for metrics
Jaeger client libraries for tracing (based on OpenTracing)
Vendor-specific SDKs (Datadog agent, New Relic agent, Dynatrace OneAgent)
OpenCensus (Google's instrumentation library)
Zipkin libraries for Zipkin-native tracing

Each had its own API, its own data format, and its own export targets. Switching from Jaeger to Datadog meant ripping out one SDK and replacing it with another — across every service. Worse, each SDK only covered one or two signal types (tracing but not metrics, metrics but not logs).

                            
                            Vendor Lock-In: Before OTel, instrumentation code was deeply coupled to your observability vendor. Switching vendors required touching every service, modifying instrumentation code, and redeploying everything. OpenTelemetry eliminates this by separating instrumentation (how you generate data) from export (where you send data).
                        

The OpenTelemetry Promise

OpenTelemetry (OTel) merges OpenTracing and OpenCensus into a single, vendor-neutral standard. Its promise:

                            
                            Instrument once, export anywhere. Write your instrumentation code using the OTel API. Configure the OTel SDK to export to Prometheus, Jaeger, Tempo, Datadog, New Relic, or any OTLP-compatible backend. Switch backends by changing configuration, not code.
                        

OTel is a CNCF incubating project (the 2nd most active CNCF project after Kubernetes) with SDKs for 11+ languages and broad vendor support.

OTel Architecture

Three Signals — Unified

OTel provides a unified framework for all three telemetry signals:

Signal	OTel API	Data Model	Maturity
Traces	`TracerProvider`, `Tracer`, `Span`	Spans with attributes, events, links	Stable
Metrics	`MeterProvider`, `Meter`, `Counter`, `Histogram`	Counters, gauges, histograms	Stable
Logs	`LoggerProvider`, `Logger`	Log records with trace context	Stable

The key innovation: all three signals share the same context propagation system. A trace ID generated in a span is automatically available in the logger, so log entries include trace context without any extra code.

Core Components

OpenTelemetry Architecture

                                flowchart TD
                                    subgraph Application
                                        A[OTel API\nVendor-neutral interfaces] --> B[OTel SDK\nConfiguration + Processing]
                                        C[Auto-Instrumentation\nLibrary hooks] --> A
                                    end
                                    B -->|OTLP| D[OTel Collector\nReceive → Process → Export]
                                    D -->|Prometheus remote_write| E[Prometheus / Mimir]
                                    D -->|OTLP| F[Tempo / Jaeger]
                                    D -->|OTLP| G[Loki]
                                    D -->|Vendor API| H[Datadog / New Relic / Splunk]

Component	Role	Where It Runs
OTel API	Vendor-neutral interfaces for creating spans, metrics, logs	Application code
OTel SDK	Implementation of the API; configures exporters, processors, samplers	Application runtime
Auto-Instrumentation	Automatically instruments common libraries (HTTP, DB, gRPC) without code changes	Application runtime
OTLP	OpenTelemetry Protocol — the wire format for transmitting telemetry data	Network (gRPC or HTTP)
OTel Collector	Receives, processes (filter, enrich, sample), and exports telemetry to backends	Sidecar or DaemonSet

OTLP — The Universal Wire Protocol

OTLP (OpenTelemetry Protocol) is a general-purpose telemetry data delivery protocol. It supports gRPC and HTTP/protobuf transports. OTLP is now the recommended protocol for transmitting telemetry from applications to backends.

# OTLP endpoints
# gRPC (default port 4317):
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317

# HTTP/protobuf (default port 4318):
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4318

# Signal-specific endpoints:
OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=http://otel-collector:4317
OTEL_EXPORTER_OTLP_METRICS_ENDPOINT=http://otel-collector:4317
OTEL_EXPORTER_OTLP_LOGS_ENDPOINT=http://otel-collector:4317

SDK & Manual Instrumentation

Python Setup — Complete Working Example

# Install OTel packages:
# pip install opentelemetry-api opentelemetry-sdk opentelemetry-exporter-otlp

from opentelemetry import trace, metrics
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter
from opentelemetry.sdk.resources import Resource

# 1. Define service identity
resource = Resource.create({
    "service.name": "order-service",
    "service.version": "2.4.1",
    "deployment.environment": "production"
})

# 2. Configure tracing
tracer_provider = TracerProvider(resource=resource)
tracer_provider.add_span_processor(
    BatchSpanProcessor(OTLPSpanExporter(endpoint="http://otel-collector:4317"))
)
trace.set_tracer_provider(tracer_provider)

# 3. Configure metrics
metric_reader = PeriodicExportingMetricReader(
    OTLPMetricExporter(endpoint="http://otel-collector:4317"),
    export_interval_millis=10000  # Export every 10 seconds
)
meter_provider = MeterProvider(resource=resource, metric_readers=[metric_reader])
metrics.set_meter_provider(meter_provider)

# 4. Get tracer and meter
tracer = trace.get_tracer("order-service")
meter = metrics.get_meter("order-service")

print("OpenTelemetry configured successfully")

Creating Custom Spans

from opentelemetry import trace

tracer = trace.get_tracer("order-service")

def process_order(order_id, items):
    # Create a span for the entire order processing
    with tracer.start_as_current_span("process_order") as span:
        span.set_attribute("order.id", order_id)
        span.set_attribute("order.item_count", len(items))

        # Nested span for validation
        with tracer.start_as_current_span("validate_order") as validate_span:
            validate_span.set_attribute("validation.rules_checked", 5)
            is_valid = validate_items(items)
            validate_span.set_attribute("validation.passed", is_valid)

        # Nested span for payment
        with tracer.start_as_current_span("charge_payment") as payment_span:
            payment_span.set_attribute("payment.method", "credit_card")
            total = sum(item["price"] for item in items)
            payment_span.set_attribute("payment.amount_usd", total)

            try:
                charge_result = process_payment(total)
                payment_span.set_attribute("payment.status", "success")
            except Exception as e:
                payment_span.set_status(
                    trace.Status(trace.StatusCode.ERROR, str(e))
                )
                payment_span.record_exception(e)
                raise

        span.add_event("order_completed", {
            "order.id": order_id,
            "order.total": total
        })
        return {"status": "completed", "order_id": order_id}

# Placeholder functions for the example
def validate_items(items):
    return True

def process_payment(amount):
    return {"charged": amount}

# Example usage
result = process_order("ORD-123", [{"name": "Widget", "price": 29.99}])
print(result)

Creating Custom Metrics

from opentelemetry import metrics

meter = metrics.get_meter("order-service")

# Counter — tracks cumulative totals
orders_counter = meter.create_counter(
    name="orders_total",
    description="Total number of orders processed",
    unit="1"
)

# Histogram — tracks distributions (latency, sizes)
order_duration = meter.create_histogram(
    name="order_processing_duration_ms",
    description="Time to process an order in milliseconds",
    unit="ms"
)

# Up-Down Counter — tracks values that go up and down
active_orders = meter.create_up_down_counter(
    name="active_orders",
    description="Number of orders currently being processed",
    unit="1"
)

# Usage in application code
import time

def process_order_with_metrics(order_id, items):
    active_orders.add(1, {"order.type": "standard"})
    start = time.time()

    try:
        # ... process order ...
        orders_counter.add(1, {
            "order.status": "success",
            "order.type": "standard"
        })
        result = {"status": "completed", "order_id": order_id}
        return result
    except Exception:
        orders_counter.add(1, {
            "order.status": "failed",
            "order.type": "standard"
        })
        raise
    finally:
        duration_ms = (time.time() - start) * 1000
        order_duration.record(duration_ms, {"order.type": "standard"})
        active_orders.add(-1, {"order.type": "standard"})

result = process_order_with_metrics("ORD-456", [{"name": "Gadget", "price": 49.99}])
print(result)

Auto-Instrumentation — Zero Code Changes

OTel auto-instrumentation automatically hooks into popular libraries (HTTP clients, database drivers, web frameworks) and generates traces and metrics without you writing any instrumentation code.

# Python: Install auto-instrumentation packages
pip install opentelemetry-distro opentelemetry-exporter-otlp
opentelemetry-bootstrap -a install

# Run your app with auto-instrumentation:
opentelemetry-instrument \
  --service_name order-service \
  --exporter_otlp_endpoint http://otel-collector:4317 \
  python app.py

# This automatically instruments: Flask, Django, FastAPI, requests,
# urllib3, psycopg2, pymongo, redis, grpcio, and 40+ more libraries

                            
                            Auto + Manual: Auto-instrumentation and manual instrumentation are complementary. Auto-instrumentation covers framework-level operations (HTTP requests, DB queries). Manual instrumentation covers business logic (order processing, payment flow). Use both together for complete visibility.
                        

Auto-Instrumentation Coverage

What Gets Instrumented Automatically

With auto-instrumentation enabled, OTel automatically creates spans for:

Inbound HTTP requests: Every request to your Flask/Django/FastAPI app → server span with HTTP method, status, route
Outbound HTTP requests: Every call via requests/urllib3/httpx → client span with target URL, status
Database queries: Every query via psycopg2/pymongo/mysql-connector → span with SQL statement, DB name
Redis operations: Every GET/SET/DEL → span with Redis command
gRPC calls: Every inbound/outbound gRPC call → span with service/method
Message queue operations: Kafka produce/consume, RabbitMQ publish/consume

All of this happens without writing a single line of instrumentation code. You just add the auto-instrumentation agent and configure the exporter.

Auto-Instrumentation Zero Code Changes Library Support

The OTel Collector

Collector Architecture

The OTel Collector is a vendor-agnostic telemetry pipeline that receives, processes, and exports telemetry data. It sits between your applications and your backends, providing a centralised point for transformation, filtering, and routing.

OTel Collector Pipeline

                                flowchart LR
                                    subgraph Receivers
                                        A[OTLP\nPort 4317/4318]
                                        B[Prometheus\nScrape targets]
                                        C[Jaeger\nPort 14250]
                                    end
                                    subgraph Processors
                                        D[Batch\nGroup for efficiency]
                                        E[Filter\nDrop unwanted data]
                                        F[Attributes\nEnrich metadata]
                                        G[Tail Sampling\nKeep interesting traces]
                                    end
                                    subgraph Exporters
                                        H[OTLP → Tempo]
                                        I[Prometheus\nremote_write → Mimir]
                                        J[Loki → Logs]
                                    end
                                    A & B & C --> D --> E --> F --> G --> H & I & J

Collector Configuration

# otel-collector-config.yaml — Production-ready configuration
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318
  # Scrape Prometheus metrics from applications
  prometheus:
    config:
      scrape_configs:
        - job_name: 'otel-collector'
          scrape_interval: 15s
          static_configs:
            - targets: ['0.0.0.0:8888']  # Collector's own metrics

processors:
  # Batch telemetry for efficient export
  batch:
    send_batch_size: 1024
    send_batch_max_size: 2048
    timeout: 5s

  # Add resource attributes to all telemetry
  resource:
    attributes:
      - key: deployment.environment
        value: production
        action: upsert
      - key: k8s.cluster.name
        value: prod-us-east-1
        action: upsert

  # Filter out noisy spans (e.g., health checks)
  filter/traces:
    error_mode: ignore
    traces:
      span:
        - 'attributes["http.route"] == "/health"'
        - 'attributes["http.route"] == "/readyz"'

  # Memory limiter to prevent OOM
  memory_limiter:
    check_interval: 1s
    limit_mib: 1024
    spike_limit_mib: 256

exporters:
  # Export traces to Tempo
  otlp/tempo:
    endpoint: tempo.monitoring.svc.cluster.local:4317
    tls:
      insecure: true

  # Export metrics to Prometheus/Mimir
  prometheusremotewrite:
    endpoint: http://mimir.monitoring.svc.cluster.local:9009/api/v1/push

  # Export logs to Loki
  loki:
    endpoint: http://loki.monitoring.svc.cluster.local:3100/loki/api/v1/push

  # Debug exporter for troubleshooting
  debug:
    verbosity: basic

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, filter/traces, resource, batch]
      exporters: [otlp/tempo]
    metrics:
      receivers: [otlp, prometheus]
      processors: [memory_limiter, resource, batch]
      exporters: [prometheusremotewrite]
    logs:
      receivers: [otlp]
      processors: [memory_limiter, resource, batch]
      exporters: [loki]

  telemetry:
    logs:
      level: info
    metrics:
      address: 0.0.0.0:8888

Production Deployment Patterns

Pattern 1: Agent + Gateway

Run a lightweight OTel Collector as a DaemonSet (one per node) that forwards to a central Gateway Collector for processing and export. This reduces per-application configuration and provides a single choke point for sampling decisions.

Pattern 2: Sidecar

Run an OTel Collector as a sidecar container in each pod. This provides isolation between tenants in multi-tenant systems and allows per-service export configuration. Higher resource cost than DaemonSet.

Pattern 3: Direct Export

Applications export directly to backends (no Collector). Simpler architecture but loses the benefits of centralised processing, filtering, and sampling. Only suitable for small deployments or development environments.

                            
                            Recommended: Agent + Gateway is the standard production pattern. DaemonSet agents handle local collection and buffering. Gateway handles global processing (tail-based sampling, enrichment) and export. This gives you resilience (agents buffer during backend outages) and flexibility (change backends at the Gateway without touching agents).
                        

Conclusion & Next Steps

OpenTelemetry is the future of observability instrumentation. Key takeaways from Part 6:

Instrument once, export anywhere: OTel decouples instrumentation from backends — switch observability vendors by changing config, not code
Three unified signals: Traces, metrics, and logs share context propagation — log entries automatically include trace IDs
Auto-instrumentation covers 40+ libraries per language with zero code changes; combine with manual instrumentation for business logic
OTLP is the universal wire protocol — every major backend now supports it
The OTel Collector is a vendor-agnostic pipeline for receiving, processing, and exporting telemetry
Agent + Gateway is the recommended production deployment pattern

Previous Part 5: Distributed Tracing & Context Propagation Next Part 7: Visualization & Alerting

Cookie Consent

Part 6: OpenTelemetry — The Modern Observability Standard

Table of Contents

Why OpenTelemetry?

The Fragmentation Problem

The OpenTelemetry Promise

OTel Architecture

Three Signals — Unified

Core Components

OTLP — The Universal Wire Protocol

SDK & Manual Instrumentation

Python Setup — Complete Working Example

Creating Custom Spans

Creating Custom Metrics

Auto-Instrumentation — Zero Code Changes

What Gets Instrumented Automatically

The OTel Collector

Collector Architecture

Collector Configuration

Production Deployment Patterns

Pattern 1: Agent + Gateway

Pattern 2: Sidecar

Pattern 3: Direct Export

Conclusion & Next Steps

Cookie Consent

Part 6: OpenTelemetry — The Modern Observability Standard

Table of Contents

Why OpenTelemetry?

The Fragmentation Problem

The OpenTelemetry Promise

OTel Architecture

Three Signals — Unified

Core Components

OTLP — The Universal Wire Protocol

SDK & Manual Instrumentation

Python Setup — Complete Working Example

Creating Custom Spans

Creating Custom Metrics

Auto-Instrumentation — Zero Code Changes

What Gets Instrumented Automatically

The OTel Collector

Collector Architecture

Collector Configuration

Production Deployment Patterns

Pattern 1: Agent + Gateway

Pattern 2: Sidecar

Pattern 3: Direct Export

Conclusion & Next Steps

Continue the Series

Part 7: Observability Architecture, Visualization & Alerting

Part 5: Distributed Tracing & Context Propagation

Part 8: Kubernetes Observability