Back to Monitoring, Observability & Reliability Series

Tool Deep Dive: OpenTelemetry Collector Complete Guide

May 14, 2026 Wasil Zafar 20 min read

The OpenTelemetry Collector is the Swiss Army knife of observability — a vendor-neutral proxy that receives, processes, and exports metrics, logs, and traces. This deep dive covers every aspect of Collector configuration from pipeline design to production deployment patterns.

Table of Contents

  1. Collector Architecture
  2. Receivers
  3. Processors
  4. Exporters
  5. Pipeline Configuration
  6. Deployment Patterns
  7. Performance Tuning
  8. Production Checklist
  9. Related Posts

Collector Architecture

The OpenTelemetry Collector is a vendor-agnostic agent that can receive, process, and export telemetry data. It removes the need for running multiple agents/collectors and provides a unified pipeline for metrics, traces, and logs. The Collector runs as a standalone binary and is configured via a single YAML file that defines receivers, processors, exporters, and the pipelines that connect them.

OpenTelemetry Collector Architecture — Pipeline Data Flow
flowchart LR
    subgraph Sources
        A1[Application
OTLP SDK] A2[Prometheus
Targets] A3[Log Files] A4[Kafka] end subgraph Receivers R1[otlp
gRPC + HTTP] R2[prometheus
Scraper] R3[filelog
Tailer] R4[kafka
Consumer] end subgraph Processors P1[memory_limiter] P2[batch] P3[attributes] P4[filter] P5[tail_sampling] end subgraph Exporters E1[otlp
Collector/Backend] E2[prometheusremotewrite
Mimir/Thanos] E3[loki
Log Storage] E4[debug
Stdout] end A1 --> R1 A2 --> R2 A3 --> R3 A4 --> R4 R1 --> P1 R2 --> P1 R3 --> P1 R4 --> P1 P1 --> P2 P2 --> P3 P3 --> P4 P4 --> P5 P5 --> E1 P5 --> E2 P5 --> E3 P5 --> E4

Pipeline Model

The Collector supports three signal types, each with independent pipelines:

  • Metrics pipeline — receives metric data (counters, gauges, histograms), processes it (aggregation, filtering, relabeling), and exports to metric backends (Prometheus, Mimir, Datadog)
  • Traces pipeline — receives span data from instrumented applications, processes it (sampling, attribute enrichment), and exports to tracing backends (Jaeger, Tempo, Zipkin)
  • Logs pipeline — receives log records from files, syslog, or OTLP, processes them (parsing, severity mapping), and exports to log backends (Loki, Elasticsearch, Splunk)

Each pipeline is defined independently in the service.pipelines section. A single receiver or exporter can participate in multiple pipelines, and multiple pipelines of the same signal type can run concurrently with different processing rules.

Receivers

Receivers define how data gets into the Collector. They listen on network ports, scrape endpoints, or tail files. Each receiver is configured once in the top-level receivers: section and then referenced by name in one or more pipelines.

Receiver Protocol Use Case
otlp gRPC (4317) + HTTP (4318) Primary receiver for OTLP-instrumented applications — accepts metrics, traces, and logs
prometheus HTTP scrape Scrape Prometheus-format metrics from /metrics endpoints using standard scrape configs
filelog File I/O Tail log files with configurable parsing (regex, JSON, syslog) and multiline support
kafka Kafka consumer Consume telemetry from Kafka topics — useful for buffered/decoupled architectures
hostmetrics System calls Collect host-level metrics: CPU, memory, disk, network, filesystem, processes
k8s_cluster Kubernetes API Collect cluster-level events and metrics: node conditions, pod phases, deployments

Receiver Configuration

# Receivers section of otel-collector-config.yaml
receivers:
  # OTLP receiver — primary ingestion point
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
        max_recv_msg_size_mib: 4
      http:
        endpoint: 0.0.0.0:4318
        cors:
          allowed_origins: ["*"]

  # Prometheus receiver — scrape Prometheus targets
  prometheus:
    config:
      scrape_configs:
        - job_name: 'node-exporter'
          scrape_interval: 15s
          static_configs:
            - targets: ['node-exporter:9100']
        - job_name: 'kubernetes-pods'
          kubernetes_sd_configs:
            - role: pod
          relabel_configs:
            - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
              action: keep
              regex: true

  # Filelog receiver — tail application logs
  filelog:
    include:
      - /var/log/pods/*/*/*.log
    exclude:
      - /var/log/pods/*/otc-container/*.log
    start_at: end
    include_file_path: true
    operators:
      - type: router
        routes:
          - output: json-parser
            expr: 'body matches "^\\{"'
      - id: json-parser
        type: json_parser
        timestamp:
          parse_from: attributes.timestamp
          layout: '%Y-%m-%dT%H:%M:%S.%LZ'

  # Host metrics receiver — system-level metrics
  hostmetrics:
    collection_interval: 30s
    scrapers:
      cpu:
        metrics:
          system.cpu.utilization:
            enabled: true
      memory: {}
      disk: {}
      network: {}
      filesystem: {}

Processors

Processors transform, filter, and enrich telemetry data as it flows through the pipeline. They execute in the order defined in the pipeline's processors: list — order matters significantly for correctness and performance.

Processor Purpose Key Configuration
batch Reduce export calls by batching telemetry into larger payloads send_batch_size, timeout
memory_limiter Prevent OOM by applying backpressure when memory exceeds limits limit_mib, spike_limit_mib, check_interval
attributes Add, remove, or modify attributes on spans, metrics, or logs actions: [insert, update, delete, hash]
filter Drop unwanted telemetry based on attribute conditions metrics.exclude, traces.span, logs.exclude
tail_sampling Intelligent trace sampling based on complete trace data policies, decision_wait, num_traces
resource Add resource attributes (cluster, environment, service info) attributes: [{key, value, action}]
transform Apply OTTL expressions for complex transformations trace_statements, metric_statements, log_statements

Processor Ordering & Configuration

# Processors section — order in pipeline definition matters!
processors:
  # Memory limiter — ALWAYS first in pipeline
  memory_limiter:
    limit_mib: 512
    spike_limit_mib: 128
    check_interval: 5s

  # Batch processor — improves compression and reduces network calls
  batch:
    send_batch_size: 8192
    send_batch_max_size: 16384
    timeout: 5s

  # Resource processor — add environment metadata
  resource:
    attributes:
      - key: environment
        value: production
        action: upsert
      - key: cluster
        value: us-east-1-primary
        action: upsert

  # Attributes processor — enrich/redact span attributes
  attributes:
    actions:
      - key: db.statement
        action: hash
      - key: http.request.header.authorization
        action: delete
      - key: deployment.version
        value: "v2.4.1"
        action: upsert

  # Filter processor — drop noisy/unwanted telemetry
  filter:
    error_mode: ignore
    metrics:
      exclude:
        match_type: regexp
        metric_names:
          - ".*_test_.*"
          - "go_gc_.*"
    traces:
      span:
        - 'attributes["http.route"] == "/healthz"'
        - 'attributes["http.route"] == "/readyz"'

  # Tail sampling — keep errors + slow traces + sample rest
  tail_sampling:
    decision_wait: 10s
    num_traces: 100000
    policies:
      - name: errors
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: slow-traces
        type: latency
        latency: { threshold_ms: 2000 }
      - name: probabilistic
        type: probabilistic
        probabilistic: { sampling_percentage: 10 }
Processor ordering is critical. Always place memory_limiter first to prevent OOM before any data buffering occurs. Place batch last (before export) to maximize compression efficiency. A recommended order: memory_limiter → resource → attributes → filter → tail_sampling → batch.

Exporters

Exporters send processed telemetry to one or more backends. A single pipeline can fan out to multiple exporters, enabling you to send the same data to different systems (e.g., traces to both Jaeger and a long-term archive). Each exporter manages its own retry queue and connection pool.

Exporter Protocol Use Case
otlp gRPC / HTTP Forward to another Collector (gateway pattern) or OTLP-native backends (Tempo, Mimir)
prometheusremotewrite HTTP POST Write metrics to Prometheus-compatible backends (Prometheus, Mimir, Thanos, Cortex)
loki HTTP POST Send logs to Grafana Loki with label extraction from resource/log attributes
debug Stdout Print telemetry to console for development and troubleshooting (replaces deprecated logging)
file File I/O Write telemetry to disk in JSON format for offline analysis or debugging

Exporter Configuration

# Exporters section of otel-collector-config.yaml
exporters:
  # OTLP exporter — forward to gateway Collector or backend
  otlp:
    endpoint: otel-gateway.monitoring.svc:4317
    tls:
      insecure: false
      cert_file: /etc/otel/tls/client.crt
      key_file: /etc/otel/tls/client.key
    retry_on_failure:
      enabled: true
      initial_interval: 5s
      max_interval: 30s
      max_elapsed_time: 300s
    sending_queue:
      enabled: true
      num_consumers: 10
      queue_size: 5000

  # Prometheus Remote Write — metrics to Mimir/Thanos
  prometheusremotewrite:
    endpoint: http://mimir.monitoring.svc:9009/api/v1/push
    resource_to_telemetry_conversion:
      enabled: true
    headers:
      X-Scope-OrgID: "production"
    retry_on_failure:
      enabled: true
      initial_interval: 5s
      max_interval: 30s

  # Loki exporter — logs to Grafana Loki
  loki:
    endpoint: http://loki.monitoring.svc:3100/loki/api/v1/push
    labels:
      resource:
        service.name: "service_name"
        service.namespace: "namespace"
        k8s.container.name: "container"
      attributes:
        severity: ""
        level: ""

  # Debug exporter — stdout for development
  debug:
    verbosity: detailed
    sampling_initial: 5
    sampling_thereafter: 200

Pipeline Configuration

The service section ties everything together by defining named pipelines. Each pipeline specifies its signal type (metrics, traces, or logs), its receivers, processors (in order), and exporters. This is where the Collector's full power emerges — you can define multiple pipelines per signal type with different processing rules.

Complete Working Configuration

# otel-collector-config.yaml — Production-ready full configuration
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

  prometheus:
    config:
      scrape_configs:
        - job_name: 'kubernetes-pods'
          scrape_interval: 30s
          kubernetes_sd_configs:
            - role: pod
          relabel_configs:
            - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
              action: keep
              regex: true

  filelog:
    include:
      - /var/log/pods/*/*/*.log
    start_at: end
    include_file_path: true
    operators:
      - type: json_parser
        timestamp:
          parse_from: attributes.time
          layout: '%Y-%m-%dT%H:%M:%S.%LZ'

processors:
  memory_limiter:
    limit_mib: 512
    spike_limit_mib: 128
    check_interval: 5s

  batch:
    send_batch_size: 8192
    timeout: 5s

  resource:
    attributes:
      - key: cluster
        value: us-east-1-primary
        action: upsert
      - key: environment
        value: production
        action: upsert

  tail_sampling:
    decision_wait: 10s
    num_traces: 100000
    policies:
      - name: errors-policy
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: slow-requests
        type: latency
        latency: { threshold_ms: 2000 }
      - name: baseline-sampling
        type: probabilistic
        probabilistic: { sampling_percentage: 10 }

exporters:
  otlp/traces:
    endpoint: tempo.monitoring.svc:4317
    tls:
      insecure: true

  prometheusremotewrite:
    endpoint: http://mimir.monitoring.svc:9009/api/v1/push
    resource_to_telemetry_conversion:
      enabled: true

  loki:
    endpoint: http://loki.monitoring.svc:3100/loki/api/v1/push
    labels:
      resource:
        service.name: "service_name"
        k8s.namespace.name: "namespace"

service:
  telemetry:
    logs:
      level: info
    metrics:
      address: 0.0.0.0:8888

  pipelines:
    # Metrics pipeline: OTLP + Prometheus → batch → Remote Write
    metrics:
      receivers: [otlp, prometheus]
      processors: [memory_limiter, resource, batch]
      exporters: [prometheusremotewrite]

    # Traces pipeline: OTLP → tail sampling → OTLP export
    traces:
      receivers: [otlp]
      processors: [memory_limiter, resource, tail_sampling, batch]
      exporters: [otlp/traces]

    # Logs pipeline: Filelog → batch → Loki
    logs:
      receivers: [filelog]
      processors: [memory_limiter, resource, batch]
      exporters: [loki]
Named exporters: Use exporter_type/name syntax (e.g., otlp/traces) to define multiple instances of the same exporter type with different configurations. This is how you send traces to Tempo and metrics to Mimir using two separate OTLP exporters.

Deployment Patterns

The Collector can be deployed in several patterns depending on your scale, reliability requirements, and processing needs. The three primary patterns — Agent, Gateway, and Sidecar — are often combined in production architectures.

Pattern Kubernetes Deployment Scope Best For
Agent DaemonSet (one per node) Per-node collection Host metrics, log collection, low-overhead forwarding to gateway
Gateway Deployment (2-3 replicas, HPA) Centralized processing Heavy processing (tail sampling, OTTL transforms), cross-service correlation, fan-out to multiple backends
Sidecar Sidecar container (per pod) Per-application Application-specific processing, PII redaction, compliance isolation, multi-tenant environments

Agent + Gateway Pattern (Recommended)

Agent + Gateway Deployment Pattern
flowchart TD
    subgraph Node1["Node 1"]
        App1[App Pod A] -->|OTLP| Agent1[OTel Agent
DaemonSet] App2[App Pod B] -->|OTLP| Agent1 Logs1[/var/log/pods/] -->|filelog| Agent1 end subgraph Node2["Node 2"] App3[App Pod C] -->|OTLP| Agent2[OTel Agent
DaemonSet] App4[App Pod D] -->|OTLP| Agent2 Logs2[/var/log/pods/] -->|filelog| Agent2 end subgraph Gateway["Gateway Deployment (2+ replicas)"] GW1[OTel Gateway 1
Tail Sampling
OTTL Transforms] GW2[OTel Gateway 2
Tail Sampling
OTTL Transforms] end Agent1 -->|OTLP gRPC| GW1 Agent1 -->|OTLP gRPC| GW2 Agent2 -->|OTLP gRPC| GW1 Agent2 -->|OTLP gRPC| GW2 GW1 --> Mimir[Mimir
Metrics] GW1 --> Tempo[Tempo
Traces] GW1 --> Loki[Loki
Logs] GW2 --> Mimir GW2 --> Tempo GW2 --> Loki
When to use each pattern:
  • Agent only — small clusters (<20 nodes), simple forwarding, no tail sampling needed
  • Agent + Gateway — production clusters, tail sampling required, heavy transformations, fan-out to multiple backends
  • Sidecar — multi-tenant platforms requiring isolation, compliance-heavy workloads needing per-service PII redaction, service meshes without sidecars

Performance Tuning

The Collector's performance depends heavily on correct configuration of the batch processor, memory limiter, and exporter queues. Misconfigured settings lead to either OOM kills (memory too loose) or data loss from backpressure (queues too small).

Component Setting Default Production Recommendation
batch send_batch_size 8192 8192–16384 (higher = fewer exports, more memory)
timeout 200ms 5s (flush even if batch not full)
memory_limiter limit_mib 0 (disabled) 80% of container memory limit (e.g., 512 for 640MiB limit)
spike_limit_mib 0 25% of limit_mib (e.g., 128 for 512 limit)
check_interval 0s 1–5s (lower = more responsive, slightly more CPU)
exporters sending_queue.queue_size 1000 5000–10000 (buffer during backend outages)
retry_on_failure.max_elapsed_time 300s 300–600s (5–10 min retry window)

Memory Limiter — The Most Important Processor

The memory_limiter processor is non-negotiable in production. Without it, a traffic spike or backend outage causes the Collector to buffer unbounded data in memory until it's OOM-killed — losing all in-flight telemetry. Configure it as the FIRST processor in every pipeline, with limit_mib set to 80% of your container's memory limit. When the limit is hit, the Collector applies backpressure to receivers (refusing new data) rather than crashing.
# Memory limiter tuning for a 1GiB container
processors:
  memory_limiter:
    # Hard limit: refuse data when heap reaches this
    limit_mib: 800          # 80% of 1024 MiB container
    # Spike limit: start refusing earlier during burst
    spike_limit_mib: 200    # 25% of limit_mib
    # How often to check memory usage
    check_interval: 2s

# Corresponding Kubernetes resource limits
# resources:
#   limits:
#     memory: 1Gi
#   requests:
#     memory: 512Mi

Production Checklist

Production Readiness

OpenTelemetry Collector Production Deployment Checklist

  1. Enable memory_limiter as first processor — set limit_mib to 80% of container memory limit in every pipeline to prevent OOM kills and uncontrolled data loss
  2. Configure exporter retry queues — enable sending_queue with adequate queue_size (5000+) to buffer during backend outages without dropping telemetry
  3. Deploy Agent + Gateway pattern — use DaemonSet agents for lightweight collection and Deployment gateways (2+ replicas with HPA) for heavy processing like tail sampling
  4. Monitor the Collector itself — expose service.telemetry.metrics on port 8888 and alert on otelcol_exporter_send_failed_spans, otelcol_processor_refused_spans, and memory usage
  5. Use tail_sampling on Gateway only — tail sampling requires seeing complete traces, which only works on a centralized gateway where all spans for a trace arrive at the same instance
  6. Secure with TLS and authentication — enable TLS on OTLP receivers/exporters, use mTLS between agents and gateways, and add bearer token authentication for external ingestion endpoints
  7. Version-pin the Collector image — use specific tags (e.g., otel/opentelemetry-collector-contrib:0.98.0) not :latest; test upgrades in staging before production rollout
  8. Implement graceful shutdown — configure Kubernetes terminationGracePeriodSeconds: 30 and set OTEL_COLLECTOR_SHUTDOWN_TIMEOUT=25s to flush in-flight data before pod termination
OpenTelemetry production ops reliability