Tool Deep Dive: OpenTelemetry Collector Complete Guide

Collector Architecture

The OpenTelemetry Collector is a vendor-agnostic agent that can receive, process, and export telemetry data. It removes the need for running multiple agents/collectors and provides a unified pipeline for metrics, traces, and logs. The Collector runs as a standalone binary and is configured via a single YAML file that defines receivers, processors, exporters, and the pipelines that connect them.

OpenTelemetry Collector Architecture — Pipeline Data Flow

flowchart LR
    subgraph Sources
        A1[Application
OTLP SDK]
        A2[Prometheus
Targets]
        A3[Log Files]
        A4[Kafka]
    end

    subgraph Receivers
        R1[otlp
gRPC + HTTP]
        R2[prometheus
Scraper]
        R3[filelog
Tailer]
        R4[kafka
Consumer]
    end

    subgraph Processors
        P1[memory_limiter]
        P2[batch]
        P3[attributes]
        P4[filter]
        P5[tail_sampling]
    end

    subgraph Exporters
        E1[otlp
Collector/Backend]
        E2[prometheusremotewrite
Mimir/Thanos]
        E3[loki
Log Storage]
        E4[debug
Stdout]
    end

    A1 --> R1
    A2 --> R2
    A3 --> R3
    A4 --> R4

    R1 --> P1
    R2 --> P1
    R3 --> P1
    R4 --> P1

    P1 --> P2
    P2 --> P3
    P3 --> P4
    P4 --> P5

    P5 --> E1
    P5 --> E2
    P5 --> E3
    P5 --> E4

Pipeline Model

The Collector supports three signal types, each with independent pipelines:

Metrics pipeline — receives metric data (counters, gauges, histograms), processes it (aggregation, filtering, relabeling), and exports to metric backends (Prometheus, Mimir, Datadog)
Traces pipeline — receives span data from instrumented applications, processes it (sampling, attribute enrichment), and exports to tracing backends (Jaeger, Tempo, Zipkin)
Logs pipeline — receives log records from files, syslog, or OTLP, processes them (parsing, severity mapping), and exports to log backends (Loki, Elasticsearch, Splunk)

Each pipeline is defined independently in the service.pipelines section. A single receiver or exporter can participate in multiple pipelines, and multiple pipelines of the same signal type can run concurrently with different processing rules.

Receivers

Receivers define how data gets into the Collector. They listen on network ports, scrape endpoints, or tail files. Each receiver is configured once in the top-level receivers: section and then referenced by name in one or more pipelines.

Receiver	Protocol	Use Case
`otlp`	gRPC (4317) + HTTP (4318)	Primary receiver for OTLP-instrumented applications — accepts metrics, traces, and logs
`prometheus`	HTTP scrape	Scrape Prometheus-format metrics from /metrics endpoints using standard scrape configs
`filelog`	File I/O	Tail log files with configurable parsing (regex, JSON, syslog) and multiline support
`kafka`	Kafka consumer	Consume telemetry from Kafka topics — useful for buffered/decoupled architectures
`hostmetrics`	System calls	Collect host-level metrics: CPU, memory, disk, network, filesystem, processes
`k8s_cluster`	Kubernetes API	Collect cluster-level events and metrics: node conditions, pod phases, deployments

Receiver Configuration

# Receivers section of otel-collector-config.yaml
receivers:
  # OTLP receiver — primary ingestion point
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
        max_recv_msg_size_mib: 4
      http:
        endpoint: 0.0.0.0:4318
        cors:
          allowed_origins: ["*"]

  # Prometheus receiver — scrape Prometheus targets
  prometheus:
    config:
      scrape_configs:
        - job_name: 'node-exporter'
          scrape_interval: 15s
          static_configs:
            - targets: ['node-exporter:9100']
        - job_name: 'kubernetes-pods'
          kubernetes_sd_configs:
            - role: pod
          relabel_configs:
            - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
              action: keep
              regex: true

  # Filelog receiver — tail application logs
  filelog:
    include:
      - /var/log/pods/*/*/*.log
    exclude:
      - /var/log/pods/*/otc-container/*.log
    start_at: end
    include_file_path: true
    operators:
      - type: router
        routes:
          - output: json-parser
            expr: 'body matches "^\\{"'
      - id: json-parser
        type: json_parser
        timestamp:
          parse_from: attributes.timestamp
          layout: '%Y-%m-%dT%H:%M:%S.%LZ'

  # Host metrics receiver — system-level metrics
  hostmetrics:
    collection_interval: 30s
    scrapers:
      cpu:
        metrics:
          system.cpu.utilization:
            enabled: true
      memory: {}
      disk: {}
      network: {}
      filesystem: {}

Processors

Processors transform, filter, and enrich telemetry data as it flows through the pipeline. They execute in the order defined in the pipeline's processors: list — order matters significantly for correctness and performance.

Processor	Purpose	Key Configuration
`batch`	Reduce export calls by batching telemetry into larger payloads	`send_batch_size`, `timeout`
`memory_limiter`	Prevent OOM by applying backpressure when memory exceeds limits	`limit_mib`, `spike_limit_mib`, `check_interval`
`attributes`	Add, remove, or modify attributes on spans, metrics, or logs	`actions: [insert, update, delete, hash]`
`filter`	Drop unwanted telemetry based on attribute conditions	`metrics.exclude`, `traces.span`, `logs.exclude`
`tail_sampling`	Intelligent trace sampling based on complete trace data	`policies`, `decision_wait`, `num_traces`
`resource`	Add resource attributes (cluster, environment, service info)	`attributes: [{key, value, action}]`
`transform`	Apply OTTL expressions for complex transformations	`trace_statements`, `metric_statements`, `log_statements`

Processor Ordering & Configuration

# Processors section — order in pipeline definition matters!
processors:
  # Memory limiter — ALWAYS first in pipeline
  memory_limiter:
    limit_mib: 512
    spike_limit_mib: 128
    check_interval: 5s

  # Batch processor — improves compression and reduces network calls
  batch:
    send_batch_size: 8192
    send_batch_max_size: 16384
    timeout: 5s

  # Resource processor — add environment metadata
  resource:
    attributes:
      - key: environment
        value: production
        action: upsert
      - key: cluster
        value: us-east-1-primary
        action: upsert

  # Attributes processor — enrich/redact span attributes
  attributes:
    actions:
      - key: db.statement
        action: hash
      - key: http.request.header.authorization
        action: delete
      - key: deployment.version
        value: "v2.4.1"
        action: upsert

  # Filter processor — drop noisy/unwanted telemetry
  filter:
    error_mode: ignore
    metrics:
      exclude:
        match_type: regexp
        metric_names:
          - ".*_test_.*"
          - "go_gc_.*"
    traces:
      span:
        - 'attributes["http.route"] == "/healthz"'
        - 'attributes["http.route"] == "/readyz"'

  # Tail sampling — keep errors + slow traces + sample rest
  tail_sampling:
    decision_wait: 10s
    num_traces: 100000
    policies:
      - name: errors
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: slow-traces
        type: latency
        latency: { threshold_ms: 2000 }
      - name: probabilistic
        type: probabilistic
        probabilistic: { sampling_percentage: 10 }

                            
                            Processor ordering is critical. Always place memory_limiter first to prevent OOM before any data buffering occurs. Place batch last (before export) to maximize compression efficiency. A recommended order: memory_limiter → resource → attributes → filter → tail_sampling → batch.
                        

Exporters

Exporters send processed telemetry to one or more backends. A single pipeline can fan out to multiple exporters, enabling you to send the same data to different systems (e.g., traces to both Jaeger and a long-term archive). Each exporter manages its own retry queue and connection pool.

Exporter	Protocol	Use Case
`otlp`	gRPC / HTTP	Forward to another Collector (gateway pattern) or OTLP-native backends (Tempo, Mimir)
`prometheusremotewrite`	HTTP POST	Write metrics to Prometheus-compatible backends (Prometheus, Mimir, Thanos, Cortex)
`loki`	HTTP POST	Send logs to Grafana Loki with label extraction from resource/log attributes
`debug`	Stdout	Print telemetry to console for development and troubleshooting (replaces deprecated `logging`)
`file`	File I/O	Write telemetry to disk in JSON format for offline analysis or debugging

Exporter Configuration

# Exporters section of otel-collector-config.yaml
exporters:
  # OTLP exporter — forward to gateway Collector or backend
  otlp:
    endpoint: otel-gateway.monitoring.svc:4317
    tls:
      insecure: false
      cert_file: /etc/otel/tls/client.crt
      key_file: /etc/otel/tls/client.key
    retry_on_failure:
      enabled: true
      initial_interval: 5s
      max_interval: 30s
      max_elapsed_time: 300s
    sending_queue:
      enabled: true
      num_consumers: 10
      queue_size: 5000

  # Prometheus Remote Write — metrics to Mimir/Thanos
  prometheusremotewrite:
    endpoint: http://mimir.monitoring.svc:9009/api/v1/push
    resource_to_telemetry_conversion:
      enabled: true
    headers:
      X-Scope-OrgID: "production"
    retry_on_failure:
      enabled: true
      initial_interval: 5s
      max_interval: 30s

  # Loki exporter — logs to Grafana Loki
  loki:
    endpoint: http://loki.monitoring.svc:3100/loki/api/v1/push
    labels:
      resource:
        service.name: "service_name"
        service.namespace: "namespace"
        k8s.container.name: "container"
      attributes:
        severity: ""
        level: ""

  # Debug exporter — stdout for development
  debug:
    verbosity: detailed
    sampling_initial: 5
    sampling_thereafter: 200

Pipeline Configuration

The service section ties everything together by defining named pipelines. Each pipeline specifies its signal type (metrics, traces, or logs), its receivers, processors (in order), and exporters. This is where the Collector's full power emerges — you can define multiple pipelines per signal type with different processing rules.

Complete Working Configuration

# otel-collector-config.yaml — Production-ready full configuration
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

  prometheus:
    config:
      scrape_configs:
        - job_name: 'kubernetes-pods'
          scrape_interval: 30s
          kubernetes_sd_configs:
            - role: pod
          relabel_configs:
            - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
              action: keep
              regex: true

  filelog:
    include:
      - /var/log/pods/*/*/*.log
    start_at: end
    include_file_path: true
    operators:
      - type: json_parser
        timestamp:
          parse_from: attributes.time
          layout: '%Y-%m-%dT%H:%M:%S.%LZ'

processors:
  memory_limiter:
    limit_mib: 512
    spike_limit_mib: 128
    check_interval: 5s

  batch:
    send_batch_size: 8192
    timeout: 5s

  resource:
    attributes:
      - key: cluster
        value: us-east-1-primary
        action: upsert
      - key: environment
        value: production
        action: upsert

  tail_sampling:
    decision_wait: 10s
    num_traces: 100000
    policies:
      - name: errors-policy
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: slow-requests
        type: latency
        latency: { threshold_ms: 2000 }
      - name: baseline-sampling
        type: probabilistic
        probabilistic: { sampling_percentage: 10 }

exporters:
  otlp/traces:
    endpoint: tempo.monitoring.svc:4317
    tls:
      insecure: true

  prometheusremotewrite:
    endpoint: http://mimir.monitoring.svc:9009/api/v1/push
    resource_to_telemetry_conversion:
      enabled: true

  loki:
    endpoint: http://loki.monitoring.svc:3100/loki/api/v1/push
    labels:
      resource:
        service.name: "service_name"
        k8s.namespace.name: "namespace"

service:
  telemetry:
    logs:
      level: info
    metrics:
      address: 0.0.0.0:8888

  pipelines:
    # Metrics pipeline: OTLP + Prometheus → batch → Remote Write
    metrics:
      receivers: [otlp, prometheus]
      processors: [memory_limiter, resource, batch]
      exporters: [prometheusremotewrite]

    # Traces pipeline: OTLP → tail sampling → OTLP export
    traces:
      receivers: [otlp]
      processors: [memory_limiter, resource, tail_sampling, batch]
      exporters: [otlp/traces]

    # Logs pipeline: Filelog → batch → Loki
    logs:
      receivers: [filelog]
      processors: [memory_limiter, resource, batch]
      exporters: [loki]

                            
                            Named exporters: Use exporter_type/name syntax (e.g., otlp/traces) to define multiple instances of the same exporter type with different configurations. This is how you send traces to Tempo and metrics to Mimir using two separate OTLP exporters.
                        

Deployment Patterns

The Collector can be deployed in several patterns depending on your scale, reliability requirements, and processing needs. The three primary patterns — Agent, Gateway, and Sidecar — are often combined in production architectures.

Pattern	Kubernetes Deployment	Scope	Best For
Agent	DaemonSet (one per node)	Per-node collection	Host metrics, log collection, low-overhead forwarding to gateway
Gateway	Deployment (2-3 replicas, HPA)	Centralized processing	Heavy processing (tail sampling, OTTL transforms), cross-service correlation, fan-out to multiple backends
Sidecar	Sidecar container (per pod)	Per-application	Application-specific processing, PII redaction, compliance isolation, multi-tenant environments

Agent + Gateway Pattern (Recommended)

Agent + Gateway Deployment Pattern

flowchart TD
    subgraph Node1["Node 1"]
        App1[App Pod A] -->|OTLP| Agent1[OTel Agent
DaemonSet]
        App2[App Pod B] -->|OTLP| Agent1
        Logs1[/var/log/pods/] -->|filelog| Agent1
    end

    subgraph Node2["Node 2"]
        App3[App Pod C] -->|OTLP| Agent2[OTel Agent
DaemonSet]
        App4[App Pod D] -->|OTLP| Agent2
        Logs2[/var/log/pods/] -->|filelog| Agent2
    end

    subgraph Gateway["Gateway Deployment (2+ replicas)"]
        GW1[OTel Gateway 1
Tail Sampling
OTTL Transforms]
        GW2[OTel Gateway 2
Tail Sampling
OTTL Transforms]
    end

    Agent1 -->|OTLP gRPC| GW1
    Agent1 -->|OTLP gRPC| GW2
    Agent2 -->|OTLP gRPC| GW1
    Agent2 -->|OTLP gRPC| GW2

    GW1 --> Mimir[Mimir
Metrics]
    GW1 --> Tempo[Tempo
Traces]
    GW1 --> Loki[Loki
Logs]
    GW2 --> Mimir
    GW2 --> Tempo
    GW2 --> Loki

                            
                            When to use each pattern:
                            Agent only — small clusters (<20 nodes), simple forwarding, no tail sampling needed
Agent + Gateway — production clusters, tail sampling required, heavy transformations, fan-out to multiple backends
Sidecar — multi-tenant platforms requiring isolation, compliance-heavy workloads needing per-service PII redaction, service meshes without sidecars

Performance Tuning

The Collector's performance depends heavily on correct configuration of the batch processor, memory limiter, and exporter queues. Misconfigured settings lead to either OOM kills (memory too loose) or data loss from backpressure (queues too small).

Component	Setting	Default	Production Recommendation
batch	`send_batch_size`	8192	8192–16384 (higher = fewer exports, more memory)
batch	`timeout`	200ms	5s (flush even if batch not full)
memory_limiter	`limit_mib`	0 (disabled)	80% of container memory limit (e.g., 512 for 640MiB limit)
	`spike_limit_mib`	0	25% of `limit_mib` (e.g., 128 for 512 limit)
	`check_interval`	0s	1–5s (lower = more responsive, slightly more CPU)
exporters	`sending_queue.queue_size`	1000	5000–10000 (buffer during backend outages)
exporters	`retry_on_failure.max_elapsed_time`	300s	300–600s (5–10 min retry window)

Memory Limiter — The Most Important Processor

                            
                            The memory_limiter processor is non-negotiable in production. Without it, a traffic spike or backend outage causes the Collector to buffer unbounded data in memory until it's OOM-killed — losing all in-flight telemetry. Configure it as the FIRST processor in every pipeline, with limit_mib set to 80% of your container's memory limit. When the limit is hit, the Collector applies backpressure to receivers (refusing new data) rather than crashing.
                        

# Memory limiter tuning for a 1GiB container
processors:
  memory_limiter:
    # Hard limit: refuse data when heap reaches this
    limit_mib: 800          # 80% of 1024 MiB container
    # Spike limit: start refusing earlier during burst
    spike_limit_mib: 200    # 25% of limit_mib
    # How often to check memory usage
    check_interval: 2s

# Corresponding Kubernetes resource limits
# resources:
#   limits:
#     memory: 1Gi
#   requests:
#     memory: 512Mi

Production Checklist

Production Readiness

OpenTelemetry Collector Production Deployment Checklist

Enable memory_limiter as first processor — set limit_mib to 80% of container memory limit in every pipeline to prevent OOM kills and uncontrolled data loss
Configure exporter retry queues — enable sending_queue with adequate queue_size (5000+) to buffer during backend outages without dropping telemetry
Deploy Agent + Gateway pattern — use DaemonSet agents for lightweight collection and Deployment gateways (2+ replicas with HPA) for heavy processing like tail sampling
Monitor the Collector itself — expose service.telemetry.metrics on port 8888 and alert on otelcol_exporter_send_failed_spans, otelcol_processor_refused_spans, and memory usage
Use tail_sampling on Gateway only — tail sampling requires seeing complete traces, which only works on a centralized gateway where all spans for a trace arrive at the same instance
Secure with TLS and authentication — enable TLS on OTLP receivers/exporters, use mTLS between agents and gateways, and add bearer token authentication for external ingestion endpoints
Version-pin the Collector image — use specific tags (e.g., otel/opentelemetry-collector-contrib:0.98.0) not :latest; test upgrades in staging before production rollout
Implement graceful shutdown — configure Kubernetes terminationGracePeriodSeconds: 30 and set OTEL_COLLECTOR_SHUTDOWN_TIMEOUT=25s to flush in-flight data before pod termination

OpenTelemetry production ops reliability

Previous Tool Deep Dive: Alertmanager Complete Guide Next Platform Guide: Datadog Complete Guide

Cookie Consent

Tool Deep Dive: OpenTelemetry Collector Complete Guide

Table of Contents

Collector Architecture

Pipeline Model

Receivers

Receiver Configuration

Processors

Processor Ordering & Configuration

Exporters

Exporter Configuration

Pipeline Configuration

Complete Working Configuration

Deployment Patterns

Agent + Gateway Pattern (Recommended)

Performance Tuning

Memory Limiter — The Most Important Processor

Production Checklist

OpenTelemetry Collector Production Deployment Checklist

Cookie Consent

Tool Deep Dive: OpenTelemetry Collector Complete Guide

Table of Contents

Collector Architecture

Pipeline Model

Receivers

Receiver Configuration

Processors

Processor Ordering & Configuration

Exporters

Exporter Configuration

Pipeline Configuration

Complete Working Configuration

Deployment Patterns

Agent + Gateway Pattern (Recommended)

Performance Tuning

Memory Limiter — The Most Important Processor

Production Checklist

OpenTelemetry Collector Production Deployment Checklist

Related Posts

Related Articles in This Series

Part 6: OpenTelemetry — Unified Observability Framework

Tool Deep Dive: Prometheus Complete Guide

Tool Deep Dive: Jaeger Complete Guide