Platform Deep Dive: Datadog

Platform Overview

Datadog provides a single pane of glass across your entire technology stack. Unlike open-source alternatives that require assembling multiple tools, Datadog delivers infrastructure monitoring, APM, logs, synthetics, security, and more within a single unified platform with consistent UX and correlated data.

Datadog Platform Architecture

flowchart TD
    A[Applications & Services] --> B[Datadog Agent]
    B --> C[Datadog SaaS Platform]
    C --> D[Infrastructure Monitoring]
    C --> E[APM & Tracing]
    C --> F[Log Management]
    C --> G[Synthetics & RUM]
    C --> H[Security Monitoring]

    style A fill:#3B9797,color:#fff
    style B fill:#132440,color:#fff
    style C fill:#BF092F,color:#fff
    style D fill:#16476A,color:#fff
    style E fill:#16476A,color:#fff
    style F fill:#16476A,color:#fff
    style G fill:#16476A,color:#fff
    style H fill:#16476A,color:#fff

Product Modules

Module	Purpose	Key Features
Infrastructure	Host & container monitoring	450+ integrations, host maps, live processes
APM	Distributed tracing & profiling	Service map, flame graphs, continuous profiler
Logs	Centralized log management	Pipelines, facets, patterns, archives
Synthetics	Proactive testing	API tests, browser tests, private locations
RUM	Real User Monitoring	Session replay, core web vitals, error tracking
Security	SIEM & threat detection	Cloud SIEM, CSPM, application security
CI Visibility	Pipeline observability	Test performance, flaky test detection

Datadog Agent

The Datadog Agent is the cornerstone of data collection. It's a lightweight process that runs on your hosts, collecting metrics, traces, and logs and forwarding them to the Datadog platform. Agent 7 (latest) is written in Go with Python-based check runners.

Agent Configuration

# /etc/datadog-agent/datadog.yaml
api_key: YOUR_DATADOG_API_KEY
site: datadoghq.com

# Enable log collection
logs_enabled: true

# APM configuration
apm_config:
  enabled: true
  apm_dd_url: https://trace.agent.datadoghq.com
  env: production
  max_traces_per_second: 200

# Process monitoring
process_config:
  process_collection:
    enabled: true
  container_collection:
    enabled: true

# Network Performance Monitoring
network_config:
  enabled: true

# Tags applied to all metrics from this host
tags:
  - env:production
  - team:platform
  - service:checkout-api
  - region:us-east-1

Agent Deployment Options

Method	Use Case	Data Collected
Datadog Agent	VMs, bare metal, containers	Metrics, traces, logs, processes, NPM
DogStatsD	Custom metric submission	Custom metrics (counters, gauges, histograms)
Serverless Forwarder	AWS Lambda, Azure Functions	Logs, enhanced metrics, traces
Agentless (API)	Cloud integrations, SaaS tools	Cloud metrics via API polling
Cluster Agent	Kubernetes environments	Cluster-level metrics, external metrics, admission controller

Kubernetes Deployment

# Datadog Agent Helm values.yaml
datadog:
  apiKey: <DATADOG_API_KEY>
  appKey: <DATADOG_APP_KEY>
  site: datadoghq.com

  logs:
    enabled: true
    containerCollectAll: true

  apm:
    portEnabled: true

  processAgent:
    enabled: true
    processCollection: true

  networkMonitoring:
    enabled: true

clusterAgent:
  enabled: true
  metricsProvider:
    enabled: true
  admissionController:
    enabled: true
    mutateUnlabelled: false

Infrastructure Monitoring

Datadog's infrastructure monitoring provides real-time visibility across your entire stack with 450+ out-of-the-box integrations. Host maps give you a visual representation of your infrastructure, colored and grouped by any tag.

Key Integrations

Integration	Metrics Collected	Setup Method
AWS	EC2, RDS, ELB, S3, Lambda, ECS, EKS metrics	CloudFormation / Terraform (API polling + CloudWatch)
Kubernetes	Pod, node, deployment, DaemonSet, ReplicaSet metrics	Helm chart / DaemonSet + Cluster Agent
Docker	Container CPU, memory, I/O, network, image stats	Agent auto-discovery via Docker socket
NGINX	Requests/sec, connections, response codes, upstream	Agent check + stub_status module
PostgreSQL	Connections, locks, query performance, replication lag	Agent check + read-only DB user
Redis	Commands/sec, memory, hit rate, connected clients	Agent check + INFO command access

Tagging Strategy

Effective tagging is critical for Datadog ROI. Tags enable filtering, grouping, and correlating data across all products.

                            
                            Unified Tagging Convention: Apply consistent tags across metrics, traces, and logs. The three essential tag keys are env (production/staging/dev), service (application name), and version (deployment version). These enable Datadog's unified service tagging for seamless correlation.
                        

APM & Distributed Tracing

Datadog APM provides end-to-end distributed tracing with automatic instrumentation for most languages. The service map visualizes dependencies, while flame graphs reveal exactly where latency occurs within a request.

Auto-Instrumentation (Python)

# Install the Datadog tracing library
pip install ddtrace

# Run your application with auto-instrumentation
ddtrace-run python app.py

# Or with specific configuration
DD_SERVICE=checkout-api \
DD_ENV=production \
DD_VERSION=1.4.2 \
DD_TRACE_SAMPLE_RATE=1.0 \
DD_PROFILING_ENABLED=true \
DD_LOGS_INJECTION=true \
ddtrace-run gunicorn myapp.wsgi:application --bind 0.0.0.0:8000

Custom Instrumentation

# Example: custom span with ddtrace in Python
# File: payment_service.py

from ddtrace import tracer

@tracer.wrap(service="payment-service", resource="process_payment")
def process_payment(order_id, amount):
    """Process a payment with custom span attributes."""
    span = tracer.current_span()
    span.set_tag("order.id", order_id)
    span.set_tag("payment.amount", amount)
    span.set_tag("payment.currency", "USD")

    # Business logic here
    result = charge_credit_card(order_id, amount)

    span.set_tag("payment.status", result.status)
    return result

                            
                            Continuous Profiler: Datadog's always-on profiler captures CPU, memory, lock contention, and I/O profiles in production with <2% overhead. It correlates profiles directly with traces — click any slow span to see the exact code path consuming time. This eliminates the need for ad-hoc profiling sessions during incidents.
                        

Key APM Features

Feature	Description
Service Map	Auto-generated dependency graph showing request flow, error rates, and latency between services
Flame Graphs	Visual breakdown of time spent in each span within a trace
Trace Search	Query traces by tags, duration, status, or any custom attribute (15-day retention)
Error Tracking	Automatic grouping of similar errors with stack traces and impacted users
Deployment Tracking	Compare latency, error rate, and request volume across versions

Log Management

Datadog Log Management ingests, processes, and stores logs at scale. Logs are parsed via pipelines, enriched with tags, and correlated with traces and infrastructure metrics for unified troubleshooting.

                            
                            Cost Warning — Log Volume: Log ingestion is the #1 cost driver for most Datadog deployments. At $0.10/GB ingested and $1.70/million log events indexed, an uncontrolled application logging 50GB/day can cost $150+/day in log fees alone. Use exclusion filters, sampling, and log pipelines aggressively to control costs.
                        

Log Collection Configuration

# /etc/datadog-agent/conf.d/app.d/conf.yaml
logs:
  - type: file
    path: /var/log/myapp/*.log
    service: checkout-api
    source: python
    tags:
      - env:production
      - team:platform

  - type: docker
    service: payment-service
    source: java
    log_processing_rules:
      # Exclude health check noise
      - type: exclude_at_match
        name: exclude_healthchecks
        pattern: "GET /health"
      # Multi-line log aggregation
      - type: multi_line
        name: java_stacktrace
        pattern: '^\d{4}-\d{2}-\d{2}'

Log Pipeline Processing

Pipelines transform raw logs into structured, queryable data:

Grok Parser — Extract structured fields from unstructured log lines
Date Remapper — Set the official log timestamp from a parsed field
Status Remapper — Map log severity (INFO, WARN, ERROR)
Service Remapper — Assign the correct service name
Category Processor — Classify logs into categories based on content
Trace ID Remapper — Link logs to APM traces for correlation

Index Management

Use multiple indexes with different retention periods to control costs:

Index	Filter	Retention	Purpose
errors-critical	`status:error OR status:critical`	30 days	Incident investigation
security-audit	`source:auth OR @action:login`	90 days	Compliance & audit
application-default	`*` (catch-all)	7 days	General troubleshooting

Dashboards & Monitors

Datadog offers two dashboard types: Timeboards (time-synchronized, ideal for troubleshooting) and Screenboards (free-form layout, ideal for status displays and NOC screens).

Monitor Types

Monitor Type	Triggers On	Best For
Metric	Threshold or change in metric value	CPU > 90%, error rate > 5%
APM	Service latency, error rate, or hit rate	p99 latency > 500ms on checkout service
Log	Log count matching a query	More than 100 errors in 5 minutes
Composite	Boolean combination of other monitors	Alert only when BOTH high CPU AND high error rate
SLO	Error budget consumption rate	Burning through monthly error budget too fast
Anomaly	Deviation from historical baseline	Traffic drop or unexpected spike detection
Forecast	Predicted threshold breach	Disk will be full in 48 hours

                            
                            Composite Monitors: These are Datadog's most powerful alerting primitive. Instead of getting paged for every CPU spike (which might be normal during deployments), create a composite that only fires when high CPU AND high error rate AND NOT during a maintenance window. This dramatically reduces alert fatigue.
                        

Dashboard JSON Definition

{
  "title": "Service Overview - Checkout API",
  "widgets": [
    {
      "definition": {
        "type": "timeseries",
        "title": "Request Rate",
        "requests": [
          {
            "q": "sum:trace.http.request.hits{service:checkout-api}.as_rate()",
            "display_type": "bars"
          }
        ]
      }
    },
    {
      "definition": {
        "type": "query_value",
        "title": "P99 Latency",
        "requests": [
          {
            "q": "p99:trace.http.request.duration{service:checkout-api}",
            "aggregator": "avg"
          }
        ],
        "precision": 0
      }
    }
  ],
  "layout_type": "ordered"
}

Cost Optimization

Datadog pricing is consumption-based. Without governance, costs can spiral quickly — especially with custom metrics and log ingestion. Understanding the cost model is essential for production deployments.

Cost Factors

Cost Factor	Pricing Model	Typical Trap
Infrastructure Hosts	$15-23/host/month (Pro/Enterprise)	Container-per-host counting in Kubernetes
Custom Metrics	$0.05/metric/month (first 100 free per host)	Cardinality explosion from unbounded tags
Log Ingestion	$0.10/GB ingested	Debug logging left on in production
Log Indexing	$1.70/million events (15-day retention)	Indexing all logs instead of sampling
APM Spans	$36/host/month + ingested spans	Tracing every request without sampling
Synthetics	$5/10K API tests, $12/1K browser tests	Running tests every minute from all regions

Cost Reduction Strategies

Tag-based metric filtering — Use metrics_without_limits to drop unnecessary tag combinations while keeping aggregated data queryable
Log exclusion filters — Filter out health checks, debug logs, and known noisy patterns before indexing (still searchable in Live Tail)
APM trace sampling — Use head-based sampling at 10-20% for high-throughput services; Datadog retains error and high-latency traces automatically
Log archives to S3/GCS — Archive all logs to cloud storage ($0.02/GB) for compliance, only index what you actively query
Custom metric governance — Audit metrics monthly; remove unused metrics and restrict high-cardinality tags (user_id, request_id)
Committed-use contracts — Annual commits provide 20-40% discounts vs. on-demand; forecast usage based on 3-month trailing data

                            
                            Custom Metric Explosion: A single metric name with high-cardinality tags can generate thousands of custom metrics. For example, http.requests{endpoint, method, status, user_id} with 100 endpoints × 4 methods × 10 statuses × 10,000 users = 40 million unique metric time series. Always validate tag cardinality before deploying new instrumentation.
                        

When to Choose Datadog

Platform Assessment

Datadog: Strengths & Limitations

Strengths

Unified platform — Single pane of glass across metrics, traces, logs, and synthetics with seamless correlation
450+ integrations — Out-of-the-box dashboards and checks for virtually every technology
Zero infrastructure management — Fully managed SaaS; no clusters to scale, upgrade, or maintain
Excellent UX — Intuitive interface, fast search, powerful query language
AI/ML features — Anomaly detection, forecasting, Watchdog auto-discovery of issues
Enterprise features — RBAC, audit trails, SSO, data residency options

Limitations

Cost at scale — Can become expensive (>$100K/year) for large deployments without governance
Vendor lock-in — Proprietary query language, agent, and APIs make migration difficult
Custom metric pricing — Cardinality-based billing punishes high-dimensional data
Data retention limits — Default 15 days for metrics, pay more for longer retention
No self-hosted option — Data leaves your network; some regulated industries can't use it

Best For

Teams that want unified observability without managing infrastructure
Organizations with 50-5,000 hosts running diverse technology stacks
Companies that value developer velocity over infrastructure cost optimization
Environments running on AWS/GCP/Azure with cloud-native architectures

Commercial SaaS-Only Full-Stack Enterprise

Previous Tool Deep Dive: OTel Collector Next Platform Deep Dive: New Relic

Cookie Consent

Platform Deep Dive: Datadog

Table of Contents

Platform Overview

Product Modules

Datadog Agent

Agent Configuration

Agent Deployment Options

Kubernetes Deployment

Infrastructure Monitoring

Key Integrations

Tagging Strategy

APM & Distributed Tracing

Auto-Instrumentation (Python)

Custom Instrumentation

Key APM Features

Log Management

Log Collection Configuration

Log Pipeline Processing

Index Management

Dashboards & Monitors

Monitor Types

Dashboard JSON Definition

Cost Optimization

Cost Factors

Cost Reduction Strategies

When to Choose Datadog

Datadog: Strengths & Limitations

Strengths

Limitations

Best For

Cookie Consent

Platform Deep Dive: Datadog

Table of Contents

Platform Overview

Product Modules

Datadog Agent

Agent Configuration

Agent Deployment Options

Kubernetes Deployment

Infrastructure Monitoring

Key Integrations

Tagging Strategy

APM & Distributed Tracing

Auto-Instrumentation (Python)

Custom Instrumentation

Key APM Features

Log Management

Log Collection Configuration

Log Pipeline Processing

Index Management

Dashboards & Monitors

Monitor Types

Dashboard JSON Definition

Cost Optimization

Cost Factors

Cost Reduction Strategies

When to Choose Datadog

Datadog: Strengths & Limitations

Strengths

Limitations

Best For

Related Articles in This Series

Platform Deep Dive: New Relic

Platform Deep Dive: Grafana Cloud

Part 7: Visualization & Alerting