Platform Overview
Datadog provides a single pane of glass across your entire technology stack. Unlike open-source alternatives that require assembling multiple tools, Datadog delivers infrastructure monitoring, APM, logs, synthetics, security, and more within a single unified platform with consistent UX and correlated data.
flowchart TD
A[Applications & Services] --> B[Datadog Agent]
B --> C[Datadog SaaS Platform]
C --> D[Infrastructure Monitoring]
C --> E[APM & Tracing]
C --> F[Log Management]
C --> G[Synthetics & RUM]
C --> H[Security Monitoring]
style A fill:#3B9797,color:#fff
style B fill:#132440,color:#fff
style C fill:#BF092F,color:#fff
style D fill:#16476A,color:#fff
style E fill:#16476A,color:#fff
style F fill:#16476A,color:#fff
style G fill:#16476A,color:#fff
style H fill:#16476A,color:#fff
Product Modules
| Module | Purpose | Key Features |
|---|---|---|
| Infrastructure | Host & container monitoring | 450+ integrations, host maps, live processes |
| APM | Distributed tracing & profiling | Service map, flame graphs, continuous profiler |
| Logs | Centralized log management | Pipelines, facets, patterns, archives |
| Synthetics | Proactive testing | API tests, browser tests, private locations |
| RUM | Real User Monitoring | Session replay, core web vitals, error tracking |
| Security | SIEM & threat detection | Cloud SIEM, CSPM, application security |
| CI Visibility | Pipeline observability | Test performance, flaky test detection |
Datadog Agent
The Datadog Agent is the cornerstone of data collection. It's a lightweight process that runs on your hosts, collecting metrics, traces, and logs and forwarding them to the Datadog platform. Agent 7 (latest) is written in Go with Python-based check runners.
Agent Configuration
# /etc/datadog-agent/datadog.yaml
api_key: YOUR_DATADOG_API_KEY
site: datadoghq.com
# Enable log collection
logs_enabled: true
# APM configuration
apm_config:
enabled: true
apm_dd_url: https://trace.agent.datadoghq.com
env: production
max_traces_per_second: 200
# Process monitoring
process_config:
process_collection:
enabled: true
container_collection:
enabled: true
# Network Performance Monitoring
network_config:
enabled: true
# Tags applied to all metrics from this host
tags:
- env:production
- team:platform
- service:checkout-api
- region:us-east-1
Agent Deployment Options
| Method | Use Case | Data Collected |
|---|---|---|
| Datadog Agent | VMs, bare metal, containers | Metrics, traces, logs, processes, NPM |
| DogStatsD | Custom metric submission | Custom metrics (counters, gauges, histograms) |
| Serverless Forwarder | AWS Lambda, Azure Functions | Logs, enhanced metrics, traces |
| Agentless (API) | Cloud integrations, SaaS tools | Cloud metrics via API polling |
| Cluster Agent | Kubernetes environments | Cluster-level metrics, external metrics, admission controller |
Kubernetes Deployment
# Datadog Agent Helm values.yaml
datadog:
apiKey: <DATADOG_API_KEY>
appKey: <DATADOG_APP_KEY>
site: datadoghq.com
logs:
enabled: true
containerCollectAll: true
apm:
portEnabled: true
processAgent:
enabled: true
processCollection: true
networkMonitoring:
enabled: true
clusterAgent:
enabled: true
metricsProvider:
enabled: true
admissionController:
enabled: true
mutateUnlabelled: false
Infrastructure Monitoring
Datadog's infrastructure monitoring provides real-time visibility across your entire stack with 450+ out-of-the-box integrations. Host maps give you a visual representation of your infrastructure, colored and grouped by any tag.
Key Integrations
| Integration | Metrics Collected | Setup Method |
|---|---|---|
| AWS | EC2, RDS, ELB, S3, Lambda, ECS, EKS metrics | CloudFormation / Terraform (API polling + CloudWatch) |
| Kubernetes | Pod, node, deployment, DaemonSet, ReplicaSet metrics | Helm chart / DaemonSet + Cluster Agent |
| Docker | Container CPU, memory, I/O, network, image stats | Agent auto-discovery via Docker socket |
| NGINX | Requests/sec, connections, response codes, upstream | Agent check + stub_status module |
| PostgreSQL | Connections, locks, query performance, replication lag | Agent check + read-only DB user |
| Redis | Commands/sec, memory, hit rate, connected clients | Agent check + INFO command access |
Tagging Strategy
Effective tagging is critical for Datadog ROI. Tags enable filtering, grouping, and correlating data across all products.
env (production/staging/dev), service (application name), and version (deployment version). These enable Datadog's unified service tagging for seamless correlation.
APM & Distributed Tracing
Datadog APM provides end-to-end distributed tracing with automatic instrumentation for most languages. The service map visualizes dependencies, while flame graphs reveal exactly where latency occurs within a request.
Auto-Instrumentation (Python)
# Install the Datadog tracing library
pip install ddtrace
# Run your application with auto-instrumentation
ddtrace-run python app.py
# Or with specific configuration
DD_SERVICE=checkout-api \
DD_ENV=production \
DD_VERSION=1.4.2 \
DD_TRACE_SAMPLE_RATE=1.0 \
DD_PROFILING_ENABLED=true \
DD_LOGS_INJECTION=true \
ddtrace-run gunicorn myapp.wsgi:application --bind 0.0.0.0:8000
Custom Instrumentation
# Example: custom span with ddtrace in Python
# File: payment_service.py
from ddtrace import tracer
@tracer.wrap(service="payment-service", resource="process_payment")
def process_payment(order_id, amount):
"""Process a payment with custom span attributes."""
span = tracer.current_span()
span.set_tag("order.id", order_id)
span.set_tag("payment.amount", amount)
span.set_tag("payment.currency", "USD")
# Business logic here
result = charge_credit_card(order_id, amount)
span.set_tag("payment.status", result.status)
return result
Key APM Features
| Feature | Description |
|---|---|
| Service Map | Auto-generated dependency graph showing request flow, error rates, and latency between services |
| Flame Graphs | Visual breakdown of time spent in each span within a trace |
| Trace Search | Query traces by tags, duration, status, or any custom attribute (15-day retention) |
| Error Tracking | Automatic grouping of similar errors with stack traces and impacted users |
| Deployment Tracking | Compare latency, error rate, and request volume across versions |
Log Management
Datadog Log Management ingests, processes, and stores logs at scale. Logs are parsed via pipelines, enriched with tags, and correlated with traces and infrastructure metrics for unified troubleshooting.
Log Collection Configuration
# /etc/datadog-agent/conf.d/app.d/conf.yaml
logs:
- type: file
path: /var/log/myapp/*.log
service: checkout-api
source: python
tags:
- env:production
- team:platform
- type: docker
service: payment-service
source: java
log_processing_rules:
# Exclude health check noise
- type: exclude_at_match
name: exclude_healthchecks
pattern: "GET /health"
# Multi-line log aggregation
- type: multi_line
name: java_stacktrace
pattern: '^\d{4}-\d{2}-\d{2}'
Log Pipeline Processing
Pipelines transform raw logs into structured, queryable data:
- Grok Parser — Extract structured fields from unstructured log lines
- Date Remapper — Set the official log timestamp from a parsed field
- Status Remapper — Map log severity (INFO, WARN, ERROR)
- Service Remapper — Assign the correct service name
- Category Processor — Classify logs into categories based on content
- Trace ID Remapper — Link logs to APM traces for correlation
Index Management
Use multiple indexes with different retention periods to control costs:
| Index | Filter | Retention | Purpose |
|---|---|---|---|
| errors-critical | status:error OR status:critical |
30 days | Incident investigation |
| security-audit | source:auth OR @action:login |
90 days | Compliance & audit |
| application-default | * (catch-all) |
7 days | General troubleshooting |
Dashboards & Monitors
Datadog offers two dashboard types: Timeboards (time-synchronized, ideal for troubleshooting) and Screenboards (free-form layout, ideal for status displays and NOC screens).
Monitor Types
| Monitor Type | Triggers On | Best For |
|---|---|---|
| Metric | Threshold or change in metric value | CPU > 90%, error rate > 5% |
| APM | Service latency, error rate, or hit rate | p99 latency > 500ms on checkout service |
| Log | Log count matching a query | More than 100 errors in 5 minutes |
| Composite | Boolean combination of other monitors | Alert only when BOTH high CPU AND high error rate |
| SLO | Error budget consumption rate | Burning through monthly error budget too fast |
| Anomaly | Deviation from historical baseline | Traffic drop or unexpected spike detection |
| Forecast | Predicted threshold breach | Disk will be full in 48 hours |
Dashboard JSON Definition
{
"title": "Service Overview - Checkout API",
"widgets": [
{
"definition": {
"type": "timeseries",
"title": "Request Rate",
"requests": [
{
"q": "sum:trace.http.request.hits{service:checkout-api}.as_rate()",
"display_type": "bars"
}
]
}
},
{
"definition": {
"type": "query_value",
"title": "P99 Latency",
"requests": [
{
"q": "p99:trace.http.request.duration{service:checkout-api}",
"aggregator": "avg"
}
],
"precision": 0
}
}
],
"layout_type": "ordered"
}
Cost Optimization
Datadog pricing is consumption-based. Without governance, costs can spiral quickly — especially with custom metrics and log ingestion. Understanding the cost model is essential for production deployments.
Cost Factors
| Cost Factor | Pricing Model | Typical Trap |
|---|---|---|
| Infrastructure Hosts | $15-23/host/month (Pro/Enterprise) | Container-per-host counting in Kubernetes |
| Custom Metrics | $0.05/metric/month (first 100 free per host) | Cardinality explosion from unbounded tags |
| Log Ingestion | $0.10/GB ingested | Debug logging left on in production |
| Log Indexing | $1.70/million events (15-day retention) | Indexing all logs instead of sampling |
| APM Spans | $36/host/month + ingested spans | Tracing every request without sampling |
| Synthetics | $5/10K API tests, $12/1K browser tests | Running tests every minute from all regions |
Cost Reduction Strategies
- Tag-based metric filtering — Use
metrics_without_limitsto drop unnecessary tag combinations while keeping aggregated data queryable - Log exclusion filters — Filter out health checks, debug logs, and known noisy patterns before indexing (still searchable in Live Tail)
- APM trace sampling — Use head-based sampling at 10-20% for high-throughput services; Datadog retains error and high-latency traces automatically
- Log archives to S3/GCS — Archive all logs to cloud storage ($0.02/GB) for compliance, only index what you actively query
- Custom metric governance — Audit metrics monthly; remove unused metrics and restrict high-cardinality tags (user_id, request_id)
- Committed-use contracts — Annual commits provide 20-40% discounts vs. on-demand; forecast usage based on 3-month trailing data
http.requests{endpoint, method, status, user_id} with 100 endpoints × 4 methods × 10 statuses × 10,000 users = 40 million unique metric time series. Always validate tag cardinality before deploying new instrumentation.
When to Choose Datadog
Datadog: Strengths & Limitations
Strengths
- Unified platform — Single pane of glass across metrics, traces, logs, and synthetics with seamless correlation
- 450+ integrations — Out-of-the-box dashboards and checks for virtually every technology
- Zero infrastructure management — Fully managed SaaS; no clusters to scale, upgrade, or maintain
- Excellent UX — Intuitive interface, fast search, powerful query language
- AI/ML features — Anomaly detection, forecasting, Watchdog auto-discovery of issues
- Enterprise features — RBAC, audit trails, SSO, data residency options
Limitations
- Cost at scale — Can become expensive (>$100K/year) for large deployments without governance
- Vendor lock-in — Proprietary query language, agent, and APIs make migration difficult
- Custom metric pricing — Cardinality-based billing punishes high-dimensional data
- Data retention limits — Default 15 days for metrics, pay more for longer retention
- No self-hosted option — Data leaves your network; some regulated industries can't use it
Best For
- Teams that want unified observability without managing infrastructure
- Organizations with 50-5,000 hosts running diverse technology stacks
- Companies that value developer velocity over infrastructure cost optimization
- Environments running on AWS/GCP/Azure with cloud-native architectures