The LGTM Stack
Grafana Cloud is built on four open-source projects — collectively called the LGTM stack — each purpose-built for a specific telemetry signal. Unlike monolithic platforms that use a single proprietary backend, Grafana Labs maintains dedicated, horizontally-scalable systems optimized for each data type.
flowchart TD
A[Applications & Infrastructure] --> B[OTel Collector / Grafana Agent / Alloy]
B --> C[Grafana Mimir — Metrics]
B --> D[Grafana Loki — Logs]
B --> E[Grafana Tempo — Traces]
C --> F[Grafana Dashboards]
D --> F
E --> F
F --> G[Alertmanager / Grafana OnCall]
style A fill:#3B9797,color:#fff
style B fill:#132440,color:#fff
style C fill:#BF092F,color:#fff
style D fill:#16476A,color:#fff
style E fill:#16476A,color:#fff
style F fill:#3B9797,color:#fff
style G fill:#132440,color:#fff
| Component | Signal | Query Language | Open-Source Project |
|---|---|---|---|
| Grafana Mimir | Metrics | PromQL | Apache 2.0 licensed |
| Grafana Loki | Logs | LogQL | AGPLv3 licensed |
| Grafana Tempo | Traces | TraceQL | AGPLv3 licensed |
| Grafana | Visualization | — | AGPLv3 licensed |
| Grafana Alloy | Collection (Agent) | River config | Apache 2.0 licensed |
The critical architectural advantage: every protocol is open. Mimir speaks Prometheus remote_write, Loki accepts standard syslog and Fluentd protocols, and Tempo ingests OpenTelemetry, Jaeger, and Zipkin natively. You can migrate away from Grafana Cloud at any time without rewriting instrumentation — the same agents and SDKs work with self-hosted versions.
Grafana Mimir (Managed Metrics)
Grafana Mimir is the metrics backend powering Grafana Cloud — a horizontally scalable, multi-tenant Prometheus-compatible time-series database. If you already know PromQL, you know how to query Mimir. It's designed to handle billions of active series with consistent query performance.
Prometheus vs Mimir Comparison
| Aspect | Prometheus (Self-Hosted) | Grafana Mimir (Grafana Cloud) |
|---|---|---|
| Retention | Limited by local disk (typically 15-30 days) | 13 months default, configurable up to 2 years |
| High Availability | Manual — dual Prometheus + deduplication | Built-in — triple replication across zones |
| Horizontal Scaling | Not native — requires federation/Thanos/Cortex | Native — scales to billions of active series |
| Query Performance | Degrades with cardinality and range | Query sharding across workers; consistent latency |
| Global View | Requires federation for cross-cluster queries | Single pane across all clusters and regions |
| Maintenance | Manual upgrades, compaction tuning, storage management | Fully managed — zero operational overhead |
| Cost | Infrastructure + engineering time | Per active series pricing ($8/1K series/month) |
Sending Metrics to Grafana Cloud
# Grafana Alloy configuration for remote_write to Grafana Cloud
prometheus.remote_write "grafana_cloud" {
endpoint {
url = "https://prometheus-prod-01-eu-west-0.grafana.net/api/prom/push"
basic_auth {
username = env("GRAFANA_CLOUD_PROM_USER")
password = env("GRAFANA_CLOUD_API_KEY")
}
}
queue_config {
max_samples_per_send = 1000
batch_send_deadline = "5s"
min_backoff = "30ms"
max_backoff = "5s"
}
}
Grafana Loki (Managed Logs)
Grafana Loki takes a fundamentally different approach to log storage than Elasticsearch or Splunk. Instead of full-text indexing every log line, Loki only indexes labels (metadata like pod name, namespace, service) and stores log content in compressed chunks on cheap object storage. This makes it orders of magnitude cheaper to operate at scale.
LogQL Query Examples
# Basic log stream selection by labels
{namespace="production", app="checkout-service"}
# Filter log lines containing specific text
{namespace="production", app="checkout-service"} |= "error"
# Regex filter with JSON parsing
{namespace="production"} | json | status_code >= 500
# Log-based metrics — count errors per minute
rate({namespace="production", app="checkout-service"} |= "error" [5m])
# Top 5 pods by error count
topk(5, sum by(pod) (rate({namespace="production"} |= "error" [5m])))
Grafana Alloy Log Collection Config
# Grafana Alloy — collect logs from Kubernetes pods
loki.source.kubernetes "pods" {
targets = discovery.kubernetes.pods.targets
forward_to = [loki.write.grafana_cloud.receiver]
}
# Add labels and process log lines
loki.process "pipeline" {
forward_to = [loki.write.grafana_cloud.receiver]
stage.json {
expressions = {
level = "level",
message = "msg",
trace_id = "trace_id",
}
}
stage.labels {
values = {
level = "",
}
}
stage.timestamp {
source = "timestamp"
format = "RFC3339"
}
}
# Write to Grafana Cloud Loki endpoint
loki.write "grafana_cloud" {
endpoint {
url = "https://logs-prod-eu-west-0.grafana.net/loki/api/v1/push"
basic_auth {
username = env("GRAFANA_CLOUD_LOKI_USER")
password = env("GRAFANA_CLOUD_API_KEY")
}
}
}
Grafana Tempo (Managed Traces)
Grafana Tempo is a distributed tracing backend that stores traces on object storage without any indexing. It relies on trace IDs for direct lookups, and the newer TraceQL query language for searching across trace attributes. Like Loki, this architecture makes it extremely cost-effective at scale.
Tempo vs Jaeger Comparison
| Aspect | Jaeger (Self-Hosted) | Grafana Tempo (Grafana Cloud) |
|---|---|---|
| Storage Backend | Elasticsearch, Cassandra, or BadgerDB | Object storage (S3/GCS) — no database needed |
| Indexing | Full span indexing in Elasticsearch | No indexing — ID-based lookup + TraceQL search |
| Query Language | Tag-based search UI | TraceQL — expressive query language with span filtering |
| Cost at Scale | High — Elasticsearch storage and compute costs | Low — object storage only (~$0.02/GB/month) |
| Ingestion Protocols | Jaeger, Zipkin, OpenTelemetry | Jaeger, Zipkin, OpenTelemetry, Kafka |
| Retention | Limited by storage capacity | 30 days default (configurable) |
| Service Graph | Requires separate dependency calculation | Built-in service graph from span data |
TraceQL Query Examples
# Find traces where checkout service had errors
{ resource.service.name = "checkout-service" && status = error }
# Find slow database spans (> 500ms)
{ span.db.system = "postgresql" && duration > 500ms }
# Find traces crossing service boundaries with high latency
{ resource.service.name = "api-gateway" && duration > 2s } >> { resource.service.name = "payment-service" }
# Search by custom span attribute
{ span.http.status_code >= 500 && resource.deployment.environment = "production" }
# Aggregate — count errors per service
{ status = error } | count() by (resource.service.name)
Grafana Alerting
Grafana Cloud provides a unified alerting system that can evaluate alert rules against any data source — Mimir (Prometheus), Loki (logs), Tempo (traces), and even external sources like Elasticsearch or CloudWatch. Alert rules are defined using the same query languages (PromQL, LogQL) and managed alongside dashboards.
Alerting Components Comparison
| Component | Role | Key Features | Best For |
|---|---|---|---|
| Grafana Alerting | Alert rule evaluation & routing | Multi-source rules, recording rules, silences, mute timings | Unified alerting across all data sources |
| Alertmanager | Alert deduplication & grouping | Routing trees, inhibition, grouping, repeat intervals | Complex routing logic for Prometheus-style alerts |
| Grafana OnCall | Incident response & escalation | On-call schedules, escalation chains, auto-acknowledge, mobile app | PagerDuty-style on-call management within Grafana |
| Grafana IRM | Incident lifecycle management | Declare incidents, timelines, postmortems, Slack integration | Full incident response lifecycle tracking |
Alert Rule Example (PromQL)
# Grafana Alert Rule — High Error Rate
apiVersion: 1
groups:
- orgId: 1
name: production-alerts
folder: SRE
interval: 1m
rules:
- uid: high-error-rate
title: "High Error Rate - Checkout Service"
condition: C
data:
- refId: A
datasourceUid: grafana-cloud-prom
model:
expr: |
sum(rate(http_requests_total{service="checkout", status=~"5.."}[5m]))
/
sum(rate(http_requests_total{service="checkout"}[5m]))
instant: true
- refId: C
datasourceUid: __expr__
model:
type: threshold
conditions:
- evaluator:
type: gt
params: [0.05]
noDataState: NoData
execErrState: Error
for: 5m
labels:
severity: critical
team: checkout
annotations:
summary: "Checkout error rate above 5%"
runbook_url: "https://wiki.internal/runbooks/checkout-errors"
Synthetic Monitoring & k6
Grafana Cloud includes synthetic monitoring for probing endpoints from global locations, plus native k6 Cloud integration for load testing. Synthetic checks run from Grafana's global probe network (or your private probes) and report availability, latency, and certificate status directly into Grafana dashboards.
Synthetic Check Types
| Check Type | Purpose | Frequency | Example Use Case |
|---|---|---|---|
| HTTP | Endpoint availability & response validation | 10s – 120s intervals | Monitor API health endpoint from 20+ global locations |
| Ping (ICMP) | Network-level reachability | 10s – 60s intervals | Verify server reachability and measure packet loss |
| DNS | DNS resolution validation | 30s – 120s intervals | Detect DNS propagation issues or hijacking |
| TCP | Port connectivity | 10s – 60s intervals | Verify database port is open and responsive |
| Traceroute | Network path analysis | 60s – 300s intervals | Identify network hops causing latency spikes |
| Multi-step (Scripted) | Complex user flows via k6 scripts | 60s – 600s intervals | Login → Browse → Add to Cart → Checkout flow |
| Browser (k6) | Real browser rendering (Chromium) | 60s – 600s intervals | Measure Core Web Vitals (LCP, FID, CLS) synthetically |
k6 Cloud Integration
Grafana acquired k6 (the popular open-source load testing tool) and integrated it directly into Grafana Cloud. You can run load tests from the CLI, visualize results in Grafana dashboards, and set performance thresholds as pass/fail gates in CI/CD pipelines.
# Run a k6 load test and stream results to Grafana Cloud
k6 run --out cloud script.js
# Example k6 script for load testing
# (save as script.js)
import http from 'k6/http';
import { check, sleep } from 'k6';
export const options = {
stages: [
{ duration: '2m', target: 100 }, // Ramp up
{ duration: '5m', target: 100 }, // Stay at 100 users
{ duration: '2m', target: 0 }, // Ramp down
],
thresholds: {
http_req_duration: ['p(95)<500'], // 95% of requests under 500ms
http_req_failed: ['rate<0.01'], // Less than 1% failures
},
};
export default function () {
const res = http.get('https://api.example.com/health');
check(res, {
'status is 200': (r) => r.status === 200,
'response time < 500ms': (r) => r.timings.duration < 500,
});
sleep(1);
}
Pricing & Free Tier
Grafana Cloud stands out with one of the most generous free tiers in the observability market — and transparent, usage-based pricing that avoids per-seat or per-host models for the core platform.
Tier Comparison
| Feature | Free | Pro ($0 base + usage) | Advanced (Custom) |
|---|---|---|---|
| Metrics (Active Series) | 10,000 series | $8 per 1,000 series/month | Volume discounts |
| Logs | 50 GB/month | $0.50 per GB | Volume discounts |
| Traces | 50 GB/month | $0.50 per GB | Volume discounts |
| Metrics Retention | 13 months | 13 months (configurable) | Up to 2 years |
| Logs/Traces Retention | 14 days | 30 days (configurable) | Custom |
| Users | 3 (free forever) | Unlimited (included) | Unlimited |
| Synthetic Monitoring | 5 checks | Included with usage pricing | Custom |
| Grafana OnCall | Included (limited) | Included | Included |
| Support | Community | Standard (business hours) | Premium 24/7 |
Cost Optimization Strategies
- Reduce active series cardinality — Use recording rules to pre-aggregate high-cardinality metrics and drop unused labels at the agent level. A 10x reduction in series count translates directly to 10x cost reduction.
- Use Adaptive Metrics — Grafana Cloud's built-in feature identifies metrics that aren't queried by any dashboard or alert rule. Automatically aggregate or drop unused series to eliminate waste.
- Filter logs at the agent — Drop debug/trace-level logs before they reach Loki. Use Alloy's pipeline stages to filter, redact, or sample verbose log streams.
- Implement trace sampling — Use tail-based sampling in Grafana Alloy to keep only interesting traces (errors, high-latency, specific services) and drop routine traces.
- Leverage the free tier strategically — Use the free tier for staging/dev environments. Reserve paid capacity for production workloads only.
- Set usage alerts — Configure billing alerts in Grafana Cloud to notify before usage spikes cause bill shock. Set hard limits per data source.
grafanacloud_instance_active_series metric to monitor your usage.
When to Choose Grafana Cloud
Grafana Cloud: Strengths & Limitations
Strengths
- No vendor lock-in — All protocols are open standards (Prometheus remote_write, OTLP, Jaeger, syslog). Switch providers or go self-hosted anytime without rewriting instrumentation.
- Generous free tier — 10K metrics series, 50 GB logs, 50 GB traces — enough for small production workloads at zero cost
- Cost-effective at scale — Loki and Tempo use object storage; no expensive Elasticsearch or dedicated trace databases needed
- Open-source ecosystem — Thousands of community dashboards, exporters, and integrations; massive Grafana plugin library
- No per-user pricing — Unlimited users on Pro tier; teams of 50+ don't pay seat fees
- Self-hosted escape hatch — Run the same LGTM stack on your own infrastructure if regulations or costs require it
Limitations
- Assembly required — Unlike Datadog's unified UI, you configure separate data sources, dashboards, and alerting rules; steeper initial setup
- Multiple query languages — PromQL for metrics, LogQL for logs, TraceQL for traces; no single unified query language across all signals
- Log search limitations — Loki's label-only indexing means grep-style full-text searches across unfiltered logs are slow compared to Elasticsearch
- APM is newer — Application Observability (APM) features are less mature than Datadog or New Relic's decade-old APM offerings
- Enterprise features gated — RBAC, audit logs, and advanced security require Advanced tier (custom pricing)
Best For
- Teams already using Prometheus and Grafana who want managed infrastructure without lock-in
- Cost-conscious organizations that need observability at scale without per-host or per-seat pricing
- OpenTelemetry-first shops that want an OTel-native backend for all signals
- Multi-cloud or hybrid environments where open standards enable portability
- Organizations with regulatory requirements that may need to self-host in the future
Platform Comparison Summary
Having explored Datadog, New Relic, and Grafana Cloud in depth, here's a consolidated comparison to guide your platform selection decision:
| Dimension | Datadog | New Relic | Grafana Cloud |
|---|---|---|---|
| Pricing Model | Per-host + per-GB + add-ons | Per-GB ingest + per-user seats | Per-active-series + per-GB (logs/traces) |
| Free Tier | 14-day trial only | 100 GB/month + 1 full user (forever) | 10K series + 50 GB logs + 50 GB traces (forever) |
| Vendor Lock-In | High — proprietary agents, DQL, integrations | Medium — NRQL proprietary, but OTel supported | Low — all open standards, self-hosted option exists |
| Protocol Support | Proprietary + OTel (growing) | Proprietary + OTel (native OTLP) | Native OTel, Prometheus, Jaeger, Zipkin, syslog |
| Self-Hosted Option | No | No | Yes — full LGTM stack is open source |
| Query Language | DQL (proprietary) | NRQL (SQL-like, proprietary) | PromQL + LogQL + TraceQL (open) |
| APM Maturity | Excellent — 10+ years, auto-instrumentation | Excellent — 15+ years, deep language support | Growing — Application Observability is newer |
| Key Strength | Unified UX, 750+ integrations, AI/ML | NRQL flexibility, generous free tier, entity model | Open standards, no lock-in, cost at scale |
| Key Weakness | Expensive at scale, billing unpredictability | User seat costs, 8-day default retention | Assembly required, multiple query languages |
| Ideal For | Teams wanting polished UX with budget for premium tooling | Query-driven teams wanting full-stack at predictable cost | Open-source-first teams wanting portability and cost control |