Platform Deep Dive: Grafana Cloud

The LGTM Stack

Grafana Cloud is built on four open-source projects — collectively called the LGTM stack — each purpose-built for a specific telemetry signal. Unlike monolithic platforms that use a single proprietary backend, Grafana Labs maintains dedicated, horizontally-scalable systems optimized for each data type.

Grafana Cloud LGTM Architecture

flowchart TD
    A[Applications & Infrastructure] --> B[OTel Collector / Grafana Agent / Alloy]
    B --> C[Grafana Mimir — Metrics]
    B --> D[Grafana Loki — Logs]
    B --> E[Grafana Tempo — Traces]
    C --> F[Grafana Dashboards]
    D --> F
    E --> F
    F --> G[Alertmanager / Grafana OnCall]

    style A fill:#3B9797,color:#fff
    style B fill:#132440,color:#fff
    style C fill:#BF092F,color:#fff
    style D fill:#16476A,color:#fff
    style E fill:#16476A,color:#fff
    style F fill:#3B9797,color:#fff
    style G fill:#132440,color:#fff

Component	Signal	Query Language	Open-Source Project
Grafana Mimir	Metrics	PromQL	Apache 2.0 licensed
Grafana Loki	Logs	LogQL	AGPLv3 licensed
Grafana Tempo	Traces	TraceQL	AGPLv3 licensed
Grafana	Visualization	—	AGPLv3 licensed
Grafana Alloy	Collection (Agent)	River config	Apache 2.0 licensed

The critical architectural advantage: every protocol is open. Mimir speaks Prometheus remote_write, Loki accepts standard syslog and Fluentd protocols, and Tempo ingests OpenTelemetry, Jaeger, and Zipkin natively. You can migrate away from Grafana Cloud at any time without rewriting instrumentation — the same agents and SDKs work with self-hosted versions.

Grafana Mimir (Managed Metrics)

Grafana Mimir is the metrics backend powering Grafana Cloud — a horizontally scalable, multi-tenant Prometheus-compatible time-series database. If you already know PromQL, you know how to query Mimir. It's designed to handle billions of active series with consistent query performance.

Prometheus vs Mimir Comparison

Aspect	Prometheus (Self-Hosted)	Grafana Mimir (Grafana Cloud)
Retention	Limited by local disk (typically 15-30 days)	13 months default, configurable up to 2 years
High Availability	Manual — dual Prometheus + deduplication	Built-in — triple replication across zones
Horizontal Scaling	Not native — requires federation/Thanos/Cortex	Native — scales to billions of active series
Query Performance	Degrades with cardinality and range	Query sharding across workers; consistent latency
Global View	Requires federation for cross-cluster queries	Single pane across all clusters and regions
Maintenance	Manual upgrades, compaction tuning, storage management	Fully managed — zero operational overhead
Cost	Infrastructure + engineering time	Per active series pricing ($8/1K series/month)

                            
                            Native PromQL Support: Mimir is 100% PromQL-compatible. Every existing Prometheus query, recording rule, and alert rule works without modification. Your Grafana dashboards, Prometheus alert rules, and recording rules all carry over — zero migration effort for the query layer.
                        

Sending Metrics to Grafana Cloud

# Grafana Alloy configuration for remote_write to Grafana Cloud
prometheus.remote_write "grafana_cloud" {
  endpoint {
    url = "https://prometheus-prod-01-eu-west-0.grafana.net/api/prom/push"

    basic_auth {
      username = env("GRAFANA_CLOUD_PROM_USER")
      password = env("GRAFANA_CLOUD_API_KEY")
    }
  }

  queue_config {
    max_samples_per_send = 1000
    batch_send_deadline  = "5s"
    min_backoff          = "30ms"
    max_backoff          = "5s"
  }
}

Grafana Loki (Managed Logs)

Grafana Loki takes a fundamentally different approach to log storage than Elasticsearch or Splunk. Instead of full-text indexing every log line, Loki only indexes labels (metadata like pod name, namespace, service) and stores log content in compressed chunks on cheap object storage. This makes it orders of magnitude cheaper to operate at scale.

LogQL Query Examples

# Basic log stream selection by labels
{namespace="production", app="checkout-service"}

# Filter log lines containing specific text
{namespace="production", app="checkout-service"} |= "error"

# Regex filter with JSON parsing
{namespace="production"} | json | status_code >= 500

# Log-based metrics — count errors per minute
rate({namespace="production", app="checkout-service"} |= "error" [5m])

# Top 5 pods by error count
topk(5, sum by(pod) (rate({namespace="production"} |= "error" [5m])))

Grafana Alloy Log Collection Config

# Grafana Alloy — collect logs from Kubernetes pods
loki.source.kubernetes "pods" {
  targets    = discovery.kubernetes.pods.targets
  forward_to = [loki.write.grafana_cloud.receiver]
}

# Add labels and process log lines
loki.process "pipeline" {
  forward_to = [loki.write.grafana_cloud.receiver]

  stage.json {
    expressions = {
      level   = "level",
      message = "msg",
      trace_id = "trace_id",
    }
  }

  stage.labels {
    values = {
      level = "",
    }
  }

  stage.timestamp {
    source = "timestamp"
    format = "RFC3339"
  }
}

# Write to Grafana Cloud Loki endpoint
loki.write "grafana_cloud" {
  endpoint {
    url = "https://logs-prod-eu-west-0.grafana.net/loki/api/v1/push"

    basic_auth {
      username = env("GRAFANA_CLOUD_LOKI_USER")
      password = env("GRAFANA_CLOUD_API_KEY")
    }
  }
}

                            
                            Cost Advantage Over Elasticsearch: Loki indexes only labels — not log content. This means the index is typically 100-1000x smaller than an equivalent Elasticsearch deployment. Storage costs drop dramatically because compressed log chunks sit on object storage (S3/GCS) at ~$0.02/GB/month. The trade-off: grep-style queries are slower than indexed full-text search, but label-filtered queries are instant.
                        

Grafana Tempo (Managed Traces)

Grafana Tempo is a distributed tracing backend that stores traces on object storage without any indexing. It relies on trace IDs for direct lookups, and the newer TraceQL query language for searching across trace attributes. Like Loki, this architecture makes it extremely cost-effective at scale.

Tempo vs Jaeger Comparison

Aspect	Jaeger (Self-Hosted)	Grafana Tempo (Grafana Cloud)
Storage Backend	Elasticsearch, Cassandra, or BadgerDB	Object storage (S3/GCS) — no database needed
Indexing	Full span indexing in Elasticsearch	No indexing — ID-based lookup + TraceQL search
Query Language	Tag-based search UI	TraceQL — expressive query language with span filtering
Cost at Scale	High — Elasticsearch storage and compute costs	Low — object storage only (~$0.02/GB/month)
Ingestion Protocols	Jaeger, Zipkin, OpenTelemetry	Jaeger, Zipkin, OpenTelemetry, Kafka
Retention	Limited by storage capacity	30 days default (configurable)
Service Graph	Requires separate dependency calculation	Built-in service graph from span data

TraceQL Query Examples

# Find traces where checkout service had errors
{ resource.service.name = "checkout-service" && status = error }

# Find slow database spans (> 500ms)
{ span.db.system = "postgresql" && duration > 500ms }

# Find traces crossing service boundaries with high latency
{ resource.service.name = "api-gateway" && duration > 2s } >> { resource.service.name = "payment-service" }

# Search by custom span attribute
{ span.http.status_code >= 500 && resource.deployment.environment = "production" }

# Aggregate — count errors per service
{ status = error } | count() by (resource.service.name)

Grafana Alerting

Grafana Cloud provides a unified alerting system that can evaluate alert rules against any data source — Mimir (Prometheus), Loki (logs), Tempo (traces), and even external sources like Elasticsearch or CloudWatch. Alert rules are defined using the same query languages (PromQL, LogQL) and managed alongside dashboards.

Alerting Components Comparison

Component	Role	Key Features	Best For
Grafana Alerting	Alert rule evaluation & routing	Multi-source rules, recording rules, silences, mute timings	Unified alerting across all data sources
Alertmanager	Alert deduplication & grouping	Routing trees, inhibition, grouping, repeat intervals	Complex routing logic for Prometheus-style alerts
Grafana OnCall	Incident response & escalation	On-call schedules, escalation chains, auto-acknowledge, mobile app	PagerDuty-style on-call management within Grafana
Grafana IRM	Incident lifecycle management	Declare incidents, timelines, postmortems, Slack integration	Full incident response lifecycle tracking

                            
                            Multi-Source Alert Rules: Unlike platform-specific alerting (e.g., Prometheus Alertmanager which only evaluates PromQL), Grafana Alerting can combine signals from different backends in a single alert rule. For example: fire an alert when error rate from Mimir exceeds 5% AND error logs from Loki contain "database connection refused" — correlating metrics and logs in one condition.
                        

Alert Rule Example (PromQL)

# Grafana Alert Rule — High Error Rate
apiVersion: 1
groups:
  - orgId: 1
    name: production-alerts
    folder: SRE
    interval: 1m
    rules:
      - uid: high-error-rate
        title: "High Error Rate - Checkout Service"
        condition: C
        data:
          - refId: A
            datasourceUid: grafana-cloud-prom
            model:
              expr: |
                sum(rate(http_requests_total{service="checkout", status=~"5.."}[5m]))
                /
                sum(rate(http_requests_total{service="checkout"}[5m]))
              instant: true
          - refId: C
            datasourceUid: __expr__
            model:
              type: threshold
              conditions:
                - evaluator:
                    type: gt
                    params: [0.05]
        noDataState: NoData
        execErrState: Error
        for: 5m
        labels:
          severity: critical
          team: checkout
        annotations:
          summary: "Checkout error rate above 5%"
          runbook_url: "https://wiki.internal/runbooks/checkout-errors"

Synthetic Monitoring & k6

Grafana Cloud includes synthetic monitoring for probing endpoints from global locations, plus native k6 Cloud integration for load testing. Synthetic checks run from Grafana's global probe network (or your private probes) and report availability, latency, and certificate status directly into Grafana dashboards.

Synthetic Check Types

Check Type	Purpose	Frequency	Example Use Case
HTTP	Endpoint availability & response validation	10s – 120s intervals	Monitor API health endpoint from 20+ global locations
Ping (ICMP)	Network-level reachability	10s – 60s intervals	Verify server reachability and measure packet loss
DNS	DNS resolution validation	30s – 120s intervals	Detect DNS propagation issues or hijacking
TCP	Port connectivity	10s – 60s intervals	Verify database port is open and responsive
Traceroute	Network path analysis	60s – 300s intervals	Identify network hops causing latency spikes
Multi-step (Scripted)	Complex user flows via k6 scripts	60s – 600s intervals	Login → Browse → Add to Cart → Checkout flow
Browser (k6)	Real browser rendering (Chromium)	60s – 600s intervals	Measure Core Web Vitals (LCP, FID, CLS) synthetically

k6 Cloud Integration

Grafana acquired k6 (the popular open-source load testing tool) and integrated it directly into Grafana Cloud. You can run load tests from the CLI, visualize results in Grafana dashboards, and set performance thresholds as pass/fail gates in CI/CD pipelines.

# Run a k6 load test and stream results to Grafana Cloud
k6 run --out cloud script.js

# Example k6 script for load testing
# (save as script.js)
import http from 'k6/http';
import { check, sleep } from 'k6';

export const options = {
  stages: [
    { duration: '2m', target: 100 },  // Ramp up
    { duration: '5m', target: 100 },  // Stay at 100 users
    { duration: '2m', target: 0 },    // Ramp down
  ],
  thresholds: {
    http_req_duration: ['p(95)<500'],  // 95% of requests under 500ms
    http_req_failed: ['rate<0.01'],    // Less than 1% failures
  },
};

export default function () {
  const res = http.get('https://api.example.com/health');
  check(res, {
    'status is 200': (r) => r.status === 200,
    'response time < 500ms': (r) => r.timings.duration < 500,
  });
  sleep(1);
}

Pricing & Free Tier

Grafana Cloud stands out with one of the most generous free tiers in the observability market — and transparent, usage-based pricing that avoids per-seat or per-host models for the core platform.

Tier Comparison

Feature	Free	Pro ($0 base + usage)	Advanced (Custom)
Metrics (Active Series)	10,000 series	$8 per 1,000 series/month	Volume discounts
Logs	50 GB/month	$0.50 per GB	Volume discounts
Traces	50 GB/month	$0.50 per GB	Volume discounts
Metrics Retention	13 months	13 months (configurable)	Up to 2 years
Logs/Traces Retention	14 days	30 days (configurable)	Custom
Users	3 (free forever)	Unlimited (included)	Unlimited
Synthetic Monitoring	5 checks	Included with usage pricing	Custom
Grafana OnCall	Included (limited)	Included	Included
Support	Community	Standard (business hours)	Premium 24/7

Cost Optimization Strategies

Reduce active series cardinality — Use recording rules to pre-aggregate high-cardinality metrics and drop unused labels at the agent level. A 10x reduction in series count translates directly to 10x cost reduction.
Use Adaptive Metrics — Grafana Cloud's built-in feature identifies metrics that aren't queried by any dashboard or alert rule. Automatically aggregate or drop unused series to eliminate waste.
Filter logs at the agent — Drop debug/trace-level logs before they reach Loki. Use Alloy's pipeline stages to filter, redact, or sample verbose log streams.
Implement trace sampling — Use tail-based sampling in Grafana Alloy to keep only interesting traces (errors, high-latency, specific services) and drop routine traces.
Leverage the free tier strategically — Use the free tier for staging/dev environments. Reserve paid capacity for production workloads only.
Set usage alerts — Configure billing alerts in Grafana Cloud to notify before usage spikes cause bill shock. Set hard limits per data source.

                            
                            Active Series = Primary Cost Factor: For metrics, the billing unit is "active series" — a unique combination of metric name + label values that has received a sample in the last 15 minutes. High-cardinality labels (user IDs, request IDs, UUIDs) can explode your series count from thousands to millions. Always review cardinality before production rollout. Use grafanacloud_instance_active_series metric to monitor your usage.
                        

When to Choose Grafana Cloud

Platform Assessment

Grafana Cloud: Strengths & Limitations

Strengths

No vendor lock-in — All protocols are open standards (Prometheus remote_write, OTLP, Jaeger, syslog). Switch providers or go self-hosted anytime without rewriting instrumentation.
Generous free tier — 10K metrics series, 50 GB logs, 50 GB traces — enough for small production workloads at zero cost
Cost-effective at scale — Loki and Tempo use object storage; no expensive Elasticsearch or dedicated trace databases needed
Open-source ecosystem — Thousands of community dashboards, exporters, and integrations; massive Grafana plugin library
No per-user pricing — Unlimited users on Pro tier; teams of 50+ don't pay seat fees
Self-hosted escape hatch — Run the same LGTM stack on your own infrastructure if regulations or costs require it

Limitations

Assembly required — Unlike Datadog's unified UI, you configure separate data sources, dashboards, and alerting rules; steeper initial setup
Multiple query languages — PromQL for metrics, LogQL for logs, TraceQL for traces; no single unified query language across all signals
Log search limitations — Loki's label-only indexing means grep-style full-text searches across unfiltered logs are slow compared to Elasticsearch
APM is newer — Application Observability (APM) features are less mature than Datadog or New Relic's decade-old APM offerings
Enterprise features gated — RBAC, audit logs, and advanced security require Advanced tier (custom pricing)

Best For

Teams already using Prometheus and Grafana who want managed infrastructure without lock-in
Cost-conscious organizations that need observability at scale without per-host or per-seat pricing
OpenTelemetry-first shops that want an OTel-native backend for all signals
Multi-cloud or hybrid environments where open standards enable portability
Organizations with regulatory requirements that may need to self-host in the future

Open Source No Lock-In Cost-Effective Usage-Based

Platform Comparison Summary

Having explored Datadog, New Relic, and Grafana Cloud in depth, here's a consolidated comparison to guide your platform selection decision:

Dimension	Datadog	New Relic	Grafana Cloud
Pricing Model	Per-host + per-GB + add-ons	Per-GB ingest + per-user seats	Per-active-series + per-GB (logs/traces)
Free Tier	14-day trial only	100 GB/month + 1 full user (forever)	10K series + 50 GB logs + 50 GB traces (forever)
Vendor Lock-In	High — proprietary agents, DQL, integrations	Medium — NRQL proprietary, but OTel supported	Low — all open standards, self-hosted option exists
Protocol Support	Proprietary + OTel (growing)	Proprietary + OTel (native OTLP)	Native OTel, Prometheus, Jaeger, Zipkin, syslog
Self-Hosted Option	No	No	Yes — full LGTM stack is open source
Query Language	DQL (proprietary)	NRQL (SQL-like, proprietary)	PromQL + LogQL + TraceQL (open)
APM Maturity	Excellent — 10+ years, auto-instrumentation	Excellent — 15+ years, deep language support	Growing — Application Observability is newer
Key Strength	Unified UX, 750+ integrations, AI/ML	NRQL flexibility, generous free tier, entity model	Open standards, no lock-in, cost at scale
Key Weakness	Expensive at scale, billing unpredictability	User seat costs, 8-day default retention	Assembly required, multiple query languages
Ideal For	Teams wanting polished UX with budget for premium tooling	Query-driven teams wanting full-stack at predictable cost	Open-source-first teams wanting portability and cost control

Previous Platform Deep Dive: New Relic Series Index All Parts & Deep Dives

Cookie Consent

Platform Deep Dive: Grafana Cloud

Table of Contents

The LGTM Stack

Grafana Mimir (Managed Metrics)

Prometheus vs Mimir Comparison

Sending Metrics to Grafana Cloud

Grafana Loki (Managed Logs)

LogQL Query Examples

Grafana Alloy Log Collection Config

Grafana Tempo (Managed Traces)

Tempo vs Jaeger Comparison

TraceQL Query Examples

Grafana Alerting

Alerting Components Comparison

Alert Rule Example (PromQL)

Synthetic Monitoring & k6

Synthetic Check Types

k6 Cloud Integration

Pricing & Free Tier

Tier Comparison

Cost Optimization Strategies

When to Choose Grafana Cloud

Grafana Cloud: Strengths & Limitations

Strengths

Limitations

Best For

Platform Comparison Summary

Cookie Consent

Platform Deep Dive: Grafana Cloud

Table of Contents

The LGTM Stack

Grafana Mimir (Managed Metrics)

Prometheus vs Mimir Comparison

Sending Metrics to Grafana Cloud

Grafana Loki (Managed Logs)

LogQL Query Examples

Grafana Alloy Log Collection Config

Grafana Tempo (Managed Traces)

Tempo vs Jaeger Comparison

TraceQL Query Examples

Grafana Alerting

Alerting Components Comparison

Alert Rule Example (PromQL)

Synthetic Monitoring & k6

Synthetic Check Types

k6 Cloud Integration

Pricing & Free Tier

Tier Comparison

Cost Optimization Strategies

When to Choose Grafana Cloud

Grafana Cloud: Strengths & Limitations

Strengths

Limitations

Best For

Platform Comparison Summary

Related Articles in This Series

Platform Deep Dive: Datadog

Platform Deep Dive: New Relic

Part 12: Observability as Code