Back to Monitoring, Observability & Reliability Series

Platform Deep Dive: Grafana Cloud

May 14, 2026 Wasil Zafar 18 min read

Grafana Cloud is the managed version of the open-source LGTM stack — Loki for logs, Grafana for visualization, Tempo for traces, and Mimir for metrics. It offers the flexibility of open-source tools with the convenience of a managed platform, plus a generous free tier and no vendor lock-in since all protocols are open standards.

Table of Contents

  1. The LGTM Stack
  2. Grafana Mimir (Managed Metrics)
  3. Grafana Loki (Managed Logs)
  4. Grafana Tempo (Managed Traces)
  5. Grafana Alerting
  6. Synthetic Monitoring & k6
  7. Pricing & Free Tier
  8. When to Choose Grafana Cloud
  9. Platform Comparison Summary

The LGTM Stack

Grafana Cloud is built on four open-source projects — collectively called the LGTM stack — each purpose-built for a specific telemetry signal. Unlike monolithic platforms that use a single proprietary backend, Grafana Labs maintains dedicated, horizontally-scalable systems optimized for each data type.

Grafana Cloud LGTM Architecture
flowchart TD
    A[Applications & Infrastructure] --> B[OTel Collector / Grafana Agent / Alloy]
    B --> C[Grafana Mimir — Metrics]
    B --> D[Grafana Loki — Logs]
    B --> E[Grafana Tempo — Traces]
    C --> F[Grafana Dashboards]
    D --> F
    E --> F
    F --> G[Alertmanager / Grafana OnCall]

    style A fill:#3B9797,color:#fff
    style B fill:#132440,color:#fff
    style C fill:#BF092F,color:#fff
    style D fill:#16476A,color:#fff
    style E fill:#16476A,color:#fff
    style F fill:#3B9797,color:#fff
    style G fill:#132440,color:#fff
                            
Component Signal Query Language Open-Source Project
Grafana Mimir Metrics PromQL Apache 2.0 licensed
Grafana Loki Logs LogQL AGPLv3 licensed
Grafana Tempo Traces TraceQL AGPLv3 licensed
Grafana Visualization AGPLv3 licensed
Grafana Alloy Collection (Agent) River config Apache 2.0 licensed

The critical architectural advantage: every protocol is open. Mimir speaks Prometheus remote_write, Loki accepts standard syslog and Fluentd protocols, and Tempo ingests OpenTelemetry, Jaeger, and Zipkin natively. You can migrate away from Grafana Cloud at any time without rewriting instrumentation — the same agents and SDKs work with self-hosted versions.

Grafana Mimir (Managed Metrics)

Grafana Mimir is the metrics backend powering Grafana Cloud — a horizontally scalable, multi-tenant Prometheus-compatible time-series database. If you already know PromQL, you know how to query Mimir. It's designed to handle billions of active series with consistent query performance.

Prometheus vs Mimir Comparison

Aspect Prometheus (Self-Hosted) Grafana Mimir (Grafana Cloud)
Retention Limited by local disk (typically 15-30 days) 13 months default, configurable up to 2 years
High Availability Manual — dual Prometheus + deduplication Built-in — triple replication across zones
Horizontal Scaling Not native — requires federation/Thanos/Cortex Native — scales to billions of active series
Query Performance Degrades with cardinality and range Query sharding across workers; consistent latency
Global View Requires federation for cross-cluster queries Single pane across all clusters and regions
Maintenance Manual upgrades, compaction tuning, storage management Fully managed — zero operational overhead
Cost Infrastructure + engineering time Per active series pricing ($8/1K series/month)
Native PromQL Support: Mimir is 100% PromQL-compatible. Every existing Prometheus query, recording rule, and alert rule works without modification. Your Grafana dashboards, Prometheus alert rules, and recording rules all carry over — zero migration effort for the query layer.

Sending Metrics to Grafana Cloud

# Grafana Alloy configuration for remote_write to Grafana Cloud
prometheus.remote_write "grafana_cloud" {
  endpoint {
    url = "https://prometheus-prod-01-eu-west-0.grafana.net/api/prom/push"

    basic_auth {
      username = env("GRAFANA_CLOUD_PROM_USER")
      password = env("GRAFANA_CLOUD_API_KEY")
    }
  }

  queue_config {
    max_samples_per_send = 1000
    batch_send_deadline  = "5s"
    min_backoff          = "30ms"
    max_backoff          = "5s"
  }
}

Grafana Loki (Managed Logs)

Grafana Loki takes a fundamentally different approach to log storage than Elasticsearch or Splunk. Instead of full-text indexing every log line, Loki only indexes labels (metadata like pod name, namespace, service) and stores log content in compressed chunks on cheap object storage. This makes it orders of magnitude cheaper to operate at scale.

LogQL Query Examples

# Basic log stream selection by labels
{namespace="production", app="checkout-service"}

# Filter log lines containing specific text
{namespace="production", app="checkout-service"} |= "error"

# Regex filter with JSON parsing
{namespace="production"} | json | status_code >= 500

# Log-based metrics — count errors per minute
rate({namespace="production", app="checkout-service"} |= "error" [5m])

# Top 5 pods by error count
topk(5, sum by(pod) (rate({namespace="production"} |= "error" [5m])))

Grafana Alloy Log Collection Config

# Grafana Alloy — collect logs from Kubernetes pods
loki.source.kubernetes "pods" {
  targets    = discovery.kubernetes.pods.targets
  forward_to = [loki.write.grafana_cloud.receiver]
}

# Add labels and process log lines
loki.process "pipeline" {
  forward_to = [loki.write.grafana_cloud.receiver]

  stage.json {
    expressions = {
      level   = "level",
      message = "msg",
      trace_id = "trace_id",
    }
  }

  stage.labels {
    values = {
      level = "",
    }
  }

  stage.timestamp {
    source = "timestamp"
    format = "RFC3339"
  }
}

# Write to Grafana Cloud Loki endpoint
loki.write "grafana_cloud" {
  endpoint {
    url = "https://logs-prod-eu-west-0.grafana.net/loki/api/v1/push"

    basic_auth {
      username = env("GRAFANA_CLOUD_LOKI_USER")
      password = env("GRAFANA_CLOUD_API_KEY")
    }
  }
}
Cost Advantage Over Elasticsearch: Loki indexes only labels — not log content. This means the index is typically 100-1000x smaller than an equivalent Elasticsearch deployment. Storage costs drop dramatically because compressed log chunks sit on object storage (S3/GCS) at ~$0.02/GB/month. The trade-off: grep-style queries are slower than indexed full-text search, but label-filtered queries are instant.

Grafana Tempo (Managed Traces)

Grafana Tempo is a distributed tracing backend that stores traces on object storage without any indexing. It relies on trace IDs for direct lookups, and the newer TraceQL query language for searching across trace attributes. Like Loki, this architecture makes it extremely cost-effective at scale.

Tempo vs Jaeger Comparison

Aspect Jaeger (Self-Hosted) Grafana Tempo (Grafana Cloud)
Storage Backend Elasticsearch, Cassandra, or BadgerDB Object storage (S3/GCS) — no database needed
Indexing Full span indexing in Elasticsearch No indexing — ID-based lookup + TraceQL search
Query Language Tag-based search UI TraceQL — expressive query language with span filtering
Cost at Scale High — Elasticsearch storage and compute costs Low — object storage only (~$0.02/GB/month)
Ingestion Protocols Jaeger, Zipkin, OpenTelemetry Jaeger, Zipkin, OpenTelemetry, Kafka
Retention Limited by storage capacity 30 days default (configurable)
Service Graph Requires separate dependency calculation Built-in service graph from span data

TraceQL Query Examples

# Find traces where checkout service had errors
{ resource.service.name = "checkout-service" && status = error }

# Find slow database spans (> 500ms)
{ span.db.system = "postgresql" && duration > 500ms }

# Find traces crossing service boundaries with high latency
{ resource.service.name = "api-gateway" && duration > 2s } >> { resource.service.name = "payment-service" }

# Search by custom span attribute
{ span.http.status_code >= 500 && resource.deployment.environment = "production" }

# Aggregate — count errors per service
{ status = error } | count() by (resource.service.name)

Grafana Alerting

Grafana Cloud provides a unified alerting system that can evaluate alert rules against any data source — Mimir (Prometheus), Loki (logs), Tempo (traces), and even external sources like Elasticsearch or CloudWatch. Alert rules are defined using the same query languages (PromQL, LogQL) and managed alongside dashboards.

Alerting Components Comparison

Component Role Key Features Best For
Grafana Alerting Alert rule evaluation & routing Multi-source rules, recording rules, silences, mute timings Unified alerting across all data sources
Alertmanager Alert deduplication & grouping Routing trees, inhibition, grouping, repeat intervals Complex routing logic for Prometheus-style alerts
Grafana OnCall Incident response & escalation On-call schedules, escalation chains, auto-acknowledge, mobile app PagerDuty-style on-call management within Grafana
Grafana IRM Incident lifecycle management Declare incidents, timelines, postmortems, Slack integration Full incident response lifecycle tracking
Multi-Source Alert Rules: Unlike platform-specific alerting (e.g., Prometheus Alertmanager which only evaluates PromQL), Grafana Alerting can combine signals from different backends in a single alert rule. For example: fire an alert when error rate from Mimir exceeds 5% AND error logs from Loki contain "database connection refused" — correlating metrics and logs in one condition.

Alert Rule Example (PromQL)

# Grafana Alert Rule — High Error Rate
apiVersion: 1
groups:
  - orgId: 1
    name: production-alerts
    folder: SRE
    interval: 1m
    rules:
      - uid: high-error-rate
        title: "High Error Rate - Checkout Service"
        condition: C
        data:
          - refId: A
            datasourceUid: grafana-cloud-prom
            model:
              expr: |
                sum(rate(http_requests_total{service="checkout", status=~"5.."}[5m]))
                /
                sum(rate(http_requests_total{service="checkout"}[5m]))
              instant: true
          - refId: C
            datasourceUid: __expr__
            model:
              type: threshold
              conditions:
                - evaluator:
                    type: gt
                    params: [0.05]
        noDataState: NoData
        execErrState: Error
        for: 5m
        labels:
          severity: critical
          team: checkout
        annotations:
          summary: "Checkout error rate above 5%"
          runbook_url: "https://wiki.internal/runbooks/checkout-errors"

Synthetic Monitoring & k6

Grafana Cloud includes synthetic monitoring for probing endpoints from global locations, plus native k6 Cloud integration for load testing. Synthetic checks run from Grafana's global probe network (or your private probes) and report availability, latency, and certificate status directly into Grafana dashboards.

Synthetic Check Types

Check Type Purpose Frequency Example Use Case
HTTP Endpoint availability & response validation 10s – 120s intervals Monitor API health endpoint from 20+ global locations
Ping (ICMP) Network-level reachability 10s – 60s intervals Verify server reachability and measure packet loss
DNS DNS resolution validation 30s – 120s intervals Detect DNS propagation issues or hijacking
TCP Port connectivity 10s – 60s intervals Verify database port is open and responsive
Traceroute Network path analysis 60s – 300s intervals Identify network hops causing latency spikes
Multi-step (Scripted) Complex user flows via k6 scripts 60s – 600s intervals Login → Browse → Add to Cart → Checkout flow
Browser (k6) Real browser rendering (Chromium) 60s – 600s intervals Measure Core Web Vitals (LCP, FID, CLS) synthetically

k6 Cloud Integration

Grafana acquired k6 (the popular open-source load testing tool) and integrated it directly into Grafana Cloud. You can run load tests from the CLI, visualize results in Grafana dashboards, and set performance thresholds as pass/fail gates in CI/CD pipelines.

# Run a k6 load test and stream results to Grafana Cloud
k6 run --out cloud script.js

# Example k6 script for load testing
# (save as script.js)
import http from 'k6/http';
import { check, sleep } from 'k6';

export const options = {
  stages: [
    { duration: '2m', target: 100 },  // Ramp up
    { duration: '5m', target: 100 },  // Stay at 100 users
    { duration: '2m', target: 0 },    // Ramp down
  ],
  thresholds: {
    http_req_duration: ['p(95)<500'],  // 95% of requests under 500ms
    http_req_failed: ['rate<0.01'],    // Less than 1% failures
  },
};

export default function () {
  const res = http.get('https://api.example.com/health');
  check(res, {
    'status is 200': (r) => r.status === 200,
    'response time < 500ms': (r) => r.timings.duration < 500,
  });
  sleep(1);
}

Pricing & Free Tier

Grafana Cloud stands out with one of the most generous free tiers in the observability market — and transparent, usage-based pricing that avoids per-seat or per-host models for the core platform.

Tier Comparison

Feature Free Pro ($0 base + usage) Advanced (Custom)
Metrics (Active Series) 10,000 series $8 per 1,000 series/month Volume discounts
Logs 50 GB/month $0.50 per GB Volume discounts
Traces 50 GB/month $0.50 per GB Volume discounts
Metrics Retention 13 months 13 months (configurable) Up to 2 years
Logs/Traces Retention 14 days 30 days (configurable) Custom
Users 3 (free forever) Unlimited (included) Unlimited
Synthetic Monitoring 5 checks Included with usage pricing Custom
Grafana OnCall Included (limited) Included Included
Support Community Standard (business hours) Premium 24/7

Cost Optimization Strategies

  1. Reduce active series cardinality — Use recording rules to pre-aggregate high-cardinality metrics and drop unused labels at the agent level. A 10x reduction in series count translates directly to 10x cost reduction.
  2. Use Adaptive Metrics — Grafana Cloud's built-in feature identifies metrics that aren't queried by any dashboard or alert rule. Automatically aggregate or drop unused series to eliminate waste.
  3. Filter logs at the agent — Drop debug/trace-level logs before they reach Loki. Use Alloy's pipeline stages to filter, redact, or sample verbose log streams.
  4. Implement trace sampling — Use tail-based sampling in Grafana Alloy to keep only interesting traces (errors, high-latency, specific services) and drop routine traces.
  5. Leverage the free tier strategically — Use the free tier for staging/dev environments. Reserve paid capacity for production workloads only.
  6. Set usage alerts — Configure billing alerts in Grafana Cloud to notify before usage spikes cause bill shock. Set hard limits per data source.
Active Series = Primary Cost Factor: For metrics, the billing unit is "active series" — a unique combination of metric name + label values that has received a sample in the last 15 minutes. High-cardinality labels (user IDs, request IDs, UUIDs) can explode your series count from thousands to millions. Always review cardinality before production rollout. Use grafanacloud_instance_active_series metric to monitor your usage.

When to Choose Grafana Cloud

Platform Assessment

Grafana Cloud: Strengths & Limitations

Strengths
  • No vendor lock-in — All protocols are open standards (Prometheus remote_write, OTLP, Jaeger, syslog). Switch providers or go self-hosted anytime without rewriting instrumentation.
  • Generous free tier — 10K metrics series, 50 GB logs, 50 GB traces — enough for small production workloads at zero cost
  • Cost-effective at scale — Loki and Tempo use object storage; no expensive Elasticsearch or dedicated trace databases needed
  • Open-source ecosystem — Thousands of community dashboards, exporters, and integrations; massive Grafana plugin library
  • No per-user pricing — Unlimited users on Pro tier; teams of 50+ don't pay seat fees
  • Self-hosted escape hatch — Run the same LGTM stack on your own infrastructure if regulations or costs require it
Limitations
  • Assembly required — Unlike Datadog's unified UI, you configure separate data sources, dashboards, and alerting rules; steeper initial setup
  • Multiple query languages — PromQL for metrics, LogQL for logs, TraceQL for traces; no single unified query language across all signals
  • Log search limitations — Loki's label-only indexing means grep-style full-text searches across unfiltered logs are slow compared to Elasticsearch
  • APM is newer — Application Observability (APM) features are less mature than Datadog or New Relic's decade-old APM offerings
  • Enterprise features gated — RBAC, audit logs, and advanced security require Advanced tier (custom pricing)
Best For
  • Teams already using Prometheus and Grafana who want managed infrastructure without lock-in
  • Cost-conscious organizations that need observability at scale without per-host or per-seat pricing
  • OpenTelemetry-first shops that want an OTel-native backend for all signals
  • Multi-cloud or hybrid environments where open standards enable portability
  • Organizations with regulatory requirements that may need to self-host in the future
Open Source No Lock-In Cost-Effective Usage-Based

Platform Comparison Summary

Having explored Datadog, New Relic, and Grafana Cloud in depth, here's a consolidated comparison to guide your platform selection decision:

Dimension Datadog New Relic Grafana Cloud
Pricing Model Per-host + per-GB + add-ons Per-GB ingest + per-user seats Per-active-series + per-GB (logs/traces)
Free Tier 14-day trial only 100 GB/month + 1 full user (forever) 10K series + 50 GB logs + 50 GB traces (forever)
Vendor Lock-In High — proprietary agents, DQL, integrations Medium — NRQL proprietary, but OTel supported Low — all open standards, self-hosted option exists
Protocol Support Proprietary + OTel (growing) Proprietary + OTel (native OTLP) Native OTel, Prometheus, Jaeger, Zipkin, syslog
Self-Hosted Option No No Yes — full LGTM stack is open source
Query Language DQL (proprietary) NRQL (SQL-like, proprietary) PromQL + LogQL + TraceQL (open)
APM Maturity Excellent — 10+ years, auto-instrumentation Excellent — 15+ years, deep language support Growing — Application Observability is newer
Key Strength Unified UX, 750+ integrations, AI/ML NRQL flexibility, generous free tier, entity model Open standards, no lock-in, cost at scale
Key Weakness Expensive at scale, billing unpredictability User seat costs, 8-day default retention Assembly required, multiple query languages
Ideal For Teams wanting polished UX with budget for premium tooling Query-driven teams wanting full-stack at predictable cost Open-source-first teams wanting portability and cost control