Monitoring Kubernetes with Grafana
Kubernetes is the dominant container orchestration platform, and monitoring it effectively requires collecting telemetry at multiple layers: node metrics, pod and container statistics, cluster-level events, and application-generated signals. The OpenTelemetry Collector provides a comprehensive set of receivers specifically designed for Kubernetes environments, feeding data into Grafana’s LGTM stack (Loki, Grafana, Tempo, Mimir).
flowchart TD
subgraph K8s Cluster
KA[Kubernetes Attributes Processor]
KS[Kubeletstats Receiver]
FL[Filelog Receiver]
KC[Cluster Receiver]
KO[Object Receiver]
PR[Prometheus Receiver]
HM[Host Metrics Receiver]
end
subgraph OTel Collector
R[Receivers] --> P[Processors]
P --> E[Exporters]
end
KS --> R
FL --> R
KC --> R
KO --> R
PR --> R
HM --> R
KA -.-> P
subgraph Grafana Stack
M[Mimir - Metrics]
L[Loki - Logs]
T[Tempo - Traces]
G[Grafana - Visualization]
end
E --> M
E --> L
E --> T
M --> G
L --> G
T --> G
Kubernetes Attributes Processor
The k8sattributes processor is the cornerstone of Kubernetes observability. It automatically enriches telemetry data (metrics, logs, and traces) with Kubernetes metadata by correlating the source IP address of incoming telemetry with the Kubernetes API. This enrichment enables powerful cross-signal correlation in Grafana — you can jump from a slow trace to the pod’s CPU metrics to its container logs without manual context switching.
The processor adds attributes such as:
k8s.pod.nameandk8s.pod.uid— Identify the exact pod instancek8s.namespace.name— Namespace isolation contextk8s.deployment.name/k8s.statefulset.name— Workload ownershipk8s.node.name— Node placement informationk8s.container.name— Container within multi-container pods- Pod labels and annotations — Custom metadata from your deployment manifests
# OTel Collector configuration - Kubernetes Attributes Processor
processors:
k8sattributes:
auth_type: "serviceAccount"
passthrough: false
filter:
node_from_env_var: KUBE_NODE_NAME
extract:
metadata:
- k8s.pod.name
- k8s.pod.uid
- k8s.namespace.name
- k8s.deployment.name
- k8s.statefulset.name
- k8s.daemonset.name
- k8s.cronjob.name
- k8s.job.name
- k8s.node.name
- k8s.container.name
- container.id
- container.image.name
- container.image.tag
labels:
- tag_name: app.label.team
key: team
from: pod
- tag_name: app.label.version
key: app.kubernetes.io/version
from: pod
annotations:
- tag_name: app.annotation.config-hash
key: checksum/config
from: pod
pod_association:
- sources:
- from: resource_attribute
name: k8s.pod.ip
- sources:
- from: resource_attribute
name: k8s.pod.uid
- sources:
- from: connection
The pod_association configuration defines how the processor matches incoming telemetry to pods. It tries multiple strategies in order: first by the k8s.pod.ip resource attribute (set by other receivers), then by pod UID, and finally by the connection source IP. This cascading approach ensures enrichment works regardless of how telemetry arrives at the Collector.
Kubeletstats Receiver
The Kubeletstats Receiver collects node, pod, container, and volume metrics directly from the Kubelet’s /stats/summary API endpoint on each node. This provides the foundational infrastructure metrics you need for capacity planning and resource optimization: CPU usage, memory consumption, filesystem utilization, and network I/O at every level of the Kubernetes resource hierarchy.
# OTel Collector configuration - Kubeletstats Receiver
receivers:
kubeletstats:
collection_interval: 20s
auth_type: "serviceAccount"
endpoint: "https://${env:KUBE_NODE_NAME}:10250"
insecure_skip_verify: true
# Metric groups to collect
metric_groups:
- node # Node-level CPU, memory, filesystem, network
- pod # Pod-level aggregates
- container # Per-container metrics
- volume # PersistentVolume usage
# Optional: extra metadata for enrichment
extra_metadata_labels:
- container.id
- k8s.volume.type
# Node metrics
metrics:
k8s.node.cpu.utilization:
enabled: true
k8s.node.memory.available:
enabled: true
k8s.node.filesystem.available:
enabled: true
k8s.node.network.io:
enabled: true
# Container metrics
k8s.container.cpu_limit_utilization:
enabled: true
k8s.container.memory_limit_utilization:
enabled: true
KUBE_NODE_NAME set from the spec.nodeName field. The ServiceAccount needs nodes/stats read permissions via a ClusterRole binding.
Filelog Receiver (Container Logs)
The Filelog Receiver tails container log files from the node filesystem. In Kubernetes, container stdout/stderr is written to /var/log/pods/ (or /var/log/containers/ via symlinks). The receiver parses these log files, extracts Kubernetes metadata from the file path, and forwards structured log entries to Loki or any logs backend.
# OTel Collector configuration - Filelog Receiver for K8s
receivers:
filelog:
include:
- /var/log/pods/*/*/*.log
exclude:
# Exclude collector's own logs to prevent feedback loops
- /var/log/pods/observability_otel-collector*/**/*.log
start_at: end
include_file_path: true
include_file_name: false
operators:
# Parse container runtime format (CRI)
- type: router
id: get-format
routes:
- output: parser-docker
expr: 'body matches "^\\{"'
- output: parser-cri
expr: 'body matches "^[^ Z]+ "'
- output: parser-containerd
expr: 'body matches "^[^ ]+ [^ ]+ [^ ]+ "'
# CRI-O / containerd format parser
- type: regex_parser
id: parser-cri
regex: '^(?P<time>[^ Z]+) (?P<stream>stdout|stderr) (?P<logtag>[^ ]*) ?(?P<log>.*)$'
timestamp:
parse_from: attributes.time
layout: '%Y-%m-%dT%H:%M:%S.%LZ'
# Docker JSON format parser
- type: json_parser
id: parser-docker
timestamp:
parse_from: attributes.time
layout: '%Y-%m-%dT%H:%M:%S.%LZ'
# containerd format parser
- type: regex_parser
id: parser-containerd
regex: '^(?P<time>[^ ]+) (?P<stream>stdout|stderr) (?P<flags>[^ ]+) (?P<log>.*)$'
timestamp:
parse_from: attributes.time
layout: '%Y-%m-%dT%H:%M:%S.%LZ'
# Extract K8s metadata from file path
- type: regex_parser
id: extract-metadata
regex: '^.*\/(?P<namespace>[^_]+)_(?P<pod_name>[^_]+)_(?P<uid>[^\/]+)\/(?P<container_name>[^\/]+)\/'
parse_from: attributes["log.file.path"]
# Move extracted fields to resource attributes
- type: move
from: attributes.namespace
to: resource["k8s.namespace.name"]
- type: move
from: attributes.pod_name
to: resource["k8s.pod.name"]
- type: move
from: attributes.container_name
to: resource["k8s.container.name"]
- type: move
from: attributes.uid
to: resource["k8s.pod.uid"]
The Filelog Receiver handles all three container runtime log formats (Docker JSON, CRI-O, containerd) by routing log lines through format-specific parsers. The regex extraction of Kubernetes metadata from file paths follows the standard /var/log/pods/<namespace>_<pod>_<uid>/<container>/ convention, ensuring logs are automatically correlated with the correct pod and container.
Kubernetes Cluster Receiver
The Kubernetes Cluster Receiver collects cluster-level metrics and entity events from the Kubernetes API server. Unlike the Kubeletstats Receiver (which reports what resources pods are using), the Cluster Receiver reports what resources are defined and their state — replica counts, resource requests/limits, pod phases, and condition statuses.
# OTel Collector configuration - Kubernetes Cluster Receiver
receivers:
k8s_cluster:
auth_type: serviceAccount
collection_interval: 30s
node_conditions_to_report:
- Ready
- MemoryPressure
- DiskPressure
- PIDPressure
allocatable_types_to_report:
- cpu
- memory
- storage
- ephemeral-storage
metadata_collection_interval: 5m
# Resource types to observe
resource_attributes:
k8s.deployment.name:
enabled: true
k8s.namespace.name:
enabled: true
k8s.node.name:
enabled: true
k8s.pod.name:
enabled: true
Key metrics produced by the Cluster Receiver include:
k8s.deployment.desired/k8s.deployment.available— Replica healthk8s.pod.phase— Running, Pending, Failed, Succeededk8s.container.restarts— CrashLoopBackOff detectionk8s.node.condition— Node readiness and pressure conditionsk8s.resource_quota.hard_limit/k8s.resource_quota.used— Quota consumption
Kubernetes Object Receiver
The Kubernetes Object Receiver watches Kubernetes Events and converts them into log records. Kubernetes Events are ephemeral API objects that record significant occurrences — pod scheduling, image pulls, OOMKills, failed probes, scaling events, and more. Capturing these as logs in Loki enables powerful historical analysis and alerting on cluster-level events.
# OTel Collector configuration - Kubernetes Object Receiver
receivers:
k8sobjects:
auth_type: serviceAccount
objects:
- name: events
mode: watch
namespaces: [default, production, staging]
group: events.k8s.io
- name: events
mode: pull
namespaces: [kube-system]
interval: 60s
The receiver supports two modes: watch streams events in real-time via the Kubernetes watch API, while pull periodically lists events at a configured interval. Use watch for production namespaces where real-time alerting matters, and pull for system namespaces where near-real-time is acceptable.
Prometheus Receiver (Scraping Pod Metrics)
The Prometheus Receiver implements Prometheus-compatible scraping within the OTel Collector. It discovers and scrapes metrics from pods annotated with prometheus.io/scrape: "true", application endpoints exposing /metrics, and Kubernetes service monitors. This bridges the Prometheus ecosystem into the OpenTelemetry pipeline, allowing you to collect application metrics alongside infrastructure telemetry.
# OTel Collector configuration - Prometheus Receiver
receivers:
prometheus:
config:
scrape_configs:
# Scrape pods with prometheus.io annotations
- job_name: 'kubernetes-pods'
scrape_interval: 30s
kubernetes_sd_configs:
- role: pod
relabel_configs:
# Only scrape pods with annotation prometheus.io/scrape=true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
# Use custom port from annotation
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
target_label: __address__
regex: (.+)
replacement: $1
# Use custom path from annotation
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
# Add namespace label
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: namespace
# Add pod name label
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: pod
# Scrape kube-state-metrics
- job_name: 'kube-state-metrics'
scrape_interval: 30s
static_configs:
- targets: ['kube-state-metrics.kube-system:8080']
# Scrape node-exporter (if deployed)
- job_name: 'node-exporter'
scrape_interval: 30s
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- source_labels: [__meta_kubernetes_endpoints_name]
action: keep
regex: node-exporter
Host Metrics Receiver
The Host Metrics Receiver collects system-level metrics from the host machine — the underlying node in a Kubernetes context. It provides detailed CPU, memory, disk, and network metrics at the operating system level, complementing the Kubeletstats Receiver with lower-level visibility into node health.
# OTel Collector configuration - Host Metrics Receiver
receivers:
hostmetrics:
collection_interval: 30s
root_path: /hostfs # Mount host filesystem at /hostfs in container
scrapers:
cpu:
metrics:
system.cpu.utilization:
enabled: true
memory:
metrics:
system.memory.utilization:
enabled: true
disk:
include:
devices: ["sd*", "nvme*"]
match_type: glob
filesystem:
exclude_mount_points:
mount_points: ["/dev/*", "/proc/*", "/sys/*"]
match_type: glob
exclude_fs_types:
fs_types: ["autofs", "binfmt_misc", "bpf", "cgroup2",
"configfs", "debugfs", "devpts", "devtmpfs",
"fusectl", "hugetlbfs", "mqueue", "nsfs",
"overlay", "proc", "procfs", "pstore",
"rpc_pipefs", "securityfs", "selinuxfs",
"squashfs", "sysfs", "tracefs", "tmpfs"]
match_type: strict
load: {}
network:
include:
interfaces: ["eth*", "ens*"]
match_type: glob
processes: {}
process:
include:
match_type: regexp
names: ["kubelet", "containerd", "dockerd"]
mute_process_exe_error: true
mute_process_io_error: true
| Receiver | Deployment | Signal | Primary Use Case |
|---|---|---|---|
| Kubeletstats | DaemonSet | Metrics | Pod/container CPU, memory, network, volumes |
| Filelog | DaemonSet | Logs | Container stdout/stderr log collection |
| Cluster | Deployment (1 replica) | Metrics | Replica counts, pod phases, node conditions |
| Object | Deployment (1 replica) | Logs | Kubernetes Events as log records |
| Prometheus | DaemonSet or Deployment | Metrics | Application metrics from /metrics endpoints |
| Host Metrics | DaemonSet | Metrics | OS-level CPU, memory, disk, network |
Visualizing AWS Telemetry with Grafana Cloud
Amazon Web Services provides CloudWatch as its native monitoring service, collecting metrics, logs, and traces from over 80 AWS services automatically. Grafana integrates with AWS through the CloudWatch data source plugin, enabling you to query CloudWatch Metrics, CloudWatch Logs, and AWS X-Ray traces directly from Grafana dashboards without duplicating data into a separate backend.
CloudWatch Data Source Configuration
The CloudWatch data source connects Grafana to your AWS account using IAM authentication. For Grafana Cloud, the recommended approach is to use an AWS IAM Role with cross-account trust, allowing Grafana to assume the role without storing long-lived access keys.
# Grafana provisioning - CloudWatch data source
apiVersion: 1
datasources:
- name: Amazon CloudWatch
type: cloudwatch
access: proxy
uid: cloudwatch-prod
jsonData:
authType: default # Uses instance IAM role / env credentials
defaultRegion: us-east-1
# For cross-account:
# authType: keys
# assumeRoleArn: arn:aws:iam::123456789012:role/GrafanaCloudWatchRole
customMetricsNamespaces: "CustomApp,MyService"
logsTimeout: "30m"
# For explicit keys (less secure, avoid in production):
# secureJsonData:
# accessKey: "AKIA..."
# secretKey: "..."
cloudwatch:GetMetricData, cloudwatch:ListMetrics, cloudwatch:GetMetricStatistics, logs:StartQuery, logs:GetQueryResults, logs:GetLogEvents, xray:BatchGetTraces, and xray:GetTraceSummaries. Use an external ID in the trust policy for Grafana Cloud deployments.
CloudWatch Metrics
CloudWatch Metrics covers every AWS service with namespace-organized metrics. Common namespaces include AWS/EC2 (instance metrics), AWS/ECS (container service), AWS/Lambda (serverless functions), AWS/RDS (databases), AWS/ELB (load balancers), and AWS/S3 (storage). Each namespace exposes dimensions for filtering (instance ID, function name, etc.) and statistics for aggregation (Average, Sum, Maximum, Minimum, p99).
In Grafana, the CloudWatch metrics query editor provides:
- Namespace selection — Browse available metric namespaces
- Metric name — Auto-complete from the selected namespace
- Dimensions — Filter by resource ID, name, or tag
- Statistics — Average, Sum, Min, Max, SampleCount, or extended statistics (percentiles)
- Period — Aggregation granularity (60s, 300s, etc.)
- Math expressions — Combine metrics with CloudWatch Metric Math (
METRICS("m1") / METRICS("m2") * 100)
{
"namespace": "AWS/Lambda",
"metricName": "Duration",
"dimensions": {
"FunctionName": ["payment-processor", "order-service"]
},
"statistics": ["p99", "Average"],
"period": "300",
"region": "us-east-1",
"matchExact": true
}
X-Ray Traces
AWS X-Ray provides distributed tracing for applications running on AWS services. Grafana’s CloudWatch data source includes X-Ray trace querying, allowing you to search traces by service name, response time, error status, and custom annotations. This enables end-to-end request tracing across Lambda functions, API Gateway, ECS tasks, and other AWS services directly within Grafana.
X-Ray queries use filter expressions:
# Find slow requests to the payment service
service("payment-service") AND responsetime > 5
# Find error traces in the last hour
service("order-api") AND fault = true AND annotation.environment = "production"
# Trace specific request patterns
http.url CONTAINS "/api/v2/orders" AND http.method = "POST" AND responsetime > 2
CloudWatch Logs
CloudWatch Logs Insights provides a purpose-built query language for searching and analyzing log groups. From Grafana, you write Logs Insights queries that execute against CloudWatch and return results directly in your dashboards — no need to ship logs to a separate system for basic querying.
# CloudWatch Logs Insights query - Error analysis
fields @timestamp, @message, @logStream
| filter @message like /ERROR|Exception/
| stats count(*) as errorCount by bin(5m)
| sort @timestamp desc
# Lambda cold starts analysis
filter @type = "REPORT"
| stats avg(@duration) as avgDuration,
max(@duration) as maxDuration,
avg(@initDuration) as avgColdStart
by bin(1h)
# Top 10 slowest API requests
fields @timestamp, @message
| filter @message like /API_LATENCY/
| parse @message "latency=* ms" as latency
| sort latency desc
| limit 10
Monitoring GCP with Grafana
Google Cloud Platform provides Cloud Monitoring (formerly Stackdriver) as its native observability service. Grafana’s Google Cloud Monitoring data source enables querying GCP metrics using the Monitoring Query Language (MQL) or the visual query builder, bringing GCP infrastructure and application metrics into unified Grafana dashboards alongside data from other cloud providers.
Cloud Monitoring Data Source
The Google Cloud Monitoring data source authenticates using a GCP service account with the monitoring.viewer role. For Grafana Cloud, you can use Workload Identity Federation to avoid managing JSON key files.
# Grafana provisioning - Google Cloud Monitoring data source
apiVersion: 1
datasources:
- name: Google Cloud Monitoring
type: stackdriver
access: proxy
uid: gcp-monitoring-prod
jsonData:
authenticationType: gce # Uses GCE metadata (when on GCP)
# For service account key:
# authenticationType: jwt
defaultProject: my-gcp-project-id
# Optional: specify additional projects for cross-project queries
# For JWT authentication (service account key):
# secureJsonData:
# privateKey: |
# -----BEGIN PRIVATE KEY-----
# ...
# -----END PRIVATE KEY-----
# clientEmail: grafana-monitoring@my-project.iam.gserviceaccount.com
# tokenUri: https://oauth2.googleapis.com/token
roles/monitoring.viewer role for reading metrics, roles/logging.viewer for Cloud Logging queries, and roles/cloudtrace.user for Cloud Trace access. For cross-project monitoring, grant these roles at the organization or folder level.
Query Editor & Dashboards
Grafana’s GCP query editor supports two modes: the visual builder (dropdown-based metric selection) and MQL (Monitoring Query Language) for complex queries. MQL provides full programmatic control over metric selection, alignment, aggregation, and filtering.
Visual Builder steps:
- Select Service (e.g., Compute Engine, Cloud SQL, GKE)
- Choose Metric (e.g.,
compute.googleapis.com/instance/cpu/utilization) - Add Filters by resource labels (zone, instance_name) or metric labels
- Set Group By for aggregation dimensions
- Choose Alignment function and period
- Select Cross-series Reducer (mean, sum, max, count)
MQL Examples:
# GKE container CPU utilization by namespace
fetch k8s_container
| metric 'kubernetes.io/container/cpu/core_usage_time'
| align rate(1m)
| every 1m
| group_by [resource.namespace_name], [value_core_usage_time_aggregate: aggregate(value.core_usage_time)]
# Cloud SQL connection count with alerting threshold
fetch cloudsql_database
| metric 'cloudsql.googleapis.com/database/network/connections'
| align mean(5m)
| every 5m
| group_by [resource.database_id]
# Cloud Run request latency percentiles
fetch cloud_run_revision
| metric 'run.googleapis.com/request_latencies'
| align delta(1m)
| every 1m
| group_by [resource.service_name],
[value_request_latencies_percentile: percentile(value.request_latencies, 99)]
GCP metrics are organized by monitored resource type (the resource being measured) and metric type (what aspect is being measured). Common resource types include gce_instance, k8s_container, cloudsql_database, cloud_run_revision, and gcs_bucket. Grafana’s auto-complete and documentation links help navigate GCP’s extensive metric catalog.
Monitoring Azure with Grafana
Microsoft Azure provides Azure Monitor as its comprehensive observability platform, encompassing metrics, logs (via Log Analytics), traces (via Application Insights), and alerts. Grafana’s Azure Monitor data source provides deep integration with all Azure Monitor capabilities, including the ability to query across multiple subscriptions and workspaces.
Azure Monitor Data Source Configuration
The Azure Monitor data source authenticates using an Azure Active Directory (Microsoft Entra ID) service principal or managed identity. For self-hosted Grafana running on Azure VMs or AKS, managed identity is the recommended zero-credential approach.
# Grafana provisioning - Azure Monitor data source
apiVersion: 1
datasources:
- name: Azure Monitor
type: grafana-azure-monitor-datasource
access: proxy
uid: azure-monitor-prod
jsonData:
# Authentication method
azureAuthType: msi # Managed Identity (recommended on Azure)
# For App Registration (service principal):
# azureAuthType: clientsecret
# tenantId: "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
# clientId: "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
# cloudName: azuremonitor # azuremonitor | azuremonitorchina | azuremonitorusgov
# Default subscription for metric queries
subscriptionId: "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
# Log Analytics default workspace
logAnalyticsDefaultWorkspace: "/subscriptions/xxx/resourceGroups/rg-monitoring/providers/Microsoft.OperationalInsights/workspaces/law-prod"
# Application Insights (optional)
appInsightsAppId: "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
# For client secret authentication:
# secureJsonData:
# clientSecret: "your-client-secret"
Monitoring Reader role at the subscription or resource group level for metrics, Log Analytics Reader for log queries, and Application Insights Component Reader for APM data. Use Azure Policy to ensure these roles are assigned consistently across subscriptions.
Azure Monitor Query Editor
The Azure Monitor data source supports four distinct query types, each optimized for a different Azure Monitor subsystem:
1. Metrics Query — Azure Monitor Metrics (platform metrics for all Azure resources):
{
"queryType": "Azure Monitor",
"subscription": "prod-subscription-id",
"resourceGroup": "rg-production",
"metricNamespace": "Microsoft.Web/sites",
"resourceName": "my-web-app",
"metricName": "HttpResponseTime",
"aggregation": "Average",
"timeGrain": "PT5M",
"dimensionFilters": [
{ "dimension": "Instance", "operator": "eq", "filters": ["web-01", "web-02"] }
]
}
2. Logs Query — Azure Log Analytics (KQL — Kusto Query Language):
# Application errors by severity over time
AppExceptions
| where TimeGenerated > ago(24h)
| summarize ErrorCount = count() by bin(TimeGenerated, 1h), SeverityLevel
| order by TimeGenerated desc
# Container CPU utilization in AKS
Perf
| where ObjectName == "K8SContainer"
| where CounterName == "cpuUsageNanoCores"
| summarize AvgCPU = avg(CounterValue) by bin(TimeGenerated, 5m), InstanceName
| render timechart
# Network flow analysis
AzureNetworkAnalytics_CL
| where FlowStatus_s == "A" // Allowed flows
| summarize TotalBytes = sum(InboundBytes_d + OutboundBytes_d)
by bin(TimeGenerated, 1h), DestIP_s
| top 10 by TotalBytes desc
3. Traces Query — Application Insights distributed traces:
# Slow dependency calls
dependencies
| where duration > 5000
| where success == false
| project timestamp, target, name, duration, resultCode
| order by duration desc
| take 50
# End-to-end transaction search
union requests, dependencies, exceptions
| where operation_Id == "abc123-trace-id"
| order by timestamp asc
4. Azure Resource Graph — Query Azure resource inventory and configuration:
# Find all VMs not running
Resources
| where type == "microsoft.compute/virtualmachines"
| where properties.extended.instanceView.powerState.displayStatus != "VM running"
| project name, resourceGroup, location, properties.extended.instanceView.powerState.displayStatus
Azure Monitor Dashboards
Grafana provides pre-built dashboard templates for common Azure scenarios. These can be imported from the Grafana dashboard registry and customized for your environment:
- Azure VM Overview — CPU, memory, disk IOPS, network throughput across all VMs
- Azure Kubernetes Service (AKS) — Cluster health, node pools, pod status, container insights
- Azure SQL Database — DTU/vCore utilization, deadlocks, query performance
- Azure App Service — HTTP response codes, response time, instance health
- Azure Functions — Execution count, duration, failures, queue depth
- Azure Storage Accounts — Transaction volume, latency, capacity trends
- Azure Networking — Load Balancer health, Application Gateway metrics, NSG flow logs
- Azure Cost Overview — Cost by resource group, service, and tag (via Cost Management APIs)
Dashboard variables in Azure Monitor support dynamic population from subscription lists, resource groups, and resource names — enabling a single dashboard template to monitor any Azure resource by changing dropdown selections.
Best Practices & Approaches
Managing infrastructure monitoring across Kubernetes clusters and multiple cloud providers requires thoughtful architecture to avoid tool sprawl, alert fatigue, runaway costs, and fragmented visibility. The following best practices address the most common challenges in multi-cloud observability.
Multi-Cloud Observability Strategy
flowchart LR
subgraph AWS
CW[CloudWatch]
XR[X-Ray]
end
subgraph GCP
CM[Cloud Monitoring]
CT[Cloud Trace]
end
subgraph Azure
AM[Azure Monitor]
AI[App Insights]
end
subgraph Kubernetes
OC[OTel Collector]
end
subgraph Grafana Platform
G[Grafana]
M[Mimir]
L[Loki]
T[Tempo]
end
CW -->|CloudWatch DS| G
XR -->|X-Ray DS| G
CM -->|Cloud Monitoring DS| G
CT -->|Cloud Trace DS| G
AM -->|Azure Monitor DS| G
AI -->|Azure Monitor DS| G
OC -->|OTLP| M
OC -->|OTLP| L
OC -->|OTLP| T
M --> G
L --> G
T --> G
A successful multi-cloud monitoring strategy follows these principles:
- Single pane of glass — Grafana serves as the unified visualization layer regardless of where data originates. Users should never need to switch between cloud provider consoles for day-to-day monitoring.
- Vendor-native for platform services — Use CloudWatch for Lambda, Cloud Monitoring for Cloud Run, Azure Monitor for Azure Functions. These integrations are maintained by the cloud providers and have zero-latency access to platform metrics.
- OpenTelemetry for applications — Instrument your own code with OpenTelemetry to ensure portability. Application telemetry flows through the OTel Collector into your own backends (Mimir/Loki/Tempo), avoiding vendor lock-in.
- Consistent labeling taxonomy — Establish naming conventions for environment (
env: prod|staging|dev), service (service.name), team ownership (team), and cost center across all providers. - Centralized alerting — Define all alert rules in Grafana Alerting rather than in individual cloud provider alerting systems. This provides unified notification routing and on-call management.
Unified Dashboards
Effective multi-cloud dashboards abstract away provider-specific details and present a service-centric view:
- Service Health Overview — RED metrics (Rate, Errors, Duration) for each service regardless of where it runs. Use mixed data sources to combine Lambda invocations (CloudWatch), GKE pod latency (Cloud Monitoring), and AKS response time (Azure Monitor) on a single dashboard.
- Infrastructure Cost Dashboard — Aggregate cost signals from AWS Cost Explorer, GCP Billing, and Azure Cost Management using their respective Grafana plugins.
- SLO Dashboard — Define Service Level Objectives that span providers. An SLO for “99.9% order processing success” might combine error rates from AWS Lambda (order intake), GCP Cloud Run (payment processing), and Azure Functions (notification delivery).
- Capacity Planning — Kubernetes resource utilization across EKS, GKE, and AKS clusters displayed with consistent units and time ranges. Use Grafana template variables to switch between clusters.
# Example: Mixed data source dashboard variable
# In Grafana dashboard JSON:
templating:
list:
- name: cloud_provider
type: custom
options:
- text: AWS (us-east-1)
value: cloudwatch-prod
- text: GCP (us-central1)
value: gcp-monitoring-prod
- text: Azure (eastus)
value: azure-monitor-prod
current:
text: AWS (us-east-1)
value: cloudwatch-prod
Cost Management
Monitoring costs grow with cardinality (unique label combinations), retention duration, and query frequency. Apply these cost controls across your infrastructure monitoring:
Kubernetes Telemetry
- Filter at the Collector — Use OTel processors to drop metrics from non-production namespaces (
filterprocessor) or reduce cardinality by removing high-cardinality labels (attributesprocessor) - Aggregate before export — Use the
metricstransformprocessor to combine per-pod metrics into per-deployment aggregates for less critical workloads - Tiered collection intervals — 15s for production-critical, 60s for development/staging
Cloud Provider Costs
- CloudWatch — Each
GetMetricDataAPI call costs $0.01/1,000 metrics. Use CloudWatch Metric Streams for high-volume scenarios (flat monthly fee vs. per-query cost) - GCP Cloud Monitoring — First 150M API calls/month free, then $0.01/1,000 calls. Use alignment periods ≥ 60s to reduce call volume
- Azure Monitor — Logs ingestion into Log Analytics is the primary cost driver. Use Data Collection Rules to filter logs before ingestion and apply Basic tier for non-critical log tables
Grafana-Side Optimizations
- Dashboard query caching — Enable query caching for dashboards with many viewers but slow-changing data
- Reduce auto-refresh frequency — Default to 30s or 1m refresh instead of 5s for operational dashboards
- Recording rules — Pre-compute expensive aggregations as recording rules in Mimir rather than computing them at query time on every dashboard load
# OTel Collector - Cost optimization processors
processors:
# Drop metrics from non-production namespaces
filter/drop-nonprod:
metrics:
exclude:
match_type: regexp
resource_attributes:
- key: k8s.namespace.name
value: "^(dev|test|sandbox)-.*"
# Remove high-cardinality labels before export
attributes/reduce-cardinality:
actions:
- key: k8s.pod.uid
action: delete
- key: container.id
action: delete
# Batch and compress before export
batch:
send_batch_size: 10000
send_batch_max_size: 11000
timeout: 10s
# Memory limiter to prevent OOM
memory_limiter:
check_interval: 5s
limit_mib: 512
spike_limit_mib: 128
Summary & Next Steps
Infrastructure monitoring with Grafana spans the full spectrum from Kubernetes cluster internals to multi-cloud platform services. The key architectural decisions are:
- Kubernetes — Deploy the OTel Collector as a DaemonSet (node-level receivers) and a single-replica Deployment (cluster-level receivers). The
k8sattributesprocessor is non-negotiable for cross-signal correlation. - AWS — Query CloudWatch in-place via Grafana’s native data source. Use IAM roles (not access keys) and consider CloudWatch Metric Streams for high-volume scenarios.
- GCP — Use the Cloud Monitoring data source with MQL for advanced queries. Leverage GCE metadata authentication when running on GCP.
- Azure — The Azure Monitor data source covers metrics, Log Analytics (KQL), Application Insights traces, and Resource Graph queries. Use managed identity for zero-credential authentication.
- Multi-cloud — Keep cloud-native metrics in their native stores, centralize only application telemetry, and use Grafana as the unified visualization and alerting layer.
Next in the Series
In Part 8: Displaying Data with Dashboards, we’ll explore Grafana’s dashboard authoring capabilities — panel types, template variables, annotations, repeating rows, library panels, and dashboard-as-code workflows for managing dashboards at scale.