Grafana Deep Dive Part 7: Infrastructure Monitoring — Kubernetes, AWS, GCP & Azure

Monitoring Kubernetes with Grafana

Kubernetes is the dominant container orchestration platform, and monitoring it effectively requires collecting telemetry at multiple layers: node metrics, pod and container statistics, cluster-level events, and application-generated signals. The OpenTelemetry Collector provides a comprehensive set of receivers specifically designed for Kubernetes environments, feeding data into Grafana’s LGTM stack (Loki, Grafana, Tempo, Mimir).

Kubernetes Monitoring Architecture with OTel Collector

flowchart TD
    subgraph K8s Cluster
        KA[Kubernetes Attributes Processor]
        KS[Kubeletstats Receiver]
        FL[Filelog Receiver]
        KC[Cluster Receiver]
        KO[Object Receiver]
        PR[Prometheus Receiver]
        HM[Host Metrics Receiver]
    end

    subgraph OTel Collector
        R[Receivers] --> P[Processors]
        P --> E[Exporters]
    end

    KS --> R
    FL --> R
    KC --> R
    KO --> R
    PR --> R
    HM --> R
    KA -.-> P

    subgraph Grafana Stack
        M[Mimir - Metrics]
        L[Loki - Logs]
        T[Tempo - Traces]
        G[Grafana - Visualization]
    end

    E --> M
    E --> L
    E --> T
    M --> G
    L --> G
    T --> G

                            
                            Key Insight: The OpenTelemetry Collector deployed as a DaemonSet on every node provides the most comprehensive Kubernetes telemetry. Each receiver handles a specific telemetry domain — from node-level CPU/memory to cluster-level events — and the Kubernetes Attributes Processor enriches all signals with pod, namespace, and deployment metadata for unified correlation in Grafana.
                        

Kubernetes Attributes Processor

The k8sattributes processor is the cornerstone of Kubernetes observability. It automatically enriches telemetry data (metrics, logs, and traces) with Kubernetes metadata by correlating the source IP address of incoming telemetry with the Kubernetes API. This enrichment enables powerful cross-signal correlation in Grafana — you can jump from a slow trace to the pod’s CPU metrics to its container logs without manual context switching.

The processor adds attributes such as:

k8s.pod.name and k8s.pod.uid — Identify the exact pod instance
k8s.namespace.name — Namespace isolation context
k8s.deployment.name / k8s.statefulset.name — Workload ownership
k8s.node.name — Node placement information
k8s.container.name — Container within multi-container pods
Pod labels and annotations — Custom metadata from your deployment manifests

# OTel Collector configuration - Kubernetes Attributes Processor
processors:
  k8sattributes:
    auth_type: "serviceAccount"
    passthrough: false
    filter:
      node_from_env_var: KUBE_NODE_NAME
    extract:
      metadata:
        - k8s.pod.name
        - k8s.pod.uid
        - k8s.namespace.name
        - k8s.deployment.name
        - k8s.statefulset.name
        - k8s.daemonset.name
        - k8s.cronjob.name
        - k8s.job.name
        - k8s.node.name
        - k8s.container.name
        - container.id
        - container.image.name
        - container.image.tag
      labels:
        - tag_name: app.label.team
          key: team
          from: pod
        - tag_name: app.label.version
          key: app.kubernetes.io/version
          from: pod
      annotations:
        - tag_name: app.annotation.config-hash
          key: checksum/config
          from: pod
    pod_association:
      - sources:
          - from: resource_attribute
            name: k8s.pod.ip
      - sources:
          - from: resource_attribute
            name: k8s.pod.uid
      - sources:
          - from: connection

The pod_association configuration defines how the processor matches incoming telemetry to pods. It tries multiple strategies in order: first by the k8s.pod.ip resource attribute (set by other receivers), then by pod UID, and finally by the connection source IP. This cascading approach ensures enrichment works regardless of how telemetry arrives at the Collector.

Kubeletstats Receiver

The Kubeletstats Receiver collects node, pod, container, and volume metrics directly from the Kubelet’s /stats/summary API endpoint on each node. This provides the foundational infrastructure metrics you need for capacity planning and resource optimization: CPU usage, memory consumption, filesystem utilization, and network I/O at every level of the Kubernetes resource hierarchy.

# OTel Collector configuration - Kubeletstats Receiver
receivers:
  kubeletstats:
    collection_interval: 20s
    auth_type: "serviceAccount"
    endpoint: "https://${env:KUBE_NODE_NAME}:10250"
    insecure_skip_verify: true
    # Metric groups to collect
    metric_groups:
      - node        # Node-level CPU, memory, filesystem, network
      - pod         # Pod-level aggregates
      - container   # Per-container metrics
      - volume      # PersistentVolume usage
    # Optional: extra metadata for enrichment
    extra_metadata_labels:
      - container.id
      - k8s.volume.type
    # Node metrics
    metrics:
      k8s.node.cpu.utilization:
        enabled: true
      k8s.node.memory.available:
        enabled: true
      k8s.node.filesystem.available:
        enabled: true
      k8s.node.network.io:
        enabled: true
      # Container metrics
      k8s.container.cpu_limit_utilization:
        enabled: true
      k8s.container.memory_limit_utilization:
        enabled: true

                            
                            Deployment Note: The Kubeletstats Receiver must run on every node to collect node-local metrics. Deploy the OTel Collector as a DaemonSet with KUBE_NODE_NAME set from the spec.nodeName field. The ServiceAccount needs nodes/stats read permissions via a ClusterRole binding.
                        

Filelog Receiver (Container Logs)

The Filelog Receiver tails container log files from the node filesystem. In Kubernetes, container stdout/stderr is written to /var/log/pods/ (or /var/log/containers/ via symlinks). The receiver parses these log files, extracts Kubernetes metadata from the file path, and forwards structured log entries to Loki or any logs backend.

# OTel Collector configuration - Filelog Receiver for K8s
receivers:
  filelog:
    include:
      - /var/log/pods/*/*/*.log
    exclude:
      # Exclude collector's own logs to prevent feedback loops
      - /var/log/pods/observability_otel-collector*/**/*.log
    start_at: end
    include_file_path: true
    include_file_name: false
    operators:
      # Parse container runtime format (CRI)
      - type: router
        id: get-format
        routes:
          - output: parser-docker
            expr: 'body matches "^\\{"'
          - output: parser-cri
            expr: 'body matches "^[^ Z]+ "'
          - output: parser-containerd
            expr: 'body matches "^[^ ]+ [^ ]+ [^ ]+ "'
      # CRI-O / containerd format parser
      - type: regex_parser
        id: parser-cri
        regex: '^(?P<time>[^ Z]+) (?P<stream>stdout|stderr) (?P<logtag>[^ ]*) ?(?P<log>.*)$'
        timestamp:
          parse_from: attributes.time
          layout: '%Y-%m-%dT%H:%M:%S.%LZ'
      # Docker JSON format parser
      - type: json_parser
        id: parser-docker
        timestamp:
          parse_from: attributes.time
          layout: '%Y-%m-%dT%H:%M:%S.%LZ'
      # containerd format parser
      - type: regex_parser
        id: parser-containerd
        regex: '^(?P<time>[^ ]+) (?P<stream>stdout|stderr) (?P<flags>[^ ]+) (?P<log>.*)$'
        timestamp:
          parse_from: attributes.time
          layout: '%Y-%m-%dT%H:%M:%S.%LZ'
      # Extract K8s metadata from file path
      - type: regex_parser
        id: extract-metadata
        regex: '^.*\/(?P<namespace>[^_]+)_(?P<pod_name>[^_]+)_(?P<uid>[^\/]+)\/(?P<container_name>[^\/]+)\/'
        parse_from: attributes["log.file.path"]
      # Move extracted fields to resource attributes
      - type: move
        from: attributes.namespace
        to: resource["k8s.namespace.name"]
      - type: move
        from: attributes.pod_name
        to: resource["k8s.pod.name"]
      - type: move
        from: attributes.container_name
        to: resource["k8s.container.name"]
      - type: move
        from: attributes.uid
        to: resource["k8s.pod.uid"]

The Filelog Receiver handles all three container runtime log formats (Docker JSON, CRI-O, containerd) by routing log lines through format-specific parsers. The regex extraction of Kubernetes metadata from file paths follows the standard /var/log/pods/<namespace>_<pod>_<uid>/<container>/ convention, ensuring logs are automatically correlated with the correct pod and container.

Kubernetes Cluster Receiver

The Kubernetes Cluster Receiver collects cluster-level metrics and entity events from the Kubernetes API server. Unlike the Kubeletstats Receiver (which reports what resources pods are using), the Cluster Receiver reports what resources are defined and their state — replica counts, resource requests/limits, pod phases, and condition statuses.

# OTel Collector configuration - Kubernetes Cluster Receiver
receivers:
  k8s_cluster:
    auth_type: serviceAccount
    collection_interval: 30s
    node_conditions_to_report:
      - Ready
      - MemoryPressure
      - DiskPressure
      - PIDPressure
    allocatable_types_to_report:
      - cpu
      - memory
      - storage
      - ephemeral-storage
    metadata_collection_interval: 5m
    # Resource types to observe
    resource_attributes:
      k8s.deployment.name:
        enabled: true
      k8s.namespace.name:
        enabled: true
      k8s.node.name:
        enabled: true
      k8s.pod.name:
        enabled: true

Key metrics produced by the Cluster Receiver include:

k8s.deployment.desired / k8s.deployment.available — Replica health
k8s.pod.phase — Running, Pending, Failed, Succeeded
k8s.container.restarts — CrashLoopBackOff detection
k8s.node.condition — Node readiness and pressure conditions
k8s.resource_quota.hard_limit / k8s.resource_quota.used — Quota consumption

                            
                            Important: Deploy the Cluster Receiver as a single-replica Deployment (not a DaemonSet). It watches the API server centrally — running multiple replicas produces duplicate metrics. The ServiceAccount requires cluster-wide read access to pods, nodes, deployments, replicasets, statefulsets, jobs, cronjobs, daemonsets, and resource quotas.
                        

Kubernetes Object Receiver

The Kubernetes Object Receiver watches Kubernetes Events and converts them into log records. Kubernetes Events are ephemeral API objects that record significant occurrences — pod scheduling, image pulls, OOMKills, failed probes, scaling events, and more. Capturing these as logs in Loki enables powerful historical analysis and alerting on cluster-level events.

# OTel Collector configuration - Kubernetes Object Receiver
receivers:
  k8sobjects:
    auth_type: serviceAccount
    objects:
      - name: events
        mode: watch
        namespaces: [default, production, staging]
        group: events.k8s.io
      - name: events
        mode: pull
        namespaces: [kube-system]
        interval: 60s

The receiver supports two modes: watch streams events in real-time via the Kubernetes watch API, while pull periodically lists events at a configured interval. Use watch for production namespaces where real-time alerting matters, and pull for system namespaces where near-real-time is acceptable.

Prometheus Receiver (Scraping Pod Metrics)

The Prometheus Receiver implements Prometheus-compatible scraping within the OTel Collector. It discovers and scrapes metrics from pods annotated with prometheus.io/scrape: "true", application endpoints exposing /metrics, and Kubernetes service monitors. This bridges the Prometheus ecosystem into the OpenTelemetry pipeline, allowing you to collect application metrics alongside infrastructure telemetry.

# OTel Collector configuration - Prometheus Receiver
receivers:
  prometheus:
    config:
      scrape_configs:
        # Scrape pods with prometheus.io annotations
        - job_name: 'kubernetes-pods'
          scrape_interval: 30s
          kubernetes_sd_configs:
            - role: pod
          relabel_configs:
            # Only scrape pods with annotation prometheus.io/scrape=true
            - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
              action: keep
              regex: true
            # Use custom port from annotation
            - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
              action: replace
              target_label: __address__
              regex: (.+)
              replacement: $1
            # Use custom path from annotation
            - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
              action: replace
              target_label: __metrics_path__
              regex: (.+)
            # Add namespace label
            - source_labels: [__meta_kubernetes_namespace]
              action: replace
              target_label: namespace
            # Add pod name label
            - source_labels: [__meta_kubernetes_pod_name]
              action: replace
              target_label: pod

        # Scrape kube-state-metrics
        - job_name: 'kube-state-metrics'
          scrape_interval: 30s
          static_configs:
            - targets: ['kube-state-metrics.kube-system:8080']

        # Scrape node-exporter (if deployed)
        - job_name: 'node-exporter'
          scrape_interval: 30s
          kubernetes_sd_configs:
            - role: endpoints
          relabel_configs:
            - source_labels: [__meta_kubernetes_endpoints_name]
              action: keep
              regex: node-exporter

Host Metrics Receiver

The Host Metrics Receiver collects system-level metrics from the host machine — the underlying node in a Kubernetes context. It provides detailed CPU, memory, disk, and network metrics at the operating system level, complementing the Kubeletstats Receiver with lower-level visibility into node health.

# OTel Collector configuration - Host Metrics Receiver
receivers:
  hostmetrics:
    collection_interval: 30s
    root_path: /hostfs    # Mount host filesystem at /hostfs in container
    scrapers:
      cpu:
        metrics:
          system.cpu.utilization:
            enabled: true
      memory:
        metrics:
          system.memory.utilization:
            enabled: true
      disk:
        include:
          devices: ["sd*", "nvme*"]
          match_type: glob
      filesystem:
        exclude_mount_points:
          mount_points: ["/dev/*", "/proc/*", "/sys/*"]
          match_type: glob
        exclude_fs_types:
          fs_types: ["autofs", "binfmt_misc", "bpf", "cgroup2",
                     "configfs", "debugfs", "devpts", "devtmpfs",
                     "fusectl", "hugetlbfs", "mqueue", "nsfs",
                     "overlay", "proc", "procfs", "pstore",
                     "rpc_pipefs", "securityfs", "selinuxfs",
                     "squashfs", "sysfs", "tracefs", "tmpfs"]
          match_type: strict
      load: {}
      network:
        include:
          interfaces: ["eth*", "ens*"]
          match_type: glob
      processes: {}
      process:
        include:
          match_type: regexp
          names: ["kubelet", "containerd", "dockerd"]
        mute_process_exe_error: true
        mute_process_io_error: true

Receiver Comparison Choosing the Right Receiver for Each Use Case

Receiver	Deployment	Signal	Primary Use Case
Kubeletstats	DaemonSet	Metrics	Pod/container CPU, memory, network, volumes
Filelog	DaemonSet	Logs	Container stdout/stderr log collection
Cluster	Deployment (1 replica)	Metrics	Replica counts, pod phases, node conditions
Object	Deployment (1 replica)	Logs	Kubernetes Events as log records
Prometheus	DaemonSet or Deployment	Metrics	Application metrics from /metrics endpoints
Host Metrics	DaemonSet	Metrics	OS-level CPU, memory, disk, network

DaemonSet for Node-Level Deployment for Cluster-Level k8sattributes Enriches All

Visualizing AWS Telemetry with Grafana Cloud

Amazon Web Services provides CloudWatch as its native monitoring service, collecting metrics, logs, and traces from over 80 AWS services automatically. Grafana integrates with AWS through the CloudWatch data source plugin, enabling you to query CloudWatch Metrics, CloudWatch Logs, and AWS X-Ray traces directly from Grafana dashboards without duplicating data into a separate backend.

CloudWatch Data Source Configuration

The CloudWatch data source connects Grafana to your AWS account using IAM authentication. For Grafana Cloud, the recommended approach is to use an AWS IAM Role with cross-account trust, allowing Grafana to assume the role without storing long-lived access keys.

# Grafana provisioning - CloudWatch data source
apiVersion: 1
datasources:
  - name: Amazon CloudWatch
    type: cloudwatch
    access: proxy
    uid: cloudwatch-prod
    jsonData:
      authType: default          # Uses instance IAM role / env credentials
      defaultRegion: us-east-1
      # For cross-account:
      # authType: keys
      # assumeRoleArn: arn:aws:iam::123456789012:role/GrafanaCloudWatchRole
      customMetricsNamespaces: "CustomApp,MyService"
      logsTimeout: "30m"
    # For explicit keys (less secure, avoid in production):
    # secureJsonData:
    #   accessKey: "AKIA..."
    #   secretKey: "..."

                            
                            IAM Best Practice: Create a dedicated IAM role with a minimal policy granting only cloudwatch:GetMetricData, cloudwatch:ListMetrics, cloudwatch:GetMetricStatistics, logs:StartQuery, logs:GetQueryResults, logs:GetLogEvents, xray:BatchGetTraces, and xray:GetTraceSummaries. Use an external ID in the trust policy for Grafana Cloud deployments.
                        

CloudWatch Metrics

CloudWatch Metrics covers every AWS service with namespace-organized metrics. Common namespaces include AWS/EC2 (instance metrics), AWS/ECS (container service), AWS/Lambda (serverless functions), AWS/RDS (databases), AWS/ELB (load balancers), and AWS/S3 (storage). Each namespace exposes dimensions for filtering (instance ID, function name, etc.) and statistics for aggregation (Average, Sum, Maximum, Minimum, p99).

In Grafana, the CloudWatch metrics query editor provides:

Namespace selection — Browse available metric namespaces
Metric name — Auto-complete from the selected namespace
Dimensions — Filter by resource ID, name, or tag
Statistics — Average, Sum, Min, Max, SampleCount, or extended statistics (percentiles)
Period — Aggregation granularity (60s, 300s, etc.)
Math expressions — Combine metrics with CloudWatch Metric Math (METRICS("m1") / METRICS("m2") * 100)

{
  "namespace": "AWS/Lambda",
  "metricName": "Duration",
  "dimensions": {
    "FunctionName": ["payment-processor", "order-service"]
  },
  "statistics": ["p99", "Average"],
  "period": "300",
  "region": "us-east-1",
  "matchExact": true
}

X-Ray Traces

AWS X-Ray provides distributed tracing for applications running on AWS services. Grafana’s CloudWatch data source includes X-Ray trace querying, allowing you to search traces by service name, response time, error status, and custom annotations. This enables end-to-end request tracing across Lambda functions, API Gateway, ECS tasks, and other AWS services directly within Grafana.

X-Ray queries use filter expressions:

# Find slow requests to the payment service
service("payment-service") AND responsetime > 5

# Find error traces in the last hour
service("order-api") AND fault = true AND annotation.environment = "production"

# Trace specific request patterns
http.url CONTAINS "/api/v2/orders" AND http.method = "POST" AND responsetime > 2

CloudWatch Logs

CloudWatch Logs Insights provides a purpose-built query language for searching and analyzing log groups. From Grafana, you write Logs Insights queries that execute against CloudWatch and return results directly in your dashboards — no need to ship logs to a separate system for basic querying.

# CloudWatch Logs Insights query - Error analysis
fields @timestamp, @message, @logStream
| filter @message like /ERROR|Exception/
| stats count(*) as errorCount by bin(5m)
| sort @timestamp desc

# Lambda cold starts analysis
filter @type = "REPORT"
| stats avg(@duration) as avgDuration,
        max(@duration) as maxDuration,
        avg(@initDuration) as avgColdStart
  by bin(1h)

# Top 10 slowest API requests
fields @timestamp, @message
| filter @message like /API_LATENCY/
| parse @message "latency=* ms" as latency
| sort latency desc
| limit 10

Monitoring GCP with Grafana

Google Cloud Platform provides Cloud Monitoring (formerly Stackdriver) as its native observability service. Grafana’s Google Cloud Monitoring data source enables querying GCP metrics using the Monitoring Query Language (MQL) or the visual query builder, bringing GCP infrastructure and application metrics into unified Grafana dashboards alongside data from other cloud providers.

Cloud Monitoring Data Source

The Google Cloud Monitoring data source authenticates using a GCP service account with the monitoring.viewer role. For Grafana Cloud, you can use Workload Identity Federation to avoid managing JSON key files.

# Grafana provisioning - Google Cloud Monitoring data source
apiVersion: 1
datasources:
  - name: Google Cloud Monitoring
    type: stackdriver
    access: proxy
    uid: gcp-monitoring-prod
    jsonData:
      authenticationType: gce     # Uses GCE metadata (when on GCP)
      # For service account key:
      # authenticationType: jwt
      defaultProject: my-gcp-project-id
      # Optional: specify additional projects for cross-project queries
    # For JWT authentication (service account key):
    # secureJsonData:
    #   privateKey: |
    #     -----BEGIN PRIVATE KEY-----
    #     ...
    #     -----END PRIVATE KEY-----
    #   clientEmail: grafana-monitoring@my-project.iam.gserviceaccount.com
    #   tokenUri: https://oauth2.googleapis.com/token

                            
                            Service Account Permissions: The GCP service account needs the roles/monitoring.viewer role for reading metrics, roles/logging.viewer for Cloud Logging queries, and roles/cloudtrace.user for Cloud Trace access. For cross-project monitoring, grant these roles at the organization or folder level.
                        

Query Editor & Dashboards

Grafana’s GCP query editor supports two modes: the visual builder (dropdown-based metric selection) and MQL (Monitoring Query Language) for complex queries. MQL provides full programmatic control over metric selection, alignment, aggregation, and filtering.

Visual Builder steps:

Select Service (e.g., Compute Engine, Cloud SQL, GKE)
Choose Metric (e.g., compute.googleapis.com/instance/cpu/utilization)
Add Filters by resource labels (zone, instance_name) or metric labels
Set Group By for aggregation dimensions
Choose Alignment function and period
Select Cross-series Reducer (mean, sum, max, count)

MQL Examples:

# GKE container CPU utilization by namespace
fetch k8s_container
| metric 'kubernetes.io/container/cpu/core_usage_time'
| align rate(1m)
| every 1m
| group_by [resource.namespace_name], [value_core_usage_time_aggregate: aggregate(value.core_usage_time)]

# Cloud SQL connection count with alerting threshold
fetch cloudsql_database
| metric 'cloudsql.googleapis.com/database/network/connections'
| align mean(5m)
| every 5m
| group_by [resource.database_id]

# Cloud Run request latency percentiles
fetch cloud_run_revision
| metric 'run.googleapis.com/request_latencies'
| align delta(1m)
| every 1m
| group_by [resource.service_name],
    [value_request_latencies_percentile: percentile(value.request_latencies, 99)]

GCP metrics are organized by monitored resource type (the resource being measured) and metric type (what aspect is being measured). Common resource types include gce_instance, k8s_container, cloudsql_database, cloud_run_revision, and gcs_bucket. Grafana’s auto-complete and documentation links help navigate GCP’s extensive metric catalog.

Monitoring Azure with Grafana

Microsoft Azure provides Azure Monitor as its comprehensive observability platform, encompassing metrics, logs (via Log Analytics), traces (via Application Insights), and alerts. Grafana’s Azure Monitor data source provides deep integration with all Azure Monitor capabilities, including the ability to query across multiple subscriptions and workspaces.

Azure Monitor Data Source Configuration

The Azure Monitor data source authenticates using an Azure Active Directory (Microsoft Entra ID) service principal or managed identity. For self-hosted Grafana running on Azure VMs or AKS, managed identity is the recommended zero-credential approach.

# Grafana provisioning - Azure Monitor data source
apiVersion: 1
datasources:
  - name: Azure Monitor
    type: grafana-azure-monitor-datasource
    access: proxy
    uid: azure-monitor-prod
    jsonData:
      # Authentication method
      azureAuthType: msi          # Managed Identity (recommended on Azure)
      # For App Registration (service principal):
      # azureAuthType: clientsecret
      # tenantId: "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
      # clientId: "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
      # cloudName: azuremonitor    # azuremonitor | azuremonitorchina | azuremonitorusgov

      # Default subscription for metric queries
      subscriptionId: "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"

      # Log Analytics default workspace
      logAnalyticsDefaultWorkspace: "/subscriptions/xxx/resourceGroups/rg-monitoring/providers/Microsoft.OperationalInsights/workspaces/law-prod"

      # Application Insights (optional)
      appInsightsAppId: "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
    # For client secret authentication:
    # secureJsonData:
    #   clientSecret: "your-client-secret"

                            
                            Azure RBAC: The service principal or managed identity needs the Monitoring Reader role at the subscription or resource group level for metrics, Log Analytics Reader for log queries, and Application Insights Component Reader for APM data. Use Azure Policy to ensure these roles are assigned consistently across subscriptions.
                        

Azure Monitor Query Editor

The Azure Monitor data source supports four distinct query types, each optimized for a different Azure Monitor subsystem:

1. Metrics Query — Azure Monitor Metrics (platform metrics for all Azure resources):

{
  "queryType": "Azure Monitor",
  "subscription": "prod-subscription-id",
  "resourceGroup": "rg-production",
  "metricNamespace": "Microsoft.Web/sites",
  "resourceName": "my-web-app",
  "metricName": "HttpResponseTime",
  "aggregation": "Average",
  "timeGrain": "PT5M",
  "dimensionFilters": [
    { "dimension": "Instance", "operator": "eq", "filters": ["web-01", "web-02"] }
  ]
}

2. Logs Query — Azure Log Analytics (KQL — Kusto Query Language):

# Application errors by severity over time
AppExceptions
| where TimeGenerated > ago(24h)
| summarize ErrorCount = count() by bin(TimeGenerated, 1h), SeverityLevel
| order by TimeGenerated desc

# Container CPU utilization in AKS
Perf
| where ObjectName == "K8SContainer"
| where CounterName == "cpuUsageNanoCores"
| summarize AvgCPU = avg(CounterValue) by bin(TimeGenerated, 5m), InstanceName
| render timechart

# Network flow analysis
AzureNetworkAnalytics_CL
| where FlowStatus_s == "A"  // Allowed flows
| summarize TotalBytes = sum(InboundBytes_d + OutboundBytes_d)
  by bin(TimeGenerated, 1h), DestIP_s
| top 10 by TotalBytes desc

3. Traces Query — Application Insights distributed traces:

# Slow dependency calls
dependencies
| where duration > 5000
| where success == false
| project timestamp, target, name, duration, resultCode
| order by duration desc
| take 50

# End-to-end transaction search
union requests, dependencies, exceptions
| where operation_Id == "abc123-trace-id"
| order by timestamp asc

4. Azure Resource Graph — Query Azure resource inventory and configuration:

# Find all VMs not running
Resources
| where type == "microsoft.compute/virtualmachines"
| where properties.extended.instanceView.powerState.displayStatus != "VM running"
| project name, resourceGroup, location, properties.extended.instanceView.powerState.displayStatus

Azure Monitor Dashboards

Grafana provides pre-built dashboard templates for common Azure scenarios. These can be imported from the Grafana dashboard registry and customized for your environment:

Dashboard Templates Pre-Built Azure Monitoring Dashboards

Azure VM Overview — CPU, memory, disk IOPS, network throughput across all VMs
Azure Kubernetes Service (AKS) — Cluster health, node pools, pod status, container insights
Azure SQL Database — DTU/vCore utilization, deadlocks, query performance
Azure App Service — HTTP response codes, response time, instance health
Azure Functions — Execution count, duration, failures, queue depth
Azure Storage Accounts — Transaction volume, latency, capacity trends
Azure Networking — Load Balancer health, Application Gateway metrics, NSG flow logs
Azure Cost Overview — Cost by resource group, service, and tag (via Cost Management APIs)

Template Variables Multi-Subscription Auto-Discovery

Dashboard variables in Azure Monitor support dynamic population from subscription lists, resource groups, and resource names — enabling a single dashboard template to monitor any Azure resource by changing dropdown selections.

Best Practices & Approaches

Managing infrastructure monitoring across Kubernetes clusters and multiple cloud providers requires thoughtful architecture to avoid tool sprawl, alert fatigue, runaway costs, and fragmented visibility. The following best practices address the most common challenges in multi-cloud observability.

Multi-Cloud Observability Strategy

Multi-Cloud Observability Architecture

flowchart LR
    subgraph AWS
        CW[CloudWatch]
        XR[X-Ray]
    end

    subgraph GCP
        CM[Cloud Monitoring]
        CT[Cloud Trace]
    end

    subgraph Azure
        AM[Azure Monitor]
        AI[App Insights]
    end

    subgraph Kubernetes
        OC[OTel Collector]
    end

    subgraph Grafana Platform
        G[Grafana]
        M[Mimir]
        L[Loki]
        T[Tempo]
    end

    CW -->|CloudWatch DS| G
    XR -->|X-Ray DS| G
    CM -->|Cloud Monitoring DS| G
    CT -->|Cloud Trace DS| G
    AM -->|Azure Monitor DS| G
    AI -->|Azure Monitor DS| G
    OC -->|OTLP| M
    OC -->|OTLP| L
    OC -->|OTLP| T
    M --> G
    L --> G
    T --> G

A successful multi-cloud monitoring strategy follows these principles:

Single pane of glass — Grafana serves as the unified visualization layer regardless of where data originates. Users should never need to switch between cloud provider consoles for day-to-day monitoring.
Vendor-native for platform services — Use CloudWatch for Lambda, Cloud Monitoring for Cloud Run, Azure Monitor for Azure Functions. These integrations are maintained by the cloud providers and have zero-latency access to platform metrics.
OpenTelemetry for applications — Instrument your own code with OpenTelemetry to ensure portability. Application telemetry flows through the OTel Collector into your own backends (Mimir/Loki/Tempo), avoiding vendor lock-in.
Consistent labeling taxonomy — Establish naming conventions for environment (env: prod|staging|dev), service (service.name), team ownership (team), and cost center across all providers.
Centralized alerting — Define all alert rules in Grafana Alerting rather than in individual cloud provider alerting systems. This provides unified notification routing and on-call management.

                            
                            Anti-Pattern: Do not attempt to ship all cloud provider metrics into Mimir/Prometheus. Cloud-native metrics (AWS CloudWatch, GCP Cloud Monitoring, Azure Monitor) are best queried in-place via their respective Grafana data sources. Duplicating this data is expensive and introduces staleness. Only centralize metrics from sources you fully control (Kubernetes workloads, custom applications).
                        

Unified Dashboards

Effective multi-cloud dashboards abstract away provider-specific details and present a service-centric view:

Service Health Overview — RED metrics (Rate, Errors, Duration) for each service regardless of where it runs. Use mixed data sources to combine Lambda invocations (CloudWatch), GKE pod latency (Cloud Monitoring), and AKS response time (Azure Monitor) on a single dashboard.
Infrastructure Cost Dashboard — Aggregate cost signals from AWS Cost Explorer, GCP Billing, and Azure Cost Management using their respective Grafana plugins.
SLO Dashboard — Define Service Level Objectives that span providers. An SLO for “99.9% order processing success” might combine error rates from AWS Lambda (order intake), GCP Cloud Run (payment processing), and Azure Functions (notification delivery).
Capacity Planning — Kubernetes resource utilization across EKS, GKE, and AKS clusters displayed with consistent units and time ranges. Use Grafana template variables to switch between clusters.

# Example: Mixed data source dashboard variable
# In Grafana dashboard JSON:
templating:
  list:
    - name: cloud_provider
      type: custom
      options:
        - text: AWS (us-east-1)
          value: cloudwatch-prod
        - text: GCP (us-central1)
          value: gcp-monitoring-prod
        - text: Azure (eastus)
          value: azure-monitor-prod
      current:
        text: AWS (us-east-1)
        value: cloudwatch-prod

Cost Management

Monitoring costs grow with cardinality (unique label combinations), retention duration, and query frequency. Apply these cost controls across your infrastructure monitoring:

Cost Optimization Strategies for Controlling Monitoring Costs

Kubernetes Telemetry

Filter at the Collector — Use OTel processors to drop metrics from non-production namespaces (filter processor) or reduce cardinality by removing high-cardinality labels (attributes processor)
Aggregate before export — Use the metricstransform processor to combine per-pod metrics into per-deployment aggregates for less critical workloads
Tiered collection intervals — 15s for production-critical, 60s for development/staging

Cloud Provider Costs

CloudWatch — Each GetMetricData API call costs $0.01/1,000 metrics. Use CloudWatch Metric Streams for high-volume scenarios (flat monthly fee vs. per-query cost)
GCP Cloud Monitoring — First 150M API calls/month free, then $0.01/1,000 calls. Use alignment periods ≥ 60s to reduce call volume
Azure Monitor — Logs ingestion into Log Analytics is the primary cost driver. Use Data Collection Rules to filter logs before ingestion and apply Basic tier for non-critical log tables

Grafana-Side Optimizations

Dashboard query caching — Enable query caching for dashboards with many viewers but slow-changing data
Reduce auto-refresh frequency — Default to 30s or 1m refresh instead of 5s for operational dashboards
Recording rules — Pre-compute expensive aggregations as recording rules in Mimir rather than computing them at query time on every dashboard load

Filter Early Aggregate Aggressively Cache Smartly

# OTel Collector - Cost optimization processors
processors:
  # Drop metrics from non-production namespaces
  filter/drop-nonprod:
    metrics:
      exclude:
        match_type: regexp
        resource_attributes:
          - key: k8s.namespace.name
            value: "^(dev|test|sandbox)-.*"

  # Remove high-cardinality labels before export
  attributes/reduce-cardinality:
    actions:
      - key: k8s.pod.uid
        action: delete
      - key: container.id
        action: delete

  # Batch and compress before export
  batch:
    send_batch_size: 10000
    send_batch_max_size: 11000
    timeout: 10s

  # Memory limiter to prevent OOM
  memory_limiter:
    check_interval: 5s
    limit_mib: 512
    spike_limit_mib: 128

Summary & Next Steps

Infrastructure monitoring with Grafana spans the full spectrum from Kubernetes cluster internals to multi-cloud platform services. The key architectural decisions are:

Kubernetes — Deploy the OTel Collector as a DaemonSet (node-level receivers) and a single-replica Deployment (cluster-level receivers). The k8sattributes processor is non-negotiable for cross-signal correlation.
AWS — Query CloudWatch in-place via Grafana’s native data source. Use IAM roles (not access keys) and consider CloudWatch Metric Streams for high-volume scenarios.
GCP — Use the Cloud Monitoring data source with MQL for advanced queries. Leverage GCE metadata authentication when running on GCP.
Azure — The Azure Monitor data source covers metrics, Log Analytics (KQL), Application Insights traces, and Resource Graph queries. Use managed identity for zero-credential authentication.
Multi-cloud — Keep cloud-native metrics in their native stores, centralize only application telemetry, and use Grafana as the unified visualization and alerting layer.

Next in the Series

In Part 8: Displaying Data with Dashboards, we’ll explore Grafana’s dashboard authoring capabilities — panel types, template variables, annotations, repeating rows, library panels, and dashboard-as-code workflows for managing dashboards at scale.

Previous Part 6: Tracing with Grafana Tempo & TraceQL Next Part 8: Displaying Data with Dashboards