Back to Monitoring, Observability & Reliability Series

Part 8: Kubernetes Observability

May 14, 2026 Wasil Zafar 21 min read

Kubernetes introduces unique observability challenges — ephemeral pods, dynamic scheduling, multi-layer networking, and a complex control plane. This part covers every layer of K8s monitoring: from the API server and etcd to individual containers, using kube-state-metrics, cAdvisor, and the OTel Operator.

Table of Contents

  1. Kubernetes Observability Layers
  2. Control Plane Monitoring
  3. Node & Pod Monitoring
  4. OTel Operator for Kubernetes
  5. kube-prometheus-stack
  6. Conclusion & Next Steps

Kubernetes Observability Layers

Kubernetes is not a single system — it is layers of systems orchestrating containers. Each layer needs different telemetry:

Kubernetes Observability Stack — Layer Model
                                flowchart TD
                                    A[Layer 4: Application\nBusiness metrics, request traces, app logs] --> B[Layer 3: Container\nCPU, memory, restarts, OOMKills]
                                    B --> C[Layer 2: Pod / Workload\nReady replicas, rollout status, HPA state]
                                    C --> D[Layer 1: Node\nNode CPU, disk, network, kubelet health]
                                    D --> E[Layer 0: Control Plane\nAPI server, etcd, scheduler, controller manager]
                            
LayerMetrics SourceKey Signals
Control PlaneAPI server /metrics, etcd /metricsRequest latency, etcd leader changes, watch cache size
Nodenode_exporter, kubelet /metricsCPU, memory, disk, network, kubelet health
Pod / Workloadkube-state-metricsDesired vs ready replicas, pod phase, restart count
ContainercAdvisor (embedded in kubelet)Container CPU/memory usage vs limits, OOMKills
ApplicationOTel SDK, Prometheus clientRequest rate, error rate, latency, business metrics

Control Plane Monitoring

API Server Metrics

The API server is the gateway to all cluster operations. If it is slow or overloaded, everything suffers — deployments fail, pods do not get scheduled, and kubectl commands time out.

# Critical API server metrics to monitor
# Request rate by verb and resource
sum(rate(apiserver_request_total[5m])) by (verb, resource)

# Request latency (p99 for mutating operations)
histogram_quantile(0.99,
  sum(rate(apiserver_request_duration_seconds_bucket{verb=~"POST|PUT|PATCH|DELETE"}[5m])) by (le, verb)
)

# Request error rate (5xx responses)
sum(rate(apiserver_request_total{code=~"5.."}[5m]))
/ sum(rate(apiserver_request_total[5m]))

# Watch count (high watch counts indicate controller inefficiency)
apiserver_registered_watchers

# Request queue depth (requests waiting to be processed)
apiserver_current_inflight_requests
Alert: API server p99 latency > 5s for mutating requests indicates the cluster is overloaded. Common causes: too many watches, etcd slow, node overcommit. This is a P1 alert — the cluster cannot process deployments or scheduling decisions.

etcd Health Monitoring

etcd is the cluster's brain — all state lives there. If etcd is unhealthy, the cluster is unhealthy.

# etcd critical metrics
# Leader changes (should be 0 in steady state)
etcd_server_leader_changes_seen_total

# Commit duration (p99 — how long to commit a write to the Raft log)
histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m]))

# Database size (etcd has a default 2GB limit)
etcd_mvcc_db_total_size_in_bytes

# Proposals committed vs pending (healthy: committed >> pending)
rate(etcd_server_proposals_committed_total[5m])
rate(etcd_server_proposals_pending)

# gRPC request rate to etcd
sum(rate(grpc_server_handled_total{grpc_service="etcdserverpb.KV"}[5m])) by (grpc_method)

Scheduler & Controller Manager

The scheduler places pods on nodes; the controller manager reconciles desired state with actual state. Monitor both for scheduling failures and reconciliation delays:

# Scheduler: pod scheduling latency
histogram_quantile(0.99, sum(rate(scheduler_scheduling_algorithm_duration_seconds_bucket[5m])) by (le))

# Scheduler: unschedulable pods (no node can accept them)
scheduler_pending_pods{queue="unschedulable"}

# Controller Manager: work queue depth (backlog of reconciliation work)
workqueue_depth{name=~"deployment|replicaset|statefulset"}

# Controller Manager: reconciliation latency
workqueue_queue_duration_seconds{name="deployment"}

Node & Pod Monitoring

kube-state-metrics (KSM)

kube-state-metrics generates Prometheus metrics from the Kubernetes API — it turns cluster state into numbers. It tells you the desired state of workloads, not the resource usage (that is cAdvisor's job).

# Pod status — are all pods in the deployment running?
kube_deployment_status_replicas_ready{deployment="order-service"}
kube_deployment_spec_replicas{deployment="order-service"}

# Pods in CrashLoopBackOff (container keeps crashing and restarting)
kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff"}

# Pod restart count (high restarts = unstable application)
kube_pod_container_status_restarts_total

# OOMKilled containers (ran out of memory)
kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}

# Pods stuck in Pending (scheduling issues)
kube_pod_status_phase{phase="Pending"}

# HPA current vs desired replicas
kube_horizontalpodautoscaler_status_current_replicas
kube_horizontalpodautoscaler_spec_max_replicas

# PVC status (bound vs pending)
kube_persistentvolumeclaim_status_phase

cAdvisor & kubelet Metrics

cAdvisor (Container Advisor) is embedded in the kubelet and provides per-container resource usage metrics. These are the metrics that tell you whether containers are actually using the resources they requested.

# Container CPU usage vs request vs limit
sum(rate(container_cpu_usage_seconds_total{namespace="production"}[5m])) by (pod, container)
kube_pod_container_resource_requests{resource="cpu"}
kube_pod_container_resource_limits{resource="cpu"}

# Container memory usage vs limit (critical for OOMKill detection)
container_memory_working_set_bytes{namespace="production"}
kube_pod_container_resource_limits{resource="memory"}

# Memory usage ratio (alert when > 90% of limit)
container_memory_working_set_bytes
/ on(namespace, pod, container)
kube_pod_container_resource_limits{resource="memory"}

# Network I/O per pod
sum(rate(container_network_receive_bytes_total[5m])) by (pod)
sum(rate(container_network_transmit_bytes_total[5m])) by (pod)

# Container filesystem usage
container_fs_usage_bytes{device=~"^/dev/.*"}

Essential Kubernetes Alerts

AlertConditionSeverity
Pod CrashLoopBackOffContainer restarting > 5 times in 10 minP2
Pod OOMKilledContainer terminated with OOMKilled reasonP2
Deployment Replicas MismatchReady replicas < desired for > 10 minP2
Node Not ReadyNode condition NotReady for > 5 minP1
etcd Leader Changes> 3 leader changes in 1 hourP1
API Server Errors5xx rate > 1% for > 5 minP1
Persistent Volume FullPVC usage > 90%P2
HPA at MaxCurrent replicas = max replicas for > 15 minP3
Pods PendingPods in Pending phase for > 5 minP3

The OTel Operator for Kubernetes

The OpenTelemetry Operator is a Kubernetes operator that manages OTel Collectors and auto-instrumentation injection. It enables zero-code observability for Kubernetes workloads.

# Install the OTel Operator
# helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts
# helm install otel-operator open-telemetry/opentelemetry-operator

# 1. Deploy an OTel Collector as a DaemonSet
apiVersion: opentelemetry.io/v1beta1
kind: OpenTelemetryCollector
metadata:
  name: otel-collector
  namespace: monitoring
spec:
  mode: daemonset  # One collector per node
  config:
    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317
          http:
            endpoint: 0.0.0.0:4318
    processors:
      batch:
        send_batch_size: 1024
        timeout: 5s
      memory_limiter:
        check_interval: 1s
        limit_mib: 512
    exporters:
      otlp/tempo:
        endpoint: tempo.monitoring.svc.cluster.local:4317
        tls:
          insecure: true
      prometheusremotewrite:
        endpoint: http://mimir.monitoring.svc.cluster.local:9009/api/v1/push
    service:
      pipelines:
        traces:
          receivers: [otlp]
          processors: [memory_limiter, batch]
          exporters: [otlp/tempo]
        metrics:
          receivers: [otlp]
          processors: [memory_limiter, batch]
          exporters: [prometheusremotewrite]
# 2. Auto-instrumentation injection — annotate pods to auto-instrument
apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
  name: python-instrumentation
  namespace: production
spec:
  exporter:
    endpoint: http://otel-collector.monitoring.svc.cluster.local:4317
  propagators:
    - tracecontext
    - baggage
  python:
    image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-python:latest

---
# 3. Annotate your deployment to enable auto-instrumentation
apiVersion: apps/v1
kind: Deployment
metadata:
  name: order-service
  namespace: production
spec:
  template:
    metadata:
      annotations:
        # This single annotation auto-instruments the Python application
        instrumentation.opentelemetry.io/inject-python: "true"
    spec:
      containers:
        - name: order-service
          image: order-service:2.4.1
Zero-Code Instrumentation: With the OTel Operator, adding instrumentation.opentelemetry.io/inject-python: "true" to a pod annotation automatically injects the OTel auto-instrumentation agent. The pod starts generating traces and metrics for all HTTP, gRPC, and database operations without any code changes. Supported languages: Python, Java, Node.js, .NET, Go.

kube-prometheus-stack — The Standard K8s Monitoring Stack

The kube-prometheus-stack Helm chart is the de facto standard for deploying Kubernetes monitoring. It bundles everything you need:

ComponentPurpose
PrometheusMetrics collection and storage
AlertmanagerAlert routing and notification
GrafanaDashboards and visualisation
kube-state-metricsKubernetes object state metrics
node-exporterNode hardware and OS metrics
Prometheus OperatorCRDs for managing Prometheus config (ServiceMonitor, PrometheusRule)
# Install kube-prometheus-stack
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

helm install monitoring prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --set grafana.adminPassword=admin \
  --set prometheus.prometheusSpec.retention=15d \
  --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=50Gi

After installation, you get 20+ pre-built Grafana dashboards covering: Kubernetes cluster overview, node details, pod metrics, namespace resource usage, API server performance, etcd health, CoreDNS, and kubelet metrics.

Production Checklist

Kubernetes Observability Deployment Checklist

  1. Deploy kube-prometheus-stack (Prometheus + Grafana + Alertmanager + KSM + node-exporter)
  2. Deploy OTel Operator + DaemonSet Collector for application telemetry
  3. Deploy Loki for centralised log collection (Fluent Bit DaemonSet → Loki)
  4. Deploy Tempo for distributed trace storage
  5. Configure Grafana data sources: Prometheus, Loki, Tempo
  6. Enable auto-instrumentation via OTel Operator annotations
  7. Configure Alertmanager routing to PagerDuty/Slack
  8. Create SLO burn rate alerts for critical services
  9. Verify dashboards: cluster overview, node details, service golden signals
  10. Run load test and validate metrics, logs, and traces flow end-to-end
Production Readiness Kubernetes Day 2 Operations

Conclusion & Next Steps

Kubernetes observability requires monitoring at every layer — from the control plane to individual containers. Key takeaways from Part 8:

  • Five layers of K8s observability: control plane, node, pod/workload, container, and application
  • kube-state-metrics provides desired-state metrics (replicas, pod phase, HPA status); cAdvisor provides actual resource usage
  • Control plane monitoring (API server, etcd, scheduler) is critical — if the control plane is unhealthy, nothing works
  • The OTel Operator enables zero-code auto-instrumentation via pod annotations
  • kube-prometheus-stack is the standard Helm chart for deploying the complete monitoring stack
  • Always set up alerts for CrashLoopBackOff, OOMKills, node NotReady, and API server errors