Part 8: Kubernetes Observability

Kubernetes Observability Layers

Kubernetes is not a single system — it is layers of systems orchestrating containers. Each layer needs different telemetry:

Kubernetes Observability Stack — Layer Model

                                flowchart TD
                                    A[Layer 4: Application\nBusiness metrics, request traces, app logs] --> B[Layer 3: Container\nCPU, memory, restarts, OOMKills]
                                    B --> C[Layer 2: Pod / Workload\nReady replicas, rollout status, HPA state]
                                    C --> D[Layer 1: Node\nNode CPU, disk, network, kubelet health]
                                    D --> E[Layer 0: Control Plane\nAPI server, etcd, scheduler, controller manager]

Layer	Metrics Source	Key Signals
Control Plane	API server /metrics, etcd /metrics	Request latency, etcd leader changes, watch cache size
Node	node_exporter, kubelet /metrics	CPU, memory, disk, network, kubelet health
Pod / Workload	kube-state-metrics	Desired vs ready replicas, pod phase, restart count
Container	cAdvisor (embedded in kubelet)	Container CPU/memory usage vs limits, OOMKills
Application	OTel SDK, Prometheus client	Request rate, error rate, latency, business metrics

Control Plane Monitoring

API Server Metrics

The API server is the gateway to all cluster operations. If it is slow or overloaded, everything suffers — deployments fail, pods do not get scheduled, and kubectl commands time out.

# Critical API server metrics to monitor
# Request rate by verb and resource
sum(rate(apiserver_request_total[5m])) by (verb, resource)

# Request latency (p99 for mutating operations)
histogram_quantile(0.99,
  sum(rate(apiserver_request_duration_seconds_bucket{verb=~"POST|PUT|PATCH|DELETE"}[5m])) by (le, verb)
)

# Request error rate (5xx responses)
sum(rate(apiserver_request_total{code=~"5.."}[5m]))
/ sum(rate(apiserver_request_total[5m]))

# Watch count (high watch counts indicate controller inefficiency)
apiserver_registered_watchers

# Request queue depth (requests waiting to be processed)
apiserver_current_inflight_requests

                            
                            Alert: API server p99 latency > 5s for mutating requests indicates the cluster is overloaded. Common causes: too many watches, etcd slow, node overcommit. This is a P1 alert — the cluster cannot process deployments or scheduling decisions.
                        

etcd Health Monitoring

etcd is the cluster's brain — all state lives there. If etcd is unhealthy, the cluster is unhealthy.

# etcd critical metrics
# Leader changes (should be 0 in steady state)
etcd_server_leader_changes_seen_total

# Commit duration (p99 — how long to commit a write to the Raft log)
histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m]))

# Database size (etcd has a default 2GB limit)
etcd_mvcc_db_total_size_in_bytes

# Proposals committed vs pending (healthy: committed >> pending)
rate(etcd_server_proposals_committed_total[5m])
rate(etcd_server_proposals_pending)

# gRPC request rate to etcd
sum(rate(grpc_server_handled_total{grpc_service="etcdserverpb.KV"}[5m])) by (grpc_method)

Scheduler & Controller Manager

The scheduler places pods on nodes; the controller manager reconciles desired state with actual state. Monitor both for scheduling failures and reconciliation delays:

# Scheduler: pod scheduling latency
histogram_quantile(0.99, sum(rate(scheduler_scheduling_algorithm_duration_seconds_bucket[5m])) by (le))

# Scheduler: unschedulable pods (no node can accept them)
scheduler_pending_pods{queue="unschedulable"}

# Controller Manager: work queue depth (backlog of reconciliation work)
workqueue_depth{name=~"deployment|replicaset|statefulset"}

# Controller Manager: reconciliation latency
workqueue_queue_duration_seconds{name="deployment"}

Node & Pod Monitoring

kube-state-metrics (KSM)

kube-state-metrics generates Prometheus metrics from the Kubernetes API — it turns cluster state into numbers. It tells you the desired state of workloads, not the resource usage (that is cAdvisor's job).

# Pod status — are all pods in the deployment running?
kube_deployment_status_replicas_ready{deployment="order-service"}
kube_deployment_spec_replicas{deployment="order-service"}

# Pods in CrashLoopBackOff (container keeps crashing and restarting)
kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff"}

# Pod restart count (high restarts = unstable application)
kube_pod_container_status_restarts_total

# OOMKilled containers (ran out of memory)
kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}

# Pods stuck in Pending (scheduling issues)
kube_pod_status_phase{phase="Pending"}

# HPA current vs desired replicas
kube_horizontalpodautoscaler_status_current_replicas
kube_horizontalpodautoscaler_spec_max_replicas

# PVC status (bound vs pending)
kube_persistentvolumeclaim_status_phase

cAdvisor & kubelet Metrics

cAdvisor (Container Advisor) is embedded in the kubelet and provides per-container resource usage metrics. These are the metrics that tell you whether containers are actually using the resources they requested.

# Container CPU usage vs request vs limit
sum(rate(container_cpu_usage_seconds_total{namespace="production"}[5m])) by (pod, container)
kube_pod_container_resource_requests{resource="cpu"}
kube_pod_container_resource_limits{resource="cpu"}

# Container memory usage vs limit (critical for OOMKill detection)
container_memory_working_set_bytes{namespace="production"}
kube_pod_container_resource_limits{resource="memory"}

# Memory usage ratio (alert when > 90% of limit)
container_memory_working_set_bytes
/ on(namespace, pod, container)
kube_pod_container_resource_limits{resource="memory"}

# Network I/O per pod
sum(rate(container_network_receive_bytes_total[5m])) by (pod)
sum(rate(container_network_transmit_bytes_total[5m])) by (pod)

# Container filesystem usage
container_fs_usage_bytes{device=~"^/dev/.*"}

Essential Kubernetes Alerts

Alert	Condition	Severity
Pod CrashLoopBackOff	Container restarting > 5 times in 10 min	P2
Pod OOMKilled	Container terminated with OOMKilled reason	P2
Deployment Replicas Mismatch	Ready replicas < desired for > 10 min	P2
Node Not Ready	Node condition NotReady for > 5 min	P1
etcd Leader Changes	> 3 leader changes in 1 hour	P1
API Server Errors	5xx rate > 1% for > 5 min	P1
Persistent Volume Full	PVC usage > 90%	P2
HPA at Max	Current replicas = max replicas for > 15 min	P3
Pods Pending	Pods in Pending phase for > 5 min	P3

The OTel Operator for Kubernetes

The OpenTelemetry Operator is a Kubernetes operator that manages OTel Collectors and auto-instrumentation injection. It enables zero-code observability for Kubernetes workloads.

# Install the OTel Operator
# helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts
# helm install otel-operator open-telemetry/opentelemetry-operator

# 1. Deploy an OTel Collector as a DaemonSet
apiVersion: opentelemetry.io/v1beta1
kind: OpenTelemetryCollector
metadata:
  name: otel-collector
  namespace: monitoring
spec:
  mode: daemonset  # One collector per node
  config:
    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317
          http:
            endpoint: 0.0.0.0:4318
    processors:
      batch:
        send_batch_size: 1024
        timeout: 5s
      memory_limiter:
        check_interval: 1s
        limit_mib: 512
    exporters:
      otlp/tempo:
        endpoint: tempo.monitoring.svc.cluster.local:4317
        tls:
          insecure: true
      prometheusremotewrite:
        endpoint: http://mimir.monitoring.svc.cluster.local:9009/api/v1/push
    service:
      pipelines:
        traces:
          receivers: [otlp]
          processors: [memory_limiter, batch]
          exporters: [otlp/tempo]
        metrics:
          receivers: [otlp]
          processors: [memory_limiter, batch]
          exporters: [prometheusremotewrite]

# 2. Auto-instrumentation injection — annotate pods to auto-instrument
apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
  name: python-instrumentation
  namespace: production
spec:
  exporter:
    endpoint: http://otel-collector.monitoring.svc.cluster.local:4317
  propagators:
    - tracecontext
    - baggage
  python:
    image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-python:latest

---
# 3. Annotate your deployment to enable auto-instrumentation
apiVersion: apps/v1
kind: Deployment
metadata:
  name: order-service
  namespace: production
spec:
  template:
    metadata:
      annotations:
        # This single annotation auto-instruments the Python application
        instrumentation.opentelemetry.io/inject-python: "true"
    spec:
      containers:
        - name: order-service
          image: order-service:2.4.1

                            
                            Zero-Code Instrumentation: With the OTel Operator, adding instrumentation.opentelemetry.io/inject-python: "true" to a pod annotation automatically injects the OTel auto-instrumentation agent. The pod starts generating traces and metrics for all HTTP, gRPC, and database operations without any code changes. Supported languages: Python, Java, Node.js, .NET, Go.
                        

kube-prometheus-stack — The Standard K8s Monitoring Stack

The kube-prometheus-stack Helm chart is the de facto standard for deploying Kubernetes monitoring. It bundles everything you need:

Component	Purpose
Prometheus	Metrics collection and storage
Alertmanager	Alert routing and notification
Grafana	Dashboards and visualisation
kube-state-metrics	Kubernetes object state metrics
node-exporter	Node hardware and OS metrics
Prometheus Operator	CRDs for managing Prometheus config (ServiceMonitor, PrometheusRule)

# Install kube-prometheus-stack
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

helm install monitoring prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --set grafana.adminPassword=admin \
  --set prometheus.prometheusSpec.retention=15d \
  --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=50Gi

After installation, you get 20+ pre-built Grafana dashboards covering: Kubernetes cluster overview, node details, pod metrics, namespace resource usage, API server performance, etcd health, CoreDNS, and kubelet metrics.

Production Checklist

Kubernetes Observability Deployment Checklist

Deploy kube-prometheus-stack (Prometheus + Grafana + Alertmanager + KSM + node-exporter)
Deploy OTel Operator + DaemonSet Collector for application telemetry
Deploy Loki for centralised log collection (Fluent Bit DaemonSet → Loki)
Deploy Tempo for distributed trace storage
Configure Grafana data sources: Prometheus, Loki, Tempo
Enable auto-instrumentation via OTel Operator annotations
Configure Alertmanager routing to PagerDuty/Slack
Create SLO burn rate alerts for critical services
Verify dashboards: cluster overview, node details, service golden signals
Run load test and validate metrics, logs, and traces flow end-to-end

Production Readiness Kubernetes Day 2 Operations

Conclusion & Next Steps

Kubernetes observability requires monitoring at every layer — from the control plane to individual containers. Key takeaways from Part 8:

Five layers of K8s observability: control plane, node, pod/workload, container, and application
kube-state-metrics provides desired-state metrics (replicas, pod phase, HPA status); cAdvisor provides actual resource usage
Control plane monitoring (API server, etcd, scheduler) is critical — if the control plane is unhealthy, nothing works
The OTel Operator enables zero-code auto-instrumentation via pod annotations
kube-prometheus-stack is the standard Helm chart for deploying the complete monitoring stack
Always set up alerts for CrashLoopBackOff, OOMKills, node NotReady, and API server errors

Previous Part 7: Visualization & Alerting Next Part 9: SLOs, SLIs, SLAs & Error Budgets

Cookie Consent

Part 8: Kubernetes Observability

Table of Contents

Kubernetes Observability Layers

Control Plane Monitoring

API Server Metrics

etcd Health Monitoring

Scheduler & Controller Manager

Node & Pod Monitoring

kube-state-metrics (KSM)

cAdvisor & kubelet Metrics

Essential Kubernetes Alerts

The OTel Operator for Kubernetes

kube-prometheus-stack — The Standard K8s Monitoring Stack

Kubernetes Observability Deployment Checklist

Conclusion & Next Steps

Cookie Consent

Part 8: Kubernetes Observability

Table of Contents

Kubernetes Observability Layers

Control Plane Monitoring

API Server Metrics

etcd Health Monitoring

Scheduler & Controller Manager

Node & Pod Monitoring

kube-state-metrics (KSM)

cAdvisor & kubelet Metrics

Essential Kubernetes Alerts

The OTel Operator for Kubernetes

kube-prometheus-stack — The Standard K8s Monitoring Stack

Kubernetes Observability Deployment Checklist

Conclusion & Next Steps

Continue the Series

Part 9: SLOs, SLIs, SLAs & Error Budgets

Part 6: OpenTelemetry

Part 7: Visualization & Alerting