Kubernetes Observability Layers
Kubernetes is not a single system — it is layers of systems orchestrating containers. Each layer needs different telemetry:
flowchart TD
A[Layer 4: Application\nBusiness metrics, request traces, app logs] --> B[Layer 3: Container\nCPU, memory, restarts, OOMKills]
B --> C[Layer 2: Pod / Workload\nReady replicas, rollout status, HPA state]
C --> D[Layer 1: Node\nNode CPU, disk, network, kubelet health]
D --> E[Layer 0: Control Plane\nAPI server, etcd, scheduler, controller manager]
| Layer | Metrics Source | Key Signals |
|---|---|---|
| Control Plane | API server /metrics, etcd /metrics | Request latency, etcd leader changes, watch cache size |
| Node | node_exporter, kubelet /metrics | CPU, memory, disk, network, kubelet health |
| Pod / Workload | kube-state-metrics | Desired vs ready replicas, pod phase, restart count |
| Container | cAdvisor (embedded in kubelet) | Container CPU/memory usage vs limits, OOMKills |
| Application | OTel SDK, Prometheus client | Request rate, error rate, latency, business metrics |
Control Plane Monitoring
API Server Metrics
The API server is the gateway to all cluster operations. If it is slow or overloaded, everything suffers — deployments fail, pods do not get scheduled, and kubectl commands time out.
# Critical API server metrics to monitor
# Request rate by verb and resource
sum(rate(apiserver_request_total[5m])) by (verb, resource)
# Request latency (p99 for mutating operations)
histogram_quantile(0.99,
sum(rate(apiserver_request_duration_seconds_bucket{verb=~"POST|PUT|PATCH|DELETE"}[5m])) by (le, verb)
)
# Request error rate (5xx responses)
sum(rate(apiserver_request_total{code=~"5.."}[5m]))
/ sum(rate(apiserver_request_total[5m]))
# Watch count (high watch counts indicate controller inefficiency)
apiserver_registered_watchers
# Request queue depth (requests waiting to be processed)
apiserver_current_inflight_requests
etcd Health Monitoring
etcd is the cluster's brain — all state lives there. If etcd is unhealthy, the cluster is unhealthy.
# etcd critical metrics
# Leader changes (should be 0 in steady state)
etcd_server_leader_changes_seen_total
# Commit duration (p99 — how long to commit a write to the Raft log)
histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m]))
# Database size (etcd has a default 2GB limit)
etcd_mvcc_db_total_size_in_bytes
# Proposals committed vs pending (healthy: committed >> pending)
rate(etcd_server_proposals_committed_total[5m])
rate(etcd_server_proposals_pending)
# gRPC request rate to etcd
sum(rate(grpc_server_handled_total{grpc_service="etcdserverpb.KV"}[5m])) by (grpc_method)
Scheduler & Controller Manager
The scheduler places pods on nodes; the controller manager reconciles desired state with actual state. Monitor both for scheduling failures and reconciliation delays:
# Scheduler: pod scheduling latency
histogram_quantile(0.99, sum(rate(scheduler_scheduling_algorithm_duration_seconds_bucket[5m])) by (le))
# Scheduler: unschedulable pods (no node can accept them)
scheduler_pending_pods{queue="unschedulable"}
# Controller Manager: work queue depth (backlog of reconciliation work)
workqueue_depth{name=~"deployment|replicaset|statefulset"}
# Controller Manager: reconciliation latency
workqueue_queue_duration_seconds{name="deployment"}
Node & Pod Monitoring
kube-state-metrics (KSM)
kube-state-metrics generates Prometheus metrics from the Kubernetes API — it turns cluster state into numbers. It tells you the desired state of workloads, not the resource usage (that is cAdvisor's job).
# Pod status — are all pods in the deployment running?
kube_deployment_status_replicas_ready{deployment="order-service"}
kube_deployment_spec_replicas{deployment="order-service"}
# Pods in CrashLoopBackOff (container keeps crashing and restarting)
kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff"}
# Pod restart count (high restarts = unstable application)
kube_pod_container_status_restarts_total
# OOMKilled containers (ran out of memory)
kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}
# Pods stuck in Pending (scheduling issues)
kube_pod_status_phase{phase="Pending"}
# HPA current vs desired replicas
kube_horizontalpodautoscaler_status_current_replicas
kube_horizontalpodautoscaler_spec_max_replicas
# PVC status (bound vs pending)
kube_persistentvolumeclaim_status_phase
cAdvisor & kubelet Metrics
cAdvisor (Container Advisor) is embedded in the kubelet and provides per-container resource usage metrics. These are the metrics that tell you whether containers are actually using the resources they requested.
# Container CPU usage vs request vs limit
sum(rate(container_cpu_usage_seconds_total{namespace="production"}[5m])) by (pod, container)
kube_pod_container_resource_requests{resource="cpu"}
kube_pod_container_resource_limits{resource="cpu"}
# Container memory usage vs limit (critical for OOMKill detection)
container_memory_working_set_bytes{namespace="production"}
kube_pod_container_resource_limits{resource="memory"}
# Memory usage ratio (alert when > 90% of limit)
container_memory_working_set_bytes
/ on(namespace, pod, container)
kube_pod_container_resource_limits{resource="memory"}
# Network I/O per pod
sum(rate(container_network_receive_bytes_total[5m])) by (pod)
sum(rate(container_network_transmit_bytes_total[5m])) by (pod)
# Container filesystem usage
container_fs_usage_bytes{device=~"^/dev/.*"}
Essential Kubernetes Alerts
| Alert | Condition | Severity |
|---|---|---|
| Pod CrashLoopBackOff | Container restarting > 5 times in 10 min | P2 |
| Pod OOMKilled | Container terminated with OOMKilled reason | P2 |
| Deployment Replicas Mismatch | Ready replicas < desired for > 10 min | P2 |
| Node Not Ready | Node condition NotReady for > 5 min | P1 |
| etcd Leader Changes | > 3 leader changes in 1 hour | P1 |
| API Server Errors | 5xx rate > 1% for > 5 min | P1 |
| Persistent Volume Full | PVC usage > 90% | P2 |
| HPA at Max | Current replicas = max replicas for > 15 min | P3 |
| Pods Pending | Pods in Pending phase for > 5 min | P3 |
The OTel Operator for Kubernetes
The OpenTelemetry Operator is a Kubernetes operator that manages OTel Collectors and auto-instrumentation injection. It enables zero-code observability for Kubernetes workloads.
# Install the OTel Operator
# helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts
# helm install otel-operator open-telemetry/opentelemetry-operator
# 1. Deploy an OTel Collector as a DaemonSet
apiVersion: opentelemetry.io/v1beta1
kind: OpenTelemetryCollector
metadata:
name: otel-collector
namespace: monitoring
spec:
mode: daemonset # One collector per node
config:
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
send_batch_size: 1024
timeout: 5s
memory_limiter:
check_interval: 1s
limit_mib: 512
exporters:
otlp/tempo:
endpoint: tempo.monitoring.svc.cluster.local:4317
tls:
insecure: true
prometheusremotewrite:
endpoint: http://mimir.monitoring.svc.cluster.local:9009/api/v1/push
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [otlp/tempo]
metrics:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [prometheusremotewrite]
# 2. Auto-instrumentation injection — annotate pods to auto-instrument
apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
name: python-instrumentation
namespace: production
spec:
exporter:
endpoint: http://otel-collector.monitoring.svc.cluster.local:4317
propagators:
- tracecontext
- baggage
python:
image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-python:latest
---
# 3. Annotate your deployment to enable auto-instrumentation
apiVersion: apps/v1
kind: Deployment
metadata:
name: order-service
namespace: production
spec:
template:
metadata:
annotations:
# This single annotation auto-instruments the Python application
instrumentation.opentelemetry.io/inject-python: "true"
spec:
containers:
- name: order-service
image: order-service:2.4.1
instrumentation.opentelemetry.io/inject-python: "true" to a pod annotation automatically injects the OTel auto-instrumentation agent. The pod starts generating traces and metrics for all HTTP, gRPC, and database operations without any code changes. Supported languages: Python, Java, Node.js, .NET, Go.
kube-prometheus-stack — The Standard K8s Monitoring Stack
The kube-prometheus-stack Helm chart is the de facto standard for deploying Kubernetes monitoring. It bundles everything you need:
| Component | Purpose |
|---|---|
| Prometheus | Metrics collection and storage |
| Alertmanager | Alert routing and notification |
| Grafana | Dashboards and visualisation |
| kube-state-metrics | Kubernetes object state metrics |
| node-exporter | Node hardware and OS metrics |
| Prometheus Operator | CRDs for managing Prometheus config (ServiceMonitor, PrometheusRule) |
# Install kube-prometheus-stack
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install monitoring prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
--set grafana.adminPassword=admin \
--set prometheus.prometheusSpec.retention=15d \
--set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=50Gi
After installation, you get 20+ pre-built Grafana dashboards covering: Kubernetes cluster overview, node details, pod metrics, namespace resource usage, API server performance, etcd health, CoreDNS, and kubelet metrics.
Kubernetes Observability Deployment Checklist
- Deploy kube-prometheus-stack (Prometheus + Grafana + Alertmanager + KSM + node-exporter)
- Deploy OTel Operator + DaemonSet Collector for application telemetry
- Deploy Loki for centralised log collection (Fluent Bit DaemonSet → Loki)
- Deploy Tempo for distributed trace storage
- Configure Grafana data sources: Prometheus, Loki, Tempo
- Enable auto-instrumentation via OTel Operator annotations
- Configure Alertmanager routing to PagerDuty/Slack
- Create SLO burn rate alerts for critical services
- Verify dashboards: cluster overview, node details, service golden signals
- Run load test and validate metrics, logs, and traces flow end-to-end
Conclusion & Next Steps
Kubernetes observability requires monitoring at every layer — from the control plane to individual containers. Key takeaways from Part 8:
- Five layers of K8s observability: control plane, node, pod/workload, container, and application
- kube-state-metrics provides desired-state metrics (replicas, pod phase, HPA status); cAdvisor provides actual resource usage
- Control plane monitoring (API server, etcd, scheduler) is critical — if the control plane is unhealthy, nothing works
- The OTel Operator enables zero-code auto-instrumentation via pod annotations
- kube-prometheus-stack is the standard Helm chart for deploying the complete monitoring stack
- Always set up alerts for CrashLoopBackOff, OOMKills, node NotReady, and API server errors