Cluster Operations & Reliability - Part 13

Cluster Upgrades

Version Skew Policy

Kubernetes follows a strict version skew policy that dictates which component versions are compatible during rolling upgrades:

Component	Allowed Skew	Example (API Server 1.30)
kube-apiserver	Reference (newest)	1.30
kubelet	apiserver −3 versions	1.27–1.30
kube-controller-manager	apiserver −1 version	1.29–1.30
kube-scheduler	apiserver −1 version	1.29–1.30
kube-proxy	apiserver −3 versions	1.27–1.30
kubectl	apiserver ±1 version	1.29–1.31

                            
                            Upgrade Rule: Always upgrade one minor version at a time (1.28→1.29→1.30, never 1.28→1.30). Control plane first, then workers. Read the changelog for each version — deprecated APIs and removed features can break workloads.
                        

Upgrade Procedure

# Kubernetes cluster upgrade procedure (kubeadm):

# Step 1: Upgrade control plane (one node at a time if HA)
# Check available versions:
apt-cache policy kubeadm | head -20

# Upgrade kubeadm:
apt-get update
apt-get install -y kubeadm=1.30.0-1.1

# Verify upgrade plan:
kubeadm upgrade plan
# [upgrade/config] Making sure the configuration is correct:
# ...
# Components that must be upgraded manually after upgrade:
# COMPONENT   CURRENT   TARGET
# kubelet     v1.29.4   v1.30.0
# 
# Upgrade to the latest stable version:
# COMPONENT                 CURRENT   TARGET
# kube-apiserver            v1.29.4   v1.30.0
# kube-controller-manager   v1.29.4   v1.30.0
# kube-scheduler            v1.29.4   v1.30.0
# etcd                      3.5.12    3.5.15

# Apply the upgrade:
kubeadm upgrade apply v1.30.0

# Upgrade kubelet and kubectl on control plane node:
apt-get install -y kubelet=1.30.0-1.1 kubectl=1.30.0-1.1
systemctl daemon-reload
systemctl restart kubelet

# Step 2: Upgrade worker nodes (one at a time)
# From the control plane, drain the worker:
kubectl drain worker-1 --ignore-daemonsets --delete-emptydir-data

# On the worker node:
apt-get update
apt-get install -y kubeadm=1.30.0-1.1
kubeadm upgrade node
apt-get install -y kubelet=1.30.0-1.1
systemctl daemon-reload
systemctl restart kubelet

# Uncordon the worker:
kubectl uncordon worker-1

# Verify:
kubectl get nodes
# NAME       STATUS   VERSION
# master-1   Ready    v1.30.0
# worker-1   Ready    v1.30.0
# worker-2   Ready    v1.29.4  ← still needs upgrade

Managed K8s Upgrades

# Managed Kubernetes: control plane upgrades are provider-managed

# AWS EKS:
aws eks update-cluster-version --name production --kubernetes-version 1.30
# EKS upgrades control plane first, then you upgrade node groups:
aws eks update-nodegroup-version --cluster-name production \
  --nodegroup-name workers --kubernetes-version 1.30

# Azure AKS:
az aks upgrade --resource-group myRG --name production --kubernetes-version 1.30
# AKS can surge-upgrade (create extra nodes, drain old ones)

# GCP GKE:
gcloud container clusters upgrade production --master --cluster-version 1.30
gcloud container clusters upgrade production --node-pool default-pool

# GKE auto-upgrades: maintenance windows + surge settings
# Control plane: automatic during maintenance window
# Node pools: configurable surge (maxSurge, maxUnavailable)

Node Management

Drain & Cordon

# Cordon: mark a node as unschedulable (existing pods stay)
kubectl cordon worker-3
# node/worker-3 cordoned

kubectl get nodes
# NAME       STATUS                     ROLES
# worker-3   Ready,SchedulingDisabled   <none>

# Drain: evict all pods from a node (respects PDBs)
kubectl drain worker-3 \
  --ignore-daemonsets \        # DaemonSet pods can't be evicted
  --delete-emptydir-data \     # Allow evicting pods with emptyDir
  --grace-period=60 \          # Give pods 60s to terminate
  --timeout=300s               # Fail if drain takes >5min

# What happens during drain:
# 1. Node marked SchedulingDisabled (cordoned)
# 2. Each pod is evicted (respecting PDBs)
# 3. ReplicaSets/Deployments create replacement pods on other nodes
# 4. Eviction waits for pod's terminationGracePeriodSeconds
# 5. If PDB blocks eviction, drain waits (until --timeout)

# Uncordon: allow scheduling again
kubectl uncordon worker-3

Taints & Tolerations

Taints repel pods from nodes. Tolerations allow specific pods to schedule on tainted nodes. Together they control which workloads run where:

# Add taints to nodes:
kubectl taint nodes gpu-node-1 nvidia.com/gpu=present:NoSchedule
kubectl taint nodes spot-node-1 cloud.google.com/gke-spot=true:PreferNoSchedule
kubectl taint nodes dedicated-1 dedicated=ml-team:NoExecute

# Taint effects:
# NoSchedule       — new pods won't schedule (existing stay)
# PreferNoSchedule — try to avoid, but allow if necessary
# NoExecute        — evict existing pods that don't tolerate

# Remove a taint (note the trailing minus):
kubectl taint nodes gpu-node-1 nvidia.com/gpu=present:NoSchedule-

# Pod with tolerations:
apiVersion: v1
kind: Pod
metadata:
  name: ml-training
spec:
  tolerations:
  # Tolerate GPU nodes:
  - key: "nvidia.com/gpu"
    operator: "Equal"
    value: "present"
    effect: "NoSchedule"
  # Tolerate dedicated nodes (for any value):
  - key: "dedicated"
    operator: "Exists"
    effect: "NoExecute"
    tolerationSeconds: 3600   # Stay for 1h after taint applied
  containers:
  - name: training
    image: ml-training:v1.0
    resources:
      limits:
        nvidia.com/gpu: 1

Node Affinity

# Node affinity: attract pods TO specific nodes
apiVersion: apps/v1
kind: Deployment
metadata:
  name: latency-sensitive
spec:
  replicas: 3
  selector:
    matchLabels:
      app: latency-sensitive
  template:
    metadata:
      labels:
        app: latency-sensitive
    spec:
      affinity:
        # Hard requirement: must schedule on SSD nodes
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: disk-type
                operator: In
                values: ["ssd", "nvme"]
          # Soft preference: prefer zone-a
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 80
            preference:
              matchExpressions:
              - key: topology.kubernetes.io/zone
                operator: In
                values: ["us-east-1a"]
        # Pod anti-affinity: spread across nodes
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values: ["latency-sensitive"]
            topologyKey: kubernetes.io/hostname
      containers:
      - name: app
        image: my-app:v1.0

Resource Management

Requests vs Limits

                            
                            The Two Resources:
                            Requests — guaranteed minimum. The scheduler uses this to place pods. A 256Mi request means "guarantee me 256Mi is available."
Limits — enforced maximum. The kubelet kills pods exceeding memory limits (OOMKilled) and throttles CPU. A 512Mi limit means "kill me if I try to use more."

# Resource specification:
containers:
- name: app
  image: my-app:v1.0
  resources:
    requests:
      memory: "256Mi"    # Scheduling guarantee
      cpu: "250m"        # 0.25 CPU cores guaranteed
    limits:
      memory: "512Mi"    # OOMKilled above this
      cpu: "1000m"       # Throttled above 1 core

# CPU units:
# 1000m = 1 core = 1 vCPU (AWS) = 1 hyperthread
# 100m = 0.1 core (one-tenth of a CPU)
# 500m = half a core

# Memory units:
# Ki, Mi, Gi (binary: 1024-based)
# K, M, G (decimal: 1000-based)
# 256Mi = 268,435,456 bytes

QoS Classes

Kubernetes assigns a Quality of Service class to each pod based on its resource configuration. This determines eviction priority when a node is under memory pressure:

QoS Class	Condition	Eviction Priority	Use Case
Guaranteed	requests == limits (for all containers)	Last to be evicted	Critical production workloads
Burstable	requests < limits (at least one container)	Evicted if exceeding request	Most workloads
BestEffort	No requests or limits set	First to be evicted	Batch jobs, non-critical

# Check QoS class of pods:
kubectl get pods -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.qosClass}{"\n"}{end}'
# payment-abc12     Guaranteed
# web-def34         Burstable
# batch-ghi56       BestEffort

# When node memory pressure occurs (kubelet eviction):
# 1. BestEffort pods evicted first
# 2. Burstable pods using > request evicted next
# 3. Guaranteed pods evicted last (only if node truly exhausted)

ResourceQuotas

ResourceQuotas limit the total resources a namespace can consume — preventing one team from monopolising cluster capacity:

# ResourceQuota: limit total resources in a namespace
apiVersion: v1
kind: ResourceQuota
metadata:
  name: team-backend-quota
  namespace: backend
spec:
  hard:
    # Compute limits:
    requests.cpu: "20"           # Total CPU requests
    requests.memory: "40Gi"      # Total memory requests
    limits.cpu: "40"             # Total CPU limits
    limits.memory: "80Gi"        # Total memory limits
    # Object count limits:
    pods: "50"                   # Max pods
    services: "20"              # Max Services
    persistentvolumeclaims: "30" # Max PVCs
    secrets: "100"              # Max Secrets
    configmaps: "100"           # Max ConfigMaps
    # Storage limits:
    requests.storage: "500Gi"    # Total PVC storage
    fast-ssd.storageclass.storage.k8s.io/requests.storage: "200Gi"

# View quota usage:
kubectl describe resourcequota team-backend-quota -n backend
# Name:                   team-backend-quota
# Resource                Used    Hard
# --------                ----    ----
# requests.cpu            12      20
# requests.memory         28Gi    40Gi
# limits.cpu              24      40
# limits.memory           56Gi    80Gi
# pods                    32      50
# persistentvolumeclaims  12      30

LimitRanges

# LimitRange: set defaults and constraints PER POD/CONTAINER
apiVersion: v1
kind: LimitRange
metadata:
  name: container-limits
  namespace: backend
spec:
  limits:
  - type: Container
    default:           # Applied if no limits specified
      memory: "256Mi"
      cpu: "500m"
    defaultRequest:    # Applied if no requests specified
      memory: "128Mi"
      cpu: "100m"
    min:               # Minimum allowed
      memory: "64Mi"
      cpu: "50m"
    max:               # Maximum allowed
      memory: "4Gi"
      cpu: "4"
  - type: Pod
    max:
      memory: "8Gi"
      cpu: "8"
  - type: PersistentVolumeClaim
    min:
      storage: "1Gi"
    max:
      storage: "100Gi"

Disruption Management

PodDisruptionBudgets

A PodDisruptionBudget (PDB) protects workloads during voluntary disruptions (node drain, cluster upgrade, autoscaler scale-down) by limiting how many pods can be unavailable simultaneously:

# PDB: ensure minimum availability during disruptions
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: payment-pdb
spec:
  # Option A: minimum available
  minAvailable: 2             # At least 2 pods must remain running
  # Option B: maximum unavailable
  # maxUnavailable: 1         # At most 1 pod can be down at a time
  # Option C: percentage
  # minAvailable: "75%"       # 75% of desired must remain
  selector:
    matchLabels:
      app: payment

# PDB protects against:
# ✓ kubectl drain (drain waits until PDB allows eviction)
# ✓ Cluster autoscaler (won't scale down if it would violate PDB)
# ✓ Node upgrades (respects PDB during rolling upgrade)
# ✓ Voluntary eviction API calls

# PDB does NOT protect against:
# ✗ Node crashes (involuntary disruptions)
# ✗ Pod OOMKilled (resource limits)
# ✗ Application crashes

# Check PDB status:
kubectl get pdb
# NAME          MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
# payment-pdb   2               N/A               1                     5d
# redis-pdb     N/A             1                 1                     5d

# "ALLOWED DISRUPTIONS: 1" means one more pod can be evicted right now
# If it shows 0, drain will block until a pod becomes available

# Common PDB patterns:
# Stateless (3+ replicas):  maxUnavailable: 1
# Database (3 replicas):    minAvailable: 2
# Singleton (1 replica):    maxUnavailable: 0  ← blocks all drains!
# Batch processing:         no PDB needed

                            
                            PDB Gotcha: A PDB with maxUnavailable: 0 or minAvailable equal to replicas will block ALL voluntary disruptions — including node drains and cluster upgrades. Your cluster becomes un-upgradeable. Always allow at least 1 disruption for production workloads.
                        

Priority & Preemption

# PriorityClass: define scheduling priority
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: critical-production
value: 1000000          # Higher number = higher priority
globalDefault: false
preemptionPolicy: PreemptLowerPriority
description: "Critical production services — will preempt lower priority pods"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: batch-processing
value: 100
globalDefault: false
preemptionPolicy: Never   # Never preempt other pods
description: "Batch jobs — evicted first when resources scarce"
---
# Using priority in a Deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
  name: payment-service
spec:
  template:
    spec:
      priorityClassName: critical-production
      containers:
      - name: payment
        image: payment:v2.0

# What happens when cluster is full:
# 1. High-priority pod can't schedule (no resources)
# 2. Scheduler identifies lower-priority pods to preempt
# 3. Lower-priority pods are terminated (graceful shutdown)
# 4. High-priority pod schedules on freed resources

Autoscaling

Kubernetes Autoscaling Layers

flowchart TD
    subgraph Application Layer
        HPA[HPA
Scale pods horizontally]
        VPA[VPA
Resize pod resources]
    end
    subgraph Infrastructure Layer
        CA[Cluster Autoscaler
Add/remove nodes]
    end
    subgraph Event Layer
        KEDA[KEDA
Scale from 0 on events]
    end
    
    HPA -->|Needs more nodes| CA
    VPA -->|Needs bigger nodes| CA
    KEDA -->|Needs nodes for scale-up| CA
    CA -->|Provisions capacity| HPA
    CA -->|Provisions capacity| VPA

Horizontal Pod Autoscaler (HPA)

# HPA: scale pods based on metrics
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: payment-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: payment-service
  minReplicas: 3
  maxReplicas: 20
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60    # Wait 60s before scaling up
      policies:
      - type: Percent
        value: 100                       # Double pods per scale-up
        periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300   # Wait 5min before scaling down
      policies:
      - type: Pods
        value: 2                         # Remove max 2 pods per period
        periodSeconds: 60
  metrics:
  # CPU-based scaling:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70          # Target 70% CPU usage
  # Memory-based scaling:
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
  # Custom metrics (from Prometheus):
  - type: Pods
    pods:
      metric:
        name: http_requests_per_second
      target:
        type: AverageValue
        averageValue: "1000"            # Target 1000 RPS per pod

# HPA status:
kubectl get hpa
# NAME          REFERENCE              TARGETS        MINPODS   MAXPODS   REPLICAS   AGE
# payment-hpa   Deployment/payment     68%/70%,45%/80%   3      20        5          2d

# Detailed status:
kubectl describe hpa payment-hpa
# Metrics:
#   "cpu" resource utilization (percentage of request):  68% (250m) / 70%
#   "memory" resource utilization (percentage):          45% / 80%
# Min replicas: 3, Max replicas: 20, Current: 5
# Conditions:
#   AbleToScale: True (ready for new scale)
#   ScalingActive: True (HPA can scale)
#   ScalingLimited: False (not at min/max)

Vertical Pod Autoscaler (VPA)

# VPA: automatically adjust resource requests/limits
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: payment-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: payment-service
  updatePolicy:
    updateMode: "Auto"    # Auto, Recreate, Initial, Off
    # Auto: evict pods to apply new resources
    # Initial: only apply to new pods
    # Off: only recommend (don't change anything)
  resourcePolicy:
    containerPolicies:
    - containerName: payment
      minAllowed:
        cpu: "100m"
        memory: "128Mi"
      maxAllowed:
        cpu: "4"
        memory: "8Gi"
      controlledResources: ["cpu", "memory"]

# VPA recommendations:
kubectl describe vpa payment-vpa
# Recommendation:
#   Container Recommendations:
#     Container Name: payment
#     Lower Bound:    Cpu: 150m, Memory: 200Mi
#     Target:         Cpu: 350m, Memory: 450Mi   ← use these
#     Uncapped Target: Cpu: 350m, Memory: 450Mi
#     Upper Bound:    Cpu: 800m, Memory: 1Gi

# WARNING: VPA and HPA should NOT target the same metric
# VPA changes requests → HPA uses requests for CPU% calculation
# Safe combination: VPA for memory + HPA for CPU or custom metrics

Cluster Autoscaler

# Cluster Autoscaler: add/remove nodes based on demand

# Scale UP triggers:
# - Pod is Pending (unschedulable) due to insufficient resources
# - Cluster Autoscaler evaluates which node group can fit the pod
# - Provisions a new node (takes 1-5 minutes depending on cloud)

# Scale DOWN triggers:
# - Node utilization below threshold (default 50%) for 10 minutes
# - All pods on the node can be moved elsewhere
# - No PDB violations from moving pods
# - No pods with local storage (unless configured)
# - No system-critical pods that can't be moved

# Configuration (AWS EKS example):
# Node group with autoscaling:
aws eks create-nodegroup --cluster-name production \
  --nodegroup-name workers \
  --scaling-config minSize=3,maxSize=20,desiredSize=5

# Cluster Autoscaler respects:
# ✓ PodDisruptionBudgets
# ✓ Pod anti-affinity rules
# ✓ Node taints and tolerations
# ✓ Pods with local storage annotations
# ✓ Pods with controller (won't remove standalone pods)

KEDA (Kubernetes Event-Driven Autoscaling)

# KEDA: scale from/to zero based on event sources
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: order-processor
spec:
  scaleTargetRef:
    name: order-processor      # Deployment name
  minReplicaCount: 0           # Scale to zero when idle!
  maxReplicaCount: 50
  pollingInterval: 15          # Check every 15 seconds
  cooldownPeriod: 300          # Wait 5min before scaling to 0
  triggers:
  # Scale based on queue length:
  - type: rabbitmq
    metadata:
      queueName: orders
      host: amqp://rabbitmq.default.svc:5672
      queueLength: "10"        # 1 pod per 10 messages
  # Scale based on Kafka lag:
  - type: kafka
    metadata:
      bootstrapServers: kafka.default.svc:9092
      consumerGroup: order-consumers
      topic: orders
      lagThreshold: "100"
  # Scale based on Prometheus metric:
  - type: prometheus
    metadata:
      serverAddress: http://prometheus.monitoring.svc:9090
      metricName: pending_orders
      query: sum(pending_orders_total)
      threshold: "50"

Comparison Autoscaling Decision Guide

When to Use Each Autoscaler

Scenario	Autoscaler	Why
Web API under varying load	HPA (CPU/RPS)	Scale pods with traffic
Memory-hungry app, hard to predict	VPA	Right-size without manual tuning
Queue processor (bursty)	KEDA	Scale to 0 when idle, burst on demand
Pods pending, no node capacity	Cluster Autoscaler	Add nodes for pending pods
Event-driven microservices	KEDA + CA	KEDA scales pods, CA adds nodes
Stable workload, cost optimisation	VPA (Off mode)	Get recommendations, apply manually

Cost Optimisation Performance Elasticity

Exercises

                            
                            Exercise 1 — Resource Right-Sizing: Deploy an application without resource requests/limits. Generate load with a stress tool. Install VPA in "Off" mode and observe its recommendations. Apply the recommended values and verify the QoS class changes from BestEffort to Burstable or Guaranteed.
                        

                            
                            Exercise 2 — PDB & Drain: Deploy a 3-replica Deployment with a PDB (minAvailable: 2). Drain a node hosting one replica — observe it succeeds. Now drain a second node — observe the drain blocks because it would violate the PDB. Understand why this protects availability.
                        

                            
                            Exercise 3 — HPA Under Load: Deploy a CPU-intensive application with HPA (target 50% CPU, min 2, max 10). Generate load with kubectl run load-gen --image=busybox -- /bin/sh -c "while true; do wget -q -O- http://app-svc; done". Watch the HPA scale up. Remove the load and watch it scale down after the stabilisation window.
                        

                            
                            Exercise 4 — Namespace Governance: Create a namespace with ResourceQuota (10 pods, 4 CPU, 8Gi RAM) and LimitRange (default 256Mi/100m per container). Try deploying workloads that exceed the quota. Observe which requests are rejected and why. Test what happens when you don't set resource requests (LimitRange defaults should apply).
                        

Conclusion

Cluster operations and reliability engineering is what separates a development cluster from production infrastructure. Key takeaways:

Upgrades: One minor version at a time, control plane first, workers second, always with PDB protection
Resource management: Set requests for scheduling, limits for protection, use quotas for multi-tenancy
Disruption protection: PDBs ensure minimum availability during voluntary disruptions
Autoscaling: HPA for horizontal pod scaling, VPA for right-sizing, Cluster Autoscaler for infrastructure, KEDA for event-driven
Node management: Taints, tolerations, and affinity control where workloads run

In Part 14, we'll tackle Kubernetes Security — RBAC, Pod Security Standards, network policies, secrets management, admission controllers, and supply chain security.

Previous Part 12: CRDs & Operators Next Part 14: Kubernetes Security

Cookie Consent

Part 13: Cluster Operations & Reliability

Table of Contents

Cluster Upgrades

Version Skew Policy

Upgrade Procedure

Managed K8s Upgrades

Node Management

Drain & Cordon

Taints & Tolerations

Node Affinity

Resource Management

Requests vs Limits

QoS Classes

ResourceQuotas

LimitRanges

Disruption Management

PodDisruptionBudgets

Priority & Preemption

Autoscaling

Horizontal Pod Autoscaler (HPA)

Vertical Pod Autoscaler (VPA)

Cluster Autoscaler

KEDA (Kubernetes Event-Driven Autoscaling)

When to Use Each Autoscaler

Exercises

Conclusion

Cookie Consent

Part 13: Cluster Operations & Reliability

Table of Contents

Cluster Upgrades

Version Skew Policy

Upgrade Procedure

Managed K8s Upgrades

Node Management

Drain & Cordon

Taints & Tolerations

Node Affinity

Resource Management

Requests vs Limits

QoS Classes

ResourceQuotas

LimitRanges

Disruption Management

PodDisruptionBudgets

Priority & Preemption

Autoscaling

Horizontal Pod Autoscaler (HPA)

Vertical Pod Autoscaler (VPA)

Cluster Autoscaler

KEDA (Kubernetes Event-Driven Autoscaling)

When to Use Each Autoscaler

Exercises

Conclusion

Continue the Series

Part 12: CRDs & Operators

Part 14: Kubernetes Security

Part 15: Observability & Troubleshooting