Cluster Upgrades
Version Skew Policy
Kubernetes follows a strict version skew policy that dictates which component versions are compatible during rolling upgrades:
| Component | Allowed Skew | Example (API Server 1.30) |
|---|---|---|
| kube-apiserver | Reference (newest) | 1.30 |
| kubelet | apiserver −3 versions | 1.27–1.30 |
| kube-controller-manager | apiserver −1 version | 1.29–1.30 |
| kube-scheduler | apiserver −1 version | 1.29–1.30 |
| kube-proxy | apiserver −3 versions | 1.27–1.30 |
| kubectl | apiserver ±1 version | 1.29–1.31 |
Upgrade Procedure
# Kubernetes cluster upgrade procedure (kubeadm):
# Step 1: Upgrade control plane (one node at a time if HA)
# Check available versions:
apt-cache policy kubeadm | head -20
# Upgrade kubeadm:
apt-get update
apt-get install -y kubeadm=1.30.0-1.1
# Verify upgrade plan:
kubeadm upgrade plan
# [upgrade/config] Making sure the configuration is correct:
# ...
# Components that must be upgraded manually after upgrade:
# COMPONENT CURRENT TARGET
# kubelet v1.29.4 v1.30.0
#
# Upgrade to the latest stable version:
# COMPONENT CURRENT TARGET
# kube-apiserver v1.29.4 v1.30.0
# kube-controller-manager v1.29.4 v1.30.0
# kube-scheduler v1.29.4 v1.30.0
# etcd 3.5.12 3.5.15
# Apply the upgrade:
kubeadm upgrade apply v1.30.0
# Upgrade kubelet and kubectl on control plane node:
apt-get install -y kubelet=1.30.0-1.1 kubectl=1.30.0-1.1
systemctl daemon-reload
systemctl restart kubelet
# Step 2: Upgrade worker nodes (one at a time)
# From the control plane, drain the worker:
kubectl drain worker-1 --ignore-daemonsets --delete-emptydir-data
# On the worker node:
apt-get update
apt-get install -y kubeadm=1.30.0-1.1
kubeadm upgrade node
apt-get install -y kubelet=1.30.0-1.1
systemctl daemon-reload
systemctl restart kubelet
# Uncordon the worker:
kubectl uncordon worker-1
# Verify:
kubectl get nodes
# NAME STATUS VERSION
# master-1 Ready v1.30.0
# worker-1 Ready v1.30.0
# worker-2 Ready v1.29.4 ← still needs upgrade
Managed K8s Upgrades
# Managed Kubernetes: control plane upgrades are provider-managed
# AWS EKS:
aws eks update-cluster-version --name production --kubernetes-version 1.30
# EKS upgrades control plane first, then you upgrade node groups:
aws eks update-nodegroup-version --cluster-name production \
--nodegroup-name workers --kubernetes-version 1.30
# Azure AKS:
az aks upgrade --resource-group myRG --name production --kubernetes-version 1.30
# AKS can surge-upgrade (create extra nodes, drain old ones)
# GCP GKE:
gcloud container clusters upgrade production --master --cluster-version 1.30
gcloud container clusters upgrade production --node-pool default-pool
# GKE auto-upgrades: maintenance windows + surge settings
# Control plane: automatic during maintenance window
# Node pools: configurable surge (maxSurge, maxUnavailable)
Node Management
Drain & Cordon
# Cordon: mark a node as unschedulable (existing pods stay)
kubectl cordon worker-3
# node/worker-3 cordoned
kubectl get nodes
# NAME STATUS ROLES
# worker-3 Ready,SchedulingDisabled <none>
# Drain: evict all pods from a node (respects PDBs)
kubectl drain worker-3 \
--ignore-daemonsets \ # DaemonSet pods can't be evicted
--delete-emptydir-data \ # Allow evicting pods with emptyDir
--grace-period=60 \ # Give pods 60s to terminate
--timeout=300s # Fail if drain takes >5min
# What happens during drain:
# 1. Node marked SchedulingDisabled (cordoned)
# 2. Each pod is evicted (respecting PDBs)
# 3. ReplicaSets/Deployments create replacement pods on other nodes
# 4. Eviction waits for pod's terminationGracePeriodSeconds
# 5. If PDB blocks eviction, drain waits (until --timeout)
# Uncordon: allow scheduling again
kubectl uncordon worker-3
Taints & Tolerations
Taints repel pods from nodes. Tolerations allow specific pods to schedule on tainted nodes. Together they control which workloads run where:
# Add taints to nodes:
kubectl taint nodes gpu-node-1 nvidia.com/gpu=present:NoSchedule
kubectl taint nodes spot-node-1 cloud.google.com/gke-spot=true:PreferNoSchedule
kubectl taint nodes dedicated-1 dedicated=ml-team:NoExecute
# Taint effects:
# NoSchedule — new pods won't schedule (existing stay)
# PreferNoSchedule — try to avoid, but allow if necessary
# NoExecute — evict existing pods that don't tolerate
# Remove a taint (note the trailing minus):
kubectl taint nodes gpu-node-1 nvidia.com/gpu=present:NoSchedule-
# Pod with tolerations:
apiVersion: v1
kind: Pod
metadata:
name: ml-training
spec:
tolerations:
# Tolerate GPU nodes:
- key: "nvidia.com/gpu"
operator: "Equal"
value: "present"
effect: "NoSchedule"
# Tolerate dedicated nodes (for any value):
- key: "dedicated"
operator: "Exists"
effect: "NoExecute"
tolerationSeconds: 3600 # Stay for 1h after taint applied
containers:
- name: training
image: ml-training:v1.0
resources:
limits:
nvidia.com/gpu: 1
Node Affinity
# Node affinity: attract pods TO specific nodes
apiVersion: apps/v1
kind: Deployment
metadata:
name: latency-sensitive
spec:
replicas: 3
selector:
matchLabels:
app: latency-sensitive
template:
metadata:
labels:
app: latency-sensitive
spec:
affinity:
# Hard requirement: must schedule on SSD nodes
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: disk-type
operator: In
values: ["ssd", "nvme"]
# Soft preference: prefer zone-a
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 80
preference:
matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values: ["us-east-1a"]
# Pod anti-affinity: spread across nodes
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values: ["latency-sensitive"]
topologyKey: kubernetes.io/hostname
containers:
- name: app
image: my-app:v1.0
Resource Management
Requests vs Limits
- Requests — guaranteed minimum. The scheduler uses this to place pods. A 256Mi request means "guarantee me 256Mi is available."
- Limits — enforced maximum. The kubelet kills pods exceeding memory limits (OOMKilled) and throttles CPU. A 512Mi limit means "kill me if I try to use more."
# Resource specification:
containers:
- name: app
image: my-app:v1.0
resources:
requests:
memory: "256Mi" # Scheduling guarantee
cpu: "250m" # 0.25 CPU cores guaranteed
limits:
memory: "512Mi" # OOMKilled above this
cpu: "1000m" # Throttled above 1 core
# CPU units:
# 1000m = 1 core = 1 vCPU (AWS) = 1 hyperthread
# 100m = 0.1 core (one-tenth of a CPU)
# 500m = half a core
# Memory units:
# Ki, Mi, Gi (binary: 1024-based)
# K, M, G (decimal: 1000-based)
# 256Mi = 268,435,456 bytes
QoS Classes
Kubernetes assigns a Quality of Service class to each pod based on its resource configuration. This determines eviction priority when a node is under memory pressure:
| QoS Class | Condition | Eviction Priority | Use Case |
|---|---|---|---|
| Guaranteed | requests == limits (for all containers) | Last to be evicted | Critical production workloads |
| Burstable | requests < limits (at least one container) | Evicted if exceeding request | Most workloads |
| BestEffort | No requests or limits set | First to be evicted | Batch jobs, non-critical |
# Check QoS class of pods:
kubectl get pods -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.qosClass}{"\n"}{end}'
# payment-abc12 Guaranteed
# web-def34 Burstable
# batch-ghi56 BestEffort
# When node memory pressure occurs (kubelet eviction):
# 1. BestEffort pods evicted first
# 2. Burstable pods using > request evicted next
# 3. Guaranteed pods evicted last (only if node truly exhausted)
ResourceQuotas
ResourceQuotas limit the total resources a namespace can consume — preventing one team from monopolising cluster capacity:
# ResourceQuota: limit total resources in a namespace
apiVersion: v1
kind: ResourceQuota
metadata:
name: team-backend-quota
namespace: backend
spec:
hard:
# Compute limits:
requests.cpu: "20" # Total CPU requests
requests.memory: "40Gi" # Total memory requests
limits.cpu: "40" # Total CPU limits
limits.memory: "80Gi" # Total memory limits
# Object count limits:
pods: "50" # Max pods
services: "20" # Max Services
persistentvolumeclaims: "30" # Max PVCs
secrets: "100" # Max Secrets
configmaps: "100" # Max ConfigMaps
# Storage limits:
requests.storage: "500Gi" # Total PVC storage
fast-ssd.storageclass.storage.k8s.io/requests.storage: "200Gi"
# View quota usage:
kubectl describe resourcequota team-backend-quota -n backend
# Name: team-backend-quota
# Resource Used Hard
# -------- ---- ----
# requests.cpu 12 20
# requests.memory 28Gi 40Gi
# limits.cpu 24 40
# limits.memory 56Gi 80Gi
# pods 32 50
# persistentvolumeclaims 12 30
LimitRanges
# LimitRange: set defaults and constraints PER POD/CONTAINER
apiVersion: v1
kind: LimitRange
metadata:
name: container-limits
namespace: backend
spec:
limits:
- type: Container
default: # Applied if no limits specified
memory: "256Mi"
cpu: "500m"
defaultRequest: # Applied if no requests specified
memory: "128Mi"
cpu: "100m"
min: # Minimum allowed
memory: "64Mi"
cpu: "50m"
max: # Maximum allowed
memory: "4Gi"
cpu: "4"
- type: Pod
max:
memory: "8Gi"
cpu: "8"
- type: PersistentVolumeClaim
min:
storage: "1Gi"
max:
storage: "100Gi"
Disruption Management
PodDisruptionBudgets
A PodDisruptionBudget (PDB) protects workloads during voluntary disruptions (node drain, cluster upgrade, autoscaler scale-down) by limiting how many pods can be unavailable simultaneously:
# PDB: ensure minimum availability during disruptions
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: payment-pdb
spec:
# Option A: minimum available
minAvailable: 2 # At least 2 pods must remain running
# Option B: maximum unavailable
# maxUnavailable: 1 # At most 1 pod can be down at a time
# Option C: percentage
# minAvailable: "75%" # 75% of desired must remain
selector:
matchLabels:
app: payment
# PDB protects against:
# ✓ kubectl drain (drain waits until PDB allows eviction)
# ✓ Cluster autoscaler (won't scale down if it would violate PDB)
# ✓ Node upgrades (respects PDB during rolling upgrade)
# ✓ Voluntary eviction API calls
# PDB does NOT protect against:
# ✗ Node crashes (involuntary disruptions)
# ✗ Pod OOMKilled (resource limits)
# ✗ Application crashes
# Check PDB status:
kubectl get pdb
# NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE
# payment-pdb 2 N/A 1 5d
# redis-pdb N/A 1 1 5d
# "ALLOWED DISRUPTIONS: 1" means one more pod can be evicted right now
# If it shows 0, drain will block until a pod becomes available
# Common PDB patterns:
# Stateless (3+ replicas): maxUnavailable: 1
# Database (3 replicas): minAvailable: 2
# Singleton (1 replica): maxUnavailable: 0 ← blocks all drains!
# Batch processing: no PDB needed
maxUnavailable: 0 or minAvailable equal to replicas will block ALL voluntary disruptions — including node drains and cluster upgrades. Your cluster becomes un-upgradeable. Always allow at least 1 disruption for production workloads.
Priority & Preemption
# PriorityClass: define scheduling priority
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: critical-production
value: 1000000 # Higher number = higher priority
globalDefault: false
preemptionPolicy: PreemptLowerPriority
description: "Critical production services — will preempt lower priority pods"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: batch-processing
value: 100
globalDefault: false
preemptionPolicy: Never # Never preempt other pods
description: "Batch jobs — evicted first when resources scarce"
---
# Using priority in a Deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
name: payment-service
spec:
template:
spec:
priorityClassName: critical-production
containers:
- name: payment
image: payment:v2.0
# What happens when cluster is full:
# 1. High-priority pod can't schedule (no resources)
# 2. Scheduler identifies lower-priority pods to preempt
# 3. Lower-priority pods are terminated (graceful shutdown)
# 4. High-priority pod schedules on freed resources
Autoscaling
flowchart TD
subgraph Application Layer
HPA[HPA
Scale pods horizontally]
VPA[VPA
Resize pod resources]
end
subgraph Infrastructure Layer
CA[Cluster Autoscaler
Add/remove nodes]
end
subgraph Event Layer
KEDA[KEDA
Scale from 0 on events]
end
HPA -->|Needs more nodes| CA
VPA -->|Needs bigger nodes| CA
KEDA -->|Needs nodes for scale-up| CA
CA -->|Provisions capacity| HPA
CA -->|Provisions capacity| VPA
Horizontal Pod Autoscaler (HPA)
# HPA: scale pods based on metrics
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: payment-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: payment-service
minReplicas: 3
maxReplicas: 20
behavior:
scaleUp:
stabilizationWindowSeconds: 60 # Wait 60s before scaling up
policies:
- type: Percent
value: 100 # Double pods per scale-up
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300 # Wait 5min before scaling down
policies:
- type: Pods
value: 2 # Remove max 2 pods per period
periodSeconds: 60
metrics:
# CPU-based scaling:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70 # Target 70% CPU usage
# Memory-based scaling:
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
# Custom metrics (from Prometheus):
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "1000" # Target 1000 RPS per pod
# HPA status:
kubectl get hpa
# NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
# payment-hpa Deployment/payment 68%/70%,45%/80% 3 20 5 2d
# Detailed status:
kubectl describe hpa payment-hpa
# Metrics:
# "cpu" resource utilization (percentage of request): 68% (250m) / 70%
# "memory" resource utilization (percentage): 45% / 80%
# Min replicas: 3, Max replicas: 20, Current: 5
# Conditions:
# AbleToScale: True (ready for new scale)
# ScalingActive: True (HPA can scale)
# ScalingLimited: False (not at min/max)
Vertical Pod Autoscaler (VPA)
# VPA: automatically adjust resource requests/limits
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: payment-vpa
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: payment-service
updatePolicy:
updateMode: "Auto" # Auto, Recreate, Initial, Off
# Auto: evict pods to apply new resources
# Initial: only apply to new pods
# Off: only recommend (don't change anything)
resourcePolicy:
containerPolicies:
- containerName: payment
minAllowed:
cpu: "100m"
memory: "128Mi"
maxAllowed:
cpu: "4"
memory: "8Gi"
controlledResources: ["cpu", "memory"]
# VPA recommendations:
kubectl describe vpa payment-vpa
# Recommendation:
# Container Recommendations:
# Container Name: payment
# Lower Bound: Cpu: 150m, Memory: 200Mi
# Target: Cpu: 350m, Memory: 450Mi ← use these
# Uncapped Target: Cpu: 350m, Memory: 450Mi
# Upper Bound: Cpu: 800m, Memory: 1Gi
# WARNING: VPA and HPA should NOT target the same metric
# VPA changes requests → HPA uses requests for CPU% calculation
# Safe combination: VPA for memory + HPA for CPU or custom metrics
Cluster Autoscaler
# Cluster Autoscaler: add/remove nodes based on demand
# Scale UP triggers:
# - Pod is Pending (unschedulable) due to insufficient resources
# - Cluster Autoscaler evaluates which node group can fit the pod
# - Provisions a new node (takes 1-5 minutes depending on cloud)
# Scale DOWN triggers:
# - Node utilization below threshold (default 50%) for 10 minutes
# - All pods on the node can be moved elsewhere
# - No PDB violations from moving pods
# - No pods with local storage (unless configured)
# - No system-critical pods that can't be moved
# Configuration (AWS EKS example):
# Node group with autoscaling:
aws eks create-nodegroup --cluster-name production \
--nodegroup-name workers \
--scaling-config minSize=3,maxSize=20,desiredSize=5
# Cluster Autoscaler respects:
# ✓ PodDisruptionBudgets
# ✓ Pod anti-affinity rules
# ✓ Node taints and tolerations
# ✓ Pods with local storage annotations
# ✓ Pods with controller (won't remove standalone pods)
KEDA (Kubernetes Event-Driven Autoscaling)
# KEDA: scale from/to zero based on event sources
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: order-processor
spec:
scaleTargetRef:
name: order-processor # Deployment name
minReplicaCount: 0 # Scale to zero when idle!
maxReplicaCount: 50
pollingInterval: 15 # Check every 15 seconds
cooldownPeriod: 300 # Wait 5min before scaling to 0
triggers:
# Scale based on queue length:
- type: rabbitmq
metadata:
queueName: orders
host: amqp://rabbitmq.default.svc:5672
queueLength: "10" # 1 pod per 10 messages
# Scale based on Kafka lag:
- type: kafka
metadata:
bootstrapServers: kafka.default.svc:9092
consumerGroup: order-consumers
topic: orders
lagThreshold: "100"
# Scale based on Prometheus metric:
- type: prometheus
metadata:
serverAddress: http://prometheus.monitoring.svc:9090
metricName: pending_orders
query: sum(pending_orders_total)
threshold: "50"
When to Use Each Autoscaler
| Scenario | Autoscaler | Why |
|---|---|---|
| Web API under varying load | HPA (CPU/RPS) | Scale pods with traffic |
| Memory-hungry app, hard to predict | VPA | Right-size without manual tuning |
| Queue processor (bursty) | KEDA | Scale to 0 when idle, burst on demand |
| Pods pending, no node capacity | Cluster Autoscaler | Add nodes for pending pods |
| Event-driven microservices | KEDA + CA | KEDA scales pods, CA adds nodes |
| Stable workload, cost optimisation | VPA (Off mode) | Get recommendations, apply manually |
Exercises
minAvailable: 2). Drain a node hosting one replica — observe it succeeds. Now drain a second node — observe the drain blocks because it would violate the PDB. Understand why this protects availability.
kubectl run load-gen --image=busybox -- /bin/sh -c "while true; do wget -q -O- http://app-svc; done". Watch the HPA scale up. Remove the load and watch it scale down after the stabilisation window.
Conclusion
Cluster operations and reliability engineering is what separates a development cluster from production infrastructure. Key takeaways:
- Upgrades: One minor version at a time, control plane first, workers second, always with PDB protection
- Resource management: Set requests for scheduling, limits for protection, use quotas for multi-tenancy
- Disruption protection: PDBs ensure minimum availability during voluntary disruptions
- Autoscaling: HPA for horizontal pod scaling, VPA for right-sizing, Cluster Autoscaler for infrastructure, KEDA for event-driven
- Node management: Taints, tolerations, and affinity control where workloads run
In Part 14, we'll tackle Kubernetes Security — RBAC, Pod Security Standards, network policies, secrets management, admission controllers, and supply chain security.