API Machinery
The Kubernetes API server is not a monolithic endpoint — it's a collection of API groups, each containing versioned resources. Understanding this structure is essential for writing controllers, debugging API interactions, and working with CRDs.
API Groups & Versions
Every Kubernetes resource belongs to an API group. The "core" group (also called the legacy group) contains fundamental resources like Pods, Services, and ConfigMaps. Newer resources live in named groups like apps, batch, and networking.k8s.io.
| API Group | Path | Key Resources | Stable Version |
|---|---|---|---|
core (legacy) |
/api/v1 |
Pod, Service, ConfigMap, Secret, PV, PVC, Namespace | v1 |
apps |
/apis/apps/v1 |
Deployment, StatefulSet, DaemonSet, ReplicaSet | v1 |
batch |
/apis/batch/v1 |
Job, CronJob | v1 |
networking.k8s.io |
/apis/networking.k8s.io/v1 |
Ingress, IngressClass, NetworkPolicy | v1 |
rbac.authorization.k8s.io |
/apis/rbac.authorization.k8s.io/v1 |
Role, ClusterRole, RoleBinding, ClusterRoleBinding | v1 |
storage.k8s.io |
/apis/storage.k8s.io/v1 |
StorageClass, CSIDriver, VolumeAttachment | v1 |
autoscaling |
/apis/autoscaling/v2 |
HorizontalPodAutoscaler | v2 |
policy |
/apis/policy/v1 |
PodDisruptionBudget | v1 |
# List all API groups available on the cluster
kubectl api-versions
# Explore resources in a specific group
kubectl api-resources --api-group=apps
# NAME SHORTNAMES APIVERSION NAMESPACED KIND
# deployments deploy apps/v1 true Deployment
# daemonsets ds apps/v1 true DaemonSet
# replicasets rs apps/v1 true ReplicaSet
# statefulsets sts apps/v1 true StatefulSet
# View the raw API discovery document
kubectl get --raw /apis | jq '.groups[].name'
# List all resources in the core group
kubectl api-resources --api-group=""
GVR vs GVK
Two fundamental concepts identify resources in Kubernetes — GVK (Group-Version-Kind) identifies a type of resource, while GVR (Group-Version-Resource) identifies the REST path to access it.
Deployment), while Resource is plural and lowercase (deployments).
# GVK example: apps/v1, Kind=Deployment
# GVR example: apps/v1, Resource=deployments
# The REST path for a namespaced resource follows this pattern:
# /apis/{group}/{version}/namespaces/{namespace}/{resource}/{name}
# /apis/apps/v1/namespaces/default/deployments/nginx
# For core group resources, the path omits /apis and the group:
# /api/v1/namespaces/default/pods/my-pod
# Access a specific deployment via raw API
kubectl get --raw /apis/apps/v1/namespaces/default/deployments/nginx | jq '.metadata.name'
# Subresources are accessed as nested paths:
# /apis/apps/v1/namespaces/default/deployments/nginx/status
# /apis/apps/v1/namespaces/default/deployments/nginx/scale
kubectl get --raw /apis/apps/v1/namespaces/default/deployments/nginx/scale | jq .
Resource Versioning & Optimistic Concurrency
Every Kubernetes object carries a metadata.resourceVersion field — an opaque string (typically the etcd revision number) that changes on every update. This enables optimistic concurrency control: the API server rejects updates where the submitted resourceVersion doesn't match the current stored version.
409 Conflict response, it means another process modified the resource since you last read it. The correct pattern is: read → modify → write. If conflict occurs, re-read and retry. Never cache resourceVersion across long intervals.
# Observe resourceVersion changing on updates
kubectl get pod nginx -o jsonpath='{.metadata.resourceVersion}'
# Output: 15234
kubectl label pod nginx env=prod
kubectl get pod nginx -o jsonpath='{.metadata.resourceVersion}'
# Output: 15235 (incremented)
# Demonstrate optimistic concurrency conflict
# Terminal 1: Get the resource
kubectl get deployment nginx -o yaml > deployment.yaml
# Terminal 2: Modify it (changes resourceVersion on server)
kubectl scale deployment nginx --replicas=5
# Terminal 1: Try to apply stale version
kubectl apply -f deployment.yaml
# Error: the object has been modified; please apply your changes to the latest version
API Discovery & Subresources
The API server exposes discovery endpoints that allow clients to dynamically determine which resources and operations are available. This is how kubectl auto-completes resource types and how controllers discover CRDs at runtime.
Subresources are secondary endpoints on a resource that handle specific operations:
| Subresource | Path Suffix | Purpose | Used By |
|---|---|---|---|
/status |
…/pods/nginx/status |
Separate RBAC for spec vs status updates | Controllers updating status without touching spec |
/scale |
…/deployments/nginx/scale |
Uniform scaling interface | HPA, kubectl scale |
/log |
…/pods/nginx/log |
Stream container logs | kubectl logs |
/exec |
…/pods/nginx/exec |
Execute commands in container | kubectl exec |
/portforward |
…/pods/nginx/portforward |
Tunnel TCP connections | kubectl port-forward |
/eviction |
…/pods/nginx/eviction |
Graceful pod eviction respecting PDBs | kubectl drain, descheduler |
The Informer Pattern
Informers are the backbone of the Kubernetes controller ecosystem. They solve a critical problem: how do you keep track of cluster state without overwhelming the API server with repeated list requests?
SharedInformers
Without informers, every controller watching Deployments would independently list and watch the same resources — 10 controllers would mean 10x the API load. SharedInformers ensure a single watch connection per resource type per process, with events fanned out to all registered handlers.
flowchart LR
API[API Server] -->|List + Watch| R[Reflector]
R -->|Objects| DF[DeltaFIFO Queue]
DF -->|Pop| I[Informer]
I -->|Store| IX[Indexer / Cache]
I -->|Notify| EH1[Event Handler 1]
I -->|Notify| EH2[Event Handler 2]
I -->|Notify| EH3[Event Handler 3]
EH1 -->|Enqueue Key| WQ[Work Queue]
EH2 -->|Enqueue Key| WQ
WQ -->|Dequeue| RC[Reconcile Loop]
RC -->|Read| IX
RC -->|Write| API
ListWatch Mechanism
The ListWatch mechanism is a two-phase synchronization protocol:
- List — On startup, fetch all existing objects of the watched type. This populates the initial cache.
- Watch — After listing, open a long-lived HTTP streaming connection. The API server pushes change events (ADDED, MODIFIED, DELETED) as they occur.
# Observe a watch stream directly (raw HTTP)
kubectl get --raw '/api/v1/namespaces/default/pods?watch=true&resourceVersion=0' &
# You'll see NDJSON events like:
# {"type":"ADDED","object":{"kind":"Pod","metadata":{"name":"nginx",...},...}}
# {"type":"MODIFIED","object":{"kind":"Pod","metadata":{"name":"nginx",...},...}}
# {"type":"DELETED","object":{"kind":"Pod","metadata":{"name":"nginx",...},...}}
# The watch starts from the given resourceVersion
# If the version is too old, you get 410 Gone → triggers a full re-list
BOOKMARK event type (opt-in) allows the server to periodically update the client's resourceVersion without sending full objects, reducing re-list data after reconnection.
Local Cache & Indexers
The Informer maintains a thread-safe local cache (the Indexer) that mirrors the API server's state. Controllers read from this cache instead of hitting the API server, providing microsecond-latency lookups for objects that would otherwise require network round-trips.
# The cache supports indexed lookups — common indexes include:
# - By namespace: cache.MetaNamespaceIndexFunc
# - By node name: custom indexer for pod-to-node mapping
# - By label: custom indexer for label-based queries
# Pseudocode showing cache interaction
# controller reads from cache (fast, no API call):
# pod, exists, err := podIndexer.GetByKey("default/nginx")
#
# controller writes to API server (network call):
# _, err := clientset.CoreV1().Pods("default").Update(ctx, pod, metav1.UpdateOptions{})
Event Handlers & Resync
Event handlers are callbacks registered with the informer that fire when objects change. The three handler types map to watch event types:
# Go pseudocode — registering event handlers on a SharedInformer
# informer.AddEventHandler(cache.ResourceEventHandlerFuncs{
# AddFunc: func(obj interface{}) {
# key, _ := cache.MetaNamespaceKeyFunc(obj)
# workqueue.Add(key) // enqueue "namespace/name"
# },
# UpdateFunc: func(oldObj, newObj interface{}) {
# key, _ := cache.MetaNamespaceKeyFunc(newObj)
# workqueue.Add(key)
# },
# DeleteFunc: func(obj interface{}) {
# key, _ := cache.DeletionHandlingMetaNamespaceKeyFunc(obj)
# workqueue.Add(key)
# },
# })
# Resync period: periodically re-queues ALL objects from cache
# Purpose: catch missed events, ensure eventual consistency
# Typical value: 30s to 10m (longer = less load, shorter = faster convergence)
# informer.AddEventHandlerWithResyncPeriod(handler, 5*time.Minute)
Controller Architecture
Every Kubernetes controller follows the same architectural pattern: observe desired state (spec), compare with actual state, and take action to reconcile. This pattern is what makes Kubernetes self-healing and declarative.
The Generic Controller Pattern
flowchart TD
INF[SharedInformer] -->|Event| EH[Event Handler]
EH -->|"key: namespace/name"| WQ[Work Queue]
WQ -->|Dequeue| W[Worker Goroutine]
W --> R{Reconcile}
R -->|Read desired state| CACHE[Informer Cache]
R -->|Read actual state| EXT[External System / API]
R -->|Diff| D{Desired == Actual?}
D -->|Yes| DONE[Done — Requeue after interval]
D -->|No| ACT[Take Action]
ACT -->|Create/Update/Delete| API[API Server]
ACT -->|Success| DONE
ACT -->|Error| RETRY[Requeue with backoff]
RETRY --> WQ
# Go pseudocode — the canonical controller reconciliation loop
#
# func (c *Controller) Run(workers int, stopCh <-chan struct{}) {
# defer c.workqueue.ShutDown()
#
# // Wait for informer caches to sync before processing
# if !cache.WaitForCacheSync(stopCh, c.podsSynced, c.deploymentsSynced) {
# return
# }
#
# // Launch worker goroutines
# for i := 0; i < workers; i++ {
# go wait.Until(c.runWorker, time.Second, stopCh)
# }
# <-stopCh
# }
#
# func (c *Controller) runWorker() {
# for c.processNextWorkItem() {}
# }
#
# func (c *Controller) processNextWorkItem() bool {
# key, shutdown := c.workqueue.Get()
# if shutdown { return false }
# defer c.workqueue.Done(key)
#
# err := c.reconcile(key.(string))
# if err == nil {
# c.workqueue.Forget(key) // success — reset retry count
# return true
# }
#
# // Requeue with rate limiting (exponential backoff)
# c.workqueue.AddRateLimited(key)
# return true
# }
Work Queues & Rate Limiting
The work queue is the decoupling layer between event observation and reconciliation. It provides three critical guarantees:
- Deduplication — If an object changes 10 times before it's processed, only one reconciliation occurs (using the latest state from cache)
- Rate Limiting — Failed reconciliations are retried with exponential backoff (default: 5ms base, 1000s max)
- Ordering — Items are processed in insertion order (FIFO), but re-queued items respect rate limits
# Go pseudocode — work queue rate limiter configuration
#
# // Default rate limiter: exponential backoff + overall rate limit
# rateLimiter := workqueue.NewMaxOfRateLimiter(
# // Per-item exponential backoff: 5ms base, doubles each retry, max 1000s
# workqueue.NewItemExponentialFailureRateLimiter(5*time.Millisecond, 1000*time.Second),
# // Overall rate: max 10 items/second, burst of 100
# &workqueue.BucketRateLimiter{Limiter: rate.NewLimiter(rate.Limit(10), 100)},
# )
#
# queue := workqueue.NewRateLimitingQueue(rateLimiter)
#
# // After 5 consecutive failures for the same key:
# // Retry delays: 5ms → 10ms → 20ms → 40ms → 80ms (exponential)
Leader Election for HA Controllers
In production, controllers run with multiple replicas for high availability. But only one replica should actively reconcile at any time — this is achieved through leader election using a Kubernetes Lease object.
leaseDuration, a standby replica acquires leadership. Typical settings: lease duration 15s, renew deadline 10s, retry period 2s.
# Leader election Lease object (created automatically by the framework)
apiVersion: coordination.k8s.io/v1
kind: Lease
metadata:
name: my-controller-leader
namespace: kube-system
spec:
holderIdentity: controller-pod-abc123 # Current leader's pod name
leaseDurationSeconds: 15 # How long lease is valid
acquireTime: "2026-05-14T10:00:00Z"
renewTime: "2026-05-14T10:00:12Z" # Last renewal timestamp
leaseTransitions: 3 # Number of leader changes
Scheduler Internals
The kube-scheduler assigns pods to nodes through a pluggable framework. The scheduling decision happens in two phases: filtering (which nodes can run the pod) and scoring (which node is best).
The Scheduling Framework
flowchart LR
subgraph "Scheduling Cycle (per pod)"
QS[QueueSort] --> PF[PreFilter]
PF --> F[Filter]
F --> PFS[PostFilter]
PFS --> PS[PreScore]
PS --> S[Score]
S --> NS[NormalizeScore]
NS --> R[Reserve]
end
subgraph "Binding Cycle (async)"
R --> PB[PreBind]
PB --> B[Bind]
B --> POB[PostBind]
end
The scheduling framework defines extension points where plugins hook into the decision process. Each extension point runs a set of plugins that can filter nodes out, score remaining candidates, or perform binding operations.
Filter & Score Plugins
| Plugin | Phase | Purpose |
|---|---|---|
NodeResourcesFit |
Filter + Score | Ensures node has sufficient CPU/memory; scores by resource utilization balance |
NodeAffinity |
Filter + Score | Enforces requiredDuringScheduling and scores preferredDuringScheduling affinities |
PodTopologySpread |
Filter + Score | Distributes pods across failure domains (zones, nodes) based on topology constraints |
InterPodAffinity |
Filter + Score | Enforces pod affinity/anti-affinity rules (co-location and spreading) |
TaintToleration |
Filter + Score | Filters nodes with untolerated taints; scores by number of tolerated taints |
VolumeBinding |
Filter + Reserve | Checks PV availability in the node's zone; reserves PV bindings |
NodePorts |
Filter | Filters nodes where requested hostPorts are already in use |
ImageLocality |
Score | Prefers nodes that already have the required container images cached |
# Debug scheduler decisions for a pending pod
kubectl get events --field-selector reason=FailedScheduling -n default
# Verbose scheduler output (scheduler must be started with -v=10)
# In managed clusters, check scheduler logs:
kubectl logs -n kube-system -l component=kube-scheduler --tail=50
# View which scheduler profiles are active
kubectl get configmap -n kube-system kube-scheduler-config -o yaml 2>/dev/null || \
echo "Scheduler uses default configuration (no custom ConfigMap)"
# Check pod scheduling annotations after binding
kubectl get pod nginx -o jsonpath='{.metadata.annotations}' | jq .
Preemption & Priority Classes
When no node can accommodate a pod, the scheduler can preempt (evict) lower-priority pods to make room. This is controlled by PriorityClasses — cluster-wide objects that assign integer priorities to pods.
# Define priority classes
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: critical-infrastructure
value: 1000000 # Higher number = higher priority
globalDefault: false
preemptionPolicy: PreemptLowerPriority # or "Never" to disable preemption
description: "For critical infrastructure components (monitoring, logging)"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: batch-low
value: 100
globalDefault: false
preemptionPolicy: Never # This class will never preempt others
description: "Low-priority batch jobs that can wait"
---
# Using a PriorityClass in a pod
apiVersion: v1
kind: Pod
metadata:
name: critical-monitor
spec:
priorityClassName: critical-infrastructure
containers:
- name: monitor
image: prometheus:latest
resources:
requests:
cpu: "500m"
memory: "512Mi"
preemptionPolicy: Never for batch jobs that should wait rather than evict production workloads.
etcd Deep Dive
etcd is Kubernetes' brain — a distributed key-value store that holds all cluster state. Understanding its internals is crucial for operating large clusters reliably.
Key Structure
Kubernetes stores objects in etcd under a well-defined key hierarchy. The default prefix is /registry/:
# etcd key structure (requires direct etcd access — NOT recommended in production)
# /registry/{resource-plural}/{namespace}/{name}
# /registry/pods/default/nginx
# /registry/deployments/kube-system/coredns
# /registry/namespaces/default
# /registry/clusterroles/admin (cluster-scoped — no namespace segment)
# /registry/services/specs/default/kubernetes
# /registry/secrets/default/my-secret
# In production, access etcd via the API server — direct etcd access bypasses RBAC
# Only use etcdctl for disaster recovery or debugging:
ETCDCTL_API=3 etcdctl \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
get /registry/pods/default/nginx --print-value-only | \
auger decode # Decode protobuf to readable format
# List all keys under a prefix
ETCDCTL_API=3 etcdctl get /registry/ --prefix --keys-only | head -20
auger tool can decode these binary values for debugging. The API server handles conversion between protobuf (storage) and JSON/YAML (client-facing).
Compaction & Defragmentation
etcd is an MVCC (Multi-Version Concurrency Control) store — it keeps a history of all key revisions. Without periodic maintenance, this history grows unbounded and degrades performance.
# Check etcd database size and alarm status
ETCDCTL_API=3 etcdctl endpoint status --write-out=table
# +-------------------+--------+---------+---------+-----------+...
# | ENDPOINT | ID | VERSION | DB SIZE | IS LEADER |
# +-------------------+--------+---------+---------+-----------+...
# | https://...:2379 | abc123 | 3.5.12 | 4.2 GB | true |
# Compact old revisions (keep only latest 10000 revisions)
REVISION=$(ETCDCTL_API=3 etcdctl endpoint status --write-out=json | jq '.[0].Status.header.revision')
ETCDCTL_API=3 etcdctl compact $((REVISION - 10000))
# Defragment to reclaim disk space (compaction marks revisions as deleted but doesn't free space)
ETCDCTL_API=3 etcdctl defrag --endpoints=https://127.0.0.1:2379
# Check for NOSPACE alarm (etcd stops accepting writes at quota limit)
ETCDCTL_API=3 etcdctl alarm list
# Clear NOSPACE alarm after resolving (compact + defrag)
ETCDCTL_API=3 etcdctl alarm disarm
Performance Tuning
| Parameter | Default | Recommended | Impact |
|---|---|---|---|
--quota-backend-bytes |
2 GB | 8 GB (max recommended) | Maximum DB size before NOSPACE alarm |
--auto-compaction-mode |
periodic | revision | "revision" keeps last N revisions; "periodic" compacts every M hours |
--auto-compaction-retention |
0 (disabled) | 10000 (revisions) or "5m" (periodic) | How much history to keep |
--snapshot-count |
100000 | 10000 for faster recovery | Transactions between snapshots (WAL compaction trigger) |
| Disk type | — | SSD with fsync < 10ms | etcd is fsync-heavy; spinning disks cause leader election storms |
| Network latency | — | < 10ms between peers | High latency causes frequent elections and inconsistency |
Garbage Collection & Finalizers
When you delete a Deployment, its ReplicaSets and Pods are cleaned up automatically. This isn't magic — it's the garbage collector following owner references through the object graph.
Owner References
Owner references create a directed acyclic graph (DAG) of object ownership. When a parent is deleted, the garbage collector identifies and removes all dependents.
flowchart TD
D[Deployment
nginx] -->|owns| RS[ReplicaSet
nginx-7d4f8b]
RS -->|owns| P1[Pod
nginx-7d4f8b-abc]
RS -->|owns| P2[Pod
nginx-7d4f8b-def]
RS -->|owns| P3[Pod
nginx-7d4f8b-ghi]
D -->|delete| GC{Garbage Collector}
GC -->|"cascade: background"| RS
GC -->|"cascade: foreground"| P1
GC -->|"cascade: foreground"| P2
GC -->|"cascade: foreground"| P3
# Owner reference as seen on a ReplicaSet owned by a Deployment
apiVersion: apps/v1
kind: ReplicaSet
metadata:
name: nginx-7d4f8b6c9f
namespace: default
ownerReferences:
- apiVersion: apps/v1
kind: Deployment
name: nginx
uid: a1b2c3d4-e5f6-7890-abcd-ef1234567890
controller: true # This is THE managing controller
blockOwnerDeletion: true # Participates in foreground cascading delete
# Inspect owner references on a pod
kubectl get pod nginx-7d4f8b6c9f-abc12 -o jsonpath='{.metadata.ownerReferences}' | jq .
# Delete with explicit cascade policy
kubectl delete deployment nginx --cascade=foreground # Wait for all dependents to be deleted first
kubectl delete deployment nginx --cascade=background # Delete parent immediately, GC cleans up async (default)
kubectl delete deployment nginx --cascade=orphan # Delete parent only, leave dependents running
Cascading Deletion Modes
| Mode | Behavior | Use Case |
|---|---|---|
| Background (default) | Parent is deleted immediately. GC deletes dependents asynchronously. | Normal operations — fast deletion, eventual cleanup |
| Foreground | Parent enters "deletion in progress" state. GC deletes all dependents first, then removes parent. | When you need to guarantee cleanup before proceeding (migrations, replacements) |
| Orphan | Parent is deleted. Dependents are left running with ownerReferences cleared. | Replacing a Deployment while keeping existing pods running |
Finalizers
Finalizers are metadata strings that block object deletion until a controller removes them. They provide a hook for cleanup logic that must complete before an object can be garbage collected.
metadata.deletionTimestamp but does NOT remove the object from etcd. The object remains "terminating" until all finalizers are removed. A controller watches for the deletionTimestamp, performs cleanup (external resources, DNS records, cloud load balancers), then removes its finalizer. Once all finalizers are cleared, the GC deletes the object.
# Common finalizer patterns in Kubernetes
apiVersion: v1
kind: Namespace
metadata:
name: my-namespace
finalizers:
- kubernetes # Ensures all resources in the namespace are deleted first
---
# Custom finalizer on a CRD instance
apiVersion: databases.example.com/v1
kind: PostgresCluster
metadata:
name: prod-db
finalizers:
- databases.example.com/cleanup # Controller deletes cloud DB before allowing object removal
spec:
instances: 3
storage:
size: 100Gi
# Debug stuck namespace deletion (common issue — stuck finalizers)
kubectl get namespace stuck-ns -o json | jq '.spec.finalizers'
# Output: ["kubernetes"]
# Check which resources still exist in the namespace
kubectl api-resources --verbs=list --namespaced -o name | \
xargs -n1 -I{} sh -c 'kubectl get {} -n stuck-ns 2>/dev/null && echo "Found: {}"'
# DANGEROUS: Force-remove finalizer (only when cleanup is confirmed done)
kubectl get namespace stuck-ns -o json | \
jq '.spec.finalizers = []' | \
kubectl replace --raw "/api/v1/namespaces/stuck-ns/finalize" -f -
Admission Controllers
Admission controllers intercept API requests after authentication and authorization but before persistence to etcd. They can validate requests (rejecting invalid ones) or mutate them (injecting defaults, sidecars, labels).
Mutating & Validating Webhooks
The admission pipeline processes webhooks in two phases: mutating webhooks run first (and can modify the object), then validating webhooks run on the final mutated object (and can only accept or reject).
| Type | Can Modify Object? | Ordering | Common Uses |
|---|---|---|---|
| MutatingAdmissionWebhook | Yes (via JSON patch) | Runs first, in order of webhook weight | Sidecar injection (Istio), default resource limits, label injection |
| ValidatingAdmissionWebhook | No (accept/reject only) | Runs after all mutating webhooks | Policy enforcement (OPA/Gatekeeper), image allowlists, naming conventions |
Built-in Admission Controllers
| Controller | Type | What It Does |
|---|---|---|
NamespaceLifecycle |
Validating | Prevents creating objects in terminating/non-existent namespaces |
LimitRanger |
Mutating + Validating | Injects default resource requests/limits, rejects pods exceeding limits |
ResourceQuota |
Validating | Rejects requests that would exceed namespace resource quotas |
PodSecurity |
Validating | Enforces Pod Security Standards (restricted, baseline, privileged) |
ServiceAccount |
Mutating | Injects default service account token volume mount |
DefaultStorageClass |
Mutating | Adds default StorageClass to PVCs without one |
MutatingAdmissionWebhook |
Mutating | Calls external mutating webhook services |
ValidatingAdmissionWebhook |
Validating | Calls external validating webhook services |
Writing Custom Admission Webhooks
A custom webhook is a web server that receives AdmissionReview requests from the API server and returns an AdmissionResponse. Here's the configuration that tells the API server when to invoke your webhook:
# ValidatingWebhookConfiguration — reject pods without resource limits
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingWebhookConfiguration
metadata:
name: require-resource-limits
webhooks:
- name: require-limits.example.com
admissionReviewVersions: ["v1"]
sideEffects: None
timeoutSeconds: 5
failurePolicy: Fail # Reject request if webhook is unavailable
matchPolicy: Equivalent
clientConfig:
service:
name: admission-webhook
namespace: webhook-system
path: /validate-resource-limits
caBundle: LS0tLS1CRUdJTi... # Base64-encoded CA cert
rules:
- operations: ["CREATE", "UPDATE"]
apiGroups: [""]
apiVersions: ["v1"]
resources: ["pods"]
namespaceSelector:
matchExpressions:
- key: enforce-limits
operator: In
values: ["true"] # Only applies to namespaces with this label
---
# MutatingWebhookConfiguration — inject sidecar container
apiVersion: admissionregistration.k8s.io/v1
kind: MutatingWebhookConfiguration
metadata:
name: sidecar-injector
webhooks:
- name: sidecar.inject.example.com
admissionReviewVersions: ["v1"]
sideEffects: None
reinvocationPolicy: IfNeeded # Re-invoke if another mutating webhook changed the object
timeoutSeconds: 10
failurePolicy: Ignore # Don't block pod creation if webhook is down
clientConfig:
service:
name: sidecar-injector
namespace: injection-system
path: /inject
caBundle: LS0tLS1CRUdJTi...
rules:
- operations: ["CREATE"]
apiGroups: [""]
apiVersions: ["v1"]
resources: ["pods"]
objectSelector:
matchLabels:
inject-sidecar: "true" # Only pods with this label
failurePolicy: Fail is set, ALL matching API requests will be rejected — including system-critical pods. Always use failurePolicy: Ignore during initial rollout and narrow your namespaceSelector to exclude kube-system. Set short timeoutSeconds (3-5s) to avoid blocking the API server.
# Debug admission webhook issues
kubectl get validatingwebhookconfigurations
kubectl get mutatingwebhookconfigurations
# Check if a webhook is rejecting requests
kubectl get events --field-selector reason=FailedCreate
# Warning FailedCreate admission webhook "require-limits.example.com" denied the request: ...
# Test webhook connectivity from API server perspective
kubectl logs -n webhook-system deploy/admission-webhook --tail=20
# Temporarily disable a misbehaving webhook
kubectl delete validatingwebhookconfiguration require-resource-limits
Exercises
Explore the Kubernetes API Surface
Using only kubectl get --raw and jq, complete the following tasks:
- List all API groups and their preferred versions
- Find the GVR for CronJobs and construct the full REST path to list CronJobs in the
productionnamespace - Retrieve the
/scalesubresource of a Deployment and modify its replica count viakubectl replace --raw - Demonstrate optimistic concurrency by creating a conflict — two concurrent writes to the same ConfigMap
Build a Manual Reconciliation Loop
Simulate the controller pattern using shell scripting:
- Write a bash script that watches for ConfigMap changes using
kubectl get --raw ...?watch=true - On each ADDED/MODIFIED event, extract the ConfigMap's data and write it to a local file
- Implement a simple exponential backoff: if writing fails, wait 1s, 2s, 4s, 8s before retrying
- Add deduplication: if the same key appears multiple times in the queue before processing, skip duplicates
Diagnose and Fix Scheduling Failures
Create scenarios that demonstrate each scheduler failure mode:
- Create a pod requesting 128 CPUs — diagnose the FailedScheduling event and identify which filter rejected it
- Create two PriorityClasses (high=1000, low=100) and demonstrate preemption: fill a node with low-priority pods, then submit a high-priority pod
- Configure a custom scheduler profile that disables the
ImageLocalityscoring plugin and verify the behavior difference
Implement Custom Garbage Collection and Admission Control
Combine finalizers with admission webhooks:
- Create a ConfigMap with a custom finalizer. Write a controller (bash loop) that watches for deletionTimestamp, performs "cleanup" (logs a message), then removes the finalizer
- Deploy a simple validating webhook (using a language of choice) that rejects pods with
latestimage tag - Demonstrate the orphan cascade policy: delete a Deployment with
--cascade=orphanand verify pods continue running without owner references
Conclusion
Understanding Kubernetes internals — API machinery, informers, controllers, the scheduler, etcd, garbage collection, and admission controllers — transforms your mental model from "kubectl does things" to understanding the precise mechanisms behind every operation. This knowledge is essential for:
- Debugging complex failures — stuck deletions (finalizers), scheduling loops (plugin conflicts), stale state (informer cache lag)
- Building operators — implementing the informer→queue→reconcile pattern correctly
- Performance tuning — etcd compaction, informer resync intervals, webhook timeouts
- Operating at scale — understanding why clusters with 5,000+ nodes need careful etcd tuning and rate limiting
Next in the Series
In Part 12: CRDs & Operators, we'll apply everything from this article to build custom resources and controllers — extending Kubernetes with your own domain-specific APIs using the Operator pattern.