Kubernetes Internals - Part 11

API Machinery

The Kubernetes API server is not a monolithic endpoint — it's a collection of API groups, each containing versioned resources. Understanding this structure is essential for writing controllers, debugging API interactions, and working with CRDs.

API Groups & Versions

                            
                            Definition: An API Group is a logical collection of related resources that are versioned together. Groups allow the Kubernetes API to evolve without breaking backward compatibility — resources can be added, deprecated, and removed within a group independently of other groups.
                        

Every Kubernetes resource belongs to an API group. The "core" group (also called the legacy group) contains fundamental resources like Pods, Services, and ConfigMaps. Newer resources live in named groups like apps, batch, and networking.k8s.io.

API Group	Path	Key Resources	Stable Version
`core` (legacy)	`/api/v1`	Pod, Service, ConfigMap, Secret, PV, PVC, Namespace	v1
`apps`	`/apis/apps/v1`	Deployment, StatefulSet, DaemonSet, ReplicaSet	v1
`batch`	`/apis/batch/v1`	Job, CronJob	v1
`networking.k8s.io`	`/apis/networking.k8s.io/v1`	Ingress, IngressClass, NetworkPolicy	v1
`rbac.authorization.k8s.io`	`/apis/rbac.authorization.k8s.io/v1`	Role, ClusterRole, RoleBinding, ClusterRoleBinding	v1
`storage.k8s.io`	`/apis/storage.k8s.io/v1`	StorageClass, CSIDriver, VolumeAttachment	v1
`autoscaling`	`/apis/autoscaling/v2`	HorizontalPodAutoscaler	v2
`policy`	`/apis/policy/v1`	PodDisruptionBudget	v1

# List all API groups available on the cluster
kubectl api-versions

# Explore resources in a specific group
kubectl api-resources --api-group=apps
# NAME          SHORTNAMES   APIVERSION   NAMESPACED   KIND
# deployments   deploy       apps/v1      true         Deployment
# daemonsets    ds           apps/v1      true         DaemonSet
# replicasets   rs           apps/v1      true         ReplicaSet
# statefulsets  sts          apps/v1      true         StatefulSet

# View the raw API discovery document
kubectl get --raw /apis | jq '.groups[].name'

# List all resources in the core group
kubectl api-resources --api-group=""

GVR vs GVK

Two fundamental concepts identify resources in Kubernetes — GVK (Group-Version-Kind) identifies a type of resource, while GVR (Group-Version-Resource) identifies the REST path to access it.

                            
                            Key Insight: GVK is what a resource is (its type in the type system). GVR is how you access it (its REST endpoint). The mapping between them is handled by a REST mapper. Kind is typically singular and PascalCase (Deployment), while Resource is plural and lowercase (deployments).
                        

# GVK example: apps/v1, Kind=Deployment
# GVR example: apps/v1, Resource=deployments

# The REST path for a namespaced resource follows this pattern:
# /apis/{group}/{version}/namespaces/{namespace}/{resource}/{name}
# /apis/apps/v1/namespaces/default/deployments/nginx

# For core group resources, the path omits /apis and the group:
# /api/v1/namespaces/default/pods/my-pod

# Access a specific deployment via raw API
kubectl get --raw /apis/apps/v1/namespaces/default/deployments/nginx | jq '.metadata.name'

# Subresources are accessed as nested paths:
# /apis/apps/v1/namespaces/default/deployments/nginx/status
# /apis/apps/v1/namespaces/default/deployments/nginx/scale
kubectl get --raw /apis/apps/v1/namespaces/default/deployments/nginx/scale | jq .

Resource Versioning & Optimistic Concurrency

Every Kubernetes object carries a metadata.resourceVersion field — an opaque string (typically the etcd revision number) that changes on every update. This enables optimistic concurrency control: the API server rejects updates where the submitted resourceVersion doesn't match the current stored version.

                            
                            Conflict Resolution: When you receive a 409 Conflict response, it means another process modified the resource since you last read it. The correct pattern is: read → modify → write. If conflict occurs, re-read and retry. Never cache resourceVersion across long intervals.
                        

# Observe resourceVersion changing on updates
kubectl get pod nginx -o jsonpath='{.metadata.resourceVersion}'
# Output: 15234

kubectl label pod nginx env=prod
kubectl get pod nginx -o jsonpath='{.metadata.resourceVersion}'
# Output: 15235 (incremented)

# Demonstrate optimistic concurrency conflict
# Terminal 1: Get the resource
kubectl get deployment nginx -o yaml > deployment.yaml
# Terminal 2: Modify it (changes resourceVersion on server)
kubectl scale deployment nginx --replicas=5
# Terminal 1: Try to apply stale version
kubectl apply -f deployment.yaml
# Error: the object has been modified; please apply your changes to the latest version

API Discovery & Subresources

The API server exposes discovery endpoints that allow clients to dynamically determine which resources and operations are available. This is how kubectl auto-completes resource types and how controllers discover CRDs at runtime.

Subresources are secondary endpoints on a resource that handle specific operations:

Subresource	Path Suffix	Purpose	Used By
`/status`	`…/pods/nginx/status`	Separate RBAC for spec vs status updates	Controllers updating status without touching spec
`/scale`	`…/deployments/nginx/scale`	Uniform scaling interface	HPA, kubectl scale
`/log`	`…/pods/nginx/log`	Stream container logs	kubectl logs
`/exec`	`…/pods/nginx/exec`	Execute commands in container	kubectl exec
`/portforward`	`…/pods/nginx/portforward`	Tunnel TCP connections	kubectl port-forward
`/eviction`	`…/pods/nginx/eviction`	Graceful pod eviction respecting PDBs	kubectl drain, descheduler

The Informer Pattern

Informers are the backbone of the Kubernetes controller ecosystem. They solve a critical problem: how do you keep track of cluster state without overwhelming the API server with repeated list requests?

SharedInformers

                            
                            Definition: A SharedInformer is a shared, in-process cache of Kubernetes objects that is kept in sync with the API server via a ListWatch. Multiple controllers can share the same informer instance, multiplexing event notifications without multiplying API server load.
                        

Without informers, every controller watching Deployments would independently list and watch the same resources — 10 controllers would mean 10x the API load. SharedInformers ensure a single watch connection per resource type per process, with events fanned out to all registered handlers.

Informer Architecture — From API Server to Controller

flowchart LR
    API[API Server] -->|List + Watch| R[Reflector]
    R -->|Objects| DF[DeltaFIFO Queue]
    DF -->|Pop| I[Informer]
    I -->|Store| IX[Indexer / Cache]
    I -->|Notify| EH1[Event Handler 1]
    I -->|Notify| EH2[Event Handler 2]
    I -->|Notify| EH3[Event Handler 3]
    EH1 -->|Enqueue Key| WQ[Work Queue]
    EH2 -->|Enqueue Key| WQ
    WQ -->|Dequeue| RC[Reconcile Loop]
    RC -->|Read| IX
    RC -->|Write| API

ListWatch Mechanism

The ListWatch mechanism is a two-phase synchronization protocol:

List — On startup, fetch all existing objects of the watched type. This populates the initial cache.
Watch — After listing, open a long-lived HTTP streaming connection. The API server pushes change events (ADDED, MODIFIED, DELETED) as they occur.

# Observe a watch stream directly (raw HTTP)
kubectl get --raw '/api/v1/namespaces/default/pods?watch=true&resourceVersion=0' &

# You'll see NDJSON events like:
# {"type":"ADDED","object":{"kind":"Pod","metadata":{"name":"nginx",...},...}}
# {"type":"MODIFIED","object":{"kind":"Pod","metadata":{"name":"nginx",...},...}}
# {"type":"DELETED","object":{"kind":"Pod","metadata":{"name":"nginx",...},...}}

# The watch starts from the given resourceVersion
# If the version is too old, you get 410 Gone → triggers a full re-list

                            
                            Watch Bookmark: Watch connections can be interrupted by network issues, API server restarts, or etcd compaction (410 Gone). The informer handles this automatically by re-listing from the last known resourceVersion. The BOOKMARK event type (opt-in) allows the server to periodically update the client's resourceVersion without sending full objects, reducing re-list data after reconnection.
                        

Local Cache & Indexers

The Informer maintains a thread-safe local cache (the Indexer) that mirrors the API server's state. Controllers read from this cache instead of hitting the API server, providing microsecond-latency lookups for objects that would otherwise require network round-trips.

# The cache supports indexed lookups — common indexes include:
# - By namespace: cache.MetaNamespaceIndexFunc
# - By node name: custom indexer for pod-to-node mapping
# - By label: custom indexer for label-based queries

# Pseudocode showing cache interaction
# controller reads from cache (fast, no API call):
#   pod, exists, err := podIndexer.GetByKey("default/nginx")
#
# controller writes to API server (network call):
#   _, err := clientset.CoreV1().Pods("default").Update(ctx, pod, metav1.UpdateOptions{})

Event Handlers & Resync

Event handlers are callbacks registered with the informer that fire when objects change. The three handler types map to watch event types:

# Go pseudocode — registering event handlers on a SharedInformer
# informer.AddEventHandler(cache.ResourceEventHandlerFuncs{
#     AddFunc: func(obj interface{}) {
#         key, _ := cache.MetaNamespaceKeyFunc(obj)
#         workqueue.Add(key)  // enqueue "namespace/name"
#     },
#     UpdateFunc: func(oldObj, newObj interface{}) {
#         key, _ := cache.MetaNamespaceKeyFunc(newObj)
#         workqueue.Add(key)
#     },
#     DeleteFunc: func(obj interface{}) {
#         key, _ := cache.DeletionHandlingMetaNamespaceKeyFunc(obj)
#         workqueue.Add(key)
#     },
# })

# Resync period: periodically re-queues ALL objects from cache
# Purpose: catch missed events, ensure eventual consistency
# Typical value: 30s to 10m (longer = less load, shorter = faster convergence)
# informer.AddEventHandlerWithResyncPeriod(handler, 5*time.Minute)

                            
                            Why Informers Matter: A cluster with 10,000 pods and 50 controllers without informers would generate 500,000 list calls per sync interval. With SharedInformers, it's 1 watch connection per resource type — the API server sends events only when things change.
                        

Controller Architecture

Every Kubernetes controller follows the same architectural pattern: observe desired state (spec), compare with actual state, and take action to reconcile. This pattern is what makes Kubernetes self-healing and declarative.

The Generic Controller Pattern

Generic Controller Reconciliation Loop

flowchart TD
    INF[SharedInformer] -->|Event| EH[Event Handler]
    EH -->|"key: namespace/name"| WQ[Work Queue]
    WQ -->|Dequeue| W[Worker Goroutine]
    W --> R{Reconcile}
    R -->|Read desired state| CACHE[Informer Cache]
    R -->|Read actual state| EXT[External System / API]
    R -->|Diff| D{Desired == Actual?}
    D -->|Yes| DONE[Done — Requeue after interval]
    D -->|No| ACT[Take Action]
    ACT -->|Create/Update/Delete| API[API Server]
    ACT -->|Success| DONE
    ACT -->|Error| RETRY[Requeue with backoff]
    RETRY --> WQ

# Go pseudocode — the canonical controller reconciliation loop
#
# func (c *Controller) Run(workers int, stopCh <-chan struct{}) {
#     defer c.workqueue.ShutDown()
#
#     // Wait for informer caches to sync before processing
#     if !cache.WaitForCacheSync(stopCh, c.podsSynced, c.deploymentsSynced) {
#         return
#     }
#
#     // Launch worker goroutines
#     for i := 0; i < workers; i++ {
#         go wait.Until(c.runWorker, time.Second, stopCh)
#     }
#     <-stopCh
# }
#
# func (c *Controller) runWorker() {
#     for c.processNextWorkItem() {}
# }
#
# func (c *Controller) processNextWorkItem() bool {
#     key, shutdown := c.workqueue.Get()
#     if shutdown { return false }
#     defer c.workqueue.Done(key)
#
#     err := c.reconcile(key.(string))
#     if err == nil {
#         c.workqueue.Forget(key)  // success — reset retry count
#         return true
#     }
#
#     // Requeue with rate limiting (exponential backoff)
#     c.workqueue.AddRateLimited(key)
#     return true
# }

Work Queues & Rate Limiting

The work queue is the decoupling layer between event observation and reconciliation. It provides three critical guarantees:

Deduplication — If an object changes 10 times before it's processed, only one reconciliation occurs (using the latest state from cache)
Rate Limiting — Failed reconciliations are retried with exponential backoff (default: 5ms base, 1000s max)
Ordering — Items are processed in insertion order (FIFO), but re-queued items respect rate limits

# Go pseudocode — work queue rate limiter configuration
#
# // Default rate limiter: exponential backoff + overall rate limit
# rateLimiter := workqueue.NewMaxOfRateLimiter(
#     // Per-item exponential backoff: 5ms base, doubles each retry, max 1000s
#     workqueue.NewItemExponentialFailureRateLimiter(5*time.Millisecond, 1000*time.Second),
#     // Overall rate: max 10 items/second, burst of 100
#     &workqueue.BucketRateLimiter{Limiter: rate.NewLimiter(rate.Limit(10), 100)},
# )
#
# queue := workqueue.NewRateLimitingQueue(rateLimiter)
#
# // After 5 consecutive failures for the same key:
# // Retry delays: 5ms → 10ms → 20ms → 40ms → 80ms (exponential)

Leader Election for HA Controllers

In production, controllers run with multiple replicas for high availability. But only one replica should actively reconcile at any time — this is achieved through leader election using a Kubernetes Lease object.

                            
                            How It Works: Replicas compete to acquire a Lease (or ConfigMap/Endpoints). The winner becomes the leader and runs reconciliation loops. Other replicas stand by, watching the lease. If the leader fails to renew within the leaseDuration, a standby replica acquires leadership. Typical settings: lease duration 15s, renew deadline 10s, retry period 2s.
                        

# Leader election Lease object (created automatically by the framework)
apiVersion: coordination.k8s.io/v1
kind: Lease
metadata:
  name: my-controller-leader
  namespace: kube-system
spec:
  holderIdentity: controller-pod-abc123    # Current leader's pod name
  leaseDurationSeconds: 15                  # How long lease is valid
  acquireTime: "2026-05-14T10:00:00Z"
  renewTime: "2026-05-14T10:00:12Z"         # Last renewal timestamp
  leaseTransitions: 3                       # Number of leader changes

Scheduler Internals

The kube-scheduler assigns pods to nodes through a pluggable framework. The scheduling decision happens in two phases: filtering (which nodes can run the pod) and scoring (which node is best).

The Scheduling Framework

Scheduler Plugin Pipeline

flowchart LR
    subgraph "Scheduling Cycle (per pod)"
        QS[QueueSort] --> PF[PreFilter]
        PF --> F[Filter]
        F --> PFS[PostFilter]
        PFS --> PS[PreScore]
        PS --> S[Score]
        S --> NS[NormalizeScore]
        NS --> R[Reserve]
    end
    subgraph "Binding Cycle (async)"
        R --> PB[PreBind]
        PB --> B[Bind]
        B --> POB[PostBind]
    end

The scheduling framework defines extension points where plugins hook into the decision process. Each extension point runs a set of plugins that can filter nodes out, score remaining candidates, or perform binding operations.

Filter & Score Plugins

Plugin	Phase	Purpose
`NodeResourcesFit`	Filter + Score	Ensures node has sufficient CPU/memory; scores by resource utilization balance
`NodeAffinity`	Filter + Score	Enforces requiredDuringScheduling and scores preferredDuringScheduling affinities
`PodTopologySpread`	Filter + Score	Distributes pods across failure domains (zones, nodes) based on topology constraints
`InterPodAffinity`	Filter + Score	Enforces pod affinity/anti-affinity rules (co-location and spreading)
`TaintToleration`	Filter + Score	Filters nodes with untolerated taints; scores by number of tolerated taints
`VolumeBinding`	Filter + Reserve	Checks PV availability in the node's zone; reserves PV bindings
`NodePorts`	Filter	Filters nodes where requested hostPorts are already in use
`ImageLocality`	Score	Prefers nodes that already have the required container images cached

# Debug scheduler decisions for a pending pod
kubectl get events --field-selector reason=FailedScheduling -n default

# Verbose scheduler output (scheduler must be started with -v=10)
# In managed clusters, check scheduler logs:
kubectl logs -n kube-system -l component=kube-scheduler --tail=50

# View which scheduler profiles are active
kubectl get configmap -n kube-system kube-scheduler-config -o yaml 2>/dev/null || \
  echo "Scheduler uses default configuration (no custom ConfigMap)"

# Check pod scheduling annotations after binding
kubectl get pod nginx -o jsonpath='{.metadata.annotations}' | jq .

Preemption & Priority Classes

When no node can accommodate a pod, the scheduler can preempt (evict) lower-priority pods to make room. This is controlled by PriorityClasses — cluster-wide objects that assign integer priorities to pods.

# Define priority classes
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: critical-infrastructure
value: 1000000              # Higher number = higher priority
globalDefault: false
preemptionPolicy: PreemptLowerPriority   # or "Never" to disable preemption
description: "For critical infrastructure components (monitoring, logging)"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: batch-low
value: 100
globalDefault: false
preemptionPolicy: Never     # This class will never preempt others
description: "Low-priority batch jobs that can wait"
---
# Using a PriorityClass in a pod
apiVersion: v1
kind: Pod
metadata:
  name: critical-monitor
spec:
  priorityClassName: critical-infrastructure
  containers:
  - name: monitor
    image: prometheus:latest
    resources:
      requests:
        cpu: "500m"
        memory: "512Mi"

                            
                            Preemption Warning: Preemption is disruptive — evicted pods are terminated with their grace period and must be rescheduled elsewhere. Always set PodDisruptionBudgets on critical workloads. Use preemptionPolicy: Never for batch jobs that should wait rather than evict production workloads.
                        

etcd Deep Dive

etcd is Kubernetes' brain — a distributed key-value store that holds all cluster state. Understanding its internals is crucial for operating large clusters reliably.

Key Structure

Kubernetes stores objects in etcd under a well-defined key hierarchy. The default prefix is /registry/:

# etcd key structure (requires direct etcd access — NOT recommended in production)
# /registry/{resource-plural}/{namespace}/{name}
# /registry/pods/default/nginx
# /registry/deployments/kube-system/coredns
# /registry/namespaces/default
# /registry/clusterroles/admin        (cluster-scoped — no namespace segment)
# /registry/services/specs/default/kubernetes
# /registry/secrets/default/my-secret

# In production, access etcd via the API server — direct etcd access bypasses RBAC
# Only use etcdctl for disaster recovery or debugging:
ETCDCTL_API=3 etcdctl \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  get /registry/pods/default/nginx --print-value-only | \
  auger decode    # Decode protobuf to readable format

# List all keys under a prefix
ETCDCTL_API=3 etcdctl get /registry/ --prefix --keys-only | head -20

                            
                            Serialization Format: Kubernetes stores objects in etcd using Protocol Buffers (protobuf) by default — not JSON. Protobuf is ~2x smaller and ~5x faster to serialize/deserialize than JSON. The auger tool can decode these binary values for debugging. The API server handles conversion between protobuf (storage) and JSON/YAML (client-facing).
                        

Compaction & Defragmentation

etcd is an MVCC (Multi-Version Concurrency Control) store — it keeps a history of all key revisions. Without periodic maintenance, this history grows unbounded and degrades performance.

# Check etcd database size and alarm status
ETCDCTL_API=3 etcdctl endpoint status --write-out=table
# +-------------------+--------+---------+---------+-----------+...
# |     ENDPOINT      |   ID   | VERSION | DB SIZE | IS LEADER |
# +-------------------+--------+---------+---------+-----------+...
# | https://...:2379  | abc123 | 3.5.12  | 4.2 GB  |   true    |

# Compact old revisions (keep only latest 10000 revisions)
REVISION=$(ETCDCTL_API=3 etcdctl endpoint status --write-out=json | jq '.[0].Status.header.revision')
ETCDCTL_API=3 etcdctl compact $((REVISION - 10000))

# Defragment to reclaim disk space (compaction marks revisions as deleted but doesn't free space)
ETCDCTL_API=3 etcdctl defrag --endpoints=https://127.0.0.1:2379

# Check for NOSPACE alarm (etcd stops accepting writes at quota limit)
ETCDCTL_API=3 etcdctl alarm list

# Clear NOSPACE alarm after resolving (compact + defrag)
ETCDCTL_API=3 etcdctl alarm disarm

                            
                            NOSPACE Alarm: When etcd's database exceeds its quota (default 2GB, recommended max 8GB), it raises a NOSPACE alarm and rejects all write operations. This makes the cluster effectively read-only — no pod scheduling, no deployments, no scaling. Monitor etcd DB size and set up alerts at 80% capacity. Recovery: compact + defrag + alarm disarm.
                        

Performance Tuning

Parameter	Default	Recommended	Impact
`--quota-backend-bytes`	2 GB	8 GB (max recommended)	Maximum DB size before NOSPACE alarm
`--auto-compaction-mode`	periodic	revision	"revision" keeps last N revisions; "periodic" compacts every M hours
`--auto-compaction-retention`	0 (disabled)	10000 (revisions) or "5m" (periodic)	How much history to keep
`--snapshot-count`	100000	10000 for faster recovery	Transactions between snapshots (WAL compaction trigger)
Disk type	—	SSD with fsync < 10ms	etcd is fsync-heavy; spinning disks cause leader election storms
Network latency	—	< 10ms between peers	High latency causes frequent elections and inconsistency

Garbage Collection & Finalizers

When you delete a Deployment, its ReplicaSets and Pods are cleaned up automatically. This isn't magic — it's the garbage collector following owner references through the object graph.

Owner References

Owner references create a directed acyclic graph (DAG) of object ownership. When a parent is deleted, the garbage collector identifies and removes all dependents.

Garbage Collection — Owner Reference Chain

flowchart TD
    D[Deployment
nginx] -->|owns| RS[ReplicaSet
nginx-7d4f8b]
    RS -->|owns| P1[Pod
nginx-7d4f8b-abc]
    RS -->|owns| P2[Pod
nginx-7d4f8b-def]
    RS -->|owns| P3[Pod
nginx-7d4f8b-ghi]
    D -->|delete| GC{Garbage Collector}
    GC -->|"cascade: background"| RS
    GC -->|"cascade: foreground"| P1
    GC -->|"cascade: foreground"| P2
    GC -->|"cascade: foreground"| P3

# Owner reference as seen on a ReplicaSet owned by a Deployment
apiVersion: apps/v1
kind: ReplicaSet
metadata:
  name: nginx-7d4f8b6c9f
  namespace: default
  ownerReferences:
  - apiVersion: apps/v1
    kind: Deployment
    name: nginx
    uid: a1b2c3d4-e5f6-7890-abcd-ef1234567890
    controller: true          # This is THE managing controller
    blockOwnerDeletion: true  # Participates in foreground cascading delete

# Inspect owner references on a pod
kubectl get pod nginx-7d4f8b6c9f-abc12 -o jsonpath='{.metadata.ownerReferences}' | jq .

# Delete with explicit cascade policy
kubectl delete deployment nginx --cascade=foreground   # Wait for all dependents to be deleted first
kubectl delete deployment nginx --cascade=background   # Delete parent immediately, GC cleans up async (default)
kubectl delete deployment nginx --cascade=orphan       # Delete parent only, leave dependents running

Cascading Deletion Modes

Mode	Behavior	Use Case
Background (default)	Parent is deleted immediately. GC deletes dependents asynchronously.	Normal operations — fast deletion, eventual cleanup
Foreground	Parent enters "deletion in progress" state. GC deletes all dependents first, then removes parent.	When you need to guarantee cleanup before proceeding (migrations, replacements)
Orphan	Parent is deleted. Dependents are left running with ownerReferences cleared.	Replacing a Deployment while keeping existing pods running

Finalizers

Finalizers are metadata strings that block object deletion until a controller removes them. They provide a hook for cleanup logic that must complete before an object can be garbage collected.

                            
                            How Finalizers Work: When you delete an object with finalizers, Kubernetes sets metadata.deletionTimestamp but does NOT remove the object from etcd. The object remains "terminating" until all finalizers are removed. A controller watches for the deletionTimestamp, performs cleanup (external resources, DNS records, cloud load balancers), then removes its finalizer. Once all finalizers are cleared, the GC deletes the object.
                        

# Common finalizer patterns in Kubernetes
apiVersion: v1
kind: Namespace
metadata:
  name: my-namespace
  finalizers:
  - kubernetes    # Ensures all resources in the namespace are deleted first

---
# Custom finalizer on a CRD instance
apiVersion: databases.example.com/v1
kind: PostgresCluster
metadata:
  name: prod-db
  finalizers:
  - databases.example.com/cleanup    # Controller deletes cloud DB before allowing object removal
spec:
  instances: 3
  storage:
    size: 100Gi

# Debug stuck namespace deletion (common issue — stuck finalizers)
kubectl get namespace stuck-ns -o json | jq '.spec.finalizers'
# Output: ["kubernetes"]

# Check which resources still exist in the namespace
kubectl api-resources --verbs=list --namespaced -o name | \
  xargs -n1 -I{} sh -c 'kubectl get {} -n stuck-ns 2>/dev/null && echo "Found: {}"'

# DANGEROUS: Force-remove finalizer (only when cleanup is confirmed done)
kubectl get namespace stuck-ns -o json | \
  jq '.spec.finalizers = []' | \
  kubectl replace --raw "/api/v1/namespaces/stuck-ns/finalize" -f -

Admission Controllers

Admission controllers intercept API requests after authentication and authorization but before persistence to etcd. They can validate requests (rejecting invalid ones) or mutate them (injecting defaults, sidecars, labels).

Mutating & Validating Webhooks

The admission pipeline processes webhooks in two phases: mutating webhooks run first (and can modify the object), then validating webhooks run on the final mutated object (and can only accept or reject).

Type	Can Modify Object?	Ordering	Common Uses
MutatingAdmissionWebhook	Yes (via JSON patch)	Runs first, in order of webhook weight	Sidecar injection (Istio), default resource limits, label injection
ValidatingAdmissionWebhook	No (accept/reject only)	Runs after all mutating webhooks	Policy enforcement (OPA/Gatekeeper), image allowlists, naming conventions

Built-in Admission Controllers

Controller	Type	What It Does
`NamespaceLifecycle`	Validating	Prevents creating objects in terminating/non-existent namespaces
`LimitRanger`	Mutating + Validating	Injects default resource requests/limits, rejects pods exceeding limits
`ResourceQuota`	Validating	Rejects requests that would exceed namespace resource quotas
`PodSecurity`	Validating	Enforces Pod Security Standards (restricted, baseline, privileged)
`ServiceAccount`	Mutating	Injects default service account token volume mount
`DefaultStorageClass`	Mutating	Adds default StorageClass to PVCs without one
`MutatingAdmissionWebhook`	Mutating	Calls external mutating webhook services
`ValidatingAdmissionWebhook`	Validating	Calls external validating webhook services

Writing Custom Admission Webhooks

A custom webhook is a web server that receives AdmissionReview requests from the API server and returns an AdmissionResponse. Here's the configuration that tells the API server when to invoke your webhook:

# ValidatingWebhookConfiguration — reject pods without resource limits
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingWebhookConfiguration
metadata:
  name: require-resource-limits
webhooks:
- name: require-limits.example.com
  admissionReviewVersions: ["v1"]
  sideEffects: None
  timeoutSeconds: 5
  failurePolicy: Fail            # Reject request if webhook is unavailable
  matchPolicy: Equivalent
  clientConfig:
    service:
      name: admission-webhook
      namespace: webhook-system
      path: /validate-resource-limits
    caBundle: LS0tLS1CRUdJTi...   # Base64-encoded CA cert
  rules:
  - operations: ["CREATE", "UPDATE"]
    apiGroups: [""]
    apiVersions: ["v1"]
    resources: ["pods"]
  namespaceSelector:
    matchExpressions:
    - key: enforce-limits
      operator: In
      values: ["true"]           # Only applies to namespaces with this label
---
# MutatingWebhookConfiguration — inject sidecar container
apiVersion: admissionregistration.k8s.io/v1
kind: MutatingWebhookConfiguration
metadata:
  name: sidecar-injector
webhooks:
- name: sidecar.inject.example.com
  admissionReviewVersions: ["v1"]
  sideEffects: None
  reinvocationPolicy: IfNeeded    # Re-invoke if another mutating webhook changed the object
  timeoutSeconds: 10
  failurePolicy: Ignore           # Don't block pod creation if webhook is down
  clientConfig:
    service:
      name: sidecar-injector
      namespace: injection-system
      path: /inject
    caBundle: LS0tLS1CRUdJTi...
  rules:
  - operations: ["CREATE"]
    apiGroups: [""]
    apiVersions: ["v1"]
    resources: ["pods"]
  objectSelector:
    matchLabels:
      inject-sidecar: "true"      # Only pods with this label

                            
                            Webhook Availability: If your webhook is unavailable and failurePolicy: Fail is set, ALL matching API requests will be rejected — including system-critical pods. Always use failurePolicy: Ignore during initial rollout and narrow your namespaceSelector to exclude kube-system. Set short timeoutSeconds (3-5s) to avoid blocking the API server.
                        

# Debug admission webhook issues
kubectl get validatingwebhookconfigurations
kubectl get mutatingwebhookconfigurations

# Check if a webhook is rejecting requests
kubectl get events --field-selector reason=FailedCreate
# Warning  FailedCreate  admission webhook "require-limits.example.com" denied the request: ...

# Test webhook connectivity from API server perspective
kubectl logs -n webhook-system deploy/admission-webhook --tail=20

# Temporarily disable a misbehaving webhook
kubectl delete validatingwebhookconfiguration require-resource-limits

Exercises

Exercise 1 API Exploration

Explore the Kubernetes API Surface

Using only kubectl get --raw and jq, complete the following tasks:

List all API groups and their preferred versions
Find the GVR for CronJobs and construct the full REST path to list CronJobs in the production namespace
Retrieve the /scale subresource of a Deployment and modify its replica count via kubectl replace --raw
Demonstrate optimistic concurrency by creating a conflict — two concurrent writes to the same ConfigMap

API groups GVR resourceVersion subresources

Exercise 2 Controller Simulation

Build a Manual Reconciliation Loop

Simulate the controller pattern using shell scripting:

Write a bash script that watches for ConfigMap changes using kubectl get --raw ...?watch=true
On each ADDED/MODIFIED event, extract the ConfigMap's data and write it to a local file
Implement a simple exponential backoff: if writing fails, wait 1s, 2s, 4s, 8s before retrying
Add deduplication: if the same key appears multiple times in the queue before processing, skip duplicates

watch reconciliation backoff idempotency

Exercise 3 Scheduler Debugging

Diagnose and Fix Scheduling Failures

Create scenarios that demonstrate each scheduler failure mode:

Create a pod requesting 128 CPUs — diagnose the FailedScheduling event and identify which filter rejected it
Create two PriorityClasses (high=1000, low=100) and demonstrate preemption: fill a node with low-priority pods, then submit a high-priority pod
Configure a custom scheduler profile that disables the ImageLocality scoring plugin and verify the behavior difference

scheduling preemption priority plugins

Exercise 4 Finalizer & Webhook Lab

Implement Custom Garbage Collection and Admission Control

Combine finalizers with admission webhooks:

Create a ConfigMap with a custom finalizer. Write a controller (bash loop) that watches for deletionTimestamp, performs "cleanup" (logs a message), then removes the finalizer
Deploy a simple validating webhook (using a language of choice) that rejects pods with latest image tag
Demonstrate the orphan cascade policy: delete a Deployment with --cascade=orphan and verify pods continue running without owner references

finalizers webhooks garbage collection cascade

Conclusion

Understanding Kubernetes internals — API machinery, informers, controllers, the scheduler, etcd, garbage collection, and admission controllers — transforms your mental model from "kubectl does things" to understanding the precise mechanisms behind every operation. This knowledge is essential for:

Debugging complex failures — stuck deletions (finalizers), scheduling loops (plugin conflicts), stale state (informer cache lag)
Building operators — implementing the informer→queue→reconcile pattern correctly
Performance tuning — etcd compaction, informer resync intervals, webhook timeouts
Operating at scale — understanding why clusters with 5,000+ nodes need careful etcd tuning and rate limiting

Next in the Series

In Part 12: CRDs & Operators, we'll apply everything from this article to build custom resources and controllers — extending Kubernetes with your own domain-specific APIs using the Operator pattern.

Previous Part 10: Kubernetes Storage Next Part 12: CRDs & Operators

Cookie Consent

Part 11: Kubernetes Internals

Table of Contents

API Machinery

API Groups & Versions

GVR vs GVK

Resource Versioning & Optimistic Concurrency

API Discovery & Subresources

The Informer Pattern

SharedInformers

ListWatch Mechanism

Local Cache & Indexers

Event Handlers & Resync

Controller Architecture

The Generic Controller Pattern

Work Queues & Rate Limiting

Leader Election for HA Controllers

Scheduler Internals

The Scheduling Framework

Filter & Score Plugins

Preemption & Priority Classes

etcd Deep Dive

Key Structure

Compaction & Defragmentation

Performance Tuning

Garbage Collection & Finalizers

Owner References

Cascading Deletion Modes

Finalizers

Admission Controllers

Mutating & Validating Webhooks

Built-in Admission Controllers

Writing Custom Admission Webhooks

Exercises

Explore the Kubernetes API Surface

Build a Manual Reconciliation Loop

Diagnose and Fix Scheduling Failures

Implement Custom Garbage Collection and Admission Control

Conclusion

Next in the Series

Cookie Consent

Part 11: Kubernetes Internals

Table of Contents

API Machinery

API Groups & Versions

GVR vs GVK

Resource Versioning & Optimistic Concurrency

API Discovery & Subresources

The Informer Pattern

SharedInformers

ListWatch Mechanism

Local Cache & Indexers

Event Handlers & Resync

Controller Architecture

The Generic Controller Pattern

Work Queues & Rate Limiting

Leader Election for HA Controllers

Scheduler Internals

The Scheduling Framework

Filter & Score Plugins

Preemption & Priority Classes

etcd Deep Dive

Key Structure

Compaction & Defragmentation

Performance Tuning

Garbage Collection & Finalizers

Owner References

Cascading Deletion Modes

Finalizers

Admission Controllers

Mutating & Validating Webhooks

Built-in Admission Controllers

Writing Custom Admission Webhooks

Exercises

Explore the Kubernetes API Surface

Build a Manual Reconciliation Loop

Diagnose and Fix Scheduling Failures

Implement Custom Garbage Collection and Admission Control

Conclusion

Next in the Series

Continue the Series

Part 6: Kubernetes Architecture

Part 12: CRDs & Operators

Part 13: Day-2 Operations