Kernel Development Series - Phase 16: Performance & Optimization

Introduction: Performance Matters

                        
                        Phase 16 Goals: By the end of this phase, your OS will have disk and page caching, an optimized scheduler, efficient memory allocation, and profiling tools to measure and improve performance.
                    

OS Development Mastery

Your 18-step learning path • Currently on Step 17

17

Phase 16: Performance & Optimization

Caching, scheduler tuning

You Are Here

18

Phase 17: Stability, Security & Finishing

Debugging, hardening, completion

We've built a functional operating system over 15 phases. But "works" isn't the same as "works well". An OS that freezes when copying files or stutters during video playback isn't usable. Performance optimization is what transforms a toy OS into something production-ready.

Memory and storage latency hierarchy from registers to disk — Memory and storage latency hierarchy — from 1ns registers to 10ms disk access

╔═════════════════════════════════════════════════════════════════════════════╗
║                    MEMORY & STORAGE LATENCY HIERARCHY                        ║
╠═════════════════════════════════════════════════════════════════════════════╣
║                                                                             ║
║  Access Time          Component              Typical Size               ║
║  ─────────────          ───────────────          ────────────────       ║
║                                                                             ║
║  ~1 ns               L1 Cache                32-64 KB                   ║
║  ~4 ns               L2 Cache                256 KB - 1 MB              ║
║  ~10 ns              L3 Cache                8-64 MB                    ║
║  ~100 ns             RAM (DDR4/5)            8-128 GB                   ║
║  ~10,000 ns          NVMe SSD                256 GB - 8 TB              ║
║  ~10,000,000 ns      HDD (Spinning)          1-20 TB                    ║
║                                                                             ║
║  HDD is 10,000,000x slower than L1 cache!                                  ║
║  Caching transforms disk access from unusable to instant.                 ║
║                                                                             ║
╚═════════════════════════════════════════════════════════════════════════════╝

                        
                        Key Insight: Making an OS fast requires understanding where time is spent. Caching reduces disk latency, scheduler tuning improves responsiveness, and memory optimizations reduce allocation overhead.
                    

Performance Mindset

Before optimizing, you need to measure. Donald Knuth famously said: "Premature optimization is the root of all evil." The corollary is: profile first, optimize second.

Principle	Meaning	Example
Measure First	Profile before optimizing	Find that 90% of time is in disk I/O, not CPU
Amdahl's Law	Speedup limited by serial portions	10% serial code limits speedup to 10x max
Locality	Access nearby data together	Array of structs vs struct of arrays
Avoid Copies	Zero-copy where possible	Pass pointers, use DMA for disk I/O
Batch Operations	Amortize overhead over many items	Write 100 sectors at once, not 1 at a time

Common Bottlenecks

In an OS, the major performance bottlenecks are:

╔═════════════════════════════════════════════════════════════════════════════╗
║                     OS PERFORMANCE BOTTLENECKS                               ║
╠═════════════════════════════════════════════════════════════════════════════╣
║                                                                             ║
║  ┌───────────────────────────────────────────────────────────────────┐  ║
║  │  1. DISK I/O ─ Uncached disk reads are killers (~10ms each!)        │  ║
║  │     → Solution: Block cache, read-ahead, async I/O                  │  ║
║  └───────────────────────────────────────────────────────────────────┘  ║
║                                                                             ║
║  ┌───────────────────────────────────────────────────────────────────┐  ║
║  │  2. CONTEXT SWITCHES ─ Task switching overhead adds up              │  ║
║  │     → Solution: Smart scheduler, longer time slices where OK        │  ║
║  └───────────────────────────────────────────────────────────────────┘  ║
║                                                                             ║
║  ┌───────────────────────────────────────────────────────────────────┐  ║
║  │  3. MEMORY ALLOCATION ─ malloc/free are surprisingly slow          │  ║
║  │     → Solution: Slab allocator, object pools                        │  ║
║  └───────────────────────────────────────────────────────────────────┘  ║
║                                                                             ║
║  ┌───────────────────────────────────────────────────────────────────┐  ║
║  │  4. TLB MISSES ─ Page table walks are expensive (~100 cycles)     │  ║
║  │     → Solution: Huge pages, minimize mappings, locality             │  ║
║  └───────────────────────────────────────────────────────────────────┘  ║
║                                                                             ║
║  ┌───────────────────────────────────────────────────────────────────┐  ║
║  │  5. LOCK CONTENTION ─ Threads waiting on each other                │  ║
║  │     → Solution: Lock-free structures, finer-grained locks           │  ║
║  └───────────────────────────────────────────────────────────────────┘  ║
║                                                                             ║
╚═════════════════════════════════════════════════════════════════════════════╝

Caching Strategies

Caching is the single most important performance optimization in an OS. By keeping frequently accessed data in memory, we avoid slow disk operations. A good cache can make your OS feel 100x faster!

╔═════════════════════════════════════════════════════════════════════════════╗
║                       OS CACHING LAYERS                                     ║
╠═════════════════════════════════════════════════════════════════════════════╣
║                                                                             ║
║  Application                                                              ║
║       │                                                                    ║
║       ▼  read("/home/user/file.txt")                                     ║
║  ┌─────────────────────────────┐                                              ║
║  │      Page Cache            │  File data cached as memory pages       ║
║  │  (virtual file pages)     │  Serves most read() calls from RAM      ║
║  └─────────────┬───────────────┘                                              ║
║                │                                                            ║
║                ▼  Page miss → lookup block number                         ║
║  ┌─────────────────────────────┐                                              ║
║  │      Block Cache           │  Raw disk sectors cached in RAM         ║
║  │  (disk sectors/blocks)    │  Indexed by (device, block_num)         ║
║  └─────────────┬───────────────┘                                              ║
║                │                                                            ║
║                ▼  Block miss → actual disk I/O                            ║
║  ┌─────────────────────────────┐                                              ║
║  │      Disk Driver           │  AHCI/NVMe actually reads from disk     ║
║  │  (SLOW! Avoid at all cost)│  HDD: ~10ms, SSD: ~0.1ms                ║
║  └─────────────────────────────┘                                              ║
║                                                                             ║
╚═════════════════════════════════════════════════════════════════════════════╝

Block Cache

The block cache stores recently accessed disk sectors in memory. Before reading from disk, we check the cache—a hit returns instantly (nanoseconds vs. milliseconds).

/* Block Cache Entry */
typedef struct cache_entry {
    uint64_t block_num;         // Disk block number
    uint8_t* data;              // Block data (512/4096 bytes)
    uint32_t flags;             // Dirty, valid, etc.
    uint64_t access_time;       // For LRU eviction
    struct cache_entry* next;   // Hash chain
    struct cache_entry* lru_prev;
    struct cache_entry* lru_next;
} cache_entry_t;

/* Block Cache */
#define CACHE_SIZE 1024  // Number of cached blocks
#define HASH_SIZE  256

typedef struct {
    cache_entry_t* hash[HASH_SIZE];
    cache_entry_t* lru_head;
    cache_entry_t* lru_tail;
    uint32_t hits;
    uint32_t misses;
} block_cache_t;

/* Get cached block or read from disk */
cache_entry_t* cache_get(block_cache_t* cache, uint64_t block) {
    uint32_t hash = block % HASH_SIZE;
    
    // Search hash chain
    for (cache_entry_t* e = cache->hash[hash]; e; e = e->next) {
        if (e->block_num == block) {
            cache->hits++;
            lru_touch(cache, e);  // Move to front
            return e;
        }
    }
    
    // Cache miss - read from disk
    cache->misses++;
    cache_entry_t* entry = lru_evict_or_alloc(cache);
    entry->block_num = block;
    disk_read(block, entry->data);
    
    // Insert into hash
    entry->next = cache->hash[hash];
    cache->hash[hash] = entry;
    
    return entry;
}

                        
                        Cache Hit Rate: Track hits / (hits + misses). A good block cache achieves 90%+ hit rate for typical workloads. Even 90% means only 10% of accesses hit disk!
                    

Page Cache

While the block cache stores raw sectors, the page cache stores file data as memory pages. This integrates with virtual memory—a memory-mapped file is just pages in the page cache!

/* Page Cache - Indexed by (inode, offset) */
typedef struct page_cache_entry {
    struct inode* inode;        // File inode
    uint64_t offset;            // Page offset within file
    void* page;                 // 4KB page of file data
    uint32_t flags;             // PG_DIRTY, PG_UPTODATE, etc.
    atomic_t ref_count;         // Reference count
    struct list_head lru;       // LRU list linkage
    struct hlist_node hash;     // Hash table linkage
} page_cache_entry_t;

/* Find or create page in cache */
page_cache_entry_t* page_cache_lookup(struct inode* inode, uint64_t offset) {
    uint64_t page_offset = offset & ~0xFFF;  // Align to page
    uint32_t hash = hash_page(inode, page_offset);
    
    // Search hash bucket
    spin_lock(&page_cache_lock);
    page_cache_entry_t* entry;
    hlist_for_each_entry(entry, &page_hash[hash], hash) {
        if (entry->inode == inode && entry->offset == page_offset) {
            atomic_inc(&entry->ref_count);
            lru_touch(entry);  // Move to front of LRU
            spin_unlock(&page_cache_lock);
            return entry;
        }
    }
    spin_unlock(&page_cache_lock);
    
    // Cache miss - allocate new page and read from disk
    entry = alloc_page_cache_entry();
    entry->inode = inode;
    entry->offset = page_offset;
    entry->page = alloc_page();
    
    // Read file data into page (uses block cache internally)
    inode->i_op->readpage(inode, entry->page, page_offset);
    
    // Insert into cache
    spin_lock(&page_cache_lock);
    hlist_add_head(&entry->hash, &page_hash[hash]);
    list_add(&entry->lru, &page_lru_list);
    spin_unlock(&page_cache_lock);
    
    return entry;
}

/* Memory-mapped file read - just returns cached page */
void* mmap_read(struct file* file, uint64_t offset) {
    page_cache_entry_t* entry = page_cache_lookup(file->f_inode, offset);
    return entry->page + (offset & 0xFFF);  // Return pointer within page
}

LRU Eviction

When the cache is full, we need to evict old entries. LRU (Least Recently Used) is the classic algorithm—evict the entry that hasn't been accessed for the longest time.

╔═════════════════════════════════════════════════════════════════════════════╗
║                     LRU CACHE OPERATION                                      ║
╠═════════════════════════════════════════════════════════════════════════════╣
║                                                                             ║
║  Doubly-Linked List (head = most recent, tail = least recent)            ║
║                                                                             ║
║  HEAD                                                         TAIL        ║
║    │                                                           │          ║
║    ▼                                                           ▼          ║
║  ┌─────┐   ┌─────┐   ┌─────┐   ┌─────┐   ┌─────┐   ┌─────┐        ║
║  │Blk 5│───│Blk 2│───│Blk 9│───│Blk 1│───│Blk 7│───│Blk 3│        ║
║  └─────┘   └─────┘   └─────┘   └─────┘   └─────┘   └─────┘        ║
║  newest                                                 oldest            ║
║                                                                             ║
║  Operation: Access Block 1                                                ║
║  ──────────────────────────                                                   ║
║  1. Remove Block 1 from current position                                    ║
║  2. Insert Block 1 at HEAD                                                   ║
║                                                                             ║
║  HEAD                                                         TAIL        ║
║    │                                                           │          ║
║    ▼                                                           ▼          ║
║  ┌─────┐   ┌─────┐   ┌─────┐   ┌─────┐   ┌─────┐   ┌─────┐        ║
║  │Blk 1│───│Blk 5│───│Blk 2│───│Blk 9│───│Blk 7│───│Blk 3│        ║
║  └─────┘   └─────┘   └─────┘   └─────┘   └─────┘   └─────┘        ║
║  ↑ moved                                                 evict me          ║
║                                                                             ║
╚═════════════════════════════════════════════════════════════════════════════╝

/* LRU List Operations - O(1) */
void lru_touch(block_cache_t* cache, cache_entry_t* entry) {
    // Remove from current position
    if (entry->lru_prev) entry->lru_prev->lru_next = entry->lru_next;
    if (entry->lru_next) entry->lru_next->lru_prev = entry->lru_prev;
    if (cache->lru_tail == entry) cache->lru_tail = entry->lru_prev;
    
    // Insert at head
    entry->lru_prev = NULL;
    entry->lru_next = cache->lru_head;
    if (cache->lru_head) cache->lru_head->lru_prev = entry;
    cache->lru_head = entry;
    if (!cache->lru_tail) cache->lru_tail = entry;
}

cache_entry_t* lru_evict(block_cache_t* cache) {
    cache_entry_t* victim = cache->lru_tail;
    if (!victim) return NULL;
    
    // Remove from tail
    cache->lru_tail = victim->lru_prev;
    if (cache->lru_tail) cache->lru_tail->lru_next = NULL;
    else cache->lru_head = NULL;
    
    // If dirty, write back to disk first
    if (victim->flags & CACHE_DIRTY) {
        disk_write(victim->block_num, victim->data);
    }
    
    // Remove from hash chain
    cache_hash_remove(cache, victim);
    
    return victim;
}

                        
                        Write-Back vs Write-Through:
                        Write-Through: Every write immediately goes to disk. Safe but slow.
Write-Back: Writes stay in cache (marked dirty), flushed later. Fast but requires careful fsync.

Scheduler Tuning

The scheduler determines which process runs when. A bad scheduler makes the system feel sluggish even with a fast CPU. Good scheduling balances throughput (total work done) with latency (responsiveness to user input).

Scheduler design tradeoffs between throughput and latency — Scheduler design tradeoffs — balancing throughput, latency, and fairness

╔═════════════════════════════════════════════════════════════════════════════╗
║                     SCHEDULER DESIGN TRADEOFFS                               ║
╠═════════════════════════════════════════════════════════════════════════════╣
║                                                                             ║
║  ┌─────────────────────────────────┐     ┌──────────────────────────────┐  ║
║  │        THROUGHPUT              │     │         LATENCY              │  ║
║  │  Total work completed/second    │     │  Time to respond to events   │  ║
║  │                                 │     │                              │  ║
║  │  • Long time slices             │     │  • Short time slices         │  ║
║  │  • Batch processing             │     │  • Interactive priority      │  ║
║  │  • Minimize context switches    │     │  • Preemption                 │  ║
║  └─────────────────────────────────┘     └──────────────────────────────┘  ║
║                                                                             ║
║  Server (compile farm): Throughput priority                               ║
║  Desktop (user interaction): Latency priority                             ║
║  Real-time (audio): Hard latency guarantees                               ║
║                                                                             ║
╚═════════════════════════════════════════════════════════════════════════════╝

Priority Queues (MLFQ)

Multi-Level Feedback Queue (MLFQ) is a classic scheduler that automatically categorizes processes. New processes start at high priority (assumed interactive). If they use their full time slice, they're demoted (likely CPU-bound batch). If they yield early, they stay high (likely I/O-bound interactive).

/* Multi-Level Feedback Queue Scheduler */
#define NUM_PRIORITIES 4
#define QUANTUM_MS 10

typedef struct {
    task_t* queues[NUM_PRIORITIES];
    uint32_t quantum[NUM_PRIORITIES];
} mlfq_scheduler_t;

mlfq_scheduler_t scheduler = {
    .quantum = {5, 10, 20, 40}  // ms per priority level
};

/* Schedule next task */
task_t* mlfq_schedule(void) {
    // Pick from highest priority non-empty queue
    for (int pri = 0; pri < NUM_PRIORITIES; pri++) {
        if (scheduler.queues[pri]) {
            task_t* task = scheduler.queues[pri];
            scheduler.queues[pri] = task->next;
            return task;
        }
    }
    return idle_task;
}

/* Handle quantum expiration */
void mlfq_quantum_expired(task_t* task) {
    // Demote to lower priority (if not already lowest)
    if (task->priority < NUM_PRIORITIES - 1) {
        task->priority++;
    }
    
    // Re-queue
    task->next = scheduler.queues[task->priority];
    scheduler.queues[task->priority] = task;
}

Completely Fair Scheduler (CFS)

Linux's Completely Fair Scheduler (CFS) uses a red-black tree to give each task a "fair" share of CPU. Instead of fixed priority queues, CFS tracks virtual runtime (vruntime)—the amount of CPU time a task "should" have received.

╔═════════════════════════════════════════════════════════════════════════════╗
║                  CFS RED-BLACK TREE (sorted by vruntime)                    ║
╠═════════════════════════════════════════════════════════════════════════════╣
║                                                                             ║
║                           ┌───────────┐                                    ║
║                           │  Task C   │                                    ║
║                           │ vrun=500  │                                    ║
║                           └─────┬─────┘                                    ║
║                      ┌──────────┴──────────┐                               ║
║                      ▼                     ▼                               ║
║                ┌───────────┐         ┌───────────┐                      ║
║                │  Task A   │         │  Task E   │                      ║
║                │ vrun=200  │         │ vrun=800  │                      ║
║                └─────┬─────┘         └───────────┘                      ║
║           ┌──────────┴──────────┐                                          ║
║           ▼                     ▼                                          ║
║     ┌───────────┐         ┌───────────┐                               ║
║     │  Task B   │←LEFTMOST │  Task D   │                               ║
║     │ vrun=100  │         │ vrun=350  │                               ║
║     └───────────┘         └───────────┘                               ║
║      ^ NEXT TO RUN                                                        ║
║                                                                             ║
║  • Leftmost node = smallest vruntime = most "starved" for CPU             ║
║  • After running, task's vruntime increases, moves right in tree          ║
║  • Nice value affects HOW FAST vruntime increases (lower nice = slower)   ║
║                                                                             ║
╚═════════════════════════════════════════════════════════════════════════════╝

/* CFS Scheduler - Red-Black Tree based */
typedef struct cfs_rq {
    struct rb_root tasks_timeline;  // Red-black tree of tasks
    struct rb_node* leftmost;       // Task with smallest vruntime
    uint64_t min_vruntime;          // Minimum vruntime in tree
    uint32_t nr_running;            // Number of runnable tasks
} cfs_rq_t;

typedef struct cfs_task {
    struct task_struct* task;
    struct rb_node run_node;        // RB-tree node
    uint64_t vruntime;              // Virtual runtime
    uint64_t weight;                // Nice-based weight (nice 0 = 1024)
} cfs_task_t;

/* Nice to weight conversion (nice 0 = 1024) */
const uint32_t nice_to_weight[40] = {
 /* -20 */ 88761, 71755, 56483, 46273, 36291,
 /* -15 */ 29154, 23254, 18705, 14949, 11916,
 /* -10 */  9548,  7620,  6100,  4904,  3906,
 /*  -5 */  3121,  2501,  1991,  1586,  1277,
 /*   0 */  1024,   820,   655,   526,   423,
 /*   5 */   335,   272,   215,   172,   137,
 /*  10 */   110,    87,    70,    56,    45,
 /*  15 */    36,    29,    23,    18,    15,
};

/* CFS: Pick task with smallest vruntime */
cfs_task_t* cfs_pick_next(cfs_rq_t* rq) {
    if (!rq->leftmost) return NULL;
    return rb_entry(rq->leftmost, cfs_task_t, run_node);
}

/* CFS: Update vruntime after running */
void cfs_update_vruntime(cfs_task_t* se, uint64_t delta_exec) {
    // Higher weight = vruntime increases slower = more CPU time
    // vruntime += delta_exec * (NICE_0_WEIGHT / se->weight)
    uint64_t weighted = (delta_exec * 1024) / se->weight;
    se->vruntime += weighted;
    
    // Update min_vruntime (never go backwards)
    if (se->vruntime > se->cfs_rq->min_vruntime) {
        se->cfs_rq->min_vruntime = se->vruntime;
    }
}

/* CFS: Insert task into tree */
void cfs_enqueue(cfs_rq_t* rq, cfs_task_t* se) {
    // New task starts at current min_vruntime (fair start)
    if (se->vruntime < rq->min_vruntime) {
        se->vruntime = rq->min_vruntime;
    }
    
    // Insert into RB-tree, sorted by vruntime
    rb_insert(&rq->tasks_timeline, &se->run_node, cfs_compare_vruntime);
    
    // Update leftmost if needed
    if (!rq->leftmost || se->vruntime < cfs_get_vruntime(rq->leftmost)) {
        rq->leftmost = &se->run_node;
    }
    rq->nr_running++;
}

                        
                        Why vruntime is Brilliant: CFS doesn't need complex priority queues. A task that used 50ms of CPU will have higher vruntime than one that used 10ms—so the 10ms task runs next. The "nice" value affects the rate at which vruntime increases.
                    

Scheduler	Data Structure	Complexity	Best For
Round-Robin	Queue	O(1)	Simple, equal priority
MLFQ	Multiple Queues	O(1)	Auto-detects interactive
CFS	Red-Black Tree	O(log n)	General-purpose fairness
EDF	Sorted by deadline	O(n) or O(log n)	Real-time systems

Memory Optimization

Memory management has hidden costs that cripple performance when ignored. Every memory access triggers address translation, and every allocation has overhead. Optimizing these paths makes a huge difference.

TLB and slab allocator optimization techniques — Memory optimization — TLB efficiency, huge pages, and slab allocation strategies

TLB Efficiency

The Translation Lookaside Buffer (TLB) caches recent page table lookups. A TLB hit is ~1 cycle; a TLB miss requires a page table walk (~100 cycles on x86-64 with 4-level paging). With thousands of memory accesses per function call, TLB efficiency is critical!

╔═════════════════════════════════════════════════════════════════════════════╗
║                     TLB HIT vs MISS                                         ║
╠═════════════════════════════════════════════════════════════════════════════╣
║                                                                             ║
║  TLB HIT (~1 cycle)                                                        ║
║  ┌───────────┐    hit!    ┌────────────────┐                           ║
║  │   CPU     │─────────▶│   TLB Cache    │──────▶ Physical Address   ║
║  │ (virtual) │           │ (64-1024 entries)│                          ║
║  └───────────┘           └────────────────┘                           ║
║                                                                             ║
║  TLB MISS (~100+ cycles)                                                   ║
║  ┌───────────┐    miss    ┌────────────────┐                           ║
║  │   CPU     │─────────▶│   TLB Cache    │                           ║
║  │ (virtual) │           │                │                           ║
║  └───────────┘           └───────┬────────┘                           ║
║                                  │                                         ║
║                                  ▼ Page Table Walk (4 memory accesses!)    ║
║  ┌─────┐──▶┌─────┐──▶┌─────┐──▶┌─────┐──▶ Physical Address          ║
║  │ PML4│   │ PDPT│   │  PD │   │  PT │                              ║
║  └─────┘   └─────┘   └─────┘   └─────┘                              ║
║                                                                             ║
╚═════════════════════════════════════════════════════════════════════════════╝

                        
                        TLB Size Matters: A typical CPU has only 64-1024 TLB entries. With 4KB pages, that covers only 256KB-4MB of address space. Access scattered memory and you'll TLB miss constantly!
                    

Huge Pages

Huge pages (2MB or 1GB) cover more memory with fewer TLB entries. One 2MB huge page covers 512× as much memory as a 4KB page—perfect for large allocations!

Page Size	TLB Coverage (64 entries)	Page Table Levels	Use Case
4 KB	256 KB	4 levels (PML4→PDPT→PD→PT)	General purpose
2 MB	128 MB	3 levels (PML4→PDPT→PD)	Large heaps, databases
1 GB	64 GB	2 levels (PML4→PDPT)	Scientific computing, VMs

/* Huge Page Allocation */
#define PAGE_SIZE_4K   0x1000       // 4 KB
#define PAGE_SIZE_2M   0x200000     // 2 MB
#define PAGE_SIZE_1G   0x40000000   // 1 GB

/* Page table entry flags for huge pages */
#define PTE_HUGE       (1 << 7)     // PS bit - Page Size (makes PD entry point to 2MB page)

/* Allocate 2MB huge page */
void* alloc_huge_page_2m(void) {
    // Find 2MB-aligned physical frame
    uint64_t phys = alloc_physical_2m_aligned();
    if (!phys) return NULL;
    
    // Map at PD level (skip PT level entirely)
    uint64_t virt = find_free_vm_range(PAGE_SIZE_2M);
    
    // Set huge page flag in PD entry
    uint64_t* pd_entry = get_pd_entry(virt);
    *pd_entry = phys | PTE_PRESENT | PTE_WRITABLE | PTE_HUGE;
    
    return (void*)virt;
}

/* When to use huge pages */
// ✓ Large contiguous allocations (>2MB)
// ✓ Memory-mapped files
// ✓ Database buffer pools
// ✓ Scientific computing arrays
// ✗ Small allocations (wastes memory)
// ✗ Fragmented access patterns

Slab Allocator

The kernel allocates many small, fixed-size objects (task structs, file descriptors, inodes). A general-purpose allocator like malloc has overhead for each allocation. The slab allocator pre-allocates "slabs" of memory divided into fixed-size slots.

╔═════════════════════════════════════════════════════════════════════════════╗
║                     SLAB ALLOCATOR STRUCTURE                                ║
╠═════════════════════════════════════════════════════════════════════════════╣
║                                                                             ║
║  kmem_cache "task_struct" (object_size = 2048 bytes)                      ║
║  ┌────────────────────────────────────────────────────────┐                ║
║  │  partial ──▶ ┌───────────────────────────────────┐    │                ║
║  │              │  Slab 1 (has free slots)         │    │                ║
║  │              │ ┌────┬────┬────┬────┬────┬────┐  │    │                ║
║  │              │ │used│free│used│used│free│free│  │    │                ║
║  │              │ └────┴────┴────┴────┴────┴────┘  │    │                ║
║  │              └───────────────────────────────────┘    │                ║
║  │                                                          │                ║
║  │  full ─────▶ ┌───────────────────────────────────┐    │                ║
║  │              │  Slab 2 (all used)               │    │                ║
║  │              │ ┌────┬────┬────┬────┬────┬────┐  │    │                ║
║  │              │ │used│used│used│used│used│used│  │    │                ║
║  │              │ └────┴────┴────┴────┴────┴────┘  │    │                ║
║  │              └───────────────────────────────────┘    │                ║
║  └────────────────────────────────────────────────────────┘                ║
║                                                                             ║
║  • Allocation: Find partial slab, flip bit in bitmap, return pointer      ║
║  • Free: Clear bit in bitmap, move to partial list if was full            ║
║  • No fragmentation—all objects same size!                                ║
║                                                                             ║
╚═════════════════════════════════════════════════════════════════════════════╝

/* Slab Allocator for Fixed-Size Objects */
typedef struct slab {
    void*  base;            // Slab memory base
    size_t object_size;     // Size of each object
    size_t num_objects;     // Objects per slab
    uint64_t* bitmap;       // Allocation bitmap
    struct slab* next;      // Next slab in cache
} slab_t;

typedef struct {
    const char* name;
    size_t object_size;
    slab_t* slabs;
    slab_t* partial;        // Slabs with free space
    uint64_t allocations;
    uint64_t frees;
} slab_cache_t;

/* Allocate object from slab cache */
void* slab_alloc(slab_cache_t* cache) {
    // Find slab with free space
    for (slab_t* slab = cache->partial; slab; slab = slab->next) {
        int idx = find_first_zero_bit(slab->bitmap, slab->num_objects);
        if (idx >= 0) {
            set_bit(slab->bitmap, idx);
            cache->allocations++;
            return slab->base + idx * cache->object_size;
        }
    }
    
    // No space - create new slab
    slab_t* new_slab = create_slab(cache);
    set_bit(new_slab->bitmap, 0);
    new_slab->next = cache->partial;
    cache->partial = new_slab;
    cache->allocations++;
    
    return new_slab->base;
}

Profiling Tools

You can't optimize what you can't measure. Modern CPUs provide hardware performance counters that track cache misses, branch mispredictions, and cycles with zero overhead. Building profiling into your OS is essential for finding bottlenecks.

Hardware performance counters and profiling infrastructure — Profiling infrastructure — hardware performance counters, sampling, and trace analysis

Performance Counters (PMC)

Performance Monitoring Counters (PMC) are special CPU registers that count hardware events. x86 CPUs have 4-8 general-purpose counters that can track various events.

Counter Event	What It Measures	Optimization Insight
CPU_CYCLES	Total clock cycles	Overall time spent
INSTRUCTIONS_RETIRED	Instructions completed	IPC (Instr/Cycle) efficiency
LLC_MISSES	Last-level cache misses	Memory bandwidth issues
BRANCH_MISPREDICT	Wrong branch predictions	Unpredictable code paths
DTLB_MISS	Data TLB misses	Need huge pages?

/* Read CPU Performance Counter (RDPMC) */
static inline uint64_t read_pmc(uint32_t counter) {
    uint32_t lo, hi;
    __asm__ volatile("rdpmc" : "=a"(lo), "=d"(hi) : "c"(counter));
    return ((uint64_t)hi << 32) | lo;
}

/* Read Time Stamp Counter */
static inline uint64_t read_tsc(void) {
    uint32_t lo, hi;
    __asm__ volatile("rdtsc" : "=a"(lo), "=d"(hi));
    return ((uint64_t)hi << 32) | lo;
}

/* Simple profiling macro */
#define PROFILE_START(name) uint64_t _prof_##name = read_tsc()
#define PROFILE_END(name) do { \
    uint64_t _end = read_tsc(); \
    kprintf("%s: %lu cycles\n", #name, _end - _prof_##name); \
} while(0)

/* Configure PMC to count cache misses */
void setup_pmc_cache_misses(void) {
    // PMC setup via MSRs (Model-Specific Registers)
    // Event 0x2E = LLC_MISSES, Umask 0x41
    uint64_t event_select = 0x412E;  // LLC misses
    event_select |= (1 << 16);       // USR - count in user mode
    event_select |= (1 << 17);       // OS - count in kernel mode
    event_select |= (1 << 22);       // EN - enable counting
    
    wrmsr(MSR_PERFEVTSEL0, event_select);
    wrmsr(MSR_PMC0, 0);  // Reset counter
}

uint64_t read_cache_misses(void) {
    return read_pmc(0);  // Read PMC0
}

Kernel Tracing

Kernel tracing records events as they happen—syscalls, interrupts, context switches—creating a timeline you can analyze. This is invaluable for finding why the system "feels slow."

/* Simple Ring Buffer for Kernel Tracing */
#define TRACE_BUFFER_SIZE 65536

typedef struct trace_event {
    uint64_t timestamp;     // TSC timestamp
    uint32_t event_type;    // TRACE_SYSCALL, TRACE_IRQ, etc.
    uint32_t cpu;           // Which CPU
    uint32_t pid;           // Process ID
    uint64_t arg1, arg2;    // Event-specific data
} trace_event_t;

typedef struct trace_buffer {
    trace_event_t events[TRACE_BUFFER_SIZE];
    volatile uint32_t head;
    volatile uint32_t tail;
    bool enabled;
} trace_buffer_t;

trace_buffer_t trace_buf;

/* Record trace event (called from kernel code) */
void trace_record(uint32_t type, uint64_t arg1, uint64_t arg2) {
    if (!trace_buf.enabled) return;
    
    uint32_t idx = atomic_inc(&trace_buf.head) % TRACE_BUFFER_SIZE;
    trace_event_t* e = &trace_buf.events[idx];
    
    e->timestamp = read_tsc();
    e->event_type = type;
    e->cpu = get_cpu_id();
    e->pid = current->pid;
    e->arg1 = arg1;
    e->arg2 = arg2;
}

/* Trace points in kernel */
#define TRACE_SYSCALL_ENTER  1
#define TRACE_SYSCALL_EXIT   2
#define TRACE_IRQ_ENTER      3
#define TRACE_IRQ_EXIT       4
#define TRACE_SCHED_SWITCH   5
#define TRACE_PAGE_FAULT     6

/* Example: Trace syscall entry */
long syscall_handler(long num, ...) {
    trace_record(TRACE_SYSCALL_ENTER, num, 0);
    
    long result = do_syscall(num, ...);
    
    trace_record(TRACE_SYSCALL_EXIT, num, result);
    return result;
}

                        
                        Visualization: Export trace data to a file, then visualize with tools like Chrome's chrome://tracing or Perfetto. You'll see exactly what the kernel was doing at each microsecond!
                    

What You Can Build

                        
                        Phase 16 Project: A fast OS! Your system now has disk caching for fast file access, an optimized scheduler for responsive multitasking, efficient memory allocation, and tools to measure performance.
                    

Let's build a performance benchmark suite that measures all the optimizations we've implemented:

/* performance_benchmark.c - Comprehensive OS Performance Suite */
#include "cache.h"
#include "scheduler.h"
#include "memory.h"
#include "profiling.h"

typedef struct benchmark_result {
    const char* name;
    uint64_t cycles;
    uint64_t cache_misses;
    uint64_t operations;
    double ops_per_second;
} benchmark_result_t;

/* Benchmark 1: Block Cache Performance */
benchmark_result_t bench_block_cache(void) {
    const uint64_t NUM_OPS = 100000;
    uint64_t start_cycles = read_tsc();
    uint64_t start_misses = read_cache_misses();
    
    // First pass: cold cache (all misses)
    for (uint64_t i = 0; i < 1000; i++) {
        cache_get(&block_cache, i);
    }
    
    // Second pass: warm cache (should hit)
    for (uint64_t i = 0; i < NUM_OPS; i++) {
        cache_get(&block_cache, i % 1000);  // Access within cache size
    }
    
    uint64_t end_cycles = read_tsc();
    uint64_t end_misses = read_cache_misses();
    
    benchmark_result_t result = {
        .name = "Block Cache",
        .cycles = end_cycles - start_cycles,
        .cache_misses = end_misses - start_misses,
        .operations = NUM_OPS,
        .ops_per_second = (double)NUM_OPS / ((end_cycles - start_cycles) / cpu_freq)
    };
    
    kprintf("[BENCH] %s: %lu ops, %lu cycles, %.2f ops/sec, hit_rate=%.1f%%\n",
            result.name, result.operations, result.cycles, result.ops_per_second,
            100.0 * block_cache.hits / (block_cache.hits + block_cache.misses));
    
    return result;
}

/* Benchmark 2: Slab Allocator vs Generic Malloc */
benchmark_result_t bench_slab_allocator(void) {
    const uint64_t NUM_ALLOCS = 50000;
    void* ptrs[1000];
    
    // Benchmark slab allocator
    uint64_t slab_start = read_tsc();
    for (uint64_t i = 0; i < NUM_ALLOCS; i++) {
        ptrs[i % 1000] = slab_alloc(&task_cache);  // Pre-sized for task_struct
        if (i >= 1000) slab_free(&task_cache, ptrs[(i - 1000) % 1000]);
    }
    uint64_t slab_cycles = read_tsc() - slab_start;
    
    // Benchmark generic malloc
    uint64_t malloc_start = read_tsc();
    for (uint64_t i = 0; i < NUM_ALLOCS; i++) {
        ptrs[i % 1000] = kmalloc(sizeof(task_t));
        if (i >= 1000) kfree(ptrs[(i - 1000) % 1000]);
    }
    uint64_t malloc_cycles = read_tsc() - malloc_start;
    
    kprintf("[BENCH] Slab: %lu cycles, Malloc: %lu cycles (%.1fx faster)\n",
            slab_cycles, malloc_cycles, (double)malloc_cycles / slab_cycles);
    
    return (benchmark_result_t){
        .name = "Slab Allocator",
        .cycles = slab_cycles,
        .operations = NUM_ALLOCS
    };
}

/* Benchmark 3: Scheduler Latency */
benchmark_result_t bench_scheduler_latency(void) {
    const int NUM_SAMPLES = 1000;
    uint64_t latencies[NUM_SAMPLES];
    
    for (int i = 0; i < NUM_SAMPLES; i++) {
        uint64_t start = read_tsc();
        yield();  // Voluntary context switch
        latencies[i] = read_tsc() - start;
    }
    
    // Calculate statistics
    uint64_t sum = 0, min = UINT64_MAX, max = 0;
    for (int i = 0; i < NUM_SAMPLES; i++) {
        sum += latencies[i];
        if (latencies[i] < min) min = latencies[i];
        if (latencies[i] > max) max = latencies[i];
    }
    
    kprintf("[BENCH] Context Switch: avg=%lu, min=%lu, max=%lu cycles\n",
            sum / NUM_SAMPLES, min, max);
    
    return (benchmark_result_t){
        .name = "Context Switch",
        .cycles = sum / NUM_SAMPLES,
        .operations = NUM_SAMPLES
    };
}

/* Benchmark 4: TLB/Huge Page Impact */
benchmark_result_t bench_huge_pages(void) {
    const size_t SIZE = 64 * 1024 * 1024;  // 64 MB
    
    // Allocate with 4KB pages
    uint64_t start_4k = read_tsc();
    uint64_t start_tlb = read_pmc_tlb_misses();
    char* mem_4k = vmalloc(SIZE);  // Regular pages
    for (size_t i = 0; i < SIZE; i += 4096) {
        mem_4k[i] = 1;  // Touch each page
    }
    uint64_t cycles_4k = read_tsc() - start_4k;
    uint64_t tlb_4k = read_pmc_tlb_misses() - start_tlb;
    vfree(mem_4k);
    
    // Allocate with 2MB huge pages
    uint64_t start_2m = read_tsc();
    start_tlb = read_pmc_tlb_misses();
    char* mem_2m = vmalloc_huge(SIZE);  // Huge pages
    for (size_t i = 0; i < SIZE; i += 4096) {
        mem_2m[i] = 1;
    }
    uint64_t cycles_2m = read_tsc() - start_2m;
    uint64_t tlb_2m = read_pmc_tlb_misses() - start_tlb;
    vfree(mem_2m);
    
    kprintf("[BENCH] 4KB pages: %lu cycles, %lu TLB misses\n", cycles_4k, tlb_4k);
    kprintf("[BENCH] 2MB pages: %lu cycles, %lu TLB misses (%.1fx faster)\n",
            cycles_2m, tlb_2m, (double)cycles_4k / cycles_2m);
    
    return (benchmark_result_t){
        .name = "Huge Pages",
        .cycles = cycles_2m,
        .operations = SIZE / 4096
    };
}

/* Run all benchmarks */
void run_performance_suite(void) {
    kprintf("\n=== WasilOS Performance Benchmark Suite ===\n\n");
    
    // Enable performance counters
    setup_pmc_cache_misses();
    
    benchmark_result_t results[4];
    results[0] = bench_block_cache();
    results[1] = bench_slab_allocator();
    results[2] = bench_scheduler_latency();
    results[3] = bench_huge_pages();
    
    kprintf("\n=== Benchmark Complete ===\n");
}

Exercises

Exercise 1: Implement Read-Ahead

When reading block N, also prefetch blocks N+1, N+2, etc. into the cache. Most file access is sequential, so this dramatically improves performance.

void cache_readahead(block_cache_t* cache, uint64_t block, int count) {
    for (int i = 1; i <= count; i++) {
        // Async prefetch without blocking
        cache_prefetch(cache, block + i);
    }
}

Exercise 2: Implement CLOCK Algorithm

CLOCK is an approximation of LRU that's cheaper to implement. Instead of moving entries on every access, it uses a "second chance" bit.

Exercise 3: Add Nice Support

Implement the nice() syscall that allows processes to adjust their priority. Higher nice values = lower priority = slower vruntime increase.

Exercise 4: Build a Profiler

Create a simple sampling profiler that uses timer interrupts to record which function is executing. Build a histogram of where time is spent.

╔═════════════════════════════════════════════════════════════════════════════╗
║                    PHASE 16 → PHASE 17 TRANSITION                           ║
╠═════════════════════════════════════════════════════════════════════════════╣
║                                                                             ║
║  Phase 16 Complete:                                                        ║
║  ┌─────────────────────────────────────────────────────────────────────┐  ║
║  │ ✓ Block cache (90%+ hit rate)                                       │  ║
║  │ ✓ Page cache (memory-mapped files)                                  │  ║
║  │ ✓ LRU eviction policy                                               │  ║
║  │ ✓ MLFQ scheduler (priority queues)                                  │  ║
║  │ ✓ CFS scheduler (red-black tree)                                    │  ║
║  │ ✓ TLB optimization                                                  │  ║
║  │ ✓ Huge pages (2MB/1GB)                                              │  ║
║  │ ✓ Slab allocator                                                    │  ║
║  │ ✓ Performance counters (PMC)                                        │  ║
║  │ ✓ Kernel tracing                                                    │  ║
║  └─────────────────────────────────────────────────────────────────────┘  ║
║                                    │                                        ║
║                                    ▼                                        ║
║  Phase 17 Preview - Stability, Security & Finishing:                      ║
║  ┌─────────────────────────────────────────────────────────────────────┐  ║
║  │ • Kernel debugging (stack traces, panic handling)                   │  ║
║  │ • Security hardening (ASLR, stack canaries, NX)                     │  ║
║  │ • User/kernel separation                                            │  ║
║  │ • Complete bootable OS image                                        │  ║
║  └─────────────────────────────────────────────────────────────────────┘  ║
║                                                                             ║
╚═════════════════════════════════════════════════════════════════════════════╝

Concept	What We Built	Performance Impact
Block Cache	Hash table + LRU list	10,000x faster than disk
Page Cache	File pages in memory	Enables mmap()
CFS Scheduler	Red-black tree by vruntime	Fair + responsive
Huge Pages	2MB/1GB page support	10-50% faster for large allocs
Slab Allocator	Fixed-size object pools	10x faster than malloc
Profiling	PMC + tracing	Find what to optimize

Next Steps

With performance optimized, it's time to make the OS stable and secure. In Phase 17, we'll add debugging tools, implement security hardening, and put the finishing touches on our complete operating system.

Cookie Consent

Cookie Preferences

Phase 16: Performance & Optimization

Table of Contents

Introduction: Performance Matters

OS Development Mastery

Phase 0: Orientation & Big Picture

Phase 1: How a Computer Starts

Phase 2: Real Mode - First Steps

Phase 3: Entering Protected Mode

Phase 4: Display, Input & Output

Phase 5: Interrupts & CPU Control

Phase 6: Memory Management

Phase 7: Disk Access & Filesystems

Phase 8: Processes & User Mode

Phase 9: ELF Loading & Executables

Phase 10: Standard Library & Shell

Phase 11: 64-Bit Long Mode

Phase 12: Modern Booting with UEFI

Phase 13: Graphics & GUI Systems

Phase 14: Advanced Input & Timing

Phase 15: Hardware Discovery & Drivers

Phase 16: Performance & Optimization

Phase 17: Stability, Security & Finishing

Performance Mindset

Common Bottlenecks

Caching Strategies

Block Cache

Page Cache

LRU Eviction

Scheduler Tuning

Priority Queues (MLFQ)

Completely Fair Scheduler (CFS)

Memory Optimization

TLB Efficiency

Huge Pages

Slab Allocator

Profiling Tools

Performance Counters (PMC)

Kernel Tracing

What You Can Build

Exercises

Exercise 1: Implement Read-Ahead

Exercise 2: Implement CLOCK Algorithm

Exercise 3: Add Nice Support

Exercise 4: Build a Profiler

Next Steps

Continue the Series

Phase 15: Hardware Discovery & Drivers

Phase 17: Stability, Security & Finishing

Phase 0: Orientation & Big Picture