Back to Technology

Phase 16: Performance & Optimization

February 6, 2026 Wasil Zafar 30 min read

Make your OS fast with caching strategies, scheduler optimization, memory access patterns, and profiling tools.

Table of Contents

  1. Introduction
  2. Caching Strategies
  3. Scheduler Tuning
  4. Memory Optimization
  5. Profiling Tools
  6. What You Can Build
  7. Next Steps

Introduction: Performance Matters

Phase 16 Goals: By the end of this phase, your OS will have disk and page caching, an optimized scheduler, efficient memory allocation, and profiling tools to measure and improve performance.

We've built a functional operating system over 15 phases. But "works" isn't the same as "works well". An OS that freezes when copying files or stutters during video playback isn't usable. Performance optimization is what transforms a toy OS into something production-ready.

╔═════════════════════════════════════════════════════════════════════════════╗
                    MEMORY & STORAGE LATENCY HIERARCHY                        
╠═════════════════════════════════════════════════════════════════════════════╣
                                                                             
  Access Time          Component              Typical Size               
  ─────────────          ───────────────          ────────────────       
                                                                             
  ~1 ns               L1 Cache                32-64 KB                   
  ~4 ns               L2 Cache                256 KB - 1 MB              
  ~10 ns              L3 Cache                8-64 MB                    
  ~100 ns             RAM (DDR4/5)            8-128 GB                   
  ~10,000 ns          NVMe SSD                256 GB - 8 TB              
  ~10,000,000 ns      HDD (Spinning)          1-20 TB                    
                                                                             
  HDD is 10,000,000x slower than L1 cache!                                  
  Caching transforms disk access from unusable to instant.                 
                                                                             
╚═════════════════════════════════════════════════════════════════════════════╝
Key Insight: Making an OS fast requires understanding where time is spent. Caching reduces disk latency, scheduler tuning improves responsiveness, and memory optimizations reduce allocation overhead.

Performance Mindset

Before optimizing, you need to measure. Donald Knuth famously said: "Premature optimization is the root of all evil." The corollary is: profile first, optimize second.

Principle Meaning Example
Measure First Profile before optimizing Find that 90% of time is in disk I/O, not CPU
Amdahl's Law Speedup limited by serial portions 10% serial code limits speedup to 10x max
Locality Access nearby data together Array of structs vs struct of arrays
Avoid Copies Zero-copy where possible Pass pointers, use DMA for disk I/O
Batch Operations Amortize overhead over many items Write 100 sectors at once, not 1 at a time

Common Bottlenecks

In an OS, the major performance bottlenecks are:

╔═════════════════════════════════════════════════════════════════════════════╗
                     OS PERFORMANCE BOTTLENECKS                               
╠═════════════════════════════════════════════════════════════════════════════╣
                                                                             
  ┌───────────────────────────────────────────────────────────────────┐  
    1. DISK I/O ─ Uncached disk reads are killers (~10ms each!)          
       → Solution: Block cache, read-ahead, async I/O                    
  └───────────────────────────────────────────────────────────────────┘  
                                                                             
  ┌───────────────────────────────────────────────────────────────────┐  
    2. CONTEXT SWITCHES ─ Task switching overhead adds up                
       → Solution: Smart scheduler, longer time slices where OK          
  └───────────────────────────────────────────────────────────────────┘  
                                                                             
  ┌───────────────────────────────────────────────────────────────────┐  
    3. MEMORY ALLOCATION ─ malloc/free are surprisingly slow            
       → Solution: Slab allocator, object pools                          
  └───────────────────────────────────────────────────────────────────┘  
                                                                             
  ┌───────────────────────────────────────────────────────────────────┐  
    4. TLB MISSES ─ Page table walks are expensive (~100 cycles)       
       → Solution: Huge pages, minimize mappings, locality               
  └───────────────────────────────────────────────────────────────────┘  
                                                                             
  ┌───────────────────────────────────────────────────────────────────┐  
    5. LOCK CONTENTION ─ Threads waiting on each other                  
       → Solution: Lock-free structures, finer-grained locks             
  └───────────────────────────────────────────────────────────────────┘  
                                                                             
╚═════════════════════════════════════════════════════════════════════════════╝

Caching Strategies

Caching is the single most important performance optimization in an OS. By keeping frequently accessed data in memory, we avoid slow disk operations. A good cache can make your OS feel 100x faster!

╔═════════════════════════════════════════════════════════════════════════════╗
                       OS CACHING LAYERS                                     
╠═════════════════════════════════════════════════════════════════════════════╣
                                                                             
  Application                                                              

read("/home/user/file.txt")                                     
  ┌─────────────────────────────┐                                              
        Page Cache              File data cached as memory pages       
    (virtual file pages)       Serves most read() calls from RAM      
  └─────────────┬───────────────┘                                              

Page miss → lookup block number                         
  ┌─────────────────────────────┐                                              
        Block Cache             Raw disk sectors cached in RAM         
    (disk sectors/blocks)      Indexed by (device, block_num)         
  └─────────────┬───────────────┘                                              

Block miss → actual disk I/O                            
  ┌─────────────────────────────┐                                              
        Disk Driver             AHCI/NVMe actually reads from disk     
    (SLOW! Avoid at all cost)  HDD: ~10ms, SSD: ~0.1ms                
  └─────────────────────────────┘                                              
                                                                             
╚═════════════════════════════════════════════════════════════════════════════╝

Block Cache

The block cache stores recently accessed disk sectors in memory. Before reading from disk, we check the cache—a hit returns instantly (nanoseconds vs. milliseconds).

/* Block Cache Entry */
typedef struct cache_entry {
    uint64_t block_num;         // Disk block number
    uint8_t* data;              // Block data (512/4096 bytes)
    uint32_t flags;             // Dirty, valid, etc.
    uint64_t access_time;       // For LRU eviction
    struct cache_entry* next;   // Hash chain
    struct cache_entry* lru_prev;
    struct cache_entry* lru_next;
} cache_entry_t;

/* Block Cache */
#define CACHE_SIZE 1024  // Number of cached blocks
#define HASH_SIZE  256

typedef struct {
    cache_entry_t* hash[HASH_SIZE];
    cache_entry_t* lru_head;
    cache_entry_t* lru_tail;
    uint32_t hits;
    uint32_t misses;
} block_cache_t;

/* Get cached block or read from disk */
cache_entry_t* cache_get(block_cache_t* cache, uint64_t block) {
    uint32_t hash = block % HASH_SIZE;
    
    // Search hash chain
    for (cache_entry_t* e = cache->hash[hash]; e; e = e->next) {
        if (e->block_num == block) {
            cache->hits++;
            lru_touch(cache, e);  // Move to front
            return e;
        }
    }
    
    // Cache miss - read from disk
    cache->misses++;
    cache_entry_t* entry = lru_evict_or_alloc(cache);
    entry->block_num = block;
    disk_read(block, entry->data);
    
    // Insert into hash
    entry->next = cache->hash[hash];
    cache->hash[hash] = entry;
    
    return entry;
}
Cache Hit Rate: Track hits / (hits + misses). A good block cache achieves 90%+ hit rate for typical workloads. Even 90% means only 10% of accesses hit disk!

Page Cache

While the block cache stores raw sectors, the page cache stores file data as memory pages. This integrates with virtual memory—a memory-mapped file is just pages in the page cache!

/* Page Cache - Indexed by (inode, offset) */
typedef struct page_cache_entry {
    struct inode* inode;        // File inode
    uint64_t offset;            // Page offset within file
    void* page;                 // 4KB page of file data
    uint32_t flags;             // PG_DIRTY, PG_UPTODATE, etc.
    atomic_t ref_count;         // Reference count
    struct list_head lru;       // LRU list linkage
    struct hlist_node hash;     // Hash table linkage
} page_cache_entry_t;

/* Find or create page in cache */
page_cache_entry_t* page_cache_lookup(struct inode* inode, uint64_t offset) {
    uint64_t page_offset = offset & ~0xFFF;  // Align to page
    uint32_t hash = hash_page(inode, page_offset);
    
    // Search hash bucket
    spin_lock(&page_cache_lock);
    page_cache_entry_t* entry;
    hlist_for_each_entry(entry, &page_hash[hash], hash) {
        if (entry->inode == inode && entry->offset == page_offset) {
            atomic_inc(&entry->ref_count);
            lru_touch(entry);  // Move to front of LRU
            spin_unlock(&page_cache_lock);
            return entry;
        }
    }
    spin_unlock(&page_cache_lock);
    
    // Cache miss - allocate new page and read from disk
    entry = alloc_page_cache_entry();
    entry->inode = inode;
    entry->offset = page_offset;
    entry->page = alloc_page();
    
    // Read file data into page (uses block cache internally)
    inode->i_op->readpage(inode, entry->page, page_offset);
    
    // Insert into cache
    spin_lock(&page_cache_lock);
    hlist_add_head(&entry->hash, &page_hash[hash]);
    list_add(&entry->lru, &page_lru_list);
    spin_unlock(&page_cache_lock);
    
    return entry;
}

/* Memory-mapped file read - just returns cached page */
void* mmap_read(struct file* file, uint64_t offset) {
    page_cache_entry_t* entry = page_cache_lookup(file->f_inode, offset);
    return entry->page + (offset & 0xFFF);  // Return pointer within page
}

LRU Eviction

When the cache is full, we need to evict old entries. LRU (Least Recently Used) is the classic algorithm—evict the entry that hasn't been accessed for the longest time.

╔═════════════════════════════════════════════════════════════════════════════╗
                     LRU CACHE OPERATION                                      
╠═════════════════════════════════════════════════════════════════════════════╣
                                                                             
  Doubly-Linked List (head = most recent, tail = least recent)            
                                                                             
  HEAD                                                         TAIL        
    │                                                           │          
    ▼                                                           ▼          
  ┌─────┐   ┌─────┐   ┌─────┐   ┌─────┐   ┌─────┐   ┌─────┐        
  Blk 5│───│Blk 2│───│Blk 9│───│Blk 1│───│Blk 7│───│Blk 3        
  └─────┘   └─────┘   └─────┘   └─────┘   └─────┘   └─────┘        
  newest                                                 oldest            
                                                                             
  Operation: Access Block 1                                                
  ──────────────────────────                                                   
  1. Remove Block 1 from current position                                    
  2. Insert Block 1 at HEAD                                                   
                                                                             
  HEAD                                                         TAIL        
    │                                                           │          
    ▼                                                           ▼          
  ┌─────┐   ┌─────┐   ┌─────┐   ┌─────┐   ┌─────┐   ┌─────┐        
  Blk 1│───│Blk 5│───│Blk 2│───│Blk 9│───│Blk 7│───│Blk 3        
  └─────┘   └─────┘   └─────┘   └─────┘   └─────┘   └─────┘        
  ↑ moved                                                 evict me          
                                                                             
╚═════════════════════════════════════════════════════════════════════════════╝
/* LRU List Operations - O(1) */
void lru_touch(block_cache_t* cache, cache_entry_t* entry) {
    // Remove from current position
    if (entry->lru_prev) entry->lru_prev->lru_next = entry->lru_next;
    if (entry->lru_next) entry->lru_next->lru_prev = entry->lru_prev;
    if (cache->lru_tail == entry) cache->lru_tail = entry->lru_prev;
    
    // Insert at head
    entry->lru_prev = NULL;
    entry->lru_next = cache->lru_head;
    if (cache->lru_head) cache->lru_head->lru_prev = entry;
    cache->lru_head = entry;
    if (!cache->lru_tail) cache->lru_tail = entry;
}

cache_entry_t* lru_evict(block_cache_t* cache) {
    cache_entry_t* victim = cache->lru_tail;
    if (!victim) return NULL;
    
    // Remove from tail
    cache->lru_tail = victim->lru_prev;
    if (cache->lru_tail) cache->lru_tail->lru_next = NULL;
    else cache->lru_head = NULL;
    
    // If dirty, write back to disk first
    if (victim->flags & CACHE_DIRTY) {
        disk_write(victim->block_num, victim->data);
    }
    
    // Remove from hash chain
    cache_hash_remove(cache, victim);
    
    return victim;
}
Write-Back vs Write-Through:
  • Write-Through: Every write immediately goes to disk. Safe but slow.
  • Write-Back: Writes stay in cache (marked dirty), flushed later. Fast but requires careful fsync.

Scheduler Tuning

The scheduler determines which process runs when. A bad scheduler makes the system feel sluggish even with a fast CPU. Good scheduling balances throughput (total work done) with latency (responsiveness to user input).

╔═════════════════════════════════════════════════════════════════════════════╗
                     SCHEDULER DESIGN TRADEOFFS                               
╠═════════════════════════════════════════════════════════════════════════════╣
                                                                             
  ┌─────────────────────────────────┐     ┌──────────────────────────────┐  
          THROUGHPUT                            LATENCY                
    Total work completed/second           Time to respond to events     
                                                                        
    • Long time slices                    • Short time slices           
    • Batch processing                    • Interactive priority        
    • Minimize context switches           • Preemption                   
  └─────────────────────────────────┘     └──────────────────────────────┘  
                                                                             
  Server (compile farm): Throughput priority                               
  Desktop (user interaction): Latency priority                             
  Real-time (audio): Hard latency guarantees                               
                                                                             
╚═════════════════════════════════════════════════════════════════════════════╝

Priority Queues (MLFQ)

Multi-Level Feedback Queue (MLFQ) is a classic scheduler that automatically categorizes processes. New processes start at high priority (assumed interactive). If they use their full time slice, they're demoted (likely CPU-bound batch). If they yield early, they stay high (likely I/O-bound interactive).

/* Multi-Level Feedback Queue Scheduler */
#define NUM_PRIORITIES 4
#define QUANTUM_MS 10

typedef struct {
    task_t* queues[NUM_PRIORITIES];
    uint32_t quantum[NUM_PRIORITIES];
} mlfq_scheduler_t;

mlfq_scheduler_t scheduler = {
    .quantum = {5, 10, 20, 40}  // ms per priority level
};

/* Schedule next task */
task_t* mlfq_schedule(void) {
    // Pick from highest priority non-empty queue
    for (int pri = 0; pri < NUM_PRIORITIES; pri++) {
        if (scheduler.queues[pri]) {
            task_t* task = scheduler.queues[pri];
            scheduler.queues[pri] = task->next;
            return task;
        }
    }
    return idle_task;
}

/* Handle quantum expiration */
void mlfq_quantum_expired(task_t* task) {
    // Demote to lower priority (if not already lowest)
    if (task->priority < NUM_PRIORITIES - 1) {
        task->priority++;
    }
    
    // Re-queue
    task->next = scheduler.queues[task->priority];
    scheduler.queues[task->priority] = task;
}

Completely Fair Scheduler (CFS)

Linux's Completely Fair Scheduler (CFS) uses a red-black tree to give each task a "fair" share of CPU. Instead of fixed priority queues, CFS tracks virtual runtime (vruntime)—the amount of CPU time a task "should" have received.

╔═════════════════════════════════════════════════════════════════════════════╗
                  CFS RED-BLACK TREE (sorted by vruntime)                    
╠═════════════════════════════════════════════════════════════════════════════╣
                                                                             
                           ┌───────────┐                                    
                             Task C                                       
                            vrun=500                                      
                           └─────┬─────┘                                    
                      ┌──────────┴──────────┐                               
                      ▼                     ▼                               
                ┌───────────┐         ┌───────────┐                      
                  Task A              Task E                         
                 vrun=200            vrun=800                        
                └─────┬─────┘         └───────────┘                      
           ┌──────────┴──────────┐                                          
           ▼                     ▼                                          
     ┌───────────┐         ┌───────────┐                               
       Task B   ←LEFTMOST   Task D                                  
      vrun=100            vrun=350                                 
     └───────────┘         └───────────┘                               
      ^ NEXT TO RUN                                                        
                                                                             
  • Leftmost node = smallest vruntime = most "starved" for CPU             
  • After running, task's vruntime increases, moves right in tree          
  • Nice value affects HOW FAST vruntime increases (lower nice = slower)   
                                                                             
╚═════════════════════════════════════════════════════════════════════════════╝
/* CFS Scheduler - Red-Black Tree based */
typedef struct cfs_rq {
    struct rb_root tasks_timeline;  // Red-black tree of tasks
    struct rb_node* leftmost;       // Task with smallest vruntime
    uint64_t min_vruntime;          // Minimum vruntime in tree
    uint32_t nr_running;            // Number of runnable tasks
} cfs_rq_t;

typedef struct cfs_task {
    struct task_struct* task;
    struct rb_node run_node;        // RB-tree node
    uint64_t vruntime;              // Virtual runtime
    uint64_t weight;                // Nice-based weight (nice 0 = 1024)
} cfs_task_t;

/* Nice to weight conversion (nice 0 = 1024) */
const uint32_t nice_to_weight[40] = {
 /* -20 */ 88761, 71755, 56483, 46273, 36291,
 /* -15 */ 29154, 23254, 18705, 14949, 11916,
 /* -10 */  9548,  7620,  6100,  4904,  3906,
 /*  -5 */  3121,  2501,  1991,  1586,  1277,
 /*   0 */  1024,   820,   655,   526,   423,
 /*   5 */   335,   272,   215,   172,   137,
 /*  10 */   110,    87,    70,    56,    45,
 /*  15 */    36,    29,    23,    18,    15,
};

/* CFS: Pick task with smallest vruntime */
cfs_task_t* cfs_pick_next(cfs_rq_t* rq) {
    if (!rq->leftmost) return NULL;
    return rb_entry(rq->leftmost, cfs_task_t, run_node);
}

/* CFS: Update vruntime after running */
void cfs_update_vruntime(cfs_task_t* se, uint64_t delta_exec) {
    // Higher weight = vruntime increases slower = more CPU time
    // vruntime += delta_exec * (NICE_0_WEIGHT / se->weight)
    uint64_t weighted = (delta_exec * 1024) / se->weight;
    se->vruntime += weighted;
    
    // Update min_vruntime (never go backwards)
    if (se->vruntime > se->cfs_rq->min_vruntime) {
        se->cfs_rq->min_vruntime = se->vruntime;
    }
}

/* CFS: Insert task into tree */
void cfs_enqueue(cfs_rq_t* rq, cfs_task_t* se) {
    // New task starts at current min_vruntime (fair start)
    if (se->vruntime < rq->min_vruntime) {
        se->vruntime = rq->min_vruntime;
    }
    
    // Insert into RB-tree, sorted by vruntime
    rb_insert(&rq->tasks_timeline, &se->run_node, cfs_compare_vruntime);
    
    // Update leftmost if needed
    if (!rq->leftmost || se->vruntime < cfs_get_vruntime(rq->leftmost)) {
        rq->leftmost = &se->run_node;
    }
    rq->nr_running++;
}
Why vruntime is Brilliant: CFS doesn't need complex priority queues. A task that used 50ms of CPU will have higher vruntime than one that used 10ms—so the 10ms task runs next. The "nice" value affects the rate at which vruntime increases.
Scheduler Data Structure Complexity Best For
Round-Robin Queue O(1) Simple, equal priority
MLFQ Multiple Queues O(1) Auto-detects interactive
CFS Red-Black Tree O(log n) General-purpose fairness
EDF Sorted by deadline O(n) or O(log n) Real-time systems

Memory Optimization

Memory management has hidden costs that cripple performance when ignored. Every memory access triggers address translation, and every allocation has overhead. Optimizing these paths makes a huge difference.

TLB Efficiency

The Translation Lookaside Buffer (TLB) caches recent page table lookups. A TLB hit is ~1 cycle; a TLB miss requires a page table walk (~100 cycles on x86-64 with 4-level paging). With thousands of memory accesses per function call, TLB efficiency is critical!

╔═════════════════════════════════════════════════════════════════════════════╗
                     TLB HIT vs MISS                                         
╠═════════════════════════════════════════════════════════════════════════════╣
                                                                             
  TLB HIT (~1 cycle)                                                        
  ┌───────────┐    hit!    ┌────────────────┐                           
     CPU     ─────────▶   TLB Cache    ──────▶ Physical Address   
   (virtual)             (64-1024 entries)                          
  └───────────┘           └────────────────┘                           
                                                                             
  TLB MISS (~100+ cycles)                                                   
  ┌───────────┐    miss    ┌────────────────┐                           
     CPU     ─────────▶   TLB Cache                               
   (virtual)                                                       
  └───────────┘           └───────┬────────┘                           

Page Table Walk (4 memory accesses!)    
  ┌─────┐──▶┌─────┐──▶┌─────┐──▶┌─────┐──▶ Physical Address          
   PML4    PDPT     PD      PT                               
  └─────┘   └─────┘   └─────┘   └─────┘                              
                                                                             
╚═════════════════════════════════════════════════════════════════════════════╝
TLB Size Matters: A typical CPU has only 64-1024 TLB entries. With 4KB pages, that covers only 256KB-4MB of address space. Access scattered memory and you'll TLB miss constantly!

Huge Pages

Huge pages (2MB or 1GB) cover more memory with fewer TLB entries. One 2MB huge page covers 512× as much memory as a 4KB page—perfect for large allocations!

Page Size TLB Coverage (64 entries) Page Table Levels Use Case
4 KB 256 KB 4 levels (PML4→PDPT→PD→PT) General purpose
2 MB 128 MB 3 levels (PML4→PDPT→PD) Large heaps, databases
1 GB 64 GB 2 levels (PML4→PDPT) Scientific computing, VMs
/* Huge Page Allocation */
#define PAGE_SIZE_4K   0x1000       // 4 KB
#define PAGE_SIZE_2M   0x200000     // 2 MB
#define PAGE_SIZE_1G   0x40000000   // 1 GB

/* Page table entry flags for huge pages */
#define PTE_HUGE       (1 << 7)     // PS bit - Page Size (makes PD entry point to 2MB page)

/* Allocate 2MB huge page */
void* alloc_huge_page_2m(void) {
    // Find 2MB-aligned physical frame
    uint64_t phys = alloc_physical_2m_aligned();
    if (!phys) return NULL;
    
    // Map at PD level (skip PT level entirely)
    uint64_t virt = find_free_vm_range(PAGE_SIZE_2M);
    
    // Set huge page flag in PD entry
    uint64_t* pd_entry = get_pd_entry(virt);
    *pd_entry = phys | PTE_PRESENT | PTE_WRITABLE | PTE_HUGE;
    
    return (void*)virt;
}

/* When to use huge pages */
// ✓ Large contiguous allocations (>2MB)
// ✓ Memory-mapped files
// ✓ Database buffer pools
// ✓ Scientific computing arrays
// ✗ Small allocations (wastes memory)
// ✗ Fragmented access patterns

Slab Allocator

The kernel allocates many small, fixed-size objects (task structs, file descriptors, inodes). A general-purpose allocator like malloc has overhead for each allocation. The slab allocator pre-allocates "slabs" of memory divided into fixed-size slots.

╔═════════════════════════════════════════════════════════════════════════════╗
                     SLAB ALLOCATOR STRUCTURE                                
╠═════════════════════════════════════════════════════════════════════════════╣
                                                                             
  kmem_cache "task_struct" (object_size = 2048 bytes)                      
  ┌────────────────────────────────────────────────────────┐                
    partial ──▶ ┌───────────────────────────────────┐                    
                  Slab 1 (has free slots)                             
                 ┌────┬────┬────┬────┬────┬────┐                      
                 usedfreeusedusedfreefree                      
                 └────┴────┴────┴────┴────┴────┘                      
                └───────────────────────────────────┘                    
                                                                            
    full ─────▶ ┌───────────────────────────────────┐                    
                  Slab 2 (all used)                                   
                 ┌────┬────┬────┬────┬────┬────┐                      
                 usedusedusedusedusedused                      
                 └────┴────┴────┴────┴────┴────┘                      
                └───────────────────────────────────┘                    
  └────────────────────────────────────────────────────────┘                
                                                                             
  • Allocation: Find partial slab, flip bit in bitmap, return pointer      
  • Free: Clear bit in bitmap, move to partial list if was full            
  • No fragmentation—all objects same size!                                
                                                                             
╚═════════════════════════════════════════════════════════════════════════════╝
/* Slab Allocator for Fixed-Size Objects */
typedef struct slab {
    void*  base;            // Slab memory base
    size_t object_size;     // Size of each object
    size_t num_objects;     // Objects per slab
    uint64_t* bitmap;       // Allocation bitmap
    struct slab* next;      // Next slab in cache
} slab_t;

typedef struct {
    const char* name;
    size_t object_size;
    slab_t* slabs;
    slab_t* partial;        // Slabs with free space
    uint64_t allocations;
    uint64_t frees;
} slab_cache_t;

/* Allocate object from slab cache */
void* slab_alloc(slab_cache_t* cache) {
    // Find slab with free space
    for (slab_t* slab = cache->partial; slab; slab = slab->next) {
        int idx = find_first_zero_bit(slab->bitmap, slab->num_objects);
        if (idx >= 0) {
            set_bit(slab->bitmap, idx);
            cache->allocations++;
            return slab->base + idx * cache->object_size;
        }
    }
    
    // No space - create new slab
    slab_t* new_slab = create_slab(cache);
    set_bit(new_slab->bitmap, 0);
    new_slab->next = cache->partial;
    cache->partial = new_slab;
    cache->allocations++;
    
    return new_slab->base;
}

Profiling Tools

You can't optimize what you can't measure. Modern CPUs provide hardware performance counters that track cache misses, branch mispredictions, and cycles with zero overhead. Building profiling into your OS is essential for finding bottlenecks.

Performance Counters (PMC)

Performance Monitoring Counters (PMC) are special CPU registers that count hardware events. x86 CPUs have 4-8 general-purpose counters that can track various events.

Counter Event What It Measures Optimization Insight
CPU_CYCLES Total clock cycles Overall time spent
INSTRUCTIONS_RETIRED Instructions completed IPC (Instr/Cycle) efficiency
LLC_MISSES Last-level cache misses Memory bandwidth issues
BRANCH_MISPREDICT Wrong branch predictions Unpredictable code paths
DTLB_MISS Data TLB misses Need huge pages?
/* Read CPU Performance Counter (RDPMC) */
static inline uint64_t read_pmc(uint32_t counter) {
    uint32_t lo, hi;
    __asm__ volatile("rdpmc" : "=a"(lo), "=d"(hi) : "c"(counter));
    return ((uint64_t)hi << 32) | lo;
}

/* Read Time Stamp Counter */
static inline uint64_t read_tsc(void) {
    uint32_t lo, hi;
    __asm__ volatile("rdtsc" : "=a"(lo), "=d"(hi));
    return ((uint64_t)hi << 32) | lo;
}

/* Simple profiling macro */
#define PROFILE_START(name) uint64_t _prof_##name = read_tsc()
#define PROFILE_END(name) do { \
    uint64_t _end = read_tsc(); \
    kprintf("%s: %lu cycles\n", #name, _end - _prof_##name); \
} while(0)

/* Configure PMC to count cache misses */
void setup_pmc_cache_misses(void) {
    // PMC setup via MSRs (Model-Specific Registers)
    // Event 0x2E = LLC_MISSES, Umask 0x41
    uint64_t event_select = 0x412E;  // LLC misses
    event_select |= (1 << 16);       // USR - count in user mode
    event_select |= (1 << 17);       // OS - count in kernel mode
    event_select |= (1 << 22);       // EN - enable counting
    
    wrmsr(MSR_PERFEVTSEL0, event_select);
    wrmsr(MSR_PMC0, 0);  // Reset counter
}

uint64_t read_cache_misses(void) {
    return read_pmc(0);  // Read PMC0
}

Kernel Tracing

Kernel tracing records events as they happen—syscalls, interrupts, context switches—creating a timeline you can analyze. This is invaluable for finding why the system "feels slow."

/* Simple Ring Buffer for Kernel Tracing */
#define TRACE_BUFFER_SIZE 65536

typedef struct trace_event {
    uint64_t timestamp;     // TSC timestamp
    uint32_t event_type;    // TRACE_SYSCALL, TRACE_IRQ, etc.
    uint32_t cpu;           // Which CPU
    uint32_t pid;           // Process ID
    uint64_t arg1, arg2;    // Event-specific data
} trace_event_t;

typedef struct trace_buffer {
    trace_event_t events[TRACE_BUFFER_SIZE];
    volatile uint32_t head;
    volatile uint32_t tail;
    bool enabled;
} trace_buffer_t;

trace_buffer_t trace_buf;

/* Record trace event (called from kernel code) */
void trace_record(uint32_t type, uint64_t arg1, uint64_t arg2) {
    if (!trace_buf.enabled) return;
    
    uint32_t idx = atomic_inc(&trace_buf.head) % TRACE_BUFFER_SIZE;
    trace_event_t* e = &trace_buf.events[idx];
    
    e->timestamp = read_tsc();
    e->event_type = type;
    e->cpu = get_cpu_id();
    e->pid = current->pid;
    e->arg1 = arg1;
    e->arg2 = arg2;
}

/* Trace points in kernel */
#define TRACE_SYSCALL_ENTER  1
#define TRACE_SYSCALL_EXIT   2
#define TRACE_IRQ_ENTER      3
#define TRACE_IRQ_EXIT       4
#define TRACE_SCHED_SWITCH   5
#define TRACE_PAGE_FAULT     6

/* Example: Trace syscall entry */
long syscall_handler(long num, ...) {
    trace_record(TRACE_SYSCALL_ENTER, num, 0);
    
    long result = do_syscall(num, ...);
    
    trace_record(TRACE_SYSCALL_EXIT, num, result);
    return result;
}
Visualization: Export trace data to a file, then visualize with tools like Chrome's chrome://tracing or Perfetto. You'll see exactly what the kernel was doing at each microsecond!

What You Can Build

Phase 16 Project: A fast OS! Your system now has disk caching for fast file access, an optimized scheduler for responsive multitasking, efficient memory allocation, and tools to measure performance.

Let's build a performance benchmark suite that measures all the optimizations we've implemented:

/* performance_benchmark.c - Comprehensive OS Performance Suite */
#include "cache.h"
#include "scheduler.h"
#include "memory.h"
#include "profiling.h"

typedef struct benchmark_result {
    const char* name;
    uint64_t cycles;
    uint64_t cache_misses;
    uint64_t operations;
    double ops_per_second;
} benchmark_result_t;

/* Benchmark 1: Block Cache Performance */
benchmark_result_t bench_block_cache(void) {
    const uint64_t NUM_OPS = 100000;
    uint64_t start_cycles = read_tsc();
    uint64_t start_misses = read_cache_misses();
    
    // First pass: cold cache (all misses)
    for (uint64_t i = 0; i < 1000; i++) {
        cache_get(&block_cache, i);
    }
    
    // Second pass: warm cache (should hit)
    for (uint64_t i = 0; i < NUM_OPS; i++) {
        cache_get(&block_cache, i % 1000);  // Access within cache size
    }
    
    uint64_t end_cycles = read_tsc();
    uint64_t end_misses = read_cache_misses();
    
    benchmark_result_t result = {
        .name = "Block Cache",
        .cycles = end_cycles - start_cycles,
        .cache_misses = end_misses - start_misses,
        .operations = NUM_OPS,
        .ops_per_second = (double)NUM_OPS / ((end_cycles - start_cycles) / cpu_freq)
    };
    
    kprintf("[BENCH] %s: %lu ops, %lu cycles, %.2f ops/sec, hit_rate=%.1f%%\n",
            result.name, result.operations, result.cycles, result.ops_per_second,
            100.0 * block_cache.hits / (block_cache.hits + block_cache.misses));
    
    return result;
}

/* Benchmark 2: Slab Allocator vs Generic Malloc */
benchmark_result_t bench_slab_allocator(void) {
    const uint64_t NUM_ALLOCS = 50000;
    void* ptrs[1000];
    
    // Benchmark slab allocator
    uint64_t slab_start = read_tsc();
    for (uint64_t i = 0; i < NUM_ALLOCS; i++) {
        ptrs[i % 1000] = slab_alloc(&task_cache);  // Pre-sized for task_struct
        if (i >= 1000) slab_free(&task_cache, ptrs[(i - 1000) % 1000]);
    }
    uint64_t slab_cycles = read_tsc() - slab_start;
    
    // Benchmark generic malloc
    uint64_t malloc_start = read_tsc();
    for (uint64_t i = 0; i < NUM_ALLOCS; i++) {
        ptrs[i % 1000] = kmalloc(sizeof(task_t));
        if (i >= 1000) kfree(ptrs[(i - 1000) % 1000]);
    }
    uint64_t malloc_cycles = read_tsc() - malloc_start;
    
    kprintf("[BENCH] Slab: %lu cycles, Malloc: %lu cycles (%.1fx faster)\n",
            slab_cycles, malloc_cycles, (double)malloc_cycles / slab_cycles);
    
    return (benchmark_result_t){
        .name = "Slab Allocator",
        .cycles = slab_cycles,
        .operations = NUM_ALLOCS
    };
}

/* Benchmark 3: Scheduler Latency */
benchmark_result_t bench_scheduler_latency(void) {
    const int NUM_SAMPLES = 1000;
    uint64_t latencies[NUM_SAMPLES];
    
    for (int i = 0; i < NUM_SAMPLES; i++) {
        uint64_t start = read_tsc();
        yield();  // Voluntary context switch
        latencies[i] = read_tsc() - start;
    }
    
    // Calculate statistics
    uint64_t sum = 0, min = UINT64_MAX, max = 0;
    for (int i = 0; i < NUM_SAMPLES; i++) {
        sum += latencies[i];
        if (latencies[i] < min) min = latencies[i];
        if (latencies[i] > max) max = latencies[i];
    }
    
    kprintf("[BENCH] Context Switch: avg=%lu, min=%lu, max=%lu cycles\n",
            sum / NUM_SAMPLES, min, max);
    
    return (benchmark_result_t){
        .name = "Context Switch",
        .cycles = sum / NUM_SAMPLES,
        .operations = NUM_SAMPLES
    };
}

/* Benchmark 4: TLB/Huge Page Impact */
benchmark_result_t bench_huge_pages(void) {
    const size_t SIZE = 64 * 1024 * 1024;  // 64 MB
    
    // Allocate with 4KB pages
    uint64_t start_4k = read_tsc();
    uint64_t start_tlb = read_pmc_tlb_misses();
    char* mem_4k = vmalloc(SIZE);  // Regular pages
    for (size_t i = 0; i < SIZE; i += 4096) {
        mem_4k[i] = 1;  // Touch each page
    }
    uint64_t cycles_4k = read_tsc() - start_4k;
    uint64_t tlb_4k = read_pmc_tlb_misses() - start_tlb;
    vfree(mem_4k);
    
    // Allocate with 2MB huge pages
    uint64_t start_2m = read_tsc();
    start_tlb = read_pmc_tlb_misses();
    char* mem_2m = vmalloc_huge(SIZE);  // Huge pages
    for (size_t i = 0; i < SIZE; i += 4096) {
        mem_2m[i] = 1;
    }
    uint64_t cycles_2m = read_tsc() - start_2m;
    uint64_t tlb_2m = read_pmc_tlb_misses() - start_tlb;
    vfree(mem_2m);
    
    kprintf("[BENCH] 4KB pages: %lu cycles, %lu TLB misses\n", cycles_4k, tlb_4k);
    kprintf("[BENCH] 2MB pages: %lu cycles, %lu TLB misses (%.1fx faster)\n",
            cycles_2m, tlb_2m, (double)cycles_4k / cycles_2m);
    
    return (benchmark_result_t){
        .name = "Huge Pages",
        .cycles = cycles_2m,
        .operations = SIZE / 4096
    };
}

/* Run all benchmarks */
void run_performance_suite(void) {
    kprintf("\n=== WasilOS Performance Benchmark Suite ===\n\n");
    
    // Enable performance counters
    setup_pmc_cache_misses();
    
    benchmark_result_t results[4];
    results[0] = bench_block_cache();
    results[1] = bench_slab_allocator();
    results[2] = bench_scheduler_latency();
    results[3] = bench_huge_pages();
    
    kprintf("\n=== Benchmark Complete ===\n");
}

Exercises

Exercise 1: Implement Read-Ahead

When reading block N, also prefetch blocks N+1, N+2, etc. into the cache. Most file access is sequential, so this dramatically improves performance.

void cache_readahead(block_cache_t* cache, uint64_t block, int count) {
    for (int i = 1; i <= count; i++) {
        // Async prefetch without blocking
        cache_prefetch(cache, block + i);
    }
}

Exercise 2: Implement CLOCK Algorithm

CLOCK is an approximation of LRU that's cheaper to implement. Instead of moving entries on every access, it uses a "second chance" bit.

Exercise 3: Add Nice Support

Implement the nice() syscall that allows processes to adjust their priority. Higher nice values = lower priority = slower vruntime increase.

Exercise 4: Build a Profiler

Create a simple sampling profiler that uses timer interrupts to record which function is executing. Build a histogram of where time is spent.

╔═════════════════════════════════════════════════════════════════════════════╗
                    PHASE 16 → PHASE 17 TRANSITION                           
╠═════════════════════════════════════════════════════════════════════════════╣
                                                                             
  Phase 16 Complete:                                                        
  ┌─────────────────────────────────────────────────────────────────────┐  
   ✓ Block cache (90%+ hit rate)                                         
   ✓ Page cache (memory-mapped files)                                    
   ✓ LRU eviction policy                                                 
   ✓ MLFQ scheduler (priority queues)                                    
   ✓ CFS scheduler (red-black tree)                                      
   ✓ TLB optimization                                                    
   ✓ Huge pages (2MB/1GB)                                                
   ✓ Slab allocator                                                      
   ✓ Performance counters (PMC)                                          
   ✓ Kernel tracing                                                      
  └─────────────────────────────────────────────────────────────────────┘  


  Phase 17 Preview - Stability, Security & Finishing:                      
  ┌─────────────────────────────────────────────────────────────────────┐  
   • Kernel debugging (stack traces, panic handling)                     
   • Security hardening (ASLR, stack canaries, NX)                       
   • User/kernel separation                                              
   • Complete bootable OS image                                          
  └─────────────────────────────────────────────────────────────────────┘  
                                                                             
╚═════════════════════════════════════════════════════════════════════════════╝
Concept What We Built Performance Impact
Block Cache Hash table + LRU list 10,000x faster than disk
Page Cache File pages in memory Enables mmap()
CFS Scheduler Red-black tree by vruntime Fair + responsive
Huge Pages 2MB/1GB page support 10-50% faster for large allocs
Slab Allocator Fixed-size object pools 10x faster than malloc
Profiling PMC + tracing Find what to optimize

Next Steps

With performance optimized, it's time to make the OS stable and secure. In Phase 17, we'll add debugging tools, implement security hardening, and put the finishing touches on our complete operating system.

Technology