Introduction: Performance Matters
Phase 16 Goals: By the end of this phase, your OS will have disk and page caching, an optimized scheduler, efficient memory allocation, and profiling tools to measure and improve performance.
Phase 0: Orientation & Big Picture
OS fundamentals, kernel architectures, learning path
Phase 1: How a Computer Starts
BIOS/UEFI, boot sequence, dev environment
Phase 2: Real Mode - First Steps
Real mode, bootloader, BIOS interrupts
Phase 3: Entering Protected Mode
GDT, 32-bit mode, C code execution
Phase 4: Display, Input & Output
VGA text mode, keyboard handling
Phase 5: Interrupts & CPU Control
IDT, ISRs, PIC programming
Phase 6: Memory Management
Paging, virtual memory, heap allocator
Phase 7: Disk Access & Filesystems
Block devices, FAT, VFS layer
Phase 8: Processes & User Mode
Task switching, system calls, user space
Phase 9: ELF Loading & Executables
ELF format, program loading
Phase 10: Standard Library & Shell
C library, command-line shell
Phase 11: 64-Bit Long Mode
x86-64, 64-bit paging, modern architecture
Phase 12: Modern Booting with UEFI
UEFI boot services, memory maps
Phase 13: Graphics & GUI Systems
Framebuffer, windowing, drawing
Phase 14: Advanced Input & Timing
Mouse, high-precision timers
Phase 15: Hardware Discovery & Drivers
PCI, device drivers, NVMe
17
Phase 16: Performance & Optimization
Caching, scheduler tuning
You Are Here
18
Phase 17: Stability, Security & Finishing
Debugging, hardening, completion
We've built a functional operating system over 15 phases. But "works" isn't the same as "works well". An OS that freezes when copying files or stutters during video playback isn't usable. Performance optimization is what transforms a toy OS into something production-ready.
╔═════════════════════════════════════════════════════════════════════════════╗
║ MEMORY & STORAGE LATENCY HIERARCHY ║
╠═════════════════════════════════════════════════════════════════════════════╣
║ ║
║ Access Time Component Typical Size ║
║ ───────────── ─────────────── ──────────────── ║
║ ║
║ ~1 ns L1 Cache 32-64 KB ║
║ ~4 ns L2 Cache 256 KB - 1 MB ║
║ ~10 ns L3 Cache 8-64 MB ║
║ ~100 ns RAM (DDR4/5) 8-128 GB ║
║ ~10,000 ns NVMe SSD 256 GB - 8 TB ║
║ ~10,000,000 ns HDD (Spinning) 1-20 TB ║
║ ║
║ HDD is 10,000,000x slower than L1 cache! ║
║ Caching transforms disk access from unusable to instant. ║
║ ║
╚═════════════════════════════════════════════════════════════════════════════╝
Key Insight: Making an OS fast requires understanding where time is spent. Caching reduces disk latency, scheduler tuning improves responsiveness, and memory optimizations reduce allocation overhead.
Performance Mindset
Before optimizing, you need to measure. Donald Knuth famously said: "Premature optimization is the root of all evil." The corollary is: profile first, optimize second.
| Principle |
Meaning |
Example |
| Measure First |
Profile before optimizing |
Find that 90% of time is in disk I/O, not CPU |
| Amdahl's Law |
Speedup limited by serial portions |
10% serial code limits speedup to 10x max |
| Locality |
Access nearby data together |
Array of structs vs struct of arrays |
| Avoid Copies |
Zero-copy where possible |
Pass pointers, use DMA for disk I/O |
| Batch Operations |
Amortize overhead over many items |
Write 100 sectors at once, not 1 at a time |
Common Bottlenecks
In an OS, the major performance bottlenecks are:
╔═════════════════════════════════════════════════════════════════════════════╗
║ OS PERFORMANCE BOTTLENECKS ║
╠═════════════════════════════════════════════════════════════════════════════╣
║ ║
║ ┌───────────────────────────────────────────────────────────────────┐ ║
║ │ 1. DISK I/O ─ Uncached disk reads are killers (~10ms each!) │ ║
║ │ → Solution: Block cache, read-ahead, async I/O │ ║
║ └───────────────────────────────────────────────────────────────────┘ ║
║ ║
║ ┌───────────────────────────────────────────────────────────────────┐ ║
║ │ 2. CONTEXT SWITCHES ─ Task switching overhead adds up │ ║
║ │ → Solution: Smart scheduler, longer time slices where OK │ ║
║ └───────────────────────────────────────────────────────────────────┘ ║
║ ║
║ ┌───────────────────────────────────────────────────────────────────┐ ║
║ │ 3. MEMORY ALLOCATION ─ malloc/free are surprisingly slow │ ║
║ │ → Solution: Slab allocator, object pools │ ║
║ └───────────────────────────────────────────────────────────────────┘ ║
║ ║
║ ┌───────────────────────────────────────────────────────────────────┐ ║
║ │ 4. TLB MISSES ─ Page table walks are expensive (~100 cycles) │ ║
║ │ → Solution: Huge pages, minimize mappings, locality │ ║
║ └───────────────────────────────────────────────────────────────────┘ ║
║ ║
║ ┌───────────────────────────────────────────────────────────────────┐ ║
║ │ 5. LOCK CONTENTION ─ Threads waiting on each other │ ║
║ │ → Solution: Lock-free structures, finer-grained locks │ ║
║ └───────────────────────────────────────────────────────────────────┘ ║
║ ║
╚═════════════════════════════════════════════════════════════════════════════╝
Caching Strategies
Caching is the single most important performance optimization in an OS. By keeping frequently accessed data in memory, we avoid slow disk operations. A good cache can make your OS feel 100x faster!
╔═════════════════════════════════════════════════════════════════════════════╗
║ OS CACHING LAYERS ║
╠═════════════════════════════════════════════════════════════════════════════╣
║ ║
║ Application ║
║ │ ║
║ ▼ read("/home/user/file.txt") ║
║ ┌─────────────────────────────┐ ║
║ │ Page Cache │ File data cached as memory pages ║
║ │ (virtual file pages) │ Serves most read() calls from RAM ║
║ └─────────────┬───────────────┘ ║
║ │ ║
║ ▼ Page miss → lookup block number ║
║ ┌─────────────────────────────┐ ║
║ │ Block Cache │ Raw disk sectors cached in RAM ║
║ │ (disk sectors/blocks) │ Indexed by (device, block_num) ║
║ └─────────────┬───────────────┘ ║
║ │ ║
║ ▼ Block miss → actual disk I/O ║
║ ┌─────────────────────────────┐ ║
║ │ Disk Driver │ AHCI/NVMe actually reads from disk ║
║ │ (SLOW! Avoid at all cost)│ HDD: ~10ms, SSD: ~0.1ms ║
║ └─────────────────────────────┘ ║
║ ║
╚═════════════════════════════════════════════════════════════════════════════╝
Block Cache
The block cache stores recently accessed disk sectors in memory. Before reading from disk, we check the cache—a hit returns instantly (nanoseconds vs. milliseconds).
/* Block Cache Entry */
typedef struct cache_entry {
uint64_t block_num; // Disk block number
uint8_t* data; // Block data (512/4096 bytes)
uint32_t flags; // Dirty, valid, etc.
uint64_t access_time; // For LRU eviction
struct cache_entry* next; // Hash chain
struct cache_entry* lru_prev;
struct cache_entry* lru_next;
} cache_entry_t;
/* Block Cache */
#define CACHE_SIZE 1024 // Number of cached blocks
#define HASH_SIZE 256
typedef struct {
cache_entry_t* hash[HASH_SIZE];
cache_entry_t* lru_head;
cache_entry_t* lru_tail;
uint32_t hits;
uint32_t misses;
} block_cache_t;
/* Get cached block or read from disk */
cache_entry_t* cache_get(block_cache_t* cache, uint64_t block) {
uint32_t hash = block % HASH_SIZE;
// Search hash chain
for (cache_entry_t* e = cache->hash[hash]; e; e = e->next) {
if (e->block_num == block) {
cache->hits++;
lru_touch(cache, e); // Move to front
return e;
}
}
// Cache miss - read from disk
cache->misses++;
cache_entry_t* entry = lru_evict_or_alloc(cache);
entry->block_num = block;
disk_read(block, entry->data);
// Insert into hash
entry->next = cache->hash[hash];
cache->hash[hash] = entry;
return entry;
}
Cache Hit Rate: Track hits / (hits + misses). A good block cache achieves 90%+ hit rate for typical workloads. Even 90% means only 10% of accesses hit disk!
Page Cache
While the block cache stores raw sectors, the page cache stores file data as memory pages. This integrates with virtual memory—a memory-mapped file is just pages in the page cache!
/* Page Cache - Indexed by (inode, offset) */
typedef struct page_cache_entry {
struct inode* inode; // File inode
uint64_t offset; // Page offset within file
void* page; // 4KB page of file data
uint32_t flags; // PG_DIRTY, PG_UPTODATE, etc.
atomic_t ref_count; // Reference count
struct list_head lru; // LRU list linkage
struct hlist_node hash; // Hash table linkage
} page_cache_entry_t;
/* Find or create page in cache */
page_cache_entry_t* page_cache_lookup(struct inode* inode, uint64_t offset) {
uint64_t page_offset = offset & ~0xFFF; // Align to page
uint32_t hash = hash_page(inode, page_offset);
// Search hash bucket
spin_lock(&page_cache_lock);
page_cache_entry_t* entry;
hlist_for_each_entry(entry, &page_hash[hash], hash) {
if (entry->inode == inode && entry->offset == page_offset) {
atomic_inc(&entry->ref_count);
lru_touch(entry); // Move to front of LRU
spin_unlock(&page_cache_lock);
return entry;
}
}
spin_unlock(&page_cache_lock);
// Cache miss - allocate new page and read from disk
entry = alloc_page_cache_entry();
entry->inode = inode;
entry->offset = page_offset;
entry->page = alloc_page();
// Read file data into page (uses block cache internally)
inode->i_op->readpage(inode, entry->page, page_offset);
// Insert into cache
spin_lock(&page_cache_lock);
hlist_add_head(&entry->hash, &page_hash[hash]);
list_add(&entry->lru, &page_lru_list);
spin_unlock(&page_cache_lock);
return entry;
}
/* Memory-mapped file read - just returns cached page */
void* mmap_read(struct file* file, uint64_t offset) {
page_cache_entry_t* entry = page_cache_lookup(file->f_inode, offset);
return entry->page + (offset & 0xFFF); // Return pointer within page
}
LRU Eviction
When the cache is full, we need to evict old entries. LRU (Least Recently Used) is the classic algorithm—evict the entry that hasn't been accessed for the longest time.
╔═════════════════════════════════════════════════════════════════════════════╗
║ LRU CACHE OPERATION ║
╠═════════════════════════════════════════════════════════════════════════════╣
║ ║
║ Doubly-Linked List (head = most recent, tail = least recent) ║
║ ║
║ HEAD TAIL ║
║ │ │ ║
║ ▼ ▼ ║
║ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ ║
║ │Blk 5│───│Blk 2│───│Blk 9│───│Blk 1│───│Blk 7│───│Blk 3│ ║
║ └─────┘ └─────┘ └─────┘ └─────┘ └─────┘ └─────┘ ║
║ newest oldest ║
║ ║
║ Operation: Access Block 1 ║
║ ────────────────────────── ║
║ 1. Remove Block 1 from current position ║
║ 2. Insert Block 1 at HEAD ║
║ ║
║ HEAD TAIL ║
║ │ │ ║
║ ▼ ▼ ║
║ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ ║
║ │Blk 1│───│Blk 5│───│Blk 2│───│Blk 9│───│Blk 7│───│Blk 3│ ║
║ └─────┘ └─────┘ └─────┘ └─────┘ └─────┘ └─────┘ ║
║ ↑ moved evict me ║
║ ║
╚═════════════════════════════════════════════════════════════════════════════╝
/* LRU List Operations - O(1) */
void lru_touch(block_cache_t* cache, cache_entry_t* entry) {
// Remove from current position
if (entry->lru_prev) entry->lru_prev->lru_next = entry->lru_next;
if (entry->lru_next) entry->lru_next->lru_prev = entry->lru_prev;
if (cache->lru_tail == entry) cache->lru_tail = entry->lru_prev;
// Insert at head
entry->lru_prev = NULL;
entry->lru_next = cache->lru_head;
if (cache->lru_head) cache->lru_head->lru_prev = entry;
cache->lru_head = entry;
if (!cache->lru_tail) cache->lru_tail = entry;
}
cache_entry_t* lru_evict(block_cache_t* cache) {
cache_entry_t* victim = cache->lru_tail;
if (!victim) return NULL;
// Remove from tail
cache->lru_tail = victim->lru_prev;
if (cache->lru_tail) cache->lru_tail->lru_next = NULL;
else cache->lru_head = NULL;
// If dirty, write back to disk first
if (victim->flags & CACHE_DIRTY) {
disk_write(victim->block_num, victim->data);
}
// Remove from hash chain
cache_hash_remove(cache, victim);
return victim;
}
Write-Back vs Write-Through:
- Write-Through: Every write immediately goes to disk. Safe but slow.
- Write-Back: Writes stay in cache (marked dirty), flushed later. Fast but requires careful fsync.
Scheduler Tuning
The scheduler determines which process runs when. A bad scheduler makes the system feel sluggish even with a fast CPU. Good scheduling balances throughput (total work done) with latency (responsiveness to user input).
╔═════════════════════════════════════════════════════════════════════════════╗
║ SCHEDULER DESIGN TRADEOFFS ║
╠═════════════════════════════════════════════════════════════════════════════╣
║ ║
║ ┌─────────────────────────────────┐ ┌──────────────────────────────┐ ║
║ │ THROUGHPUT │ │ LATENCY │ ║
║ │ Total work completed/second │ │ Time to respond to events │ ║
║ │ │ │ │ ║
║ │ • Long time slices │ │ • Short time slices │ ║
║ │ • Batch processing │ │ • Interactive priority │ ║
║ │ • Minimize context switches │ │ • Preemption │ ║
║ └─────────────────────────────────┘ └──────────────────────────────┘ ║
║ ║
║ Server (compile farm): Throughput priority ║
║ Desktop (user interaction): Latency priority ║
║ Real-time (audio): Hard latency guarantees ║
║ ║
╚═════════════════════════════════════════════════════════════════════════════╝
Priority Queues (MLFQ)
Multi-Level Feedback Queue (MLFQ) is a classic scheduler that automatically categorizes processes. New processes start at high priority (assumed interactive). If they use their full time slice, they're demoted (likely CPU-bound batch). If they yield early, they stay high (likely I/O-bound interactive).
/* Multi-Level Feedback Queue Scheduler */
#define NUM_PRIORITIES 4
#define QUANTUM_MS 10
typedef struct {
task_t* queues[NUM_PRIORITIES];
uint32_t quantum[NUM_PRIORITIES];
} mlfq_scheduler_t;
mlfq_scheduler_t scheduler = {
.quantum = {5, 10, 20, 40} // ms per priority level
};
/* Schedule next task */
task_t* mlfq_schedule(void) {
// Pick from highest priority non-empty queue
for (int pri = 0; pri < NUM_PRIORITIES; pri++) {
if (scheduler.queues[pri]) {
task_t* task = scheduler.queues[pri];
scheduler.queues[pri] = task->next;
return task;
}
}
return idle_task;
}
/* Handle quantum expiration */
void mlfq_quantum_expired(task_t* task) {
// Demote to lower priority (if not already lowest)
if (task->priority < NUM_PRIORITIES - 1) {
task->priority++;
}
// Re-queue
task->next = scheduler.queues[task->priority];
scheduler.queues[task->priority] = task;
}
Completely Fair Scheduler (CFS)
Linux's Completely Fair Scheduler (CFS) uses a red-black tree to give each task a "fair" share of CPU. Instead of fixed priority queues, CFS tracks virtual runtime (vruntime)—the amount of CPU time a task "should" have received.
╔═════════════════════════════════════════════════════════════════════════════╗
║ CFS RED-BLACK TREE (sorted by vruntime) ║
╠═════════════════════════════════════════════════════════════════════════════╣
║ ║
║ ┌───────────┐ ║
║ │ Task C │ ║
║ │ vrun=500 │ ║
║ └─────┬─────┘ ║
║ ┌──────────┴──────────┐ ║
║ ▼ ▼ ║
║ ┌───────────┐ ┌───────────┐ ║
║ │ Task A │ │ Task E │ ║
║ │ vrun=200 │ │ vrun=800 │ ║
║ └─────┬─────┘ └───────────┘ ║
║ ┌──────────┴──────────┐ ║
║ ▼ ▼ ║
║ ┌───────────┐ ┌───────────┐ ║
║ │ Task B │←LEFTMOST │ Task D │ ║
║ │ vrun=100 │ │ vrun=350 │ ║
║ └───────────┘ └───────────┘ ║
║ ^ NEXT TO RUN ║
║ ║
║ • Leftmost node = smallest vruntime = most "starved" for CPU ║
║ • After running, task's vruntime increases, moves right in tree ║
║ • Nice value affects HOW FAST vruntime increases (lower nice = slower) ║
║ ║
╚═════════════════════════════════════════════════════════════════════════════╝
/* CFS Scheduler - Red-Black Tree based */
typedef struct cfs_rq {
struct rb_root tasks_timeline; // Red-black tree of tasks
struct rb_node* leftmost; // Task with smallest vruntime
uint64_t min_vruntime; // Minimum vruntime in tree
uint32_t nr_running; // Number of runnable tasks
} cfs_rq_t;
typedef struct cfs_task {
struct task_struct* task;
struct rb_node run_node; // RB-tree node
uint64_t vruntime; // Virtual runtime
uint64_t weight; // Nice-based weight (nice 0 = 1024)
} cfs_task_t;
/* Nice to weight conversion (nice 0 = 1024) */
const uint32_t nice_to_weight[40] = {
/* -20 */ 88761, 71755, 56483, 46273, 36291,
/* -15 */ 29154, 23254, 18705, 14949, 11916,
/* -10 */ 9548, 7620, 6100, 4904, 3906,
/* -5 */ 3121, 2501, 1991, 1586, 1277,
/* 0 */ 1024, 820, 655, 526, 423,
/* 5 */ 335, 272, 215, 172, 137,
/* 10 */ 110, 87, 70, 56, 45,
/* 15 */ 36, 29, 23, 18, 15,
};
/* CFS: Pick task with smallest vruntime */
cfs_task_t* cfs_pick_next(cfs_rq_t* rq) {
if (!rq->leftmost) return NULL;
return rb_entry(rq->leftmost, cfs_task_t, run_node);
}
/* CFS: Update vruntime after running */
void cfs_update_vruntime(cfs_task_t* se, uint64_t delta_exec) {
// Higher weight = vruntime increases slower = more CPU time
// vruntime += delta_exec * (NICE_0_WEIGHT / se->weight)
uint64_t weighted = (delta_exec * 1024) / se->weight;
se->vruntime += weighted;
// Update min_vruntime (never go backwards)
if (se->vruntime > se->cfs_rq->min_vruntime) {
se->cfs_rq->min_vruntime = se->vruntime;
}
}
/* CFS: Insert task into tree */
void cfs_enqueue(cfs_rq_t* rq, cfs_task_t* se) {
// New task starts at current min_vruntime (fair start)
if (se->vruntime < rq->min_vruntime) {
se->vruntime = rq->min_vruntime;
}
// Insert into RB-tree, sorted by vruntime
rb_insert(&rq->tasks_timeline, &se->run_node, cfs_compare_vruntime);
// Update leftmost if needed
if (!rq->leftmost || se->vruntime < cfs_get_vruntime(rq->leftmost)) {
rq->leftmost = &se->run_node;
}
rq->nr_running++;
}
Why vruntime is Brilliant: CFS doesn't need complex priority queues. A task that used 50ms of CPU will have higher vruntime than one that used 10ms—so the 10ms task runs next. The "nice" value affects the rate at which vruntime increases.
| Scheduler |
Data Structure |
Complexity |
Best For |
| Round-Robin |
Queue |
O(1) |
Simple, equal priority |
| MLFQ |
Multiple Queues |
O(1) |
Auto-detects interactive |
| CFS |
Red-Black Tree |
O(log n) |
General-purpose fairness |
| EDF |
Sorted by deadline |
O(n) or O(log n) |
Real-time systems |
Memory Optimization
Memory management has hidden costs that cripple performance when ignored. Every memory access triggers address translation, and every allocation has overhead. Optimizing these paths makes a huge difference.
TLB Efficiency
The Translation Lookaside Buffer (TLB) caches recent page table lookups. A TLB hit is ~1 cycle; a TLB miss requires a page table walk (~100 cycles on x86-64 with 4-level paging). With thousands of memory accesses per function call, TLB efficiency is critical!
╔═════════════════════════════════════════════════════════════════════════════╗
║ TLB HIT vs MISS ║
╠═════════════════════════════════════════════════════════════════════════════╣
║ ║
║ TLB HIT (~1 cycle) ║
║ ┌───────────┐ hit! ┌────────────────┐ ║
║ │ CPU │─────────▶│ TLB Cache │──────▶ Physical Address ║
║ │ (virtual) │ │ (64-1024 entries)│ ║
║ └───────────┘ └────────────────┘ ║
║ ║
║ TLB MISS (~100+ cycles) ║
║ ┌───────────┐ miss ┌────────────────┐ ║
║ │ CPU │─────────▶│ TLB Cache │ ║
║ │ (virtual) │ │ │ ║
║ └───────────┘ └───────┬────────┘ ║
║ │ ║
║ ▼ Page Table Walk (4 memory accesses!) ║
║ ┌─────┐──▶┌─────┐──▶┌─────┐──▶┌─────┐──▶ Physical Address ║
║ │ PML4│ │ PDPT│ │ PD │ │ PT │ ║
║ └─────┘ └─────┘ └─────┘ └─────┘ ║
║ ║
╚═════════════════════════════════════════════════════════════════════════════╝
TLB Size Matters: A typical CPU has only 64-1024 TLB entries. With 4KB pages, that covers only 256KB-4MB of address space. Access scattered memory and you'll TLB miss constantly!
Huge Pages
Huge pages (2MB or 1GB) cover more memory with fewer TLB entries. One 2MB huge page covers 512× as much memory as a 4KB page—perfect for large allocations!
| Page Size |
TLB Coverage (64 entries) |
Page Table Levels |
Use Case |
| 4 KB |
256 KB |
4 levels (PML4→PDPT→PD→PT) |
General purpose |
| 2 MB |
128 MB |
3 levels (PML4→PDPT→PD) |
Large heaps, databases |
| 1 GB |
64 GB |
2 levels (PML4→PDPT) |
Scientific computing, VMs |
/* Huge Page Allocation */
#define PAGE_SIZE_4K 0x1000 // 4 KB
#define PAGE_SIZE_2M 0x200000 // 2 MB
#define PAGE_SIZE_1G 0x40000000 // 1 GB
/* Page table entry flags for huge pages */
#define PTE_HUGE (1 << 7) // PS bit - Page Size (makes PD entry point to 2MB page)
/* Allocate 2MB huge page */
void* alloc_huge_page_2m(void) {
// Find 2MB-aligned physical frame
uint64_t phys = alloc_physical_2m_aligned();
if (!phys) return NULL;
// Map at PD level (skip PT level entirely)
uint64_t virt = find_free_vm_range(PAGE_SIZE_2M);
// Set huge page flag in PD entry
uint64_t* pd_entry = get_pd_entry(virt);
*pd_entry = phys | PTE_PRESENT | PTE_WRITABLE | PTE_HUGE;
return (void*)virt;
}
/* When to use huge pages */
// ✓ Large contiguous allocations (>2MB)
// ✓ Memory-mapped files
// ✓ Database buffer pools
// ✓ Scientific computing arrays
// ✗ Small allocations (wastes memory)
// ✗ Fragmented access patterns
Slab Allocator
The kernel allocates many small, fixed-size objects (task structs, file descriptors, inodes). A general-purpose allocator like malloc has overhead for each allocation. The slab allocator pre-allocates "slabs" of memory divided into fixed-size slots.
╔═════════════════════════════════════════════════════════════════════════════╗
║ SLAB ALLOCATOR STRUCTURE ║
╠═════════════════════════════════════════════════════════════════════════════╣
║ ║
║ kmem_cache "task_struct" (object_size = 2048 bytes) ║
║ ┌────────────────────────────────────────────────────────┐ ║
║ │ partial ──▶ ┌───────────────────────────────────┐ │ ║
║ │ │ Slab 1 (has free slots) │ │ ║
║ │ │ ┌────┬────┬────┬────┬────┬────┐ │ │ ║
║ │ │ │used│free│used│used│free│free│ │ │ ║
║ │ │ └────┴────┴────┴────┴────┴────┘ │ │ ║
║ │ └───────────────────────────────────┘ │ ║
║ │ │ ║
║ │ full ─────▶ ┌───────────────────────────────────┐ │ ║
║ │ │ Slab 2 (all used) │ │ ║
║ │ │ ┌────┬────┬────┬────┬────┬────┐ │ │ ║
║ │ │ │used│used│used│used│used│used│ │ │ ║
║ │ │ └────┴────┴────┴────┴────┴────┘ │ │ ║
║ │ └───────────────────────────────────┘ │ ║
║ └────────────────────────────────────────────────────────┘ ║
║ ║
║ • Allocation: Find partial slab, flip bit in bitmap, return pointer ║
║ • Free: Clear bit in bitmap, move to partial list if was full ║
║ • No fragmentation—all objects same size! ║
║ ║
╚═════════════════════════════════════════════════════════════════════════════╝
/* Slab Allocator for Fixed-Size Objects */
typedef struct slab {
void* base; // Slab memory base
size_t object_size; // Size of each object
size_t num_objects; // Objects per slab
uint64_t* bitmap; // Allocation bitmap
struct slab* next; // Next slab in cache
} slab_t;
typedef struct {
const char* name;
size_t object_size;
slab_t* slabs;
slab_t* partial; // Slabs with free space
uint64_t allocations;
uint64_t frees;
} slab_cache_t;
/* Allocate object from slab cache */
void* slab_alloc(slab_cache_t* cache) {
// Find slab with free space
for (slab_t* slab = cache->partial; slab; slab = slab->next) {
int idx = find_first_zero_bit(slab->bitmap, slab->num_objects);
if (idx >= 0) {
set_bit(slab->bitmap, idx);
cache->allocations++;
return slab->base + idx * cache->object_size;
}
}
// No space - create new slab
slab_t* new_slab = create_slab(cache);
set_bit(new_slab->bitmap, 0);
new_slab->next = cache->partial;
cache->partial = new_slab;
cache->allocations++;
return new_slab->base;
}
Profiling Tools
You can't optimize what you can't measure. Modern CPUs provide hardware performance counters that track cache misses, branch mispredictions, and cycles with zero overhead. Building profiling into your OS is essential for finding bottlenecks.
Performance Counters (PMC)
Performance Monitoring Counters (PMC) are special CPU registers that count hardware events. x86 CPUs have 4-8 general-purpose counters that can track various events.
| Counter Event |
What It Measures |
Optimization Insight |
| CPU_CYCLES |
Total clock cycles |
Overall time spent |
| INSTRUCTIONS_RETIRED |
Instructions completed |
IPC (Instr/Cycle) efficiency |
| LLC_MISSES |
Last-level cache misses |
Memory bandwidth issues |
| BRANCH_MISPREDICT |
Wrong branch predictions |
Unpredictable code paths |
| DTLB_MISS |
Data TLB misses |
Need huge pages? |
/* Read CPU Performance Counter (RDPMC) */
static inline uint64_t read_pmc(uint32_t counter) {
uint32_t lo, hi;
__asm__ volatile("rdpmc" : "=a"(lo), "=d"(hi) : "c"(counter));
return ((uint64_t)hi << 32) | lo;
}
/* Read Time Stamp Counter */
static inline uint64_t read_tsc(void) {
uint32_t lo, hi;
__asm__ volatile("rdtsc" : "=a"(lo), "=d"(hi));
return ((uint64_t)hi << 32) | lo;
}
/* Simple profiling macro */
#define PROFILE_START(name) uint64_t _prof_##name = read_tsc()
#define PROFILE_END(name) do { \
uint64_t _end = read_tsc(); \
kprintf("%s: %lu cycles\n", #name, _end - _prof_##name); \
} while(0)
/* Configure PMC to count cache misses */
void setup_pmc_cache_misses(void) {
// PMC setup via MSRs (Model-Specific Registers)
// Event 0x2E = LLC_MISSES, Umask 0x41
uint64_t event_select = 0x412E; // LLC misses
event_select |= (1 << 16); // USR - count in user mode
event_select |= (1 << 17); // OS - count in kernel mode
event_select |= (1 << 22); // EN - enable counting
wrmsr(MSR_PERFEVTSEL0, event_select);
wrmsr(MSR_PMC0, 0); // Reset counter
}
uint64_t read_cache_misses(void) {
return read_pmc(0); // Read PMC0
}
Kernel Tracing
Kernel tracing records events as they happen—syscalls, interrupts, context switches—creating a timeline you can analyze. This is invaluable for finding why the system "feels slow."
/* Simple Ring Buffer for Kernel Tracing */
#define TRACE_BUFFER_SIZE 65536
typedef struct trace_event {
uint64_t timestamp; // TSC timestamp
uint32_t event_type; // TRACE_SYSCALL, TRACE_IRQ, etc.
uint32_t cpu; // Which CPU
uint32_t pid; // Process ID
uint64_t arg1, arg2; // Event-specific data
} trace_event_t;
typedef struct trace_buffer {
trace_event_t events[TRACE_BUFFER_SIZE];
volatile uint32_t head;
volatile uint32_t tail;
bool enabled;
} trace_buffer_t;
trace_buffer_t trace_buf;
/* Record trace event (called from kernel code) */
void trace_record(uint32_t type, uint64_t arg1, uint64_t arg2) {
if (!trace_buf.enabled) return;
uint32_t idx = atomic_inc(&trace_buf.head) % TRACE_BUFFER_SIZE;
trace_event_t* e = &trace_buf.events[idx];
e->timestamp = read_tsc();
e->event_type = type;
e->cpu = get_cpu_id();
e->pid = current->pid;
e->arg1 = arg1;
e->arg2 = arg2;
}
/* Trace points in kernel */
#define TRACE_SYSCALL_ENTER 1
#define TRACE_SYSCALL_EXIT 2
#define TRACE_IRQ_ENTER 3
#define TRACE_IRQ_EXIT 4
#define TRACE_SCHED_SWITCH 5
#define TRACE_PAGE_FAULT 6
/* Example: Trace syscall entry */
long syscall_handler(long num, ...) {
trace_record(TRACE_SYSCALL_ENTER, num, 0);
long result = do_syscall(num, ...);
trace_record(TRACE_SYSCALL_EXIT, num, result);
return result;
}
Visualization: Export trace data to a file, then visualize with tools like Chrome's chrome://tracing or Perfetto. You'll see exactly what the kernel was doing at each microsecond!
What You Can Build
Phase 16 Project: A fast OS! Your system now has disk caching for fast file access, an optimized scheduler for responsive multitasking, efficient memory allocation, and tools to measure performance.
Let's build a performance benchmark suite that measures all the optimizations we've implemented:
/* performance_benchmark.c - Comprehensive OS Performance Suite */
#include "cache.h"
#include "scheduler.h"
#include "memory.h"
#include "profiling.h"
typedef struct benchmark_result {
const char* name;
uint64_t cycles;
uint64_t cache_misses;
uint64_t operations;
double ops_per_second;
} benchmark_result_t;
/* Benchmark 1: Block Cache Performance */
benchmark_result_t bench_block_cache(void) {
const uint64_t NUM_OPS = 100000;
uint64_t start_cycles = read_tsc();
uint64_t start_misses = read_cache_misses();
// First pass: cold cache (all misses)
for (uint64_t i = 0; i < 1000; i++) {
cache_get(&block_cache, i);
}
// Second pass: warm cache (should hit)
for (uint64_t i = 0; i < NUM_OPS; i++) {
cache_get(&block_cache, i % 1000); // Access within cache size
}
uint64_t end_cycles = read_tsc();
uint64_t end_misses = read_cache_misses();
benchmark_result_t result = {
.name = "Block Cache",
.cycles = end_cycles - start_cycles,
.cache_misses = end_misses - start_misses,
.operations = NUM_OPS,
.ops_per_second = (double)NUM_OPS / ((end_cycles - start_cycles) / cpu_freq)
};
kprintf("[BENCH] %s: %lu ops, %lu cycles, %.2f ops/sec, hit_rate=%.1f%%\n",
result.name, result.operations, result.cycles, result.ops_per_second,
100.0 * block_cache.hits / (block_cache.hits + block_cache.misses));
return result;
}
/* Benchmark 2: Slab Allocator vs Generic Malloc */
benchmark_result_t bench_slab_allocator(void) {
const uint64_t NUM_ALLOCS = 50000;
void* ptrs[1000];
// Benchmark slab allocator
uint64_t slab_start = read_tsc();
for (uint64_t i = 0; i < NUM_ALLOCS; i++) {
ptrs[i % 1000] = slab_alloc(&task_cache); // Pre-sized for task_struct
if (i >= 1000) slab_free(&task_cache, ptrs[(i - 1000) % 1000]);
}
uint64_t slab_cycles = read_tsc() - slab_start;
// Benchmark generic malloc
uint64_t malloc_start = read_tsc();
for (uint64_t i = 0; i < NUM_ALLOCS; i++) {
ptrs[i % 1000] = kmalloc(sizeof(task_t));
if (i >= 1000) kfree(ptrs[(i - 1000) % 1000]);
}
uint64_t malloc_cycles = read_tsc() - malloc_start;
kprintf("[BENCH] Slab: %lu cycles, Malloc: %lu cycles (%.1fx faster)\n",
slab_cycles, malloc_cycles, (double)malloc_cycles / slab_cycles);
return (benchmark_result_t){
.name = "Slab Allocator",
.cycles = slab_cycles,
.operations = NUM_ALLOCS
};
}
/* Benchmark 3: Scheduler Latency */
benchmark_result_t bench_scheduler_latency(void) {
const int NUM_SAMPLES = 1000;
uint64_t latencies[NUM_SAMPLES];
for (int i = 0; i < NUM_SAMPLES; i++) {
uint64_t start = read_tsc();
yield(); // Voluntary context switch
latencies[i] = read_tsc() - start;
}
// Calculate statistics
uint64_t sum = 0, min = UINT64_MAX, max = 0;
for (int i = 0; i < NUM_SAMPLES; i++) {
sum += latencies[i];
if (latencies[i] < min) min = latencies[i];
if (latencies[i] > max) max = latencies[i];
}
kprintf("[BENCH] Context Switch: avg=%lu, min=%lu, max=%lu cycles\n",
sum / NUM_SAMPLES, min, max);
return (benchmark_result_t){
.name = "Context Switch",
.cycles = sum / NUM_SAMPLES,
.operations = NUM_SAMPLES
};
}
/* Benchmark 4: TLB/Huge Page Impact */
benchmark_result_t bench_huge_pages(void) {
const size_t SIZE = 64 * 1024 * 1024; // 64 MB
// Allocate with 4KB pages
uint64_t start_4k = read_tsc();
uint64_t start_tlb = read_pmc_tlb_misses();
char* mem_4k = vmalloc(SIZE); // Regular pages
for (size_t i = 0; i < SIZE; i += 4096) {
mem_4k[i] = 1; // Touch each page
}
uint64_t cycles_4k = read_tsc() - start_4k;
uint64_t tlb_4k = read_pmc_tlb_misses() - start_tlb;
vfree(mem_4k);
// Allocate with 2MB huge pages
uint64_t start_2m = read_tsc();
start_tlb = read_pmc_tlb_misses();
char* mem_2m = vmalloc_huge(SIZE); // Huge pages
for (size_t i = 0; i < SIZE; i += 4096) {
mem_2m[i] = 1;
}
uint64_t cycles_2m = read_tsc() - start_2m;
uint64_t tlb_2m = read_pmc_tlb_misses() - start_tlb;
vfree(mem_2m);
kprintf("[BENCH] 4KB pages: %lu cycles, %lu TLB misses\n", cycles_4k, tlb_4k);
kprintf("[BENCH] 2MB pages: %lu cycles, %lu TLB misses (%.1fx faster)\n",
cycles_2m, tlb_2m, (double)cycles_4k / cycles_2m);
return (benchmark_result_t){
.name = "Huge Pages",
.cycles = cycles_2m,
.operations = SIZE / 4096
};
}
/* Run all benchmarks */
void run_performance_suite(void) {
kprintf("\n=== WasilOS Performance Benchmark Suite ===\n\n");
// Enable performance counters
setup_pmc_cache_misses();
benchmark_result_t results[4];
results[0] = bench_block_cache();
results[1] = bench_slab_allocator();
results[2] = bench_scheduler_latency();
results[3] = bench_huge_pages();
kprintf("\n=== Benchmark Complete ===\n");
}
Exercises
Exercise 1: Implement Read-Ahead
When reading block N, also prefetch blocks N+1, N+2, etc. into the cache. Most file access is sequential, so this dramatically improves performance.
void cache_readahead(block_cache_t* cache, uint64_t block, int count) {
for (int i = 1; i <= count; i++) {
// Async prefetch without blocking
cache_prefetch(cache, block + i);
}
}
Exercise 2: Implement CLOCK Algorithm
CLOCK is an approximation of LRU that's cheaper to implement. Instead of moving entries on every access, it uses a "second chance" bit.
Exercise 3: Add Nice Support
Implement the nice() syscall that allows processes to adjust their priority. Higher nice values = lower priority = slower vruntime increase.
Exercise 4: Build a Profiler
Create a simple sampling profiler that uses timer interrupts to record which function is executing. Build a histogram of where time is spent.
╔═════════════════════════════════════════════════════════════════════════════╗
║ PHASE 16 → PHASE 17 TRANSITION ║
╠═════════════════════════════════════════════════════════════════════════════╣
║ ║
║ Phase 16 Complete: ║
║ ┌─────────────────────────────────────────────────────────────────────┐ ║
║ │ ✓ Block cache (90%+ hit rate) │ ║
║ │ ✓ Page cache (memory-mapped files) │ ║
║ │ ✓ LRU eviction policy │ ║
║ │ ✓ MLFQ scheduler (priority queues) │ ║
║ │ ✓ CFS scheduler (red-black tree) │ ║
║ │ ✓ TLB optimization │ ║
║ │ ✓ Huge pages (2MB/1GB) │ ║
║ │ ✓ Slab allocator │ ║
║ │ ✓ Performance counters (PMC) │ ║
║ │ ✓ Kernel tracing │ ║
║ └─────────────────────────────────────────────────────────────────────┘ ║
║ │ ║
║ ▼ ║
║ Phase 17 Preview - Stability, Security & Finishing: ║
║ ┌─────────────────────────────────────────────────────────────────────┐ ║
║ │ • Kernel debugging (stack traces, panic handling) │ ║
║ │ • Security hardening (ASLR, stack canaries, NX) │ ║
║ │ • User/kernel separation │ ║
║ │ • Complete bootable OS image │ ║
║ └─────────────────────────────────────────────────────────────────────┘ ║
║ ║
╚═════════════════════════════════════════════════════════════════════════════╝
| Concept |
What We Built |
Performance Impact |
| Block Cache |
Hash table + LRU list |
10,000x faster than disk |
| Page Cache |
File pages in memory |
Enables mmap() |
| CFS Scheduler |
Red-black tree by vruntime |
Fair + responsive |
| Huge Pages |
2MB/1GB page support |
10-50% faster for large allocs |
| Slab Allocator |
Fixed-size object pools |
10x faster than malloc |
| Profiling |
PMC + tracing |
Find what to optimize |
Next Steps
With performance optimized, it's time to make the OS stable and secure. In Phase 17, we'll add debugging tools, implement security hardening, and put the finishing touches on our complete operating system.
Continue the Series
Phase 15: Hardware Discovery & Drivers
Review PCI enumeration and driver framework.
Read Article
Phase 17: Stability, Security & Finishing
Complete the OS with debugging and security.
Read Article
Phase 0: Orientation & Big Picture
Revisit the fundamentals and see how far you've come.
Read Article