x86 Assembly Series Part 2: x86 CPU Architecture Overview

February 6, 2026 Wasil Zafar 30 min read

Explore the evolution of x86 from the 8086 to modern x86-64 processors. Understand CISC philosophy, execution modes, privilege rings, CPU pipelines, and the microcode layer that powers modern processors.

x86 Evolution
CISC Philosophy
- CISC vs RISC
- Design Implications
Execution Modes
Privilege Rings
- Ring 0-3 Model
- Ring Transitions
CPU Internals

x86 Evolution: From 8086 to x86-64

                        
                        Historical Context: The x86 architecture has maintained backward compatibility for over 45 years, making it one of the most successful processor architectures in computing history.
                    

x86 Assembly Mastery

Your 25-step learning path • Currently on Step 3

Development Environment, Tooling & Workflow

IDEs, debuggers, build tools, workflow setup

Assembly Language Fundamentals & Toolchain Setup

Syntax basics, assemblers, linkers, object files

x86 CPU Architecture Overview

Instruction pipeline, execution units, microarchitecture

You Are Here

The 8086 Origins (1978)

Historical

Intel 8086 Specifications

Word Size: 16-bit registers and data bus
Address Bus: 20-bit (1 MB addressable memory)
Registers: AX, BX, CX, DX, SI, DI, BP, SP
Segmentation: Segment:Offset addressing
Clock Speed: 5-10 MHz

The 8086 established the instruction set that all x86 processors still support today.

IA-32: The 32-bit Era (80386+)

The 80386 (1985) introduced 32-bit computing to x86:

Timeline of x86 processor evolution from 8086 through 80386 to modern x86-64 architecture — x86 architecture evolution timeline — from the 16-bit 8086 (1978) through 32-bit 80386 to modern 64-bit x86-64, showing key milestones in register width, addressing, and features

32-bit registers (EAX, EBX, etc.)
4 GB addressable memory
Protected mode with ring-based security
Virtual memory with paging
Hardware task switching

x86-64 / AMD64 (2003)

AMD's bold move to extend x86 to 64-bit (rather than adopting Intel's Itanium) gave us today's dominant architecture:

Key x86-64 Features

64-bit general-purpose registers: RAX, RBX, RCX, RDX, RSI, RDI, RBP, RSP
8 new registers: R8-R15 (finally, more than 8 GPRs!)
Virtual address space: 48-bit (256 TB) currently, 57-bit with LA57
RIP-relative addressing: Position-independent code is easy
Flat memory model: No more segment:offset headaches (mostly)
Single OS model: Ring 0 and Ring 3 only

Register Naming Convention:

64-bit  32-bit  16-bit  8-bit (low)  8-bit (high)
RAX     EAX     AX      AL           AH
RBX     EBX     BX      BL           BH
RCX     ECX     CX      CL           CH
RDX     EDX     DX      DL           DH
RSI     ESI     SI      SIL          -
RDI     EDI     DI      DIL          -
R8      R8D     R8W     R8B          -
R9      R9D     R9W     R9B          -
...and so on for R10-R15

Long Mode Sub-Modes

Sub-Mode	Description	Use Case
64-bit Mode	Full 64-bit OS and applications	Modern operating systems
Compatibility Mode	Run 32-bit apps on 64-bit OS	WoW64 (Windows on Windows 64)

CISC Philosophy

CISC vs RISC

Comparison

Architecture Philosophies

CISC (x86)	RISC (ARM)
Complex, variable-length instructions	Simple, fixed-length instructions
Memory operands in most instructions	Load/Store architecture
Hardware microcode	Hardwired control
Fewer registers	Many registers

Design Implications

CISC architecture profoundly affects how you write and think about assembly:

CISC versus RISC architecture comparison showing instruction complexity and memory access patterns — CISC vs RISC design philosophies — x86 CISC uses complex variable-length instructions with direct memory operands, while RISC uses simple fixed-length load/store instructions

Memory Operands Anywhere

Unlike RISC (load/store), x86 allows memory operands directly in arithmetic:

; CISC style - memory in arithmetic
add [count], 5          ; Add 5 directly to memory location
mul dword [factor]      ; Multiply EAX by memory value

; RISC equivalent would need:
ldr r1, [count]         ; Load
add r1, r1, #5          ; Compute
str r1, [count]         ; Store

Rich Instruction Set

; String operations
rep movsb               ; Copy RCX bytes from RSI to RDI
rep stosq               ; Fill RCX quadwords at RDI with RAX
repnz scasb             ; Search for AL in string at RDI

; Complex addressing
mov eax, [rbx + rsi*4 + 16]   ; Array[i] with base offset

; Atomic operations  
lock cmpxchg [mutex], ecx     ; Compare-and-swap for synchronization
lock xadd [counter], eax      ; Atomic fetch-and-add

Trade-offs for Assembly Programmers

Advantage	Disadvantage
Fewer instructions for same task	Complex instruction encoding (1-15 bytes)
Powerful addressing modes	Harder to predict timing
Rich built-in operations (REP, LOOP)	Microcode overhead for complex ops
Direct memory manipulation	Limited registers (historical)

                        
                        Modern Reality: Today's x86 CPUs are RISC internally. The decoder translates complex CISC instructions into simple micro-operations (μops) that execute on a RISC-like core. You get CISC convenience with RISC performance.
                    

CPU Execution Modes

                        
                        Key Insight: Modern x86 processors boot in Real Mode (for BIOS compatibility), then transition to Protected Mode (for 32-bit OS) or Long Mode (for 64-bit OS). Understanding these modes is essential for bootloader and kernel development.
                    

x86 CPU Execution Modes

graph TD
    RM["Real Mode
16-bit | 1 MB Address Space
No Protection | Direct HW Access"]
    PM["Protected Mode
32-bit | 4 GB Address Space
Ring Protection | Paging"]
    LM["Long Mode (x86-64)
64-bit | 256 TB Virtual
4-Level Paging | RIP-relative"]
    VM["Virtual 8086 Mode
Real Mode emulation
inside Protected Mode"]

    RM -->|"Set PE bit in CR0"| PM
    PM -->|"Set LME in EFER + PG"| LM
    PM -->|"Set VM flag"| VM
    VM -->|"Clear VM flag"| PM

    style RM fill:#fff5f5,stroke:#BF092F
    style PM fill:#f0f4f8,stroke:#16476A
    style LM fill:#e8f4f4,stroke:#3B9797
    style VM fill:#f8f9fa,stroke:#666

Real Mode

Mode

Real Mode Characteristics

Address Space: 1 MB (20-bit addresses)
Segmentation: Segment × 16 + Offset
Protection: None (direct hardware access)
Use Case: BIOS, bootloaders, DOS compatibility

Protected Mode

Introduced with the 80386, Protected Mode is where 32-bit operating systems live:

Global Descriptor Table (GDT)

The GDT defines memory segments with protection attributes:

GDT Entry (8 bytes each):
┌──────────┬─────────┬─────────┬──────────┬──────────┐
│ Base[24:31] │ Flags     │ Access    │ Base[16:23] │ Base[0:15]  │
│ Limit[16:19]│ (G,D,L,0) │ (P,DPL,S..)│             │ Limit[0:15] │
└──────────┴─────────┴─────────┴──────────┴──────────┘

Typical GDT layout:
  Index 0: Null descriptor (required)
  Index 1: Kernel Code (Ring 0, Execute)
  Index 2: Kernel Data (Ring 0, Read/Write)
  Index 3: User Code   (Ring 3, Execute)
  Index 4: User Data   (Ring 3, Read/Write)
  Index 5: TSS         (Task State Segment)

Entering Protected Mode (Bootloader Pattern)

; Minimal GDT for entering protected mode
gdt_start:
    dq 0                        ; Null descriptor (index 0)
gdt_code:                       ; Code segment descriptor (index 1)
    dw 0xFFFF                   ; Limit 0-15
    dw 0                        ; Base 0-15
    db 0                        ; Base 16-23
    db 10011010b                ; Access: Present, Ring 0, Code, Readable
    db 11001111b                ; Flags: 4K granularity, 32-bit
    db 0                        ; Base 24-31
gdt_data:                       ; Data segment descriptor (index 2)
    dw 0xFFFF
    dw 0
    db 0
    db 10010010b                ; Access: Present, Ring 0, Data, Writable
    db 11001111b
    db 0
gdt_end:

gdt_descriptor:
    dw gdt_end - gdt_start - 1  ; GDT size - 1
    dd gdt_start                ; GDT address

; Switch to protected mode
enter_protected:
    cli                         ; Disable interrupts
    lgdt [gdt_descriptor]       ; Load GDT
    mov eax, cr0
    or eax, 1                   ; Set PE (Protection Enable) bit
    mov cr0, eax
    jmp 0x08:protected_start    ; Far jump to flush pipeline, load CS

                        
                        Protected Mode Gotchas:
                        Can't use BIOS interrupts (they're 16-bit real mode code)
Must set up an IDT before enabling interrupts
Segment registers hold selectors, not segment bases
The far jump after setting CR0.PE is mandatory to load CS properly

                    

Long Mode (64-bit)

Long Mode is the native mode for 64-bit x86 processors. You must transition through Protected Mode to reach it:

Entering Long Mode (From Protected Mode)

; Prerequisites:
; 1. Already in Protected Mode with paging disabled
; 2. PAE (Physical Address Extension) enabled
; 3. 4-level page tables set up

enter_long_mode:
    ; Enable PAE in CR4
    mov eax, cr4
    or eax, (1 << 5)            ; Set PAE bit
    mov cr4, eax
    
    ; Load PML4 table address into CR3
    mov eax, pml4_table         ; Page-Map Level-4 Table
    mov cr3, eax
    
    ; Enable Long Mode in EFER MSR
    mov ecx, 0xC0000080         ; EFER MSR number
    rdmsr
    or eax, (1 << 8)            ; Set LME (Long Mode Enable)
    wrmsr
    
    ; Enable paging (this activates Long Mode)
    mov eax, cr0
    or eax, (1 << 31)           ; Set PG (Paging) bit
    mov cr0, eax
    
    ; Far jump to 64-bit code segment
    jmp 0x08:long_mode_start

[bits 64]
long_mode_start:
    ; Now in 64-bit mode!
    mov rsp, stack_top
    call kernel_main

64-bit Addressing

Feature	32-bit Protected	64-bit Long
Virtual Address	32-bit (4 GB)	48-bit (256 TB) canonical
Physical Address	32-bit (36 with PAE)	52-bit (4 PB)
Page Tables	2-level (or 3 with PAE)	4-level (5 with LA57)
Segments	Full segmentation	Flat model (FS/GS for TLS)

                        
                        Canonical Addresses: In 64-bit mode, only 48 bits of address are used. Bits 48-63 must be sign-extended (all 0s or all 1s). This creates a "canonical hole" in the middle of the address space - the kernel lives in high addresses (0xFFFF...), user space in low addresses (0x0000...).
                    

Privilege Rings

The Ring Model (0-3)

Security

x86 Privilege Levels

Ring 0 (Kernel): Full hardware access, OS kernel code
Ring 1: Device drivers (rarely used)
Ring 2: Device drivers (rarely used)
Ring 3 (User): Application code, restricted access

Most modern OSes use only Ring 0 (kernel) and Ring 3 (user), with hypervisors sometimes utilizing Ring -1 (hardware virtualization).

x86 Privilege Ring Model

graph TD
    R0["Ring 0 — Kernel
Full hardware access
All instructions allowed"]
    R1["Ring 1 — Device Drivers
(Rarely used in modern OS)"]
    R2["Ring 2 — Device Drivers
(Rarely used in modern OS)"]
    R3["Ring 3 — User Applications
Restricted access
Must use syscalls for I/O"]

    R0 --- R1
    R1 --- R2
    R2 --- R3

    R3 -->|"INT 0x80 / SYSCALL"| R0
    R0 -->|"IRET / SYSRET"| R3

    style R0 fill:#BF092F,stroke:#132440,color:#fff
    style R1 fill:#16476A,stroke:#132440,color:#fff
    style R2 fill:#3B9797,stroke:#132440,color:#fff
    style R3 fill:#e8f4f4,stroke:#3B9797

Ring Transitions

Code transitions between privilege levels through controlled gates:

x86 privilege rings diagram showing Ring 0 kernel through Ring 3 user space with syscall transitions — x86 privilege ring model — Ring 0 (kernel) and Ring 3 (user) transitions via syscall/sysret, interrupt gates, and call gates

User to Kernel (Ring 3 → Ring 0)

; Modern syscall instruction (64-bit Linux)
; Arguments: RDI, RSI, RDX, R10, R8, R9
; Syscall number: RAX
; Return value: RAX

mov rax, 1              ; sys_write
mov rdi, 1              ; fd = stdout
mov rsi, message        ; buffer
mov rdx, 13             ; length
syscall                 ; RING 3 → RING 0 → RING 3

; What happens:
; 1. CPU saves RIP to RCX, RFLAGS to R11
; 2. Loads kernel CS:RIP from STAR/LSTAR MSRs
; 3. Sets CPL to 0 (kernel mode)
; 4. Kernel handler executes
; 5. sysret instruction returns to user mode

Transition Mechanisms

Mechanism	Direction	Use Case
`syscall` / `sysret`	User ↔ Kernel	Fast system calls (64-bit)
`sysenter` / `sysexit`	User ↔ Kernel	Fast system calls (32-bit)
`int 0x80`	User → Kernel	Legacy Linux syscall
`int 0x2e`	User → Kernel	Legacy Windows syscall
Hardware interrupt	Any → Kernel	Device events, timer
Exception (fault/trap)	Any → Kernel	Page fault, divide error

Exercise: Observe Ring Transitions

# Count syscalls made by a program
strace -c ls /

# Example output:
# % time     calls     syscall
# 67.50%       101     read
# 12.00%        43     write
#  5.00%        25     openat
#  ...each is a Ring 3 → 0 → 3 transition!

CPU Internals

Instruction Pipeline

Modern CPUs process instructions through a multi-stage pipeline to increase throughput:

Classic 5-Stage Pipeline:

  ┌───────┐   ┌───────┐   ┌───────┐   ┌───────┐   ┌───────┐
  │ Fetch │ → │Decode │ → │Execute│ → │Memory │ → │Write- │
  │       │   │       │   │       │   │ Access│   │ back  │
  └───────┘   └───────┘   └───────┘   └───────┘   └───────┘
  
  Clock  1    2    3    4    5    6    7    8    9
  Inst1  IF   ID   EX   MEM  WB
  Inst2       IF   ID   EX   MEM  WB
  Inst3            IF   ID   EX   MEM  WB
  Inst4                 IF   ID   EX   MEM  WB
  
  Throughput: 1 instruction per cycle (ideally)

Pipeline Stages Explained

Stage	Action	Relevant to Assembly
Fetch (IF)	Read instruction bytes from memory/cache	Code alignment matters for cache lines
Decode (ID)	Parse instruction, read registers	Simpler instructions decode faster
Execute (EX)	Perform computation (ALU, FPU, etc.)	Some instructions take multiple cycles
Memory (MEM)	Load/store data from memory	Memory operations are slow (cache helps)
Writeback (WB)	Write results to registers	Register dependencies cause stalls

                        
                        Pipeline Hazards:
                        Data hazard: add rax, rbx; sub rcx, rax — second instruction needs result from first
Control hazard: Branches disrupt the pipeline (branch prediction helps)
Structural hazard: Multiple instructions need same hardware unit

Instruction Decoder

x86's variable-length instructions (1-15 bytes) create a decoding challenge:

Instruction Length Examples:

nop                          ; 90                    (1 byte)
ret                          ; C3                    (1 byte)
mov eax, 1                   ; B8 01 00 00 00        (5 bytes)
mov rax, 1                   ; 48 B8 01 00 00 00 00 00 00 00  (10 bytes)
lock cmpxchg [rdi+rcx*8], rax ; F0 48 0F B1 04 CF   (6 bytes with prefix)

The decoder must:
1. Find instruction boundaries (no fixed size!)
2. Handle legacy prefixes (REX, VEX, EVEX)
3. Parse ModR/M and SIB bytes
4. Extract immediate values

Decoder Design

Modern Intel/AMD CPUs use multiple parallel decoders:

Intel Skylake Decoder:

┌──────────────────────────────────────────────┐
│          Instruction Fetch (16 bytes/cycle)  │
└──────────────────────────────────────────────┘
                      │
                      ▼
┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐
│Decode│ │Decode│ │Decode│ │Decode│ │Complex│
│  1   │ │  2   │ │  3   │ │  4   │ │Decode │
│simple│ │simple│ │simple│ │simple│ │(MSROM)│
└──────┘ └──────┘ └──────┘ └──────┘ └───────┘
  ↓                                    ↓
 1 µop                              Multiple
 each                                 µops

Simple: Can be decoded to 1 micro-op (fused)
Complex: Needs microcode ROM lookup

                        
                        Assembly Optimization Tip: Prefer instructions that decode to single µops. Avoid complex instructions like LOOP (which decodes to multiple µops and is slower than a manual DEC + JNZ sequence).
                    

Microcode Layer

Complex CISC instructions are internally translated to simple RISC-like operations called micro-operations (µops):

µop Translation Examples

; Simple instruction → 1 µop
add rax, rbx             ; 1 µop: add

; Memory operand → 2 µops (fused in some cases)
add rax, [rbx]           ; 2 µops: load + add (may fuse to 1)

; Memory-memory (read-modify-write) → 4 µops
add [count], 5           ; load + add + store-address + store-data

; Complex string operation → many µops
rep movsb                ; ~100+ µops for copying 100 bytes
                         ; (internally loops in microcode)

Why Microcode Matters

Aspect	Hardwired	Microcoded
Speed	Single-cycle (ideal)	Multiple cycles
Examples	`add`, `mov`, `xor`	`div`, `cpuid`, `rep movs`
Updatable	No (silicon)	Yes (microcode updates)

Exercise: Count µops with perf

# Profile µops on Intel
perf stat -e uops_issued.any,uops_executed.thread ./program

# Sample output:
# 1,500,000,000   uops_issued.any
# 1,200,000,000   uops_executed.thread  
# 1.0 seconds elapsed

# More µops issued than executed = speculation wasted on mispredicts

                        
                        Security: Microcode updates can patch CPU vulnerabilities like Spectre and Meltdown. Check your current microcode version with: cat /proc/cpuinfo | grep microcode on Linux.
                    

Next Steps

With a solid understanding of CPU architecture, we'll now dive deep into the registers—the CPU's working memory for assembly programming.

Technology

x86 Assembly Series Part 2: x86 CPU Architecture Overview

Table of Contents

x86 Evolution: From 8086 to x86-64

x86 Assembly Mastery

Development Environment, Tooling & Workflow

Assembly Language Fundamentals & Toolchain Setup

x86 CPU Architecture Overview

Registers – Complete Deep Dive

Instruction Encoding & Binary Layout

NASM Syntax, Directives & Macros

Complete Assembler Comparison

Memory Addressing Modes

Stack Internals & Calling Conventions

Control Flow & Procedures

Integer, Bitwise & Arithmetic Operations

Floating Point & SIMD Foundations

SIMD, Vectorization & Performance

System Calls, Interrupts & Privilege Transitions

Debugging & Reverse Engineering

Linking, Relocation & Loader Behavior

x86-64 Long Mode & Advanced Features

Assembly + C/C++ Interoperability

Memory Protection & Security Concepts

Bootloaders & Bare-Metal Programming

Kernel-Level Assembly

Complete Emulator & Simulator Guide

Advanced Optimization & CPU Internals

Real-World Assembly Projects

Assembly Mastery Capstone