x86 Assembly Series Part 12: SIMD – SSE, AVX, AVX-512

SIMD Fundamentals

                        
                        SIMD = Parallel Power: Process multiple data elements with a single instruction. Instead of adding 4 floats one at a time (4 instructions), add all 4 with ADDPS (1 instruction).
                    

Comparison of scalar processing (one element at a time) versus SIMD parallel processing (multiple elements simultaneously) — SIMD processes multiple data elements in parallel with a single instruction — 4× throughput for packed single-precision floats using SSE

x86 Assembly Mastery

Your 25-step learning path • Currently on Step 13

13

SIMD, Vectorization & Performance

SSE, AVX, AVX-512, data-parallel processing

You Are Here

14

System Calls, Interrupts & Privilege Transitions

INT, SYSCALL, IDT, ring transitions, exception handling

15

Debugging & Reverse Engineering

GDB, breakpoints, disassembly, binary analysis, IDA

16

Linking, Relocation & Loader Behavior

ELF/PE formats, symbol resolution, dynamic linking, GOT/PLT

17

x86-64 Long Mode & Advanced Features

64-bit extensions, RIP addressing, canonical addresses

18

Assembly + C/C++ Interoperability

Inline assembly, calling C from ASM, ABI compliance

19

Memory Protection & Security Concepts

DEP, ASLR, stack canaries, ROP, mitigations

20

Bootloaders & Bare-Metal Programming

BIOS/UEFI, MBR, real mode, protected mode transition

21

Kernel-Level Assembly

Context switching, interrupt handlers, TSS, GDT/LDT

22

Complete Emulator & Simulator Guide

QEMU, Bochs, instruction-level simulation, debugging VMs

23

Advanced Optimization & CPU Internals

Pipeline hazards, branch prediction, cache optimization, ILP

24

Real-World Assembly Projects

Shellcode, drivers, cryptography, signal processing

25

Assembly Mastery Capstone

Final project, comprehensive review, advanced techniques

; Scalar: 4 separate additions
addss xmm0, [a + 0]
addss xmm1, [a + 4]
addss xmm2, [a + 8]
addss xmm3, [a + 12]

; SIMD: Pack 4 floats, one instruction
movaps xmm0, [a]      ; Load 4 floats
addps xmm0, [b]       ; Add 4 floats in parallel

Vector Registers

XMM Registers (128-bit) - SSE

Reference

XMM Register Layout

XMM0-XMM15 (x86-64) | XMM0-XMM7 (x86-32)
128 bits = 4 × float OR 2 × double OR 16 × byte OR 8 × word ...

|  float[3]  |  float[2]  |  float[1]  |  float[0]  |
|  127-96    |   95-64    |   63-32    |    31-0    |

YMM Registers (256-bit) - AVX

; YMM0-YMM15: 256-bit (8 floats or 4 doubles)
vmovaps ymm0, [array]     ; Load 8 floats
vaddps ymm0, ymm0, ymm1   ; Add 8 floats in parallel

ZMM Registers (512-bit) - AVX-512

; ZMM0-ZMM31: 512-bit (16 floats or 8 doubles)
vmovaps zmm0, [array]     ; Load 16 floats (64-byte aligned!)
vaddps zmm0, zmm1, zmm2   ; Add 16 floats in parallel

; Mask registers k0-k7 (opmask)
kmovw k1, eax             ; Load mask from integer
vaddps zmm0 {k1}, zmm1, zmm2  ; Masked addition (only lanes where k1=1)

Extension	Registers	Width	Floats	Doubles
SSE	XMM0-15	128-bit	4	2
AVX/AVX2	YMM0-15	256-bit	8	4
AVX-512	ZMM0-31	512-bit	16	8

Diagram showing XMM (128-bit), YMM (256-bit), and ZMM (512-bit) register widths and their data lane layouts — SIMD register evolution: XMM holds 4 floats (128-bit), YMM holds 8 floats (256-bit), and ZMM holds 16 floats (512-bit)

SSE Operations

Packed Floating-Point

; Packed single-precision (4 × 32-bit float)
movaps xmm0, [vec1]       ; Load aligned
movups xmm0, [vec1]       ; Load unaligned
addps xmm0, xmm1          ; XMM0[0-3] += XMM1[0-3]
mulps xmm0, xmm1          ; Multiply 4 floats
divps xmm0, xmm1          ; Divide 4 floats
sqrtps xmm0, xmm1         ; Square root of 4 floats

; Packed double-precision (2 × 64-bit double)
movapd xmm0, [dvec]
addpd xmm0, xmm1          ; Add 2 doubles

SSE packed float operation showing four parallel additions across XMM register lanes — Packed SSE operations (ADDPS, MULPS) process all four 32-bit float lanes simultaneously in a single instruction cycle

Packed Integer

; Packed byte operations (16 × 8-bit)
paddb xmm0, xmm1          ; Add 16 bytes (wrap around)
paddusb xmm0, xmm1        ; Add 16 bytes (unsigned saturation)
paddsb xmm0, xmm1         ; Add 16 bytes (signed saturation)
psubb xmm0, xmm1          ; Subtract 16 bytes

; Packed word operations (8 × 16-bit)
paddw xmm0, xmm1          ; Add 8 words
pmullw xmm0, xmm1         ; Multiply 8 words (low half)
pmulhw xmm0, xmm1         ; Multiply 8 words (high half, signed)

; Packed dword operations (4 × 32-bit)
paddd xmm0, xmm1          ; Add 4 dwords
psubd xmm0, xmm1          ; Subtract 4 dwords
pmulld xmm0, xmm1         ; Multiply 4 dwords (SSE4.1)

; Packed quadword (2 × 64-bit)
paddq xmm0, xmm1          ; Add 2 qwords

; Bitwise operations (all bits)
pand xmm0, xmm1           ; XMM0 = XMM0 & XMM1
por xmm0, xmm1            ; XMM0 = XMM0 | XMM1
pxor xmm0, xmm1           ; XMM0 = XMM0 ^ XMM1
pandn xmm0, xmm1          ; XMM0 = ~XMM0 & XMM1

                        
                        Saturation vs Wraparound: paddb wraps (255+1=0). paddusb saturates (255+1=255). Use saturation for audio/image processing where clipping is needed.
                    

AVX & AVX2

                        
                        VEX Prefix: AVX uses 3-operand syntax (non-destructive). SSE: addps xmm0, xmm1 (XMM0 modified). AVX: vaddps xmm0, xmm1, xmm2 (XMM0 = XMM1 + XMM2).
                    

; AVX 256-bit operations (8 floats)
vmovaps ymm0, [array]
vaddps ymm0, ymm1, ymm2   ; YMM0 = YMM1 + YMM2
vmulps ymm0, ymm1, [mem]  ; YMM0 = YMM1 * [mem]

; AVX2 256-bit integer operations
vpaddd ymm0, ymm1, ymm2   ; Add 8 × 32-bit integers

AVX-512

AVX-512 adds 512-bit operations, 32 registers, and powerful mask registers (k0-k7):

; Load and operate on 16 floats at once
vmovaps zmm0, [array]           ; Load 64 bytes (16 floats)
vaddps zmm0, zmm1, zmm2         ; ZMM0 = ZMM1 + ZMM2

; Masked operations - only process some lanes
kxnorw k1, k1, k1               ; k1 = all 1s (enable all lanes)
kmovw k2, eax                   ; k2 from integer mask

vaddps zmm0 {k2}, zmm1, zmm2    ; Add only where k2 bits are set
vaddps zmm0 {k2}{z}, zmm1, zmm2 ; Same, but zero masked lanes

; Broadcast (replicate scalar to all lanes)
vbroadcastss zmm0, [scalar]     ; All 16 floats = scalar value

; Ternary logic (any 3-input boolean function!)
vpternlogd zmm0, zmm1, zmm2, 0xCA  ; Complex bitwise operation

AVX-512 Feature Subsets

AVX-512 has many subsets. Check CPU support with CPUID:

AVX-512F: Foundation (512-bit ops, mask regs)
AVX-512VL: Vector Length (use AVX-512 features on XMM/YMM)
AVX-512BW: Byte and Word operations
AVX-512DQ: Doubleword and Quadword
AVX-512VNNI: Vector Neural Network Instructions

Memory Alignment

SIMD loads/stores have strict alignment requirements:

Instruction	Requires	Penalty for Misalign
`movaps` / `movapd`	16-byte aligned	#GP fault (crash!)
`movups` / `movupd`	Any	Slight performance hit
`vmovaps ymm`	32-byte aligned	#GP fault
`vmovups ymm`	Any	Performance hit
`vmovaps zmm`	64-byte aligned	#GP fault

section .data
    align 16
    sse_array: times 4 dd 1.0    ; 16-byte aligned for SSE
    
    align 32
    avx_array: times 8 dd 1.0    ; 32-byte aligned for AVX
    
    align 64
    avx512_array: times 16 dd 1.0 ; 64-byte aligned for AVX-512

section .bss
    align 32
    result: resd 8                ; Aligned output buffer

section .text
    ; Safe: aligned load
    movaps xmm0, [sse_array]     ; OK - 16-byte aligned
    
    ; Dangerous: unaligned load with aligned instruction
    ; movaps xmm0, [sse_array + 4]  ; CRASH! Not 16-byte aligned
    
    ; Safe: use unaligned instruction
    movups xmm0, [sse_array + 4]  ; Works, slightly slower

                        
                        Stack Alignment: Local arrays on the stack may not be naturally aligned. Use and rsp, -32 to force alignment, or explicitly align with sub rsp; and rsp.
                    

Practical Examples

Array Sum (SSE)

; Sum an array of floats using SSE
; rdi = array pointer, rsi = count (multiple of 4)
; Returns sum in xmm0
array_sum_sse:
    xorps xmm0, xmm0          ; Accumulator = 0
    
.loop:
    cmp rsi, 0
    jle .done
    
    addps xmm0, [rdi]         ; Add 4 floats
    add rdi, 16               ; Advance pointer
    sub rsi, 4                ; Decrement count
    jmp .loop
    
.done:
    ; Horizontal add: xmm0 = [a, b, c, d]
    movhlps xmm1, xmm0        ; xmm1 = [c, d, ?, ?]
    addps xmm0, xmm1          ; xmm0 = [a+c, b+d, ...]
    movaps xmm1, xmm0
    shufps xmm1, xmm1, 0x55   ; xmm1 = [b+d, b+d, ...]
    addss xmm0, xmm1          ; Final sum in xmm0[0]
    ret

Dot Product (AVX)

; Dot product of two float arrays (length = multiple of 8)
; rdi = array1, rsi = array2, rdx = count
; Returns in xmm0
dot_product_avx:
    vxorps ymm0, ymm0, ymm0   ; Accumulator
    
.loop:
    cmp rdx, 0
    jle .reduce
    
    vmovaps ymm1, [rdi]       ; Load 8 floats from array1
    vmovaps ymm2, [rsi]       ; Load 8 floats from array2
    vfmadd231ps ymm0, ymm1, ymm2  ; ymm0 += ymm1 * ymm2 (FMA)
    
    add rdi, 32
    add rsi, 32
    sub rdx, 8
    jmp .loop
    
.reduce:
    ; Reduce 8 floats to 1
    vextractf128 xmm1, ymm0, 1    ; Upper 128 bits
    vaddps xmm0, xmm0, xmm1       ; Add upper and lower
    vhaddps xmm0, xmm0, xmm0      ; Horizontal add
    vhaddps xmm0, xmm0, xmm0      ; Final sum
    
    vzeroupper                    ; Clear upper YMM bits (avoid penalty)
    ret

Exercise: SIMD Image Brightness

Increase brightness of grayscale image by adding a constant to all pixels:

; Add brightness to image (saturating)
; rdi = pixel buffer, rsi = pixel count, dl = brightness delta
adjust_brightness:
    movd xmm1, edx
    pxor xmm2, xmm2
    pshufb xmm1, xmm2         ; Broadcast byte to all 16 lanes
    
.loop:
    cmp rsi, 16
    jl .done
    
    movdqu xmm0, [rdi]        ; Load 16 pixels
    paddusb xmm0, xmm1        ; Add with saturation
    movdqu [rdi], xmm0        ; Store result
    
    add rdi, 16
    sub rsi, 16
    jmp .loop
.done:
    ret

x86 Assembly Series Part 12: SIMD – SSE, AVX, AVX-512

Table of Contents

SIMD Fundamentals

x86 Assembly Mastery

Development Environment, Tooling & Workflow

Assembly Language Fundamentals & Toolchain Setup

x86 CPU Architecture Overview

Registers – Complete Deep Dive

Instruction Encoding & Binary Layout

NASM Syntax, Directives & Macros

Complete Assembler Comparison

Memory Addressing Modes

Stack Internals & Calling Conventions

Control Flow & Procedures

Integer, Bitwise & Arithmetic Operations

Floating Point & SIMD Foundations