Back to Technology

x86 Assembly Series Part 12: SIMD – SSE, AVX, AVX-512

February 6, 2026 Wasil Zafar 40 min read

Master SIMD (Single Instruction, Multiple Data) programming: SSE 128-bit vectors, AVX 256-bit operations, AVX-512 for maximum throughput, and practical use cases for parallel data processing.

Table of Contents

  1. SIMD Fundamentals
  2. Vector Registers
  3. SSE Operations
  4. AVX & AVX2
  5. AVX-512
  6. Memory Alignment
  7. Practical Examples

Vector Registers

XMM Registers (128-bit) - SSE

Reference

XMM Register Layout

XMM0-XMM15 (x86-64) | XMM0-XMM7 (x86-32)
128 bits = 4 × float OR 2 × double OR 16 × byte OR 8 × word ...

|  float[3]  |  float[2]  |  float[1]  |  float[0]  |
|  127-96    |   95-64    |   63-32    |    31-0    |

YMM Registers (256-bit) - AVX

; YMM0-YMM15: 256-bit (8 floats or 4 doubles)
vmovaps ymm0, [array]     ; Load 8 floats
vaddps ymm0, ymm0, ymm1   ; Add 8 floats in parallel

ZMM Registers (512-bit) - AVX-512

; ZMM0-ZMM31: 512-bit (16 floats or 8 doubles)
vmovaps zmm0, [array]     ; Load 16 floats (64-byte aligned!)
vaddps zmm0, zmm1, zmm2   ; Add 16 floats in parallel

; Mask registers k0-k7 (opmask)
kmovw k1, eax             ; Load mask from integer
vaddps zmm0 {k1}, zmm1, zmm2  ; Masked addition (only lanes where k1=1)
ExtensionRegistersWidthFloatsDoubles
SSEXMM0-15128-bit42
AVX/AVX2YMM0-15256-bit84
AVX-512ZMM0-31512-bit168

SSE Operations

Packed Floating-Point

; Packed single-precision (4 × 32-bit float)
movaps xmm0, [vec1]       ; Load aligned
movups xmm0, [vec1]       ; Load unaligned
addps xmm0, xmm1          ; XMM0[0-3] += XMM1[0-3]
mulps xmm0, xmm1          ; Multiply 4 floats
divps xmm0, xmm1          ; Divide 4 floats
sqrtps xmm0, xmm1         ; Square root of 4 floats

; Packed double-precision (2 × 64-bit double)
movapd xmm0, [dvec]
addpd xmm0, xmm1          ; Add 2 doubles

Packed Integer

; Packed byte operations (16 × 8-bit)
paddb xmm0, xmm1          ; Add 16 bytes (wrap around)
paddusb xmm0, xmm1        ; Add 16 bytes (unsigned saturation)
paddsb xmm0, xmm1         ; Add 16 bytes (signed saturation)
psubb xmm0, xmm1          ; Subtract 16 bytes

; Packed word operations (8 × 16-bit)
paddw xmm0, xmm1          ; Add 8 words
pmullw xmm0, xmm1         ; Multiply 8 words (low half)
pmulhw xmm0, xmm1         ; Multiply 8 words (high half, signed)

; Packed dword operations (4 × 32-bit)
paddd xmm0, xmm1          ; Add 4 dwords
psubd xmm0, xmm1          ; Subtract 4 dwords
pmulld xmm0, xmm1         ; Multiply 4 dwords (SSE4.1)

; Packed quadword (2 × 64-bit)
paddq xmm0, xmm1          ; Add 2 qwords

; Bitwise operations (all bits)
pand xmm0, xmm1           ; XMM0 = XMM0 & XMM1
por xmm0, xmm1            ; XMM0 = XMM0 | XMM1
pxor xmm0, xmm1           ; XMM0 = XMM0 ^ XMM1
pandn xmm0, xmm1          ; XMM0 = ~XMM0 & XMM1
Saturation vs Wraparound: paddb wraps (255+1=0). paddusb saturates (255+1=255). Use saturation for audio/image processing where clipping is needed.

AVX & AVX2

VEX Prefix: AVX uses 3-operand syntax (non-destructive). SSE: addps xmm0, xmm1 (XMM0 modified). AVX: vaddps xmm0, xmm1, xmm2 (XMM0 = XMM1 + XMM2).
; AVX 256-bit operations (8 floats)
vmovaps ymm0, [array]
vaddps ymm0, ymm1, ymm2   ; YMM0 = YMM1 + YMM2
vmulps ymm0, ymm1, [mem]  ; YMM0 = YMM1 * [mem]

; AVX2 256-bit integer operations
vpaddd ymm0, ymm1, ymm2   ; Add 8 × 32-bit integers

AVX-512

AVX-512 adds 512-bit operations, 32 registers, and powerful mask registers (k0-k7):

; Load and operate on 16 floats at once
vmovaps zmm0, [array]           ; Load 64 bytes (16 floats)
vaddps zmm0, zmm1, zmm2         ; ZMM0 = ZMM1 + ZMM2

; Masked operations - only process some lanes
kxnorw k1, k1, k1               ; k1 = all 1s (enable all lanes)
kmovw k2, eax                   ; k2 from integer mask

vaddps zmm0 {k2}, zmm1, zmm2    ; Add only where k2 bits are set
vaddps zmm0 {k2}{z}, zmm1, zmm2 ; Same, but zero masked lanes

; Broadcast (replicate scalar to all lanes)
vbroadcastss zmm0, [scalar]     ; All 16 floats = scalar value

; Ternary logic (any 3-input boolean function!)
vpternlogd zmm0, zmm1, zmm2, 0xCA  ; Complex bitwise operation

AVX-512 Feature Subsets

AVX-512 has many subsets. Check CPU support with CPUID:

  • AVX-512F: Foundation (512-bit ops, mask regs)
  • AVX-512VL: Vector Length (use AVX-512 features on XMM/YMM)
  • AVX-512BW: Byte and Word operations
  • AVX-512DQ: Doubleword and Quadword
  • AVX-512VNNI: Vector Neural Network Instructions

Memory Alignment

SIMD loads/stores have strict alignment requirements:

InstructionRequiresPenalty for Misalign
movaps / movapd16-byte aligned#GP fault (crash!)
movups / movupdAnySlight performance hit
vmovaps ymm32-byte aligned#GP fault
vmovups ymmAnyPerformance hit
vmovaps zmm64-byte aligned#GP fault
section .data
    align 16
    sse_array: times 4 dd 1.0    ; 16-byte aligned for SSE
    
    align 32
    avx_array: times 8 dd 1.0    ; 32-byte aligned for AVX
    
    align 64
    avx512_array: times 16 dd 1.0 ; 64-byte aligned for AVX-512

section .bss
    align 32
    result: resd 8                ; Aligned output buffer

section .text
    ; Safe: aligned load
    movaps xmm0, [sse_array]     ; OK - 16-byte aligned
    
    ; Dangerous: unaligned load with aligned instruction
    ; movaps xmm0, [sse_array + 4]  ; CRASH! Not 16-byte aligned
    
    ; Safe: use unaligned instruction
    movups xmm0, [sse_array + 4]  ; Works, slightly slower
Stack Alignment: Local arrays on the stack may not be naturally aligned. Use and rsp, -32 to force alignment, or explicitly align with sub rsp; and rsp.

Practical Examples

Array Sum (SSE)

; Sum an array of floats using SSE
; rdi = array pointer, rsi = count (multiple of 4)
; Returns sum in xmm0
array_sum_sse:
    xorps xmm0, xmm0          ; Accumulator = 0
    
.loop:
    cmp rsi, 0
    jle .done
    
    addps xmm0, [rdi]         ; Add 4 floats
    add rdi, 16               ; Advance pointer
    sub rsi, 4                ; Decrement count
    jmp .loop
    
.done:
    ; Horizontal add: xmm0 = [a, b, c, d]
    movhlps xmm1, xmm0        ; xmm1 = [c, d, ?, ?]
    addps xmm0, xmm1          ; xmm0 = [a+c, b+d, ...]
    movaps xmm1, xmm0
    shufps xmm1, xmm1, 0x55   ; xmm1 = [b+d, b+d, ...]
    addss xmm0, xmm1          ; Final sum in xmm0[0]
    ret

Dot Product (AVX)

; Dot product of two float arrays (length = multiple of 8)
; rdi = array1, rsi = array2, rdx = count
; Returns in xmm0
dot_product_avx:
    vxorps ymm0, ymm0, ymm0   ; Accumulator
    
.loop:
    cmp rdx, 0
    jle .reduce
    
    vmovaps ymm1, [rdi]       ; Load 8 floats from array1
    vmovaps ymm2, [rsi]       ; Load 8 floats from array2
    vfmadd231ps ymm0, ymm1, ymm2  ; ymm0 += ymm1 * ymm2 (FMA)
    
    add rdi, 32
    add rsi, 32
    sub rdx, 8
    jmp .loop
    
.reduce:
    ; Reduce 8 floats to 1
    vextractf128 xmm1, ymm0, 1    ; Upper 128 bits
    vaddps xmm0, xmm0, xmm1       ; Add upper and lower
    vhaddps xmm0, xmm0, xmm0      ; Horizontal add
    vhaddps xmm0, xmm0, xmm0      ; Final sum
    
    vzeroupper                    ; Clear upper YMM bits (avoid penalty)
    ret

Exercise: SIMD Image Brightness

Increase brightness of grayscale image by adding a constant to all pixels:

; Add brightness to image (saturating)
; rdi = pixel buffer, rsi = pixel count, dl = brightness delta
adjust_brightness:
    movd xmm1, edx
    pxor xmm2, xmm2
    pshufb xmm1, xmm2         ; Broadcast byte to all 16 lanes
    
.loop:
    cmp rsi, 16
    jl .done
    
    movdqu xmm0, [rdi]        ; Load 16 pixels
    paddusb xmm0, xmm1        ; Add with saturation
    movdqu [rdi], xmm0        ; Store result
    
    add rdi, 16
    sub rsi, 16
    jmp .loop
.done:
    ret