Vector Registers
XMM Registers (128-bit) - SSE
Reference
XMM Register Layout
XMM0-XMM15 (x86-64) | XMM0-XMM7 (x86-32)
128 bits = 4 × float OR 2 × double OR 16 × byte OR 8 × word ...
| float[3] | float[2] | float[1] | float[0] |
| 127-96 | 95-64 | 63-32 | 31-0 |
YMM Registers (256-bit) - AVX
; YMM0-YMM15: 256-bit (8 floats or 4 doubles)
vmovaps ymm0, [array] ; Load 8 floats
vaddps ymm0, ymm0, ymm1 ; Add 8 floats in parallel
ZMM Registers (512-bit) - AVX-512
; ZMM0-ZMM31: 512-bit (16 floats or 8 doubles)
vmovaps zmm0, [array] ; Load 16 floats (64-byte aligned!)
vaddps zmm0, zmm1, zmm2 ; Add 16 floats in parallel
; Mask registers k0-k7 (opmask)
kmovw k1, eax ; Load mask from integer
vaddps zmm0 {k1}, zmm1, zmm2 ; Masked addition (only lanes where k1=1)
| Extension | Registers | Width | Floats | Doubles |
| SSE | XMM0-15 | 128-bit | 4 | 2 |
| AVX/AVX2 | YMM0-15 | 256-bit | 8 | 4 |
| AVX-512 | ZMM0-31 | 512-bit | 16 | 8 |
SSE Operations
Packed Floating-Point
; Packed single-precision (4 × 32-bit float)
movaps xmm0, [vec1] ; Load aligned
movups xmm0, [vec1] ; Load unaligned
addps xmm0, xmm1 ; XMM0[0-3] += XMM1[0-3]
mulps xmm0, xmm1 ; Multiply 4 floats
divps xmm0, xmm1 ; Divide 4 floats
sqrtps xmm0, xmm1 ; Square root of 4 floats
; Packed double-precision (2 × 64-bit double)
movapd xmm0, [dvec]
addpd xmm0, xmm1 ; Add 2 doubles
Packed Integer
; Packed byte operations (16 × 8-bit)
paddb xmm0, xmm1 ; Add 16 bytes (wrap around)
paddusb xmm0, xmm1 ; Add 16 bytes (unsigned saturation)
paddsb xmm0, xmm1 ; Add 16 bytes (signed saturation)
psubb xmm0, xmm1 ; Subtract 16 bytes
; Packed word operations (8 × 16-bit)
paddw xmm0, xmm1 ; Add 8 words
pmullw xmm0, xmm1 ; Multiply 8 words (low half)
pmulhw xmm0, xmm1 ; Multiply 8 words (high half, signed)
; Packed dword operations (4 × 32-bit)
paddd xmm0, xmm1 ; Add 4 dwords
psubd xmm0, xmm1 ; Subtract 4 dwords
pmulld xmm0, xmm1 ; Multiply 4 dwords (SSE4.1)
; Packed quadword (2 × 64-bit)
paddq xmm0, xmm1 ; Add 2 qwords
; Bitwise operations (all bits)
pand xmm0, xmm1 ; XMM0 = XMM0 & XMM1
por xmm0, xmm1 ; XMM0 = XMM0 | XMM1
pxor xmm0, xmm1 ; XMM0 = XMM0 ^ XMM1
pandn xmm0, xmm1 ; XMM0 = ~XMM0 & XMM1
Saturation vs Wraparound: paddb wraps (255+1=0). paddusb saturates (255+1=255). Use saturation for audio/image processing where clipping is needed.
AVX & AVX2
VEX Prefix: AVX uses 3-operand syntax (non-destructive). SSE: addps xmm0, xmm1 (XMM0 modified). AVX: vaddps xmm0, xmm1, xmm2 (XMM0 = XMM1 + XMM2).
; AVX 256-bit operations (8 floats)
vmovaps ymm0, [array]
vaddps ymm0, ymm1, ymm2 ; YMM0 = YMM1 + YMM2
vmulps ymm0, ymm1, [mem] ; YMM0 = YMM1 * [mem]
; AVX2 256-bit integer operations
vpaddd ymm0, ymm1, ymm2 ; Add 8 × 32-bit integers
AVX-512
AVX-512 adds 512-bit operations, 32 registers, and powerful mask registers (k0-k7):
; Load and operate on 16 floats at once
vmovaps zmm0, [array] ; Load 64 bytes (16 floats)
vaddps zmm0, zmm1, zmm2 ; ZMM0 = ZMM1 + ZMM2
; Masked operations - only process some lanes
kxnorw k1, k1, k1 ; k1 = all 1s (enable all lanes)
kmovw k2, eax ; k2 from integer mask
vaddps zmm0 {k2}, zmm1, zmm2 ; Add only where k2 bits are set
vaddps zmm0 {k2}{z}, zmm1, zmm2 ; Same, but zero masked lanes
; Broadcast (replicate scalar to all lanes)
vbroadcastss zmm0, [scalar] ; All 16 floats = scalar value
; Ternary logic (any 3-input boolean function!)
vpternlogd zmm0, zmm1, zmm2, 0xCA ; Complex bitwise operation
AVX-512 Feature Subsets
AVX-512 has many subsets. Check CPU support with CPUID:
- AVX-512F: Foundation (512-bit ops, mask regs)
- AVX-512VL: Vector Length (use AVX-512 features on XMM/YMM)
- AVX-512BW: Byte and Word operations
- AVX-512DQ: Doubleword and Quadword
- AVX-512VNNI: Vector Neural Network Instructions
Memory Alignment
SIMD loads/stores have strict alignment requirements:
| Instruction | Requires | Penalty for Misalign |
movaps / movapd | 16-byte aligned | #GP fault (crash!) |
movups / movupd | Any | Slight performance hit |
vmovaps ymm | 32-byte aligned | #GP fault |
vmovups ymm | Any | Performance hit |
vmovaps zmm | 64-byte aligned | #GP fault |
section .data
align 16
sse_array: times 4 dd 1.0 ; 16-byte aligned for SSE
align 32
avx_array: times 8 dd 1.0 ; 32-byte aligned for AVX
align 64
avx512_array: times 16 dd 1.0 ; 64-byte aligned for AVX-512
section .bss
align 32
result: resd 8 ; Aligned output buffer
section .text
; Safe: aligned load
movaps xmm0, [sse_array] ; OK - 16-byte aligned
; Dangerous: unaligned load with aligned instruction
; movaps xmm0, [sse_array + 4] ; CRASH! Not 16-byte aligned
; Safe: use unaligned instruction
movups xmm0, [sse_array + 4] ; Works, slightly slower
Stack Alignment: Local arrays on the stack may not be naturally aligned. Use and rsp, -32 to force alignment, or explicitly align with sub rsp; and rsp.
Practical Examples
Array Sum (SSE)
; Sum an array of floats using SSE
; rdi = array pointer, rsi = count (multiple of 4)
; Returns sum in xmm0
array_sum_sse:
xorps xmm0, xmm0 ; Accumulator = 0
.loop:
cmp rsi, 0
jle .done
addps xmm0, [rdi] ; Add 4 floats
add rdi, 16 ; Advance pointer
sub rsi, 4 ; Decrement count
jmp .loop
.done:
; Horizontal add: xmm0 = [a, b, c, d]
movhlps xmm1, xmm0 ; xmm1 = [c, d, ?, ?]
addps xmm0, xmm1 ; xmm0 = [a+c, b+d, ...]
movaps xmm1, xmm0
shufps xmm1, xmm1, 0x55 ; xmm1 = [b+d, b+d, ...]
addss xmm0, xmm1 ; Final sum in xmm0[0]
ret
Dot Product (AVX)
; Dot product of two float arrays (length = multiple of 8)
; rdi = array1, rsi = array2, rdx = count
; Returns in xmm0
dot_product_avx:
vxorps ymm0, ymm0, ymm0 ; Accumulator
.loop:
cmp rdx, 0
jle .reduce
vmovaps ymm1, [rdi] ; Load 8 floats from array1
vmovaps ymm2, [rsi] ; Load 8 floats from array2
vfmadd231ps ymm0, ymm1, ymm2 ; ymm0 += ymm1 * ymm2 (FMA)
add rdi, 32
add rsi, 32
sub rdx, 8
jmp .loop
.reduce:
; Reduce 8 floats to 1
vextractf128 xmm1, ymm0, 1 ; Upper 128 bits
vaddps xmm0, xmm0, xmm1 ; Add upper and lower
vhaddps xmm0, xmm0, xmm0 ; Horizontal add
vhaddps xmm0, xmm0, xmm0 ; Final sum
vzeroupper ; Clear upper YMM bits (avoid penalty)
ret
Exercise: SIMD Image Brightness
Increase brightness of grayscale image by adding a constant to all pixels:
; Add brightness to image (saturating)
; rdi = pixel buffer, rsi = pixel count, dl = brightness delta
adjust_brightness:
movd xmm1, edx
pxor xmm2, xmm2
pshufb xmm1, xmm2 ; Broadcast byte to all 16 lanes
.loop:
cmp rsi, 16
jl .done
movdqu xmm0, [rdi] ; Load 16 pixels
paddusb xmm0, xmm1 ; Add with saturation
movdqu [rdi], xmm0 ; Store result
add rdi, 16
sub rsi, 16
jmp .loop
.done:
ret