Back to Technology

ARM Assembly Part 2: ARM32 Instruction Set Fundamentals

February 19, 2026 Wasil Zafar 20 min read

Explore the 32-bit ARM instruction set in depth: ARM vs Thumb encoding modes, the register file and CPSR flags register, data processing instructions, load/store operations, the powerful barrel shifter, and ARM's unique predicated conditional execution.

Table of Contents

  1. Introduction
  2. ARM vs Thumb Modes
  3. Register File & CPSR
  4. Data Processing Instructions
  5. The Barrel Shifter
  6. Load/Store Instructions
  7. Conditional Execution
  8. Conclusion & Next Steps

Introduction

Series Overview: This is Part 2 of our 28-part ARM Assembly Mastery Series. Part 1 covered ARM architecture history and core concepts. Now we dive into the 32-bit ARM instruction set — the foundation of billions of embedded and mobile devices still in use today.

ARM Assembly Mastery

Your 28-step learning path • Currently on Step 2
1
Architecture History & Core Concepts
ARMv1→v9, RISC philosophy, profiles
2
ARM32 Instruction Set Fundamentals
ARM vs Thumb, registers, CPSR, barrel shifter
You Are Here
3
AArch64 Registers, Addressing & Data Movement
X/W regs, addressing modes, load/store pairs
4
Arithmetic, Logic & Bit Manipulation
ADD/SUB, bitfield extract/insert, CLZ
5
Branching, Loops & Conditional Execution
Branch types, link register, jump tables
6
Stack, Subroutines & AAPCS
Calling conventions, prologue/epilogue
7
Memory Model, Caches & Barriers
Weak ordering, DMB/DSB/ISB, TLB
8
NEON & Advanced SIMD
Vector ops, intrinsics, media processing
9
SVE & SVE2 Scalable Vector Extensions
Predicate regs, gather/scatter, HPC/ML
10
Floating-Point & VFP Instructions
IEEE-754, scalar FP, rounding modes
11
Exception Levels, Interrupts & Vector Tables
EL0–EL3, GIC, fault debugging
12
MMU, Page Tables & Virtual Memory
Stage-1 translation, permissions, huge pages
13
TrustZone & ARM Security Extensions
Secure monitor, world switching, TF-A
14
Cortex-M Assembly & Bare-Metal Embedded
NVIC, SysTick, linker scripts, low-power
15
Cortex-A System Programming & Boot
EL3→EL1 transitions, MMU setup, PSCI
16
Apple Silicon & macOS ABI
ARM64e PAC, Mach-O, dyld, perf counters
17
Inline Assembly, GCC/Clang & C Interop
Constraints, clobbers, compiler interaction
18
Performance Profiling & Micro-Optimization
Pipeline hazards, PMU, benchmarking
19
Reverse Engineering & ARM Binary Analysis
ELF, disassembly, CFR, iOS/Android quirks
20
Building a Bare-Metal OS Kernel
Bootloader, UART, scheduler, context switch
21
ARM Microarchitecture Deep Dive
OOO pipelines, reorder buffers, branch predict
22
Virtualization Extensions
EL2 hypervisor, stage-2 translation, KVM
23
Debugging & Tooling Ecosystem
GDB, OpenOCD/JTAG, ETM/ITM, QEMU
24
Linkers, Loaders & Binary Format Internals
ELF deep dive, relocations, PIC, crt0
25
Cross-Compilation & Build Systems
GCC/Clang toolchains, CMake, firmware gen
26
ARM in Real Systems
Android, FreeRTOS/Zephyr, U-Boot, TF-A
27
Security Research & Exploitation
ASLR, PAC attacks, ROP/JOP, kernel exploit
28
Emerging ARMv9 & Future Directions
MTE, SME, confidential compute, AI accel

ARM32 Overview

ARM32, formally known as AArch32, is the 32-bit instruction set architecture that dominated ARM computing from 1985 until the AArch64 transition began in 2013. Even today, billions of embedded devices, IoT sensors, and legacy systems run ARM32 code. Every Cortex-M microcontroller (the most widely shipped ARM core family) runs exclusively in AArch32's Thumb-2 mode.

Understanding ARM32 is essential for several reasons:

  • Embedded dominance: All Cortex-M firmware (STM32, nRF52, RP2040) uses Thumb-2, a subset of AArch32
  • Legacy codebases: Millions of lines of 32-bit ARM code exist in production — Android NDK libraries, game engines, signal processing
  • AArch64 foundation: Many AArch64 concepts (condition flags, load/store architecture, barrel shifting) originated in ARM32
  • Reverse engineering: Security analysis of IoT firmware requires ARM32/Thumb disassembly skills

Instruction Width & Encoding

ARM32 has a unique characteristic among ISAs: multiple instruction encodings for the same architecture. The CPU can switch between them at runtime, and a single binary can mix both:

Encoding Width Code Density Features Typical Use
ARM Fixed 32-bit Lower (~1.4× Thumb) Full: conditional execution, barrel shifter on every op Performance-critical inner loops
Thumb Fixed 16-bit Highest Limited: R0–R7 only, no conditionals, no barrel shift Legacy compact code
Thumb-2 Mixed 16/32-bit High (~70% of ARM) Near-full: conditionals via IT, barrel shifter, wide registers Modern embedded (Cortex-M default)

ARM vs Thumb Modes

The dual-mode nature of ARM32 is one of its most distinctive features. Think of it like a bilingual speaker who switches languages depending on the audience — ARM mode is the verbose, expressive language, while Thumb is the concise, compact one.

ARM Mode (32-bit)

In ARM mode, every instruction is exactly 32 bits (4 bytes) wide. This fixed width provides several advantages:

  • Full conditional execution: Every instruction has a 4-bit condition field, allowing any instruction to be predicated (executed only if flags match)
  • Full register access: All 16 registers (R0–R15) available in every instruction
  • Inline barrel shifter: The second operand of data processing instructions can be shifted/rotated at zero additional cost
  • Uniform decoding: Fixed width makes hardware decode simpler and enables constant-time fetch

The trade-off is code size: ARM mode code is typically 30–40% larger than equivalent Thumb code, consuming more instruction cache and flash memory.

@ ARM mode example: Absolute value in 2 instructions (no branch!)
    CMP   R0, #0           @ Compare R0 with zero
    RSBLT R0, R0, #0       @ If R0 < 0: R0 = 0 - R0 (negate)
    @ This conditional execution is unique to ARM mode

Thumb Mode (16-bit)

Thumb mode encodes instructions in 16 bits (2 bytes), roughly halving code size compared to ARM mode. This was critical when ARM7TDMI-era devices had only 32–256 KB of flash memory.

Thumb restrictions versus ARM mode:

  • Limited registers: Most instructions can only access R0–R7 (the "low registers"); R8–R15 available only through MOV, ADD, CMP
  • No conditional execution: Only branch instructions can be conditional
  • No barrel shifter inline: Shift operations require separate instructions
  • 2-operand format: Most ALU instructions use Rd = Rd op Rm format (destination must be a source)
Analogy

The SMS vs Email Analogy

Think of ARM mode like email — you can write detailed messages with formatting, attachments, and CC lists. Thumb mode is like SMS from the 2000s — you have 160 characters, so you abbreviate everything. The message gets through, but you sacrifice expressiveness for brevity. Thumb-2 is like modern messaging with both short texts and longer rich messages.

Thumb-2 Mode (Mixed)

Introduced with ARMv6T2 and perfected in ARMv7, Thumb-2 is the best of both worlds. It extends Thumb with 32-bit wide instructions while keeping the 16-bit encodings for common operations. The CPU identifies wide instructions by their first halfword's bit pattern.

Thumb-2 advantages:

  • Near-ARM expressiveness: Full register access, barrel shifter, IT blocks for conditional execution
  • Thumb-class density: Code size is roughly 26% smaller than ARM mode
  • Cortex-M exclusive: Cortex-M processors only support Thumb/Thumb-2 (no ARM mode at all)
  • Performance parity: On cores with 32-bit bus, Thumb-2 matches ARM mode performance
@ Thumb-2 wide instruction example (32-bit encoding in Thumb state)
    MOVW  R0, #0x1234       @ 32-bit Thumb-2: load lower 16 bits
    MOVT  R0, #0x5678       @ 32-bit Thumb-2: load upper 16 bits
    @ R0 now contains 0x56781234

@ Thumb-2 narrow instructions (16-bit encoding, same behavior)
    MOVS  R0, #42           @ 16-bit Thumb: move immediate to low register
    ADDS  R1, R0, R2        @ 16-bit Thumb: 3-register add

Interworking & BX/BLX

ARM processors can switch between ARM and Thumb states at runtime using interworking instructions. The mechanism is elegant: the least significant bit (LSB) of the target address controls the mode:

  • LSB = 0: Target is ARM mode code (address must be 4-byte aligned)
  • LSB = 1: Target is Thumb mode code (the bit is stripped; code is 2-byte aligned)
@ Interworking examples
    .arm                     @ Currently in ARM mode
    LDR   R0, =thumb_func   @ Load address of Thumb function (+1 in LSB)
    BX    R0                 @ Branch and eXchange: switch to Thumb mode

    .thumb                   @ Now in Thumb mode
thumb_func:
    MOVS  R0, #1
    BX    LR                 @ Return; LR's LSB determines caller's mode

@ Branch-Link-Exchange: call + mode switch in one instruction
    BLX   arm_function       @ Call ARM function from Thumb (sets LR, clears T)
ABI Note: The ARM EABI (Embedded Application Binary Interface) requires that function pointers always have the correct LSB set. If you're calling a Thumb function at address 0x8000, the pointer must be 0x8001. The linker handles this automatically for direct calls, but when manually constructing function pointers (e.g., in jump tables), forgetting the LSB is a classic bug that causes a processor fault.
Key Insight: The CPU mode is determined by the T bit in CPSR. BX Rn switches mode based on Rn[0]: LSB=1 enters Thumb, LSB=0 enters ARM. This is how function pointers work transparently across modes.
@ ARM mode: switch to Thumb and call a function
    BX   r0          @ Jump to address in r0; T-bit set by r0[0]
    BLX  r1          @ Branch-with-link-and-exchange: call + mode switch

@ Check current mode (1 = Thumb, 0 = ARM) via CPSR T bit
    MRS  r0, CPSR
    TST  r0, #0x20   @ T bit is bit 5 of CPSR
    BNE  thumb_mode

Register File & CPSR

The ARM32 register file is the programmer's primary workspace. Understanding every register's role and constraints is essential for writing correct assembly code.

General-Purpose Registers (R0–R15)

ARM32 provides 16 general-purpose 32-bit registers, labeled R0 through R15. While all can hold data, several have architectural or conventional roles:

Register AAPCS Name Role Preserved?
R0–R3a1–a4Arguments / return values / scratchNo (caller-saved)
R4–R11v1–v8Local variablesYes (callee-saved)
R12IPIntra-procedure scratch (linker veneer)No
R13SPStack PointerYes (must be restored)
R14LRLink Register (return address)No (overwritten by BL)
R15PCProgram CounterN/A (architectural)
Key Point: In Thumb mode, most 16-bit instructions can only access "low registers" (R0–R7). High registers (R8–R15) require special MOV, ADD, or CMP encodings. This is why Thumb code tends to be register-constrained compared to ARM mode.

Special Registers (SP, LR, PC)

Three registers have special architectural behavior that every ARM programmer must understand:

R13 / SP (Stack Pointer): Points to the current top of the stack. The ARM architecture defines SP as growing downward (toward lower addresses). Each processor mode has its own banked SP, allowing interrupt handlers to use their own stack without corrupting the user stack.

R14 / LR (Link Register): When a BL (Branch with Link) instruction executes, the return address is saved into LR. Functions return by branching to LR: BX LR or loading LR into PC: MOV PC, LR. If a function calls another function, it must save LR first (typically via PUSH {LR}).

R15 / PC (Program Counter): Reading PC returns the address of the current instruction plus 8 bytes in ARM mode (plus 4 in Thumb) due to the pipeline. This offset is a frequent source of confusion:

@ PC-relative addressing — the +8 offset matters!
    .arm
    MOV  R0, PC            @ R0 = address of this instruction + 8
    @ If this instruction is at 0x1000, R0 = 0x1008

@ Common PC-relative pattern: load from literal pool
    LDR  R0, [PC, #offset] @ Loads from (PC+8+offset)
    @ The assembler calculates 'offset' accounting for +8

CPSR & SPSR Flags

The Current Program Status Register (CPSR) is a 32-bit register that controls processor state and records the result of operations. It has four groups of fields:

Registers CPSR
CPSR Bit Layout Reference
Bit:  31  30  29  28  27  26:25  24  23:20  19:16  15:10  9  8  7  6  5  4:0
      N   Z   C   V   Q   IT[1:0] J  [RAZ]  GE     IT     E  A  I  F  T  Mode

Condition Flags (31:28):
  N = Negative (result bit 31)     Z = Zero (result == 0)
  C = Carry (unsigned overflow)    V = oVerflow (signed overflow)

Control Bits:
  T = Thumb state (1=Thumb, 0=ARM)
  I = IRQ disable    F = FIQ disable    A = Async abort disable
  E = Endianness (0=LE, 1=BE)

Mode Bits (4:0):
  10000 = User     10001 = FIQ      10010 = IRQ
  10011 = SVC      10111 = Abort    11011 = Undef    11111 = System

The SPSR (Saved Program Status Register) automatically saves the CPSR when an exception occurs, allowing the exception handler to restore the original state when returning.

@ Reading and modifying CPSR
    MRS  R0, CPSR          @ Read CPSR into R0
    BIC  R0, R0, #0x80     @ Clear IRQ disable bit (enable IRQ)
    MSR  CPSR_c, R0        @ Write back only control field bits

@ Check condition flags after CMP
    CMP  R0, R1            @ Sets N, Z, C, V based on R0 - R1
    MRS  R2, CPSR          @ R2 now contains the flag state
    AND  R2, R2, #0xF0000000  @ Isolate NZCV bits

Banked Registers & Modes

ARM32 has 7 processor modes, and several registers are banked — meaning each mode has its own physical copy. When the processor switches modes (e.g., on an interrupt), the banked registers instantly contain mode-specific values without needing to save/restore:

Mode CPSR Mode Bits Banked Registers When Entered
User10000None (base set)Normal application code
FIQ10001R8–R12, SP, LR, SPSRFast interrupt
IRQ10010SP, LR, SPSRNormal interrupt
SVC10011SP, LR, SPSRSWI / SVC instruction
Abort10111SP, LR, SPSRMemory fault
Undefined11011SP, LR, SPSRUndefined instruction
System11111Same as UserPrivileged User-mode registers
FIQ Fast Path: FIQ mode banks seven registers (R8–R14), meaning a fast interrupt handler can use R8–R12 as scratch without saving anything. This design made ARM32 exceptionally efficient for latency-critical interrupt handling in early embedded systems. AArch64 replaced this mechanism with the more uniform EL0–EL3 exception level model.

Data Processing Instructions

ARM32's data processing instructions all share a common 32-bit encoding format. They operate exclusively on registers (load/store architecture), and the second operand can be a register, shifted register, or rotated immediate.

Arithmetic: ADD, SUB, RSB, ADC

ARM provides a complete set of arithmetic operations. Note the distinction between forward and reverse operations, and how the carry flag enables multi-word arithmetic:

@ Basic arithmetic
    ADD  R0, R1, R2        @ R0 = R1 + R2
    ADD  R0, R1, #100      @ R0 = R1 + 100
    SUB  R0, R1, R2        @ R0 = R1 - R2
    RSB  R0, R1, #0        @ R0 = 0 - R1   (negate: Reverse Subtract)
    RSB  R0, R1, R2        @ R0 = R2 - R1   (useful when R2 is complex)

@ Multi-word (64-bit) addition: [R1:R0] = [R3:R2] + [R5:R4]
    ADDS R0, R2, R4        @ Low 32 bits; S sets Carry flag
    ADC  R1, R3, R5        @ High 32 bits + Carry from low add

@ Multiply instructions
    MUL  R0, R1, R2        @ R0 = R1 * R2 (bottom 32 bits)
    MLA  R0, R1, R2, R3    @ R0 = (R1 * R2) + R3 (multiply-accumulate)
    UMULL R0, R1, R2, R3   @ [R1:R0] = R2 * R3 (unsigned 64-bit result)
    SMULL R0, R1, R2, R3   @ [R1:R0] = R2 * R3 (signed 64-bit result)
Why RSB? In a 3-operand instruction SUB Rd, Rn, Op2, the immediate/shifted value is always Op2 (the second operand). If you want Rd = immediate - Rn, you can't just swap operands because Op2 is fixed on the right. RSB (Reverse Subtract) solves this: RSB R0, R1, #100 computes R0 = 100 - R1.
@ Arithmetic examples
    ADD  r0, r1, r2        @ r0 = r1 + r2
    ADDS r0, r1, r2        @ r0 = r1 + r2 ; update CPSR flags
    SUB  r0, r1, #10       @ r0 = r1 - 10
    RSB  r0, r1, #0        @ r0 = 0 - r1  (negate)
    ADC  r2, r2, r3        @ r2 = r2 + r3 + Carry (64-bit add high word)

Logical: AND, ORR, EOR, BIC

Bitwise logical operations are the foundation of low-level programming — essential for hardware register manipulation, bit masking, and flag management:

@ Bitwise logical operations
    AND  R0, R1, #0xFF     @ R0 = R1 & 0xFF  (mask lower byte)
    ORR  R0, R1, #0x80     @ R0 = R1 | 0x80  (set bit 7)
    EOR  R0, R1, R2        @ R0 = R1 ^ R2    (XOR / toggle bits)
    BIC  R0, R1, #0x03     @ R0 = R1 & ~0x03 (clear bits 0-1)

@ Common patterns
    EOR  R0, R0, R0        @ R0 = 0           (clear register)
    ORR  R0, R0, #(1<<5)   @ Set bit 5        (set T bit in CPSR)
    BIC  R0, R0, #(1<<7)   @ Clear bit 7      (enable IRQ in CPSR)
    EOR  R0, R1, R2        @ Toggle: flip bits in R1 where R2 has 1s
Real World

Bit Manipulation in Hardware Drivers

When configuring a UART peripheral on an STM32 microcontroller, you frequently use these patterns:

@ Enable UART1 clock in RCC register (typical Cortex-M pattern)
    LDR  R0, =0x40021018   @ RCC_APB2ENR address
    LDR  R1, [R0]          @ Read current value
    ORR  R1, R1, #(1<<14)  @ Set USART1EN bit (bit 14)
    STR  R1, [R0]          @ Write back — UART1 clock now enabled

@ Configure baud rate: clear bits [3:0], then set new value
    LDR  R0, =0x40011008   @ USART1_BRR address
    LDR  R1, [R0]
    BIC  R1, R1, #0x0F     @ Clear mantissa fraction bits
    ORR  R1, R1, #0x09     @ Set new fraction value
    STR  R1, [R0]

Comparison: CMP, CMN, TST, TEQ

Comparison instructions set the CPSR flags without storing a result. They always update flags (no S suffix needed):

Instruction Operation Equivalent To Use Case
CMP Rn, Op2Rn − Op2 (flags only)SUBS but discard resultCompare two values
CMN Rn, Op2Rn + Op2 (flags only)ADDS but discard resultCompare with negated value
TST Rn, Op2Rn AND Op2 (flags only)ANDS but discard resultTest if specific bits are set
TEQ Rn, Op2Rn EOR Op2 (flags only)EORS but discard resultTest equality without affecting C/V
@ Comparison patterns
    CMP  R0, #10           @ Set flags for R0 - 10
    BEQ  equal_case        @ Branch if R0 == 10 (Z flag set)
    BGT  greater_case      @ Branch if R0 > 10 (signed)

    TST  R0, #0x01         @ Test if bit 0 is set (odd number check)
    BNE  is_odd            @ Branch if bit 0 = 1 (Z flag clear)

    TEQ  R0, R1            @ Test if R0 == R1 without affecting C flag
    BEQ  are_equal

Move: MOV, MVN, MOVW, MOVT

Move instructions copy values into registers. The evolution from MOV to MOVW/MOVT reflects a key ARM32 design challenge: loading arbitrary 32-bit constants.

@ Basic moves
    MOV  R0, R1            @ R0 = R1
    MOV  R0, #255          @ R0 = 255 (8-bit immediate, rotated)
    MVN  R0, #0            @ R0 = ~0 = 0xFFFFFFFF (all bits set)
    MVN  R0, R1            @ R0 = ~R1 (bitwise NOT)

@ Loading any 32-bit constant (ARMv6T2+ / Thumb-2)
    MOVW R0, #0xDEAD       @ R0 = 0x0000DEAD (lower 16 bits)
    MOVT R0, #0xBEEF       @ R0 = 0xBEEFDEAD (upper 16 bits set)

@ Older method: LDR pseudo-instruction (literal pool)
    LDR  R0, =0xBEEFDEAD  @ Assembler places constant in memory
                            @ and generates: LDR R0, [PC, #offset]
@ Loading a 32-bit constant using MOVW + MOVT
    MOVW r0, #0xDEAD       @ r0 = 0x0000DEAD
    MOVT r0, #0xBEEF       @ r0 = 0xBEEFDEAD

The Barrel Shifter

The barrel shifter is ARM32's secret weapon — a hardware unit that can shift or rotate the second operand of any data processing instruction at zero additional cost. This means an instruction like ADD R0, R1, R2, LSL #3 computes R0 = R1 + (R2 × 8) in a single cycle. No other mainstream architecture offers this flexibility.

Analogy

The Combo Tool Analogy

Imagine a power tool that combines a drill and a screwdriver. You can drill a hole and drive a screw in the same motion. ARM's barrel shifter is similar — it lets you shift/rotate a value and perform an ALU operation in the same instruction cycle. In x86, you'd need two separate instructions (a shift, then an add).

LSL, LSR, ASR, ROR, RRX

ARM32 supports five shift/rotate types, each specified by a 2-bit type code and either a 5-bit immediate or a register for the shift amount:

Shift Name Operation C Flag Common Use
LSL #nLogical Shift LeftShift left, fill with zerosLast bit shifted outMultiply by 2n
LSR #nLogical Shift RightShift right, fill with zerosLast bit shifted outUnsigned divide by 2n
ASR #nArithmetic Shift RightShift right, replicate sign bitLast bit shifted outSigned divide by 2n
ROR #nRotate RightBits wrap around from LSB to MSBLast bit rotatedBit permutation, crypto
RRXRotate Right eXtended33-bit rotate through C flagOld bit 0Multi-word shifts

Shifter Operands in Instructions

The barrel shifter applies to the second operand (Op2) of every data processing instruction. This creates powerful single-instruction idioms:

@ Barrel shifter examples — all single-cycle instructions
    ADD  R0, R1, R2, LSL #3    @ R0 = R1 + (R2 * 8)   — array indexing
    SUB  R0, R1, R2, ASR #1    @ R0 = R1 - (R2 / 2)   — signed halving
    MOV  R0, R1, ROR #16       @ R0 = swap halfwords of R1
    RSB  R0, R1, R1, LSL #4    @ R0 = R1*16 - R1 = R1*15
    ADD  R0, R1, R1, LSL #2    @ R0 = R1 + R1*4 = R1*5

@ Multiply by 7 using barrel shifter (faster than MUL on early ARM)
    RSB  R0, R1, R1, LSL #3    @ R0 = R1*8 - R1 = R1*7

@ Shift by register value (one extra cycle on some cores)
    AND  R0, R0, R1, LSR R2    @ R0 = R0 & (R1 >> R2)

Immediate Rotate Encoding

ARM32's immediate encoding is one of its most clever — and confusing — design choices. The 12-bit immediate field encodes a 32-bit constant as:

  • An 8-bit value (range 0–255)
  • Rotated right by 2 × (4-bit rotation field) (0, 2, 4, ... 30 positions)
12-bit immediate field:
┌────────────┬────────────────────────┐
│ rot (4 bit)│    imm8 (8 bit)       │
└────────────┴────────────────────────┘

Value = imm8 ROR (2 × rot)

Examples of encodable constants:
  0xFF       → imm8=0xFF, rot=0   (no rotation)
  0xFF0      → imm8=0xFF, rot=14  (0xFF rotated right by 28)
  0x3FC      → imm8=0xFF, rot=15  (0xFF rotated right by 30)
  0xC000003F → imm8=0xFF, rot=1   (0xFF rotated right by 2)

NOT encodable (requires MOVW+MOVT or literal pool):
  0x101      → can't represent as rotated 8-bit value
  0xFFFF     → exceeds any single rotation
  0x1234     → no rotation of 8-bit value yields this
Common Pitfall: Not all 32-bit constants are directly encodable as ARM32 immediates. If the assembler reports "invalid constant after fixup", use one of these alternatives: MOVW+MOVT (ARMv6T2+), LDR Rd, =constant (literal pool), or decompose the constant mathematically (e.g., MOV R0, #0x100 + ORR R0, R0, #0x34 for 0x134).

Load/Store Instructions

As a pure load/store architecture, ARM32 uses dedicated instructions for all memory access. The ALU never touches memory directly — you must explicitly load data into registers, process it, then store results back.

LDR, STR, LDRB, STRB

The LDR/STR family handles data transfers of different sizes:

Instruction Size Sign Extension Description
LDR / STR32-bit (word)N/AFull word load/store
LDRH / STRH16-bit (halfword)Zero-extendedUnsigned halfword
LDRSH16-bit (halfword)Sign-extendedSigned halfword to 32-bit register
LDRB / STRB8-bit (byte)Zero-extendedUnsigned byte
LDRSB8-bit (byte)Sign-extendedSigned byte to 32-bit register
LDRD / STRD64-bit (double)N/ATwo registers, 8-byte aligned

Pre/Post-Index Addressing

ARM32 supports three addressing modes that differ in when the offset modifies the base register:

@ Offset addressing (base + offset, no writeback)
    LDR  R0, [R1]           @ R0 = Mem[R1]
    LDR  R0, [R1, #16]      @ R0 = Mem[R1 + 16]
    LDR  R0, [R1, R2]       @ R0 = Mem[R1 + R2]
    LDR  R0, [R1, R2, LSL #2] @ R0 = Mem[R1 + R2*4] (array of ints)

@ Pre-indexed addressing (update base BEFORE load)
    LDR  R0, [R1, #4]!      @ R1 = R1 + 4; then R0 = Mem[R1]
    @ The ! means "write back to R1"

@ Post-indexed addressing (update base AFTER load)
    LDR  R0, [R1], #4       @ R0 = Mem[R1]; then R1 = R1 + 4
    @ Useful for iterating through arrays: load, then advance pointer
Memory Trick: Post-indexed addressing with LDR R0, [R1], #4 is perfect for array traversal — it loads the current element and advances the pointer in a single instruction. This is equivalent to the C expression *ptr++.

LDM/STM Block Transfers

Load/Store Multiple instructions transfer a set of registers to/from consecutive memory addresses in a single instruction. They have four addressing variants:

Variant Name Stack Equivalent Description
LDMIA / STMIAIncrement AfterPop / Push (FD)Start at base, increment after each transfer
LDMIB / STMIBIncrement BeforePop / Push (ED)Increment first, then transfer
LDMDA / STMDADecrement AfterPop / Push (FA)Start at base, decrement after
LDMDB / STMDBDecrement BeforePush / Pop (FD)Decrement first, then transfer
@ Save multiple registers to stack (descending, pre-decrement)
    STMDB SP!, {R0-R3, LR}  @ Push R0,R1,R2,R3,LR; SP decremented

@ Restore multiple registers from stack
    LDMIA SP!, {R0-R3, PC}  @ Pop into R0,R1,R2,R3,PC; return from function

@ Block memory copy using LDM/STM (16 bytes per iteration)
loop:
    LDMIA R0!, {R4-R7}     @ Load 4 words from source, advance R0
    STMIA R1!, {R4-R7}     @ Store 4 words to dest, advance R1
    SUBS  R2, R2, #16      @ Decrement byte count
    BGT   loop              @ Repeat if more data

PUSH/POP

PUSH and POP are syntactic aliases that make stack operations more readable:

@ PUSH and POP are aliases for STMDB/LDMIA with SP!
    PUSH {R4-R7, LR}        @ Equivalent to: STMDB SP!, {R4-R7, LR}
    @ ... function body ...
    POP  {R4-R7, PC}        @ Equivalent to: LDMIA SP!, {R4-R7, PC}
    @ Loading LR into PC returns from the function

@ Why POP {..., PC} works for returning:
@ The BL instruction saved the return address in LR.
@ PUSH saved LR to the stack.
@ POP loads that saved value directly into PC → jumps to return address.
@ Function prologue/epilogue using PUSH/POP
    PUSH {r4-r7, lr}        @ Save callee-saved regs + return address
    @ ... function body ...
    POP  {r4-r7, pc}        @ Restore regs; load LR into PC to return

Conditional Execution

Conditional execution is ARM32's most iconic feature and a key differentiator from x86. In ARM mode, every instruction carries a 4-bit condition code field (bits [31:28]). If the current CPSR flags don't match the condition, the instruction behaves as a NOP — it's skipped but still takes one cycle. This eliminates short branches, which cause pipeline stalls on all processors.

Condition Codes (EQ, NE, GT, etc.)

ARM32 defines 15 condition codes (plus AL = Always, the default):

Conditional Execution ARM32
Condition Code Quick Reference
Code Suffix Meaning Flags Tested
0000EQEqual / ZeroZ=1
0001NENot Equal / Non-zeroZ=0
0010CS/HSCarry Set / Unsigned ≥C=1
0011CC/LOCarry Clear / Unsigned <C=0
0100MIMinus / NegativeN=1
0101PLPlus / Positive or ZeroN=0
0110VSOverflow SetV=1
0111VCOverflow ClearV=0
1000HIUnsigned HigherC=1 & Z=0
1001LSUnsigned Lower or SameC=0 | Z=1
1010GESigned ≥N=V
1011LTSigned <N≠V
1100GTSigned >Z=0 & N=V
1101LESigned ≤Z=1 | N≠V
1110ALAlways (default)
@ Practical conditional execution patterns

@ 1. Branch-free max(R0, R1) → R0
    CMP   R0, R1
    MOVLT R0, R1           @ If R0 < R1, replace R0 with R1

@ 2. Branch-free clamp to range [0, 255]
    CMP   R0, #0
    MOVLT R0, #0           @ If R0 < 0: R0 = 0
    CMP   R0, #255
    MOVGT R0, #255         @ If R0 > 255: R0 = 255

@ 3. Branch-free GCD (Euclidean algorithm)
gcd:
    CMP   R0, R1
    SUBGT R0, R0, R1       @ If R0 > R1: R0 -= R1
    SUBLE R1, R1, R0       @ If R0 <= R1: R1 -= R0
    BNE   gcd              @ Repeat while R0 != R1

IT Block in Thumb-2

Since Thumb instructions don't have a condition code field, Thumb-2 introduced the IT (If-Then) instruction. IT creates a block of up to 4 conditionally executed instructions:

@ IT block syntax: IT{x{y{z}}} cond
@ T = Then (same condition), E = Else (opposite condition)

@ Example: max(R0, R1) in Thumb-2
    CMP   R0, R1
    ITE   GT               @ If-Then-Else: GT
    MOVGT R0, R1           @ Then: (skipped if GT — this is wrong pattern)
    MOVLE R0, R1           @ Else: actually we want:

@ Correct max pattern:
    CMP   R0, R1
    IT    LT               @ If Less Than:
    MOVLT R0, R1           @ Then: R0 = R1 (R1 was larger)

@ 4-instruction IT block example:
    CMP   R0, #0
    ITTEE GT               @ If GT, Then, Else, Else
    ADDGT R1, R1, #1       @ Then: increment
    ADDGT R2, R2, #1       @ Then: increment
    SUBLE R1, R1, #1       @ Else: decrement
    SUBLE R2, R2, #1       @ Else: decrement
Assembler Assist: Most modern assemblers (GCC, Clang) generate IT instructions automatically when you use conditional suffixes in Thumb-2 mode. You write MOVLT R0, R1 and the assembler inserts the required IT LT before it. Explicit IT blocks are mainly needed for hand-written assembly or when analyzing disassembly output.

The S Suffix & Flag Updates

By default, ARM data processing instructions do not update the CPSR condition flags. You must explicitly request flag updates by appending the S suffix:

@ Without S: flags unchanged (can't use condition codes after)
    ADD  R0, R1, R2        @ R0 = R1 + R2; NZCV flags untouched

@ With S: flags updated (enables conditional execution)
    ADDS R0, R1, R2        @ R0 = R1 + R2; N,Z,C,V flags set

@ Comparison always updates flags (S is implicit)
    CMP  R0, R1            @ Always sets flags (no S needed)
    TST  R0, #0xFF         @ Always sets flags

@ Pattern: loop counter with flag-setting subtraction
    SUBS R2, R2, #1        @ Decrement and set Z flag when R2 reaches 0
    BNE  loop              @ Branch back if not zero
Design Rationale: Why doesn't every instruction set flags by default? Because flag-setting creates dependencies. If ADDS updates C, and a later ADC reads C, the pipeline must stall until ADDS completes. By making flag-setting opt-in, ARM allows the compiler to avoid unnecessary dependencies, enabling better instruction-level parallelism on superscalar cores.

Exercises & Practice

Exercise 1

Barrel Shifter Multiplication

Write ARM32 instructions using only ADD, SUB, RSB, and barrel shifter (no MUL) to compute:

  1. R0 = R1 × 10
  2. R0 = R1 × 17
  3. R0 = R1 × 100 (hint: 100 = 4 × 25 = 4 × (32 - 8 + 1))
Exercise 2

Conditional Execution Challenge

Convert this C function to ARM32 assembly using no branches (only conditional execution):

int sign(int x) {
    if (x > 0) return 1;
    if (x < 0) return -1;
    return 0;
}

Hint: Use CMP, then MOVGT/MOVLT/MOVEQ.

Exercise 3

Immediate Encoding Puzzle

For each constant, determine if it's encodable as an ARM32 immediate (8-bit value rotated right by even amount). If not, show how to load it with MOVW+MOVT:

  1. 0x3FC
  2. 0x102
  3. 0xAB00
  4. 0xF000000F

Conclusion & Next Steps

In this deep dive into ARM32, you've mastered the fundamental building blocks of the 32-bit instruction set:

  • ARM vs Thumb modes: The trade-off between full-featured 32-bit ARM and compact 16-bit Thumb, unified by Thumb-2
  • Register file: 16 general-purpose registers with special roles for SP, LR, and PC, plus the critical CPSR flags register
  • Banked registers: How ARM32's 7 processor modes each maintain their own SP, LR, and SPSR for efficient context switching
  • Data processing: Arithmetic, logical, comparison, and move instructions with the powerful inline barrel shifter
  • Immediate encoding: The clever but tricky 8-bit rotated immediate scheme and its constraints
  • Load/store: Flexible addressing modes including pre/post-indexing and block transfers
  • Conditional execution: ARM32's signature feature — predicated instruction execution that eliminates short branches

These ARM32 concepts form the historical foundation upon which AArch64 was built. In the next part, we'll transition to the modern 64-bit world and explore how AArch64 reimagined the register file, addressing modes, and instruction encoding for the next era of ARM computing.

Next in the Series

In Part 3: AArch64 Registers, Addressing & Data Movement, we transition to the 64-bit world: the expanded X/W register file, system registers, flexible addressing modes, load/store pair instructions, and SIMD/FP register organisation.

Technology