ARM Assembly Part 2: ARM32 Instruction Set Fundamentals

Introduction

                        
                        Series Overview: This is Part 2 of our 28-part ARM Assembly Mastery Series. Part 1 covered ARM architecture history and core concepts. Now we dive into the 32-bit ARM instruction set — the foundation of billions of embedded and mobile devices still in use today.
                    

ARM Assembly Mastery

Your 28-step learning path • Currently on Step 2

Architecture History & Core Concepts

ARMv1→v9, RISC philosophy, profiles

ARM32 Instruction Set Fundamentals

ARM vs Thumb, registers, CPSR, barrel shifter

You Are Here

ARM32 Overview

ARM32, formally known as AArch32, is the 32-bit instruction set architecture that dominated ARM computing from 1985 until the AArch64 transition began in 2013. Even today, billions of embedded devices, IoT sensors, and legacy systems run ARM32 code. Every Cortex-M microcontroller (the most widely shipped ARM core family) runs exclusively in AArch32's Thumb-2 mode.

Understanding ARM32 is essential for several reasons:

Embedded dominance: All Cortex-M firmware (STM32, nRF52, RP2040) uses Thumb-2, a subset of AArch32
Legacy codebases: Millions of lines of 32-bit ARM code exist in production — Android NDK libraries, game engines, signal processing
AArch64 foundation: Many AArch64 concepts (condition flags, load/store architecture, barrel shifting) originated in ARM32
Reverse engineering: Security analysis of IoT firmware requires ARM32/Thumb disassembly skills

Instruction Width & Encoding

ARM32 has a unique characteristic among ISAs: multiple instruction encodings for the same architecture. The CPU can switch between them at runtime, and a single binary can mix both:

Encoding	Width	Code Density	Features	Typical Use
ARM	Fixed 32-bit	Lower (~1.4× Thumb)	Full: conditional execution, barrel shifter on every op	Performance-critical inner loops
Thumb	Fixed 16-bit	Highest	Limited: R0–R7 only, no conditionals, no barrel shift	Legacy compact code
Thumb-2	Mixed 16/32-bit	High (~70% of ARM)	Near-full: conditionals via IT, barrel shifter, wide registers	Modern embedded (Cortex-M default)

ARM vs Thumb Modes

The dual-mode nature of ARM32 is one of its most distinctive features. Think of it like a bilingual speaker who switches languages depending on the audience — ARM mode is the verbose, expressive language, while Thumb is the concise, compact one.

Comparison of ARM 32-bit and Thumb 16-bit instruction encoding widths — Comparison of ARM (32-bit) and Thumb (16-bit) instruction encoding widths and their trade-offs in code density versus expressiveness

ARM Mode (32-bit)

In ARM mode, every instruction is exactly 32 bits (4 bytes) wide. This fixed width provides several advantages:

Full conditional execution: Every instruction has a 4-bit condition field, allowing any instruction to be predicated (executed only if flags match)
Full register access: All 16 registers (R0–R15) available in every instruction
Inline barrel shifter: The second operand of data processing instructions can be shifted/rotated at zero additional cost
Uniform decoding: Fixed width makes hardware decode simpler and enables constant-time fetch

The trade-off is code size: ARM mode code is typically 30–40% larger than equivalent Thumb code, consuming more instruction cache and flash memory.

@ ARM mode example: Absolute value in 2 instructions (no branch!)
    CMP   R0, #0           @ Compare R0 with zero
    RSBLT R0, R0, #0       @ If R0 < 0: R0 = 0 - R0 (negate)
    @ This conditional execution is unique to ARM mode

Thumb Mode (16-bit)

Thumb mode encodes instructions in 16 bits (2 bytes), roughly halving code size compared to ARM mode. This was critical when ARM7TDMI-era devices had only 32–256 KB of flash memory.

Thumb restrictions versus ARM mode:

Limited registers: Most instructions can only access R0–R7 (the "low registers"); R8–R15 available only through MOV, ADD, CMP
No conditional execution: Only branch instructions can be conditional
No barrel shifter inline: Shift operations require separate instructions
2-operand format: Most ALU instructions use Rd = Rd op Rm format (destination must be a source)

Analogy

The SMS vs Email Analogy

Think of ARM mode like email — you can write detailed messages with formatting, attachments, and CC lists. Thumb mode is like SMS from the 2000s — you have 160 characters, so you abbreviate everything. The message gets through, but you sacrifice expressiveness for brevity. Thumb-2 is like modern messaging with both short texts and longer rich messages.

Thumb-2 Mode (Mixed)

Introduced with ARMv6T2 and perfected in ARMv7, Thumb-2 is the best of both worlds. It extends Thumb with 32-bit wide instructions while keeping the 16-bit encodings for common operations. The CPU identifies wide instructions by their first halfword's bit pattern.

Thumb-2 advantages:

Near-ARM expressiveness: Full register access, barrel shifter, IT blocks for conditional execution
Thumb-class density: Code size is roughly 26% smaller than ARM mode
Cortex-M exclusive: Cortex-M processors only support Thumb/Thumb-2 (no ARM mode at all)
Performance parity: On cores with 32-bit bus, Thumb-2 matches ARM mode performance

@ Thumb-2 wide instruction example (32-bit encoding in Thumb state)
    MOVW  R0, #0x1234       @ 32-bit Thumb-2: load lower 16 bits
    MOVT  R0, #0x5678       @ 32-bit Thumb-2: load upper 16 bits
    @ R0 now contains 0x56781234

@ Thumb-2 narrow instructions (16-bit encoding, same behavior)
    MOVS  R0, #42           @ 16-bit Thumb: move immediate to low register
    ADDS  R1, R0, R2        @ 16-bit Thumb: 3-register add

Interworking & BX/BLX

ARM processors can switch between ARM and Thumb states at runtime using interworking instructions. The mechanism is elegant: the least significant bit (LSB) of the target address controls the mode:

LSB = 0: Target is ARM mode code (address must be 4-byte aligned)
LSB = 1: Target is Thumb mode code (the bit is stripped; code is 2-byte aligned)

@ Interworking examples
    .arm                     @ Currently in ARM mode
    LDR   R0, =thumb_func   @ Load address of Thumb function (+1 in LSB)
    BX    R0                 @ Branch and eXchange: switch to Thumb mode

    .thumb                   @ Now in Thumb mode
thumb_func:
    MOVS  R0, #1
    BX    LR                 @ Return; LR's LSB determines caller's mode

@ Branch-Link-Exchange: call + mode switch in one instruction
    BLX   arm_function       @ Call ARM function from Thumb (sets LR, clears T)

                        
                        ABI Note: The ARM EABI (Embedded Application Binary Interface) requires that function pointers always have the correct LSB set. If you're calling a Thumb function at address 0x8000, the pointer must be 0x8001. The linker handles this automatically for direct calls, but when manually constructing function pointers (e.g., in jump tables), forgetting the LSB is a classic bug that causes a processor fault.
                    

                        
                        Key Insight: The CPU mode is determined by the T bit in CPSR. BX Rn switches mode based on Rn[0]: LSB=1 enters Thumb, LSB=0 enters ARM. This is how function pointers work transparently across modes.
                    

@ ARM mode: switch to Thumb and call a function
    BX   r0          @ Jump to address in r0; T-bit set by r0[0]
    BLX  r1          @ Branch-with-link-and-exchange: call + mode switch

@ Check current mode (1 = Thumb, 0 = ARM) via CPSR T bit
    MRS  r0, CPSR
    TST  r0, #0x20   @ T bit is bit 5 of CPSR
    BNE  thumb_mode

Register File & CPSR

The ARM32 register file is the programmer's primary workspace. Understanding every register's role and constraints is essential for writing correct assembly code.

ARM32 register file layout showing R0 through R15 with AAPCS roles — ARM32 register file layout showing R0–R15, their AAPCS roles, and the distinction between caller-saved and callee-saved registers

General-Purpose Registers (R0–R15)

ARM32 provides 16 general-purpose 32-bit registers, labeled R0 through R15. While all can hold data, several have architectural or conventional roles:

Register	AAPCS Name	Role	Preserved?
R0–R3	a1–a4	Arguments / return values / scratch	No (caller-saved)
R4–R11	v1–v8	Local variables	Yes (callee-saved)
R12	IP	Intra-procedure scratch (linker veneer)	No
R13	SP	Stack Pointer	Yes (must be restored)
R14	LR	Link Register (return address)	No (overwritten by BL)
R15	PC	Program Counter	N/A (architectural)

                        
                        Key Point: In Thumb mode, most 16-bit instructions can only access "low registers" (R0–R7). High registers (R8–R15) require special MOV, ADD, or CMP encodings. This is why Thumb code tends to be register-constrained compared to ARM mode.
                    

Special Registers (SP, LR, PC)

Three registers have special architectural behavior that every ARM programmer must understand:

R13 / SP (Stack Pointer): Points to the current top of the stack. The ARM architecture defines SP as growing downward (toward lower addresses). Each processor mode has its own banked SP, allowing interrupt handlers to use their own stack without corrupting the user stack.

R14 / LR (Link Register): When a BL (Branch with Link) instruction executes, the return address is saved into LR. Functions return by branching to LR: BX LR or loading LR into PC: MOV PC, LR. If a function calls another function, it must save LR first (typically via PUSH {LR}).

R15 / PC (Program Counter): Reading PC returns the address of the current instruction plus 8 bytes in ARM mode (plus 4 in Thumb) due to the pipeline. This offset is a frequent source of confusion:

@ PC-relative addressing — the +8 offset matters!
    .arm
    MOV  R0, PC            @ R0 = address of this instruction + 8
    @ If this instruction is at 0x1000, R0 = 0x1008

@ Common PC-relative pattern: load from literal pool
    LDR  R0, [PC, #offset] @ Loads from (PC+8+offset)
    @ The assembler calculates 'offset' accounting for +8

CPSR & SPSR Flags

The Current Program Status Register (CPSR) is a 32-bit register that controls processor state and records the result of operations. It has four groups of fields:

Registers CPSR

CPSR Bit Layout Reference

Bit:  31  30  29  28  27  26:25  24  23:20  19:16  15:10  9  8  7  6  5  4:0
      N   Z   C   V   Q   IT[1:0] J  [RAZ]  GE     IT     E  A  I  F  T  Mode

Condition Flags (31:28):
  N = Negative (result bit 31)     Z = Zero (result == 0)
  C = Carry (unsigned overflow)    V = oVerflow (signed overflow)

Control Bits:
  T = Thumb state (1=Thumb, 0=ARM)
  I = IRQ disable    F = FIQ disable    A = Async abort disable
  E = Endianness (0=LE, 1=BE)

Mode Bits (4:0):
  10000 = User     10001 = FIQ      10010 = IRQ
  10011 = SVC      10111 = Abort    11011 = Undef    11111 = System

The SPSR (Saved Program Status Register) automatically saves the CPSR when an exception occurs, allowing the exception handler to restore the original state when returning.

@ Reading and modifying CPSR
    MRS  R0, CPSR          @ Read CPSR into R0
    BIC  R0, R0, #0x80     @ Clear IRQ disable bit (enable IRQ)
    MSR  CPSR_c, R0        @ Write back only control field bits

@ Check condition flags after CMP
    CMP  R0, R1            @ Sets N, Z, C, V based on R0 - R1
    MRS  R2, CPSR          @ R2 now contains the flag state
    AND  R2, R2, #0xF0000000  @ Isolate NZCV bits

Banked Registers & Modes

ARM32 has 7 processor modes, and several registers are banked — meaning each mode has its own physical copy. When the processor switches modes (e.g., on an interrupt), the banked registers instantly contain mode-specific values without needing to save/restore:

Mode	CPSR Mode Bits	Banked Registers	When Entered
User	10000	None (base set)	Normal application code
FIQ	10001	R8–R12, SP, LR, SPSR	Fast interrupt
IRQ	10010	SP, LR, SPSR	Normal interrupt
SVC	10011	SP, LR, SPSR	SWI / SVC instruction
Abort	10111	SP, LR, SPSR	Memory fault
Undefined	11011	SP, LR, SPSR	Undefined instruction
System	11111	Same as User	Privileged User-mode registers

                        
                        FIQ Fast Path: FIQ mode banks seven registers (R8–R14), meaning a fast interrupt handler can use R8–R12 as scratch without saving anything. This design made ARM32 exceptionally efficient for latency-critical interrupt handling in early embedded systems. AArch64 replaced this mechanism with the more uniform EL0–EL3 exception level model.
                    

Data Processing Instructions

ARM32's data processing instructions all share a common 32-bit encoding format. They operate exclusively on registers (load/store architecture), and the second operand can be a register, shifted register, or rotated immediate.

ARM32 data processing instruction encoding format with condition, opcode, and shifter fields — ARM32 data processing instruction encoding format showing the condition field, opcode, operand registers, and shifter operand fields

Arithmetic: ADD, SUB, RSB, ADC

ARM provides a complete set of arithmetic operations. Note the distinction between forward and reverse operations, and how the carry flag enables multi-word arithmetic:

@ Basic arithmetic
    ADD  R0, R1, R2        @ R0 = R1 + R2
    ADD  R0, R1, #100      @ R0 = R1 + 100
    SUB  R0, R1, R2        @ R0 = R1 - R2
    RSB  R0, R1, #0        @ R0 = 0 - R1   (negate: Reverse Subtract)
    RSB  R0, R1, R2        @ R0 = R2 - R1   (useful when R2 is complex)

@ Multi-word (64-bit) addition: [R1:R0] = [R3:R2] + [R5:R4]
    ADDS R0, R2, R4        @ Low 32 bits; S sets Carry flag
    ADC  R1, R3, R5        @ High 32 bits + Carry from low add

@ Multiply instructions
    MUL  R0, R1, R2        @ R0 = R1 * R2 (bottom 32 bits)
    MLA  R0, R1, R2, R3    @ R0 = (R1 * R2) + R3 (multiply-accumulate)
    UMULL R0, R1, R2, R3   @ [R1:R0] = R2 * R3 (unsigned 64-bit result)
    SMULL R0, R1, R2, R3   @ [R1:R0] = R2 * R3 (signed 64-bit result)

                        
                        Why RSB? In a 3-operand instruction SUB Rd, Rn, Op2, the immediate/shifted value is always Op2 (the second operand). If you want Rd = immediate - Rn, you can't just swap operands because Op2 is fixed on the right. RSB (Reverse Subtract) solves this: RSB R0, R1, #100 computes R0 = 100 - R1.
                    

@ Arithmetic examples
    ADD  r0, r1, r2        @ r0 = r1 + r2
    ADDS r0, r1, r2        @ r0 = r1 + r2 ; update CPSR flags
    SUB  r0, r1, #10       @ r0 = r1 - 10
    RSB  r0, r1, #0        @ r0 = 0 - r1  (negate)
    ADC  r2, r2, r3        @ r2 = r2 + r3 + Carry (64-bit add high word)

Logical: AND, ORR, EOR, BIC

Bitwise logical operations are the foundation of low-level programming — essential for hardware register manipulation, bit masking, and flag management:

@ Bitwise logical operations
    AND  R0, R1, #0xFF     @ R0 = R1 & 0xFF  (mask lower byte)
    ORR  R0, R1, #0x80     @ R0 = R1 | 0x80  (set bit 7)
    EOR  R0, R1, R2        @ R0 = R1 ^ R2    (XOR / toggle bits)
    BIC  R0, R1, #0x03     @ R0 = R1 & ~0x03 (clear bits 0-1)

@ Common patterns
    EOR  R0, R0, R0        @ R0 = 0           (clear register)
    ORR  R0, R0, #(1<<5)   @ Set bit 5        (set T bit in CPSR)
    BIC  R0, R0, #(1<<7)   @ Clear bit 7      (enable IRQ in CPSR)
    EOR  R0, R1, R2        @ Toggle: flip bits in R1 where R2 has 1s

Real World

Bit Manipulation in Hardware Drivers

When configuring a UART peripheral on an STM32 microcontroller, you frequently use these patterns:

@ Enable UART1 clock in RCC register (typical Cortex-M pattern)
    LDR  R0, =0x40021018   @ RCC_APB2ENR address
    LDR  R1, [R0]          @ Read current value
    ORR  R1, R1, #(1<<14)  @ Set USART1EN bit (bit 14)
    STR  R1, [R0]          @ Write back — UART1 clock now enabled

@ Configure baud rate: clear bits [3:0], then set new value
    LDR  R0, =0x40011008   @ USART1_BRR address
    LDR  R1, [R0]
    BIC  R1, R1, #0x0F     @ Clear mantissa fraction bits
    ORR  R1, R1, #0x09     @ Set new fraction value
    STR  R1, [R0]

Comparison: CMP, CMN, TST, TEQ

Comparison instructions set the CPSR flags without storing a result. They always update flags (no S suffix needed):

Instruction	Operation	Equivalent To	Use Case
`CMP Rn, Op2`	Rn − Op2 (flags only)	SUBS but discard result	Compare two values
`CMN Rn, Op2`	Rn + Op2 (flags only)	ADDS but discard result	Compare with negated value
`TST Rn, Op2`	Rn AND Op2 (flags only)	ANDS but discard result	Test if specific bits are set
`TEQ Rn, Op2`	Rn EOR Op2 (flags only)	EORS but discard result	Test equality without affecting C/V

@ Comparison patterns
    CMP  R0, #10           @ Set flags for R0 - 10
    BEQ  equal_case        @ Branch if R0 == 10 (Z flag set)
    BGT  greater_case      @ Branch if R0 > 10 (signed)

    TST  R0, #0x01         @ Test if bit 0 is set (odd number check)
    BNE  is_odd            @ Branch if bit 0 = 1 (Z flag clear)

    TEQ  R0, R1            @ Test if R0 == R1 without affecting C flag
    BEQ  are_equal

Move: MOV, MVN, MOVW, MOVT

Move instructions copy values into registers. The evolution from MOV to MOVW/MOVT reflects a key ARM32 design challenge: loading arbitrary 32-bit constants.

@ Basic moves
    MOV  R0, R1            @ R0 = R1
    MOV  R0, #255          @ R0 = 255 (8-bit immediate, rotated)
    MVN  R0, #0            @ R0 = ~0 = 0xFFFFFFFF (all bits set)
    MVN  R0, R1            @ R0 = ~R1 (bitwise NOT)

@ Loading any 32-bit constant (ARMv6T2+ / Thumb-2)
    MOVW R0, #0xDEAD       @ R0 = 0x0000DEAD (lower 16 bits)
    MOVT R0, #0xBEEF       @ R0 = 0xBEEFDEAD (upper 16 bits set)

@ Older method: LDR pseudo-instruction (literal pool)
    LDR  R0, =0xBEEFDEAD  @ Assembler places constant in memory
                            @ and generates: LDR R0, [PC, #offset]

@ Loading a 32-bit constant using MOVW + MOVT
    MOVW r0, #0xDEAD       @ r0 = 0x0000DEAD
    MOVT r0, #0xBEEF       @ r0 = 0xBEEFDEAD

The Barrel Shifter

The barrel shifter is ARM32's secret weapon — a hardware unit that can shift or rotate the second operand of any data processing instruction at zero additional cost. This means an instruction like ADD R0, R1, R2, LSL #3 computes R0 = R1 + (R2 × 8) in a single cycle. No other mainstream architecture offers this flexibility.

ARM32 barrel shifter showing LSL, LSR, ASR, ROR, and RRX operations — ARM32 barrel shifter operation showing how LSL, LSR, ASR, ROR, and RRX modify the second operand at zero additional cycle cost

Analogy

The Combo Tool Analogy

Imagine a power tool that combines a drill and a screwdriver. You can drill a hole and drive a screw in the same motion. ARM's barrel shifter is similar — it lets you shift/rotate a value and perform an ALU operation in the same instruction cycle. In x86, you'd need two separate instructions (a shift, then an add).

LSL, LSR, ASR, ROR, RRX

ARM32 supports five shift/rotate types, each specified by a 2-bit type code and either a 5-bit immediate or a register for the shift amount:

Shift	Name	Operation	C Flag	Common Use
`LSL #n`	Logical Shift Left	Shift left, fill with zeros	Last bit shifted out	Multiply by 2ⁿ
`LSR #n`	Logical Shift Right	Shift right, fill with zeros	Last bit shifted out	Unsigned divide by 2ⁿ
`ASR #n`	Arithmetic Shift Right	Shift right, replicate sign bit	Last bit shifted out	Signed divide by 2ⁿ
`ROR #n`	Rotate Right	Bits wrap around from LSB to MSB	Last bit rotated	Bit permutation, crypto
`RRX`	Rotate Right eXtended	33-bit rotate through C flag	Old bit 0	Multi-word shifts

Shifter Operands in Instructions

The barrel shifter applies to the second operand (Op2) of every data processing instruction. This creates powerful single-instruction idioms:

@ Barrel shifter examples — all single-cycle instructions
    ADD  R0, R1, R2, LSL #3    @ R0 = R1 + (R2 * 8)   — array indexing
    SUB  R0, R1, R2, ASR #1    @ R0 = R1 - (R2 / 2)   — signed halving
    MOV  R0, R1, ROR #16       @ R0 = swap halfwords of R1
    RSB  R0, R1, R1, LSL #4    @ R0 = R1*16 - R1 = R1*15
    ADD  R0, R1, R1, LSL #2    @ R0 = R1 + R1*4 = R1*5

@ Multiply by 7 using barrel shifter (faster than MUL on early ARM)
    RSB  R0, R1, R1, LSL #3    @ R0 = R1*8 - R1 = R1*7

@ Shift by register value (one extra cycle on some cores)
    AND  R0, R0, R1, LSR R2    @ R0 = R0 & (R1 >> R2)

Immediate Rotate Encoding

ARM32's immediate encoding is one of its most clever — and confusing — design choices. The 12-bit immediate field encodes a 32-bit constant as:

An 8-bit value (range 0–255)
Rotated right by 2 × (4-bit rotation field) (0, 2, 4, ... 30 positions)

12-bit immediate field:
┌────────────┬────────────────────────┐
│ rot (4 bit)│    imm8 (8 bit)       │
└────────────┴────────────────────────┘

Value = imm8 ROR (2 × rot)

Examples of encodable constants:
  0xFF       → imm8=0xFF, rot=0   (no rotation)
  0xFF0      → imm8=0xFF, rot=14  (0xFF rotated right by 28)
  0x3FC      → imm8=0xFF, rot=15  (0xFF rotated right by 30)
  0xC000003F → imm8=0xFF, rot=1   (0xFF rotated right by 2)

NOT encodable (requires MOVW+MOVT or literal pool):
  0x101      → can't represent as rotated 8-bit value
  0xFFFF     → exceeds any single rotation
  0x1234     → no rotation of 8-bit value yields this

                        
                        Common Pitfall: Not all 32-bit constants are directly encodable as ARM32 immediates. If the assembler reports "invalid constant after fixup", use one of these alternatives: MOVW+MOVT (ARMv6T2+), LDR Rd, =constant (literal pool), or decompose the constant mathematically (e.g., MOV R0, #0x100 + ORR R0, R0, #0x34 for 0x134).
                    

Load/Store Instructions

As a pure load/store architecture, ARM32 uses dedicated instructions for all memory access. The ALU never touches memory directly — you must explicitly load data into registers, process it, then store results back.

LDR, STR, LDRB, STRB

The LDR/STR family handles data transfers of different sizes:

Instruction	Size	Sign Extension	Description
`LDR / STR`	32-bit (word)	N/A	Full word load/store
`LDRH / STRH`	16-bit (halfword)	Zero-extended	Unsigned halfword
`LDRSH`	16-bit (halfword)	Sign-extended	Signed halfword to 32-bit register
`LDRB / STRB`	8-bit (byte)	Zero-extended	Unsigned byte
`LDRSB`	8-bit (byte)	Sign-extended	Signed byte to 32-bit register
`LDRD / STRD`	64-bit (double)	N/A	Two registers, 8-byte aligned

Pre/Post-Index Addressing

ARM32 supports three addressing modes that differ in when the offset modifies the base register:

@ Offset addressing (base + offset, no writeback)
    LDR  R0, [R1]           @ R0 = Mem[R1]
    LDR  R0, [R1, #16]      @ R0 = Mem[R1 + 16]
    LDR  R0, [R1, R2]       @ R0 = Mem[R1 + R2]
    LDR  R0, [R1, R2, LSL #2] @ R0 = Mem[R1 + R2*4] (array of ints)

@ Pre-indexed addressing (update base BEFORE load)
    LDR  R0, [R1, #4]!      @ R1 = R1 + 4; then R0 = Mem[R1]
    @ The ! means "write back to R1"

@ Post-indexed addressing (update base AFTER load)
    LDR  R0, [R1], #4       @ R0 = Mem[R1]; then R1 = R1 + 4
    @ Useful for iterating through arrays: load, then advance pointer

                        
                        Memory Trick: Post-indexed addressing with LDR R0, [R1], #4 is perfect for array traversal — it loads the current element and advances the pointer in a single instruction. This is equivalent to the C expression *ptr++.
                    

LDM/STM Block Transfers

Load/Store Multiple instructions transfer a set of registers to/from consecutive memory addresses in a single instruction. They have four addressing variants:

Variant	Name	Stack Equivalent	Description
`LDMIA / STMIA`	Increment After	Pop / Push (FD)	Start at base, increment after each transfer
`LDMIB / STMIB`	Increment Before	Pop / Push (ED)	Increment first, then transfer
`LDMDA / STMDA`	Decrement After	Pop / Push (FA)	Start at base, decrement after
`LDMDB / STMDB`	Decrement Before	Push / Pop (FD)	Decrement first, then transfer

@ Save multiple registers to stack (descending, pre-decrement)
    STMDB SP!, {R0-R3, LR}  @ Push R0,R1,R2,R3,LR; SP decremented

@ Restore multiple registers from stack
    LDMIA SP!, {R0-R3, PC}  @ Pop into R0,R1,R2,R3,PC; return from function

@ Block memory copy using LDM/STM (16 bytes per iteration)
loop:
    LDMIA R0!, {R4-R7}     @ Load 4 words from source, advance R0
    STMIA R1!, {R4-R7}     @ Store 4 words to dest, advance R1
    SUBS  R2, R2, #16      @ Decrement byte count
    BGT   loop              @ Repeat if more data

PUSH/POP

PUSH and POP are syntactic aliases that make stack operations more readable:

@ PUSH and POP are aliases for STMDB/LDMIA with SP!
    PUSH {R4-R7, LR}        @ Equivalent to: STMDB SP!, {R4-R7, LR}
    @ ... function body ...
    POP  {R4-R7, PC}        @ Equivalent to: LDMIA SP!, {R4-R7, PC}
    @ Loading LR into PC returns from the function

@ Why POP {..., PC} works for returning:
@ The BL instruction saved the return address in LR.
@ PUSH saved LR to the stack.
@ POP loads that saved value directly into PC → jumps to return address.

@ Function prologue/epilogue using PUSH/POP
    PUSH {r4-r7, lr}        @ Save callee-saved regs + return address
    @ ... function body ...
    POP  {r4-r7, pc}        @ Restore regs; load LR into PC to return

Conditional Execution

Conditional execution is ARM32's most iconic feature and a key differentiator from x86. In ARM mode, every instruction carries a 4-bit condition code field (bits [31:28]). If the current CPSR flags don't match the condition, the instruction behaves as a NOP — it's skipped but still takes one cycle. This eliminates short branches, which cause pipeline stalls on all processors.

Condition Codes (EQ, NE, GT, etc.)

ARM32 defines 15 condition codes (plus AL = Always, the default):

Conditional Execution ARM32

Condition Code Quick Reference

Code	Suffix	Meaning	Flags Tested
0000	EQ	Equal / Zero	Z=1
0001	NE	Not Equal / Non-zero	Z=0
0010	CS/HS	Carry Set / Unsigned ≥	C=1
0011	CC/LO	Carry Clear / Unsigned <	C=0
0100	MI	Minus / Negative	N=1
0101	PL	Plus / Positive or Zero	N=0
0110	VS	Overflow Set	V=1
0111	VC	Overflow Clear	V=0
1000	HI	Unsigned Higher	C=1 & Z=0
1001	LS	Unsigned Lower or Same	C=0 \| Z=1
1010	GE	Signed ≥	N=V
1011	LT	Signed <	N≠V
1100	GT	Signed >	Z=0 & N=V
1101	LE	Signed ≤	Z=1 \| N≠V
1110	AL	Always (default)	—

@ Practical conditional execution patterns

@ 1. Branch-free max(R0, R1) → R0
    CMP   R0, R1
    MOVLT R0, R1           @ If R0 < R1, replace R0 with R1

@ 2. Branch-free clamp to range [0, 255]
    CMP   R0, #0
    MOVLT R0, #0           @ If R0 < 0: R0 = 0
    CMP   R0, #255
    MOVGT R0, #255         @ If R0 > 255: R0 = 255

@ 3. Branch-free GCD (Euclidean algorithm)
gcd:
    CMP   R0, R1
    SUBGT R0, R0, R1       @ If R0 > R1: R0 -= R1
    SUBLE R1, R1, R0       @ If R0 <= R1: R1 -= R0
    BNE   gcd              @ Repeat while R0 != R1

IT Block in Thumb-2

Since Thumb instructions don't have a condition code field, Thumb-2 introduced the IT (If-Then) instruction. IT creates a block of up to 4 conditionally executed instructions:

@ IT block syntax: IT{x{y{z}}} cond
@ T = Then (same condition), E = Else (opposite condition)

@ Example: max(R0, R1) in Thumb-2
    CMP   R0, R1
    ITE   GT               @ If-Then-Else: GT
    MOVGT R0, R1           @ Then: (skipped if GT — this is wrong pattern)
    MOVLE R0, R1           @ Else: actually we want:

@ Correct max pattern:
    CMP   R0, R1
    IT    LT               @ If Less Than:
    MOVLT R0, R1           @ Then: R0 = R1 (R1 was larger)

@ 4-instruction IT block example:
    CMP   R0, #0
    ITTEE GT               @ If GT, Then, Else, Else
    ADDGT R1, R1, #1       @ Then: increment
    ADDGT R2, R2, #1       @ Then: increment
    SUBLE R1, R1, #1       @ Else: decrement
    SUBLE R2, R2, #1       @ Else: decrement

                        
                        Assembler Assist: Most modern assemblers (GCC, Clang) generate IT instructions automatically when you use conditional suffixes in Thumb-2 mode. You write MOVLT R0, R1 and the assembler inserts the required IT LT before it. Explicit IT blocks are mainly needed for hand-written assembly or when analyzing disassembly output.
                    

The S Suffix & Flag Updates

By default, ARM data processing instructions do not update the CPSR condition flags. You must explicitly request flag updates by appending the S suffix:

@ Without S: flags unchanged (can't use condition codes after)
    ADD  R0, R1, R2        @ R0 = R1 + R2; NZCV flags untouched

@ With S: flags updated (enables conditional execution)
    ADDS R0, R1, R2        @ R0 = R1 + R2; N,Z,C,V flags set

@ Comparison always updates flags (S is implicit)
    CMP  R0, R1            @ Always sets flags (no S needed)
    TST  R0, #0xFF         @ Always sets flags

@ Pattern: loop counter with flag-setting subtraction
    SUBS R2, R2, #1        @ Decrement and set Z flag when R2 reaches 0
    BNE  loop              @ Branch back if not zero

                        
                        Design Rationale: Why doesn't every instruction set flags by default? Because flag-setting creates dependencies. If ADDS updates C, and a later ADC reads C, the pipeline must stall until ADDS completes. By making flag-setting opt-in, ARM allows the compiler to avoid unnecessary dependencies, enabling better instruction-level parallelism on superscalar cores.
                    

Exercises & Practice

Exercise 1

Barrel Shifter Multiplication

Write ARM32 instructions using only ADD, SUB, RSB, and barrel shifter (no MUL) to compute:

R0 = R1 × 10
R0 = R1 × 17
R0 = R1 × 100 (hint: 100 = 4 × 25 = 4 × (32 - 8 + 1))

Exercise 2

Conditional Execution Challenge

Convert this C function to ARM32 assembly using no branches (only conditional execution):

int sign(int x) {
    if (x > 0) return 1;
    if (x < 0) return -1;
    return 0;
}

Hint: Use CMP, then MOVGT/MOVLT/MOVEQ.

Exercise 3

Immediate Encoding Puzzle

For each constant, determine if it's encodable as an ARM32 immediate (8-bit value rotated right by even amount). If not, show how to load it with MOVW+MOVT:

0x3FC
0x102
0xAB00
0xF000000F

Conclusion & Next Steps

In this deep dive into ARM32, you've mastered the fundamental building blocks of the 32-bit instruction set:

ARM vs Thumb modes: The trade-off between full-featured 32-bit ARM and compact 16-bit Thumb, unified by Thumb-2
Register file: 16 general-purpose registers with special roles for SP, LR, and PC, plus the critical CPSR flags register
Banked registers: How ARM32's 7 processor modes each maintain their own SP, LR, and SPSR for efficient context switching
Data processing: Arithmetic, logical, comparison, and move instructions with the powerful inline barrel shifter
Immediate encoding: The clever but tricky 8-bit rotated immediate scheme and its constraints
Load/store: Flexible addressing modes including pre/post-indexing and block transfers
Conditional execution: ARM32's signature feature — predicated instruction execution that eliminates short branches

These ARM32 concepts form the historical foundation upon which AArch64 was built. In the next part, we'll transition to the modern 64-bit world and explore how AArch64 reimagined the register file, addressing modes, and instruction encoding for the next era of ARM computing.

Cookie Consent

Cookie Preferences

ARM Assembly Part 2: ARM32 Instruction Set Fundamentals

Table of Contents

Introduction

ARM Assembly Mastery

Architecture History & Core Concepts

ARM32 Instruction Set Fundamentals

AArch64 Registers, Addressing & Data Movement

Arithmetic, Logic & Bit Manipulation

Branching, Loops & Conditional Execution

Stack, Subroutines & AAPCS

Memory Model, Caches & Barriers

NEON & Advanced SIMD

SVE & SVE2 Scalable Vector Extensions

Floating-Point & VFP Instructions

Exception Levels, Interrupts & Vector Tables

MMU, Page Tables & Virtual Memory

TrustZone & ARM Security Extensions

Cortex-M Assembly & Bare-Metal Embedded

Cortex-A System Programming & Boot

Apple Silicon & macOS ABI

Inline Assembly, GCC/Clang & C Interop

Performance Profiling & Micro-Optimization

Reverse Engineering & ARM Binary Analysis

Building a Bare-Metal OS Kernel

ARM Microarchitecture Deep Dive

Virtualization Extensions

Debugging & Tooling Ecosystem

Linkers, Loaders & Binary Format Internals

Cross-Compilation & Build Systems

ARM in Real Systems

Security Research & Exploitation

Emerging ARMv9 & Future Directions