Explore ARM's evolutionary journey from ARMv1 to ARMv9, understand RISC design philosophy, discover the diverse ARM ecosystem with Cortex profiles, and master foundational concepts that underpin all ARM assembly programming.
Every time you pick up a smartphone, tap your smartwatch, ask a voice assistant a question, or drive a modern car, you are interacting with an ARM processor. ARM (originally Acorn RISC Machine, later Advanced RISC Machines) is the most widely deployed processor architecture in human history, with over 250 billion chips shipped to date. Unlike Intel's x86 which dominates desktops and servers through direct chip manufacturing, ARM operates as an intellectual property (IP) licensing company — designing processor architectures that hundreds of companies then build into silicon.
This article is your gateway into understanding ARM at the deepest level. Whether you're an embedded engineer working with Cortex-M microcontrollers, a systems programmer targeting Linux on Cortex-A, or a security researcher analyzing ARM binaries, the foundational concepts covered here will serve as your bedrock throughout the entire 28-part series.
ARM isn't just another processor architecture — it's the dominant one by volume. Consider these staggering numbers:
Think of processor design like running a kitchen:
This simplicity-by-design is the core of RISC philosophy: fewer, simpler instructions that execute in predictable time, enabling aggressive pipelining and power efficiency.
The RISC (Reduced Instruction Set Computer) vs. CISC (Complex Instruction Set Computer) debate has shaped processor design for four decades. Understanding these philosophies is essential for anyone writing assembly code.
Intel's x86 architecture follows the CISC approach: provide the programmer with powerful, multi-step instructions. A single x86 instruction like REP MOVSB can copy an entire block of memory. The hardware handles the complexity internally.
ARM takes the opposite approach. The RISC design principles that ARM follows include:
| RISC Principle | What It Means | ARM Implementation |
|---|---|---|
| Fixed-width instructions | Every instruction is the same size | 32 bits (ARM mode) or 16/32 bits (Thumb-2) |
| Load/Store architecture | Only load/store instructions access memory; ALU works on registers only | LDR/STR for memory; ADD/SUB operate on registers |
| Large register file | Many general-purpose registers reduce memory accesses | 16 registers (AArch32), 31 registers (AArch64) |
| Simple addressing modes | Limited, regular memory addressing patterns | Base + offset, base + register, pre/post-index |
| Single-cycle execution | Most instructions complete in one clock cycle | Pipeline designed for single-cycle throughput |
Adding two numbers from memory and storing the result:
// x86 (CISC) - can operate directly on memory
// ADD instruction reads from memory, adds, and writes back
add eax, [ebx] // One instruction: load + add in one step
mov [ecx], eax // Store result
// ARM (RISC) - load/store architecture, explicit steps
LDR R0, [R1] // Step 1: Load value from memory into register
ADD R0, R0, R2 // Step 2: Add registers (no memory access)
STR R0, [R3] // Step 3: Store result back to memory
ARM uses more instructions, but each one is simpler and faster. The pipeline can process these uniform instructions more efficiently, and the total execution time is often comparable or faster.
ARM's journey from a skunkworks project at a British computer company to the world's most prolific processor architecture spans four decades of relentless innovation. Each generation built upon the last, adding capabilities while maintaining the core RISC efficiency that made ARM successful.
The ARM story begins at Acorn Computers in Cambridge, England. In 1983, Acorn needed a processor for their next-generation BBC Micro computer, but existing chips were too slow and expensive.
The Problem: Acorn Computers needed a 32-bit processor for their next BBC Micro. The Motorola 68000 was too expensive, and Intel's 80286 was too power-hungry.
The Solution: Engineers Sophie Wilson and Steve Furber designed their own processor from scratch, inspired by the Berkeley RISC papers. Wilson wrote the instruction set simulator in BBC BASIC in a single evening.
The Result: ARM1 (1985) worked correctly on the first silicon — an extraordinary achievement. It consumed only 0.1 watts (the 80286 consumed 3.5 watts) and could execute 4 MIPS at 8 MHz. The simplicity of RISC design had paid off spectacularly.
Legacy: This power efficiency DNA, established in the very first chip, would define ARM's advantage for the next four decades.
ARMv1 (1985): The original 32-bit RISC processor with 26-bit address space (64 MB addressable). Only used in prototypes.
ARMv2 (1987): Added multiply instructions and coprocessor support. Powered the Acorn Archimedes, one of the first ARM-based commercial computers.
ARMv3 (1992): Expanded to full 32-bit addressing (4 GB address space), added separate Program Status Register (CPSR/SPSR), and introduced the SWP atomic instruction for multiprocessor synchronization.
The mid-1990s brought ARM into the mobile revolution. ARM Ltd. was spun off from Acorn in 1990 as a joint venture with Apple (which used ARM in the Newton PDA) and VLSI Technology.
ARMv4 / ARM7TDMI (1994): The breakthrough chip. The "T" stands for Thumb — a 16-bit instruction set that compresses code to roughly 65% the size of ARM instructions while maintaining most of the performance. This was critical for memory-constrained devices. The ARM7TDMI became the best-selling ARM core of all time, found in the Nokia 6110, Game Boy Advance, and iPod.
ARMv5 (1997): Added DSP (Digital Signal Processing) extensions for multimedia codecs. Introduced the BLX instruction for smoother ARM/Thumb interworking. Enhanced the Jazelle extension for Java bytecode execution.
ARMv6 (2002): Introduced SIMD (Single Instruction Multiple Data) extensions for parallel data processing — process four 8-bit values simultaneously. Added TrustZone security technology, mixed-endian support, and the ARM1136 core that powered the original Raspberry Pi.
ARMv7 is where ARM truly conquered mobile computing. This architecture introduced the three Cortex profiles that persist today and added NEON Advanced SIMD for multimedia processing.
Key ARMv7 innovations:
ARMv8-A was ARM's most dramatic architectural leap — introducing a completely new 64-bit instruction set called AArch64 while maintaining backward compatibility with 32-bit code through an AArch32 execution state.
Context: In September 2013, Apple surprised the entire industry by shipping the A7 chip in the iPhone 5s — the world's first 64-bit ARM processor in a smartphone. Competitors were caught flat-footed.
Technical Impact: The move to AArch64 doubled the general-purpose registers from 16 to 31, enabled addressing of more than 4 GB RAM, and provided a clean new instruction set free of decades of legacy baggage.
Industry Reaction: Qualcomm and Samsung scrambled to release their own 64-bit ARM chips. Within two years, every flagship smartphone was 64-bit ARM.
Lesson: AArch64 wasn't just a wider data path — it was a fundamentally better ISA with more registers, cleaner encoding, and modern security features baked in.
Major ARMv8 features:
ARMv9 is the current generation, building on ARMv8 with a focus on security, AI acceleration, and scalable vector processing for data centers and HPC.
Key ARMv9 additions:
| Version | Year | Key Feature | Landmark Product |
|---|---|---|---|
| ARMv1 | 1985 | First RISC silicon | ARM1 prototype |
| ARMv4T | 1994 | Thumb instruction set | ARM7TDMI, GBA, iPod |
| ARMv6 | 2002 | SIMD, TrustZone | ARM1136, Raspberry Pi 1 |
| ARMv7-A | 2004 | Thumb-2, NEON, big.LITTLE | Cortex-A9, iPhone 4S |
| ARMv8-A | 2011 | AArch64 (64-bit), Crypto | Apple A7, AWS Graviton |
| ARMv9-A | 2021 | CCA, MTE, SVE2, SME | Cortex-X4, Dimensity 9300 |
Understanding how ARM makes money is essential to understanding the ARM ecosystem. Unlike Intel or AMD, ARM does not manufacture any chips. Instead, it operates one of the most successful IP licensing businesses in technology history.
Think of ARM like an architecture firm that designs blueprints for buildings (processor designs) but never builds the buildings themselves. Clients buy the blueprints and construct the buildings (chips) in their own factories (fabs). ARM earns revenue from:
This model means ARM's revenue scales directly with the global demand for silicon — the more chips shipped worldwide, the more ARM earns.
ARM offers several licensing tiers:
The breadth of ARM's licensee ecosystem is staggering:
| Company | License Type | Notable Chips | Market |
|---|---|---|---|
| Apple | Architecture | A17 Pro, M4 | Phones, laptops, servers |
| Qualcomm | Architecture | Snapdragon 8 Gen 3 | Android phones, PCs |
| Samsung | Architecture | Exynos 2400 | Phones, IoT |
| AWS/Amazon | Processor | Graviton 4 | Cloud servers |
| NVIDIA | Architecture | Grace CPU | AI/HPC servers |
| MediaTek | Processor | Dimensity 9300 | Phones, IoT, autos |
| STMicroelectronics | Processor | STM32 series | Embedded/MCU |
| NXP | Processor | i.MX RT series | Automotive, industrial |
| Raspberry Pi | Processor | RP2040 (Cortex-M0+) | Education, maker |
Since ARMv7, ARM has organized its processor designs into three distinct profiles, each optimized for specific workloads. Think of them as three different vehicle types: motorcycles (M), rally cars (R), and luxury sedans (A).
The Cortex-M family is designed for deeply embedded systems where cost, power, and real-time determinism are critical. These processors run bare-metal firmware or lightweight RTOS (Real-Time Operating Systems) like FreeRTOS or Zephyr.
The Cortex-R family targets applications requiring hard real-time guarantees — where missing a deadline could be catastrophic. These sit between the simplicity of Cortex-M and the complexity of Cortex-A.
Typical applications: Automotive brake controllers (ISO 26262 ASIL-D), hard drive/SSD controllers, 5G modem basebands, medical device real-time processing, industrial robot safety systems.
The Cortex-A family powers the devices we interact with most: smartphones, tablets, laptops, smart TVs, and increasingly, cloud servers.
| Feature | Cortex-M | Cortex-R | Cortex-A |
|---|---|---|---|
| Purpose | Microcontroller | Real-time | Application |
| Clock Speed | Up to ~800 MHz | Up to ~1.5 GHz | Up to ~3.5 GHz |
| Pipeline | 2–8 stages | 8–11 stages | 11–17+ stages |
| Memory | MPU (no MMU) | MPU (optional cache) | Full MMU |
| OS Support | Bare metal / RTOS | RTOS / bare metal | Linux / Android / Windows |
| ISA | Thumb-2 only | ARM + Thumb-2 | AArch64 + AArch32 |
| Power | μW to mW | mW to W | W (multi-watt) |
| Typical Die Size | 0.01–0.1 mm² | 0.1–1 mm² | 1–10+ mm² |
| Cost Range | $0.10–$5 | $2–$20 | $5–$200+ |
ARMv8 introduced the concept of execution states. A single processor can operate in two fundamentally different modes, each with its own instruction set, register model, and exception handling.
AArch32 is the legacy 32-bit execution state, backward-compatible with ARMv7 and earlier:
// AArch32 example: Conditional execution (no branching needed!)
CMP R0, #10 // Compare R0 with 10
ADDGT R1, R1, #1 // If R0 > 10, add 1 to R1
SUBLE R1, R1, #1 // If R0 <= 10, subtract 1 from R1
AArch64 is the modern 64-bit execution state introduced in ARMv8-A. It's a clean-sheet design, not merely a 64-bit extension of AArch32:
// AArch64 example: Same logic, different style
CMP X0, #10 // Compare X0 with 10
CINC X1, X1, GT // Conditional increment: X1++ if GT
CSUB X1, X1, #1 // (Pseudo) - actually use CSEL pattern:
// Real AArch64 approach:
CMP X0, #10
ADD X2, X1, #1 // Compute X1 + 1
SUB X3, X1, #1 // Compute X1 - 1
CSEL X1, X2, X3, GT // Select: X1 = (GT) ? X2 : X3
Before diving into instruction details in the following parts, let's establish several foundational concepts that ARM assembly programmers encounter constantly.
Endianness refers to the byte order in which multi-byte data is stored in memory. Consider the 32-bit value 0x12345678:
| Mode | Address +0 | Address +1 | Address +2 | Address +3 |
|---|---|---|---|---|
| Little-endian | 0x78 | 0x56 | 0x34 | 0x12 |
| Big-endian | 0x12 | 0x34 | 0x56 | 0x78 |
// Check endianness in AArch64 by reading SCTLR_EL1
MRS X0, SCTLR_EL1 // Read System Control Register
AND X0, X0, #(1 << 25) // Check EE bit (bit 25)
// If X0 == 0, data accesses are little-endian
// If X0 != 0, data accesses are big-endian
ARM processors use Exception Levels (EL0–EL3) to control what software can do. Think of it as a building with four floors, where higher floors have more access:
Transitions between levels occur through exceptions (interrupts, system calls, page faults). Software at EL0 can request kernel services via the SVC (Supervisor Call) instruction, which transitions to EL1.
// System call from EL0 to EL1 (AArch64 Linux)
MOV X8, #64 // Syscall number for write()
MOV X0, #1 // File descriptor: stdout
LDR X1, =message // Buffer address
MOV X2, #13 // Length
SVC #0 // Trigger exception → EL1 kernel handler
A processor pipeline is like an assembly line in a factory. Instead of completing one instruction fully before starting the next, the processor overlaps multiple instructions at different stages of execution.
The classic ARM pipeline (ARM7TDMI) had just 3 stages:
Clock: 1 2 3 4 5 6
┌────┬────┬────┐
Instr 1: │ F │ D │ E │
└────┴┬───┴┬───┘
Instr 2: │ F │ D │ E │
└────┴┬───┴┬───┘
Instr 3: │ F │ D │ E │
└────┴────┴────┘
F = Fetch (get instruction from memory)
D = Decode (figure out what instruction does)
E = Execute (perform the operation)
Modern Cortex-A cores have 11–17+ pipeline stages with out-of-order execution, allowing multiple instructions to enter and complete simultaneously. The Cortex-X4, for example, can dispatch up to 10 operations per cycle across its execution units.
Every ARM instruction is encoded as a fixed-width binary word. Understanding encoding helps with debugging, reverse engineering, and appreciating design constraints.
AArch32 ARM encoding (32 bits):
31 28 27 26 25 24 21 20 19 16 15 12 11 0
┌────┬─────┬──┬─────────┬──┬──────┬──────┬──────────────┐
│Cond│ 0 0 │I │ Opcode │S │ Rn │ Rd │ Operand2 │
└────┴─────┴──┴─────────┴──┴──────┴──────┴──────────────┘
Cond (4 bits): Condition code (EQ, NE, GT, LT, AL=always)
I (1 bit): Immediate flag (1=immediate operand, 0=register)
Opcode (4): Operation (0000=AND, 0100=ADD, 1101=MOV, etc.)
S (1 bit): Set condition flags (1=update CPSR)
Rn (4 bits): First operand register
Rd (4 bits): Destination register
Operand2 (12): Second operand (immediate or shifted register)
Example: Encoding ADDS R1, R2, R3 (Add R2+R3, store in R1, set flags):
Cond=1110 (AL, always execute)
0 0
I=0 (register operand)
Opcode=0100 (ADD)
S=1 (set flags)
Rn=0010 (R2)
Rd=0001 (R1)
Operand2=000000000011 (R3, no shift)
Binary: 1110 00 0 0100 1 0010 0001 000000000011
Hex: 0xE0921003
# Verify with an assembler (on an ARM system or cross-assembler):
echo "ADDS R1, R2, R3" | arm-none-eabi-as -o test.o -
arm-none-eabi-objdump -d test.o
# Output: e0921003 adds r1, r2, r3
ARM publishes comprehensive documentation through the ARM Architecture Reference Manual (ARM ARM). Here's how to navigate the key references:
| Document | Content | When to Use |
|---|---|---|
| ARM ARM (DDI 0487) | Complete ISA specification, all encodings | Authoritative instruction reference |
| Cortex-A TRM | Specific core implementation details | Core-specific features, pipeline details |
| AMBA/AXI spec | Bus interconnect protocols | Memory system, DMA design |
| AAPCS/PCS | Procedure Call Standard | Calling conventions, ABI compliance |
| GIC specification | Generic Interrupt Controller | Interrupt configuration |
Consider a task that copies 16 bytes from one memory location to another:
Manually encode the following AArch32 instruction to hexadecimal:
MOVEQ R5, R3 // Move R3 to R5, only if Zero flag is set
Hints:
Answer: Work through the encoding format from the Instruction Encoding section above.
For each application, choose the most appropriate Cortex profile (M, R, or A) and justify your choice:
Use this tool to generate a personalized ARM ISA quick reference card documenting the architecture, key instructions, registers, and addressing modes you want to reference. Download as Word, Excel, or PDF.
Create your personalized ARM reference card. Download as Word, Excel, or PDF.
All data stays in your browser. Nothing is sent to or stored on any server.
In this foundational article, we've traced ARM's remarkable journey from a skunkworks project at Acorn Computers in 1985 to the world's most widely deployed processor architecture. You now understand:
With this foundation established, you're ready to get hands-on with actual instruction sets. In Part 2, we'll dive deep into the ARM32 (AArch32) instruction set — the architecture that powered a decade of smartphones and still runs on billions of embedded devices today.
In Part 2: ARM32 (AArch32) Instruction Set Fundamentals, we'll dive into the ARM32 instruction set, exploring ARM vs Thumb modes, the register model, conditional execution, and the unique immediate encoding quirks that make ARM32 assembly distinctive.