Back to Technology

ARM Assembly Part 17: Inline Assembly, GCC/Clang & C Interop

April 30, 2026 Wasil Zafar 21 min read

Inline assembly bridges the gap between high-level C and precise hardware control. GCC Extended Asm and Clang's compatible implementation give you surgical register allocation, zero-overhead memory barriers, and access to hardware instructions the compiler never generates — all without the ABI overhead of a full assembly file.

Table of Contents

  1. Introduction & When to Use Inline Asm
  2. Basic asm() Syntax
  3. Extended asm() — Operands & Constraints
  4. Memory Barriers via Inline Asm
  5. Atomic Operations (LDXR/STXR)
  6. NEON & Compiler Intrinsics from C
  7. Common Pitfalls & Debugging
  8. Case Study: Linux Kernel Inline Asm
  9. Hands-On Exercises
  10. Conclusion & Next Steps

Introduction & When to Use Inline Asm

Series Overview: Part 17 of 28. After mastering standalone assembly files and platform ABIs in Parts 14–16, we now integrate assembly directly into C/C++ source. Inline asm is the tool of choice for Linux kernel barrier macros, RTOS context-switch snippets, and performance-critical inner loops where the compiler's autovectoriser falls short.

ARM Assembly Mastery

Your 28-step learning path • Currently on Step 17
1
Architecture History & Core Concepts
ARMv1→v9, RISC philosophy, profiles
2
ARM32 Instruction Set Fundamentals
ARM vs Thumb, registers, CPSR, barrel shifter
3
AArch64 Registers, Addressing & Data Movement
X/W regs, addressing modes, load/store pairs
4
Arithmetic, Logic & Bit Manipulation
ADD/SUB, bitfield extract/insert, CLZ
5
Branching, Loops & Conditional Execution
Branch types, link register, jump tables
6
Stack, Subroutines & AAPCS
Calling conventions, prologue/epilogue
7
Memory Model, Caches & Barriers
Weak ordering, DMB/DSB/ISB, TLB
8
NEON & Advanced SIMD
Vector ops, intrinsics, media processing
9
SVE & SVE2 Scalable Vector Extensions
Predicate regs, gather/scatter, HPC/ML
10
Floating-Point & VFP Instructions
IEEE-754, scalar FP, rounding modes
11
Exception Levels, Interrupts & Vector Tables
EL0–EL3, GIC, fault debugging
12
MMU, Page Tables & Virtual Memory
Stage-1 translation, permissions, huge pages
13
TrustZone & ARM Security Extensions
Secure monitor, world switching, TF-A
14
Cortex-M Assembly & Bare-Metal Embedded
NVIC, SysTick, linker scripts, low-power
15
Cortex-A System Programming & Boot
EL3→EL1 transitions, MMU setup, PSCI
16
Apple Silicon & macOS ABI
ARM64e PAC, Mach-O, dyld, perf counters
17
Inline Assembly, GCC/Clang & C Interop
Constraints, clobbers, compiler interaction
You Are Here
18
Performance Profiling & Micro-Optimization
Pipeline hazards, PMU, benchmarking
19
Reverse Engineering & ARM Binary Analysis
ELF, disassembly, CFR, iOS/Android quirks
20
Building a Bare-Metal OS Kernel
Bootloader, UART, scheduler, context switch
21
ARM Microarchitecture Deep Dive
OOO pipelines, reorder buffers, branch predict
22
Virtualization Extensions
EL2 hypervisor, stage-2 translation, KVM
23
Debugging & Tooling Ecosystem
GDB, OpenOCD/JTAG, ETM/ITM, QEMU
24
Linkers, Loaders & Binary Format Internals
ELF deep dive, relocations, PIC, crt0
25
Cross-Compilation & Build Systems
GCC/Clang toolchains, CMake, firmware gen
26
ARM in Real Systems
Android, FreeRTOS/Zephyr, U-Boot, TF-A
27
Security Research & Exploitation
ASLR, PAC attacks, ROP/JOP, kernel exploit
28
Emerging ARMv9 & Future Directions
MTE, SME, confidential compute, AI accel

Use inline assembly when: you need a single hardware instruction the compiler won't emit (e.g., clz, rbit, dmb); you need precise register placement without function call overhead; or you need compiler-visible effects on memory ordering. Avoid inline asm for anything longer than ~10 instructions — a separate .S file is cleaner, testable, and debuggable.

Real-World Analogy — Margin Notes in a Book: Think of inline assembly like handwritten margin notes in a printed book. The book (C code) follows grammar rules and is typeset by the publisher (compiler). Occasionally you scribble a precise note in the margin — a formula, a shortcut, or a correction the publisher can't express. The note is powerful but fragile: if the publisher re-typesets (compiler recompiles with different optimizations), your margin note might end up on the wrong page. Constraints (=r, "memory") are like numbered sticky tabs that tell the publisher exactly which page and line your note references, so it survives any re-layout. Without them, your handwritten wisdom ends up next to the wrong paragraph.

Basic asm() Syntax

// Basic asm — no operands, just instruction string
// Primarily for assembly directives or ultra-simple one-liners
// GCC may reorder or eliminate basic asm blocks — rarely what you want

void disable_irq(void) {
    asm("msr daifset, #2");    // Disable IRQ (sets DAIF.I)
}

void enable_irq(void) {
    asm("msr daifclr, #2");    // Enable IRQ (clears DAIF.I)
}

// Basic asm with multiple instructions (semicolon or \n separated)
void wfi_sleep(void) {
    asm("dsb sy\n\t"          // Data sync barrier
        "wfi");               // Wait for interrupt
}

Extended asm() — Operands & Constraints

Extended asm has the form: asm [volatile] ("template" : [outputs] : [inputs] : [clobbers]). The template uses %0, %1, … as placeholders for operands listed in order: outputs first, then inputs. On AArch64 use %w0 for 32-bit (W-register) access and %x0 for explicit 64-bit.

Output Constraints

#include <stdint.h>

// "=r" — any general-purpose register, write-only output
static inline uint64_t read_cntvct(void) {
    uint64_t val;
    asm volatile("mrs %0, cntvct_el0" : "=r"(val));
    return val;
}

// "+r" — read-write register (input AND output to same register)
static inline uint32_t bit_reverse(uint32_t x) {
    asm volatile("rbit %w0, %w0" : "+r"(x));
    return x;
}

// "=&r" — early-clobber: output register must not overlap any input
// Required when output may be written before all inputs are consumed
static inline void str_exclusive(uint32_t val, volatile uint32_t *addr,
                                  int *status) {
    asm volatile("stxr  %w0, %w2, [%1]"
                 : "=&r"(*status)          // =& prevents overlap with %1/%2
                 : "r"(addr), "r"(val)
                 : "memory");
}

Input Constraints

// "r" — any GP register (input)
// "i" — immediate integer (compile-time constant only)
// "m" — memory operand (produces address for asm use)
// "n" — numeric constant (immediate, no register allocated)

static inline int count_leading_zeros(uint64_t x) {
    int result;
    asm("clz %0, %1" : "=r"(result) : "r"(x));
    return result;
}

// Using "i" constraint for immediate (shift amount must be constant)
static inline uint32_t rotate_right(uint32_t x, int shift) {
    // Compiler substitutes literal integer into template at compile time
    uint32_t out;
    asm("ror %w0, %w1, #%2"
        : "=r"(out)
        : "r"(x), "i"(shift));
    return out;
}

// Tied input (digit = output index): reuse same register
static inline uint32_t set_bit(uint32_t val, uint32_t bit) {
    asm("orr %w0, %w0, %w1"
        : "+r"(val)            // %0: read-write (tied input)
        : "r"(1u << bit));
    return val;
}

Clobbers & Memory

// Clobbers tell the compiler which registers/flags the asm modifies
// WITHOUT expressing them as outputs — so compiler saves/restores them

// "cc" clobber: asm modifies condition flags
static inline int signed_saturate(int x, int limit) {
    int result;
    asm volatile(
        "cmp  %1, %2\n\t"
        "csel %0, %1, %2, lt"    // Select min(x, limit)
        : "=r"(result)
        : "r"(x), "r"(limit)
        : "cc");                  // CMP modifies NZCV flags
    return result;
}

// "memory" clobber: asm reads/writes memory not named in operands
// Forces compiler to flush/reload all live values around the asm
#define mb()  asm volatile("dmb ish"   : : : "memory")
#define wmb() asm volatile("dmb ishst" : : : "memory")
#define rmb() asm volatile("dmb ishld" : : : "memory")
#define isb() asm volatile("isb"       : : : "memory")

// Combine register + memory clobbers: context switch saves x19-x28 etc.
void switch_context(uint64_t *save_sp, uint64_t *load_sp) {
    asm volatile(
        "stp  x19, x20, [%0, #0]\n\t"
        "stp  x21, x22, [%0, #16]\n\t"
        "stp  x23, x24, [%0, #32]\n\t"
        "mov  %0, sp\n\t"               // Save current SP
        "mov  sp, %1\n\t"               // Load new SP
        "ldp  x19, x20, [sp, #0]\n\t"
        "ldp  x21, x22, [sp, #16]\n\t"
        "ldp  x23, x24, [sp, #32]"
        :
        : "r"(save_sp), "r"(*load_sp)
        : "x19","x20","x21","x22","x23","x24","memory","cc"
    );
}

Memory Barriers via Inline Asm

// Linux kernel barrier pattern (arch/arm64/include/asm/barrier.h style)
#define smp_mb()     asm volatile("dmb ish"   : : : "memory")
#define smp_wmb()    asm volatile("dmb ishst" : : : "memory")
#define smp_rmb()    asm volatile("dmb ishld" : : : "memory")
#define dma_wmb()    asm volatile("dmb oshst" : : : "memory")  // outer shareable

// READ_ONCE and WRITE_ONCE equivalents (prevent compiler reordering)
#define READ_ONCE(x) \
    ({ typeof(x) __val; asm volatile("" : "=r"(__val) : "0"(x)); __val; })
#define WRITE_ONCE(x, val) \
    asm volatile("" : : "r"((typeof(x))(val)) : "memory")

// Acquire/release atomics using LDAR/STLR
static inline uint64_t load_acquire(const volatile uint64_t *addr) {
    uint64_t val;
    asm volatile("ldar %0, [%1]" : "=r"(val) : "r"(addr) : "memory");
    return val;
}

static inline void store_release(volatile uint64_t *addr, uint64_t val) {
    asm volatile("stlr %1, [%0]" : : "r"(addr), "r"(val) : "memory");
}

Atomic Operations (LDXR/STXR)

// CAS (Compare And Swap) using exclusive monitor
// AArch64 also has ARMv8.1 LSE atomics: CAS, SWP, LDADD (preferred where available)
static inline int atomic_cmpxchg32(volatile int *ptr, int old_val, int new_val) {
    int result;
    int tmp;
    asm volatile(
    "1: ldxr  %w0, [%2]\n\t"       // Load exclusive: result = *ptr
        "cmp   %w0, %w3\n\t"       // result == old_val?
        "b.ne  2f\n\t"             // No: bail out
        "stxr  %w1, %w4, [%2]\n\t" // Yes: try store exclusive
        "cbnz  %w1, 1b\n\t"        // Retry if store exclusive failed
    "2: clrex"                     // Clear exclusive if bailed out
        : "=&r"(result), "=&r"(tmp)
        : "r"(ptr), "r"(old_val), "r"(new_val)
        : "cc", "memory"
    );
    return result;                  // Returns old value (Linux cmpxchg convention)
}

// ARMv8.1 LSE: preferred on Cortex-A55+, Neoverse, Apple Silicon
// Requires -march=armv8.1-a+lse or target_feature attribute
#ifdef __ARM_FEATURE_ATOMICS
static inline int atomic_cmpxchg32_lse(volatile int *ptr, int old_val, int new_val) {
    asm volatile("cas %w0, %w2, [%1]"
                 : "+r"(old_val)
                 : "r"(ptr), "r"(new_val)
                 : "memory");
    return old_val;
}
#endif

NEON & Compiler Intrinsics from C

// Header: <arm_neon.h> — included automatically with -mfpu=neon or -march=armv8-a
#include <arm_neon.h>
#include <stdint.h>

// Horizontal sum of 4 floats using intrinsics (no inline asm needed)
float sum4(const float *arr) {
    float32x4_t v = vld1q_f32(arr);       // Load 4 floats into NEON register
    float32x2_t lo = vget_low_f32(v);
    float32x2_t hi = vget_high_f32(v);
    float32x2_t sum = vpadd_f32(lo, hi);   // Pairwise add
    sum = vpadd_f32(sum, sum);             // Final pair add
    return vget_lane_f32(sum, 0);
}

// Dot product using FMLA (fused multiply-accumulate)
float dot_product(const float *a, const float *b, int n) {
    float32x4_t acc = vdupq_n_f32(0.0f);
    for (int i = 0; i < n - 3; i += 4) {
        float32x4_t va = vld1q_f32(a + i);
        float32x4_t vb = vld1q_f32(b + i);
        acc = vfmaq_f32(acc, va, vb);      // acc += va * vb (FMLA)
    }
    return vaddvq_f32(acc);                // Horizontal sum (ARMv8.1 addv)
}

// Mixing intrinsics and inline asm for timing (CNTVCT_EL0 not in arm_neon.h)
#include <time.h>
static inline uint64_t read_timer(void) {
    uint64_t t;
    asm volatile("mrs %0, cntvct_el0" : "=r"(t));
    return t;
}

Common Pitfalls & Debugging

Top 5 Inline Asm Mistakes:
  1. Missing "memory" clobber on barrier macros — compiler reorders C loads/stores around your barrier.
  2. Missing "cc" clobber when asm sets flags — compiler uses stale condition codes from before the asm block.
  3. Omitting volatile on asm with side effects — compiler may hoist/sink/eliminate the block.
  4. Using basic asm for kernel code — GCC is allowed to delete unreferenced basic asm blocks in optimized builds.
  5. Wrong register width for AArch64 — using %0 in a 32-bit instruction on a 64-bit variable silently generates W-register form or fails to assemble.
Key Insight: The "memory" clobber is a compiler-level ordering guarantee, not a hardware barrier. asm volatile("" : : : "memory") (a compiler fence) prevents the compiler from reordering loads/stores across it, but the CPU can still execute them out-of-order. For SMP correctness on ARM's weakly-ordered memory model, you need both: the compiler fence AND a hardware barrier instruction (dmb ish). The Linux smp_mb() macro provides both in a single asm volatile("dmb ish":::"memory") statement — the instruction is the hardware barrier, the clobber is the compiler barrier.

Case Study: Linux Kernel's Inline Assembly

KernelProductionReal-World
Linux arm64 Barrier Macros — From arch/arm64/include/asm/barrier.h

The Linux kernel's arm64 port contains over 200 inline assembly statements across barrier definitions, atomic operations, and context switching. The barrier macros demonstrate every constraint pattern covered in this article:

  • smp_mb(): asm volatile("dmb ish" ::: "memory") — hardware + compiler barrier combined. The "memory" clobber prevents GCC from reordering any C memory operation across this point, while dmb ish prevents the CPU from doing the same.
  • __cmpxchg_case_##sz: Template-generated LDXR/STXR loops with "=&r" early-clobber on the status register. Before ARMv8.1 LSE atomics, every atomic_cmpxchg() in the kernel expanded to this pattern.
  • cpu_switch_to() in entry.S calls to inline asm: Context switch saves callee-saved registers (x19–x28, x29/FP, SP) using STP with explicit register clobber lists, ensuring the compiler doesn't cache any values across a task switch.
  • READ_ONCE/WRITE_ONCE: Uses an empty inline asm statement with constraints to force the compiler to actually load/store the variable, preventing dead-store elimination or speculative reloading.

Key lesson: The kernel relies on inline asm not for performance but for correctness — expressing hardware semantics (barrier ordering, exclusive monitors, system register access) that C cannot represent. Every clobber and constraint is the difference between a working SMP system and a silent data-corruption bug.

HistoryEvolution
The GCC Extended Asm Origin Story

GCC's inline assembly syntax dates back to the 1980s, when Richard Stallman needed a way to embed PDP-11 and VAX instructions in C code for the GNU system. The original asm() was "basic" — just a string pasted into the output. The constraint system ("=r", "m", clobbers) was added in GCC 2.x to solve a critical problem: register allocation. Without constraints, the compiler had no idea which registers the assembly block used, leading to silent corruption when the compiler reused the same register for a C variable.

ARM support in GCC arrived in the early 1990s (arm-linux target), and the constraint letters were extended for ARM-specific needs. The %w0 / %x0 width modifiers for AArch64 were added with GCC 4.8 (2013) when AArch64 was first supported. Clang adopted GCC's exact syntax for compatibility, making inline asm one of the few areas where both compilers are fully interchangeable.

Hands-On Exercises

Exercise 1Beginner
Constraint Practice: Bit Manipulation Wrappers

Write three inline assembly wrapper functions in C that use AArch64 instructions the compiler rarely emits:

  1. uint32_t reverse_bytes(uint32_t x) — use REV %w0, %w1 with "=r" output and "r" input constraints
  2. int popcount64(uint64_t x) — use FMOV d0, %1; CNT v0.8b, v0.8b; ADDV b0, v0.8b; FMOV %w0, s0 (moves to NEON, counts bits per byte, horizontal sum, moves back)
  3. uint64_t extract_bits(uint64_t x, int lsb, int width) — use UBFX %0, %1, %2, %3 with "i" constraints for the immediate operands

Verify: Compile with aarch64-linux-gnu-gcc -O2 -S and confirm the compiler emitted exactly one instruction per function (no spills, no extra moves).

Exercise 2Intermediate
Spinlock with Acquire/Release Semantics

Implement a ticket spinlock using inline assembly with correct memory ordering:

  1. Define a struct spinlock { uint32_t owner; uint32_t next; }
  2. spin_lock(): Use LDAXR + ADD + STXR to atomically increment next and get your ticket number. Then spin with LDAXRH on owner until it matches your ticket.
  3. spin_unlock(): Use STLR (store-release) to increment owner, releasing the next waiter.
  4. Mark all asm blocks volatile with "memory" clobber.

Bonus: Add WFE/SEV hints — WFE in the spin loop (saves power), SEV after unlock (wakes waiters).

Exercise 3Advanced
Inline Asm vs Intrinsics Benchmark

Compare the performance and code quality of three approaches for a 256-element float array sum:

  1. Pure C loop: for (int i = 0; i < 256; i++) sum += arr[i]; compiled with -O2 -march=armv8-a
  2. NEON intrinsics: Use vld1q_f32 + vaddq_f32 with 4 accumulators, then vaddvq_f32 for horizontal reduction
  3. Inline assembly: Hand-written NEON using ld1 {v0.4s-v3.4s}, [%1], #64 with fadd accumulation

Measure: Use CNTVCT_EL0 (from the timer reading inline asm earlier) to capture minimum cycles over 10,000 iterations. Compare the -S output instruction counts. On modern Cortex-A cores with autovectorization, approach 1 should match approach 2 — demonstrating when inline asm is unnecessary.

Conclusion & Next Steps

We covered inline assembly for ARM: basic vs extended asm syntax, output/input/clobber constraints with =r, +r, =&r, r, i, m and cc/memory, atomic LDXR/STXR loops and LSE CAS, memory barrier macros, LDAR/STLR acquire-release patterns, and NEON intrinsics as a higher-level alternative. Through the Linux kernel case study, we saw how production code uses every constraint pattern for correctness on SMP systems — and through the exercises, you practiced writing constraint-correct wrappers, implementing a ticket spinlock, and benchmarking inline asm against compiler autovectorization.

Next in the Series

In Part 18: Performance Profiling & Micro-Optimization, we move to measurement: PMU hardware counters, perf stat, pipeline hazard analysis, instruction throughput vs latency tables, and micro-benchmarking methodology for assembly routines.

Technology