ARM Assembly Part 17: Inline Assembly, GCC/Clang & C Interop

Introduction & When to Use Inline Asm

                        
                        Series Overview: Part 17 of 28. After mastering standalone assembly files and platform ABIs in Parts 14–16, we now integrate assembly directly into C/C++ source. Inline asm is the tool of choice for Linux kernel barrier macros, RTOS context-switch snippets, and performance-critical inner loops where the compiler's autovectoriser falls short.
                    

ARM Assembly Mastery

Your 28-step learning path • Currently on Step 17

Inline Assembly, GCC/Clang & C Interop

Constraints, clobbers, compiler interaction

You Are Here

Performance Profiling & Micro-Optimization

Pipeline hazards, PMU, benchmarking

Reverse Engineering & ARM Binary Analysis

ELF, disassembly, CFR, iOS/Android quirks

Building a Bare-Metal OS Kernel

Bootloader, UART, scheduler, context switch

ARM Microarchitecture Deep Dive

OOO pipelines, reorder buffers, branch predict

Virtualization Extensions

EL2 hypervisor, stage-2 translation, KVM

Debugging & Tooling Ecosystem

GDB, OpenOCD/JTAG, ETM/ITM, QEMU

Linkers, Loaders & Binary Format Internals

ELF deep dive, relocations, PIC, crt0

Cross-Compilation & Build Systems

GCC/Clang toolchains, CMake, firmware gen

ARM in Real Systems

Android, FreeRTOS/Zephyr, U-Boot, TF-A

Security Research & Exploitation

ASLR, PAC attacks, ROP/JOP, kernel exploit

Emerging ARMv9 & Future Directions

MTE, SME, confidential compute, AI accel

Use inline assembly when: you need a single hardware instruction the compiler won't emit (e.g., clz, rbit, dmb); you need precise register placement without function call overhead; or you need compiler-visible effects on memory ordering. Avoid inline asm for anything longer than ~10 instructions — a separate .S file is cleaner, testable, and debuggable.

Decision flowchart for when to use inline assembly versus separate files or intrinsics — Decision flowchart: use inline asm for single hardware instructions, separate .S files for longer routines, and intrinsics for NEON/SVE

                        
                        Real-World Analogy — Margin Notes in a Book: Think of inline assembly like handwritten margin notes in a printed book. The book (C code) follows grammar rules and is typeset by the publisher (compiler). Occasionally you scribble a precise note in the margin — a formula, a shortcut, or a correction the publisher can't express. The note is powerful but fragile: if the publisher re-typesets (compiler recompiles with different optimizations), your margin note might end up on the wrong page. Constraints (=r, "memory") are like numbered sticky tabs that tell the publisher exactly which page and line your note references, so it survives any re-layout. Without them, your handwritten wisdom ends up next to the wrong paragraph.
                    

Basic asm() Syntax

// Basic asm — no operands, just instruction string
// Primarily for assembly directives or ultra-simple one-liners
// GCC may reorder or eliminate basic asm blocks — rarely what you want

void disable_irq(void) {
    asm("msr daifset, #2");    // Disable IRQ (sets DAIF.I)
}

void enable_irq(void) {
    asm("msr daifclr, #2");    // Enable IRQ (clears DAIF.I)
}

// Basic asm with multiple instructions (semicolon or \n separated)
void wfi_sleep(void) {
    asm("dsb sy\n\t"          // Data sync barrier
        "wfi");               // Wait for interrupt
}

Extended asm() — Operands & Constraints

Extended asm has the form: asm [volatile] ("template" : [outputs] : [inputs] : [clobbers]). The template uses %0, %1, … as placeholders for operands listed in order: outputs first, then inputs. On AArch64 use %w0 for 32-bit (W-register) access and %x0 for explicit 64-bit.

Extended asm operand and constraint syntax showing outputs, inputs, and clobbers — Extended asm() syntax layout: template string with %0/%1 placeholders mapped to output constraints, input constraints, and clobber lists

Output Constraints

#include <stdint.h>

// "=r" — any general-purpose register, write-only output
static inline uint64_t read_cntvct(void) {
    uint64_t val;
    asm volatile("mrs %0, cntvct_el0" : "=r"(val));
    return val;
}

// "+r" — read-write register (input AND output to same register)
static inline uint32_t bit_reverse(uint32_t x) {
    asm volatile("rbit %w0, %w0" : "+r"(x));
    return x;
}

// "=&r" — early-clobber: output register must not overlap any input
// Required when output may be written before all inputs are consumed
static inline void str_exclusive(uint32_t val, volatile uint32_t *addr,
                                  int *status) {
    asm volatile("stxr  %w0, %w2, [%1]"
                 : "=&r"(*status)          // =& prevents overlap with %1/%2
                 : "r"(addr), "r"(val)
                 : "memory");
}

Input Constraints

// "r" — any GP register (input)
// "i" — immediate integer (compile-time constant only)
// "m" — memory operand (produces address for asm use)
// "n" — numeric constant (immediate, no register allocated)

static inline int count_leading_zeros(uint64_t x) {
    int result;
    asm("clz %0, %1" : "=r"(result) : "r"(x));
    return result;
}

// Using "i" constraint for immediate (shift amount must be constant)
static inline uint32_t rotate_right(uint32_t x, int shift) {
    // Compiler substitutes literal integer into template at compile time
    uint32_t out;
    asm("ror %w0, %w1, #%2"
        : "=r"(out)
        : "r"(x), "i"(shift));
    return out;
}

// Tied input (digit = output index): reuse same register
static inline uint32_t set_bit(uint32_t val, uint32_t bit) {
    asm("orr %w0, %w0, %w1"
        : "+r"(val)            // %0: read-write (tied input)
        : "r"(1u << bit));
    return val;
}

Clobbers & Memory

// Clobbers tell the compiler which registers/flags the asm modifies
// WITHOUT expressing them as outputs — so compiler saves/restores them

// "cc" clobber: asm modifies condition flags
static inline int signed_saturate(int x, int limit) {
    int result;
    asm volatile(
        "cmp  %1, %2\n\t"
        "csel %0, %1, %2, lt"    // Select min(x, limit)
        : "=r"(result)
        : "r"(x), "r"(limit)
        : "cc");                  // CMP modifies NZCV flags
    return result;
}

// "memory" clobber: asm reads/writes memory not named in operands
// Forces compiler to flush/reload all live values around the asm
#define mb()  asm volatile("dmb ish"   : : : "memory")
#define wmb() asm volatile("dmb ishst" : : : "memory")
#define rmb() asm volatile("dmb ishld" : : : "memory")
#define isb() asm volatile("isb"       : : : "memory")

// Combine register + memory clobbers: context switch saves x19-x28 etc.
void switch_context(uint64_t *save_sp, uint64_t *load_sp) {
    asm volatile(
        "stp  x19, x20, [%0, #0]\n\t"
        "stp  x21, x22, [%0, #16]\n\t"
        "stp  x23, x24, [%0, #32]\n\t"
        "mov  %0, sp\n\t"               // Save current SP
        "mov  sp, %1\n\t"               // Load new SP
        "ldp  x19, x20, [sp, #0]\n\t"
        "ldp  x21, x22, [sp, #16]\n\t"
        "ldp  x23, x24, [sp, #32]"
        :
        : "r"(save_sp), "r"(*load_sp)
        : "x19","x20","x21","x22","x23","x24","memory","cc"
    );
}

Memory Barriers via Inline Asm

ARM memory barrier instructions DMB, DSB, ISB and their ordering guarantees — ARM memory barrier hierarchy: DMB orders memory accesses, DSB ensures completion before next instruction, ISB flushes the pipeline

// Linux kernel barrier pattern (arch/arm64/include/asm/barrier.h style)
#define smp_mb()     asm volatile("dmb ish"   : : : "memory")
#define smp_wmb()    asm volatile("dmb ishst" : : : "memory")
#define smp_rmb()    asm volatile("dmb ishld" : : : "memory")
#define dma_wmb()    asm volatile("dmb oshst" : : : "memory")  // outer shareable

// READ_ONCE and WRITE_ONCE equivalents (prevent compiler reordering)
#define READ_ONCE(x) \
    ({ typeof(x) __val; asm volatile("" : "=r"(__val) : "0"(x)); __val; })
#define WRITE_ONCE(x, val) \
    asm volatile("" : : "r"((typeof(x))(val)) : "memory")

// Acquire/release atomics using LDAR/STLR
static inline uint64_t load_acquire(const volatile uint64_t *addr) {
    uint64_t val;
    asm volatile("ldar %0, [%1]" : "=r"(val) : "r"(addr) : "memory");
    return val;
}

static inline void store_release(volatile uint64_t *addr, uint64_t val) {
    asm volatile("stlr %1, [%0]" : : "r"(addr), "r"(val) : "memory");
}

Atomic Operations (LDXR/STXR)

LDXR/STXR exclusive monitor flow for atomic compare-and-swap operations — Exclusive monitor CAS flow: LDXR acquires exclusive access, STXR attempts conditional store — retry loop on contention until success

// CAS (Compare And Swap) using exclusive monitor
// AArch64 also has ARMv8.1 LSE atomics: CAS, SWP, LDADD (preferred where available)
static inline int atomic_cmpxchg32(volatile int *ptr, int old_val, int new_val) {
    int result;
    int tmp;
    asm volatile(
    "1: ldxr  %w0, [%2]\n\t"       // Load exclusive: result = *ptr
        "cmp   %w0, %w3\n\t"       // result == old_val?
        "b.ne  2f\n\t"             // No: bail out
        "stxr  %w1, %w4, [%2]\n\t" // Yes: try store exclusive
        "cbnz  %w1, 1b\n\t"        // Retry if store exclusive failed
    "2: clrex"                     // Clear exclusive if bailed out
        : "=&r"(result), "=&r"(tmp)
        : "r"(ptr), "r"(old_val), "r"(new_val)
        : "cc", "memory"
    );
    return result;                  // Returns old value (Linux cmpxchg convention)
}

// ARMv8.1 LSE: preferred on Cortex-A55+, Neoverse, Apple Silicon
// Requires -march=armv8.1-a+lse or target_feature attribute
#ifdef __ARM_FEATURE_ATOMICS
static inline int atomic_cmpxchg32_lse(volatile int *ptr, int old_val, int new_val) {
    asm volatile("cas %w0, %w2, [%1]"
                 : "+r"(old_val)
                 : "r"(ptr), "r"(new_val)
                 : "memory");
    return old_val;
}
#endif

NEON & Compiler Intrinsics from C

NEON intrinsics mapping from C function calls to SIMD vector register operations — NEON intrinsics map C-level function calls directly to SIMD vector instructions operating on 128-bit V registers

// Header: <arm_neon.h> — included automatically with -mfpu=neon or -march=armv8-a
#include <arm_neon.h>
#include <stdint.h>

// Horizontal sum of 4 floats using intrinsics (no inline asm needed)
float sum4(const float *arr) {
    float32x4_t v = vld1q_f32(arr);       // Load 4 floats into NEON register
    float32x2_t lo = vget_low_f32(v);
    float32x2_t hi = vget_high_f32(v);
    float32x2_t sum = vpadd_f32(lo, hi);   // Pairwise add
    sum = vpadd_f32(sum, sum);             // Final pair add
    return vget_lane_f32(sum, 0);
}

// Dot product using FMLA (fused multiply-accumulate)
float dot_product(const float *a, const float *b, int n) {
    float32x4_t acc = vdupq_n_f32(0.0f);
    for (int i = 0; i < n - 3; i += 4) {
        float32x4_t va = vld1q_f32(a + i);
        float32x4_t vb = vld1q_f32(b + i);
        acc = vfmaq_f32(acc, va, vb);      // acc += va * vb (FMLA)
    }
    return vaddvq_f32(acc);                // Horizontal sum (ARMv8.1 addv)
}

// Mixing intrinsics and inline asm for timing (CNTVCT_EL0 not in arm_neon.h)
#include <time.h>
static inline uint64_t read_timer(void) {
    uint64_t t;
    asm volatile("mrs %0, cntvct_el0" : "=r"(t));
    return t;
}

AArch32 vs AArch64 NEON Compiler Flags

                        
                        AArch32 NEON: -mfpu=neon-fp-armv8 — On AArch64, NEON/Advanced SIMD is always available and no special compiler flag is needed. On AArch32 (32-bit ARM mode), you must explicitly enable NEON through the -mfpu flag. The correct value depends on your target architecture:
                        -mfpu=neon — ARMv7-A NEON (no FP16, no FMA, VFPv3)
-mfpu=neon-vfpv4 — ARMv7-A NEON with VFPv4 (adds FMA: VFMA.F32)
-mfpu=neon-fp-armv8 — ARMv8-A AArch32 NEON (adds FP16 conversion, rounding, VMAXNM/VMINNM). This is what you use on Cortex-A53/A72/A75/A76 when compiling 32-bit code.
-mfpu=crypto-neon-fp-armv8 — Same as above plus ARMv8 cryptography extensions (AES, SHA)

                    

# AArch32 compilation with NEON FMA support (Cortex-A53 in 32-bit mode)
arm-linux-gnueabihf-gcc -O2 -march=armv8-a -mfpu=neon-fp-armv8 -mfloat-abi=hard \
    -c neon_dot.c -o neon_dot.o

# The intrinsic vfmaq_f32() maps to VFMA.F32 on AArch32:
#   float32x4_t vfmaq_f32(float32x4_t a, float32x4_t b, float32x4_t c);
#   AArch32 → VFMA.F32  q0, q1, q2   (VFPv4+ required)
#   AArch64 → FMLA      v0.4s, v1.4s, v2.4s

# AArch64 — no -mfpu needed; NEON is mandatory
aarch64-linux-gnu-gcc -O2 -march=armv8-a -c neon_dot.c -o neon_dot.o

                        
                        Key Insight: The -mfloat-abi=hard flag is equally important on AArch32 — it tells GCC to pass floating-point arguments in VFP/NEON registers (S0–S15, D0–D7) instead of integer registers. Without it, every NEON intrinsic call wastes cycles moving data between integer and SIMD registers. On AArch64, the hard-float ABI is the only option — there is no soft-float AArch64 ABI.
                    

Common Pitfalls & Debugging

                        
                        Top 5 Inline Asm Mistakes:
                        Missing "memory" clobber on barrier macros — compiler reorders C loads/stores around your barrier.
Missing "cc" clobber when asm sets flags — compiler uses stale condition codes from before the asm block.
Omitting volatile on asm with side effects — compiler may hoist/sink/eliminate the block.
Using basic asm for kernel code — GCC is allowed to delete unreferenced basic asm blocks in optimized builds.
Wrong register width for AArch64 — using %0 in a 32-bit instruction on a 64-bit variable silently generates W-register form or fails to assemble.

                    

                        
                        Key Insight: The "memory" clobber is a compiler-level ordering guarantee, not a hardware barrier. asm volatile("" : : : "memory") (a compiler fence) prevents the compiler from reordering loads/stores across it, but the CPU can still execute them out-of-order. For SMP correctness on ARM's weakly-ordered memory model, you need both: the compiler fence AND a hardware barrier instruction (dmb ish). The Linux smp_mb() macro provides both in a single asm volatile("dmb ish":::"memory") statement — the instruction is the hardware barrier, the clobber is the compiler barrier.
                    

Case Study: Linux Kernel's Inline Assembly

KernelProductionReal-World

Linux arm64 Barrier Macros — From `arch/arm64/include/asm/barrier.h`

The Linux kernel's arm64 port contains over 200 inline assembly statements across barrier definitions, atomic operations, and context switching. The barrier macros demonstrate every constraint pattern covered in this article:

smp_mb(): asm volatile("dmb ish" ::: "memory") — hardware + compiler barrier combined. The "memory" clobber prevents GCC from reordering any C memory operation across this point, while dmb ish prevents the CPU from doing the same.
__cmpxchg_case_##sz: Template-generated LDXR/STXR loops with "=&r" early-clobber on the status register. Before ARMv8.1 LSE atomics, every atomic_cmpxchg() in the kernel expanded to this pattern.
cpu_switch_to() in entry.S calls to inline asm: Context switch saves callee-saved registers (x19–x28, x29/FP, SP) using STP with explicit register clobber lists, ensuring the compiler doesn't cache any values across a task switch.
READ_ONCE/WRITE_ONCE: Uses an empty inline asm statement with constraints to force the compiler to actually load/store the variable, preventing dead-store elimination or speculative reloading.

Key lesson: The kernel relies on inline asm not for performance but for correctness — expressing hardware semantics (barrier ordering, exclusive monitors, system register access) that C cannot represent. Every clobber and constraint is the difference between a working SMP system and a silent data-corruption bug.

HistoryEvolution

The GCC Extended Asm Origin Story

GCC's inline assembly syntax dates back to the 1980s, when Richard Stallman needed a way to embed PDP-11 and VAX instructions in C code for the GNU system. The original asm() was "basic" — just a string pasted into the output. The constraint system ("=r", "m", clobbers) was added in GCC 2.x to solve a critical problem: register allocation. Without constraints, the compiler had no idea which registers the assembly block used, leading to silent corruption when the compiler reused the same register for a C variable.

ARM support in GCC arrived in the early 1990s (arm-linux target), and the constraint letters were extended for ARM-specific needs. The %w0 / %x0 width modifiers for AArch64 were added with GCC 4.8 (2013) when AArch64 was first supported. Clang adopted GCC's exact syntax for compatibility, making inline asm one of the few areas where both compilers are fully interchangeable.

Hands-On Exercises

Exercise 1Beginner

Constraint Practice: Bit Manipulation Wrappers

Write three inline assembly wrapper functions in C that use AArch64 instructions the compiler rarely emits:

uint32_t reverse_bytes(uint32_t x) — use REV %w0, %w1 with "=r" output and "r" input constraints
int popcount64(uint64_t x) — use FMOV d0, %1; CNT v0.8b, v0.8b; ADDV b0, v0.8b; FMOV %w0, s0 (moves to NEON, counts bits per byte, horizontal sum, moves back)
uint64_t extract_bits(uint64_t x, int lsb, int width) — use UBFX %0, %1, %2, %3 with "i" constraints for the immediate operands

Verify: Compile with aarch64-linux-gnu-gcc -O2 -S and confirm the compiler emitted exactly one instruction per function (no spills, no extra moves).

Exercise 2Intermediate

Spinlock with Acquire/Release Semantics

Implement a ticket spinlock using inline assembly with correct memory ordering:

Define a struct spinlock { uint32_t owner; uint32_t next; }
spin_lock(): Use LDAXR + ADD + STXR to atomically increment next and get your ticket number. Then spin with LDAXRH on owner until it matches your ticket.
spin_unlock(): Use STLR (store-release) to increment owner, releasing the next waiter.
Mark all asm blocks volatile with "memory" clobber.

Bonus: Add WFE/SEV hints — WFE in the spin loop (saves power), SEV after unlock (wakes waiters).

Exercise 3Advanced

Inline Asm vs Intrinsics Benchmark

Compare the performance and code quality of three approaches for a 256-element float array sum:

Pure C loop: for (int i = 0; i < 256; i++) sum += arr[i]; compiled with -O2 -march=armv8-a
NEON intrinsics: Use vld1q_f32 + vaddq_f32 with 4 accumulators, then vaddvq_f32 for horizontal reduction
Inline assembly: Hand-written NEON using ld1 {v0.4s-v3.4s}, [%1], #64 with fadd accumulation

Measure: Use CNTVCT_EL0 (from the timer reading inline asm earlier) to capture minimum cycles over 10,000 iterations. Compare the -S output instruction counts. On modern Cortex-A cores with autovectorization, approach 1 should match approach 2 — demonstrating when inline asm is unnecessary.

Conclusion & Next Steps

We covered inline assembly for ARM: basic vs extended asm syntax, output/input/clobber constraints with =r, +r, =&r, r, i, m and cc/memory, atomic LDXR/STXR loops and LSE CAS, memory barrier macros, LDAR/STLR acquire-release patterns, and NEON intrinsics as a higher-level alternative. Through the Linux kernel case study, we saw how production code uses every constraint pattern for correctness on SMP systems — and through the exercises, you practiced writing constraint-correct wrappers, implementing a ticket spinlock, and benchmarking inline asm against compiler autovectorization.

Cookie Consent

ARM Assembly Part 17: Inline Assembly, GCC/Clang & C Interop

Table of Contents

Introduction & When to Use Inline Asm

ARM Assembly Mastery

Architecture History & Core Concepts

ARM32 Instruction Set Fundamentals

AArch64 Registers, Addressing & Data Movement

Arithmetic, Logic & Bit Manipulation

Branching, Loops & Conditional Execution

Stack, Subroutines & AAPCS

Memory Model, Caches & Barriers

NEON & Advanced SIMD

SVE & SVE2 Scalable Vector Extensions

Floating-Point & VFP Instructions

Exception Levels, Interrupts & Vector Tables

MMU, Page Tables & Virtual Memory

TrustZone & ARM Security Extensions

Cortex-M Assembly & Bare-Metal Embedded

Cortex-A System Programming & Boot

Apple Silicon & macOS ABI