Back to Technology

USB Part 15: Bare-Metal USB

March 31, 2026 Wasil Zafar ~18 min read

Write a USB device driver from scratch without TinyUSB — understand the STM32 USB FSDEV hardware registers, endpoint FIFO management, interrupt-driven enumeration, control endpoint state machine, and what TinyUSB abstracts away from you.

Table of Contents

  1. Why Write Bare-Metal USB?
  2. STM32 FSDEV Hardware Overview
  3. Packet Memory (PMA) Management
  4. USB Interrupts on STM32
  5. Control Endpoint 0 State Machine
  6. Implementing GetDescriptor
  7. SetAddress Procedure
  8. Bulk Endpoint Implementation
  9. What TinyUSB Gives You
  10. Practical Exercises
  11. Bare-Metal USB Design Generator
  12. Conclusion & Next Steps
Series Context: This is Part 15 of 17. You should have completed Parts 1–14, with solid TinyUSB experience including custom vendor class drivers (Part 14). This article deliberately removes the TinyUSB safety net — the goal is understanding, not production firmware. For production work, continue using TinyUSB.

USB Development Mastery

Your 17-step learning path • Currently on Step 15
1
USB Fundamentals
USB system architecture, transfer types, host/device model, protocol stack
2
Electrical & Hardware Layer
D+/D- signalling, pull-ups, connectors, USB-C, STM32 USB peripherals
3
Protocol & Enumeration
Enumeration sequence, USB packets, descriptors, endpoint concepts
4
USB Device Classes
HID, CDC, MSC, MIDI, Audio, composite devices, vendor class
5
TinyUSB Deep Dive
Stack architecture, execution model, STM32 integration, descriptor callbacks
6
CDC Virtual COM Port
CDC class, bulk transfers, printf over USB, baud rate handling
7
HID Keyboard & Mouse
HID descriptors, report format, keyboard/mouse/gamepad implementation
8
USB Mass Storage
MSC class, SCSI commands, FATFS integration, RAM disk
9
Composite Devices
Multiple classes, IAD descriptor, CDC+HID, CDC+MSC
10
Debugging USB
Wireshark capture, protocol analyser, enumeration debugging, common failures
11
RTOS + USB Integration
FreeRTOS + TinyUSB, task priorities, thread-safe communication
12
Advanced USB Topics
Host mode, OTG, isochronous, USB audio, USB video
13
Performance & Optimisation
DMA, zero-copy buffers, throughput maximisation, latency tuning
14
Custom USB Class Drivers
Vendor class, WinUSB, binary protocol, Python libusb host, reusable driver
15
Bare-Metal USB
STM32 FSDEV registers, PMA, EP0 state machine, USB interrupts from scratch
You Are Here
16
Security in USB
BadUSB attacks, device authentication, secure firmware, USB firewall
17
USB Hardware Design
PCB layout, differential pairs, impedance matching, EMI, USB-C PD

Why Write Bare-Metal USB?

If TinyUSB handles USB correctly on dozens of MCU families with a clean, well-tested API, why would you ever write USB firmware from scratch? There are four legitimate reasons — and "to prove you can" is not one of them:

  • Educational value: Reading TinyUSB source code is one thing. Writing a working EP0 state machine yourself is another. The permanent understanding of what actually happens when your STM32 receives a SETUP packet is worth hours of study.
  • Extreme resource constraints: TinyUSB requires a minimum of roughly 8–12 KB of flash. On an STM32F030F4 with 16 KB of flash and an existing application, there may be no room. A minimal bare-metal USB HID driver can enumerate in under 3 KB.
  • Custom timing requirements: Some applications need to service USB interrupts with precise, guaranteed latency that a full USB stack cannot provide because the stack has its own internal state machine overhead.
  • Debugging TinyUSB: When TinyUSB fails to enumerate on a custom board and you suspect a hardware issue, being able to reduce the problem to a 100-line bare-metal test that directly controls USB_EPnR registers isolates firmware issues from hardware issues instantly.

Comparison: Bare-Metal vs USB Stacks

Aspect Bare-Metal (This Article) TinyUSB STM32 HAL USB Middleware
Flash footprint 2–5 KB (minimal HID) 8–15 KB 20–50 KB
Code flexibility Total — you control everything High — via callbacks and class drivers Medium — CubeMX-generated skeleton
Development time Days to weeks for a working device Hours for standard class Hours with CubeMX
Correctness risk Very high — every byte is your responsibility Low — well-tested Medium — known bugs in some ST versions
Multi-MCU portability None — completely MCU-specific Excellent — 30+ MCU families STM32 only
Best for Learning, size-critical, custom timing All new projects CubeMX-based existing projects
Warning: The code in this article targets the STM32 FSDEV peripheral (STM32F0, STM32F1, STM32F3, STM32L0, STM32G0 families). It does not apply to the STM32 OTG_FS or OTG_HS peripherals (STM32F4, F7, H7, etc.) which have a completely different register architecture. Check your datasheet before applying any register-level code.

STM32 FSDEV Hardware Overview

The STM32 USB Full Speed Device (FSDEV) peripheral is not the same as the OTG_FS peripheral. FSDEV is a simpler, dedicated device-only USB controller found in the cost-optimised STM32 families. Key hardware facts:

  • Device-only: No host mode, no OTG. This is a peripheral device — it appears as a USB device when connected to a host. It cannot act as a USB host.
  • Full Speed only: 12 Mbit/s. No High Speed support.
  • Dedicated packet memory (PMA): 512 bytes on most F0/F1/L0 devices, 1024 bytes on some variants. This SRAM is shared between the CPU and the USB hardware and must be carefully partitioned.
  • 8 configurable endpoints: EP0 through EP7. Each can be configured as Control, Bulk, Interrupt, or Isochronous.
  • No DMA engine: Unlike OTG_FS, FSDEV has no DMA. All data moves between PMA and system SRAM via CPU-executed copy loops.

Register Map Overview

/* STM32 FSDEV register map (base address varies by family) */
/* STM32F103: USB base = 0x40005C00 */
/* STM32F042: USB base = 0x40005C00 */
/* STM32G0:   USB base = 0x40005C00 */
/* PMA base:  0x40006000 (fixed for all FSDEV devices) */

/* --- Per-endpoint registers: USB_EPnR (n = 0..7) --- */
/* Each EPnR is 16-bit wide, at address USB_BASE + n*4   */
#define USB_EP0R   (*(volatile uint16_t *)(USB_BASE + 0x00))
#define USB_EP1R   (*(volatile uint16_t *)(USB_BASE + 0x04))
#define USB_EP2R   (*(volatile uint16_t *)(USB_BASE + 0x08))
#define USB_EP3R   (*(volatile uint16_t *)(USB_BASE + 0x0C))
/* ... EP4R through EP7R follow at +0x10, +0x14, +0x18, +0x1C */

/* --- Global control registers --- */
#define USB_CNTR   (*(volatile uint16_t *)(USB_BASE + 0x40))  /* Control register */
#define USB_ISTR   (*(volatile uint16_t *)(USB_BASE + 0x44))  /* Interrupt status register */
#define USB_FNR    (*(volatile uint16_t *)(USB_BASE + 0x48))  /* Frame number register */
#define USB_DADDR  (*(volatile uint16_t *)(USB_BASE + 0x4C))  /* Device address register */
#define USB_BTABLE (*(volatile uint16_t *)(USB_BASE + 0x50))  /* Buffer table address */

/* --- EPnR bit definitions --- */
#define EP_CTR_RX    (1u << 15)  /* Correct Transfer RX — read-only, write-0-to-clear */
#define EP_DTOG_RX   (1u << 14)  /* Data toggle RX */
#define EP_STAT_RX   (3u << 12)  /* Status bits RX: DISABLED/STALL/NAK/VALID */
#define EP_SETUP     (1u << 11)  /* Set by HW on SETUP packet reception */
#define EP_TYPE      (3u <<  9)  /* Endpoint type: BULK/CONTROL/ISO/INTERRUPT */
#define EP_KIND      (1u <<  8)  /* Endpoint kind (DBL_BUF for bulk, STATUS_OUT for ctrl) */
#define EP_CTR_TX    (1u <<  7)  /* Correct Transfer TX — read-only, write-0-to-clear */
#define EP_DTOG_TX   (1u <<  6)  /* Data toggle TX */
#define EP_STAT_TX   (3u <<  4)  /* Status bits TX: DISABLED/STALL/NAK/VALID */
#define EP_ADDR      (0xFu)      /* Endpoint address (lower 4 bits) */

/* EP_STAT values (used in both STAT_TX and STAT_RX fields) */
#define EP_STAT_DISABLED  0x0
#define EP_STAT_STALL     0x1
#define EP_STAT_NAK       0x2
#define EP_STAT_VALID     0x3

/* EP_TYPE values */
#define EP_TYPE_BULK      0x0
#define EP_TYPE_CONTROL   0x1
#define EP_TYPE_ISO       0x2
#define EP_TYPE_INTERRUPT 0x3

/* USB_CNTR bit definitions */
#define CNTR_CTRM    (1u << 15)  /* Correct transfer interrupt mask */
#define CNTR_ERRM    (1u << 13)  /* Error interrupt mask */
#define CNTR_SOFM    (1u << 9)   /* Start of frame interrupt mask */
#define CNTR_RESETM  (1u << 10)  /* USB reset interrupt mask */
#define CNTR_SUSPM   (1u << 11)  /* Suspend interrupt mask */
#define CNTR_WKUPM   (1u << 12)  /* Wakeup interrupt mask */
#define CNTR_FRES    (1u <<  0)  /* Force USB reset */
#define CNTR_PDWN    (1u <<  1)  /* Power down */

/* USB_ISTR bit definitions */
#define ISTR_CTR     (1u << 15)  /* Correct transfer */
#define ISTR_RESET   (1u << 10)  /* USB reset */
#define ISTR_SUSP    (1u << 11)  /* Suspend */
#define ISTR_WKUP    (1u << 12)  /* Wakeup */
#define ISTR_SOF     (1u <<  9)  /* Start of frame */
#define ISTR_ERR     (1u << 13)  /* Error */
#define ISTR_DIR     (1u <<  4)  /* Direction of last correct transfer */
#define ISTR_EP_ID   (0xFu)      /* Endpoint ID of last correct transfer */

/* USB_DADDR bit definitions */
#define DADDR_EF     (1u <<  7)  /* USB enable function */
#define DADDR_ADD    (0x7Fu)     /* Device address (7 bits) */

Packet Memory (PMA) Management

The Packet Memory Area (PMA) is the most confusing aspect of the STM32 FSDEV peripheral. It is a block of dedicated SRAM that is accessible by both the USB hardware and the CPU, but with a peculiar access constraint: it is 16-bit wide, and on many STM32 variants (F1, F0) the CPU must access it using 32-bit reads/writes where only the lower 16 bits carry data. This means a 64-byte buffer in PMA occupies 128 bytes of PMA address space.

Buffer Descriptor Table (BDT)

The PMA begins with the Buffer Descriptor Table. The BTABLE register points to its start (usually 0x0000, meaning the BDT is at the beginning of PMA). The BDT contains one 8-byte entry per endpoint (up to 8 endpoints = 64 bytes for the BDT itself):

/*
 * PMA layout (total: 512 bytes on typical FSDEV):
 *
 * Offset 0x000: Buffer Descriptor Table (BDT)
 *   EP0 TX descriptor (4 bytes): ADDR_TX, COUNT_TX
 *   EP0 RX descriptor (4 bytes): ADDR_RX, COUNT_RX
 *   EP1 TX descriptor (4 bytes)
 *   EP1 RX descriptor (4 bytes)
 *   ... (up to EP7 = 8 × 8 = 64 bytes total for BDT)
 *
 * Offset 0x040: EP0 TX buffer  (64 bytes = 0x40)
 * Offset 0x080: EP0 RX buffer  (64 bytes = 0x40)
 * Offset 0x0C0: EP1 TX buffer  (64 bytes = 0x40)
 * Offset 0x100: EP1 RX buffer  (64 bytes = 0x40)
 * ... etc.
 */

/* PMA base address in CPU address space */
#define PMA_BASE   0x40006000UL

/*
 * PMA access macros — on F1/F0, use 32-bit pointer with stride 2
 * (each 16-bit PMA word occupies 32 bits of CPU address space)
 */
#define PMA_WORD(offset)   (*(volatile uint16_t *)(PMA_BASE + (offset) * 2))

/* BDT entry structure (at PMA offset, NOT CPU offset) */
/* EP n TX descriptor is at PMA offset: BTABLE + n*8 + 0 (ADDR_TX), +2 (COUNT_TX) */
/* EP n RX descriptor is at PMA offset: BTABLE + n*8 + 4 (ADDR_RX), +4 (COUNT_RX) */
#define EP_TX_ADDR(n)   PMA_WORD(USB_BTABLE/2 + (n)*4 + 0)
#define EP_TX_COUNT(n)  PMA_WORD(USB_BTABLE/2 + (n)*4 + 1)
#define EP_RX_ADDR(n)   PMA_WORD(USB_BTABLE/2 + (n)*4 + 2)
#define EP_RX_COUNT(n)  PMA_WORD(USB_BTABLE/2 + (n)*4 + 3)

/* PMA buffer allocation — planned layout for a minimal CDC device */
#define PMA_BDT_BASE     0x00   /* BDT: 2 endpoints × 8 bytes = 16 bytes */
#define PMA_EP0_TX_BASE  0x10   /* EP0 TX buffer at PMA offset 0x10 (64 bytes) */
#define PMA_EP0_RX_BASE  0x50   /* EP0 RX buffer at PMA offset 0x50 (64 bytes) */
#define PMA_EP1_TX_BASE  0x90   /* EP1 TX buffer (bulk IN) */
#define PMA_EP1_RX_BASE  0xD0   /* EP1 RX buffer (bulk OUT) */
/* Total: 0xD0 + 0x40 = 0x110 = 272 bytes — fits in 512-byte PMA */

/* Copy data from system SRAM to PMA buffer (16-bit transfer, word-aligned) */
void pma_write(uint16_t pma_offset, const uint8_t *src, uint16_t len)
{
    volatile uint16_t *dst = (volatile uint16_t *)(PMA_BASE + pma_offset * 2);
    while (len >= 2) {
        *dst++ = src[0] | ((uint16_t)src[1] << 8);
        dst++;          /* Skip the upper 16 bits (only lower 16 valid on F1/F0) */
        src += 2;
        len -= 2;
    }
    if (len == 1) {
        *dst = src[0]; /* Last odd byte */
    }
}

/* Copy data from PMA to system SRAM */
void pma_read(uint16_t pma_offset, uint8_t *dst, uint16_t len)
{
    volatile uint16_t *src = (volatile uint16_t *)(PMA_BASE + pma_offset * 2);
    while (len >= 2) {
        uint16_t w = *src;
        src += 2;       /* Skip upper 16 bits */
        dst[0] = (uint8_t)(w & 0xFF);
        dst[1] = (uint8_t)(w >> 8);
        dst += 2;
        len -= 2;
    }
    if (len == 1) {
        *dst = (uint8_t)(*src & 0xFF);
    }
}

COUNT_RX Field — Block Size Encoding

The COUNT_RX field in the BDT is not simply "number of bytes the buffer can hold". It uses a special encoding where you set either the number of 2-byte blocks (BL_SIZE=0, BLKSZ field = N/2) or 32-byte blocks (BL_SIZE=1, BLKSZ field = N/32) for RX buffers. This encoding tells the USB hardware the maximum buffer size so it knows when to raise an overflow error:

/* Set the RX buffer size in the COUNT_RX BDT field.
   For a 64-byte buffer: BL_SIZE=0, BLKSZ=32 → COUNT_RX = (32 << 10) = 0x8000 */
#define COUNT_RX_64   ((1u << 15) | (1u << 10))   /* BL_SIZE=1, BLKSZ=1 → 2×32=64 bytes */
/* Simpler: BL_SIZE=0 means blocks of 2 bytes, BLKSZ = (size/2)              */
/* For 64 bytes: BL_SIZE=0, BLKSZ=32 → 0x8000                               */
#define COUNT_RX_BLKS(n)  (((n)/2) << 10)          /* BL_SIZE=0, n must be even ≤ 62 */
/* For 64 bytes exactly:                                                      */
#define COUNT_RX_64B      ((1u << 15) | (1u << 10)) /* BL_SIZE=1, BLKSZ=1 → 64 bytes */

USB Interrupts on STM32

The FSDEV peripheral shares its interrupt vector with CAN on some STM32F1 devices (USB_LP_CAN_RX0_IRQn). On STM32F0, G0, and L0 it has its own vector (USB_IRQn). The interrupt service routine must examine the USB_ISTR register to determine which event triggered the interrupt.

/* USB interrupt handler — STM32F0/G0/L0 */
void USB_IRQHandler(void)
{
    uint16_t istr = USB_ISTR;

    /* ── USB bus RESET ── */
    if (istr & ISTR_RESET) {
        USB_ISTR = ~ISTR_RESET;         /* Clear by writing 0 to the bit */
        usb_handle_reset();
    }

    /* ── SUSPEND ── */
    if (istr & ISTR_SUSP) {
        USB_ISTR = ~ISTR_SUSP;
        /* Enter low-power mode if desired */
        USB_CNTR |= (1u << 3);         /* FSUSP bit */
    }

    /* ── WAKEUP ── */
    if (istr & ISTR_WKUP) {
        USB_CNTR &= ~(1u << 3);        /* Clear FSUSP */
        USB_ISTR = ~ISTR_WKUP;
    }

    /* ── CORRECT TRANSFER (CTR) — most important ── */
    while (USB_ISTR & ISTR_CTR) {
        uint8_t  ep_id  = (uint8_t)(USB_ISTR & ISTR_EP_ID);
        uint8_t  dir    = (USB_ISTR & ISTR_DIR) ? 1u : 0u;
        /* dir=1: OUT or SETUP (host wrote to device); dir=0: IN (device wrote to host) */

        if (ep_id == 0) {
            usb_handle_ep0(dir);
        } else {
            usb_handle_epn(ep_id, dir);
        }
        /* CTR is cleared by clearing CTR_RX or CTR_TX in the EPnR register,
           NOT in ISTR. The while loop continues until all CTR events are processed. */
    }

    /* ── SOF (start of frame, every 1 ms) ── */
    if (istr & ISTR_SOF) {
        USB_ISTR = ~ISTR_SOF;
        /* Optional: 1 ms tick for application timing */
    }
}

/* Initialise the USB peripheral after clock enable */
void usb_init(void)
{
    /* 1. Enable USB clock (RCC) — device-specific, not shown */

    /* 2. Exit power-down mode */
    USB_CNTR = CNTR_FRES;              /* Keep in reset, clear PDWN */
    /* Wait at least 1 µs for analog section to stabilise */
    for (volatile int i = 0; i < 72; i++); /* ~1 µs at 72 MHz */

    /* 3. Clear reset */
    USB_CNTR = 0;

    /* 4. Clear any pending interrupts */
    USB_ISTR = 0;

    /* 5. Set BTABLE — BDT at PMA offset 0 */
    USB_BTABLE = PMA_BDT_BASE;

    /* 6. Enable interrupts: CTR, RESET, SUSP, WKUP */
    USB_CNTR = CNTR_CTRM | CNTR_RESETM | CNTR_SUSPM | CNTR_WKUPM;

    /* 7. Enable USB interrupt in NVIC */
    NVIC_SetPriority(USB_IRQn, 2);
    NVIC_EnableIRQ(USB_IRQn);

    /* 8. Connect D+ pull-up (if using external pull-up or internal on G0/L4) */
    /* STM32G0: USB->BCDR |= USB_BCDR_DPPU; */
}

Control Endpoint 0 State Machine

EP0 is mandatory on all USB devices and handles all control transfers — enumeration, SetAddress, GetDescriptor, SetConfiguration, and class-specific control requests. The control transfer protocol has three stages: SETUP, DATA (optional), and STATUS. Getting the state machine wrong causes the host to fail enumeration silently.

The EP0 state machine has five states:

State Meaning Transition Triggers
IDLE Waiting for a SETUP packet → SETUP_RX when SETUP packet arrives
SETUP_RX SETUP packet received, decode request → DATA_TX (IN request with data), DATA_RX (OUT request with data), STATUS_TX (IN with no data stage), STATUS_RX (OUT with no data stage)
DATA_TX Sending descriptor/data to host (IN direction) → STATUS_RX after all data sent
DATA_RX Receiving data from host (OUT direction) → STATUS_TX after all data received
STATUS_TX / STATUS_RX Zero-length status packet exchange → IDLE after status ACK
/* ep0_state_machine.c — complete control endpoint 0 implementation */

typedef enum {
    EP0_IDLE,
    EP0_SETUP_RX,
    EP0_DATA_TX,
    EP0_DATA_RX,
    EP0_STATUS_TX,
    EP0_STATUS_RX,
    EP0_STALL
} ep0_state_t;

static ep0_state_t  ep0_state      = EP0_IDLE;
static uint8_t      ep0_setup[8];           /* Received SETUP packet */
static const uint8_t *ep0_tx_data  = NULL;  /* Pointer to descriptor/data to send */
static uint16_t     ep0_tx_len     = 0;     /* Remaining bytes to send */
static uint16_t     ep0_tx_sent    = 0;     /* Bytes already sent */
static uint8_t      ep0_pending_addr = 0;   /* New USB address (SET_ADDRESS) */

/* Helper: safely write to EPnR without toggling RO bits */
static void ep_set_stat_tx(uint8_t ep, uint16_t stat)
{
    uint16_t reg = USB_EP0R;  /* Read current value */
    /* Clear toggle bits we do NOT want to change (XOR trick):
       Bits that are 1 in both old and new value toggle — we must mask them out. */
    reg &= ~(EP_CTR_RX | EP_CTR_TX);  /* Do not clear these */
    reg &= ~(EP_DTOG_RX | EP_DTOG_TX);/* Do not toggle data toggle */
    reg &= ~EP_STAT_TX;               /* Clear current stat_tx */
    reg ^= stat;                       /* XOR to set desired state */
    USB_EP0R = reg;
}

static void ep_set_stat_rx(uint8_t ep, uint16_t stat)
{
    uint16_t reg = USB_EP0R;
    reg &= ~(EP_CTR_RX | EP_CTR_TX);
    reg &= ~(EP_DTOG_RX | EP_DTOG_TX);
    reg &= ~EP_STAT_RX;
    reg ^= (stat << 12);
    USB_EP0R = reg;
}

/* Initialise EP0 after USB reset */
void usb_handle_reset(void)
{
    /* Reset device address to 0 */
    USB_DADDR = DADDR_EF | 0;

    /* Configure EP0 as control endpoint */
    USB_EP0R = (EP_TYPE_CONTROL << 9) | 0;  /* CONTROL type, EP address 0 */

    /* Set BDT entries for EP0 */
    EP_TX_ADDR(0)  = PMA_EP0_TX_BASE;
    EP_TX_COUNT(0) = 0;
    EP_RX_ADDR(0)  = PMA_EP0_RX_BASE;
    EP_RX_COUNT(0) = COUNT_RX_64B;          /* 64-byte RX buffer */

    /* Set EP0 TX to NAK (no data to send yet), RX to VALID (ready to receive SETUP) */
    ep_set_stat_tx(0, EP_STAT_NAK);
    ep_set_stat_rx(0, EP_STAT_VALID);

    ep0_state = EP0_IDLE;
    ep0_pending_addr = 0;
}

/* Called from the USB IRQ when a CTR event occurs on EP0 */
void usb_handle_ep0(uint8_t dir)
{
    uint16_t ep0r = USB_EP0R;

    if (ep0r & EP_CTR_RX) {
        /* RX event: SETUP or OUT data received */
        if (ep0r & EP_SETUP) {
            /* SETUP packet received — read 8 bytes from PMA */
            pma_read(PMA_EP0_RX_BASE, ep0_setup, 8);
            /* Clear CTR_RX — write 0 to CTR_RX, preserve other bits */
            USB_EP0R = ep0r & ~EP_CTR_RX & ~EP_CTR_TX;
            ep0_handle_setup();
        } else {
            /* OUT data packet (DATA_RX state) */
            uint16_t count = EP_RX_COUNT(0) & 0x3FF;
            /* Read received data — not shown for brevity */
            USB_EP0R = ep0r & ~EP_CTR_RX & ~EP_CTR_TX;
            ep_set_stat_rx(0, EP_STAT_VALID); /* Ready for more */
        }
    }

    if (ep0r & EP_CTR_TX) {
        /* TX event: host acknowledged IN packet */
        USB_EP0R = ep0r & ~EP_CTR_TX & ~EP_CTR_RX;

        if (ep0_state == EP0_DATA_TX) {
            ep0_continue_tx();  /* Send next chunk */
        } else if (ep0_state == EP0_STATUS_TX) {
            /* Status ZLP sent — apply any pending address change */
            if (ep0_pending_addr) {
                USB_DADDR = DADDR_EF | ep0_pending_addr;
                ep0_pending_addr = 0;
            }
            ep0_state = EP0_IDLE;
            ep_set_stat_rx(0, EP_STAT_VALID);
        }
    }
}

/* Handle a received SETUP packet */
void ep0_handle_setup(void)
{
    uint8_t  bmRequestType = ep0_setup[0];
    uint8_t  bRequest      = ep0_setup[1];
    uint16_t wValue        = ep0_setup[2] | ((uint16_t)ep0_setup[3] << 8);
    uint16_t wIndex        = ep0_setup[4] | ((uint16_t)ep0_setup[5] << 8);
    uint16_t wLength       = ep0_setup[6] | ((uint16_t)ep0_setup[7] << 8);

    /* Standard device requests */
    if ((bmRequestType & 0x60) == 0x00) {  /* Type = Standard */
        switch (bRequest) {
            case 0x05:  /* SET_ADDRESS */
                ep0_pending_addr = (uint8_t)(wValue & 0x7F);
                ep0_tx_len = 0;
                ep0_state  = EP0_STATUS_TX;
                EP_TX_COUNT(0) = 0;
                ep_set_stat_tx(0, EP_STAT_VALID); /* Send ZLP status */
                ep_set_stat_rx(0, EP_STAT_NAK);
                break;

            case 0x06:  /* GET_DESCRIPTOR */
                ep0_handle_get_descriptor(wValue, wIndex, wLength);
                break;

            case 0x09:  /* SET_CONFIGURATION */
                ep0_handle_set_configuration(wValue);
                break;

            default:
                /* Stall unsupported standard requests */
                ep_set_stat_tx(0, EP_STAT_STALL);
                ep_set_stat_rx(0, EP_STAT_STALL);
                break;
        }
    } else {
        /* Class or vendor requests — application-specific */
        ep_set_stat_tx(0, EP_STAT_STALL);
        ep_set_stat_rx(0, EP_STAT_STALL);
    }
}

Implementing GetDescriptor

GetDescriptor is the most frequently called control request during enumeration. The host calls it multiple times: first for the device descriptor, then for the configuration descriptor (first a short 9-byte read to get wTotalLength, then the full configuration), then for string descriptors. Your implementation must handle short transfers (when wLength is less than the descriptor length) and the ZLP rule correctly.

/* Descriptor type codes in the high byte of wValue for GET_DESCRIPTOR */
#define DESC_TYPE_DEVICE        0x01
#define DESC_TYPE_CONFIGURATION 0x02
#define DESC_TYPE_STRING        0x03
#define DESC_TYPE_INTERFACE     0x04
#define DESC_TYPE_ENDPOINT      0x05
#define DESC_TYPE_BOS           0x0F

/* Static descriptors — defined in descriptors.c, extern here */
extern const uint8_t  device_descriptor[18];
extern const uint8_t  configuration_descriptor[CONFIG_TOTAL_LEN];
extern const uint8_t *string_descriptors[];
extern const uint8_t  num_string_descriptors;

void ep0_handle_get_descriptor(uint16_t wValue, uint16_t wIndex, uint16_t wLength)
{
    uint8_t  desc_type  = (uint8_t)(wValue >> 8);
    uint8_t  desc_index = (uint8_t)(wValue & 0xFF);
    const uint8_t *desc_ptr = NULL;
    uint16_t       desc_len = 0;

    switch (desc_type) {
        case DESC_TYPE_DEVICE:
            desc_ptr = device_descriptor;
            desc_len = sizeof(device_descriptor);
            break;

        case DESC_TYPE_CONFIGURATION:
            if (desc_index == 0) {
                desc_ptr = configuration_descriptor;
                desc_len = CONFIG_TOTAL_LEN;
            }
            break;

        case DESC_TYPE_STRING:
            if (desc_index < num_string_descriptors && string_descriptors[desc_index]) {
                desc_ptr = string_descriptors[desc_index];
                desc_len = desc_ptr[0]; /* First byte is bLength */
            }
            break;

        default:
            break;
    }

    if (desc_ptr == NULL) {
        /* Unsupported descriptor — stall */
        ep_set_stat_tx(0, EP_STAT_STALL);
        ep_set_stat_rx(0, EP_STAT_STALL);
        return;
    }

    /* Clamp to wLength (host may request fewer bytes than the full descriptor) */
    if (desc_len > wLength) desc_len = wLength;

    /* Start the DATA_TX phase */
    ep0_tx_data = desc_ptr;
    ep0_tx_len  = desc_len;
    ep0_tx_sent = 0;
    ep0_state   = EP0_DATA_TX;

    ep0_continue_tx();
}

/* Send the next chunk of the IN data stage (up to 64 bytes per packet) */
void ep0_continue_tx(void)
{
    uint16_t chunk = ep0_tx_len - ep0_tx_sent;
    if (chunk > 64) chunk = 64;

    if (chunk == 0) {
        /* All data sent — transition to STATUS_RX (host will send a ZLP status OUT) */
        ep0_state = EP0_STATUS_RX;
        EP_TX_COUNT(0) = 0;
        ep_set_stat_rx(0, EP_STAT_VALID);
        ep_set_stat_tx(0, EP_STAT_NAK);
        return;
    }

    /* Copy next chunk to PMA TX buffer */
    pma_write(PMA_EP0_TX_BASE, ep0_tx_data + ep0_tx_sent, chunk);
    EP_TX_COUNT(0) = chunk;
    ep0_tx_sent   += chunk;

    /* If this was the last chunk, prepare to send ZLP if needed.
       ZLP rule: required when sent bytes == max packet size (64) and
       total transfer was an exact multiple of 64. */
    if (ep0_tx_sent == ep0_tx_len && chunk < 64) {
        /* Short packet — signals end of transfer, no ZLP needed */
        ep0_state = EP0_DATA_TX;   /* Will transition to STATUS_RX after TX ACK */
    }

    ep_set_stat_tx(0, EP_STAT_VALID); /* Arm TX — hardware sends the packet */
    ep_set_stat_rx(0, EP_STAT_NAK);
}

SetAddress Procedure

SetAddress is the one control request with a timing requirement that catches almost every first-time bare-metal implementer. The USB specification mandates that the new device address takes effect after the status stage of the SetAddress request is complete — not immediately after the SETUP packet is received. If you write the new address to USB_DADDR while handling the SETUP packet, the device sends the status ZLP from the wrong address and enumeration fails.

/*
 * SetAddress correct timing:
 *
 * 1. Host sends SETUP (bRequest=SET_ADDRESS, wValue=new_addr)
 * 2. Device stores new_addr in ep0_pending_addr
 * 3. Device sends STATUS ZLP (IN direction, zero length)
 * 4. Host ACKs the status ZLP — this is the CTR_TX interrupt
 * 5. NOW (in the CTR_TX handler) we write new_addr to USB_DADDR
 *
 * This is already implemented in ep0_handle_setup() and the CTR_TX
 * handler above. Here is the isolated logic for clarity:
 */

/* Called in SETUP handler for SET_ADDRESS: */
void handle_set_address_setup(uint8_t new_addr)
{
    /* DO NOT set USB_DADDR here — the address is not valid yet */
    ep0_pending_addr = new_addr;

    /* Prepare zero-length status packet */
    EP_TX_COUNT(0) = 0;
    ep0_state = EP0_STATUS_TX;
    ep_set_stat_tx(0, EP_STAT_VALID);   /* Send ZLP */
    ep_set_stat_rx(0, EP_STAT_NAK);
}

/* Called in CTR_TX handler when status ZLP is acknowledged: */
void handle_set_address_complete(void)
{
    /* NOW it is safe to apply the new address */
    if (ep0_pending_addr != 0) {
        USB_DADDR = DADDR_EF | ep0_pending_addr;
        ep0_pending_addr = 0;
    }
    ep0_state = EP0_IDLE;
    ep_set_stat_rx(0, EP_STAT_VALID);   /* Ready for next SETUP */
}

/*
 * Common bug: writing USB_DADDR in the SETUP handler.
 * The host sends the status ACK to address 0 (before seeing the ZLP).
 * If you change the address before sending the ZLP, the ZLP goes out
 * on the new address — the host, still at address 0, never sees it.
 * Enumeration times out.
 */
The Address Timing Rule: For SET_ADDRESS only, the new address becomes active after the handshake of the status stage, not after the SETUP stage. For all other control requests (including SET_CONFIGURATION), the action takes effect immediately. This is a USB spec exception that exists for only this one request.

Bulk Endpoint Implementation

After successful enumeration (device descriptor, configuration descriptor, SET_CONFIGURATION), the host can use the non-EP0 endpoints. Here is a complete bulk endpoint implementation for EP1 IN and EP1 OUT, implementing a simple loopback:

/* bulk_endpoint.c — EP1 IN (device to host) and EP1 OUT (host to device) */

static uint8_t bulk_rx_buf[64];
static uint8_t bulk_tx_buf[64];
static bool    bulk_tx_busy = false;

/* Configure EP1 as bulk after SET_CONFIGURATION */
void ep1_init(void)
{
    /* Set EP1 type = BULK, address = 1 */
    USB_EP1R = (EP_TYPE_BULK << 9) | 1;

    /* Set BDT addresses */
    EP_TX_ADDR(1)  = PMA_EP1_TX_BASE;
    EP_TX_COUNT(1) = 0;
    EP_RX_ADDR(1)  = PMA_EP1_RX_BASE;
    EP_RX_COUNT(1) = COUNT_RX_64B;

    /* EP1 TX: DTOG=0 (DATA0), STAT=NAK */
    /* Clear DTOG_TX by writing 1 to it (it's a toggle bit) if currently set */
    if (USB_EP1R & EP_DTOG_TX) {
        USB_EP1R ^= EP_DTOG_TX;   /* Toggle DTOG_TX to clear it */
    }
    /* Set STAT_TX = NAK (nothing to send yet) */
    /* STAT is also toggle: to set to NAK (0b10), XOR with (current XOR desired) */
    /* ... use the helper function from Section 5 for EP1 */

    /* EP1 RX: DTOG=0 (DATA0), STAT=VALID (ready to receive) */
    if (USB_EP1R & EP_DTOG_RX) {
        USB_EP1R ^= EP_DTOG_RX;
    }
    /* ep_set_stat_rx(1, EP_STAT_VALID); — enable reception */

    bulk_tx_busy = false;
}

/* Send data on EP1 IN (device to host) */
bool ep1_send(const uint8_t *data, uint16_t len)
{
    if (bulk_tx_busy) return false;  /* Previous TX not complete */
    if (len > 64) len = 64;

    pma_write(PMA_EP1_TX_BASE, data, len);
    EP_TX_COUNT(1) = len;
    bulk_tx_busy   = true;

    /* Arm EP1 TX — set STAT_TX to VALID */
    /* (same EPnR XOR manipulation as shown in ep_set_stat_tx) */

    return true;
}

/* Called from USB IRQ when EP1 has a CTR event */
void usb_handle_epn(uint8_t ep_id, uint8_t dir)
{
    if (ep_id == 1) {
        uint16_t ep1r = USB_EP1R;

        if (ep1r & EP_CTR_RX) {
            /* OUT data received from host */
            uint16_t count = EP_RX_COUNT(1) & 0x3FF;
            pma_read(PMA_EP1_RX_BASE, bulk_rx_buf, count);

            /* Clear CTR_RX */
            USB_EP1R = ep1r & ~EP_CTR_RX & ~EP_CTR_TX;

            /* Application callback: process received data */
            on_bulk_rx(bulk_rx_buf, count);

            /* Re-arm EP1 RX for next packet */
            /* ep_set_stat_rx(1, EP_STAT_VALID); */
        }

        if (ep1r & EP_CTR_TX) {
            /* IN data acknowledged by host */
            USB_EP1R = ep1r & ~EP_CTR_TX & ~EP_CTR_RX;
            bulk_tx_busy = false;

            /* Application callback: TX complete, may send next chunk */
            on_bulk_tx_complete();
        }
    }
}

/* Simple loopback: echo received data back */
void on_bulk_rx(const uint8_t *data, uint16_t len)
{
    /* Copy to TX buffer and send */
    memcpy(bulk_tx_buf, data, len);
    ep1_send(bulk_tx_buf, len);
}

void on_bulk_tx_complete(void) { /* Nothing to do for loopback */ }

/*
 * DATA0/DATA1 toggle:
 * The USB spec requires alternating DATA0/DATA1 packet IDs.
 * The STM32 FSDEV hardware handles DTOG automatically for normal transfers:
 * it toggles DTOG_TX after each successful IN transfer (CTR_TX)
 * and DTOG_RX after each successful OUT transfer (CTR_RX).
 * You only need to manually control DTOG in error recovery scenarios
 * (e.g., after a STALL, you must reset DTOG to DATA0).
 */

What TinyUSB Gives You

After implementing bare-metal EP0 and a bulk endpoint, you have a new appreciation for what TinyUSB handles on your behalf. Here is a side-by-side comparison of what you wrote vs what TinyUSB gives you for free:

Concern Bare-Metal (your code) TinyUSB (automatic)
EP0 state machine ~200 lines of careful C, multiple subtle bugs possible 0 lines — fully handled internally
Descriptor handling Manual switch/case, pointer arithmetic, length clamping Callbacks: tud_descriptor_device_cb() etc.
PMA management Manual allocation, 16-bit stride, pma_read/pma_write Fully automatic
DTOG (data toggle) Hardware automatic for normal transfers, manual for error recovery Fully automatic including error recovery
SetAddress timing Must remember to defer DADDR write to CTR_TX handler Handled correctly in TinyUSB core
ZLP rule Must calculate and send manually Automatic
Suspend/Resume Manual CNTR/FSUSP bit manipulation Callbacks: tud_suspend_cb(), tud_resume_cb()
EPnR write safety XOR-based write to avoid toggling RO bits — must get right for every write Abstracted in dcd_fsdev.c
Multi-MCU portability Zero — completely MCU-specific 30+ MCU families, same application code
Class drivers Must implement all class logic manually CDC, HID, MSC, Audio, MIDI, vendor — all ready
Conclusion from this exercise: TinyUSB's src/portable/st/stm32_fsdev/dcd_stm32_fsdev.c is approximately 600 lines of carefully written hardware driver code. The EP0 state machine in src/device/usbd.c is another ~800 lines. You just wrote a simplified version of that in this article. Use TinyUSB for production. Study bare-metal to understand production.

When Bare-Metal Is Still the Right Answer

Despite the comparison above, there are legitimate production scenarios for bare-metal USB:

  • USB DFU bootloader in <4 KB: A USB DFU Class bootloader that fits in the first 4 KB of flash (before any application) must be extremely compact. libopencm3's USB stack and similar minimal stacks achieve this. TinyUSB does not fit.
  • Single-purpose fixed-protocol device: A USB HID keyboard that only ever sends one 8-byte report can implement EP0 (enumeration only) and EP1 interrupt IN in under 2 KB with a stripped bare-metal approach.
  • Custom RTOS without TinyUSB port: If your RTOS is a proprietary system without a TinyUSB port, bare-metal USB integration may be the only path without porting effort.

Practical Exercises

Exercise 1 Beginner

Bare-Metal Enumeration

On an STM32F103 or STM32F042 development board, implement bare-metal USB enumeration without any USB library. Your goal is to reach the point where Device Manager (Windows) or lsusb (Linux) shows your device with the correct VID/PID, manufacturer string, and product string. Implement only EP0 control transfers: GetDescriptor (device descriptor, configuration descriptor, string descriptors) and SetAddress. Use Wireshark/USBPcap to capture and verify that each request is handled correctly. Log all received SETUP packets over UART for debugging.

STM32 FSDEV EP0 State Machine USB Enumeration Wireshark
Exercise 2 Intermediate

Add a Bulk Endpoint Loopback

Extend the bare-metal enumerator from Exercise 1 to include EP1 bulk IN and EP1 bulk OUT. After successful SetConfiguration, configure EP1 as described in Section 8. Write a Python pyusb script that sends 100 payloads of random size (1–64 bytes) and verifies the loopback echo. Deliberately introduce a DTOG error (set DTOG to wrong state) and observe in Wireshark how the host and device recover (or fail to recover). Document the DTOG recovery procedure.

Bulk Endpoint DTOG Toggle PMA Management Error Recovery
Exercise 3 Advanced

Implement CDC Without TinyUSB

Implement a complete USB CDC-ACM (virtual COM port) device from scratch on STM32F0 using only bare-metal register access. You will need: EP0 (control + standard CDC class requests SET_LINE_CODING, GET_LINE_CODING, SET_CONTROL_LINE_STATE), EP1 bulk IN (data from device to host), EP1 bulk OUT (data from host to device), and EP2 interrupt IN (CDC notification endpoint, required by spec even if unused). The device should enumerate as a CDC serial port on Windows (using the inbox usbser.sys), accept data from a terminal program, and echo it back with the bytes uppercased. Measure the throughput in bytes/second and compare it against a TinyUSB CDC device on the same hardware.

CDC-ACM Bare-Metal Class Requests Multi-Endpoint Throughput Measurement

Bare-Metal USB Design Generator

Document your bare-metal USB design — target MCU, peripheral type, PMA size, endpoint plan, device classes, and implementation notes. Download as Word, Excel, PDF, or PPTX for design documentation or learning records.

Bare-Metal USB Design Generator

Document your bare-metal USB peripheral design. Download as Word, Excel, PDF, or PPTX.

Draft auto-saved

All data stays in your browser. Nothing is sent to or stored on any server.

Conclusion & Next Steps

Implementing bare-metal USB on an STM32 FSDEV peripheral is one of the most technically demanding exercises in embedded firmware development. After completing this article and its exercises, you now understand:

  • The STM32 FSDEV register architecture — USB_EPnR, USB_CNTR, USB_ISTR, USB_DADDR, USB_BTABLE and their precise semantics, including the XOR-based write protocol for EPnR.
  • PMA management — the 16-bit wide SRAM, the Buffer Descriptor Table layout, 32-bit CPU access stride on F1/F0, and the COUNT_RX block encoding.
  • The EP0 control state machine — five states, the SETUP-DATA-STATUS phases, ZLP rules, and the stall mechanism for unsupported requests.
  • The SetAddress timing trap — the single most common bare-metal USB bug, now permanently understood.
  • Bulk endpoint operation — DTOG, STAT_TX/RX, CTR flags, and the PMA copy path for TX and RX.
  • Why TinyUSB is the right choice for production — it correctly handles all of the above plus DTOG recovery, suspend/resume, all standard requests, class drivers, and multi-MCU portability. The bare-metal path exists to give you the mental model that makes you a better user of TinyUSB — not to replace it.

Next in the Series

In Part 16: Security in USB, we examine USB from an adversarial perspective: BadUSB attacks (malicious firmware on commodity USB devices), USB keystroke injection, device authentication mechanisms (USB PD authentication, DFU security), secure firmware update over USB, and practical USB firewall solutions for both end users and embedded system designers. Understanding USB security is essential before any USB-connected product reaches market.

Technology