USB Performance Optimization — USB Development Mastery Series Part 13

                        
                        Series Context: This is Part 13 of 17 in the USB Development Mastery series. Parts 1–12 covered USB fundamentals through advanced topics including DFU and OTG. This part is dedicated entirely to performance — understanding the theoretical ceiling, then systematically approaching it through DMA, double buffering, and careful data path design.
                    

USB Development Mastery

Your 17-step learning path • Currently on Step 13

1

13

Performance & Optimisation

DMA, zero-copy buffers, throughput maximisation, latency tuning, benchmarking

You Are Here

14

Custom USB Class Drivers

Vendor class, writing descriptors, OS driver interaction

15

Bare-Metal USB

Direct register programming, writing USB stack from scratch, PHY timing

16

Security in USB

BadUSB attacks, device authentication, secure firmware, USB firewall

17

USB Hardware Design

PCB layout, differential pairs, impedance matching, EMI, USB-C PD

USB Throughput Theory

Before optimising USB performance, you need an accurate mental model of where the theoretical ceiling is and why measured throughput always falls below it. The gap between "480 Mbps" and what your device actually delivers is not a bug — it is a consequence of USB's protocol overhead, shared-bus scheduling, and the host's polling behaviour.

Gross Bandwidth vs Net Throughput

The raw bit rates advertised for USB speeds are gross figures that include NRZI encoding, bit stuffing, sync fields, PIDs, CRCs, and inter-packet gaps. The actual user data throughput is always lower. For bulk transfers specifically:

USB Speed	Gross Bit Rate	Theoretical Net Bulk	Achievable Measured	Overhead Reason
Full Speed	12 Mbps	~1.2 MB/s	0.8–1.0 MB/s	SOF (12%), ACK/NAK, token packets
High Speed	480 Mbps	~53 MB/s theoretical	25–38 MB/s typical	Microframe overhead, host scheduling, CDC buffering
HS (vendor bulk class)	480 Mbps	~53 MB/s	35–42 MB/s	Reduced CDC overhead, direct bulk, DMA

Per-Microframe Budget

High Speed USB divides time into microframes of 125 µs, and each second contains 8,000 microframes. The theoretical maximum bulk data per second is calculated from the fraction of each microframe available to bulk transactions after control and isochronous reservations. Bulk gets whatever bandwidth remains after higher-priority transfers have been scheduled.

/*
 * HS USB per-microframe budget calculation:
 *
 * Total HS bandwidth:     480,000,000 bits/s
 * Microframes per second: 8,000
 * Bits per microframe:    480,000,000 / 8,000 = 60,000 bits = 7,500 bytes
 *
 * Per microframe overhead:
 *   SOF packet:           ~8 bytes
 *   Token + handshake:    ~4 bytes per transaction
 *   Inter-packet gaps:    ~5 bytes equivalent
 *
 * With a single 512-byte bulk packet per microframe:
 *   Net data: 512 bytes
 *   Net throughput: 512 * 8,000 = ~4 MB/s (one transaction/microframe)
 *
 * With host pipelining (multiple IN per microframe):
 *   Host issues multiple IN tokens without waiting for previous ACK
 *   Up to ~7 512-byte transactions per 125 µs microframe
 *   Theoretical: 512 * 7 * 8,000 = ~28 MB/s
 *
 * Reaching 35-42 MB/s requires:
 *   - Device always has data ready (zero NAK rate)
 *   - DMA feeding endpoint buffer with no CPU intervention
 *   - 512-byte wMaxPacketSize (not 64-byte FS legacy)
 *   - Host driver pipeline depth >= 4 outstanding requests
 */

                        
                        SOF Overhead at Full Speed: Each SOF packet consumes approximately 2.5 µs per 1 ms frame — 0.25% overhead. The real FS ceiling loss comes from token + handshake packets: each bulk IN transaction uses ~24 bits (IN token) + 8 + 64×8 + 16 bits (DATA0) + 8 bits (ACK) = 568 bits for 64 bytes payload. That is 89% efficiency, giving a ceiling of 12 Mbps × 89% / 8 ≈ 1.34 MB/s. Bus scheduling further reduces the sustained rate to roughly 1 MB/s for a single bulk endpoint.
                    

Endpoint Buffer Size Impact

The single most impactful configuration change for USB throughput is packet size. Moving from Full Speed's 64-byte maximum bulk packet to High Speed's 512-byte maximum is a direct 8× throughput multiplier for the same transaction count. But even within a given speed, software buffer sizes determine how efficiently the endpoint can stream data.

Max Bulk Packet Sizes by Speed

USB Speed	Max Bulk Packet	Max Interrupt Packet	Max Isochronous Packet	Control EP0 Max
Low Speed (1.5 Mbps)	N/A (no bulk)	8 bytes	N/A	8 bytes
Full Speed (12 Mbps)	64 bytes	64 bytes	1023 bytes	8, 16, 32, or 64 bytes
High Speed (480 Mbps)	512 bytes	1024 bytes	1024 bytes (×3 = 3072)	64 bytes

TinyUSB Buffer Size Configuration

/* tusb_config.h — critical buffer size settings */

/* CDC endpoint buffer — determines how much data can be queued.    */
/* FS: use 512 (8 x 64-byte packets). HS: use 4096 (8 x 512-byte). */
#define CFG_TUD_CDC_EP_BUFSIZE      512    /* FS default            */
/* #define CFG_TUD_CDC_EP_BUFSIZE   4096 */ /* HS recommended       */

/* CDC RX/TX FIFO sizes — application-side buffers             */
/* Should be 4-8x EP buffer size to absorb processing bursts   */
#define CFG_TUD_CDC_RX_BUFSIZE      (4 * CFG_TUD_CDC_EP_BUFSIZE)
#define CFG_TUD_CDC_TX_BUFSIZE      (4 * CFG_TUD_CDC_EP_BUFSIZE)

/* MSC buffer size — must match or exceed SCSI READ10 block size */
/* SDMMC sector = 512 bytes. Use 4096 for 8-sector read blocks.  */
#define CFG_TUD_MSC_EP_BUFSIZE      512    /* FS */
/* #define CFG_TUD_MSC_EP_BUFSIZE   4096 */ /* HS, 8 sectors/xfer */

/* Vendor bulk endpoint buffers */
#define CFG_TUD_VENDOR_EPSIZE       512    /* HS max packet size   */
#define CFG_TUD_VENDOR_RX_BUFSIZE   (16 * 512)  /* 8 KB RX FIFO   */
#define CFG_TUD_VENDOR_TX_BUFSIZE   (16 * 512)  /* 8 KB TX FIFO   */

/* Impact of going from 64-byte to 512-byte packets:            */
/* For same number of USB transactions:                         */
/*   64-byte: 1000 transactions/ms * 64 = 64 KB/s              */
/*   512-byte: 1000 transactions/ms * 512 = 512 KB/s           */
/*   Improvement: 8x — purely from wMaxPacketSize change        */
/* Note: transactions/ms is host-dependent but illustrates      */
/* the multiplier effect of larger packet sizes                 */

Double Buffering Bulk Endpoints

The fundamental throughput bottleneck in a single-buffered USB endpoint is the dead time between transactions: after the host reads buffer A, there is a gap before buffer A is refilled and presented for the next IN transaction. During this gap, the host issues an IN token and receives a NAK — wasted bus time. Double buffering eliminates this gap by maintaining two alternating buffers: while the host reads buffer A, the device fills buffer B. When the host finishes buffer A, buffer B is immediately available with fresh data.

STM32 FSDEV Double-Buffer Mode

The STM32 USB Full Speed Device (FSDEV) peripheral — used on STM32F0, F1, F3, L0, L1 series — supports hardware double-buffering for bulk endpoints via the USB_EP_KIND bit in the endpoint register. When set, the endpoint uses two separate packet buffers in USB SRAM, toggled automatically by hardware after each successful transaction.

/* STM32 FSDEV double-buffer bulk endpoint setup */

/* USB BTABLE (Buffer Descriptor Table) in USB SRAM.             */
/* Each endpoint has 4 x 16-bit entries:                         */
/*   BUF0_ADDR  — buffer 0 base address in USB SRAM             */
/*   BUF0_COUNT — TX: bytes to send; RX: buffer capacity         */
/*   BUF1_ADDR  — buffer 1 base address                         */
/*   BUF1_COUNT — same as above for buffer 1                     */

#define EP_BUF0_ADDR  0x0040   /* USB SRAM offset for buffer 0  */
#define EP_BUF1_ADDR  0x0080   /* USB SRAM offset for buffer 1  */
#define EP_BUF_SIZE   64       /* FS max packet size            */

void setup_double_buffer_bulk_ep(uint8_t ep_num) {
    /* Point both buffer addresses in BTABLE */
    PCD_SET_EP_TX_ADDRESS(USB, ep_num, EP_BUF0_ADDR);
    PCD_SET_EP_RX_ADDRESS(USB, ep_num, EP_BUF1_ADDR);

    /* Set USB_EP_KIND bit to enable double-buffer mode */
    PCD_SET_EP_KIND(USB, ep_num);

    /* Initialise both buffer counts to zero */
    PCD_SET_EP_TX_CNT(USB, ep_num, 0);
    PCD_SET_EP_RX_CNT(USB, ep_num, 0);

    /* Enable double-buffered bulk TX */
    PCD_SET_EP_DBUF_TX_STATUS(USB, ep_num, USB_EP_TX_VALID);
}

/* Hardware buffer-toggle rule:                                  */
/* After each IN transaction, hardware flips SW_BUF bit.        */
/* Firmware checks SW_BUF to know which buffer to fill next.    */
/* Reading SW_BUF: (USB->EPnR >> 14) & 1                        */

void fill_next_double_buffer(uint8_t ep_num, uint8_t *data, uint16_t len) {
    uint32_t sw_buf = (USB->EP0R >> 14) & 1;  /* Example: EP0 */
    if (sw_buf == 0) {
        /* Fill buffer 0 */
        USB_WritePMA(USB, data, EP_BUF0_ADDR, len);
        PCD_SET_EP_TX_CNT(USB, ep_num, len);
    } else {
        /* Fill buffer 1 */
        USB_WritePMA(USB, data, EP_BUF1_ADDR, len);
        PCD_SET_EP_RX_CNT(USB, ep_num, len);
    }
}

/* TinyUSB transparency:                                         */
/* TinyUSB's stm32_fsdev port driver handles double-buffering    */
/* automatically for bulk endpoints. No application changes      */
/* are needed — the throughput improvement is automatic.         */
/* On STM32 OTG_FS (F4, H7), deep TX FIFOs provide equivalent   */
/* pipelining without explicit double-buffer configuration.      */

DMA for Zero-Copy Transfers

The biggest CPU-side bottleneck in high-throughput USB transfers is the memory copy between the application buffer and the USB peripheral's packet memory area (PMA/SRAM). On a 168 MHz Cortex-M4 without DMA, a memcpy() of 512 bytes to USB SRAM consumes approximately 3–4 µs — enough time for the host to issue an IN token and receive a NAK, wasting a full transaction slot. DMA eliminates this cost entirely: the USB DMA engine reads directly from the application buffer in main SRAM, with zero CPU involvement.

STM32 OTG_FS DMA Mode

/* Enable USB OTG_FS DMA mode (STM32F4/H7 with OTG_FS peripheral) */

void usb_otg_dma_enable(void) {
    /* Enable DMA in AHB Global Configuration Register */
    USB_OTG_FS->GAHBCFG |= USB_OTG_GAHBCFG_DMAEN;

    /* Set burst type — INCR4 balances AHB bus utilisation */
    USB_OTG_FS->GAHBCFG &= ~USB_OTG_GAHBCFG_HBSTLEN_Msk;
    USB_OTG_FS->GAHBCFG |= (USB_OTG_GAHBCFG_HBSTLEN_1);  /* INCR4 */
}

/* CFG_TUSB_MEM_SECTION places buffers in DMA-accessible SRAM.  */
/* On STM32H7: USB DMA can access SRAM1/SRAM2 but NOT DTCM RAM. */
/* Place DMA buffers explicitly to avoid silent DMA failures.   */

CFG_TUSB_MEM_SECTION CFG_TUSB_MEM_ALIGN
static uint8_t cdc_tx_dma_buf[4096];

CFG_TUSB_MEM_SECTION CFG_TUSB_MEM_ALIGN
static uint8_t cdc_rx_dma_buf[4096];

/* Cortex-M7 cache coherency for DMA transfers */
/* STM32H7 has 16 KB L1 D-cache enabled by default in CubeIDE   */
/* CPU writes go to cache — DMA reads from RAM see stale data    */

void prepare_tx_for_dma(const uint8_t *src, uint32_t len) {
    memcpy(cdc_tx_dma_buf, src, len);
    /* Flush data cache: write dirty cache lines back to SRAM     */
    SCB_CleanDCache_by_Addr((uint32_t *)cdc_tx_dma_buf,
                             (len + 31) & ~31UL);
}

void invalidate_rx_after_dma(uint32_t len) {
    /* Invalidate data cache: force CPU to re-read from SRAM      */
    /* Must be called BEFORE reading DMA-received data            */
    SCB_InvalidateDCache_by_Addr((uint32_t *)cdc_rx_dma_buf,
                                  (len + 31) & ~31UL);
}

Measured Throughput: CPU Copy vs DMA

Configuration	MCU	USB Speed	CPU Copy MB/s	DMA MB/s	Improvement
CDC Bulk IN	STM32F407 @ 168 MHz	FS (12 Mbps)	0.95	1.02	~7% (bus-limited)
CDC Bulk IN	STM32F407 @ 168 MHz	HS (480 Mbps)	18	32	~78%
Vendor Bulk IN	STM32H743 @ 480 MHz	HS (480 Mbps)	26	40	~54%
MSC Read10	STM32F407 @ 168 MHz	HS (480 Mbps)	14	22	~57%

                        
                        DMA at Full Speed: For Full Speed USB (12 Mbps), DMA provides minimal throughput benefit because the USB bus itself is the bottleneck — the CPU keeps up with 64-byte packets at 1 MB/s without effort. DMA becomes essential at High Speed where the USB bus can consume data faster than a CPU memcpy can supply it. If your application runs at Full Speed, focus on reducing NAK rates and ensuring the endpoint buffer is always pre-filled before the host polls it.
                    

Measuring USB Throughput

Accurate throughput measurement requires discipline on both sides. Exclude connection setup time from the measurement window and measure over a multi-second window to average out bus scheduling noise. Always warm up the connection by discarding the first 100 KB before starting the timer.

Device Firmware: TX Benchmark

/* Continuous TX flood — fill CDC buffer every main loop iteration */
static uint8_t tx_buf[4096];

void usb_tx_benchmark_task(void) {
    if (!tud_cdc_connected()) return;
    /* Only write when a full block fits — maximises packet fill */
    if (tud_cdc_write_available() >= sizeof(tx_buf)) {
        tud_cdc_write(tx_buf, sizeof(tx_buf));
        tud_cdc_write_flush();
    }
}

/* host-side pyserial benchmark: pip install pyserial
   python -c "import serial,time; s=serial.Serial('COM12',115200,timeout=1);
   s.reset_input_buffer(); [s.read(4096) for _ in range(25)];
   n=0; t=time.perf_counter()
   while time.perf_counter()-t<10: n+=len(s.read(4096))
   print(f'{n/(time.perf_counter()-t)/1e6:.2f} MB/s')"
*/

Common measurement pitfalls: (1) measuring from first byte rather than after warmup — initial latency inflates the time denominator; (2) including enumeration in the window; (3) measuring RTT (round-trip) for a one-directional benchmark.

TX Path Optimization (Device to Host)

The TX path — data flowing from device firmware to the USB host — has several common implementation patterns that severely limit throughput. Understanding the TinyUSB CDC TX buffer model and the correct flush strategy is the foundation of TX path optimisation.

CDC Write Buffer Strategy

TinyUSB CDC maintains an internal FIFO between application writes and the USB endpoint. The key insight: tud_cdc_write() copies data into this FIFO but does not immediately initiate a USB transfer. The transfer is initiated when either (a) tud_cdc_write_flush() is called, or (b) the FIFO fills up. Calling tud_cdc_write_flush() after every byte is the single most common throughput killer.

/* WRONG: flush after every small write — catastrophic for throughput */
void bad_send_data(const uint8_t *data, uint32_t len) {
    for (uint32_t i = 0; i < len; i++) {
        tud_cdc_write(&data[i], 1);
        tud_cdc_write_flush();  /* USB transaction for EVERY BYTE */
    }
}

/* CORRECT: batch into large writes, flush once at end */
void good_send_data(const uint8_t *data, uint32_t len) {
    uint32_t written = 0;
    while (written < len) {
        uint32_t avail = tud_cdc_write_available();
        if (avail == 0) {
            tud_task();  /* Process USB events to drain TX FIFO */
            continue;
        }
        uint32_t chunk = (len - written < avail) ? len - written : avail;
        tud_cdc_write(data + written, chunk);
        written += chunk;
    }
    tud_cdc_write_flush();  /* Single flush at end */
}

/* BEST: circular buffer + DMA feeding CDC write */
/* ADC DMA fills ring_buf[] continuously.         */
/* USB task drains ring_buf into CDC TX FIFO.     */

#define RING_BUF_SIZE  (16 * 1024)
static uint8_t  ring_buf[RING_BUF_SIZE];
static uint32_t ring_head = 0;  /* DMA write position  */
static uint32_t ring_tail = 0;  /* USB read position   */

/* Called from DMA transfer complete callback (half + full) */
void adc_dma_half_cb(void) { ring_head = RING_BUF_SIZE / 2; }
void adc_dma_full_cb(void) { ring_head = 0; }

void usb_tx_drain_task(void) {
    if (!tud_cdc_connected()) return;
    uint32_t avail = tud_cdc_write_available();
    if (avail == 0) return;

    uint32_t head = ring_head;  /* Atomic 32-bit read on Cortex-M */
    uint32_t bytes = (head - ring_tail + RING_BUF_SIZE) % RING_BUF_SIZE;
    if (bytes == 0) return;

    uint32_t chunk = bytes < avail ? bytes : avail;
    /* Handle wrap-around */
    if (ring_tail + chunk <= RING_BUF_SIZE) {
        tud_cdc_write(ring_buf + ring_tail, chunk);
    } else {
        uint32_t first = RING_BUF_SIZE - ring_tail;
        tud_cdc_write(ring_buf + ring_tail, first);
        tud_cdc_write(ring_buf, chunk - first);
    }
    ring_tail = (ring_tail + chunk) & (RING_BUF_SIZE - 1u);
    tud_cdc_write_flush();
}

The tud_cdc_write_available() function returns remaining space in the CDC TX FIFO. For maximum throughput, keep the TX FIFO at least 50% full at all times by writing large chunks rather than byte-by-byte.

RX Path Optimization (Host to Device)

The RX path — data flowing from the USB host into the device firmware — has its own set of bottlenecks. The most common mistake is reading data one byte at a time from the CDC RX buffer, which causes the TinyUSB stack to generate a NAK for incoming packets until the application has drained the previous data. This stalls the host's bulk transfer pipeline.

/* WRONG: read byte-by-byte — causes NAK storms */
void bad_rx_handler(void) {
    uint8_t byte;
    while (tud_cdc_available()) {
        tud_cdc_read(&byte, 1);
        process_byte(byte);  /* Slow per-byte processing */
    }
}

/* CORRECT: drain entire available buffer in one tud_cdc_read() call */
void tud_cdc_rx_cb(uint8_t itf) {
    uint8_t buf[CFG_TUD_CDC_EP_BUFSIZE];  /* 512 for HS, 64 for FS */
    uint32_t count;

    /* Read ALL available bytes in a single call */
    while ((count = tud_cdc_n_read(itf, buf, sizeof(buf))) > 0) {
        /* Hand off entire packet to application */
        app_rx_handler(buf, count);
    }
}

/* If application cannot keep up with RX rate:            */
/* USB NAK (Not Acknowledge) is the correct behaviour.   */
/* TinyUSB automatically NAKs when RX buffer is full.    */
/* Do NOT discard data — instead signal application.     */

/* Flow control using circular buffer */
#define APP_RX_BUF_SIZE  8192
static uint8_t  app_rx_buf[APP_RX_BUF_SIZE];
static uint32_t app_rx_head = 0;
static uint32_t app_rx_tail = 0;

void tud_cdc_rx_cb_buffered(uint8_t itf) {
    uint8_t pkt[512];
    uint32_t count = tud_cdc_n_read(itf, pkt, sizeof(pkt));
    if (count == 0) return;

    /* Check if ring buffer has space */
    uint32_t free_space = (app_rx_tail - app_rx_head - 1 + APP_RX_BUF_SIZE)
                          % APP_RX_BUF_SIZE;
    if (count > free_space) {
        /* Cannot accept — TinyUSB will NAK next packet automatically */
        /* Re-push back into TinyUSB buffer (not possible directly)  */
        /* Best approach: process some data first, then call read again */
        return;
    }

    /* Copy packet into ring buffer */
    for (uint32_t i = 0; i < count; i++) {
        app_rx_buf[app_rx_head] = pkt[i];
        app_rx_head = (app_rx_head + 1) % APP_RX_BUF_SIZE;
    }
}

When the application processing rate falls below the USB RX rate, TinyUSB's flow control kicks in automatically: the RX FIFO fills, and TinyUSB NAKs the next IN token from the host. The host retries — no data is lost. This back-pressure mechanism is correct behaviour, not a bug.

MSC Throughput Optimization

USB Mass Storage throughput is a chain of three components: USB bulk transfer rate, MCU processing (SCSI command decode), and storage medium read/write speed. Optimising only the USB link without addressing SDMMC throughput leaves the majority of performance gain on the table.

/* MSC read10 callback — called when host requests a READ(10) command */
/* buf must be 4-byte aligned for DMA to work correctly              */

/* Place MSC buffer in DMA-accessible SRAM region */
CFG_TUSB_MEM_SECTION CFG_TUSB_MEM_ALIGN
static uint8_t msc_buf[4096];   /* 8 x 512-byte sectors per transfer */

int32_t tud_msc_read10_cb(uint8_t lun, uint32_t lba,
                           uint32_t offset, void *buffer,
                           uint32_t bufsize) {
    (void)lun; (void)offset;

    /* Use SDMMC DMA for zero-copy read into MSC buffer */
    if (HAL_SD_ReadBlocks_DMA(&hsd, (uint8_t *)buffer,
                               lba, bufsize / 512) != HAL_OK) {
        return -1;
    }

    /* Wait for DMA transfer complete (or use callback for async) */
    uint32_t t = HAL_GetTick();
    while (HAL_SD_GetCardState(&hsd) != HAL_SD_CARD_TRANSFER) {
        if (HAL_GetTick() - t > 500) return -1;
    }

    /* On Cortex-M7: invalidate cache before TinyUSB reads buffer */
    SCB_InvalidateDCache_by_Addr((uint32_t *)buffer,
                                  (bufsize + 31) & ~31u);
    return (int32_t)bufsize;
}

/* SDMMC configuration for maximum throughput */
/* STM32F407: SDMMC1, 4-bit wide bus, 48 MHz SDMMC clock           */
/* Achievable: ~12 MB/s read from Class 10 microSD                 */
/* Combined USB HS + SDMMC: bottleneck is SD card at ~12 MB/s      */
/* USB HS bulk can sustain ~22 MB/s — so SD card is limiting factor */

/* Pre-fetch optimisation: read next sector while current transmits */
/* TinyUSB calls tud_msc_read10_cb for each READ(10) command        */
/* Pre-fetch is complex with TinyUSB but achievable with double buf */

MCU	USB Speed	Storage	Measured Read	Measured Write	Bottleneck
STM32F407 @ 168 MHz	HS 480 Mbps	SDMMC Class 10	10.5 MB/s	8.2 MB/s	SD card
STM32H743 @ 480 MHz	HS 480 Mbps	SDMMC UHS-I	22 MB/s	18 MB/s	USB bulk pipelining
STM32F407 @ 168 MHz	FS 12 Mbps	SDMMC Class 10	0.95 MB/s	0.85 MB/s	USB Full Speed
RP2040 @ 133 MHz	FS 12 Mbps	SPI Flash 80 MHz	0.9 MB/s	0.6 MB/s	USB Full Speed

Profiling & Bottleneck Identification

When USB throughput is lower than expected, the bottleneck is in one of three places: the USB bus itself (too many NAKs), the MCU's processing pipeline (CPU too slow to feed the endpoint), or the storage/peripheral (SD card, ADC, memory bus). Identifying which one requires measurement — guessing leads to optimising the wrong thing.

GPIO Toggle Timing Method

/* GPIO timing method: toggle a spare GPIO at callback entry/exit */
/* Measure high/low times with oscilloscope or logic analyser    */

#define PROFILE_GPIO_PORT   GPIOC
#define PROFILE_GPIO_PIN    GPIO_PIN_13

static inline void profile_high(void) {
    PROFILE_GPIO_PORT->BSRR = PROFILE_GPIO_PIN;
}
static inline void profile_low(void) {
    PROFILE_GPIO_PORT->BRR  = PROFILE_GPIO_PIN;
}

/* In tud_cdc_rx_cb: */
void tud_cdc_rx_cb(uint8_t itf) {
    profile_high();  /* GPIO goes HIGH when callback starts */
    uint8_t buf[512];
    uint32_t count = tud_cdc_n_read(itf, buf, sizeof(buf));
    app_process_rx(buf, count);
    profile_low();   /* GPIO goes LOW when callback returns  */
}

/* Oscilloscope interpretation:                              */
/* HIGH duration = callback execution time                  */
/* LOW duration between callbacks = time USB bus was idle   */
/* If LOW duration is very short: MCU is the bottleneck     */
/* If HIGH duration is very short: USB bus is bottleneck    */

/* DWT Cycle Counter for µs-resolution timing without GPIO */
/* Available on all Cortex-M3/M4/M7 cores (not M0)        */

void dwt_init(void) {
    CoreDebug->DEMCR |= CoreDebug_DEMCR_TRCENA_Msk;
    DWT->CYCCNT = 0;
    DWT->CTRL  |= DWT_CTRL_CYCCNTENA_Msk;
}

static inline uint32_t dwt_get_cycles(void) {
    return DWT->CYCCNT;
}

static inline uint32_t dwt_cycles_to_us(uint32_t cycles) {
    return cycles / (SystemCoreClock / 1000000u);
}

/* Usage: */
void measure_callback_duration(void) {
    uint32_t t0 = dwt_get_cycles();
    do_usb_operation();
    uint32_t elapsed_us = dwt_cycles_to_us(dwt_get_cycles() - t0);
    printf("Operation took %lu µs\n", elapsed_us);
}

A well-optimised High Speed USB device shows: callback execution time (HIGH GPIO) of 5–15 µs for a 512-byte packet, with very short idle gaps. If idle gaps are long (50+ µs), the MCU is slow to refill the endpoint — DMA is needed. If callback time is short but throughput is still low, the bottleneck is the USB host driver scheduling on the PC side.

Exercises

Exercise 1 Beginner

Measure Baseline CDC Throughput

Using a Full Speed TinyUSB CDC device: (a) implement the TX flood firmware (usb_tx_benchmark_task()) that continuously fills the CDC TX buffer; (b) run the Python pyserial benchmark script on the host, measuring with a 10-second window and 100 KB warmup; (c) record the baseline throughput in MB/s; (d) change CFG_TUD_CDC_TX_BUFSIZE from 512 to 2048 bytes and remeasure — document the improvement; (e) try calling tud_cdc_write_flush() after every 64-byte write versus after every 512-byte write and compare the throughput difference.

CDC Throughput Buffer Sizing Benchmarking

Exercise 2 Intermediate

Profile USB TX Callback with GPIO and Logic Analyser

Set up GPIO toggle profiling on your MCU: (a) configure a spare GPIO pin as fast push-pull output; (b) set the GPIO HIGH at the start of tud_cdc_rx_cb() and LOW at the end; (c) capture 10 ms of GPIO waveform on a logic analyser while running the loopback benchmark; (d) measure the average HIGH (callback active) and LOW (idle) times; (e) calculate the duty cycle — what percentage of time is the MCU processing vs idle?; (f) identify whether the bottleneck is USB bus latency or MCU processing by analysing the waveform pattern.

GPIO Profiling Logic Analyser Bottleneck Analysis

Exercise 3 Advanced

Implement DMA-Driven ADC to USB Streaming at Maximum Throughput

Design a complete high-throughput streaming pipeline: (a) configure an ADC in continuous DMA mode on a Cortex-M4/M7 MCU, sampling at the maximum rate the USB can sustain (for FS: ~100 kSPS at 16-bit; for HS: ~2 MSPS); (b) use a ping-pong (double buffer) DMA setup — DMA fills buffer A while USB sends buffer B; (c) implement cache flush in the DMA half-complete callback before TinyUSB reads buffer A; (d) measure achieved throughput and compare to theoretical maximum; (e) add DWT cycle counter instrumentation to measure total latency from ADC sample to USB packet transmission; (f) identify and eliminate any idle gaps in the TX pipeline.

DMA Streaming Ping-Pong Buffers ADC to USB Cache Coherency

USB Performance Plan Generator

Use this tool to document your USB performance optimisation plan — target MCU, USB speed, device class, throughput goal, DMA and double-buffer strategy. Download as Word, Excel, PDF, or PPTX for project documentation.

USB Performance Plan Generator

Document your USB performance strategy and throughput targets. Download as Word, Excel, PDF, or PPTX.

Draft auto-saved

All data stays in your browser. Nothing is sent to or stored on any server.

Project Name *

Target MCU *

USB Speed

USB Class

Target Throughput (MB/s)

DMA Enabled

Double Buffering

Optimisation Notes

Author Name

Conclusion & Next Steps

Part 13 has given you a complete toolkit for USB performance analysis and optimisation. The key lessons to take forward:

Know the ceiling. USB Full Speed bulk caps at ~1 MB/s. High Speed bulk can reach 35–42 MB/s with optimal buffering and DMA. Start by understanding what is theoretically possible.
Packet size is the biggest single lever. Switching from 64-byte (FS) to 512-byte (HS) bulk packets delivers an 8× throughput gain with no code change — only hardware and descriptor changes required.
DMA is essential at High Speed. At Full Speed, CPU memcpy keeps up easily. At High Speed, DMA eliminates the CPU copy bottleneck and unlocks the last 40–80% of available throughput.
Flush strategy dominates CDC TX throughput. Calling tud_cdc_write_flush() after every byte reduces throughput from 18 MB/s to under 1 MB/s. Batch large writes and flush once at the end of each burst.
Drain the RX FIFO completely in one call. Reading byte-by-byte from tud_cdc_read() stalls the USB pipeline. Always read all available bytes in a single tud_cdc_n_read() call.
MSC performance is usually SD-card limited. USB HS can sustain 22+ MB/s; the bottleneck is almost always the storage medium. SDMMC DMA in 4-bit HS mode is essential for maximising MSC write throughput.
Measure with GPIO toggles first. One GPIO toggle per callback entry/exit, captured on a logic analyser for 10 ms, immediately reveals whether the bottleneck is USB bus time or MCU processing time — no guessing required.

Next in the Series

In Part 14: Custom USB Class Drivers, we move beyond standard classes to design custom vendor-specific USB drivers. We will write a complete custom class descriptor set, implement Microsoft OS descriptors (WCID/BOS) for driver-free Windows operation, build a libusb-based host application, and explore when a custom class is the right choice versus extending a standard one.

Cookie Consent

Cookie Preferences

USB Part 13: Performance & Optimisation

Table of Contents

USB Development Mastery

USB Fundamentals

Electrical & Hardware Layer

Protocol & Enumeration

USB Device Classes

TinyUSB Deep Dive

CDC Virtual COM Port

HID Keyboard & Mouse

USB Mass Storage

Composite Devices

Debugging USB

RTOS + USB Integration

Advanced USB Topics

Performance & Optimisation

Custom USB Class Drivers

Bare-Metal USB

Security in USB

USB Hardware Design

USB Throughput Theory

Gross Bandwidth vs Net Throughput

Per-Microframe Budget

Endpoint Buffer Size Impact

Max Bulk Packet Sizes by Speed

TinyUSB Buffer Size Configuration

Double Buffering Bulk Endpoints

STM32 FSDEV Double-Buffer Mode

DMA for Zero-Copy Transfers

STM32 OTG_FS DMA Mode

Measured Throughput: CPU Copy vs DMA

Measuring USB Throughput

Device Firmware: TX Benchmark

TX Path Optimization (Device to Host)

CDC Write Buffer Strategy

RX Path Optimization (Host to Device)

MSC Throughput Optimization

Profiling & Bottleneck Identification

GPIO Toggle Timing Method

Exercises

Measure Baseline CDC Throughput

Profile USB TX Callback with GPIO and Logic Analyser

Implement DMA-Driven ADC to USB Streaming at Maximum Throughput

USB Performance Plan Generator

USB Performance Plan Generator

Conclusion & Next Steps

Next in the Series

Related Articles in This Series

Part 12: Advanced USB Topics

Part 6: CDC Virtual COM Port

Part 14: Custom USB Class Drivers