Series Context: This is Part 13 of 17 in the USB Development Mastery series. Parts 1–12 covered USB fundamentals through advanced topics including DFU and OTG. This part is dedicated entirely to performance — understanding the theoretical ceiling, then systematically approaching it through DMA, double buffering, and careful data path design.
1
USB Fundamentals
USB system architecture, transfer types, host/device model, protocol stack
Completed
2
Electrical & Hardware Layer
D+/D- signalling, pull-ups, connectors, USB-C, STM32 USB peripherals
Completed
3
Protocol & Enumeration
Enumeration sequence, USB packets, descriptors, endpoint concepts
Completed
4
USB Device Classes
HID, CDC, MSC, MIDI, Audio, composite devices, vendor class
Completed
5
TinyUSB Deep Dive
Stack architecture, execution model, STM32 integration, descriptor callbacks
Completed
6
CDC Virtual COM Port
CDC class, bulk transfers, printf over USB, baud rate handling
Completed
7
HID Keyboard & Mouse
HID descriptors, report format, keyboard/mouse/gamepad implementation
Completed
8
USB Mass Storage
MSC class, SCSI commands, FATFS integration, RAM disk
Completed
9
Composite Devices
Multiple classes, IAD descriptor, CDC+HID, CDC+MSC
Completed
10
Debugging USB
Wireshark capture, protocol analyser, enumeration debugging, common failures
Completed
11
RTOS + USB Integration
FreeRTOS + TinyUSB, task priorities, thread-safe communication
Completed
12
Advanced USB Topics
UAC2 audio, DFU bootloader, OTG host mode, hubs, suspend, USB PD, SuperSpeed
Completed
13
Performance & Optimisation
DMA, zero-copy buffers, throughput maximisation, latency tuning, benchmarking
You Are Here
14
Custom USB Class Drivers
Vendor class, writing descriptors, OS driver interaction
15
Bare-Metal USB
Direct register programming, writing USB stack from scratch, PHY timing
16
Security in USB
BadUSB attacks, device authentication, secure firmware, USB firewall
17
USB Hardware Design
PCB layout, differential pairs, impedance matching, EMI, USB-C PD
USB Throughput Theory
Before optimising USB performance, you need an accurate mental model of where the theoretical ceiling is and why measured throughput always falls below it. The gap between "480 Mbps" and what your device actually delivers is not a bug — it is a consequence of USB's protocol overhead, shared-bus scheduling, and the host's polling behaviour.
Gross Bandwidth vs Net Throughput
The raw bit rates advertised for USB speeds are gross figures that include NRZI encoding, bit stuffing, sync fields, PIDs, CRCs, and inter-packet gaps. The actual user data throughput is always lower. For bulk transfers specifically:
| USB Speed |
Gross Bit Rate |
Theoretical Net Bulk |
Achievable Measured |
Overhead Reason |
| Full Speed |
12 Mbps |
~1.2 MB/s |
0.8–1.0 MB/s |
SOF (12%), ACK/NAK, token packets |
| High Speed |
480 Mbps |
~53 MB/s theoretical |
25–38 MB/s typical |
Microframe overhead, host scheduling, CDC buffering |
| HS (vendor bulk class) |
480 Mbps |
~53 MB/s |
35–42 MB/s |
Reduced CDC overhead, direct bulk, DMA |
Per-Microframe Budget
High Speed USB divides time into microframes of 125 µs, and each second contains 8,000 microframes. The theoretical maximum bulk data per second is calculated from the fraction of each microframe available to bulk transactions after control and isochronous reservations. Bulk gets whatever bandwidth remains after higher-priority transfers have been scheduled.
/*
* HS USB per-microframe budget calculation:
*
* Total HS bandwidth: 480,000,000 bits/s
* Microframes per second: 8,000
* Bits per microframe: 480,000,000 / 8,000 = 60,000 bits = 7,500 bytes
*
* Per microframe overhead:
* SOF packet: ~8 bytes
* Token + handshake: ~4 bytes per transaction
* Inter-packet gaps: ~5 bytes equivalent
*
* With a single 512-byte bulk packet per microframe:
* Net data: 512 bytes
* Net throughput: 512 * 8,000 = ~4 MB/s (one transaction/microframe)
*
* With host pipelining (multiple IN per microframe):
* Host issues multiple IN tokens without waiting for previous ACK
* Up to ~7 512-byte transactions per 125 µs microframe
* Theoretical: 512 * 7 * 8,000 = ~28 MB/s
*
* Reaching 35-42 MB/s requires:
* - Device always has data ready (zero NAK rate)
* - DMA feeding endpoint buffer with no CPU intervention
* - 512-byte wMaxPacketSize (not 64-byte FS legacy)
* - Host driver pipeline depth >= 4 outstanding requests
*/
SOF Overhead at Full Speed: Each SOF packet consumes approximately 2.5 µs per 1 ms frame — 0.25% overhead. The real FS ceiling loss comes from token + handshake packets: each bulk IN transaction uses ~24 bits (IN token) + 8 + 64×8 + 16 bits (DATA0) + 8 bits (ACK) = 568 bits for 64 bytes payload. That is 89% efficiency, giving a ceiling of 12 Mbps × 89% / 8 ≈ 1.34 MB/s. Bus scheduling further reduces the sustained rate to roughly 1 MB/s for a single bulk endpoint.
Endpoint Buffer Size Impact
The single most impactful configuration change for USB throughput is packet size. Moving from Full Speed's 64-byte maximum bulk packet to High Speed's 512-byte maximum is a direct 8× throughput multiplier for the same transaction count. But even within a given speed, software buffer sizes determine how efficiently the endpoint can stream data.
Max Bulk Packet Sizes by Speed
| USB Speed |
Max Bulk Packet |
Max Interrupt Packet |
Max Isochronous Packet |
Control EP0 Max |
| Low Speed (1.5 Mbps) |
N/A (no bulk) |
8 bytes |
N/A |
8 bytes |
| Full Speed (12 Mbps) |
64 bytes |
64 bytes |
1023 bytes |
8, 16, 32, or 64 bytes |
| High Speed (480 Mbps) |
512 bytes |
1024 bytes |
1024 bytes (×3 = 3072) |
64 bytes |
TinyUSB Buffer Size Configuration
/* tusb_config.h — critical buffer size settings */
/* CDC endpoint buffer — determines how much data can be queued. */
/* FS: use 512 (8 x 64-byte packets). HS: use 4096 (8 x 512-byte). */
#define CFG_TUD_CDC_EP_BUFSIZE 512 /* FS default */
/* #define CFG_TUD_CDC_EP_BUFSIZE 4096 */ /* HS recommended */
/* CDC RX/TX FIFO sizes — application-side buffers */
/* Should be 4-8x EP buffer size to absorb processing bursts */
#define CFG_TUD_CDC_RX_BUFSIZE (4 * CFG_TUD_CDC_EP_BUFSIZE)
#define CFG_TUD_CDC_TX_BUFSIZE (4 * CFG_TUD_CDC_EP_BUFSIZE)
/* MSC buffer size — must match or exceed SCSI READ10 block size */
/* SDMMC sector = 512 bytes. Use 4096 for 8-sector read blocks. */
#define CFG_TUD_MSC_EP_BUFSIZE 512 /* FS */
/* #define CFG_TUD_MSC_EP_BUFSIZE 4096 */ /* HS, 8 sectors/xfer */
/* Vendor bulk endpoint buffers */
#define CFG_TUD_VENDOR_EPSIZE 512 /* HS max packet size */
#define CFG_TUD_VENDOR_RX_BUFSIZE (16 * 512) /* 8 KB RX FIFO */
#define CFG_TUD_VENDOR_TX_BUFSIZE (16 * 512) /* 8 KB TX FIFO */
/* Impact of going from 64-byte to 512-byte packets: */
/* For same number of USB transactions: */
/* 64-byte: 1000 transactions/ms * 64 = 64 KB/s */
/* 512-byte: 1000 transactions/ms * 512 = 512 KB/s */
/* Improvement: 8x — purely from wMaxPacketSize change */
/* Note: transactions/ms is host-dependent but illustrates */
/* the multiplier effect of larger packet sizes */
Double Buffering Bulk Endpoints
The fundamental throughput bottleneck in a single-buffered USB endpoint is the dead time between transactions: after the host reads buffer A, there is a gap before buffer A is refilled and presented for the next IN transaction. During this gap, the host issues an IN token and receives a NAK — wasted bus time. Double buffering eliminates this gap by maintaining two alternating buffers: while the host reads buffer A, the device fills buffer B. When the host finishes buffer A, buffer B is immediately available with fresh data.
STM32 FSDEV Double-Buffer Mode
The STM32 USB Full Speed Device (FSDEV) peripheral — used on STM32F0, F1, F3, L0, L1 series — supports hardware double-buffering for bulk endpoints via the USB_EP_KIND bit in the endpoint register. When set, the endpoint uses two separate packet buffers in USB SRAM, toggled automatically by hardware after each successful transaction.
/* STM32 FSDEV double-buffer bulk endpoint setup */
/* USB BTABLE (Buffer Descriptor Table) in USB SRAM. */
/* Each endpoint has 4 x 16-bit entries: */
/* BUF0_ADDR — buffer 0 base address in USB SRAM */
/* BUF0_COUNT — TX: bytes to send; RX: buffer capacity */
/* BUF1_ADDR — buffer 1 base address */
/* BUF1_COUNT — same as above for buffer 1 */
#define EP_BUF0_ADDR 0x0040 /* USB SRAM offset for buffer 0 */
#define EP_BUF1_ADDR 0x0080 /* USB SRAM offset for buffer 1 */
#define EP_BUF_SIZE 64 /* FS max packet size */
void setup_double_buffer_bulk_ep(uint8_t ep_num) {
/* Point both buffer addresses in BTABLE */
PCD_SET_EP_TX_ADDRESS(USB, ep_num, EP_BUF0_ADDR);
PCD_SET_EP_RX_ADDRESS(USB, ep_num, EP_BUF1_ADDR);
/* Set USB_EP_KIND bit to enable double-buffer mode */
PCD_SET_EP_KIND(USB, ep_num);
/* Initialise both buffer counts to zero */
PCD_SET_EP_TX_CNT(USB, ep_num, 0);
PCD_SET_EP_RX_CNT(USB, ep_num, 0);
/* Enable double-buffered bulk TX */
PCD_SET_EP_DBUF_TX_STATUS(USB, ep_num, USB_EP_TX_VALID);
}
/* Hardware buffer-toggle rule: */
/* After each IN transaction, hardware flips SW_BUF bit. */
/* Firmware checks SW_BUF to know which buffer to fill next. */
/* Reading SW_BUF: (USB->EPnR >> 14) & 1 */
void fill_next_double_buffer(uint8_t ep_num, uint8_t *data, uint16_t len) {
uint32_t sw_buf = (USB->EP0R >> 14) & 1; /* Example: EP0 */
if (sw_buf == 0) {
/* Fill buffer 0 */
USB_WritePMA(USB, data, EP_BUF0_ADDR, len);
PCD_SET_EP_TX_CNT(USB, ep_num, len);
} else {
/* Fill buffer 1 */
USB_WritePMA(USB, data, EP_BUF1_ADDR, len);
PCD_SET_EP_RX_CNT(USB, ep_num, len);
}
}
/* TinyUSB transparency: */
/* TinyUSB's stm32_fsdev port driver handles double-buffering */
/* automatically for bulk endpoints. No application changes */
/* are needed — the throughput improvement is automatic. */
/* On STM32 OTG_FS (F4, H7), deep TX FIFOs provide equivalent */
/* pipelining without explicit double-buffer configuration. */
DMA for Zero-Copy Transfers
The biggest CPU-side bottleneck in high-throughput USB transfers is the memory copy between the application buffer and the USB peripheral's packet memory area (PMA/SRAM). On a 168 MHz Cortex-M4 without DMA, a memcpy() of 512 bytes to USB SRAM consumes approximately 3–4 µs — enough time for the host to issue an IN token and receive a NAK, wasting a full transaction slot. DMA eliminates this cost entirely: the USB DMA engine reads directly from the application buffer in main SRAM, with zero CPU involvement.
STM32 OTG_FS DMA Mode
/* Enable USB OTG_FS DMA mode (STM32F4/H7 with OTG_FS peripheral) */
void usb_otg_dma_enable(void) {
/* Enable DMA in AHB Global Configuration Register */
USB_OTG_FS->GAHBCFG |= USB_OTG_GAHBCFG_DMAEN;
/* Set burst type — INCR4 balances AHB bus utilisation */
USB_OTG_FS->GAHBCFG &= ~USB_OTG_GAHBCFG_HBSTLEN_Msk;
USB_OTG_FS->GAHBCFG |= (USB_OTG_GAHBCFG_HBSTLEN_1); /* INCR4 */
}
/* CFG_TUSB_MEM_SECTION places buffers in DMA-accessible SRAM. */
/* On STM32H7: USB DMA can access SRAM1/SRAM2 but NOT DTCM RAM. */
/* Place DMA buffers explicitly to avoid silent DMA failures. */
CFG_TUSB_MEM_SECTION CFG_TUSB_MEM_ALIGN
static uint8_t cdc_tx_dma_buf[4096];
CFG_TUSB_MEM_SECTION CFG_TUSB_MEM_ALIGN
static uint8_t cdc_rx_dma_buf[4096];
/* Cortex-M7 cache coherency for DMA transfers */
/* STM32H7 has 16 KB L1 D-cache enabled by default in CubeIDE */
/* CPU writes go to cache — DMA reads from RAM see stale data */
void prepare_tx_for_dma(const uint8_t *src, uint32_t len) {
memcpy(cdc_tx_dma_buf, src, len);
/* Flush data cache: write dirty cache lines back to SRAM */
SCB_CleanDCache_by_Addr((uint32_t *)cdc_tx_dma_buf,
(len + 31) & ~31UL);
}
void invalidate_rx_after_dma(uint32_t len) {
/* Invalidate data cache: force CPU to re-read from SRAM */
/* Must be called BEFORE reading DMA-received data */
SCB_InvalidateDCache_by_Addr((uint32_t *)cdc_rx_dma_buf,
(len + 31) & ~31UL);
}
Measured Throughput: CPU Copy vs DMA
| Configuration |
MCU |
USB Speed |
CPU Copy MB/s |
DMA MB/s |
Improvement |
| CDC Bulk IN |
STM32F407 @ 168 MHz |
FS (12 Mbps) |
0.95 |
1.02 |
~7% (bus-limited) |
| CDC Bulk IN |
STM32F407 @ 168 MHz |
HS (480 Mbps) |
18 |
32 |
~78% |
| Vendor Bulk IN |
STM32H743 @ 480 MHz |
HS (480 Mbps) |
26 |
40 |
~54% |
| MSC Read10 |
STM32F407 @ 168 MHz |
HS (480 Mbps) |
14 |
22 |
~57% |
DMA at Full Speed: For Full Speed USB (12 Mbps), DMA provides minimal throughput benefit because the USB bus itself is the bottleneck — the CPU keeps up with 64-byte packets at 1 MB/s without effort. DMA becomes essential at High Speed where the USB bus can consume data faster than a CPU memcpy can supply it. If your application runs at Full Speed, focus on reducing NAK rates and ensuring the endpoint buffer is always pre-filled before the host polls it.
Measuring USB Throughput
Accurate throughput measurement requires discipline on both sides. Exclude connection setup time from the measurement window and measure over a multi-second window to average out bus scheduling noise. Always warm up the connection by discarding the first 100 KB before starting the timer.
Device Firmware: TX Benchmark
/* Continuous TX flood — fill CDC buffer every main loop iteration */
static uint8_t tx_buf[4096];
void usb_tx_benchmark_task(void) {
if (!tud_cdc_connected()) return;
/* Only write when a full block fits — maximises packet fill */
if (tud_cdc_write_available() >= sizeof(tx_buf)) {
tud_cdc_write(tx_buf, sizeof(tx_buf));
tud_cdc_write_flush();
}
}
/* host-side pyserial benchmark: pip install pyserial
python -c "import serial,time; s=serial.Serial('COM12',115200,timeout=1);
s.reset_input_buffer(); [s.read(4096) for _ in range(25)];
n=0; t=time.perf_counter()
while time.perf_counter()-t<10: n+=len(s.read(4096))
print(f'{n/(time.perf_counter()-t)/1e6:.2f} MB/s')"
*/
Common measurement pitfalls: (1) measuring from first byte rather than after warmup — initial latency inflates the time denominator; (2) including enumeration in the window; (3) measuring RTT (round-trip) for a one-directional benchmark.
TX Path Optimization (Device to Host)
The TX path — data flowing from device firmware to the USB host — has several common implementation patterns that severely limit throughput. Understanding the TinyUSB CDC TX buffer model and the correct flush strategy is the foundation of TX path optimisation.
CDC Write Buffer Strategy
TinyUSB CDC maintains an internal FIFO between application writes and the USB endpoint. The key insight: tud_cdc_write() copies data into this FIFO but does not immediately initiate a USB transfer. The transfer is initiated when either (a) tud_cdc_write_flush() is called, or (b) the FIFO fills up. Calling tud_cdc_write_flush() after every byte is the single most common throughput killer.
/* WRONG: flush after every small write — catastrophic for throughput */
void bad_send_data(const uint8_t *data, uint32_t len) {
for (uint32_t i = 0; i < len; i++) {
tud_cdc_write(&data[i], 1);
tud_cdc_write_flush(); /* USB transaction for EVERY BYTE */
}
}
/* CORRECT: batch into large writes, flush once at end */
void good_send_data(const uint8_t *data, uint32_t len) {
uint32_t written = 0;
while (written < len) {
uint32_t avail = tud_cdc_write_available();
if (avail == 0) {
tud_task(); /* Process USB events to drain TX FIFO */
continue;
}
uint32_t chunk = (len - written < avail) ? len - written : avail;
tud_cdc_write(data + written, chunk);
written += chunk;
}
tud_cdc_write_flush(); /* Single flush at end */
}
/* BEST: circular buffer + DMA feeding CDC write */
/* ADC DMA fills ring_buf[] continuously. */
/* USB task drains ring_buf into CDC TX FIFO. */
#define RING_BUF_SIZE (16 * 1024)
static uint8_t ring_buf[RING_BUF_SIZE];
static uint32_t ring_head = 0; /* DMA write position */
static uint32_t ring_tail = 0; /* USB read position */
/* Called from DMA transfer complete callback (half + full) */
void adc_dma_half_cb(void) { ring_head = RING_BUF_SIZE / 2; }
void adc_dma_full_cb(void) { ring_head = 0; }
void usb_tx_drain_task(void) {
if (!tud_cdc_connected()) return;
uint32_t avail = tud_cdc_write_available();
if (avail == 0) return;
uint32_t head = ring_head; /* Atomic 32-bit read on Cortex-M */
uint32_t bytes = (head - ring_tail + RING_BUF_SIZE) % RING_BUF_SIZE;
if (bytes == 0) return;
uint32_t chunk = bytes < avail ? bytes : avail;
/* Handle wrap-around */
if (ring_tail + chunk <= RING_BUF_SIZE) {
tud_cdc_write(ring_buf + ring_tail, chunk);
} else {
uint32_t first = RING_BUF_SIZE - ring_tail;
tud_cdc_write(ring_buf + ring_tail, first);
tud_cdc_write(ring_buf, chunk - first);
}
ring_tail = (ring_tail + chunk) & (RING_BUF_SIZE - 1u);
tud_cdc_write_flush();
}
The tud_cdc_write_available() function returns remaining space in the CDC TX FIFO. For maximum throughput, keep the TX FIFO at least 50% full at all times by writing large chunks rather than byte-by-byte.
RX Path Optimization (Host to Device)
The RX path — data flowing from the USB host into the device firmware — has its own set of bottlenecks. The most common mistake is reading data one byte at a time from the CDC RX buffer, which causes the TinyUSB stack to generate a NAK for incoming packets until the application has drained the previous data. This stalls the host's bulk transfer pipeline.
/* WRONG: read byte-by-byte — causes NAK storms */
void bad_rx_handler(void) {
uint8_t byte;
while (tud_cdc_available()) {
tud_cdc_read(&byte, 1);
process_byte(byte); /* Slow per-byte processing */
}
}
/* CORRECT: drain entire available buffer in one tud_cdc_read() call */
void tud_cdc_rx_cb(uint8_t itf) {
uint8_t buf[CFG_TUD_CDC_EP_BUFSIZE]; /* 512 for HS, 64 for FS */
uint32_t count;
/* Read ALL available bytes in a single call */
while ((count = tud_cdc_n_read(itf, buf, sizeof(buf))) > 0) {
/* Hand off entire packet to application */
app_rx_handler(buf, count);
}
}
/* If application cannot keep up with RX rate: */
/* USB NAK (Not Acknowledge) is the correct behaviour. */
/* TinyUSB automatically NAKs when RX buffer is full. */
/* Do NOT discard data — instead signal application. */
/* Flow control using circular buffer */
#define APP_RX_BUF_SIZE 8192
static uint8_t app_rx_buf[APP_RX_BUF_SIZE];
static uint32_t app_rx_head = 0;
static uint32_t app_rx_tail = 0;
void tud_cdc_rx_cb_buffered(uint8_t itf) {
uint8_t pkt[512];
uint32_t count = tud_cdc_n_read(itf, pkt, sizeof(pkt));
if (count == 0) return;
/* Check if ring buffer has space */
uint32_t free_space = (app_rx_tail - app_rx_head - 1 + APP_RX_BUF_SIZE)
% APP_RX_BUF_SIZE;
if (count > free_space) {
/* Cannot accept — TinyUSB will NAK next packet automatically */
/* Re-push back into TinyUSB buffer (not possible directly) */
/* Best approach: process some data first, then call read again */
return;
}
/* Copy packet into ring buffer */
for (uint32_t i = 0; i < count; i++) {
app_rx_buf[app_rx_head] = pkt[i];
app_rx_head = (app_rx_head + 1) % APP_RX_BUF_SIZE;
}
}
When the application processing rate falls below the USB RX rate, TinyUSB's flow control kicks in automatically: the RX FIFO fills, and TinyUSB NAKs the next IN token from the host. The host retries — no data is lost. This back-pressure mechanism is correct behaviour, not a bug.
MSC Throughput Optimization
USB Mass Storage throughput is a chain of three components: USB bulk transfer rate, MCU processing (SCSI command decode), and storage medium read/write speed. Optimising only the USB link without addressing SDMMC throughput leaves the majority of performance gain on the table.
/* MSC read10 callback — called when host requests a READ(10) command */
/* buf must be 4-byte aligned for DMA to work correctly */
/* Place MSC buffer in DMA-accessible SRAM region */
CFG_TUSB_MEM_SECTION CFG_TUSB_MEM_ALIGN
static uint8_t msc_buf[4096]; /* 8 x 512-byte sectors per transfer */
int32_t tud_msc_read10_cb(uint8_t lun, uint32_t lba,
uint32_t offset, void *buffer,
uint32_t bufsize) {
(void)lun; (void)offset;
/* Use SDMMC DMA for zero-copy read into MSC buffer */
if (HAL_SD_ReadBlocks_DMA(&hsd, (uint8_t *)buffer,
lba, bufsize / 512) != HAL_OK) {
return -1;
}
/* Wait for DMA transfer complete (or use callback for async) */
uint32_t t = HAL_GetTick();
while (HAL_SD_GetCardState(&hsd) != HAL_SD_CARD_TRANSFER) {
if (HAL_GetTick() - t > 500) return -1;
}
/* On Cortex-M7: invalidate cache before TinyUSB reads buffer */
SCB_InvalidateDCache_by_Addr((uint32_t *)buffer,
(bufsize + 31) & ~31u);
return (int32_t)bufsize;
}
/* SDMMC configuration for maximum throughput */
/* STM32F407: SDMMC1, 4-bit wide bus, 48 MHz SDMMC clock */
/* Achievable: ~12 MB/s read from Class 10 microSD */
/* Combined USB HS + SDMMC: bottleneck is SD card at ~12 MB/s */
/* USB HS bulk can sustain ~22 MB/s — so SD card is limiting factor */
/* Pre-fetch optimisation: read next sector while current transmits */
/* TinyUSB calls tud_msc_read10_cb for each READ(10) command */
/* Pre-fetch is complex with TinyUSB but achievable with double buf */
| MCU | USB Speed | Storage | Measured Read | Measured Write | Bottleneck |
| STM32F407 @ 168 MHz | HS 480 Mbps | SDMMC Class 10 | 10.5 MB/s | 8.2 MB/s | SD card |
| STM32H743 @ 480 MHz | HS 480 Mbps | SDMMC UHS-I | 22 MB/s | 18 MB/s | USB bulk pipelining |
| STM32F407 @ 168 MHz | FS 12 Mbps | SDMMC Class 10 | 0.95 MB/s | 0.85 MB/s | USB Full Speed |
| RP2040 @ 133 MHz | FS 12 Mbps | SPI Flash 80 MHz | 0.9 MB/s | 0.6 MB/s | USB Full Speed |
Profiling & Bottleneck Identification
When USB throughput is lower than expected, the bottleneck is in one of three places: the USB bus itself (too many NAKs), the MCU's processing pipeline (CPU too slow to feed the endpoint), or the storage/peripheral (SD card, ADC, memory bus). Identifying which one requires measurement — guessing leads to optimising the wrong thing.
GPIO Toggle Timing Method
/* GPIO timing method: toggle a spare GPIO at callback entry/exit */
/* Measure high/low times with oscilloscope or logic analyser */
#define PROFILE_GPIO_PORT GPIOC
#define PROFILE_GPIO_PIN GPIO_PIN_13
static inline void profile_high(void) {
PROFILE_GPIO_PORT->BSRR = PROFILE_GPIO_PIN;
}
static inline void profile_low(void) {
PROFILE_GPIO_PORT->BRR = PROFILE_GPIO_PIN;
}
/* In tud_cdc_rx_cb: */
void tud_cdc_rx_cb(uint8_t itf) {
profile_high(); /* GPIO goes HIGH when callback starts */
uint8_t buf[512];
uint32_t count = tud_cdc_n_read(itf, buf, sizeof(buf));
app_process_rx(buf, count);
profile_low(); /* GPIO goes LOW when callback returns */
}
/* Oscilloscope interpretation: */
/* HIGH duration = callback execution time */
/* LOW duration between callbacks = time USB bus was idle */
/* If LOW duration is very short: MCU is the bottleneck */
/* If HIGH duration is very short: USB bus is bottleneck */
/* DWT Cycle Counter for µs-resolution timing without GPIO */
/* Available on all Cortex-M3/M4/M7 cores (not M0) */
void dwt_init(void) {
CoreDebug->DEMCR |= CoreDebug_DEMCR_TRCENA_Msk;
DWT->CYCCNT = 0;
DWT->CTRL |= DWT_CTRL_CYCCNTENA_Msk;
}
static inline uint32_t dwt_get_cycles(void) {
return DWT->CYCCNT;
}
static inline uint32_t dwt_cycles_to_us(uint32_t cycles) {
return cycles / (SystemCoreClock / 1000000u);
}
/* Usage: */
void measure_callback_duration(void) {
uint32_t t0 = dwt_get_cycles();
do_usb_operation();
uint32_t elapsed_us = dwt_cycles_to_us(dwt_get_cycles() - t0);
printf("Operation took %lu µs\n", elapsed_us);
}
A well-optimised High Speed USB device shows: callback execution time (HIGH GPIO) of 5–15 µs for a 512-byte packet, with very short idle gaps. If idle gaps are long (50+ µs), the MCU is slow to refill the endpoint — DMA is needed. If callback time is short but throughput is still low, the bottleneck is the USB host driver scheduling on the PC side.
Exercises
Exercise 1
Beginner
Measure Baseline CDC Throughput
Using a Full Speed TinyUSB CDC device: (a) implement the TX flood firmware (usb_tx_benchmark_task()) that continuously fills the CDC TX buffer; (b) run the Python pyserial benchmark script on the host, measuring with a 10-second window and 100 KB warmup; (c) record the baseline throughput in MB/s; (d) change CFG_TUD_CDC_TX_BUFSIZE from 512 to 2048 bytes and remeasure — document the improvement; (e) try calling tud_cdc_write_flush() after every 64-byte write versus after every 512-byte write and compare the throughput difference.
CDC Throughput
Buffer Sizing
Benchmarking
Exercise 2
Intermediate
Profile USB TX Callback with GPIO and Logic Analyser
Set up GPIO toggle profiling on your MCU: (a) configure a spare GPIO pin as fast push-pull output; (b) set the GPIO HIGH at the start of tud_cdc_rx_cb() and LOW at the end; (c) capture 10 ms of GPIO waveform on a logic analyser while running the loopback benchmark; (d) measure the average HIGH (callback active) and LOW (idle) times; (e) calculate the duty cycle — what percentage of time is the MCU processing vs idle?; (f) identify whether the bottleneck is USB bus latency or MCU processing by analysing the waveform pattern.
GPIO Profiling
Logic Analyser
Bottleneck Analysis
Exercise 3
Advanced
Implement DMA-Driven ADC to USB Streaming at Maximum Throughput
Design a complete high-throughput streaming pipeline: (a) configure an ADC in continuous DMA mode on a Cortex-M4/M7 MCU, sampling at the maximum rate the USB can sustain (for FS: ~100 kSPS at 16-bit; for HS: ~2 MSPS); (b) use a ping-pong (double buffer) DMA setup — DMA fills buffer A while USB sends buffer B; (c) implement cache flush in the DMA half-complete callback before TinyUSB reads buffer A; (d) measure achieved throughput and compare to theoretical maximum; (e) add DWT cycle counter instrumentation to measure total latency from ADC sample to USB packet transmission; (f) identify and eliminate any idle gaps in the TX pipeline.
DMA Streaming
Ping-Pong Buffers
ADC to USB
Cache Coherency
USB Performance Plan Generator
Use this tool to document your USB performance optimisation plan — target MCU, USB speed, device class, throughput goal, DMA and double-buffer strategy. Download as Word, Excel, PDF, or PPTX for project documentation.
Conclusion & Next Steps
Part 13 has given you a complete toolkit for USB performance analysis and optimisation. The key lessons to take forward:
- Know the ceiling. USB Full Speed bulk caps at ~1 MB/s. High Speed bulk can reach 35–42 MB/s with optimal buffering and DMA. Start by understanding what is theoretically possible.
- Packet size is the biggest single lever. Switching from 64-byte (FS) to 512-byte (HS) bulk packets delivers an 8× throughput gain with no code change — only hardware and descriptor changes required.
- DMA is essential at High Speed. At Full Speed, CPU memcpy keeps up easily. At High Speed, DMA eliminates the CPU copy bottleneck and unlocks the last 40–80% of available throughput.
- Flush strategy dominates CDC TX throughput. Calling
tud_cdc_write_flush() after every byte reduces throughput from 18 MB/s to under 1 MB/s. Batch large writes and flush once at the end of each burst.
- Drain the RX FIFO completely in one call. Reading byte-by-byte from
tud_cdc_read() stalls the USB pipeline. Always read all available bytes in a single tud_cdc_n_read() call.
- MSC performance is usually SD-card limited. USB HS can sustain 22+ MB/s; the bottleneck is almost always the storage medium. SDMMC DMA in 4-bit HS mode is essential for maximising MSC write throughput.
- Measure with GPIO toggles first. One GPIO toggle per callback entry/exit, captured on a logic analyser for 10 ms, immediately reveals whether the bottleneck is USB bus time or MCU processing time — no guessing required.
Next in the Series
In Part 14: Custom USB Class Drivers, we move beyond standard classes to design custom vendor-specific USB drivers. We will write a complete custom class descriptor set, implement Microsoft OS descriptors (WCID/BOS) for driver-free Windows operation, build a libusb-based host application, and explore when a custom class is the right choice versus extending a standard one.
Related Articles in This Series
Part 12: Advanced USB Topics
USB Audio Class 2.0, DFU bootloader design, OTG host mode, hub support, suspend/resume, USB PD, and SuperSpeed overview.
Read Article
Part 6: CDC Virtual COM Port
CDC class fundamentals, bulk transfer mechanics, and the TinyUSB CDC API — the foundation for the TX/RX path optimisations in this article.
Read Article
Part 14: Custom USB Class Drivers
Vendor-specific USB class implementation, Microsoft OS descriptors for Windows, and libusb host application development.
Read Article