Part 24: Performance Analysis — CPU, Memory, Disk & Network

The USE Method

The USE Method, created by Brendan Gregg, provides a systematic framework for performance analysis. For every resource (CPU, memory, disk, network, etc.), check three things: Utilization, Saturation, and Errors. This prevents you from randomly running tools and instead gives you a directed, exhaustive methodology.

USE Method Decision Tree

flowchart TD
    A["For Each Resource\n(CPU, Memory, Disk, Network)"] --> U{"Check Utilization\n(% time busy)"}
    U -->|High| S1{"Check Saturation\n(queue depth / waiting)"}
    U -->|Low| E1{"Check Errors\n(device errors, drops)"}
    S1 -->|Saturated| B["⚠️ Bottleneck Found\nResource overloaded"]
    S1 -->|Not saturated| E1
    E1 -->|Errors present| C["⚠️ Error-Induced\nDegraded performance"]
    E1 -->|No errors| D["✅ Resource OK\nMove to next resource"]
    B --> FIX["Tune, scale, or offload"]
    C --> FIX

            
            Utilization vs Saturation: High utilization alone is not a problem — a CPU running at 90% may be perfectly healthy if there's no queuing. Saturation is the real indicator of trouble: it means requests are waiting (queued) because the resource cannot service them fast enough. A disk at 95% utilization with 0 queue depth is fine. A disk at 70% utilization with an average queue depth of 15 is in serious trouble.
        

Utilization

Utilization measures the percentage of time a resource is busy servicing work. For time-based resources (CPU, disk), this is straightforward. For capacity-based resources (memory, bandwidth), it's the fraction of total capacity in use.

Saturation

Saturation measures the degree to which a resource has extra work it can't service — typically queued work. For CPUs, this is the run queue length. For disks, it's the I/O queue depth. For memory, it's paging/swapping activity. Saturation directly correlates with latency spikes.

Errors

Errors are events that caused failures — failed disk reads, network packet drops, ECC memory corrections, TCP retransmissions. Some errors degrade performance silently (retries, retransmissions) without full failure, making them easy to miss.

Resource	Utilization Tool	Saturation Tool	Errors Tool
CPU	`mpstat`, `top`, `sar -u`	`vmstat` (r column), `sar -q`	`perf stat` (machine check exceptions)
Memory	`free -m`, `sar -r`	`vmstat` (si/so columns), `sar -B`	`dmesg` (OOM killer, ECC errors)
Disk	`iostat -xz` (%util)	`iostat` (avgqu-sz), `sar -d`	`smartctl`, `dmesg` (I/O errors)
Network	`sar -n DEV`, `ip -s link`	`ss -tim` (retransmits), `ifconfig` (overruns)	`ip -s link` (errors, drops), `ethtool -S`

CPU Analysis

mpstat — Per-CPU Breakdown

mpstat shows per-CPU utilization breakdown into user, system, iowait, irq, soft, steal, and idle. This reveals whether load is balanced across cores or concentrated on one.

# Per-CPU utilization every 1 second, 5 samples
mpstat -P ALL 1 5

# Key columns:
# %usr    — time in user-space code
# %sys    — time in kernel code (syscalls, interrupts)
# %iowait — CPU idle while waiting for disk I/O
# %steal  — time stolen by hypervisor (VM contention)
# %idle   — genuinely idle

# High %iowait on all CPUs = disk bottleneck, not CPU
# High %sys on one CPU = possible lock contention or interrupt affinity
# %steal > 5% = noisy neighbor on shared VM, request dedicated host

perf — Hardware Performance Counters

perf is Linux's primary profiling tool. It accesses CPU hardware performance counters (PMCs) and can trace any kernel/user function, count cache misses, branch mispredictions, and sample call stacks at configurable frequencies.

# Count hardware events for a command
perf stat -d ls /tmp
# Shows: cycles, instructions, IPC, cache-misses, branch-misses

# Profile a running process (sample stacks at 99 Hz for 30s)
perf record -F 99 -p $(pgrep myapp) -g -- sleep 30

# View the profile (interactive TUI)
perf report --sort comm,dso,symbol

# Top-like view of hottest functions system-wide
perf top -g

# Count context switches for a process
perf stat -e context-switches -p $(pgrep myapp) -- sleep 10

# Trace specific syscalls
perf trace -p $(pgrep myapp) -e open,read,write -- sleep 5

Flame Graphs

Flame graphs (invented by Brendan Gregg) visualize profiled stack traces as a stacked bar chart. The x-axis is sorted alphabetically (not time), and width represents the proportion of samples. Wider frames = more CPU time. They instantly reveal which code paths dominate CPU usage.

# Generate a CPU flame graph (requires github.com/brendangregg/FlameGraph)
# Step 1: Record stacks
perf record -F 99 -a -g -- sleep 30

# Step 2: Convert to folded stacks
perf script | stackcollapse-perf.pl > out.folded

# Step 3: Generate SVG flame graph
flamegraph.pl out.folded > flamegraph.svg

# Open in browser — interactive, searchable
# Click frames to zoom, Ctrl+F to search function names

# One-liner (if FlameGraph repo is in PATH)
perf record -F 99 -a -g -- sleep 10 && \
perf script | stackcollapse-perf.pl | flamegraph.pl > cpu-flame.svg

Memory Analysis

free — Memory Overview

free gives a snapshot of total, used, free, shared, buffers/cache, and available memory. The available column (not free) is what matters — it includes memory that can be reclaimed from caches.

# Human-readable memory overview
free -m
#               total    used    free    shared  buff/cache  available
# Mem:          15884    6234    1205      432       8445       8918
# Swap:          8192     128    8064

# Key insight: "free" is misleading — Linux uses free RAM as disk cache
# "available" = memory available for new allocations without swapping
# If available is low AND swap is active → memory pressure

# Watch memory over time (every 2 seconds)
free -m -s 2

vmstat — Virtual Memory Statistics

vmstat reports processes, memory, paging, block I/O, traps, and CPU activity. It's the single most useful command for spotting whether a system is CPU-bound, memory-bound, or I/O-bound.

# Report every 1 second, 5 samples (skip first line — it's averages since boot)
vmstat 1 5

# Key columns:
# r  — processes waiting for CPU (run queue length) → CPU saturation if r > nCPU
# b  — processes in uninterruptible sleep (waiting for I/O) → I/O saturation
# si — swap in from disk (KB/s) → memory pressure
# so — swap out to disk (KB/s) → memory pressure (CRITICAL if sustained)
# us — user CPU time
# sy — system CPU time
# wa — I/O wait time
# st — steal time (VM)

# Quick diagnosis:
# High r, low wa → CPU-bound
# High b, high wa → I/O-bound
# High si/so → memory-bound (swapping)

/proc/meminfo — Detailed Breakdown

# Detailed memory breakdown
cat /proc/meminfo | head -20

# Important fields:
grep -E "MemTotal|MemAvailable|Buffers|Cached|SwapTotal|SwapFree|Dirty|Slab" /proc/meminfo

# MemTotal     — physical RAM
# MemAvailable — estimated memory available (kernel's calculation)
# Buffers      — block device metadata cache
# Cached       — page cache (file data)
# Dirty        — pages modified but not yet written to disk
# Slab         — kernel data structure caches
# SwapFree     — available swap (low = trouble)

# Track page faults (minor = cache hit, major = disk read)
sar -B 1 5
# pgpgin/pgpgout — pages paged in/out from disk
# fault/s        — total page faults per second
# majflt/s       — major faults (required disk I/O) → high = memory pressure

Disk I/O Analysis

iostat — Device Utilization & Latency

iostat reports per-device I/O statistics including throughput, IOPS, queue depth, and latency. The -x flag gives extended statistics with the critical %util and await columns.

# Extended per-device stats every 1 second, 5 samples (skip idle devices)
iostat -xz 1 5

# Key columns:
# r/s, w/s      — reads/writes per second (IOPS)
# rkB/s, wkB/s  — read/write throughput (KB/s)
# await         — average I/O latency (ms) including queue time
# r_await       — read latency; w_await — write latency
# avgqu-sz      — average queue depth → saturation indicator
# %util         — percentage of time device was busy

# Interpretation:
# %util > 80% on HDD → likely saturated (seek-bound)
# %util > 95% on SSD → check await (SSDs handle parallelism better)
# await > 10ms on SSD → something is wrong (healthy SSD: <1ms)
# avgqu-sz > 1 consistently → I/O queuing (saturation)

iotop — Per-Process I/O

# Show which processes are doing I/O (requires root)
sudo iotop -aoP

# Flags:
# -a  — accumulated I/O since iotop started
# -o  — only show processes doing I/O
# -P  — show processes (not threads)

# One-shot: top 10 processes by disk read
sudo iotop -boPn 1 | head -15

# Alternative: pidstat for I/O per process
pidstat -d 1 5
# kB_rd/s — KB read per second per process
# kB_wr/s — KB written per second per process

blktrace — Block Layer Tracing

# Trace block I/O on /dev/sda for 10 seconds
sudo blktrace -d /dev/sda -o - | blkparse -i - | head -50

# Fields: timestamp, PID, action, offset, size, latency
# Actions: Q=queued, G=get request, I=inserted, D=dispatched, C=completed

# Generate I/O latency histogram
sudo biolatency-bpfcc 10 1
# Shows distribution of I/O latencies (requires bcc-tools / bpftrace)

# Alternative: ioping for single-request latency measurement
ioping -c 10 /dev/sda    # Sequential read latency
ioping -c 10 -R /dev/sda # Random read latency

Network Performance

sar — System Activity Reporter (Network)

sar collects and reports system activity including network interface throughput, TCP statistics, and socket counts. It reads from /var/log/sa/ for historical data or samples live.

# Network interface throughput (packets and bytes per second)
sar -n DEV 1 5
# rxpck/s, txpck/s — packets received/transmitted per second
# rxkB/s, txkB/s   — KB received/transmitted per second
# Compare against link speed (e.g., 1Gbps = ~125 MB/s max)

# TCP statistics (connections, retransmits)
sar -n TCP 1 5
# active/s   — new outgoing TCP connections per second
# passive/s  — new incoming TCP connections per second
# retrans/s  — TCP retransmits per second (sign of packet loss or congestion)

# Socket statistics
sar -n SOCK 1 5
# totsck — total sockets in use
# tcp-tw — TIME_WAIT sockets (high = connection churn)

# Historical data (what happened at 3am last night?)
sar -n DEV -f /var/log/sa/sa12 -s 03:00:00 -e 04:00:00

iperf3 — Network Throughput Testing

iperf3 measures maximum achievable bandwidth between two endpoints. Run a server on one host and a client on another — it saturates the link and reports throughput, jitter, and packet loss.

# === Server side ===
iperf3 -s
# Listening on port 5201

# === Client side (TCP throughput test) ===
iperf3 -c 192.168.1.100 -t 30
# Runs for 30 seconds, reports bandwidth per second + average

# === Client side (UDP throughput + jitter + loss) ===
iperf3 -c 192.168.1.100 -u -b 1G -t 10
# -u = UDP mode, -b 1G = target 1 Gbps
# Reports: bandwidth, jitter, lost/total datagrams

# Parallel streams (useful for testing multi-queue NICs)
iperf3 -c 192.168.1.100 -P 4 -t 20

# Reverse mode (server sends, client receives)
iperf3 -c 192.168.1.100 -R -t 10

Bandwidth Testing & Diagnostics

# Check current link speed and duplex
ethtool eth0 | grep -i "speed\|duplex"

# View interface errors and drops
ip -s link show eth0
# Look for: RX errors, dropped, overruns (= NIC buffer overflow = saturation)

# View TCP socket memory and retransmissions
ss -tim state established | head -20
# Look for: retrans (retransmit count), rtt (round-trip time)

# View network connection queue depths
ss -tlnp
# Recv-Q = current backlog (connections waiting for accept)
# Send-Q = max backlog configured

# Quick DNS latency check
dig +stats google.com | grep "Query time"

The 60-Second Analysis Checklist

            
            Brendan Gregg's 60-Second Checklist: These 10 commands give you a high-level picture of system health in under a minute. Run them first on any performance investigation — they'll tell you whether the problem is CPU, memory, disk, or network, and whether it's current or historical. Only after this triage do you dive deeper with specialised tools.
        

# === The 60-Second Analysis (run these first, always) ===

# 1. Load averages — is the system overloaded? (1/5/15 min averages)
uptime

# 2. Kernel errors — OOM kills, hardware errors, disk failures
dmesg -T | tail -20

# 3. System-wide CPU, memory, I/O, swap activity
vmstat 1 5

# 4. Per-CPU utilization breakdown
mpstat -P ALL 1 5

# 5. Per-process CPU usage
pidstat 1 5

# 6. Per-device I/O utilization and latency
iostat -xz 1 5

# 7. Memory usage (focus on "available", not "free")
free -m

# 8. Network interface throughput
sar -n DEV 1 5

# 9. TCP connection rates and retransmits
sar -n TCP 1 5

# 10. Top processes (CPU, memory, state)
top -bn 1 | head -20

Load Testing with stress-ng

stress-ng is a workload generator that stresses specific subsystems — useful for validating monitoring, testing autoscaling, and reproducing performance issues in controlled environments.

# Stress 4 CPU cores for 60 seconds (matrix multiplication workload)
stress-ng --cpu 4 --cpu-method matrixprod --timeout 60s --metrics

# Stress memory: allocate and touch 2GB across 2 workers
stress-ng --vm 2 --vm-bytes 2G --vm-method all --timeout 30s

# Stress disk I/O: 4 workers doing sequential writes
stress-ng --hdd 4 --hdd-bytes 1G --timeout 30s

# Stress network (loopback): 2 socket workers
stress-ng --sock 2 --timeout 30s

# Combined stress (realistic mixed workload)
stress-ng --cpu 2 --vm 1 --vm-bytes 1G --hdd 2 --timeout 60s --metrics

Cloud Native

Container Performance Analysis

In containerised environments, resource limits (cgroups) add a layer of indirection. A container may appear idle at the host level but be CPU-throttled within its cgroup. Key tools: docker stats for live resource usage, kubectl top pods for Kubernetes, cat /sys/fs/cgroup/cpu/cpu.stat for throttling counts, and Prometheus + Grafana for historical observability. Always check nr_throttled and throttled_time — if these are growing, the container needs more CPU quota or fewer concurrent tasks.

ContainerscgroupsKubernetesObservability

# Container CPU throttling (cgroup v2)
cat /sys/fs/cgroup/cpu.stat
# nr_throttled — number of times the cgroup was throttled
# throttled_usec — total time spent throttled (microseconds)

# Docker container stats (live)
docker stats --no-stream --format "table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}\t{{.NetIO}}\t{{.BlockIO}}"

# Kubernetes pod resource usage
kubectl top pods --sort-by=cpu -n default
kubectl top nodes

# Check if pod is being OOM-killed
kubectl describe pod myapp | grep -A 5 "Last State"
# Look for: OOMKilled, Exit Code 137

Exercises

# Exercise 1: Baseline your system (run the 60-second checklist)
uptime && vmstat 1 3 && mpstat -P ALL 1 3 && free -m

# Exercise 2: Generate CPU load and observe with mpstat
stress-ng --cpu 2 --timeout 20s &
mpstat -P ALL 1 10
# Observe: which CPUs are busy? Is %usr or %sys dominant?

# Exercise 3: Check your system for memory pressure
vmstat 1 5 | awk 'NR>2 {if ($7>0 || $8>0) print "SWAP ACTIVE: si="$7, "so="$8}'
# If si/so are > 0, you have memory pressure

# Exercise 4: Find which process is doing the most disk I/O
sudo pidstat -d 1 5 | sort -k5 -rn | head -10

# Exercise 5: Measure network throughput to localhost
iperf3 -s &
sleep 1
iperf3 -c 127.0.0.1 -t 5
kill %1

Conclusion

Performance analysis is where all 24 parts of this series converge. Understanding how hardware works (Parts 1–3) tells you why CPUs stall on cache misses. Knowing how the kernel manages memory (Parts 4–6) explains why swap kills latency. Understanding filesystems and disk I/O (Parts 7–9) reveals why await spikes. Grasping networking (Parts 10–14) makes retransmission rates meaningful. And process management, debugging, and system calls (Parts 15–23) give you the tools to trace any problem from symptom to root cause.

The USE method is your starting framework — Utilization, Saturation, Errors for every resource. The 60-second checklist is your first response. Flame graphs are your deepest weapon. Together, they transform performance analysis from guessing into engineering.

This series has taken you from transistors to system calls, from boot sequences to flame graphs. You now have the mental model to understand any Linux system — not just use it, but reason about it from first principles. Every abstraction has a cost, every layer has a purpose, and performance is the discipline of knowing where those costs accumulate.

PreviousPart 23: Debugging — strace, ltrace & GDB

Cookie Consent

Part 24: Performance Analysis — CPU, Memory, Disk & Network

Table of Contents

The USE Method

Utilization

Saturation

Errors

CPU Analysis

mpstat — Per-CPU Breakdown

perf — Hardware Performance Counters

Flame Graphs

Memory Analysis

free — Memory Overview

vmstat — Virtual Memory Statistics

/proc/meminfo — Detailed Breakdown

Disk I/O Analysis

iostat — Device Utilization & Latency

iotop — Per-Process I/O

blktrace — Block Layer Tracing

Network Performance

sar — System Activity Reporter (Network)

iperf3 — Network Throughput Testing

Bandwidth Testing & Diagnostics

The 60-Second Analysis Checklist

Load Testing with stress-ng

Container Performance Analysis

Exercises

Conclusion

Cookie Consent

Part 24: Performance Analysis — CPU, Memory, Disk & Network

Table of Contents

The USE Method

Utilization

Saturation

Errors

CPU Analysis

mpstat — Per-CPU Breakdown

perf — Hardware Performance Counters

Flame Graphs

Memory Analysis

free — Memory Overview

vmstat — Virtual Memory Statistics

/proc/meminfo — Detailed Breakdown

Disk I/O Analysis

iostat — Device Utilization & Latency

iotop — Per-Process I/O

blktrace — Block Layer Tracing

Network Performance

sar — System Activity Reporter (Network)

iperf3 — Network Throughput Testing

Bandwidth Testing & Diagnostics

The 60-Second Analysis Checklist

Load Testing with stress-ng

Container Performance Analysis

Exercises

Conclusion

The Complete Series

Part 1: Hardware Fundamentals

Part 23: Debugging — strace, ltrace & GDB

Series Index — All 24 Parts