Part 23: Linux Debugging & Troubleshooting Tools

Debugging Methodology

Before reaching for any tool, effective debugging follows a systematic methodology. Random poking wastes time. The best debuggers follow a disciplined loop: reproduce the problem, observe symptoms, form a hypothesis, instrument to measure, verify the hypothesis, fix, and prevent recurrence.

Systematic Debugging Methodology

flowchart TD
    A["Reproduce\n(Can you trigger it reliably?)"] --> B["Observe Symptoms\n(What exactly is wrong?)"]
    B --> C["Hypothesize\n(What could cause this?)"]
    C --> D["Instrument / Measure\n(strace, tcpdump, lsof, logs)"]
    D --> E{"Hypothesis\nConfirmed?"}
    E -->|Yes| F["Fix\n(Minimal targeted change)"]
    E -->|No| C
    F --> G["Prevent Recurrence\n(Test, monitor, document)"]

Problem Category	Primary Tool	Secondary Tools
CPU	top / htop	perf, mpstat, pidstat
Memory	/proc/meminfo, free	vmstat, slabtop, pmap
Disk I/O	iostat, iotop	blktrace, /proc/diskstats
Network	tcpdump, ss	ip, nstat, /proc/net/*
Process	lsof, /proc/[pid]/*	strace, pstree, nsenter
Syscalls	strace	ltrace, perf trace
Kernel	dmesg	journalctl, ftrace, bpftrace

strace — System Call Tracing

strace intercepts and records every system call a process makes — every open(), read(), write(), connect(), mmap(), and ioctl(). It's the single most useful tool for understanding what a program is actually doing at the kernel boundary. If a process hangs, crashes, or behaves unexpectedly, strace reveals the truth.

            
            Production Warning: strace adds significant overhead (10–100× slowdown) because it uses ptrace() to intercept every syscall. In production, always use -e trace= to limit scope — e.g., -e trace=network to trace only network calls, or -e trace=open,read,write for file I/O. For lower overhead in production, consider perf trace or bpftrace instead.
        

Common Patterns

# Trace a command from start to finish
strace ls /tmp 2>&1 | tail -20

# Attach to a running process (requires root or same user)
sudo strace -p 1234

# Trace only file-related syscalls
strace -e trace=file ls /tmp 2>&1 | head -20
# Shows: openat(), stat(), access(), etc.

# Trace only network syscalls (connect, send, recv, etc.)
strace -e trace=network curl -s https://example.com 2>&1 | grep -E "connect|sendto|recvfrom"

# Follow child processes (essential for forking servers)
strace -f -e trace=process nginx 2>&1 | head -30
# -f follows forks, -e trace=process shows fork/exec/wait

# Show string arguments up to 200 chars (default is 32)
strace -s 200 -e trace=write python3 -c "print('hello world')" 2>&1

# Write output to file (avoids mixing with program output)
strace -o /tmp/trace.log -e trace=open,read,write cat /etc/hostname
cat /tmp/trace.log

Performance Tracing

# Time each syscall — find what's slow
strace -T -e trace=file ls /usr/bin 2>&1 | sort -t'<' -k2 -rn | head -10
# -T appends time spent in each syscall: <0.000123>

# Count syscalls by type (summary mode — very low overhead)
strace -c ls /tmp 2>&1
# Output: % time, seconds, usecs/call, calls, errors, syscall
# Great for identifying which syscall dominates

# Timestamp each syscall (for correlating with logs)
strace -t -e trace=network curl -s https://example.com 2>&1 | head -10
# -t = seconds, -tt = microseconds, -ttt = epoch with microseconds

# Trace a process that's stuck (find what it's blocking on)
sudo strace -p $(pgrep -f "my-stuck-app") -e trace=all 2>&1 | head -5
# Common findings: poll(), futex(), read() on a socket = waiting for I/O

lsof — List Open Files

lsof (List Open Files) shows every file descriptor held by every process. On Linux, "everything is a file" — sockets, pipes, devices, and regular files all appear as file descriptors. lsof reveals what resources a process is using.

# All open files for a specific process
lsof -p $(pgrep -f nginx | head -1) | head -20

# All network connections for a process
lsof -i -a -p $(pgrep -f nginx | head -1)

# Find what process is using a specific port
lsof -i :8080
# Shows PID, user, FD type, protocol, and connection state

# Find deleted files still held open (common disk space issue!)
lsof +L1
# Files with link count 0 = deleted but still open
# The space won't be freed until the process closes the FD

# All files open in a directory (find what's preventing unmount)
lsof +D /var/log

# TCP connections in ESTABLISHED state for a user
lsof -i TCP -s TCP:ESTABLISHED -u www-data

# Count open FDs per process (detect FD leaks)
lsof -u www-data 2>/dev/null | awk '{print $1, $2}' | sort | uniq -c | sort -rn | head -10

tcpdump — Packet Capture

tcpdump captures raw packets on a network interface. It's the definitive tool for network debugging — you see exactly what bytes are on the wire. Combined with Wireshark (for GUI analysis of pcap files), it answers "did the packet actually leave?", "did we get a response?", and "what was in it?".

            
            Security Warning: tcpdump captures raw packet content including HTTP request/response bodies. This means passwords, API keys, session tokens, and other sensitive data in unencrypted (HTTP, non-TLS) traffic will be visible in captures. Never store pcap files containing production traffic in insecure locations. Use BPF filters to capture only the headers you need, and delete captures after analysis.
        

# Capture HTTP traffic on port 80 (show ASCII content)
sudo tcpdump -i eth0 -A 'tcp port 80' -c 20

# Capture traffic to/from a specific host
sudo tcpdump -i any host 10.0.0.5 -nn -c 50
# -nn = don't resolve hostnames or ports (faster)

# Capture DNS queries (UDP port 53)
sudo tcpdump -i any 'udp port 53' -nn -c 10

# Filter by source and destination
sudo tcpdump -i eth0 'src 192.168.1.100 and dst port 443' -c 20

# Write to pcap file for Wireshark analysis
sudo tcpdump -i eth0 -w /tmp/capture.pcap 'tcp port 8080' -c 1000
# Later: wireshark /tmp/capture.pcap  (or tcpdump -r /tmp/capture.pcap)

# Show only TCP SYN packets (new connections)
sudo tcpdump -i any 'tcp[tcpflags] == tcp-syn' -nn -c 10

# Capture with timestamps and packet sizes (no content)
sudo tcpdump -i eth0 -tttt -q 'tcp port 443' -c 20

ss & netstat — Socket Statistics

ss (socket statistics) is the modern replacement for netstat. It's faster (reads directly from kernel via netlink) and shows more information — including TCP internal state, timer details, and memory usage per socket.

# All TCP connections with process info
ss -tlnp
# -t = TCP, -l = listening, -n = numeric, -p = process

# Established connections with timer and memory info
ss -ti state established | head -30
# Shows: cwnd, rtt, retrans, send buffer usage

# Count connections by state
ss -tan | awk '{print $1}' | sort | uniq -c | sort -rn

# Find all connections to a specific destination
ss -tn dst 10.0.0.5

# Show socket memory usage (detect buffer bloat)
ss -tm state established | grep -A 1 "mem:" | head -20

# Listening sockets with backlog info
ss -tlnp | column -t
# Recv-Q = pending connections in accept queue
# Send-Q = max backlog size

# All UNIX domain sockets
ss -xlnp | head -20

# Connections with timer info (detect retransmit storms)
ss -to state established | head -20
# Shows keepalive/retransmit timers per connection

dmesg & journalctl — Kernel and System Logs

dmesg shows kernel ring buffer messages — hardware events, driver issues, OOM kills, filesystem errors, and network stack messages. journalctl queries the systemd journal, which aggregates logs from all services, the kernel, and user processes with structured metadata.

# Recent kernel messages (newest last)
dmesg -T | tail -30
# -T = human-readable timestamps

# Kernel errors and warnings only
dmesg --level=err,warn | tail -20

# OOM (Out Of Memory) killer events
dmesg | grep -i "oom\|killed process\|out of memory"

# Hardware/driver errors
dmesg | grep -iE "error|fault|fail" | tail -20

# Follow kernel messages in real time
dmesg -w

# journalctl — systemd journal queries

# Logs for a specific service (most common use)
journalctl -u nginx.service --since "1 hour ago" --no-pager | tail -30

# Follow logs in real time (like tail -f)
journalctl -u myapp.service -f

# Kernel messages only (equivalent to dmesg)
journalctl -k --since "10 min ago"

# Logs from current boot only
journalctl -b 0 --priority=err
# --priority: emerg, alert, crit, err, warning, notice, info, debug

# Logs from previous boot (useful after a crash)
journalctl -b -1 --priority=err

# Logs for a specific PID
journalctl _PID=1234 --since "2 hours ago"

# JSON output for programmatic analysis
journalctl -u nginx.service --output=json-pretty | head -30

# Disk usage of journal
journalctl --disk-usage

/proc Filesystem Deep Dive

The /proc filesystem is a virtual filesystem that exposes kernel data structures as files. Every running process has a directory at /proc/[pid]/ containing its memory maps, file descriptors, command line, environment, and status. System-wide information lives in /proc/meminfo, /proc/cpuinfo, /proc/net/*, etc.

# Pick a process to inspect (using your shell's PID as example)
PID=$$

# Command line that started the process
cat /proc/$PID/cmdline | tr '\0' ' ' ; echo

# Current working directory
ls -la /proc/$PID/cwd

# Environment variables
cat /proc/$PID/environ | tr '\0' '\n' | head -10

# Memory map (shared libraries, heap, stack, mmap regions)
cat /proc/$PID/maps | head -20
# Format: address perms offset dev inode pathname

# Process status (state, memory, threads, capabilities)
cat /proc/$PID/status | grep -E "^(Name|State|Pid|VmRSS|VmSize|Threads|voluntary)"

# Open file descriptors (what FDs point to)
ls -la /proc/$PID/fd | head -20
# 0=stdin, 1=stdout, 2=stderr, 3+=opened files/sockets

# File descriptor limits
cat /proc/$PID/limits | grep "open files"

# System-wide /proc files

# Memory overview
cat /proc/meminfo | head -10
# MemTotal, MemFree, MemAvailable, Buffers, Cached, SwapTotal

# CPU info
cat /proc/cpuinfo | grep "model name" | head -1
nproc   # Number of CPUs

# Network statistics
cat /proc/net/tcp | head -5       # Raw TCP socket table
cat /proc/net/sockstat            # Socket allocation summary
cat /proc/net/dev                 # Per-interface packet/byte counters

# Disk I/O statistics
cat /proc/diskstats | grep -E "sda|nvme" | head -5

# System load average (1, 5, 15 min)
cat /proc/loadavg

# Uptime in seconds
cat /proc/uptime

Debugging Scenario

Debugging a Container That Won't Start

A container keeps restarting with exit code 137 (OOMKilled). Here's the systematic approach:

1. Check container logs: docker logs --tail 50 myapp — often reveals the application error before the kill.

2. Inspect container state: docker inspect myapp | jq '.[0].State' — look for OOMKilled: true, exit codes, and restart count.

3. Check kernel OOM events: dmesg | grep -i "oom\|killed process" — confirms the kernel killed it and shows memory usage at kill time.

4. Enter the container namespace: nsenter -t $(docker inspect -f '{{.State.Pid}}' myapp) -m -p -n -- /bin/sh — inspect /proc/meminfo, check what's consuming memory.

5. Check resource limits: docker inspect myapp | jq '.[0].HostConfig.Memory' — compare the memory limit to what the app actually needs (RSS from /proc/[pid]/status).

Resolution: Either increase the memory limit, fix the memory leak in the application, or tune JVM/runtime heap sizes to fit within the cgroup limit.

OOMKilledContainersSystematic Debugging

Exercises

# Exercise 1: Trace what files a command opens
strace -e trace=openat cat /etc/hostname 2>&1

# Exercise 2: Find your shell's open file descriptors
ls -la /proc/$$/fd

# Exercise 3: Count TCP connections by state on your system
ss -tan | awk '{print $1}' | sort | uniq -c | sort -rn

# Exercise 4: Check for OOM kill events in kernel log
dmesg | grep -i "oom\|killed process" | tail -5

# Exercise 5: Inspect /proc for your current shell
cat /proc/$$/status | grep -E "^(Name|State|VmRSS|Threads)"

# Exercise 6: View recent errors from any systemd service
journalctl --priority=err --since "1 hour ago" --no-pager | tail -10

# Exercise 7: Count open files per process (top consumers)
lsof 2>/dev/null | awk '{print $1}' | sort | uniq -c | sort -rn | head -5

Conclusion & Next Steps

Linux debugging is a craft built on systematic methodology and the right tool for the right layer: strace for syscall-level truth, lsof for resource inspection, tcpdump for network packets, ss for socket state, dmesg/journalctl for kernel and service logs, and /proc for deep process introspection. The methodology matters more than the tools — reproduce, observe, hypothesize, instrument, verify, fix, prevent. With these tools and this discipline, no production issue stays mysterious for long.

PreviousPart 22: Kubernetes Networking Next Part 24: Performance Analysis

Cookie Consent

Part 23: Linux Debugging & Troubleshooting Tools

Table of Contents

Debugging Methodology

strace — System Call Tracing

Common Patterns

Performance Tracing

lsof — List Open Files

tcpdump — Packet Capture

ss & netstat — Socket Statistics

dmesg & journalctl — Kernel and System Logs

/proc Filesystem Deep Dive

Debugging a Container That Won't Start

Exercises

Conclusion & Next Steps

Cookie Consent

Part 23: Linux Debugging & Troubleshooting Tools

Table of Contents

Debugging Methodology

strace — System Call Tracing

Common Patterns

Performance Tracing

lsof — List Open Files

tcpdump — Packet Capture

ss & netstat — Socket Statistics

dmesg & journalctl — Kernel and System Logs

/proc Filesystem Deep Dive

Debugging a Container That Won't Start

Exercises

Conclusion & Next Steps

Continue the Series

Part 22: Kubernetes Networking

Part 24: Performance Analysis