Back to Computing & Systems Foundations Series

Part 11: TCP/IP & UDP Deep Dive

May 13, 2026Wasil Zafar19 min read

How TCP establishes reliable connections, manages flow and congestion, guarantees ordered delivery — and when UDP's simplicity is the better choice.

Table of Contents

  1. TCP Overview
  2. TCP Connection States
  3. Flow Control
  4. Congestion Control
  5. UDP — User Datagram Protocol
  6. Socket Programming
  7. Exercises
  8. Conclusion

TCP Overview

TCP (Transmission Control Protocol) is a connection-oriented, reliable, ordered, byte-stream protocol that sits at Layer 4 (Transport) of the OSI model. It provides guarantees that the network layer (IP) does not: data arrives in order, nothing is lost, duplicates are eliminated, and the sender doesn't overwhelm the receiver.

Every TCP connection is identified by a 4-tuple: (source IP, source port, destination IP, destination port). A single server IP:port can handle thousands of concurrent connections because each client uses a different source IP:port combination.

Key Insight: TCP is not a single feature — it's a collection of mechanisms: connection management (handshake/teardown), reliable delivery (sequence numbers + ACKs + retransmission), flow control (receiver window), congestion control (slow start + AIMD), and ordered delivery (reassembly buffer). Understanding each mechanism independently makes TCP debuggable.

The Three-Way Handshake

Before any data flows, TCP establishes a connection via a three-way handshake. This synchronises sequence numbers and confirms both sides are ready.

TCP Three-Way Handshake
sequenceDiagram
    participant C as Client
    participant S as Server

    Note over S: Server calls listen() — enters LISTEN state
    C->>S: SYN (seq=x)
    Note over C: Client enters SYN_SENT state
    S->>C: SYN-ACK (seq=y, ack=x+1)
    Note over S: Server enters SYN_RCVD state
    C->>S: ACK (ack=y+1)
    Note over C,S: Connection ESTABLISHED — data can flow
            
# Capture the three-way handshake with tcpdump
# Terminal 1: listen for connections to port 8080
sudo tcpdump -i lo -nn 'tcp port 8080 and (tcp[tcpflags] & (tcp-syn|tcp-ack) != 0)' -c 6 &

# Terminal 2: start a simple server
python3 -c "
import socket
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
s.bind(('127.0.0.1', 8080))
s.listen(1)
print('Listening on :8080')
conn, addr = s.accept()
print(f'Connection from {addr}')
conn.send(b'hello')
conn.close()
s.close()
" &
sleep 1

# Terminal 3: connect as client
python3 -c "
import socket
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(('127.0.0.1', 8080))
data = s.recv(1024)
print(f'Received: {data}')
s.close()
"
# tcpdump output shows: SYN → SYN-ACK → ACK (the handshake)

Connection Teardown

TCP uses a four-way teardown because each direction is closed independently. Either side can initiate the close. The initiator enters TIME_WAIT for 2×MSL (Maximum Segment Lifetime, typically 60 seconds) to ensure late packets don't confuse a new connection on the same port.

TCP Four-Way Teardown
sequenceDiagram
    participant A as Initiator (Active Close)
    participant B as Responder (Passive Close)

    A->>B: FIN (I'm done sending)
    Note over A: FIN_WAIT_1
    B->>A: ACK (Got your FIN)
    Note over A: FIN_WAIT_2
    Note over B: CLOSE_WAIT — app still sending data
    B->>A: FIN (I'm done too)
    Note over B: LAST_ACK
    A->>B: ACK (Got your FIN)
    Note over A: TIME_WAIT (2×MSL ≈ 60s)
    Note over B: CLOSED
    Note over A: CLOSED (after timeout)
            
TIME_WAIT Accumulation: On high-traffic servers (load balancers, API gateways), thousands of connections in TIME_WAIT can exhaust ephemeral ports (default range: 32768–60999 = ~28K ports). Mitigations: enable tcp_tw_reuse (allows reusing TIME_WAIT sockets for new outgoing connections), increase ephemeral port range, or use connection pooling (HTTP keep-alive, database connection pools) to avoid creating/destroying connections frequently.
# View TCP connection states on your system
ss -tan state time-wait | wc -l       # Count TIME_WAIT connections
ss -tan state established | wc -l     # Count ESTABLISHED connections
ss -s                                   # Summary of all socket states

# View all states at once
ss -tan | awk '{print $1}' | sort | uniq -c | sort -rn

# Kernel tunables for TIME_WAIT
sysctl net.ipv4.tcp_tw_reuse           # 1 = allow reuse (recommended)
sysctl net.ipv4.ip_local_port_range    # Ephemeral port range
sysctl net.ipv4.tcp_fin_timeout        # FIN_WAIT_2 timeout (not TIME_WAIT)
sysctl net.ipv4.tcp_max_tw_buckets     # Max TIME_WAIT sockets system-wide

TCP Connection States

StateSideMeaning
LISTENServerWaiting for incoming SYN (server called listen())
SYN_SENTClientSYN sent, waiting for SYN-ACK
SYN_RCVDServerSYN received, SYN-ACK sent, waiting for ACK
ESTABLISHEDBothConnection open — data flowing
FIN_WAIT_1InitiatorFIN sent, waiting for ACK
FIN_WAIT_2InitiatorACK received for our FIN, waiting for peer's FIN
CLOSE_WAITResponderPeer sent FIN, we ACKed — app should call close()
LAST_ACKResponderWe sent our FIN, waiting for final ACK
TIME_WAITInitiatorBoth FINs exchanged — waiting 2×MSL before fully closing
CLOSEDBothConnection fully terminated
Debugging Pattern

CLOSE_WAIT Leak — The Silent Connection Killer

If you see thousands of connections in CLOSE_WAIT, it means the remote side sent FIN (they closed), your application received it, but your code never called close() on the socket. This is a resource leak — each CLOSE_WAIT socket holds a file descriptor and kernel memory. Common causes: missing finally: conn.close() in exception handling, connection pool not properly returning connections, or the application not reading EOF from the socket. Fix: ensure every socket/connection is closed in a try/finally or with block.

CLOSE_WAITResource LeakDebugging

Flow Control

TCP flow control prevents the sender from overwhelming the receiver. Each ACK carries a receive window (rwnd) — the number of bytes the receiver is willing to accept. The sender cannot have more than rwnd bytes of unacknowledged data in flight.

# View TCP window sizes on established connections
ss -ti state established | head -30
# Look for: wscale (window scaling factor), rcv_space (receive buffer)

# View and tune receive/send buffer sizes
sysctl net.ipv4.tcp_rmem    # min default max (receive buffer)
sysctl net.ipv4.tcp_wmem    # min default max (send buffer)
# Default: 4096 131072 6291456 (4KB / 128KB / 6MB)

# View a specific socket's buffer sizes
ss -tim 'dst 8.8.8.8' 2>/dev/null | head -10
# mem: shows sk_buff memory usage for the socket

# Window scaling (RFC 7323) allows windows up to 1GB
# Enabled by default on modern Linux
sysctl net.ipv4.tcp_window_scaling   # 1 = enabled
Bandwidth-Delay Product (BDP): For a link with 100 Mbps bandwidth and 50ms RTT, the BDP is 100 Mbps × 0.05s = 625 KB. To fully utilise this link, the TCP window must be at least 625 KB. If the receive buffer is smaller than the BDP, throughput is limited by the window size, not the link capacity. This is why buffer tuning matters for long-distance, high-bandwidth transfers.

Congestion Control

Flow control prevents overwhelming the receiver. Congestion control prevents overwhelming the network. TCP maintains a congestion window (cwnd) — the number of bytes the sender can send before receiving an ACK. The actual send window is min(cwnd, rwnd).

TCP Congestion Control Phases
flowchart LR
    SS["Slow Start\ncwnd doubles each RTT\n(exponential growth)"]
    CA["Congestion Avoidance\ncwnd += 1 MSS per RTT\n(linear growth — AIMD)"]
    FR["Fast Recovery\ncwnd halved\n(on 3 dup ACKs)"]
    TO["Timeout\ncwnd = 1 MSS\nssthresh = cwnd/2\n(back to Slow Start)"]

    SS -->|cwnd reaches ssthresh| CA
    CA -->|3 duplicate ACKs| FR
    FR -->|Recovery complete| CA
    CA -->|Timeout| TO
    SS -->|Timeout| TO
    TO --> SS
            
# View the current congestion control algorithm
sysctl net.ipv4.tcp_congestion_control   # Usually "cubic" (default on Linux)

# Available algorithms
sysctl net.ipv4.tcp_available_congestion_control
# cubic reno bbr (if bbr module loaded)

# BBR (Bottleneck Bandwidth and RTT) — Google's algorithm
# Better for high-latency, lossy networks (cloud, WAN)
# sudo sysctl -w net.ipv4.tcp_congestion_control=bbr  # Switch to BBR

# View per-connection congestion info
ss -ti state established | grep -E "cwnd|ssthresh|rtt" | head -10
# cwnd: current congestion window (in MSS units)
# ssthresh: slow start threshold
# rtt: round-trip time (smoothed/variance)

# Monitor retransmissions (sign of congestion)
cat /proc/net/netstat | grep -i retrans
ss -s | grep retrans

UDP — User Datagram Protocol

UDP is connectionless, unreliable, and unordered — and that's its strength. No handshake, no state, no retransmission, no flow or congestion control. Each datagram is independent. This makes UDP faster and simpler for applications that can tolerate or handle loss themselves.

FeatureTCPUDP
ConnectionConnection-oriented (handshake)Connectionless (fire and forget)
ReliabilityGuaranteed delivery (retransmission)Best-effort (no retransmission)
OrderingOrdered byte streamNo ordering guarantee
Flow controlReceive windowNone
Congestion controlSlow start, AIMD, etc.None (application must handle)
Overhead20-byte header + state8-byte header, no state
Use casesHTTP, SSH, SMTP, databasesDNS, NTP, gaming, video streaming, QUIC
# Send/receive UDP datagrams with netcat
# Terminal 1: UDP listener
nc -u -l 9999 &

# Terminal 2: Send UDP datagram
echo "hello UDP" | nc -u -w 1 127.0.0.1 9999

# View UDP socket statistics
ss -unap                 # UDP sockets with process info
ss -s | grep UDP         # UDP summary

# DNS uses UDP port 53 (most queries fit in one datagram)
dig +short google.com    # This uses UDP by default
dig +tcp google.com      # Force TCP for DNS (for large responses)

# View UDP receive/send errors (indicates packet loss)
cat /proc/net/snmp | grep Udp:
# InErrors = packets dropped (buffer full, checksum fail)
# RcvbufErrors = dropped due to receive buffer overflow
Modern Protocols

QUIC — UDP-Based Reliable Transport

HTTP/3 runs on QUIC, a protocol built on top of UDP that provides TCP-like reliability, TLS 1.3 encryption, and multiplexed streams — all in user space. Why not just use TCP? Because TCP's head-of-line blocking means one lost packet stalls all streams on a connection. QUIC handles loss per-stream, so a lost packet in one stream doesn't block others. QUIC also achieves 0-RTT connection resumption (vs TCP+TLS's 2-3 RTTs). Google, Cloudflare, and Meta serve most traffic via QUIC today.

QUICHTTP/3Head-of-Line Blocking

Socket Programming

A socket is the kernel's abstraction for a network endpoint. The socket API (socket, bind, listen, accept, connect, send, recv, close) is the interface between application code and the TCP/UDP stack.

# === TCP Server (Python) ===
import socket

server = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
server.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
server.bind(('0.0.0.0', 8080))
server.listen(5)   # Backlog = max pending connections
print("TCP server listening on :8080")

conn, addr = server.accept()   # Blocks until client connects
print(f"Connection from {addr}")
data = conn.recv(1024)          # Read up to 1024 bytes
print(f"Received: {data.decode()}")
conn.send(b"ACK: " + data)     # Echo back with prefix
conn.close()
server.close()
# === TCP Client (Python) ===
import socket

client = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
client.connect(('127.0.0.1', 8080))
client.send(b"Hello from client")
response = client.recv(1024)
print(f"Server replied: {response.decode()}")
client.close()
# View socket backlog (SYN queue + accept queue)
ss -tlnp | grep 8080
# Recv-Q = connections waiting in accept queue
# Send-Q = maximum backlog size

# View the system-wide max backlog
sysctl net.core.somaxconn        # Default: 4096 (was 128 on older kernels)

# Per-socket options you'll encounter
# SO_REUSEADDR   — allow binding to a port in TIME_WAIT
# SO_KEEPALIVE   — send keepalive probes on idle TCP connections
# TCP_NODELAY    — disable Nagle's algorithm (send immediately, no batching)
# TCP_QUICKACK   — disable delayed ACKs

# View keepalive settings
sysctl net.ipv4.tcp_keepalive_time     # Seconds before first probe (default: 7200 = 2hr)
sysctl net.ipv4.tcp_keepalive_intvl    # Seconds between probes (default: 75)
sysctl net.ipv4.tcp_keepalive_probes   # Number of probes before drop (default: 9)

Exercises

# Exercise 1: Observe TCP states on your system
ss -tan | awk '{print $1}' | sort | uniq -c | sort -rn

# Exercise 2: Capture a TCP handshake
# In one terminal:
nc -l 9999 &
# In another:
# sudo tcpdump -i lo -nn 'tcp port 9999' -c 10 &
echo "test" | nc -w 1 127.0.0.1 9999

# Exercise 3: Measure TIME_WAIT sockets
ss -tan state time-wait | wc -l

# Exercise 4: Check your congestion control algorithm
sysctl net.ipv4.tcp_congestion_control

# Exercise 5: View TCP retransmissions (0 = healthy network)
cat /proc/net/snmp | grep -A 1 "Tcp:" | tail -1 | awk '{print "Retransmits:", $13}'

Conclusion & Next Steps

TCP's reliability comes from the three-way handshake, sequence numbers, ACKs, retransmission, flow control (receive window), and congestion control (cwnd). TIME_WAIT, CLOSE_WAIT, and SYN floods are the TCP states you'll debug most often. UDP trades all these guarantees for simplicity and speed — and modern protocols like QUIC build reliability on top of UDP in user space. Socket programming is the universal API for both — and understanding the state machine behind every connect() and close() makes network debugging tractable.