TCP Overview
TCP (Transmission Control Protocol) is a connection-oriented, reliable, ordered, byte-stream protocol that sits at Layer 4 (Transport) of the OSI model. It provides guarantees that the network layer (IP) does not: data arrives in order, nothing is lost, duplicates are eliminated, and the sender doesn't overwhelm the receiver.
Every TCP connection is identified by a 4-tuple: (source IP, source port, destination IP, destination port). A single server IP:port can handle thousands of concurrent connections because each client uses a different source IP:port combination.
The Three-Way Handshake
Before any data flows, TCP establishes a connection via a three-way handshake. This synchronises sequence numbers and confirms both sides are ready.
sequenceDiagram
participant C as Client
participant S as Server
Note over S: Server calls listen() — enters LISTEN state
C->>S: SYN (seq=x)
Note over C: Client enters SYN_SENT state
S->>C: SYN-ACK (seq=y, ack=x+1)
Note over S: Server enters SYN_RCVD state
C->>S: ACK (ack=y+1)
Note over C,S: Connection ESTABLISHED — data can flow
# Capture the three-way handshake with tcpdump
# Terminal 1: listen for connections to port 8080
sudo tcpdump -i lo -nn 'tcp port 8080 and (tcp[tcpflags] & (tcp-syn|tcp-ack) != 0)' -c 6 &
# Terminal 2: start a simple server
python3 -c "
import socket
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
s.bind(('127.0.0.1', 8080))
s.listen(1)
print('Listening on :8080')
conn, addr = s.accept()
print(f'Connection from {addr}')
conn.send(b'hello')
conn.close()
s.close()
" &
sleep 1
# Terminal 3: connect as client
python3 -c "
import socket
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(('127.0.0.1', 8080))
data = s.recv(1024)
print(f'Received: {data}')
s.close()
"
# tcpdump output shows: SYN → SYN-ACK → ACK (the handshake)
Connection Teardown
TCP uses a four-way teardown because each direction is closed independently. Either side can initiate the close. The initiator enters TIME_WAIT for 2×MSL (Maximum Segment Lifetime, typically 60 seconds) to ensure late packets don't confuse a new connection on the same port.
sequenceDiagram
participant A as Initiator (Active Close)
participant B as Responder (Passive Close)
A->>B: FIN (I'm done sending)
Note over A: FIN_WAIT_1
B->>A: ACK (Got your FIN)
Note over A: FIN_WAIT_2
Note over B: CLOSE_WAIT — app still sending data
B->>A: FIN (I'm done too)
Note over B: LAST_ACK
A->>B: ACK (Got your FIN)
Note over A: TIME_WAIT (2×MSL ≈ 60s)
Note over B: CLOSED
Note over A: CLOSED (after timeout)
tcp_tw_reuse (allows reusing TIME_WAIT sockets for new outgoing connections), increase ephemeral port range, or use connection pooling (HTTP keep-alive, database connection pools) to avoid creating/destroying connections frequently.
# View TCP connection states on your system
ss -tan state time-wait | wc -l # Count TIME_WAIT connections
ss -tan state established | wc -l # Count ESTABLISHED connections
ss -s # Summary of all socket states
# View all states at once
ss -tan | awk '{print $1}' | sort | uniq -c | sort -rn
# Kernel tunables for TIME_WAIT
sysctl net.ipv4.tcp_tw_reuse # 1 = allow reuse (recommended)
sysctl net.ipv4.ip_local_port_range # Ephemeral port range
sysctl net.ipv4.tcp_fin_timeout # FIN_WAIT_2 timeout (not TIME_WAIT)
sysctl net.ipv4.tcp_max_tw_buckets # Max TIME_WAIT sockets system-wide
TCP Connection States
| State | Side | Meaning |
|---|---|---|
| LISTEN | Server | Waiting for incoming SYN (server called listen()) |
| SYN_SENT | Client | SYN sent, waiting for SYN-ACK |
| SYN_RCVD | Server | SYN received, SYN-ACK sent, waiting for ACK |
| ESTABLISHED | Both | Connection open — data flowing |
| FIN_WAIT_1 | Initiator | FIN sent, waiting for ACK |
| FIN_WAIT_2 | Initiator | ACK received for our FIN, waiting for peer's FIN |
| CLOSE_WAIT | Responder | Peer sent FIN, we ACKed — app should call close() |
| LAST_ACK | Responder | We sent our FIN, waiting for final ACK |
| TIME_WAIT | Initiator | Both FINs exchanged — waiting 2×MSL before fully closing |
| CLOSED | Both | Connection fully terminated |
CLOSE_WAIT Leak — The Silent Connection Killer
If you see thousands of connections in CLOSE_WAIT, it means the remote side sent FIN (they closed), your application received it, but your code never called close() on the socket. This is a resource leak — each CLOSE_WAIT socket holds a file descriptor and kernel memory. Common causes: missing finally: conn.close() in exception handling, connection pool not properly returning connections, or the application not reading EOF from the socket. Fix: ensure every socket/connection is closed in a try/finally or with block.
Flow Control
TCP flow control prevents the sender from overwhelming the receiver. Each ACK carries a receive window (rwnd) — the number of bytes the receiver is willing to accept. The sender cannot have more than rwnd bytes of unacknowledged data in flight.
# View TCP window sizes on established connections
ss -ti state established | head -30
# Look for: wscale (window scaling factor), rcv_space (receive buffer)
# View and tune receive/send buffer sizes
sysctl net.ipv4.tcp_rmem # min default max (receive buffer)
sysctl net.ipv4.tcp_wmem # min default max (send buffer)
# Default: 4096 131072 6291456 (4KB / 128KB / 6MB)
# View a specific socket's buffer sizes
ss -tim 'dst 8.8.8.8' 2>/dev/null | head -10
# mem: shows sk_buff memory usage for the socket
# Window scaling (RFC 7323) allows windows up to 1GB
# Enabled by default on modern Linux
sysctl net.ipv4.tcp_window_scaling # 1 = enabled
Congestion Control
Flow control prevents overwhelming the receiver. Congestion control prevents overwhelming the network. TCP maintains a congestion window (cwnd) — the number of bytes the sender can send before receiving an ACK. The actual send window is min(cwnd, rwnd).
flowchart LR
SS["Slow Start\ncwnd doubles each RTT\n(exponential growth)"]
CA["Congestion Avoidance\ncwnd += 1 MSS per RTT\n(linear growth — AIMD)"]
FR["Fast Recovery\ncwnd halved\n(on 3 dup ACKs)"]
TO["Timeout\ncwnd = 1 MSS\nssthresh = cwnd/2\n(back to Slow Start)"]
SS -->|cwnd reaches ssthresh| CA
CA -->|3 duplicate ACKs| FR
FR -->|Recovery complete| CA
CA -->|Timeout| TO
SS -->|Timeout| TO
TO --> SS
# View the current congestion control algorithm
sysctl net.ipv4.tcp_congestion_control # Usually "cubic" (default on Linux)
# Available algorithms
sysctl net.ipv4.tcp_available_congestion_control
# cubic reno bbr (if bbr module loaded)
# BBR (Bottleneck Bandwidth and RTT) — Google's algorithm
# Better for high-latency, lossy networks (cloud, WAN)
# sudo sysctl -w net.ipv4.tcp_congestion_control=bbr # Switch to BBR
# View per-connection congestion info
ss -ti state established | grep -E "cwnd|ssthresh|rtt" | head -10
# cwnd: current congestion window (in MSS units)
# ssthresh: slow start threshold
# rtt: round-trip time (smoothed/variance)
# Monitor retransmissions (sign of congestion)
cat /proc/net/netstat | grep -i retrans
ss -s | grep retrans
UDP — User Datagram Protocol
UDP is connectionless, unreliable, and unordered — and that's its strength. No handshake, no state, no retransmission, no flow or congestion control. Each datagram is independent. This makes UDP faster and simpler for applications that can tolerate or handle loss themselves.
| Feature | TCP | UDP |
|---|---|---|
| Connection | Connection-oriented (handshake) | Connectionless (fire and forget) |
| Reliability | Guaranteed delivery (retransmission) | Best-effort (no retransmission) |
| Ordering | Ordered byte stream | No ordering guarantee |
| Flow control | Receive window | None |
| Congestion control | Slow start, AIMD, etc. | None (application must handle) |
| Overhead | 20-byte header + state | 8-byte header, no state |
| Use cases | HTTP, SSH, SMTP, databases | DNS, NTP, gaming, video streaming, QUIC |
# Send/receive UDP datagrams with netcat
# Terminal 1: UDP listener
nc -u -l 9999 &
# Terminal 2: Send UDP datagram
echo "hello UDP" | nc -u -w 1 127.0.0.1 9999
# View UDP socket statistics
ss -unap # UDP sockets with process info
ss -s | grep UDP # UDP summary
# DNS uses UDP port 53 (most queries fit in one datagram)
dig +short google.com # This uses UDP by default
dig +tcp google.com # Force TCP for DNS (for large responses)
# View UDP receive/send errors (indicates packet loss)
cat /proc/net/snmp | grep Udp:
# InErrors = packets dropped (buffer full, checksum fail)
# RcvbufErrors = dropped due to receive buffer overflow
QUIC — UDP-Based Reliable Transport
HTTP/3 runs on QUIC, a protocol built on top of UDP that provides TCP-like reliability, TLS 1.3 encryption, and multiplexed streams — all in user space. Why not just use TCP? Because TCP's head-of-line blocking means one lost packet stalls all streams on a connection. QUIC handles loss per-stream, so a lost packet in one stream doesn't block others. QUIC also achieves 0-RTT connection resumption (vs TCP+TLS's 2-3 RTTs). Google, Cloudflare, and Meta serve most traffic via QUIC today.
Socket Programming
A socket is the kernel's abstraction for a network endpoint. The socket API (socket, bind, listen, accept, connect, send, recv, close) is the interface between application code and the TCP/UDP stack.
# === TCP Server (Python) ===
import socket
server = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
server.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
server.bind(('0.0.0.0', 8080))
server.listen(5) # Backlog = max pending connections
print("TCP server listening on :8080")
conn, addr = server.accept() # Blocks until client connects
print(f"Connection from {addr}")
data = conn.recv(1024) # Read up to 1024 bytes
print(f"Received: {data.decode()}")
conn.send(b"ACK: " + data) # Echo back with prefix
conn.close()
server.close()
# === TCP Client (Python) ===
import socket
client = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
client.connect(('127.0.0.1', 8080))
client.send(b"Hello from client")
response = client.recv(1024)
print(f"Server replied: {response.decode()}")
client.close()
# View socket backlog (SYN queue + accept queue)
ss -tlnp | grep 8080
# Recv-Q = connections waiting in accept queue
# Send-Q = maximum backlog size
# View the system-wide max backlog
sysctl net.core.somaxconn # Default: 4096 (was 128 on older kernels)
# Per-socket options you'll encounter
# SO_REUSEADDR — allow binding to a port in TIME_WAIT
# SO_KEEPALIVE — send keepalive probes on idle TCP connections
# TCP_NODELAY — disable Nagle's algorithm (send immediately, no batching)
# TCP_QUICKACK — disable delayed ACKs
# View keepalive settings
sysctl net.ipv4.tcp_keepalive_time # Seconds before first probe (default: 7200 = 2hr)
sysctl net.ipv4.tcp_keepalive_intvl # Seconds between probes (default: 75)
sysctl net.ipv4.tcp_keepalive_probes # Number of probes before drop (default: 9)
Exercises
# Exercise 1: Observe TCP states on your system
ss -tan | awk '{print $1}' | sort | uniq -c | sort -rn
# Exercise 2: Capture a TCP handshake
# In one terminal:
nc -l 9999 &
# In another:
# sudo tcpdump -i lo -nn 'tcp port 9999' -c 10 &
echo "test" | nc -w 1 127.0.0.1 9999
# Exercise 3: Measure TIME_WAIT sockets
ss -tan state time-wait | wc -l
# Exercise 4: Check your congestion control algorithm
sysctl net.ipv4.tcp_congestion_control
# Exercise 5: View TCP retransmissions (0 = healthy network)
cat /proc/net/snmp | grep -A 1 "Tcp:" | tail -1 | awk '{print "Retransmits:", $13}'
Conclusion & Next Steps
TCP's reliability comes from the three-way handshake, sequence numbers, ACKs, retransmission, flow control (receive window), and congestion control (cwnd). TIME_WAIT, CLOSE_WAIT, and SYN floods are the TCP states you'll debug most often. UDP trades all these guarantees for simplicity and speed — and modern protocols like QUIC build reliability on top of UDP in user space. Socket programming is the universal API for both — and understanding the state machine behind every connect() and close() makes network debugging tractable.