BGP — The Internet's Control Plane
BGP (Border Gateway Protocol) is THE control plane protocol of the internet. It's the mechanism by which ~75,000 autonomous systems (AS) — each operated by a different organization — agree on how to reach every network prefix on the planet. Without BGP, the internet would be a disconnected collection of isolated networks.
Key BGP characteristics as a control plane protocol:
- Inter-AS routing — Operates between autonomous systems (organizations), not within them
- Policy-driven — Routing decisions based on business relationships (customer, peer, provider), not just shortest path
- Path-vector — Carries the full AS path to each destination, enabling loop detection and policy filtering
- TCP-based — Runs over TCP port 179, ensuring reliable delivery of routing updates
- Incremental updates — Only sends changes (withdrawals/announcements), not the full table
BGP Path Selection Algorithm
When BGP receives multiple paths to the same prefix, it applies a strict decision process to select the single best path that gets installed in the RIB:
flowchart TD
A[Multiple Paths\nto Same Prefix] --> B{Highest\nLocal Preference?}
B -->|Tie| C{Shortest\nAS Path?}
B -->|Winner| Z[Install in RIB]
C -->|Tie| D{Lowest\nOrigin Type?}
C -->|Winner| Z
D -->|Tie| E{Lowest\nMED?}
D -->|Winner| Z
E -->|Tie| F{eBGP over\niBGP?}
E -->|Winner| Z
F -->|Tie| G{Lowest\nIGP Metric\nto Next-Hop?}
F -->|Winner| Z
G -->|Tie| H{Oldest\nRoute?}
G -->|Winner| Z
H -->|Tie| I{Lowest\nRouter ID?}
H -->|Winner| Z
I --> Z
# BGP neighbor configuration (Cisco IOS)
# Establishing BGP peering sessions — pure control plane activity
router bgp 65001
# Router's own AS number
bgp router-id 1.1.1.1
# eBGP neighbor (external — different AS)
neighbor 203.0.113.1 remote-as 65002
neighbor 203.0.113.1 description "Upstream ISP - Transit Provider"
neighbor 203.0.113.1 password SECRET123
neighbor 203.0.113.1 update-source Loopback0
# iBGP neighbor (internal — same AS)
neighbor 10.0.0.2 remote-as 65001
neighbor 10.0.0.2 description "iBGP peer - Core Router 2"
neighbor 10.0.0.2 next-hop-self
# Address family configuration
address-family ipv4 unicast
# Advertise our networks to the world
network 198.51.100.0 mask 255.255.255.0
network 198.51.101.0 mask 255.255.255.0
# Apply route-map policy to neighbor
neighbor 203.0.113.1 route-map UPSTREAM-IN in
neighbor 203.0.113.1 route-map UPSTREAM-OUT out
exit-address-family
# Route-map for policy enforcement
route-map UPSTREAM-IN permit 10
# Set local preference for routes from this provider
set local-preference 100
# Tag with community for downstream policy
set community 65001:100 additive
route-map UPSTREAM-OUT permit 10
# Only advertise our own prefixes upstream
match ip address prefix-list OUR-PREFIXES
Route Reflectors & Communities
Route Reflectors
In iBGP, every router must peer with every other router (full mesh) to ensure all routers see all routes. With N routers, that's N×(N-1)/2 sessions. For 100 routers = 4,950 sessions. Route reflectors solve this by acting as centralized route distribution points — a mini control plane within the control plane.
BGP Communities
Communities are tags attached to routes that enable policy signaling between ASes. They're the "metadata" of the control plane:
65001:100— "Learned from customer" (prefer highly)65001:200— "Learned from peer" (normal preference)65001:300— "Learned from transit provider" (least preferred)NO_EXPORT— "Don't advertise outside this AS"NO_ADVERTISE— "Don't advertise to any peer"
OSPF — Intra-Domain Link-State
While BGP handles inter-AS routing, OSPF (Open Shortest Path First) handles routing within a single organization. It's a link-state protocol — every router maintains a complete topology map of the network and independently computes shortest paths using Dijkstra's SPF algorithm.
BGP vs OSPF — Different Control Plane Roles
BGP is a policy protocol — it selects paths based on business relationships and operator preferences. It converges slowly (seconds to minutes) but handles 950K+ prefixes globally. OSPF is an optimization protocol — it finds the mathematically shortest path through a network. It converges fast (sub-second with tuning) but operates within a single administrative domain. Together, they populate the RIB that becomes the FIB for the data plane.
OSPF Areas & SPF Algorithm
flowchart TB
subgraph AREA0["Area 0 (Backbone)"]
ABR1[ABR 1\nArea Border Router]
ABR2[ABR 2\nArea Border Router]
BBR1[Backbone\nRouter 1]
BBR2[Backbone\nRouter 2]
ABR1 --- BBR1
BBR1 --- BBR2
BBR2 --- ABR2
ABR1 --- ABR2
end
subgraph AREA1["Area 1 (Engineering)"]
R1[Router 1]
R2[Router 2]
R3[Router 3]
R1 --- R2
R2 --- R3
end
subgraph AREA2["Area 2 (Data Center)"]
R4[Router 4]
R5[Router 5]
R6[Router 6]
R4 --- R5
R5 --- R6
end
ABR1 --- R1
ABR2 --- R4
OSPF divides large networks into areas to limit the scope of the link-state database and SPF calculations:
- Area 0 (Backbone) — All areas must connect to Area 0. Inter-area traffic transits through it
- Regular areas — Contain a complete LSDB of their own topology, receive summarized routes from other areas
- ABRs (Area Border Routers) — Sit between areas, summarize and redistribute routes
- DR/BDR (Designated/Backup DR) — Elected on multi-access segments to reduce flooding overhead
# OSPF area configuration
# Each area maintains its own link-state database
router ospf 1
router-id 1.1.1.1
# Interfaces in Area 0 (backbone)
network 10.0.0.0 0.0.0.255 area 0
network 10.0.1.0 0.0.0.255 area 0
# Interfaces in Area 1
network 10.1.0.0 0.0.255.255 area 1
# Summarize Area 1 routes at the ABR
area 1 range 10.1.0.0 255.255.0.0
# Fast convergence tuning
timers throttle spf 50 200 5000
# Initial SPF delay: 50ms
# Min hold between SPF runs: 200ms
# Max hold: 5000ms
# Sub-second failure detection with BFD
interface GigabitEthernet0/0
ip ospf bfd
ip ospf dead-interval minimal hello-multiplier 4
# Dead interval = 1 second (4 x 250ms hellos)
SPF Algorithm (Dijkstra's)
Each OSPF router runs SPF independently on its link-state database to compute a shortest-path tree rooted at itself. The output is a set of (destination, next-hop, cost) tuples that are installed in the RIB. When a link fails:
- Adjacent router detects failure (dead timer expires or BFD triggers)
- Router floods an updated LSA (Link-State Advertisement) into the area
- All routers in the area receive the LSA and update their LSDB
- All routers independently re-run SPF on the updated topology
- New best routes are installed in RIB → pushed to FIB → data plane adapts
IS-IS — ISP Backbone Protocol
IS-IS (Intermediate System to Intermediate System) is functionally similar to OSPF — both are link-state protocols that use SPF. IS-IS is preferred by many large ISPs because:
- Protocol-agnostic — Runs directly on Layer 2 (not IP), so it works even when IP is misconfigured
- TLV extensibility — Type-Length-Value encoding makes it easy to add new features without protocol redesign
- Proven at massive scale — Backbone networks with thousands of routers
- Simpler area design — Level 1 (intra-area) and Level 2 (inter-area) with fewer restrictions than OSPF
- Multi-topology support — Can run different topologies for IPv4 and IPv6 simultaneously
From Protocols to Forwarding
Multiple routing protocols may provide routes to the same destination. The router selects the best using administrative distance — a protocol trustworthiness ranking:
| Protocol | Admin Distance | Role |
|---|---|---|
| Connected | 0 | Directly attached networks |
| Static | 1 | Manually configured routes |
| eBGP | 20 | External BGP (inter-AS) |
| OSPF | 110 | Internal link-state |
| IS-IS | 115 | Internal link-state |
| iBGP | 200 | Internal BGP (same AS) |
After administrative distance selects the protocol, the winning route enters the RIB. From there, it's programmed into the FIB — the hardware forwarding table used by the data plane for every packet decision.
Convergence & Route Stability
flowchart LR
A[Link Failure\nOccurs] -->|"Detection\n(BFD: 50ms\nHellos: 30-40s)"| B[Failure\nDetected]
B -->|"Flooding\n(10-100ms per hop)"| C[All Routers\nNotified]
C -->|"SPF Calculation\n(1-50ms)"| D[New Routes\nComputed]
D -->|"RIB Update\n(1-10ms)"| E[RIB\nUpdated]
E -->|"FIB Programming\n(10-100ms)"| F[Data Plane\nConverged]
style A fill:#BF092F,color:#fff
style F fill:#3B9797,color:#fff
Route Flapping and Dampening
When a link oscillates between up and down rapidly (flapping), each state change triggers BGP withdrawals and re-announcements that propagate across the internet. Route dampening suppresses flapping routes by penalizing instability — after too many flaps, the route is suppressed for an exponentially increasing period.
BGP Security — Hijacking & RPKI
BGP was designed in an era of trust. Any AS can announce any prefix — there's no built-in authentication. This makes BGP hijacking a control plane attack with devastating data plane consequences:
BGP Hijacking — A Control Plane Attack
In a BGP hijack, an attacker announces someone else's IP prefix from their own AS. Because BGP has no native origin validation, other routers may accept the malicious announcement and route traffic to the attacker. Notable incidents: Pakistan accidentally hijacked YouTube's prefix (2008), causing a global outage. China Telecom has been accused of routing US traffic through China. The fix — RPKI (Resource Public Key Infrastructure) — adds cryptographic validation of prefix ownership, allowing routers to reject unauthorized announcements.
Modern BGP in Data Centers
BGP has evolved far beyond its original internet routing role. Modern data centers use eBGP as the only routing protocol in CLOS (leaf-spine) fabrics:
flowchart TB
subgraph SPINE["Spine Layer (eBGP)"]
S1[Spine 1\nAS 65100]
S2[Spine 2\nAS 65200]
S3[Spine 3\nAS 65300]
end
subgraph LEAF["Leaf Layer (eBGP)"]
L1[Leaf 1\nAS 65001]
L2[Leaf 2\nAS 65002]
L3[Leaf 3\nAS 65003]
L4[Leaf 4\nAS 65004]
end
subgraph SERVERS["Servers"]
SV1[Servers]
SV2[Servers]
SV3[Servers]
SV4[Servers]
end
L1 --- S1
L1 --- S2
L1 --- S3
L2 --- S1
L2 --- S2
L2 --- S3
L3 --- S1
L3 --- S2
L3 --- S3
L4 --- S1
L4 --- S2
L4 --- S3
SV1 --- L1
SV2 --- L2
SV3 --- L3
SV4 --- L4
Why BGP in the data center (instead of OSPF):
- Every link is eBGP — Each switch gets its own AS number, making every link an inter-AS link
- ECMP load balancing — Multiple equal-cost paths through different spines
- No SPF storms — Link failures don't trigger network-wide recalculation
- Policy at every hop — Fine-grained traffic engineering capabilities
- Proven at massive scale — Facebook, Google, Microsoft all use eBGP CLOS
BGP for Kubernetes
BGP is increasingly used to advertise Kubernetes service IPs and pod CIDRs to the physical network:
# MetalLB BGP configuration for Kubernetes
# Advertises LoadBalancer service IPs via BGP to the network fabric
apiVersion: metallb.io/v1beta2
kind: BGPPeer
metadata:
name: leaf-switch-peer
namespace: metallb-system
spec:
myASN: 65010 # Kubernetes cluster's AS
peerASN: 65001 # Leaf switch's AS
peerAddress: 10.0.0.1 # Leaf switch IP
holdTime: 90s
keepaliveTime: 30s
---
apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
name: production-pool
namespace: metallb-system
spec:
addresses:
- 198.51.100.0/24 # IPs to assign to LoadBalancer services
---
apiVersion: metallb.io/v1beta1
kind: BGPAdvertisement
metadata:
name: production-advertisement
namespace: metallb-system
spec:
ipAddressPools:
- production-pool
communities:
- 65001:100 # Tag as "internal service"
# Verify BGP sessions from a Kubernetes perspective
# Calico uses BGP to distribute pod network routes
# Check BGP peering status (Calico)
calicoctl node status
# Sample output:
# IPv4 BGP status
# +--------------+-------------------+-------+----------+-------------+
# | PEER ADDRESS | PEER TYPE | STATE | SINCE | INFO |
# +--------------+-------------------+-------+----------+-------------+
# | 10.0.0.1 | node-to-node mesh | up | 08:15:00 | Established |
# | 10.0.0.2 | node-to-node mesh | up | 08:15:01 | Established |
# | 10.0.0.3 | node-to-node mesh | up | 08:15:01 | Established |
# +--------------+-------------------+-------+----------+-------------+
# Check BGP summary on the physical leaf switch
show bgp summary
# Sample output:
# Neighbor AS MsgRcvd MsgSent Up/Down State/PfxRcd
# 10.0.0.10 65010 1205 1198 2d03h 48
# 10.0.0.11 65010 1180 1175 2d03h 48
# 10.0.0.12 65010 1195 1190 2d03h 48
# Total number of neighbors: 3, Prefixes received: 144