Storage Control vs Data Plane
Every distributed storage system faces the same fundamental challenge: managing the metadata (where things are, how they're replicated, their consistency state) separately from the data operations (actual reads and writes of bytes). This maps cleanly to the control/data plane pattern.
flowchart TB
subgraph CP["Storage Control Plane"]
META["Metadata Service\n(namespace, directory tree)"]
PLACE["Placement Engine\n(where to store)"]
REPL["Replication Manager\n(copy orchestration)"]
HEALTH["Health Monitor\n(failure detection)"]
META --> PLACE
PLACE --> REPL
HEALTH --> PLACE
end
subgraph DP["Storage Data Plane"]
WRITE["Write Path\n(client → storage nodes)"]
READ["Read Path\n(storage nodes → client)"]
XFER["Data Transfer\n(replication traffic)"]
VERIFY["Integrity Check\n(checksums)"]
end
CP -->|"Placement map"| DP
DP -->|"Health status"| CP
Ceph Architecture
Ceph is the canonical example of control/data plane separation in open-source storage. Its control plane (MON + MGR + MDS) handles cluster state and metadata, while its data plane (OSDs) handles actual object storage and retrieval.
Ceph Control Plane Components
- MON (Monitor) — maintains cluster map (OSD map, MON map, PG map, CRUSH map); Paxos consensus for consistency
- MGR (Manager) — provides monitoring, orchestration, and cluster management interfaces
- MDS (Metadata Server) — manages filesystem namespace for CephFS (POSIX metadata: directories, permissions, timestamps)
- CRUSH algorithm — deterministic placement algorithm; clients can compute object location without querying a central directory
Ceph Data Plane Components
- OSD (Object Storage Daemon) — one per physical disk; handles actual reads/writes, replication, recovery, rebalancing
- Placement Groups (PGs) — logical grouping of objects mapped to OSDs via CRUSH; enables efficient rebalancing
flowchart TB
CLIENT["Client\n(librbd / CephFS / RGW)"]
subgraph CP["Control Plane"]
MON1["MON 1"]
MON2["MON 2"]
MON3["MON 3"]
MGR["MGR\n(Dashboard, Metrics)"]
MDS["MDS\n(CephFS metadata)"]
MON1 <--> MON2
MON2 <--> MON3
end
subgraph DP["Data Plane (OSDs)"]
OSD1["OSD.0\n/dev/sda"]
OSD2["OSD.1\n/dev/sdb"]
OSD3["OSD.2\n/dev/sdc"]
OSD4["OSD.3\n/dev/sdd"]
end
CLIENT -->|"1. Get cluster map"| MON1
CLIENT -->|"2. CRUSH compute"| CLIENT
CLIENT -->|"3. Direct I/O"| OSD1
OSD1 -->|"Replicate"| OSD2
OSD1 -->|"Replicate"| OSD3
OSD1 -.->|"Heartbeat"| MON1
CRUSH — The Algorithm That Eliminates the Metadata Bottleneck
Ceph's CRUSH (Controlled Replication Under Scalable Hashing) algorithm is a deterministic placement function — given an object name and the cluster map, ANY client can independently compute which OSDs store that object. This means the control plane (MONs) doesn't need to be consulted for every I/O operation. Clients fetch the cluster map once, then go directly to OSDs. This is why Ceph scales: the data plane operates independently of the control plane for normal operations.
# Check Ceph cluster health (control plane status)
ceph status
# View the CRUSH map (placement rules — control plane)
ceph osd crush dump | head -50
# Check OSD status (data plane nodes)
ceph osd tree
# View placement group distribution
ceph pg stat
# Monitor OSD performance (data plane metrics)
ceph osd perf
HDFS Architecture
Hadoop Distributed File System (HDFS) has the clearest control/data plane separation of any storage system: the NameNode IS the control plane (file system namespace + block locations), and DataNodes ARE the data plane (block storage + retrieval).
NameNode (Control Plane)
- Stores the entire filesystem namespace in memory (directory tree, file→block mapping, block→DataNode mapping)
- Handles all metadata operations: open, close, rename, mkdir, ls
- Manages block replication: decides which DataNodes get copies
- Processes DataNode heartbeats and block reports
- Single point of failure (mitigated by HA with standby NameNode + JournalNodes)
DataNode (Data Plane)
- Stores actual data blocks on local disks (default 128MB blocks)
- Serves read/write requests directly to clients
- Performs block replication on NameNode instructions
- Reports block inventory to NameNode via periodic block reports
- Sends heartbeats to NameNode every 3 seconds
sequenceDiagram
participant C as Client
participant NN as NameNode (Control)
participant DN1 as DataNode 1 (Data)
participant DN2 as DataNode 2 (Data)
participant DN3 as DataNode 3 (Data)
Note over C,DN3: Write Path
C->>NN: Create file /data/log.txt
NN->>C: Block locations [DN1, DN2, DN3]
C->>DN1: Write Block 1
DN1->>DN2: Pipeline replicate
DN2->>DN3: Pipeline replicate
DN3->>C: ACK (all replicas written)
C->>NN: Complete file
Note over C,DN3: Read Path
C->>NN: Open file /data/log.txt
NN->>C: Block locations [DN1, DN2, DN3]
C->>DN1: Read Block 1 (nearest replica)
# Check NameNode status (control plane health)
hdfs dfsadmin -report
# View filesystem namespace (control plane metadata)
hdfs dfs -ls /user/hadoop/
# Check DataNode status (data plane nodes)
hdfs dfsadmin -printTopology
# View block distribution for a file
hdfs fsck /user/hadoop/data.csv -files -blocks -locations
# Force NameNode to re-check DataNode blocks
hdfs dfsadmin -triggerBlockReport localhost:9866
S3 Internals
Amazon S3 is the world's largest object storage system. While its internals are proprietary, AWS has revealed enough architecture to understand the control/data plane separation.
S3 Control Plane
- Metadata service — stores object keys, versions, ACLs, storage class, encryption metadata
- Placement service — decides which physical storage nodes hold object data
- Consistency layer — since 2020, provides strong read-after-write consistency (previously eventual)
- Lifecycle manager — transitions objects between storage classes, handles expiration
S3 Data Plane
- Storage nodes — actual disk arrays storing object data chunks
- Erasure coding — data is split and coded across multiple disks/AZs for durability
- Transfer acceleration — edge locations for upload/download optimization
- Multipart upload — parallel data ingestion for large objects
flowchart LR
CLIENT["Client"] --> LB["Load Balancer"]
LB --> FE["Front-End\n(Auth + Routing)"]
FE --> META["Metadata Service\n(Control Plane)"]
FE --> STORE["Storage Nodes\n(Data Plane)"]
META -->|"Object location"| FE
subgraph STORAGE["Data Plane — Storage Layer"]
STORE --> AZ1["AZ-1 Shards"]
STORE --> AZ2["AZ-2 Shards"]
STORE --> AZ3["AZ-3 Shards"]
end
How S3 Achieved Strong Consistency
For years, S3 provided only eventual consistency for overwrites and deletes. In December 2020, AWS announced strong read-after-write consistency at no extra cost. The key was redesigning the control plane's metadata layer — they built a new witness system that ensures the metadata service always reflects the latest write before responding to reads. The data plane didn't need to change; only the control plane logic for tracking "which version is current" was upgraded. This is a perfect example of improving system behavior by modifying only the control plane.
Database Replication as Control/Data Plane
Database replication maps naturally to the control/data plane pattern. A coordinator decides replication topology and consistency guarantees (control plane), while replicas execute actual data operations (data plane).
- Primary/Replica topology — primary decides write ordering (control), replicas apply the write log (data)
- Consensus protocols — Raft/Paxos leader decides commit order (control), followers persist entries (data)
- Sharding coordinator — decides which shard owns a key range (control), shards serve reads/writes (data)
CSI in Kubernetes
The Container Storage Interface (CSI) brings storage control/data plane separation into Kubernetes. The CSI driver splits into a controller plugin (control plane: provisioning, attaching) and a node plugin (data plane: mounting, formatting).
flowchart TB
subgraph K8S_CP["Kubernetes Control Plane"]
PVC["PersistentVolumeClaim"]
SC["StorageClass"]
PV["PersistentVolume"]
end
subgraph CSI_CP["CSI Control Plane"]
PROV["Provisioner\n(CreateVolume)"]
ATTACH["Attacher\n(ControllerPublish)"]
end
subgraph CSI_DP["CSI Data Plane (per Node)"]
STAGE["NodeStageVolume\n(format + mount to global)"]
PUB["NodePublishVolume\n(bind mount to pod)"]
end
PVC --> SC
SC --> PROV
PROV -->|"Create disk"| STORAGE["Cloud Storage API"]
ATTACH -->|"Attach to node"| STORAGE
STAGE -->|"Format + mount"| DISK["Block Device"]
PUB -->|"Bind to pod"| POD["Pod Filesystem"]
# StorageClass — tells CSI control plane HOW to provision
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: fast-ssd
provisioner: ebs.csi.aws.com # CSI controller plugin
parameters:
type: gp3
iops: "5000"
throughput: "250"
encrypted: "true"
volumeBindingMode: WaitForFirstConsumer
reclaimPolicy: Delete
---
# PersistentVolumeClaim — requests storage from control plane
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: app-data
spec:
accessModes: ["ReadWriteOnce"]
storageClassName: fast-ssd
resources:
requests:
storage: 100Gi
# CSI Driver deployment — split into controller and node components
apiVersion: apps/v1
kind: Deployment
metadata:
name: ebs-csi-controller # Control plane — runs once in cluster
spec:
replicas: 2
selector:
matchLabels:
app: ebs-csi-controller
template:
spec:
containers:
- name: csi-provisioner # Watches PVCs, calls CreateVolume
image: k8s.gcr.io/sig-storage/csi-provisioner:v3.6.0
- name: csi-attacher # Watches VolumeAttachments, calls ControllerPublish
image: k8s.gcr.io/sig-storage/csi-attacher:v4.4.0
- name: ebs-plugin # Talks to AWS EC2 API
image: amazon/aws-ebs-csi-driver:v1.25.0
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: ebs-csi-node # Data plane — runs on EVERY node
spec:
selector:
matchLabels:
app: ebs-csi-node
template:
spec:
containers:
- name: ebs-plugin # Handles NodeStage + NodePublish (mount operations)
image: amazon/aws-ebs-csi-driver:v1.25.0
- name: node-driver-registrar
image: k8s.gcr.io/sig-storage/csi-node-driver-registrar:v2.9.0
Performance — Metadata as Bottleneck
In storage systems, the control plane (metadata operations) almost always becomes the bottleneck before the data plane (raw I/O). This is because:
- Metadata is centralized — a finite number of metadata servers handle all namespace operations
- Data is distributed — adding more storage nodes linearly increases data throughput
- Metadata operations require consistency — must be serialized or use consensus
- Data operations can be parallel — different objects on different nodes are independent
Cross-Cutting Patterns
Universal Storage Control/Data Plane Patterns
| Pattern | Control Plane | Data Plane |
|---|---|---|
| Placement | Decides location (CRUSH, hash ring) | Stores at location |
| Replication | Decides replica count & placement | Copies bytes between nodes |
| Recovery | Detects failure, plans re-replication | Reads surviving copies, writes new ones |
| Rebalancing | Computes new placement map | Migrates data to new locations |
| Consistency | Defines consistency model (strong/eventual) | Implements read/write quorums |