Storage Systems Control & Data Planes

Storage Control vs Data Plane

Every distributed storage system faces the same fundamental challenge: managing the metadata (where things are, how they're replicated, their consistency state) separately from the data operations (actual reads and writes of bytes). This maps cleanly to the control/data plane pattern.

                            
                            Storage Control Plane: Metadata management, placement decisions, replication orchestration, failure detection, data rebalancing, namespace management, access control policy. It answers: "Where should this data go? How many copies? What happens when a node fails?"
                        

                            
                            Storage Data Plane: Actual reads/writes, data transfer between nodes, checksum verification, compression/decompression, encryption at rest, disk I/O operations. It answers: "Read these bytes from disk. Write these bytes. Verify integrity. Transfer to replica."
                        

Storage Control Plane vs Data Plane — Universal Pattern

flowchart TB
    subgraph CP["Storage Control Plane"]
        META["Metadata Service\n(namespace, directory tree)"]
        PLACE["Placement Engine\n(where to store)"]
        REPL["Replication Manager\n(copy orchestration)"]
        HEALTH["Health Monitor\n(failure detection)"]
        META --> PLACE
        PLACE --> REPL
        HEALTH --> PLACE
    end
    subgraph DP["Storage Data Plane"]
        WRITE["Write Path\n(client → storage nodes)"]
        READ["Read Path\n(storage nodes → client)"]
        XFER["Data Transfer\n(replication traffic)"]
        VERIFY["Integrity Check\n(checksums)"]
    end
    CP -->|"Placement map"| DP
    DP -->|"Health status"| CP

Ceph Architecture

Ceph is the canonical example of control/data plane separation in open-source storage. Its control plane (MON + MGR + MDS) handles cluster state and metadata, while its data plane (OSDs) handles actual object storage and retrieval.

Ceph Control Plane Components

MON (Monitor) — maintains cluster map (OSD map, MON map, PG map, CRUSH map); Paxos consensus for consistency
MGR (Manager) — provides monitoring, orchestration, and cluster management interfaces
MDS (Metadata Server) — manages filesystem namespace for CephFS (POSIX metadata: directories, permissions, timestamps)
CRUSH algorithm — deterministic placement algorithm; clients can compute object location without querying a central directory

Ceph Data Plane Components

OSD (Object Storage Daemon) — one per physical disk; handles actual reads/writes, replication, recovery, rebalancing
Placement Groups (PGs) — logical grouping of objects mapped to OSDs via CRUSH; enables efficient rebalancing

Ceph Architecture — MON/MGR (Control) + OSD (Data)

flowchart TB
    CLIENT["Client\n(librbd / CephFS / RGW)"]
    subgraph CP["Control Plane"]
        MON1["MON 1"]
        MON2["MON 2"]
        MON3["MON 3"]
        MGR["MGR\n(Dashboard, Metrics)"]
        MDS["MDS\n(CephFS metadata)"]
        MON1 <--> MON2
        MON2 <--> MON3
    end
    subgraph DP["Data Plane (OSDs)"]
        OSD1["OSD.0\n/dev/sda"]
        OSD2["OSD.1\n/dev/sdb"]
        OSD3["OSD.2\n/dev/sdc"]
        OSD4["OSD.3\n/dev/sdd"]
    end
    CLIENT -->|"1. Get cluster map"| MON1
    CLIENT -->|"2. CRUSH compute"| CLIENT
    CLIENT -->|"3. Direct I/O"| OSD1
    OSD1 -->|"Replicate"| OSD2
    OSD1 -->|"Replicate"| OSD3
    OSD1 -.->|"Heartbeat"| MON1

Key Insight

CRUSH — The Algorithm That Eliminates the Metadata Bottleneck

Ceph's CRUSH (Controlled Replication Under Scalable Hashing) algorithm is a deterministic placement function — given an object name and the cluster map, ANY client can independently compute which OSDs store that object. This means the control plane (MONs) doesn't need to be consulted for every I/O operation. Clients fetch the cluster map once, then go directly to OSDs. This is why Ceph scales: the data plane operates independently of the control plane for normal operations.

CRUSHScalabilityDecentralized

# Check Ceph cluster health (control plane status)
ceph status

# View the CRUSH map (placement rules — control plane)
ceph osd crush dump | head -50

# Check OSD status (data plane nodes)
ceph osd tree

# View placement group distribution
ceph pg stat

# Monitor OSD performance (data plane metrics)
ceph osd perf

HDFS Architecture

Hadoop Distributed File System (HDFS) has the clearest control/data plane separation of any storage system: the NameNode IS the control plane (file system namespace + block locations), and DataNodes ARE the data plane (block storage + retrieval).

NameNode (Control Plane)

Stores the entire filesystem namespace in memory (directory tree, file→block mapping, block→DataNode mapping)
Handles all metadata operations: open, close, rename, mkdir, ls
Manages block replication: decides which DataNodes get copies
Processes DataNode heartbeats and block reports
Single point of failure (mitigated by HA with standby NameNode + JournalNodes)

DataNode (Data Plane)

Stores actual data blocks on local disks (default 128MB blocks)
Serves read/write requests directly to clients
Performs block replication on NameNode instructions
Reports block inventory to NameNode via periodic block reports
Sends heartbeats to NameNode every 3 seconds

HDFS NameNode (Control) / DataNode (Data) Architecture

sequenceDiagram
    participant C as Client
    participant NN as NameNode (Control)
    participant DN1 as DataNode 1 (Data)
    participant DN2 as DataNode 2 (Data)
    participant DN3 as DataNode 3 (Data)

    Note over C,DN3: Write Path
    C->>NN: Create file /data/log.txt
    NN->>C: Block locations [DN1, DN2, DN3]
    C->>DN1: Write Block 1
    DN1->>DN2: Pipeline replicate
    DN2->>DN3: Pipeline replicate
    DN3->>C: ACK (all replicas written)
    C->>NN: Complete file

    Note over C,DN3: Read Path
    C->>NN: Open file /data/log.txt
    NN->>C: Block locations [DN1, DN2, DN3]
    C->>DN1: Read Block 1 (nearest replica)

# Check NameNode status (control plane health)
hdfs dfsadmin -report

# View filesystem namespace (control plane metadata)
hdfs dfs -ls /user/hadoop/

# Check DataNode status (data plane nodes)
hdfs dfsadmin -printTopology

# View block distribution for a file
hdfs fsck /user/hadoop/data.csv -files -blocks -locations

# Force NameNode to re-check DataNode blocks
hdfs dfsadmin -triggerBlockReport localhost:9866

                            
                            The NameNode Bottleneck: Because ALL metadata operations go through the NameNode, it becomes the scalability bottleneck in HDFS. A single NameNode can handle ~100K–500K files/directory operations per second. For clusters with billions of small files, this control plane becomes the limiting factor — not the data plane (DataNodes scale linearly). This led to HDFS Federation, which partitions the namespace across multiple NameNodes.
                        

S3 Internals

Amazon S3 is the world's largest object storage system. While its internals are proprietary, AWS has revealed enough architecture to understand the control/data plane separation.

S3 Control Plane

Metadata service — stores object keys, versions, ACLs, storage class, encryption metadata
Placement service — decides which physical storage nodes hold object data
Consistency layer — since 2020, provides strong read-after-write consistency (previously eventual)
Lifecycle manager — transitions objects between storage classes, handles expiration

S3 Data Plane

Storage nodes — actual disk arrays storing object data chunks
Erasure coding — data is split and coded across multiple disks/AZs for durability
Transfer acceleration — edge locations for upload/download optimization
Multipart upload — parallel data ingestion for large objects

S3 Request Flow — Control Plane (Metadata) + Data Plane (Storage)

flowchart LR
    CLIENT["Client"] --> LB["Load Balancer"]
    LB --> FE["Front-End\n(Auth + Routing)"]
    FE --> META["Metadata Service\n(Control Plane)"]
    FE --> STORE["Storage Nodes\n(Data Plane)"]
    META -->|"Object location"| FE
    subgraph STORAGE["Data Plane — Storage Layer"]
        STORE --> AZ1["AZ-1 Shards"]
        STORE --> AZ2["AZ-2 Shards"]
        STORE --> AZ3["AZ-3 Shards"]
    end

Architecture Insight

How S3 Achieved Strong Consistency

For years, S3 provided only eventual consistency for overwrites and deletes. In December 2020, AWS announced strong read-after-write consistency at no extra cost. The key was redesigning the control plane's metadata layer — they built a new witness system that ensures the metadata service always reflects the latest write before responding to reads. The data plane didn't need to change; only the control plane logic for tracking "which version is current" was upgraded. This is a perfect example of improving system behavior by modifying only the control plane.

ConsistencyS3Architecture

Database Replication as Control/Data Plane

Database replication maps naturally to the control/data plane pattern. A coordinator decides replication topology and consistency guarantees (control plane), while replicas execute actual data operations (data plane).

Primary/Replica topology — primary decides write ordering (control), replicas apply the write log (data)
Consensus protocols — Raft/Paxos leader decides commit order (control), followers persist entries (data)
Sharding coordinator — decides which shard owns a key range (control), shards serve reads/writes (data)

CSI in Kubernetes

The Container Storage Interface (CSI) brings storage control/data plane separation into Kubernetes. The CSI driver splits into a controller plugin (control plane: provisioning, attaching) and a node plugin (data plane: mounting, formatting).

CSI Driver Architecture — Controller (Control) vs Node (Data)

flowchart TB
    subgraph K8S_CP["Kubernetes Control Plane"]
        PVC["PersistentVolumeClaim"]
        SC["StorageClass"]
        PV["PersistentVolume"]
    end
    subgraph CSI_CP["CSI Control Plane"]
        PROV["Provisioner\n(CreateVolume)"]
        ATTACH["Attacher\n(ControllerPublish)"]
    end
    subgraph CSI_DP["CSI Data Plane (per Node)"]
        STAGE["NodeStageVolume\n(format + mount to global)"]
        PUB["NodePublishVolume\n(bind mount to pod)"]
    end
    PVC --> SC
    SC --> PROV
    PROV -->|"Create disk"| STORAGE["Cloud Storage API"]
    ATTACH -->|"Attach to node"| STORAGE
    STAGE -->|"Format + mount"| DISK["Block Device"]
    PUB -->|"Bind to pod"| POD["Pod Filesystem"]

# StorageClass — tells CSI control plane HOW to provision
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: fast-ssd
provisioner: ebs.csi.aws.com  # CSI controller plugin
parameters:
  type: gp3
  iops: "5000"
  throughput: "250"
  encrypted: "true"
volumeBindingMode: WaitForFirstConsumer
reclaimPolicy: Delete
---
# PersistentVolumeClaim — requests storage from control plane
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: app-data
spec:
  accessModes: ["ReadWriteOnce"]
  storageClassName: fast-ssd
  resources:
    requests:
      storage: 100Gi

# CSI Driver deployment — split into controller and node components
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ebs-csi-controller  # Control plane — runs once in cluster
spec:
  replicas: 2
  selector:
    matchLabels:
      app: ebs-csi-controller
  template:
    spec:
      containers:
        - name: csi-provisioner   # Watches PVCs, calls CreateVolume
          image: k8s.gcr.io/sig-storage/csi-provisioner:v3.6.0
        - name: csi-attacher      # Watches VolumeAttachments, calls ControllerPublish
          image: k8s.gcr.io/sig-storage/csi-attacher:v4.4.0
        - name: ebs-plugin        # Talks to AWS EC2 API
          image: amazon/aws-ebs-csi-driver:v1.25.0
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: ebs-csi-node  # Data plane — runs on EVERY node
spec:
  selector:
    matchLabels:
      app: ebs-csi-node
  template:
    spec:
      containers:
        - name: ebs-plugin  # Handles NodeStage + NodePublish (mount operations)
          image: amazon/aws-ebs-csi-driver:v1.25.0
        - name: node-driver-registrar
          image: k8s.gcr.io/sig-storage/csi-node-driver-registrar:v2.9.0

Performance — Metadata as Bottleneck

In storage systems, the control plane (metadata operations) almost always becomes the bottleneck before the data plane (raw I/O). This is because:

Metadata is centralized — a finite number of metadata servers handle all namespace operations
Data is distributed — adding more storage nodes linearly increases data throughput
Metadata operations require consistency — must be serialized or use consensus
Data operations can be parallel — different objects on different nodes are independent

                            
                            The Metadata Tax: In HDFS, creating a file requires ~5ms of NameNode processing. Opening a file for read: ~2ms. But the actual data transfer runs at line rate (10+ Gbps). For workloads with millions of small files, you spend more time on metadata than data — the control plane dominates latency. This is why object stores (S3, Ceph RGW) outperform file systems for massive-scale workloads: they have simpler metadata models.
                        

Cross-Cutting Patterns

Pattern Summary

Universal Storage Control/Data Plane Patterns

Pattern	Control Plane	Data Plane
Placement	Decides location (CRUSH, hash ring)	Stores at location
Replication	Decides replica count & placement	Copies bytes between nodes
Recovery	Detects failure, plans re-replication	Reads surviving copies, writes new ones
Rebalancing	Computes new placement map	Migrates data to new locations
Consistency	Defines consistency model (strong/eventual)	Implements read/write quorums

PatternsUniversalStorage

Cookie Consent

Storage Systems Control & Data Planes

Table of Contents

Storage Control vs Data Plane

Ceph Architecture

Ceph Control Plane Components

Ceph Data Plane Components

CRUSH — The Algorithm That Eliminates the Metadata Bottleneck

HDFS Architecture

NameNode (Control Plane)

DataNode (Data Plane)

S3 Internals

S3 Control Plane

S3 Data Plane

How S3 Achieved Strong Consistency

Database Replication as Control/Data Plane

CSI in Kubernetes

Performance — Metadata as Bottleneck

Cross-Cutting Patterns

Universal Storage Control/Data Plane Patterns

Cookie Consent

Storage Systems Control & Data Planes

Table of Contents

Storage Control vs Data Plane

Ceph Architecture

Ceph Control Plane Components

Ceph Data Plane Components

CRUSH — The Algorithm That Eliminates the Metadata Bottleneck

HDFS Architecture

NameNode (Control Plane)

DataNode (Data Plane)

S3 Internals

S3 Control Plane

S3 Data Plane

How S3 Achieved Strong Consistency

Database Replication as Control/Data Plane

CSI in Kubernetes

Performance — Metadata as Bottleneck

Cross-Cutting Patterns

Universal Storage Control/Data Plane Patterns

Related Deep Dives

Systems Thinking & Architecture Mastery Series

Cloud Provider Control & Data Planes — AWS, Azure & GCP

Security Control & Data Planes