Back to Systems Thinking & Architecture Mastery Series

Cloud Provider Control Planes — AWS, Azure & GCP

May 15, 2026 Wasil Zafar 24 min read

"Every cloud API call is a control plane operation. You're not creating a VM — you're declaring to a massive distributed control system that a VM should exist, and the control plane orchestrates physical infrastructure to make it so."

Table of Contents

  1. Cloud APIs as Control Plane
  2. AWS Control Plane Architecture
  3. Azure Resource Manager (ARM)
  4. GCP's Borg-Inspired Architecture
  5. Control Plane vs Data Plane Availability
  6. Control Plane Outages — Lessons Learned
  7. VPC as SDN — Cloud Networking
  8. Architectural Implications

Cloud APIs as Control Plane

Every interaction with a cloud provider follows the control/data plane pattern. When you call aws ec2 run-instances or click "Create VM" in the Azure portal, you're issuing a control plane operation. The API doesn't create a VM — it declares to a distributed control system that a VM should exist. The control plane then orchestrates physical infrastructure (the data plane) to realize that declaration.

Cloud Three-Tier Model: Management / Control / Data
flowchart TD
    subgraph "Management Plane"
        CONSOLE["Web Console\n(Portal/Console)"]
        CLI["CLI Tools\n(aws/az/gcloud)"]
        SDK["SDKs\n(boto3, Azure SDK)"]
        IAC["IaC Tools\n(Terraform, ARM, CDK)"]
    end

    subgraph "Control Plane"
        API["Cloud APIs\n(Regional endpoints)"]
        ORCH["Orchestrators\n(Placement, Scheduling)"]
        STATE["State Store\n(Resource metadata)"]
    end

    subgraph "Data Plane"
        COMPUTE["Compute\n(Hypervisors, VMs)"]
        NETWORK["Network\n(Virtual switches, routers)"]
        STORAGE["Storage\n(Disks, object stores)"]
    end

    CONSOLE --> API
    CLI --> API
    SDK --> API
    IAC --> API
    API --> ORCH
    ORCH --> STATE
    ORCH --> COMPUTE
    ORCH --> NETWORK
    ORCH --> STORAGE

    style API fill:#BF092F,color:#fff
    style ORCH fill:#BF092F,color:#fff
    style COMPUTE fill:#3B9797,color:#fff
    style NETWORK fill:#3B9797,color:#fff
    style STORAGE fill:#3B9797,color:#fff
                            
Key Distinction: The management plane (console, CLI, Terraform) is how humans interact with the control plane. The control plane is the API + orchestration layer that decides what should happen. The data plane is the physical infrastructure that does the work. Most cloud outages affect the control plane (can't create new resources) while the data plane continues serving existing workloads.
# Every CLI command is a control plane API call
# These are ALL control plane operations:

# AWS: Create a VM (control plane tells Nitro hypervisor to launch instance)
aws ec2 run-instances \
  --image-id ami-0abcdef1234567890 \
  --instance-type t3.medium \
  --subnet-id subnet-6e7f829e \
  --security-group-ids sg-903004f8 \
  --key-name my-keypair

# Azure: Create a VM (ARM orchestrates compute fabric)
az vm create \
  --resource-group myRG \
  --name myVM \
  --image Ubuntu2204 \
  --size Standard_D2s_v3 \
  --admin-username azureuser \
  --generate-ssh-keys

# GCP: Create a VM (Borg-derived scheduler places workload)
gcloud compute instances create my-instance \
  --zone=us-central1-a \
  --machine-type=e2-medium \
  --image-family=ubuntu-2204-lts \
  --image-project=ubuntu-os-cloud

# The VM running your application = data plane
# The API that created it = control plane
# SSH into the VM and serve HTTP traffic = data plane operation
# Resize the VM via API = control plane operation

AWS Control Plane Architecture

AWS pioneered the cellular architecture for control planes — each service's control plane is divided into independent cells that can fail without affecting other cells. This limits the "blast radius" of any single failure.

AWS EC2 — Control Plane vs Data Plane
flowchart TD
    subgraph "EC2 Control Plane (Regional)"
        EC2_API["EC2 API\n(RunInstances, DescribeInstances)"]
        PLACEMENT["Placement Service\n(AZ selection, capacity)"]
        METADATA["Instance Metadata\n(State tracking)"]
    end

    subgraph "EC2 Data Plane (Per-AZ)"
        NITRO["Nitro Hypervisors\n(Custom hardware)"]
        NITRO_CARD["Nitro Cards\n(Network, EBS, security)"]
        VM["Customer VMs\n(Running workloads)"]
    end

    EC2_API --> PLACEMENT
    PLACEMENT --> NITRO
    NITRO --> NITRO_CARD
    NITRO_CARD --> VM

    style EC2_API fill:#BF092F,color:#fff
    style PLACEMENT fill:#BF092F,color:#fff
    style NITRO fill:#3B9797,color:#fff
    style VM fill:#3B9797,color:#fff
                            

Cellular Architecture & Blast Radius

AWS's key architectural principle: no single failure should take down an entire service. Each control plane is divided into cells (typically per-AZ or per-partition), and failures are contained within a cell:

  • Cell isolation — each cell has its own database, queue, and worker fleet
  • No shared state — cells don't communicate with each other during normal operation
  • Shuffle sharding — customers are distributed across cells so no single cell failure affects all customers
  • Static stability — the data plane continues operating even if the control plane is completely unavailable
Static Stability Principle: AWS designs data planes to continue operating indefinitely without control plane contact. A running EC2 instance doesn't need the EC2 API to keep running. EBS volumes continue serving I/O without the EBS control plane. This is why "can't launch new instances" is a different (less severe) outage category than "existing instances stopped" — the former is control plane, the latter is data plane.

AWS Nitro — Hardware Data Plane

The Nitro system offloads network, storage, and security processing to dedicated hardware cards. This means the data plane (VM compute, network I/O, disk I/O) runs on purpose-built silicon, completely independent of the control plane software:

  • Nitro Card for VPC — hardware-accelerated packet processing, encapsulation, security groups
  • Nitro Card for EBS — hardware-accelerated block storage I/O, encryption
  • Nitro Security Chip — hardware root of trust, firmware verification
  • Nitro Hypervisor — lightweight hypervisor with near-bare-metal performance

Azure Resource Manager (ARM)

Azure Resource Manager is Azure's centralized control plane. Every Azure resource — VMs, databases, networks, storage accounts — is created, updated, and deleted through ARM. It provides a single consistent API layer over hundreds of resource providers.

# Azure ARM template — declarative control plane definition
# ARM processes this template and orchestrates resource providers
{
  "$schema": "https://schema.management.azure.com/schemas/2019-04-01/deploymentTemplate.json#",
  "contentVersion": "1.0.0.0",
  "resources": [
    {
      "type": "Microsoft.Compute/virtualMachines",
      "apiVersion": "2023-07-01",
      "name": "webServer01",
      "location": "eastus",
      "properties": {
        "hardwareProfile": {
          "vmSize": "Standard_D2s_v3"
        },
        "osProfile": {
          "computerName": "webServer01",
          "adminUsername": "azureuser"
        },
        "storageProfile": {
          "imageReference": {
            "publisher": "Canonical",
            "offer": "0001-com-ubuntu-server-jammy",
            "sku": "22_04-lts",
            "version": "latest"
          }
        },
        "networkProfile": {
          "networkInterfaces": [
            {
              "id": "[resourceId('Microsoft.Network/networkInterfaces', 'webNIC')]"
            }
          ]
        }
      }
    }
  ]
}

ARM Architecture

ARM's architecture separates concerns into layers:

  • Frontend — authenticates requests (Azure AD), authorizes (RBAC), rate limits, routes to resource providers
  • Resource Providers (RPs) — each Azure service (Compute, Network, Storage) has its own RP that handles CRUD operations
  • Regional deployment engines — orchestrate multi-resource deployments with dependency resolution
  • Async operations — long-running operations return immediately with a tracking URL; the control plane works asynchronously
Architecture Pattern
ARM's Dependency Graph — Parallel Orchestration

When you deploy an ARM template with 20 resources, ARM builds a dependency graph and deploys resources in parallel where possible. If a VM depends on a NIC which depends on a VNet, ARM deploys the VNet first, then the NIC, then the VM. But if you have 5 independent VNets, ARM deploys all 5 simultaneously. This is the control plane performing intelligent orchestration — it understands resource relationships and optimizes deployment order for speed while respecting dependencies.

OrchestrationDAGParallelism

GCP's Borg-Inspired Architecture

Google Cloud's control plane architecture descends directly from Borg — Google's internal cluster management system that runs all of Google's production workloads. Key architectural decisions inherited from Borg:

  • Global control plane — GCP's control planes operate at a global level with regional replicas, unlike AWS's region-first design
  • Declarative intent — GCP strongly favors declarative APIs (define end state, let the system converge)
  • Centralized scheduling — compute placement decisions made by sophisticated schedulers with global visibility
  • Live migration — data plane operations (VMs) can be moved between physical hosts without downtime, driven by the control plane
# GCP control plane interaction — declarative resource creation
# The control plane handles scheduling, placement, and provisioning

# Create instance (control plane operation)
gcloud compute instances create web-server \
  --zone=us-central1-a \
  --machine-type=n2-standard-4 \
  --network-interface=network=default,subnet=default \
  --maintenance-policy=MIGRATE \
  --provisioning-model=STANDARD

# The --maintenance-policy=MIGRATE flag tells the control plane:
# "During host maintenance, live-migrate this VM to another host"
# This is a control plane decision executed on the data plane

# Check control plane health for a specific zone
gcloud compute zones describe us-central1-a --format="value(status)"
# UP = control plane operational for this zone

# View operations (control plane activity log)
gcloud compute operations list --filter="zone:us-central1-a" --limit=5
# Each operation is a recorded control plane action:
# NAME       TYPE      TARGET       STATUS
# op-12345   insert    web-server   DONE
# op-12346   start     web-server   DONE

Control Plane vs Data Plane Availability

Cloud providers guarantee different SLAs for control plane vs data plane. Understanding this distinction is critical for architecture decisions:

Availability Comparison
Control Plane vs Data Plane SLAs

Control Plane (APIs): Typically 99.99% availability. Can create, update, delete resources. Outage means you can't manage resources but existing ones keep running.

Data Plane (Running workloads): Depends on configuration — single-AZ VM: 99.9%, multi-AZ: 99.99%, multi-region: 99.999%. Outage means your actual application is down.

Critical insight: A control plane outage is annoying (can't deploy changes). A data plane outage is catastrophic (users can't reach your service). Design for data plane resilience first.

SLAAvailabilityResilience
# Check control plane health across providers

# AWS: Service health dashboard (control plane status per region)
aws health describe-events \
  --filter "services=EC2,eventStatusCodes=open" \
  --region us-east-1

# Azure: Resource health (control plane view of data plane status)
az resource show \
  --resource-group myRG \
  --name myVM \
  --resource-type Microsoft.Compute/virtualMachines \
  --query "properties.instanceView.statuses[].displayStatus"
# "VM running" = data plane healthy
# API response itself = control plane healthy

# GCP: Compute zone status
gcloud compute zones list --filter="status=UP" --format="table(name,status)"

# Key principle: If you CAN run this command and get a response,
# the control plane is working.
# Whether the resource itself is healthy = data plane question.

Control Plane Outages — Lessons Learned

Major cloud outages illustrate the control/data plane separation clearly. When the control plane fails but data plane continues, existing workloads survive:

AWS S3 Control Plane Outage (2017)

What Happened: An operator accidentally removed too many S3 index servers (control plane) during maintenance. The S3 control plane couldn't process new PutObject or ListBucket requests. However, existing objects in S3 were still accessible via cached routes — the data plane (object storage nodes) was unaffected. The distinction: you couldn't manage objects (control plane), but you could still read recently accessed objects via CDN and cached paths (data plane).

Azure AD Control Plane Outage

When Azure Active Directory (the authentication control plane) experiences issues, workloads with cached tokens continue running — the data plane operates on previously issued credentials. New authentication requests fail (control plane), but existing sessions persist (data plane). This is why token lifetimes and caching strategies matter for resilience.

Lessons for Architecture

  • Cache control plane state locally — DNS resolvers cache, load balancers cache endpoints, applications cache credentials
  • Design for control plane unavailability — your system should continue serving traffic even if you can't modify it
  • Avoid control plane dependencies in request path — don't call IAM on every request; validate cached tokens locally
  • Use multiple regions — regional control plane failures don't affect other regions' data planes

VPC as SDN — Cloud Networking

Cloud networking (VPC/VNet) is SDN at massive scale. The control plane (route tables, security groups, NACLs) configures the data plane (virtual switches on hypervisors that forward packets):

Cloud Networking — Control Plane vs Data Plane
flowchart TD
    subgraph "Networking Control Plane"
        RT["Route Tables\n(where to send traffic)"]
        SG["Security Groups\n(what traffic to allow)"]
        LB_CTRL["Load Balancer Config\n(targets, health checks)"]
        DNS_CTRL["DNS Records\n(name resolution)"]
    end

    subgraph "Networking Data Plane"
        VSWITCH["Virtual Switches\n(packet forwarding)"]
        FW["Stateful Firewalls\n(packet filtering)"]
        LB_DATA["Load Balancer Nodes\n(connection routing)"]
        DNS_DATA["DNS Resolvers\n(query responses)"]
    end

    RT -->|"programs"| VSWITCH
    SG -->|"programs"| FW
    LB_CTRL -->|"configures"| LB_DATA
    DNS_CTRL -->|"configures"| DNS_DATA

    style RT fill:#BF092F,color:#fff
    style SG fill:#BF092F,color:#fff
    style VSWITCH fill:#3B9797,color:#fff
    style FW fill:#3B9797,color:#fff
                            
# Azure VNet — declarative networking control plane
# This ARM template is a control plane declaration
# Azure's network fabric (data plane) executes it
{
  "type": "Microsoft.Network/virtualNetworks",
  "name": "production-vnet",
  "location": "eastus",
  "properties": {
    "addressSpace": {
      "addressPrefixes": ["10.0.0.0/16"]
    },
    "subnets": [
      {
        "name": "web-tier",
        "properties": {
          "addressPrefix": "10.0.1.0/24",
          "networkSecurityGroup": {
            "id": "[resourceId('Microsoft.Network/networkSecurityGroups', 'web-nsg')]"
          },
          "routeTable": {
            "id": "[resourceId('Microsoft.Network/routeTables', 'web-routes')]"
          }
        }
      },
      {
        "name": "app-tier",
        "properties": {
          "addressPrefix": "10.0.2.0/24"
        }
      }
    ]
  }
}
Security Groups as Control Plane: When you add a rule to a security group, that's a control plane operation. The rule propagates to all hypervisors hosting VMs in that security group — that propagation is the control plane programming the data plane. The data plane (hypervisor-level packet filtering) then enforces the rule on every packet without consulting the control plane again. This is why security group changes are "eventually consistent" — there's propagation delay between control plane write and data plane enforcement.

Architectural Implications

Understanding that cloud providers are control/data plane systems has profound implications for how you design applications:

Design for Data Plane Independence

  • Don't put control plane calls in the request path — fetching secrets from Key Vault on every request makes your app dependent on control plane availability
  • Cache aggressively — DNS, credentials, configuration should be cached locally with TTLs
  • Use data plane endpoints — S3 data operations (GET/PUT objects) have higher availability than S3 management operations (CreateBucket)
  • Pre-provision capacity — don't rely on auto-scaling during a control plane outage; have baseline capacity pre-deployed

Understand Blast Radius

  • Regional control planes — AWS and Azure control planes are regional; a us-east-1 outage doesn't affect eu-west-1
  • Global control planes — some services (IAM, DNS, CDN) have global control planes; their failure affects all regions
  • Zonal data planes — within a region, data plane failures are typically contained to a single AZ
Key Takeaway
Cloud Providers Are the Largest Control/Data Plane Systems Ever Built

Every cloud provider is, at its core, a massive distributed control system. The APIs are the control plane — they accept declarations of desired state and orchestrate physical infrastructure to realize those declarations. The hardware (servers, switches, disks) is the data plane — doing the actual work of running workloads. Understanding this separation explains cloud pricing (control plane operations are often free or cheap; data plane usage is where the cost lies), outage patterns (control plane failures are common but survivable; data plane failures are rare but catastrophic), and design best practices (minimize control plane dependencies in your application's critical path).

Cloud ArchitectureDistributed SystemsResilience