Why Multi-Cloud
Multi-cloud is one of the most debated topics in modern infrastructure. Every analyst report claims that 90% of enterprises have a multi-cloud strategy, yet when you dig deeper, most have accidental multi-cloud — different teams adopted different providers independently, with no unified architecture, no shared tooling, and no intentional design.
True multi-cloud architecture is an intentional design decision to distribute workloads across two or more cloud providers with a coherent strategy for networking, identity, data, and operations.
Definitions That Matter
| Term | Definition | Example |
|---|---|---|
| Multi-Cloud | Using 2+ public cloud providers intentionally | ML on GCP, core app on AWS |
| Hybrid Cloud | Combining public cloud with on-premises/private cloud | AWS + on-prem data center |
| Multi-Region | Same provider, different geographic regions | AWS us-east-1 + eu-west-1 |
| Poly-Cloud | Multi-cloud without a unifying strategy | Each team picks their own cloud |
When Multi-Cloud Is Worth It
Multi-cloud is justified when at least one of these conditions is true:
- Regulatory compliance — Data sovereignty requires specific providers in specific regions
- Best-of-breed necessity — Critical workloads genuinely need unique capabilities (GCP BigQuery for analytics, AWS for broadest service catalog)
- Acquisition integration — Merging companies on different providers with no compelling reason to migrate
- Disaster recovery — True provider-level resilience (extremely rare requirement)
- Negotiation leverage — Credible alternative keeps pricing competitive (works at very large scale)
flowchart TB
subgraph Single["Single Cloud (Multi-Region)"]
direction TB
S1[Region A] --- S2[Region B]
S1 --- S3[Region C]
end
subgraph Multi["Multi-Cloud"]
direction TB
M1[AWS
Primary Compute] --- M2[GCP
ML & Analytics]
M1 --- M3[Azure
Enterprise Apps]
M2 --- M3
end
subgraph Hybrid["Hybrid Cloud"]
direction TB
H1[Public Cloud
AWS/Azure] --- H2[On-Premises
Data Center]
H2 --- H3[Edge
IoT Devices]
end
Multi-Cloud Strategy Patterns
Not all multi-cloud is the same. The pattern you choose determines your architecture, tooling requirements, and operational burden. Understanding these patterns is the first step toward intentional multi-cloud design.
Pattern 1: Best-of-Breed
Use each cloud provider for what it does best. This is the most common intentional multi-cloud pattern and often the most pragmatic.
# Example: Best-of-breed workload mapping
workloads:
compute_and_networking:
provider: aws
reason: "Broadest service catalog, mature VPC"
services: [EKS, Lambda, API Gateway, CloudFront]
machine_learning:
provider: gcp
reason: "TPUs, Vertex AI, BigQuery ML integration"
services: [Vertex AI, BigQuery, Cloud Storage]
enterprise_apps:
provider: azure
reason: "Active Directory, Office 365, Power Platform"
services: [Azure AD, Logic Apps, Power Automate]
data_analytics:
provider: gcp
reason: "BigQuery performance, Looker integration"
services: [BigQuery, Dataflow, Looker]
Pattern 2: Active-Passive DR
Run production on one cloud, maintain a warm standby on another. This provides genuine provider-level resilience but at significant cost and operational complexity.
Pattern 3: Workload Distribution
Different business units or application tiers run on different clouds, with cross-cloud integration at defined boundaries.
Pattern 4: Cloud-Agnostic
Design workloads to run identically on any cloud provider. This demands heavy abstraction (Kubernetes, Terraform, cloud-agnostic databases) and typically sacrifices cloud-native optimization.
| Pattern | Complexity | Cloud Optimization | Portability | Best For |
|---|---|---|---|---|
| Best-of-Breed | Medium | High | Low | Leveraging unique capabilities |
| Active-Passive DR | High | High (primary) | Medium | Provider-level resilience |
| Workload Distribution | Medium | Medium-High | Low | Org-level autonomy |
| Cloud-Agnostic | Very High | Low | High | Exit strategy, vendor leverage |
flowchart TD
A[Need Multi-Cloud?] -->|Regulatory| B[Workload Distribution]
A -->|Best capabilities| C[Best-of-Breed]
A -->|Provider failure DR| D[Active-Passive]
A -->|Full portability| E[Cloud-Agnostic]
A -->|No strong reason| F[Stay Single Cloud]
B --> G[Define integration boundaries]
C --> H[Map workloads to providers]
D --> I[Accept 2x infrastructure cost]
E --> J[Accept lowest common denominator]
F --> K[Multi-region for resilience]
Abstraction Layers
The key to managing multi-cloud complexity is abstraction — creating layers that hide provider-specific details behind consistent interfaces. The right abstraction layer depends on what you are abstracting.
| Layer | Tool | What It Abstracts | Trade-off |
|---|---|---|---|
| Infrastructure | Terraform | Provisioning APIs across clouds | Provider-specific resources still differ |
| Compute | Kubernetes | Container orchestration | Managed K8s differs per cloud |
| Infrastructure Control Plane | Crossplane | Cloud resources as K8s objects | Additional control plane complexity |
| Networking | Service Mesh (Istio) | Service-to-service communication | Operational overhead of mesh |
| Secrets | HashiCorp Vault | Secrets management across clouds | Additional infrastructure to manage |
| Policy | OPA/Gatekeeper | Authorization and compliance | Policy language learning curve |
| Observability | Datadog/Grafana | Metrics, logs, traces across clouds | Vendor cost or self-hosted complexity |
Crossplane for Cloud-Agnostic Infrastructure
Crossplane extends Kubernetes with cloud resource management. You define cloud resources as Kubernetes custom resources, and Crossplane controllers reconcile them against cloud APIs.
# Crossplane Composite Resource Definition
# Abstracts a "database" concept across clouds
apiVersion: apiextensions.crossplane.io/v1
kind: CompositeResourceDefinition
metadata:
name: xdatabases.platform.example.com
spec:
group: platform.example.com
names:
kind: XDatabase
plural: xdatabases
versions:
- name: v1alpha1
served: true
referenceable: true
schema:
openAPIV3Schema:
type: object
properties:
spec:
type: object
properties:
parameters:
type: object
properties:
engine:
type: string
enum: [postgres, mysql]
size:
type: string
enum: [small, medium, large]
region:
type: string
required: [engine, size, region]
# Crossplane Composition - AWS Implementation
apiVersion: apiextensions.crossplane.io/v1
kind: Composition
metadata:
name: xdatabases.aws.platform.example.com
labels:
provider: aws
spec:
compositeTypeRef:
apiVersion: platform.example.com/v1alpha1
kind: XDatabase
resources:
- name: rds-instance
base:
apiVersion: rds.aws.crossplane.io/v1alpha1
kind: DBInstance
spec:
forProvider:
dbInstanceClass: db.t3.medium
engine: postgres
engineVersion: "15"
masterUsername: admin
allocatedStorage: 20
publiclyAccessible: false
providerConfigRef:
name: aws-provider
# Crossplane Composition - Azure Implementation
apiVersion: apiextensions.crossplane.io/v1
kind: Composition
metadata:
name: xdatabases.azure.platform.example.com
labels:
provider: azure
spec:
compositeTypeRef:
apiVersion: platform.example.com/v1alpha1
kind: XDatabase
resources:
- name: azure-db
base:
apiVersion: dbforpostgresql.azure.crossplane.io/v1alpha1
kind: FlexibleServer
spec:
forProvider:
version: "15"
skuName: Standard_B1ms
storageMb: 32768
administratorLogin: admin
providerConfigRef:
name: azure-provider
flowchart TB
subgraph App["Application Layer"]
A1[Microservices]
A2[APIs]
A3[ML Pipelines]
end
subgraph Abstraction["Abstraction Layer"]
B1[Kubernetes - Compute]
B2[Istio - Networking]
B3[Vault - Secrets]
B4[OPA - Policy]
B5[Crossplane - Resources]
end
subgraph Infra["Infrastructure Layer"]
C1[Terraform - Provisioning]
end
subgraph Clouds["Cloud Providers"]
D1[AWS]
D2[Azure]
D3[GCP]
end
App --> Abstraction
Abstraction --> Infra
Infra --> Clouds
Multi-Cloud Networking
Networking is the hardest problem in multi-cloud architecture. Each cloud has its own networking model, IP address scheme, firewall rules, and connectivity options. Connecting them securely and performantly requires careful planning.
Cross-Cloud Connectivity Options
| Method | Bandwidth | Latency | Cost | Setup Complexity |
|---|---|---|---|---|
| Site-to-Site VPN | 1-2 Gbps | Variable (internet) | Low | Medium |
| Cloud Interconnect | 10-100 Gbps | Low (dedicated) | High | High |
| SD-WAN Overlay | Variable | Optimized | Medium | Medium |
| Service Mesh (Istio) | Application-level | Depends on transport | Low | High |
| API Gateway Federation | API-level | HTTP overhead | Low | Low |
Terraform: AWS-to-Azure VPN Tunnel
# AWS VPN Gateway
resource "aws_vpn_gateway" "main" {
vpc_id = aws_vpc.main.id
tags = {
Name = "aws-to-azure-vpn-gw"
Environment = var.environment
ManagedBy = "terraform"
}
}
resource "aws_customer_gateway" "azure" {
bgp_asn = 65515 # Azure default ASN
ip_address = azurerm_public_ip.vpn_gw.ip_address
type = "ipsec.1"
tags = {
Name = "azure-customer-gw"
}
}
resource "aws_vpn_connection" "to_azure" {
vpn_gateway_id = aws_vpn_gateway.main.id
customer_gateway_id = aws_customer_gateway.azure.id
type = "ipsec.1"
static_routes_only = false
tunnel1_preshared_key = var.vpn_preshared_key
tunnel1_ike_versions = ["ikev2"]
tags = {
Name = "aws-to-azure-vpn"
}
}
# Azure VPN Gateway
resource "azurerm_virtual_network_gateway" "main" {
name = "azure-to-aws-vpn-gw"
location = azurerm_resource_group.main.location
resource_group_name = azurerm_resource_group.main.name
type = "Vpn"
vpn_type = "RouteBased"
sku = "VpnGw2"
active_active = false
enable_bgp = true
ip_configuration {
name = "vpn-gw-config"
public_ip_address_id = azurerm_public_ip.vpn_gw.id
private_ip_address_allocation = "Dynamic"
subnet_id = azurerm_subnet.gateway.id
}
bgp_settings {
asn = 65515
}
}
resource "azurerm_local_network_gateway" "aws" {
name = "aws-local-gw"
location = azurerm_resource_group.main.location
resource_group_name = azurerm_resource_group.main.name
gateway_address = aws_vpn_connection.to_azure.tunnel1_address
address_space = [var.aws_vpc_cidr]
bgp_settings {
asn = 64512 # AWS default ASN
bgp_peering_address = aws_vpn_connection.to_azure.tunnel1_bgp_peer_address
}
}
resource "azurerm_virtual_network_gateway_connection" "to_aws" {
name = "azure-to-aws-connection"
location = azurerm_resource_group.main.location
resource_group_name = azurerm_resource_group.main.name
type = "IPsec"
virtual_network_gateway_id = azurerm_virtual_network_gateway.main.id
local_network_gateway_id = azurerm_local_network_gateway.aws.id
shared_key = var.vpn_preshared_key
enable_bgp = true
}
DNS & Global Load Balancing
# Multi-cloud DNS with Route 53 as primary
# Health checks monitor endpoints on both clouds
resource "aws_route53_health_check" "aws_primary" {
fqdn = "aws-app.internal.example.com"
port = 443
type = "HTTPS"
resource_path = "/health"
failure_threshold = 3
request_interval = 10
tags = {
Name = "aws-primary-health"
}
}
resource "aws_route53_health_check" "azure_secondary" {
fqdn = "azure-app.internal.example.com"
port = 443
type = "HTTPS"
resource_path = "/health"
failure_threshold = 3
request_interval = 10
tags = {
Name = "azure-secondary-health"
}
}
resource "aws_route53_record" "app_primary" {
zone_id = var.hosted_zone_id
name = "app.example.com"
type = "A"
failover_routing_policy {
type = "PRIMARY"
}
set_identifier = "primary-aws"
health_check_id = aws_route53_health_check.aws_primary.id
alias {
name = aws_lb.main.dns_name
zone_id = aws_lb.main.zone_id
evaluate_target_health = true
}
}
resource "aws_route53_record" "app_secondary" {
zone_id = var.hosted_zone_id
name = "app.example.com"
type = "A"
failover_routing_policy {
type = "SECONDARY"
}
set_identifier = "secondary-azure"
health_check_id = aws_route53_health_check.azure_secondary.id
alias {
name = "azure-tm.trafficmanager.net"
zone_id = "Z2FDTNDATAQYW2"
evaluate_target_health = true
}
}
flowchart TB
Users[Global Users] --> GLB[Global Load Balancer
Cloudflare/Route53]
GLB --> AWS_LB[AWS ALB
us-east-1]
GLB --> AZ_LB[Azure Front Door
East US]
subgraph AWS["AWS VPC (10.0.0.0/16)"]
AWS_LB --> AWS_APP[EKS Cluster]
AWS_APP --> AWS_DB[(RDS PostgreSQL)]
end
subgraph Azure["Azure VNet (10.1.0.0/16)"]
AZ_LB --> AZ_APP[AKS Cluster]
AZ_APP --> AZ_DB[(Azure Database)]
end
AWS_APP <-->|VPN Tunnel
IPsec/BGP| AZ_APP
AWS_DB <-->|Cross-Cloud
Replication| AZ_DB
Multi-Cloud Identity & Security
In a multi-cloud environment, identity is the new perimeter. Each cloud has its own IAM system (AWS IAM, Azure Entra ID, GCP IAM), and federating identities across them is essential for both human operators and machine-to-machine communication.
Federated Identity Architecture
# AWS: Trust Azure AD as OIDC identity provider
resource "aws_iam_openid_connect_provider" "azure_ad" {
url = "https://login.microsoftonline.com/${var.azure_tenant_id}/v2.0"
client_id_list = [var.azure_app_client_id]
thumbprint_list = [var.azure_ad_thumbprint]
}
# IAM Role that Azure workloads can assume via OIDC
resource "aws_iam_role" "azure_cross_cloud" {
name = "azure-cross-cloud-access"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Effect = "Allow"
Principal = {
Federated = aws_iam_openid_connect_provider.azure_ad.arn
}
Action = "sts:AssumeRoleWithWebIdentity"
Condition = {
StringEquals = {
"${aws_iam_openid_connect_provider.azure_ad.url}:aud" = var.azure_app_client_id
"${aws_iam_openid_connect_provider.azure_ad.url}:sub" = var.azure_managed_identity_object_id
}
}
}
]
})
}
resource "aws_iam_role_policy_attachment" "cross_cloud_s3" {
role = aws_iam_role.azure_cross_cloud.name
policy_arn = aws_iam_policy.cross_cloud_s3_access.arn
}
HashiCorp Vault for Centralized Secrets
# Vault configuration for multi-cloud secrets
resource "vault_mount" "aws_secrets" {
path = "aws"
type = "aws"
description = "AWS dynamic credentials"
}
resource "vault_aws_secret_backend_role" "deploy" {
backend = vault_mount.aws_secrets.path
name = "deploy-role"
credential_type = "iam_user"
policy_document = jsonencode({
Version = "2012-10-17"
Statement = [
{
Effect = "Allow"
Action = ["s3:*", "ec2:*", "eks:*"]
Resource = "*"
}
]
})
}
resource "vault_mount" "azure_secrets" {
path = "azure"
type = "azure"
description = "Azure dynamic credentials"
}
resource "vault_azure_secret_backend" "main" {
subscription_id = var.azure_subscription_id
tenant_id = var.azure_tenant_id
client_id = var.vault_azure_client_id
client_secret = var.vault_azure_client_secret
}
Unified Policy with OPA
# OPA Rego policy: enforce tagging across all clouds
# File: policies/multi-cloud-tagging.rego
package multicloud.tagging
import future.keywords.in
# Required tags for all cloud resources
required_tags := {"environment", "team", "cost-center", "managed-by"}
# Check AWS resources
deny[msg] {
input.provider == "aws"
resource := input.planned_values.root_module.resources[_]
tags := object.get(resource.values, "tags", {})
missing := required_tags - {key | tags[key]}
count(missing) > 0
msg := sprintf("AWS resource %s missing tags: %v", [resource.address, missing])
}
# Check Azure resources
deny[msg] {
input.provider == "azure"
resource := input.planned_values.root_module.resources[_]
tags := object.get(resource.values, "tags", {})
missing := required_tags - {key | tags[key]}
count(missing) > 0
msg := sprintf("Azure resource %s missing tags: %v", [resource.address, missing])
}
# Check GCP resources
deny[msg] {
input.provider == "gcp"
resource := input.planned_values.root_module.resources[_]
labels := object.get(resource.values, "labels", {})
missing := required_tags - {key | labels[key]}
count(missing) > 0
msg := sprintf("GCP resource %s missing labels: %v", [resource.address, missing])
}
| Approach | Scope | Identity Type | Complexity | Best For |
|---|---|---|---|---|
| OIDC Federation | Cross-cloud workloads | Machine identity | Medium | Service-to-service auth |
| SAML via IdP | Human access | User identity | Low | Console/portal SSO |
| Vault Dynamic Creds | All credentials | Both | High | Short-lived, audited access |
| SPIFFE/SPIRE | Workload identity | Machine identity | High | Zero-trust service mesh |
Multi-Cloud Data Strategies
Data gravity is the concept that applications tend to move toward where their data resides. In multi-cloud, data gravity is one of the strongest forces determining your architecture — moving compute is easy, moving petabytes of data is not.
Cross-Cloud Database Replication
# CockroachDB multi-cloud topology configuration
# CockroachDB natively supports multi-cloud deployment
apiVersion: crdb.cockroachlabs.com/v1alpha1
kind: CrdbCluster
metadata:
name: multi-cloud-crdb
spec:
dataStore:
pvc:
spec:
storageClassName: premium-rwo
resources:
requests:
storage: 100Gi
nodes: 9
topology:
- cloud: aws
region: us-east-1
zones: [us-east-1a, us-east-1b, us-east-1c]
nodes: 3
- cloud: azure
region: eastus
zones: [1, 2, 3]
nodes: 3
- cloud: gcp
region: us-east4
zones: [us-east4-a, us-east4-b, us-east4-c]
nodes: 3
# Kafka/Confluent Cloud for cross-cloud event streaming
resource "confluent_kafka_cluster" "multi_cloud" {
display_name = "multi-cloud-events"
availability = "MULTI_ZONE"
cloud = "AWS"
region = "us-east-1"
dedicated {
cku = 2
}
}
# Cluster linking for cross-cloud replication
resource "confluent_cluster_link" "aws_to_azure" {
link_name = "aws-to-azure-mirror"
source_kafka_cluster {
id = confluent_kafka_cluster.multi_cloud.id
rest_endpoint = confluent_kafka_cluster.multi_cloud.rest_endpoint
}
destination_kafka_cluster {
id = confluent_kafka_cluster.azure_cluster.id
rest_endpoint = confluent_kafka_cluster.azure_cluster.rest_endpoint
}
}
| Approach | Consistency | Latency | Cost | Use Case |
|---|---|---|---|---|
| CockroachDB | Strong (serializable) | Cross-region penalty | High | Multi-cloud OLTP |
| Kafka Cluster Linking | Eventual | Seconds | Medium | Event streaming |
| Object Storage Sync | Eventual | Minutes | Transfer + storage | Data lake mirroring |
| Database CDC | Eventual | Seconds | Low | Read replicas |
| API-level Sync | Application-defined | Variable | Low | Selective data sharing |
Multi-Cloud with Terraform
Terraform is the most widely adopted tool for multi-cloud infrastructure because it supports 3,000+ providers through a single workflow. However, using multiple providers in one project requires careful state and dependency management.
Multi-Provider Project Structure
# Recommended directory structure for multi-cloud Terraform
multi-cloud-infra/
├── modules/
│ ├── networking/
│ │ ├── aws/ # AWS-specific networking
│ │ ├── azure/ # Azure-specific networking
│ │ └── interface.tf # Shared input/output contract
│ ├── compute/
│ │ ├── aws/
│ │ ├── azure/
│ │ └── interface.tf
│ └── dns/
│ └── main.tf # Cloud-agnostic DNS module
├── environments/
│ ├── dev/
│ │ ├── main.tf
│ │ ├── providers.tf
│ │ └── terraform.tfvars
│ ├── staging/
│ └── production/
├── shared/
│ ├── vpn-connections/ # Cross-cloud connectivity
│ └── dns-zones/ # Global DNS management
└── terragrunt.hcl # DRY configuration
# providers.tf - Multi-cloud provider configuration
terraform {
required_version = ">= 1.6"
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
azurerm = {
source = "hashicorp/azurerm"
version = "~> 3.80"
}
google = {
source = "hashicorp/google"
version = "~> 5.0"
}
}
backend "s3" {
bucket = "my-org-terraform-state"
key = "multi-cloud/production/terraform.tfstate"
region = "us-east-1"
encrypt = true
dynamodb_table = "terraform-locks"
}
}
provider "aws" {
region = var.aws_region
default_tags {
tags = {
Environment = var.environment
ManagedBy = "terraform"
Project = "multi-cloud-platform"
}
}
}
provider "azurerm" {
features {}
subscription_id = var.azure_subscription_id
}
provider "google" {
project = var.gcp_project_id
region = var.gcp_region
}
# main.tf - Multi-cloud deployment orchestration
locals {
aws_vpc_cidr = "10.0.0.0/16"
azure_vnet_cidr = "10.1.0.0/16"
gcp_vpc_cidr = "10.2.0.0/16"
}
# AWS Infrastructure
module "aws_networking" {
source = "../../modules/networking/aws"
vpc_cidr = local.aws_vpc_cidr
environment = var.environment
}
module "aws_compute" {
source = "../../modules/compute/aws"
vpc_id = module.aws_networking.vpc_id
subnet_ids = module.aws_networking.private_subnet_ids
}
# Azure Infrastructure
module "azure_networking" {
source = "../../modules/networking/azure"
vnet_cidr = local.azure_vnet_cidr
environment = var.environment
}
module "azure_compute" {
source = "../../modules/compute/azure"
vnet_id = module.azure_networking.vnet_id
subnet_id = module.azure_networking.private_subnet_id
}
# Cross-Cloud Connectivity
module "vpn_aws_to_azure" {
source = "../../shared/vpn-connections"
aws_vpc_id = module.aws_networking.vpc_id
aws_vpc_cidr = local.aws_vpc_cidr
azure_vnet_id = module.azure_networking.vnet_id
azure_vnet_cidr = local.azure_vnet_cidr
azure_gateway_subnet_id = module.azure_networking.gateway_subnet_id
vpn_preshared_key = var.vpn_preshared_key
}
# Outputs for cross-referencing
output "aws_endpoint" {
value = module.aws_compute.load_balancer_dns
}
output "azure_endpoint" {
value = module.azure_compute.load_balancer_ip
}
flowchart LR
subgraph Plan["terraform plan"]
P1[AWS Provider] --> P2[Plan AWS Resources]
P3[Azure Provider] --> P4[Plan Azure Resources]
P5[GCP Provider] --> P6[Plan GCP Resources]
end
subgraph Apply["terraform apply"]
A1[Create AWS VPC] --> A2[Create Azure VNet]
A2 --> A3[Create VPN Connection]
A3 --> A4[Create GCP VPC]
A4 --> A5[Deploy K8s Clusters]
end
subgraph State["State Management"]
S1[Single State File
Multi-Provider]
S2[Split State
Per Provider]
end
Plan --> Apply
Apply --> State
Multi-Cloud Kubernetes
Kubernetes is the most common compute abstraction layer for multi-cloud. Running workloads across EKS, AKS, and GKE provides a consistent deployment target regardless of the underlying cloud — but multi-cluster management introduces its own complexity.
Multi-Cluster Management Tools
| Tool | Vendor | Model | Best For | Complexity |
|---|---|---|---|---|
| Rancher | SUSE | Central management plane | Multi-cluster lifecycle | Medium |
| Azure Arc | Microsoft | Azure control plane extension | Azure-centric hybrid | Medium |
| Anthos | GCP control plane extension | GCP-centric hybrid | High | |
| Kubefed | CNCF | Federation v2 CRDs | Resource propagation | High |
| Loft/vCluster | Loft | Virtual clusters | Multi-tenancy | Low |
| Skupper | Red Hat | Application networking | Cross-cluster services | Low |
GitOps for Multi-Cluster Deployments
# ArgoCD ApplicationSet for multi-cluster deployment
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
name: multi-cloud-app
namespace: argocd
spec:
generators:
- clusters:
selector:
matchLabels:
environment: production
template:
metadata:
name: "app-{{name}}"
spec:
project: default
source:
repoURL: https://github.com/org/multi-cloud-app.git
targetRevision: main
path: "deploy/overlays/{{metadata.labels.cloud}}"
destination:
server: "{{server}}"
namespace: production
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=true
# Kustomize overlay for AWS-specific configuration
# deploy/overlays/aws/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- ../../base
patches:
- target:
kind: Deployment
name: api-server
patch: |-
- op: add
path: /spec/template/spec/containers/0/env/-
value:
name: CLOUD_PROVIDER
value: "aws"
- op: add
path: /spec/template/spec/containers/0/env/-
value:
name: OBJECT_STORAGE_ENDPOINT
value: "s3.us-east-1.amazonaws.com"
- op: add
path: /spec/template/spec/serviceAccountName
value: app-irsa-sa
- target:
kind: Service
name: api-server
patch: |-
- op: add
path: /metadata/annotations
value:
service.beta.kubernetes.io/aws-load-balancer-type: "nlb"
service.beta.kubernetes.io/aws-load-balancer-scheme: "internet-facing"
# Kustomize overlay for Azure-specific configuration
# deploy/overlays/azure/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- ../../base
patches:
- target:
kind: Deployment
name: api-server
patch: |-
- op: add
path: /spec/template/spec/containers/0/env/-
value:
name: CLOUD_PROVIDER
value: "azure"
- op: add
path: /spec/template/spec/containers/0/env/-
value:
name: OBJECT_STORAGE_ENDPOINT
value: "https://storageaccount.blob.core.windows.net"
- op: add
path: /spec/template/metadata/labels/azure.workload.identity~1use
value: "true"
- target:
kind: Service
name: api-server
patch: |-
- op: add
path: /metadata/annotations
value:
service.beta.kubernetes.io/azure-load-balancer-internal: "false"
# ArgoCD cluster registration
# Register multiple clusters for multi-cloud GitOps
apiVersion: v1
kind: Secret
metadata:
name: aws-production
namespace: argocd
labels:
argocd.argoproj.io/secret-type: cluster
environment: production
cloud: aws
region: us-east-1
type: Opaque
stringData:
name: aws-production
server: https://eks-cluster.us-east-1.eks.amazonaws.com
config: |
{
"execProviderConfig": {
"command": "aws",
"args": ["eks", "get-token", "--cluster-name", "production"],
"apiVersion": "client.authentication.k8s.io/v1beta1"
}
}
---
apiVersion: v1
kind: Secret
metadata:
name: azure-production
namespace: argocd
labels:
argocd.argoproj.io/secret-type: cluster
environment: production
cloud: azure
region: eastus
type: Opaque
stringData:
name: azure-production
server: https://aks-cluster.eastus.azmk8s.io
config: |
{
"execProviderConfig": {
"command": "kubelogin",
"args": ["get-token", "--login", "azurecli", "--server-id", "APP_ID"],
"apiVersion": "client.authentication.k8s.io/v1beta1"
}
}
Cost & Governance
Multi-cloud cost management is exponentially harder than single-cloud. Each provider has different pricing models, different discount mechanisms (Reserved Instances, Committed Use Discounts, Azure Reservations), and different billing APIs. Without unified governance, costs spiral.
FinOps Practices for Multi-Cloud
# Unified tagging policy across clouds
# Enforced via OPA/Sentinel at deploy time
tagging_standard:
required:
- key: "cost-center"
format: "CC-[0-9]{4}"
description: "Financial cost center code"
- key: "environment"
values: [dev, staging, production]
- key: "team"
description: "Owning team name"
- key: "service"
description: "Service/application name"
- key: "managed-by"
values: [terraform, crossplane, manual]
provider_mapping:
aws:
tag_key_format: "PascalCase"
example: "CostCenter: CC-1234"
azure:
tag_key_format: "camelCase"
example: "costCenter: CC-1234"
gcp:
label_key_format: "lowercase-hyphen"
example: "cost-center: cc-1234"
# Terraform: enforce tagging via variable validation
variable "tags" {
type = map(string)
validation {
condition = alltrue([
contains(keys(var.tags), "cost-center"),
contains(keys(var.tags), "environment"),
contains(keys(var.tags), "team"),
contains(keys(var.tags), "service"),
can(regex("^CC-[0-9]{4}$", var.tags["cost-center"]))
])
error_message = "Tags must include cost-center (CC-XXXX format), environment, team, and service."
}
}
| Tool | Type | Clouds Supported | Best For |
|---|---|---|---|
| Kubecost | K8s-native cost | All (via K8s) | Container workload costs |
| Infracost | Pre-deploy estimation | AWS, Azure, GCP | Cost in PR reviews |
| CloudHealth | Cloud cost management | AWS, Azure, GCP | Enterprise FinOps |
| Spot.io | Cost optimization | AWS, Azure, GCP | Spot/preemptible automation |
| Vantage | Cost visibility | AWS, Azure, GCP, Datadog | Developer-friendly reporting |
| OpenCost | CNCF standard | All (via K8s) | Open-source K8s cost allocation |
Common Anti-Patterns
Multi-cloud failures are more common than successes. Recognizing anti-patterns early saves millions in wasted infrastructure spending and years of accumulated technical debt.
Hands-On Exercises
Design a Multi-Cloud Strategy
Scenario: A fintech company processes payments on AWS (for PCI compliance tooling), runs their data analytics on GCP BigQuery, and has acquired a company with all infrastructure on Azure. They need a unified strategy.
Tasks:
- Identify which multi-cloud pattern best fits this scenario
- Design the networking topology (VPN or interconnect between clouds)
- Define the identity federation strategy
- Create a unified tagging/labeling standard
- Propose a centralized observability approach
- Estimate the monthly cross-cloud data transfer cost for 5 TB/day
# Calculate cross-cloud transfer costs
# AWS egress to internet: $0.09/GB (first 10 TB)
# Azure ingress: free
# Estimate for 5 TB/day = 150 TB/month
echo "AWS egress cost: 150000 GB * $0.09 = $13,500/month"
echo "With Direct Connect: 150000 GB * $0.02 = $3,000/month"
echo "Savings with dedicated interconnect: $10,500/month"
Build a Multi-Cloud Terraform Project
Objective: Create a Terraform project that provisions a VPC on AWS and a VNet on Azure, then connects them with a VPN tunnel.
# Exercise: Complete this multi-cloud configuration
# File: main.tf
terraform {
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
azurerm = {
source = "hashicorp/azurerm"
version = "~> 3.80"
}
}
}
provider "aws" {
region = "us-east-1"
}
provider "azurerm" {
features {}
}
# Task 1: Create AWS VPC with CIDR 10.0.0.0/16
resource "aws_vpc" "main" {
cidr_block = "10.0.0.0/16"
enable_dns_hostnames = true
enable_dns_support = true
tags = { Name = "multi-cloud-vpc" }
}
# Task 2: Create Azure Resource Group and VNet with CIDR 10.1.0.0/16
resource "azurerm_resource_group" "main" {
name = "multi-cloud-rg"
location = "East US"
}
resource "azurerm_virtual_network" "main" {
name = "multi-cloud-vnet"
address_space = ["10.1.0.0/16"]
location = azurerm_resource_group.main.location
resource_group_name = azurerm_resource_group.main.name
}
# Task 3: Create subnets on both sides
# Task 4: Set up VPN gateways on both sides
# Task 5: Establish the VPN connection
# Task 6: Verify connectivity with a test VM on each side
Validation:
# Validate Terraform configuration
terraform init
terraform validate
terraform plan -out=multicloud.plan
# Check resource count
terraform show multicloud.plan | grep "Plan:"
# Expected: "Plan: X to add, 0 to change, 0 to destroy."
Set Up Cross-Cloud VPN Connectivity
Objective: Configure IPsec VPN between AWS and GCP with BGP dynamic routing.
# GCP side: Create HA VPN gateway
resource "google_compute_ha_vpn_gateway" "to_aws" {
name = "ha-vpn-to-aws"
network = google_compute_network.main.id
region = "us-central1"
}
resource "google_compute_router" "vpn_router" {
name = "vpn-router"
network = google_compute_network.main.id
region = "us-central1"
bgp {
asn = 64513
advertise_mode = "CUSTOM"
advertised_groups = ["ALL_SUBNETS"]
}
}
# Task: Create the following resources
# 1. google_compute_external_vpn_gateway (represents AWS VPN endpoint)
# 2. google_compute_vpn_tunnel (2 tunnels for HA)
# 3. google_compute_router_interface (BGP interface)
# 4. google_compute_router_peer (BGP peer configuration)
# AWS side: Create matching VPN infrastructure
# 1. aws_vpn_gateway
# 2. aws_customer_gateway (points to GCP external IP)
# 3. aws_vpn_connection (with BGP enabled)
# Verify VPN tunnel status
# AWS
aws ec2 describe-vpn-connections \
--filters "Name=tag:Name,Values=aws-to-gcp-vpn" \
--query "VpnConnections[].VgwTelemetry[].Status"
# GCP
gcloud compute vpn-tunnels describe ha-vpn-to-aws-tunnel-0 \
--region=us-central1 \
--format="value(status)"
# Expected output: "ESTABLISHED" on both sides
Implement Multi-Cloud Monitoring
Objective: Set up a unified Grafana dashboard that displays metrics from AWS CloudWatch, Azure Monitor, and GCP Cloud Monitoring.
# Grafana data sources for multi-cloud monitoring
# grafana/provisioning/datasources/multi-cloud.yaml
apiVersion: 1
datasources:
- name: AWS CloudWatch
type: cloudwatch
access: proxy
jsonData:
authType: keys
defaultRegion: us-east-1
secureJsonData:
accessKey: ${AWS_ACCESS_KEY}
secretKey: ${AWS_SECRET_KEY}
- name: Azure Monitor
type: grafana-azure-monitor-datasource
access: proxy
jsonData:
cloudName: azuremonitor
tenantId: ${AZURE_TENANT_ID}
clientId: ${AZURE_CLIENT_ID}
subscriptionId: ${AZURE_SUBSCRIPTION_ID}
secureJsonData:
clientSecret: ${AZURE_CLIENT_SECRET}
- name: GCP Cloud Monitoring
type: stackdriver
access: proxy
jsonData:
authenticationType: gce
defaultProject: ${GCP_PROJECT_ID}
{
"dashboard": {
"title": "Multi-Cloud Overview",
"panels": [
{
"title": "AWS EC2 CPU Utilization",
"datasource": "AWS CloudWatch",
"targets": [{
"namespace": "AWS/EC2",
"metricName": "CPUUtilization",
"statistics": ["Average"],
"period": "300"
}]
},
{
"title": "Azure VM CPU Percentage",
"datasource": "Azure Monitor",
"targets": [{
"resourceGroup": "production-rg",
"metricDefinition": "Microsoft.Compute/virtualMachines",
"metricName": "Percentage CPU",
"aggregation": "Average"
}]
},
{
"title": "GCP GCE CPU Usage",
"datasource": "GCP Cloud Monitoring",
"targets": [{
"metricType": "compute.googleapis.com/instance/cpu/utilization",
"filters": ["resource.type=\"gce_instance\""]
}]
}
]
}
}
Conclusion & Next Steps
Multi-cloud architecture is a powerful tool when applied intentionally, and a costly burden when adopted accidentally. The patterns, tooling, and practices in this article provide a framework for making informed decisions about when and how to distribute workloads across providers.
Key takeaways:
- Strategy first — Choose a multi-cloud pattern (best-of-breed, active-passive, workload distribution, or cloud-agnostic) based on actual requirements, not buzzword compliance
- Abstract operations, not capabilities — Terraform, Kubernetes, and service mesh abstract how you deploy; avoid abstracting away what each cloud uniquely offers
- Networking is the hard part — Cross-cloud VPN/interconnect, DNS federation, and global load balancing require significant planning and testing
- Federate identity early — OIDC federation and centralized secrets management (Vault) are prerequisites, not afterthoughts
- Respect data gravity — Compute should move to data, not the other way around; cross-cloud transfer costs add up fast
- Unified tooling is essential — GitOps (ArgoCD), observability (Grafana), and policy (OPA) must span all clouds consistently
- Budget for complexity — Multi-cloud requires 2-3x the operational expertise of single-cloud; staff accordingly
Next in the Series
In Part 17: Service Mesh & Advanced Networking, we dive into Istio, Envoy proxies, mutual TLS, traffic management, circuit breaking, and the advanced networking patterns that enable secure, observable communication across multi-cloud Kubernetes clusters.