What is Progressive Delivery?
Progressive delivery is the practice of releasing software changes to a small subset of users first, analysing the impact, and then gradually expanding to the full audience. It builds on continuous delivery by adding fine-grained control over who sees a change and when — transforming deployments from binary "all or nothing" events into controlled experiments.
Think of it like launching a new menu item at a restaurant chain. You wouldn't roll it out to every location simultaneously. You'd start with a few pilot stores, measure customer response, refine the recipe, then expand city by city. Progressive delivery applies exactly this logic to software releases.
Why Traditional Deployments Fail at Scale
Traditional deployment models assume a binary state: the old version is running, then a switch flips and the new version replaces it. This creates several failure modes at scale:
- Blast radius — A bug in the new version affects 100% of users simultaneously
- Slow feedback — Problems may not surface until thousands of users are impacted
- Rollback latency — Reverting takes minutes or hours while users experience degraded service
- No experimentation — You can't compare the old and new versions side by side in production
- Deploy fear — Teams avoid deploying on Fridays, before holidays, or during peak traffic
Progressive delivery eliminates "deploy fear" by making every release incremental, observable, and automatically reversible.
flowchart LR
A["Big Bang
Deploy"] --> B["Rolling
Update"]
B --> C["Blue-Green
Deploy"]
C --> D["Canary
Release"]
D --> E["Feature
Flags"]
E --> F["A/B
Testing"]
style A fill:#fff5f5,stroke:#BF092F,color:#132440
style B fill:#f0f4f8,stroke:#16476A,color:#132440
style C fill:#f0f4f8,stroke:#16476A,color:#132440
style D fill:#e8f4f4,stroke:#3B9797,color:#132440
style E fill:#e8f4f4,stroke:#3B9797,color:#132440
style F fill:#e8f4f4,stroke:#3B9797,color:#132440
Deployment Strategies Compared
Before diving into tooling, let's understand the core strategies. Each trades off between safety, speed, resource cost, and complexity.
Blue-Green Deployments
Blue-green maintains two identical production environments. At any time, one ("blue") serves live traffic while the other ("green") is idle or running the new version. Switching traffic is a single routing change — typically updating a load balancer or DNS record.
flowchart TD
LB["Load Balancer"] --> Blue["Blue (v1.0)
LIVE"]
LB -.-> Green["Green (v1.1)
STANDBY"]
Green -->|"Smoke tests pass"| Switch["Switch Traffic"]
Switch --> LB2["Load Balancer"]
LB2 --> Green2["Green (v1.1)
LIVE"]
LB2 -.-> Blue2["Blue (v1.0)
STANDBY"]
style Blue fill:#f0f4f8,stroke:#16476A,color:#132440
style Green fill:#e8f4f4,stroke:#3B9797,color:#132440
style Switch fill:#e8f4f4,stroke:#3B9797,color:#132440
style Blue2 fill:#f0f4f8,stroke:#16476A,color:#132440
style Green2 fill:#e8f4f4,stroke:#3B9797,color:#132440
style LB fill:#f0f4f8,stroke:#16476A,color:#132440
style LB2 fill:#f0f4f8,stroke:#16476A,color:#132440
Advantages: Instant rollback (switch back to blue), zero-downtime, full environment testing before traffic hits it. Disadvantages: Doubles infrastructure cost, database migrations require careful handling, no gradual traffic shifting.
Canary Deployments
Named after the canary in a coal mine, this strategy routes a small percentage of production traffic to the new version while the majority continues hitting the stable release. If the canary shows healthy metrics, traffic is gradually increased.
Google's Canary Analysis at Scale
Google runs canary analysis on virtually every production change. Their internal system, Canarying Analysis Service (CAS), compares metrics between the canary and baseline populations using statistical tests. A typical rollout at Google follows: 1% → 5% → 25% → 50% → 100%, with automated analysis at each stage. If the canary's error rate exceeds a threshold or latency degrades beyond a configurable limit, the rollout automatically pauses and alerts the on-call engineer.
Rolling Updates
Rolling updates replace instances of the old version one at a time (or in batches). Kubernetes uses this as its default Deployment strategy. While simple, rolling updates have a key limitation: during the update, both old and new versions serve traffic simultaneously with no control over the ratio.
# Kubernetes rolling update configuration
# kubectl apply -f deployment-rolling.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-app
namespace: production
spec:
replicas: 6
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 2 # 2 extra pods during update
maxUnavailable: 1 # At most 1 pod unavailable
selector:
matchLabels:
app: web-app
template:
metadata:
labels:
app: web-app
version: v1.1.0
spec:
containers:
- name: web-app
image: myregistry/web-app:v1.1.0
ports:
- containerPort: 8080
readinessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 256Mi
Argo Rollouts — Progressive Delivery for Kubernetes
Argo Rollouts is a Kubernetes controller that provides advanced deployment strategies — canary, blue-green, and experimentation — as first-class Kubernetes resources. It replaces the standard Deployment resource with a Rollout resource that supports fine-grained traffic management, automated analysis, and promotion gates.
# Install Argo Rollouts controller
kubectl create namespace argo-rollouts
kubectl apply -n argo-rollouts -f https://github.com/argoproj/argo-rollouts/releases/latest/download/install.yaml
# Verify installation
kubectl get pods -n argo-rollouts
# NAME READY STATUS RESTARTS AGE
# argo-rollouts-controller-xxx 1/1 Running 0 30s
# Install the kubectl plugin for CLI management
# macOS/Linux:
curl -LO https://github.com/argoproj/argo-rollouts/releases/latest/download/kubectl-argo-rollouts-linux-amd64
chmod +x kubectl-argo-rollouts-linux-amd64
sudo mv kubectl-argo-rollouts-linux-amd64 /usr/local/bin/kubectl-argo-rollouts
# Verify plugin
kubectl argo rollouts version
Canary Rollout with Argo Rollouts
A canary rollout gradually shifts traffic from the stable version to the canary. Argo Rollouts integrates with ingress controllers (NGINX, ALB) and service meshes (Istio, Linkerd) for precise traffic splitting.
# canary-rollout.yaml — Progressive canary with analysis
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: web-app
namespace: production
spec:
replicas: 5
revisionHistoryLimit: 3
selector:
matchLabels:
app: web-app
template:
metadata:
labels:
app: web-app
spec:
containers:
- name: web-app
image: myregistry/web-app:v1.2.0
ports:
- containerPort: 8080
readinessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
strategy:
canary:
# Traffic routing via NGINX Ingress
canaryService: web-app-canary
stableService: web-app-stable
trafficRouting:
nginx:
stableIngress: web-app-ingress
annotationPrefix: nginx.ingress.kubernetes.io
# Step-by-step rollout
steps:
- setWeight: 5 # 5% traffic to canary
- pause:
duration: 5m # Wait 5 minutes
- analysis:
templates:
- templateName: success-rate
- setWeight: 20 # 20% traffic
- pause:
duration: 5m
- analysis:
templates:
- templateName: success-rate
- setWeight: 50 # 50% traffic
- pause:
duration: 10m
- analysis:
templates:
- templateName: success-rate
- templateName: latency-check
# Full promotion happens automatically after last step
# analysis-template.yaml — Automated canary analysis
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: success-rate
namespace: production
spec:
metrics:
- name: success-rate
# Query Prometheus for the canary's success rate
interval: 60s
count: 5
successCondition: result[0] >= 0.95
failureLimit: 3
provider:
prometheus:
address: http://prometheus.monitoring:9090
query: |
sum(rate(http_requests_total{
app="web-app",
status=~"2..",
pod=~"{{args.canary-hash}}.*"
}[5m])) /
sum(rate(http_requests_total{
app="web-app",
pod=~"{{args.canary-hash}}.*"
}[5m]))
Blue-Green Rollout with Argo Rollouts
# bluegreen-rollout.yaml — Blue-green with automated promotion
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: web-app
namespace: production
spec:
replicas: 3
revisionHistoryLimit: 2
selector:
matchLabels:
app: web-app
template:
metadata:
labels:
app: web-app
spec:
containers:
- name: web-app
image: myregistry/web-app:v2.0.0
ports:
- containerPort: 8080
strategy:
blueGreen:
activeService: web-app-active
previewService: web-app-preview
# Automatically promote after analysis passes
autoPromotionEnabled: true
autoPromotionSeconds: 300
# Scale down old version after promotion
scaleDownDelaySeconds: 60
# Run analysis before promotion
prePromotionAnalysis:
templates:
- templateName: smoke-tests
args:
- name: service-name
value: web-app-preview
# Monitor rollout status with the Argo Rollouts CLI
kubectl argo rollouts get rollout web-app -n production --watch
# Manually promote a paused rollout
kubectl argo rollouts promote web-app -n production
# Abort a rollout (triggers automatic rollback)
kubectl argo rollouts abort web-app -n production
# Retry a failed rollout
kubectl argo rollouts retry rollout web-app -n production
# View rollout history
kubectl argo rollouts list rollouts -n production
Feature Flags — Decoupling Deploy from Release
Feature flags (also called feature toggles) are conditional statements in your code that control whether a feature is visible to users. They let you merge code into the main branch and deploy it to production without exposing it — then toggle it on for specific users, regions, or percentages at any time.
flowchart TD
App["Application"] --> SDK["Flag SDK"]
SDK --> Cache["Local Cache"]
SDK -->|"Poll / Stream"| FMS["Flag Management
Service"]
FMS --> Store["Flag Store
(DB / Config)"]
FMS --> Rules["Targeting Rules
(User, %, Region)"]
Dashboard["Admin Dashboard"] --> FMS
style App fill:#f0f4f8,stroke:#16476A,color:#132440
style SDK fill:#e8f4f4,stroke:#3B9797,color:#132440
style FMS fill:#e8f4f4,stroke:#3B9797,color:#132440
style Dashboard fill:#e8f4f4,stroke:#3B9797,color:#132440
style Cache fill:#f0f4f8,stroke:#16476A,color:#132440
style Store fill:#f0f4f8,stroke:#16476A,color:#132440
style Rules fill:#f0f4f8,stroke:#16476A,color:#132440
Types of Feature Flags
| Flag Type | Lifespan | Use Case | Example |
|---|---|---|---|
| Release Flag | Days–Weeks | Control feature rollout to users | Show new checkout flow to 10% of users |
| Experiment Flag | Weeks–Months | A/B testing and data collection | Compare two recommendation algorithms |
| Ops Flag | Permanent | Circuit breakers and kill switches | Disable expensive search during peak load |
| Permission Flag | Permanent | Entitlements and access control | Enable premium features for paying customers |
Feature Flag Lifecycle
Feature flags accumulate technical debt if not managed. Every flag should have a defined lifecycle:
# .feature-flags/new-checkout.yaml — Flag definition with lifecycle metadata
name: new-checkout-flow
description: "Redesigned checkout with single-page layout"
owner: team-payments
type: release
created: 2026-05-01
expected-removal: 2026-06-15
status: active
# Targeting rules
targeting:
# Stage 1: Internal dogfooding
- segment: internal-employees
enabled: true
since: 2026-05-01
# Stage 2: Beta users
- segment: beta-program
enabled: true
since: 2026-05-08
# Stage 3: Percentage rollout
- percentage: 25
enabled: true
since: 2026-05-15
# Stage 4: Full rollout (flag becomes candidate for removal)
- percentage: 100
enabled: true
target-date: 2026-06-01
# Cleanup tracking
cleanup:
jira-ticket: PAY-4521
removal-deadline: 2026-06-15
code-references:
- src/checkout/CheckoutPage.tsx:42
- src/checkout/CheckoutPage.tsx:87
- tests/checkout.test.ts:15
// Example: Feature flag implementation in Node.js
// Uses OpenFeature SDK — vendor-neutral flag evaluation
const { OpenFeature } = require('@openfeature/server-sdk');
const { LaunchDarklyProvider } = require('@launchdarkly/openfeature-node-server');
// Initialize the provider (runs once at startup)
const ldClient = new LaunchDarklyProvider('sdk-key-here');
OpenFeature.setProvider(ldClient);
// Get a client for evaluation
const client = OpenFeature.getClient();
// Evaluate a boolean flag with user context
async function handleCheckout(req, res) {
const context = {
targetingKey: req.user.id,
email: req.user.email,
country: req.user.country,
plan: req.user.subscriptionPlan
};
const useNewCheckout = await client.getBooleanValue(
'new-checkout-flow',
false, // default value if flag evaluation fails
context
);
if (useNewCheckout) {
return renderNewCheckout(req, res);
}
return renderLegacyCheckout(req, res);
}
console.log("Feature flag evaluation ready");
Analysis-Driven Delivery
The most powerful aspect of progressive delivery is automated analysis — letting metrics decide whether a release is safe to promote. Instead of a human watching dashboards, analysis templates define success criteria that are evaluated automatically during each rollout step.
Metrics Providers Integration
# Prometheus analysis — Error rate check
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: error-rate-check
spec:
args:
- name: service-name
metrics:
- name: error-rate
interval: 2m
count: 3
successCondition: result[0] <= 0.01
failureCondition: result[0] > 0.05
failureLimit: 1
provider:
prometheus:
address: http://prometheus.monitoring:9090
query: |
sum(rate(http_requests_total{
service="{{args.service-name}}",
status=~"5.."
}[5m])) /
sum(rate(http_requests_total{
service="{{args.service-name}}"
}[5m]))
# Datadog analysis — Latency p99 check
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: latency-check
spec:
metrics:
- name: p99-latency
interval: 3m
count: 3
successCondition: result <= 500
failureLimit: 2
provider:
datadog:
apiVersion: v2
query: |
avg:trace.http.request.duration.by_resource_service.99p{
service:web-app,
env:production
}.rollup(avg, 300)
Intuit's Automated Canary Analysis
Intuit processes over 1 billion financial transactions annually. Their progressive delivery system uses Argo Rollouts with custom analysis templates that compare canary pods against baseline pods across 47 different metrics — including error rates, latency percentiles, CPU usage, and business metrics like transaction success rates. A canary must pass all 47 metric checks across three consecutive analysis windows before automatic promotion. This system reduced production incidents from new deployments by 74% in the first year of adoption.
Flagger — Service Mesh Progressive Delivery
Flagger is a progressive delivery operator that automates canary deployments using service mesh traffic shifting (Istio, Linkerd, App Mesh) or ingress controller weighting (NGINX, Contour, Gloo). While Argo Rollouts replaces the Deployment resource, Flagger works alongside existing Deployments — creating canary Deployments and routing traffic automatically.
Automated Canary with Flagger
# flagger-canary.yaml — Automated canary with Istio
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: web-app
namespace: production
spec:
# Reference the existing Deployment
targetRef:
apiVersion: apps/v1
kind: Deployment
name: web-app
# Istio virtual service for traffic routing
service:
port: 8080
targetPort: 8080
gateways:
- public-gateway.istio-system.svc.cluster.local
hosts:
- app.example.com
analysis:
# Canary analysis schedule
interval: 1m
threshold: 5 # Max failed checks before rollback
maxWeight: 50 # Max canary traffic percentage
stepWeight: 10 # Traffic increment per step
# Prometheus metrics checks
metrics:
- name: request-success-rate
thresholdRange:
min: 99 # Minimum 99% success rate
interval: 1m
- name: request-duration
thresholdRange:
max: 500 # Max 500ms p99 latency
interval: 1m
# Webhook for load testing during canary
webhooks:
- name: load-test
type: rollout
url: http://flagger-loadtester.test/
metadata:
cmd: "hey -z 2m -q 10 -c 2 http://web-app-canary.production:8080/"
A/B Testing & Experimentation
A/B testing extends progressive delivery into product experimentation. Instead of simply checking infrastructure metrics (error rates, latency), A/B tests measure business outcomes — conversion rates, revenue per session, engagement metrics, or user satisfaction scores.
Traffic Splitting for Experiments
# Argo Rollouts experiment — A/B test with header-based routing
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: checkout-experiment
namespace: production
spec:
replicas: 5
selector:
matchLabels:
app: checkout
template:
metadata:
labels:
app: checkout
spec:
containers:
- name: checkout
image: myregistry/checkout:v3.0.0
ports:
- containerPort: 8080
strategy:
canary:
canaryService: checkout-canary
stableService: checkout-stable
trafficRouting:
istio:
virtualService:
name: checkout-vsvc
destinationRule:
name: checkout-destrule
canarySubsetName: canary
stableSubsetName: stable
steps:
# Route header-based traffic (internal testers)
- setHeaderRoute:
name: internal-test
match:
- headerName: X-Experiment
headerValue:
exact: new-checkout
- pause: {} # Wait for manual analysis
# Percentage-based split for real users
- setWeight: 50
- pause:
duration: 24h # Run experiment for 24 hours
- analysis:
templates:
- templateName: conversion-rate-analysis
Production Patterns
Dark Launches
A dark launch deploys new code to production and processes real traffic through it — but discards the results. The user never sees the output of the new code path; it's only used to validate performance, resource consumption, and correctness under real load.
// Dark launch pattern — dual-write with comparison
// New recommendation engine runs in shadow mode
const express = require('express');
const app = express();
async function getRecommendations(userId) {
// Primary path — serves the response
const primaryResult = await legacyRecommendationEngine(userId);
// Dark launch — runs in background, result is discarded
// but metrics and errors are tracked
shadowRecommendationEngine(userId)
.then(shadowResult => {
// Compare outputs for correctness validation
const match = JSON.stringify(primaryResult.ids) ===
JSON.stringify(shadowResult.ids);
// Emit comparison metrics (not user-facing)
metrics.increment('recommendations.shadow.executed');
metrics.gauge('recommendations.shadow.match_rate',
match ? 1 : 0);
metrics.histogram('recommendations.shadow.latency_ms',
shadowResult.latencyMs);
})
.catch(err => {
// Shadow failures are logged but never affect users
metrics.increment('recommendations.shadow.errors');
console.error('Shadow recommendation failed:', err.message);
});
return primaryResult;
}
console.log("Dark launch pattern initialized");
Automated Rollback Strategies
Progressive delivery is only as safe as its rollback mechanism. Every strategy should have automated rollback triggers:
# Comprehensive rollback configuration for Argo Rollouts
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: web-app
namespace: production
spec:
replicas: 5
selector:
matchLabels:
app: web-app
template:
metadata:
labels:
app: web-app
spec:
containers:
- name: web-app
image: myregistry/web-app:v2.5.0
ports:
- containerPort: 8080
strategy:
canary:
canaryService: web-app-canary
stableService: web-app-stable
# Abort and rollback on analysis failure
abortScaleDownDelaySeconds: 30
steps:
- setWeight: 10
- pause:
duration: 5m
- analysis:
templates:
- templateName: success-rate
- templateName: latency-check
- templateName: error-budget
args:
- name: service-name
value: web-app
- setWeight: 30
- pause:
duration: 10m
- analysis:
templates:
- templateName: success-rate
- templateName: latency-check
- templateName: saturation-check
- setWeight: 60
- pause:
duration: 15m
- analysis:
templates:
- templateName: full-analysis-suite
failureLimit and failureCondition in analysis templates. Set abortScaleDownDelaySeconds to give time for in-flight requests to drain. Use scaleDownDelaySeconds in blue-green to keep the old version warm for fast rollback. Never rely solely on manual rollback — automated analysis should catch problems within minutes, not hours.
Conclusion & Next Steps
Progressive delivery transforms software releases from risky, all-or-nothing events into controlled, observable experiments. By combining deployment strategies (canary, blue-green), traffic management (Argo Rollouts, Flagger), feature flags (LaunchDarkly, OpenFeature), and automated analysis (Prometheus, Datadog), teams can ship faster with dramatically lower risk.
The key principles to carry forward:
- Decouple deploy from release — Code reaches production before users see it. Feature flags and traffic routing control visibility.
- Automate analysis — Define success criteria in analysis templates. Let metrics decide promotions, not humans watching dashboards.
- Manage flag lifecycle — Every feature flag has a creation date, owner, and removal deadline. Track flag debt like technical debt.
- Start with canary — Begin with simple weight-based canary releases before adding A/B testing or experimentation frameworks.
- Rollback is the default — Design for failure. Every rollout should abort automatically if analysis fails.
Next in the Series
In Part 12: GitOps at Scale, we'll explore monorepo vs polyrepo strategies, multi-environment promotion workflows, multi-cluster GitOps with ApplicationSets, and managing hundreds of microservices through Git-driven infrastructure.