Part 16: Deployment Strategies & Progressive Delivery

Introduction

Deployment is the moment where your carefully crafted code meets the real world. Every line of code you write, every test you run, every review you conduct — it all builds toward this single event: placing new software into production where real users will interact with it.

The goal is deceptively simple: get new code into production without breaking anything. In practice, this is one of the hardest problems in software engineering because production systems are complex, interconnected, and serving real traffic 24/7.

                            
                            Key Insight: The best deployment strategy is the one that makes deployment boring. If your deploys are exciting or stressful, your strategy needs improvement. Elite teams deploy hundreds of times per day with zero drama because their strategy provides safety through automation, observability, and reversibility.
                        

The Three Deployment Goals

Every deployment strategy is evaluated against three goals, often in tension with each other:

Zero downtime — users never experience an outage during deployment
Instant rollback — if something goes wrong, revert within seconds, not minutes
Observability — know immediately whether the new version is healthy

Different strategies make different tradeoffs between these goals, plus additional factors like infrastructure cost, operational complexity, and team expertise. This article covers every major strategy so you can choose the right one for your context.

In-Place Deployment (Recreate Strategy)

The simplest deployment strategy: stop the old version, start the new version. This is Kubernetes' Recreate strategy and the default behaviour of many traditional deployment tools.

Recreate Deployment Strategy

sequenceDiagram
    participant LB as Load Balancer
    participant V1 as Version 1 (Old)
    participant V2 as Version 2 (New)
    Note over LB,V1: Normal traffic flow
    LB->>V1: User requests
    Note over V1: Stop all instances
    V1--xLB: Instances terminated
    Note over LB: DOWNTIME WINDOW
    Note over V2: Start new instances
    V2->>LB: Health check passes
    LB->>V2: Resume traffic
    Note over LB,V2: Normal traffic flow

How It Works

Take all instances of the current version offline
Deploy the new version to the same infrastructure
Start the new instances and wait for health checks to pass
Route traffic to the new version

The critical problem is obvious: there is a window where no instances are serving traffic. The downtime duration depends on how long it takes to start the new version — typically seconds to minutes.

# Kubernetes Recreate strategy
apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
spec:
  replicas: 3
  strategy:
    type: Recreate
  template:
    spec:
      containers:
      - name: my-app
        image: my-app:v2.0.0
        readinessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 5

When Recreate Is Acceptable

Development and testing environments — nobody cares about downtime in dev
Batch processing jobs — no real-time traffic to interrupt
Database migrations requiring exclusive access — sometimes you genuinely need to stop the application
Stateful applications with single-writer constraints — e.g., legacy systems that cannot tolerate two versions running simultaneously
Cost-constrained environments — no budget for duplicate infrastructure

                            
                            Warning: If you are using Recreate in production for user-facing services, you are accepting downtime as a business decision. For most modern web applications, this is unacceptable. Upgrade to rolling updates at minimum.
                        

Rolling Updates

Rolling updates replace instances of the old version with the new version one at a time (or in small batches). At any point during the deployment, some instances are running the old version and some are running the new version. This eliminates downtime because there are always healthy instances serving traffic.

Rolling Update Progression

flowchart LR
    subgraph "Step 1"
        A1[v1] --- A2[v1] --- A3[v1] --- A4[v1]
    end
    subgraph "Step 2"
        B1[v2] --- B2[v1] --- B3[v1] --- B4[v1]
    end
    subgraph "Step 3"
        C1[v2] --- C2[v2] --- C3[v1] --- C4[v1]
    end
    subgraph "Step 4"
        D1[v2] --- D2[v2] --- D3[v2] --- D4[v2]
    end

Mechanics of a Rolling Update

Start a new instance with the updated version
Wait for health checks to confirm the new instance is ready
Route traffic to the new instance
Drain and terminate one old instance
Repeat until all instances are running the new version

Configuration Parameters

Parameter	Description	Example
maxUnavailable	Maximum number of instances that can be unavailable during update	25% or 1
maxSurge	Maximum number of extra instances created above desired count	25% or 1
minReadySeconds	Time a new instance must be ready before proceeding	30
progressDeadlineSeconds	Time before a stalled rollout is considered failed	600

Kubernetes Rolling Update Configuration

# Kubernetes RollingUpdate strategy
apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
spec:
  replicas: 4
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1    # At most 1 pod unavailable
      maxSurge: 1          # At most 1 extra pod during update
  template:
    spec:
      containers:
      - name: my-app
        image: my-app:v2.0.0
        readinessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 5
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10

Rollback Triggers

Kubernetes automatically halts a rolling update if new pods fail health checks. You can also manually trigger a rollback:

# Check rollout status
kubectl rollout status deployment/my-app

# Rollback to previous version
kubectl rollout undo deployment/my-app

# Rollback to a specific revision
kubectl rollout undo deployment/my-app --to-revision=3

# View rollout history
kubectl rollout history deployment/my-app

                            
                            Best Practice: Always configure readinessProbe on your containers. Without it, Kubernetes considers a pod ready as soon as it starts, which can route traffic to instances that haven't finished initialising. A failing readiness probe prevents traffic from reaching unhealthy pods and halts the rolling update.
                        

Blue-Green Deployments

Blue-green deployment maintains two identical production environments — called "blue" and "green." At any time, only one environment serves live traffic (say, blue). You deploy the new version to the idle environment (green), verify it thoroughly, then switch all traffic from blue to green in one atomic operation.

Blue-Green Deployment Architecture

flowchart TD
    Users[Users] --> LB[Load Balancer / DNS]
    LB -->|"Active (100%)"| Blue[Blue Environment - v1.0]
    LB -.->|"Idle (0%)"| Green[Green Environment - v2.0]
    Blue --> DB[(Shared Database)]
    Green --> DB
    Note1["Switch: Change LB target\nfrom Blue → Green"] -.-> LB

The Blue-Green Process

Deploy v2.0 to Green — the idle environment receives the new version
Run smoke tests against Green — verify the deployment is healthy before exposing users
Switch traffic — update the load balancer or DNS to point at Green
Monitor — watch error rates, latency, and business metrics
If problems occur — switch back to Blue (instant rollback)
If stable — Blue becomes the idle environment for the next deployment

Traffic Switching Mechanisms

Mechanism	Switch Speed	Granularity	Considerations
DNS-based	Minutes (TTL dependent)	All-or-nothing	Simple but slow; DNS caching can delay switch
Load balancer	Seconds	All-or-nothing or weighted	Fastest option; requires LB supporting target group switching
Service mesh	Sub-second	Percentage-based	Most flexible; requires Istio/Linkerd infrastructure
Kubernetes Service	Seconds	Label selector switch	Simple if already on Kubernetes

# Kubernetes blue-green via Service selector switch
# Step 1: Service points to blue
apiVersion: v1
kind: Service
metadata:
  name: my-app
spec:
  selector:
    app: my-app
    version: blue    # Switch this to "green" to cutover
  ports:
  - port: 80
    targetPort: 8080

---
# Step 2: Deploy green alongside blue
apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app-green
spec:
  replicas: 4
  selector:
    matchLabels:
      app: my-app
      version: green
  template:
    metadata:
      labels:
        app: my-app
        version: green
    spec:
      containers:
      - name: my-app
        image: my-app:v2.0.0

Cost Implications

The primary drawback of blue-green is double the infrastructure cost. You maintain two full production environments. Mitigation strategies include:

Use auto-scaling to keep the idle environment at minimum capacity until switch
Scale up the idle environment just before the switch, scale down after stability is confirmed
Use spot/preemptible instances for the idle environment during testing
Share stateless infrastructure (databases, caches) between environments

Case Study

Amazon's Blue-Green at Scale

Amazon pioneered blue-green deployments at massive scale. Their internal deployment system, Apollo, manages deployments across millions of hosts. Each service maintains two deployment groups. When deploying, the new version is deployed to the inactive group, verified via automated tests and canary checks, then traffic is shifted via internal load balancers. The ability to instantly revert to the previous group has prevented countless customer-facing incidents. Amazon reported that this approach, combined with automated rollback triggered by CloudWatch alarms, reduced their mean time to recovery from hours to single-digit minutes.

Blue-Green Automated Rollback Scale

Canary Releases

Named after the coal miners' practice of taking canaries into mines to detect toxic gases, canary releases route a small percentage of production traffic to the new version. If the canary is healthy (no elevated error rates, latency within bounds), traffic is gradually increased. If problems emerge, the canary is terminated and all traffic returns to the stable version.

Canary Release Traffic Progression

flowchart TD
    A["Deploy canary (1% traffic)"] --> B{Metrics healthy?}
    B -->|Yes| C["Increase to 5%"]
    C --> D{Metrics healthy?}
    D -->|Yes| E["Increase to 25%"]
    E --> F{Metrics healthy?}
    F -->|Yes| G["Promote to 100%"]
    B -->|No| H["Rollback immediately"]
    D -->|No| H
    F -->|No| H

Canary Analysis Metrics

Automated canary analysis compares the canary version against the stable baseline using key metrics:

Error rate — HTTP 5xx responses, exception counts, error logs
Latency — P50, P95, P99 response times
Saturation — CPU, memory, connection pool utilisation
Business metrics — conversion rates, cart abandonment, revenue per request

Tools for Canary Deployments

Tool	Platform	Key Feature
Argo Rollouts	Kubernetes	Native K8s CRD with automated analysis
Flagger	Kubernetes + Service Mesh	Progressive delivery with Istio/Linkerd/Contour
Spinnaker	Multi-cloud	Full deployment orchestration with Kayenta analysis
AWS CodeDeploy	AWS	Managed canary with CloudWatch integration

# Argo Rollouts canary strategy
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: my-app
spec:
  replicas: 10
  strategy:
    canary:
      steps:
      - setWeight: 5        # 5% traffic to canary
      - pause:
          duration: 5m      # Wait 5 minutes
      - analysis:
          templates:
          - templateName: success-rate
      - setWeight: 25       # Increase to 25%
      - pause:
          duration: 10m
      - analysis:
          templates:
          - templateName: success-rate
      - setWeight: 50       # Increase to 50%
      - pause:
          duration: 10m
      - setWeight: 100      # Full promotion

---
# Analysis template: checks error rate
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: success-rate
spec:
  metrics:
  - name: success-rate
    interval: 60s
    successCondition: result[0] >= 0.99
    provider:
      prometheus:
        address: http://prometheus:9090
        query: |
          sum(rate(http_requests_total{status=~"2.."}[5m]))
          /
          sum(rate(http_requests_total[5m]))

Feature Flags

Feature flags (also called feature toggles) decouple deployment from release. You deploy code that contains new functionality, but the functionality is hidden behind a conditional check. You can then enable it for specific users, a percentage of traffic, or all users — without deploying again.

Flag Types

Flag Type	Lifespan	Purpose	Example
Release flag	Days to weeks	Hide incomplete features until ready	`new-checkout-flow`
Experiment flag	Days to weeks	A/B test different implementations	`pricing-page-variant-b`
Ops flag	Permanent	Circuit breaker or kill switch	`disable-email-notifications`
Permission flag	Permanent	Entitlement or access control	`premium-analytics-dashboard`

Implementation Pattern

// Feature flag implementation using a flag service
const featureFlags = require('./feature-flag-client');

async function handleCheckout(req, res) {
  const userId = req.user.id;

  // Check if user should see new checkout flow
  const useNewCheckout = await featureFlags.isEnabled(
    'new-checkout-flow',
    { userId, country: req.user.country, plan: req.user.plan }
  );

  if (useNewCheckout) {
    return renderNewCheckoutFlow(req, res);
  }

  return renderLegacyCheckoutFlow(req, res);
}

// Flag evaluation can use:
// - User ID (target specific users)
// - Percentage (random 5% of traffic)
// - Attributes (country, plan, device)
// - Environment (staging vs production)

Feature Flag Tools

Tool	Type	Strengths
LaunchDarkly	SaaS	Real-time flag updates, targeting rules, audit trail, SDKs for 25+ languages
Unleash	Open source	Self-hosted, strategy-based activation, GitOps-friendly
Flagsmith	Open source / SaaS	Remote config + flags, segment-based targeting
Split.io	SaaS	Statistical engine for experiment flags, impact analysis

                            
                            Feature Flag Debt: Flags that outlive their purpose become technical debt. Establish a hygiene practice: every release flag should have an expiry date. After the feature is fully launched, remove the flag and the old code path. Stale flags increase complexity and make the codebase harder to reason about.
                        

A/B Testing

A/B testing (also called split testing) routes users into statistically significant groups to compare the performance of different implementations. Unlike canary releases (which test for safety), A/B tests measure business outcomes — which version converts better, retains more users, or generates more revenue.

A/B Testing vs Canary Releases

Dimension	Canary Release	A/B Test
Goal	Validate that new code doesn't break anything	Determine which variant performs better
Metrics	Error rates, latency, CPU	Conversion, revenue, engagement
Duration	Minutes to hours	Days to weeks (for statistical significance)
Traffic split	Increases over time (1% → 100%)	Fixed (typically 50/50)
Outcome	Promote or rollback	Choose winner based on data

Statistical Significance

An A/B test is only valid when the results are statistically significant — meaning the observed difference is unlikely to be due to random chance. Key concepts:

Sample size — enough users must see each variant (typically thousands)
Confidence level — typically 95% (p-value < 0.05)
Effect size — the minimum detectable difference you care about
Duration — run the test long enough to cover weekly patterns (at least 1-2 business cycles)

                            
                            Common Mistake: Ending an A/B test early because one variant "looks better" after a few hours. This is called peeking and leads to false positives. Always define your required sample size upfront and commit to running the test to completion.
                        

Dark Launches

A dark launch deploys new code to production and exercises it with real traffic, but the results are never shown to users. The old code path still serves the actual response. The new code path runs in parallel (or asynchronously), and its output is compared against the old path or simply discarded after measurement.

Use Cases for Dark Launches

Performance validation — will the new code handle production load without degradation?
Correctness verification — compare outputs of old and new implementations under real data
Database migration testing — write to both old and new databases, compare results
ML model validation — run new model predictions alongside production model, measure accuracy

// Dark launch pattern: shadow execution
async function searchProducts(query) {
  // Primary path: serves the actual response
  const primaryResult = await legacySearchEngine.search(query);

  // Dark path: new implementation, result discarded
  // Runs asynchronously to avoid adding latency
  newSearchEngine.search(query)
    .then(darkResult => {
      // Compare results for correctness
      metrics.recordComparison({
        query,
        primaryCount: primaryResult.length,
        darkCount: darkResult.length,
        overlap: calculateOverlap(primaryResult, darkResult),
        darkLatency: darkResult.latencyMs
      });
    })
    .catch(err => {
      // Dark path errors never affect users
      metrics.recordDarkFailure({ query, error: err.message });
    });

  // Always return the primary result
  return primaryResult;
}

Case Study

GitHub's Scientist Library

GitHub open-sourced their dark launch framework as the Scientist library (available in Ruby, Python, and other languages). When rewriting critical code paths (like their permissions system), they used Scientist to run both implementations simultaneously on every request. The new implementation's output was compared against the old one, and any discrepancies were logged and investigated. This allowed them to rewrite core systems with confidence — they had proof that the new code produced identical results to the old code under real production conditions before switching over. The library tracked mismatch rates, latency differences, and exception counts, giving engineers complete visibility into the new code's behaviour without any user impact.

Dark Launch Scientist Shadow Traffic

Progressive Delivery

Progressive delivery is the orchestration of multiple strategies into a unified, automated promotion pipeline. Rather than choosing one strategy, you combine them in sequence: deploy dark code behind a flag → enable for 1% as a canary → expand to 10% → run A/B test at 50% → promote to 100% GA — all automated with SLO-based gates.

The Progressive Delivery Pipeline

Deploy — code ships to production behind a feature flag (users see nothing)
Internal testing — enable for internal employees (dogfooding)
Canary (1%) — enable for 1% of external traffic, monitor SLOs
Expand (10%) — if canary metrics pass, increase to 10%
Expand (50%) — optionally run A/B test at this point
GA (100%) — full rollout to all users
Cleanup — remove feature flag, delete old code path

At every step, automated analysis monitors error rates, latency, and business metrics. If any SLO is breached, the system automatically reverts to the previous step.

                            
                            Industry Trend: Progressive delivery is becoming the standard for large-scale deployments. Companies like Netflix, LinkedIn, and Spotify all use variations of this pattern. The key enabler is the combination of feature flag infrastructure + canary analysis + automated SLO gates.
                        

Database Schema Migrations

Database schema changes are the hardest part of zero-downtime deployment. Application code is stateless and easily replaced, but database schemas are shared state that both the old and new application versions must work with simultaneously during a rolling deployment.

The Expand-and-Contract Pattern

The solution is to never make breaking schema changes in one step. Instead, use a three-phase approach:

Expand-and-Contract Migration Pattern

flowchart LR
    subgraph "Phase 1: Expand"
        A[Add new column\nkeep old column\nboth nullable]
    end
    subgraph "Phase 2: Migrate"
        B[Deploy app v2\nwrites to both columns\nbackfill old data]
    end
    subgraph "Phase 3: Contract"
        C[Remove old column\nafter all code uses new]
    end
    A --> B --> C

Safe Migration Rules

Never rename a column — add new column, migrate data, drop old column
Never change a column type — add new typed column, migrate data, drop old
Never add a NOT NULL column without a default — existing rows will violate the constraint
Never drop a column that old code still reads — ensure all running code has been updated first
Always make migrations reversible — every up migration should have a corresponding down

# Example: Renaming "username" to "display_name" safely

# Phase 1: Expand (add new column)
ALTER TABLE users ADD COLUMN display_name VARCHAR(255);

# Phase 2: Migrate (backfill existing data)
UPDATE users SET display_name = username WHERE display_name IS NULL;

# Deploy app v2: reads from display_name, writes to both
# Wait until all instances are v2

# Phase 3: Contract (remove old column)
ALTER TABLE users DROP COLUMN username;

                            
                            Critical Warning: Each phase must be a separate deployment. You cannot expand and contract in the same release because during a rolling update, some pods will be running the old code (which expects the old column) while new pods are running the updated code. The expand phase ensures both versions can coexist.
                        

Strategy Comparison

Strategy	Downtime	Rollback Speed	Resource Cost	Complexity	Observability Requirement
Recreate	Yes	Slow (redeploy old)	Low (1× infra)	Very low	Minimal
Rolling Update	No	Medium (rollback pods)	Low (1× + surge)	Low	Health checks required
Blue-Green	No	Instant (switch back)	High (2× infra)	Medium	Smoke tests + monitoring
Canary	No	Fast (kill canary)	Low (+1 instance)	High	Sophisticated metrics + analysis
Feature Flags	No	Instant (toggle off)	None (same deploy)	Medium (flag management)	Flag-aware metrics
A/B Test	No	Instant (toggle off)	None (same deploy)	High (statistical analysis)	Business metrics + significance
Dark Launch	No	N/A (not user-facing)	Low (CPU for shadow)	Medium	Comparison metrics
Progressive	No	Automated (SLO gates)	Medium	Very high	Full observability stack

Exercises

                            
                            Exercise 1 — Choose a Strategy: Your team runs an e-commerce platform with 50,000 concurrent users. You need to deploy a new payment processing system. Which deployment strategy (or combination) would you choose? Justify your choice considering rollback requirements, downtime tolerance, and the criticality of the payment system.
                        

                            
                            Exercise 2 — Design a Canary Pipeline: Write an Argo Rollouts manifest that deploys a canary with the following progression: 2% for 5 minutes → 10% for 10 minutes → 50% for 15 minutes → 100%. Include an AnalysisTemplate that checks both error rate (<1%) and P99 latency (<500ms).
                        

                            
                            Exercise 3 — Safe Database Migration: You need to rename a table column from email_address to primary_email on a table with 10 million rows, while maintaining zero downtime. Write the three-phase migration plan including SQL statements and the application code changes needed at each phase.
                        

                            
                            Exercise 4 — Feature Flag Architecture: Design a feature flag system for your team. Define: (a) where flag definitions are stored, (b) how flags are evaluated at runtime, (c) how stale flags are identified and cleaned up, and (d) how you would audit who changed a flag and when.
                        

Conclusion & Next Steps

Deployment strategy is not a one-size-fits-all decision. The right choice depends on your traffic volume, risk tolerance, infrastructure budget, team expertise, and the criticality of the system being deployed. Most mature organisations combine multiple strategies — using rolling updates as the default, feature flags for risky features, and canary analysis for critical services.

The key principles to remember: always have a rollback path, always monitor what you deploy, and always decouple deployment from release. If you internalise these three principles, you can adapt to any deployment tool or platform.

Next in the Series

In Part 17: Release Engineering & GitOps, we will explore how to manage releases at scale — semantic versioning, automated changelogs, GitOps with Argo CD and Flux, release trains, and the governance processes that keep large teams shipping safely.

Previous Part 15: CI/CD Pipeline Architecture Next Part 17: Release Engineering & GitOps

Cookie Consent

Part 16: Deployment Strategies & Progressive Delivery

Table of Contents

Introduction

The Three Deployment Goals

In-Place Deployment (Recreate Strategy)

How It Works

When Recreate Is Acceptable

Rolling Updates

Mechanics of a Rolling Update

Configuration Parameters

Kubernetes Rolling Update Configuration

Rollback Triggers

Blue-Green Deployments

The Blue-Green Process

Traffic Switching Mechanisms

Cost Implications

Amazon's Blue-Green at Scale

Canary Releases

Canary Analysis Metrics

Tools for Canary Deployments

Feature Flags

Flag Types

Implementation Pattern

Feature Flag Tools

A/B Testing

A/B Testing vs Canary Releases

Statistical Significance

Dark Launches

Use Cases for Dark Launches

GitHub's Scientist Library

Progressive Delivery

The Progressive Delivery Pipeline

Database Schema Migrations

The Expand-and-Contract Pattern

Safe Migration Rules

Strategy Comparison

Exercises

Conclusion & Next Steps

Next in the Series

Cookie Consent

Part 16: Deployment Strategies & Progressive Delivery

Table of Contents

Introduction

The Three Deployment Goals

In-Place Deployment (Recreate Strategy)

How It Works

When Recreate Is Acceptable

Rolling Updates

Mechanics of a Rolling Update

Configuration Parameters

Kubernetes Rolling Update Configuration

Rollback Triggers

Blue-Green Deployments

The Blue-Green Process

Traffic Switching Mechanisms

Cost Implications

Amazon's Blue-Green at Scale

Canary Releases

Canary Analysis Metrics

Tools for Canary Deployments

Feature Flags

Flag Types

Implementation Pattern

Feature Flag Tools

A/B Testing

A/B Testing vs Canary Releases

Statistical Significance

Dark Launches

Use Cases for Dark Launches

GitHub's Scientist Library

Progressive Delivery

The Progressive Delivery Pipeline

Database Schema Migrations

The Expand-and-Contract Pattern

Safe Migration Rules

Strategy Comparison

Exercises

Conclusion & Next Steps

Next in the Series

Continue the Series

Part 15: CI/CD Pipeline Architecture

Part 17: Release Engineering & GitOps

Part 1: Software Delivery Mental Models & the SDLC