Back to Software Engineering & Delivery Mastery Series

Part 16: Deployment Strategies & Progressive Delivery

May 13, 2026 Wasil Zafar 44 min read

The moment of truth in software delivery — getting new code into production without breaking anything. This article covers every major deployment strategy, from simple in-place updates to sophisticated progressive delivery pipelines that combine canary releases, feature flags, and automated rollback.

Table of Contents

  1. Introduction
  2. In-Place (Recreate)
  3. Rolling Updates
  4. Blue-Green Deployments
  5. Canary Releases
  6. Feature Flags
  7. A/B Testing
  8. Dark Launches
  9. Progressive Delivery
  10. Database Schema Migrations
  11. Strategy Comparison
  12. Exercises
  13. Conclusion & Next Steps

Introduction

Deployment is the moment where your carefully crafted code meets the real world. Every line of code you write, every test you run, every review you conduct — it all builds toward this single event: placing new software into production where real users will interact with it.

The goal is deceptively simple: get new code into production without breaking anything. In practice, this is one of the hardest problems in software engineering because production systems are complex, interconnected, and serving real traffic 24/7.

Key Insight: The best deployment strategy is the one that makes deployment boring. If your deploys are exciting or stressful, your strategy needs improvement. Elite teams deploy hundreds of times per day with zero drama because their strategy provides safety through automation, observability, and reversibility.

The Three Deployment Goals

Every deployment strategy is evaluated against three goals, often in tension with each other:

  • Zero downtime — users never experience an outage during deployment
  • Instant rollback — if something goes wrong, revert within seconds, not minutes
  • Observability — know immediately whether the new version is healthy

Different strategies make different tradeoffs between these goals, plus additional factors like infrastructure cost, operational complexity, and team expertise. This article covers every major strategy so you can choose the right one for your context.

In-Place Deployment (Recreate Strategy)

The simplest deployment strategy: stop the old version, start the new version. This is Kubernetes' Recreate strategy and the default behaviour of many traditional deployment tools.

Recreate Deployment Strategy
sequenceDiagram
    participant LB as Load Balancer
    participant V1 as Version 1 (Old)
    participant V2 as Version 2 (New)
    Note over LB,V1: Normal traffic flow
    LB->>V1: User requests
    Note over V1: Stop all instances
    V1--xLB: Instances terminated
    Note over LB: DOWNTIME WINDOW
    Note over V2: Start new instances
    V2->>LB: Health check passes
    LB->>V2: Resume traffic
    Note over LB,V2: Normal traffic flow
                            

How It Works

  1. Take all instances of the current version offline
  2. Deploy the new version to the same infrastructure
  3. Start the new instances and wait for health checks to pass
  4. Route traffic to the new version

The critical problem is obvious: there is a window where no instances are serving traffic. The downtime duration depends on how long it takes to start the new version — typically seconds to minutes.

# Kubernetes Recreate strategy
apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
spec:
  replicas: 3
  strategy:
    type: Recreate
  template:
    spec:
      containers:
      - name: my-app
        image: my-app:v2.0.0
        readinessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 5

When Recreate Is Acceptable

  • Development and testing environments — nobody cares about downtime in dev
  • Batch processing jobs — no real-time traffic to interrupt
  • Database migrations requiring exclusive access — sometimes you genuinely need to stop the application
  • Stateful applications with single-writer constraints — e.g., legacy systems that cannot tolerate two versions running simultaneously
  • Cost-constrained environments — no budget for duplicate infrastructure
Warning: If you are using Recreate in production for user-facing services, you are accepting downtime as a business decision. For most modern web applications, this is unacceptable. Upgrade to rolling updates at minimum.

Rolling Updates

Rolling updates replace instances of the old version with the new version one at a time (or in small batches). At any point during the deployment, some instances are running the old version and some are running the new version. This eliminates downtime because there are always healthy instances serving traffic.

Rolling Update Progression
flowchart LR
    subgraph "Step 1"
        A1[v1] --- A2[v1] --- A3[v1] --- A4[v1]
    end
    subgraph "Step 2"
        B1[v2] --- B2[v1] --- B3[v1] --- B4[v1]
    end
    subgraph "Step 3"
        C1[v2] --- C2[v2] --- C3[v1] --- C4[v1]
    end
    subgraph "Step 4"
        D1[v2] --- D2[v2] --- D3[v2] --- D4[v2]
    end
                            

Mechanics of a Rolling Update

  1. Start a new instance with the updated version
  2. Wait for health checks to confirm the new instance is ready
  3. Route traffic to the new instance
  4. Drain and terminate one old instance
  5. Repeat until all instances are running the new version

Configuration Parameters

Parameter Description Example
maxUnavailable Maximum number of instances that can be unavailable during update 25% or 1
maxSurge Maximum number of extra instances created above desired count 25% or 1
minReadySeconds Time a new instance must be ready before proceeding 30
progressDeadlineSeconds Time before a stalled rollout is considered failed 600

Kubernetes Rolling Update Configuration

# Kubernetes RollingUpdate strategy
apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
spec:
  replicas: 4
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1    # At most 1 pod unavailable
      maxSurge: 1          # At most 1 extra pod during update
  template:
    spec:
      containers:
      - name: my-app
        image: my-app:v2.0.0
        readinessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 5
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10

Rollback Triggers

Kubernetes automatically halts a rolling update if new pods fail health checks. You can also manually trigger a rollback:

# Check rollout status
kubectl rollout status deployment/my-app

# Rollback to previous version
kubectl rollout undo deployment/my-app

# Rollback to a specific revision
kubectl rollout undo deployment/my-app --to-revision=3

# View rollout history
kubectl rollout history deployment/my-app
Best Practice: Always configure readinessProbe on your containers. Without it, Kubernetes considers a pod ready as soon as it starts, which can route traffic to instances that haven't finished initialising. A failing readiness probe prevents traffic from reaching unhealthy pods and halts the rolling update.

Blue-Green Deployments

Blue-green deployment maintains two identical production environments — called "blue" and "green." At any time, only one environment serves live traffic (say, blue). You deploy the new version to the idle environment (green), verify it thoroughly, then switch all traffic from blue to green in one atomic operation.

Blue-Green Deployment Architecture
flowchart TD
    Users[Users] --> LB[Load Balancer / DNS]
    LB -->|"Active (100%)"| Blue[Blue Environment - v1.0]
    LB -.->|"Idle (0%)"| Green[Green Environment - v2.0]
    Blue --> DB[(Shared Database)]
    Green --> DB
    Note1["Switch: Change LB target\nfrom Blue → Green"] -.-> LB
                            

The Blue-Green Process

  1. Deploy v2.0 to Green — the idle environment receives the new version
  2. Run smoke tests against Green — verify the deployment is healthy before exposing users
  3. Switch traffic — update the load balancer or DNS to point at Green
  4. Monitor — watch error rates, latency, and business metrics
  5. If problems occur — switch back to Blue (instant rollback)
  6. If stable — Blue becomes the idle environment for the next deployment

Traffic Switching Mechanisms

Mechanism Switch Speed Granularity Considerations
DNS-based Minutes (TTL dependent) All-or-nothing Simple but slow; DNS caching can delay switch
Load balancer Seconds All-or-nothing or weighted Fastest option; requires LB supporting target group switching
Service mesh Sub-second Percentage-based Most flexible; requires Istio/Linkerd infrastructure
Kubernetes Service Seconds Label selector switch Simple if already on Kubernetes
# Kubernetes blue-green via Service selector switch
# Step 1: Service points to blue
apiVersion: v1
kind: Service
metadata:
  name: my-app
spec:
  selector:
    app: my-app
    version: blue    # Switch this to "green" to cutover
  ports:
  - port: 80
    targetPort: 8080

---
# Step 2: Deploy green alongside blue
apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app-green
spec:
  replicas: 4
  selector:
    matchLabels:
      app: my-app
      version: green
  template:
    metadata:
      labels:
        app: my-app
        version: green
    spec:
      containers:
      - name: my-app
        image: my-app:v2.0.0

Cost Implications

The primary drawback of blue-green is double the infrastructure cost. You maintain two full production environments. Mitigation strategies include:

  • Use auto-scaling to keep the idle environment at minimum capacity until switch
  • Scale up the idle environment just before the switch, scale down after stability is confirmed
  • Use spot/preemptible instances for the idle environment during testing
  • Share stateless infrastructure (databases, caches) between environments
Case Study

Amazon's Blue-Green at Scale

Amazon pioneered blue-green deployments at massive scale. Their internal deployment system, Apollo, manages deployments across millions of hosts. Each service maintains two deployment groups. When deploying, the new version is deployed to the inactive group, verified via automated tests and canary checks, then traffic is shifted via internal load balancers. The ability to instantly revert to the previous group has prevented countless customer-facing incidents. Amazon reported that this approach, combined with automated rollback triggered by CloudWatch alarms, reduced their mean time to recovery from hours to single-digit minutes.

Blue-Green Automated Rollback Scale

Canary Releases

Named after the coal miners' practice of taking canaries into mines to detect toxic gases, canary releases route a small percentage of production traffic to the new version. If the canary is healthy (no elevated error rates, latency within bounds), traffic is gradually increased. If problems emerge, the canary is terminated and all traffic returns to the stable version.

Canary Release Traffic Progression
flowchart TD
    A["Deploy canary (1% traffic)"] --> B{Metrics healthy?}
    B -->|Yes| C["Increase to 5%"]
    C --> D{Metrics healthy?}
    D -->|Yes| E["Increase to 25%"]
    E --> F{Metrics healthy?}
    F -->|Yes| G["Promote to 100%"]
    B -->|No| H["Rollback immediately"]
    D -->|No| H
    F -->|No| H
                            

Canary Analysis Metrics

Automated canary analysis compares the canary version against the stable baseline using key metrics:

  • Error rate — HTTP 5xx responses, exception counts, error logs
  • Latency — P50, P95, P99 response times
  • Saturation — CPU, memory, connection pool utilisation
  • Business metrics — conversion rates, cart abandonment, revenue per request

Tools for Canary Deployments

Tool Platform Key Feature
Argo Rollouts Kubernetes Native K8s CRD with automated analysis
Flagger Kubernetes + Service Mesh Progressive delivery with Istio/Linkerd/Contour
Spinnaker Multi-cloud Full deployment orchestration with Kayenta analysis
AWS CodeDeploy AWS Managed canary with CloudWatch integration
# Argo Rollouts canary strategy
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: my-app
spec:
  replicas: 10
  strategy:
    canary:
      steps:
      - setWeight: 5        # 5% traffic to canary
      - pause:
          duration: 5m      # Wait 5 minutes
      - analysis:
          templates:
          - templateName: success-rate
      - setWeight: 25       # Increase to 25%
      - pause:
          duration: 10m
      - analysis:
          templates:
          - templateName: success-rate
      - setWeight: 50       # Increase to 50%
      - pause:
          duration: 10m
      - setWeight: 100      # Full promotion

---
# Analysis template: checks error rate
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: success-rate
spec:
  metrics:
  - name: success-rate
    interval: 60s
    successCondition: result[0] >= 0.99
    provider:
      prometheus:
        address: http://prometheus:9090
        query: |
          sum(rate(http_requests_total{status=~"2.."}[5m]))
          /
          sum(rate(http_requests_total[5m]))

Feature Flags

Feature flags (also called feature toggles) decouple deployment from release. You deploy code that contains new functionality, but the functionality is hidden behind a conditional check. You can then enable it for specific users, a percentage of traffic, or all users — without deploying again.

Flag Types

Flag Type Lifespan Purpose Example
Release flag Days to weeks Hide incomplete features until ready new-checkout-flow
Experiment flag Days to weeks A/B test different implementations pricing-page-variant-b
Ops flag Permanent Circuit breaker or kill switch disable-email-notifications
Permission flag Permanent Entitlement or access control premium-analytics-dashboard

Implementation Pattern

// Feature flag implementation using a flag service
const featureFlags = require('./feature-flag-client');

async function handleCheckout(req, res) {
  const userId = req.user.id;

  // Check if user should see new checkout flow
  const useNewCheckout = await featureFlags.isEnabled(
    'new-checkout-flow',
    { userId, country: req.user.country, plan: req.user.plan }
  );

  if (useNewCheckout) {
    return renderNewCheckoutFlow(req, res);
  }

  return renderLegacyCheckoutFlow(req, res);
}

// Flag evaluation can use:
// - User ID (target specific users)
// - Percentage (random 5% of traffic)
// - Attributes (country, plan, device)
// - Environment (staging vs production)

Feature Flag Tools

Tool Type Strengths
LaunchDarkly SaaS Real-time flag updates, targeting rules, audit trail, SDKs for 25+ languages
Unleash Open source Self-hosted, strategy-based activation, GitOps-friendly
Flagsmith Open source / SaaS Remote config + flags, segment-based targeting
Split.io SaaS Statistical engine for experiment flags, impact analysis
Feature Flag Debt: Flags that outlive their purpose become technical debt. Establish a hygiene practice: every release flag should have an expiry date. After the feature is fully launched, remove the flag and the old code path. Stale flags increase complexity and make the codebase harder to reason about.

A/B Testing

A/B testing (also called split testing) routes users into statistically significant groups to compare the performance of different implementations. Unlike canary releases (which test for safety), A/B tests measure business outcomes — which version converts better, retains more users, or generates more revenue.

A/B Testing vs Canary Releases

Dimension Canary Release A/B Test
Goal Validate that new code doesn't break anything Determine which variant performs better
Metrics Error rates, latency, CPU Conversion, revenue, engagement
Duration Minutes to hours Days to weeks (for statistical significance)
Traffic split Increases over time (1% → 100%) Fixed (typically 50/50)
Outcome Promote or rollback Choose winner based on data

Statistical Significance

An A/B test is only valid when the results are statistically significant — meaning the observed difference is unlikely to be due to random chance. Key concepts:

  • Sample size — enough users must see each variant (typically thousands)
  • Confidence level — typically 95% (p-value < 0.05)
  • Effect size — the minimum detectable difference you care about
  • Duration — run the test long enough to cover weekly patterns (at least 1-2 business cycles)
Common Mistake: Ending an A/B test early because one variant "looks better" after a few hours. This is called peeking and leads to false positives. Always define your required sample size upfront and commit to running the test to completion.

Dark Launches

A dark launch deploys new code to production and exercises it with real traffic, but the results are never shown to users. The old code path still serves the actual response. The new code path runs in parallel (or asynchronously), and its output is compared against the old path or simply discarded after measurement.

Use Cases for Dark Launches

  • Performance validation — will the new code handle production load without degradation?
  • Correctness verification — compare outputs of old and new implementations under real data
  • Database migration testing — write to both old and new databases, compare results
  • ML model validation — run new model predictions alongside production model, measure accuracy
// Dark launch pattern: shadow execution
async function searchProducts(query) {
  // Primary path: serves the actual response
  const primaryResult = await legacySearchEngine.search(query);

  // Dark path: new implementation, result discarded
  // Runs asynchronously to avoid adding latency
  newSearchEngine.search(query)
    .then(darkResult => {
      // Compare results for correctness
      metrics.recordComparison({
        query,
        primaryCount: primaryResult.length,
        darkCount: darkResult.length,
        overlap: calculateOverlap(primaryResult, darkResult),
        darkLatency: darkResult.latencyMs
      });
    })
    .catch(err => {
      // Dark path errors never affect users
      metrics.recordDarkFailure({ query, error: err.message });
    });

  // Always return the primary result
  return primaryResult;
}
Case Study

GitHub's Scientist Library

GitHub open-sourced their dark launch framework as the Scientist library (available in Ruby, Python, and other languages). When rewriting critical code paths (like their permissions system), they used Scientist to run both implementations simultaneously on every request. The new implementation's output was compared against the old one, and any discrepancies were logged and investigated. This allowed them to rewrite core systems with confidence — they had proof that the new code produced identical results to the old code under real production conditions before switching over. The library tracked mismatch rates, latency differences, and exception counts, giving engineers complete visibility into the new code's behaviour without any user impact.

Dark Launch Scientist Shadow Traffic

Progressive Delivery

Progressive delivery is the orchestration of multiple strategies into a unified, automated promotion pipeline. Rather than choosing one strategy, you combine them in sequence: deploy dark code behind a flag → enable for 1% as a canary → expand to 10% → run A/B test at 50% → promote to 100% GA — all automated with SLO-based gates.

The Progressive Delivery Pipeline

  1. Deploy — code ships to production behind a feature flag (users see nothing)
  2. Internal testing — enable for internal employees (dogfooding)
  3. Canary (1%) — enable for 1% of external traffic, monitor SLOs
  4. Expand (10%) — if canary metrics pass, increase to 10%
  5. Expand (50%) — optionally run A/B test at this point
  6. GA (100%) — full rollout to all users
  7. Cleanup — remove feature flag, delete old code path

At every step, automated analysis monitors error rates, latency, and business metrics. If any SLO is breached, the system automatically reverts to the previous step.

Industry Trend: Progressive delivery is becoming the standard for large-scale deployments. Companies like Netflix, LinkedIn, and Spotify all use variations of this pattern. The key enabler is the combination of feature flag infrastructure + canary analysis + automated SLO gates.

Database Schema Migrations

Database schema changes are the hardest part of zero-downtime deployment. Application code is stateless and easily replaced, but database schemas are shared state that both the old and new application versions must work with simultaneously during a rolling deployment.

The Expand-and-Contract Pattern

The solution is to never make breaking schema changes in one step. Instead, use a three-phase approach:

Expand-and-Contract Migration Pattern
flowchart LR
    subgraph "Phase 1: Expand"
        A[Add new column\nkeep old column\nboth nullable]
    end
    subgraph "Phase 2: Migrate"
        B[Deploy app v2\nwrites to both columns\nbackfill old data]
    end
    subgraph "Phase 3: Contract"
        C[Remove old column\nafter all code uses new]
    end
    A --> B --> C
                            

Safe Migration Rules

  1. Never rename a column — add new column, migrate data, drop old column
  2. Never change a column type — add new typed column, migrate data, drop old
  3. Never add a NOT NULL column without a default — existing rows will violate the constraint
  4. Never drop a column that old code still reads — ensure all running code has been updated first
  5. Always make migrations reversible — every up migration should have a corresponding down
# Example: Renaming "username" to "display_name" safely

# Phase 1: Expand (add new column)
ALTER TABLE users ADD COLUMN display_name VARCHAR(255);

# Phase 2: Migrate (backfill existing data)
UPDATE users SET display_name = username WHERE display_name IS NULL;

# Deploy app v2: reads from display_name, writes to both
# Wait until all instances are v2

# Phase 3: Contract (remove old column)
ALTER TABLE users DROP COLUMN username;
Critical Warning: Each phase must be a separate deployment. You cannot expand and contract in the same release because during a rolling update, some pods will be running the old code (which expects the old column) while new pods are running the updated code. The expand phase ensures both versions can coexist.

Strategy Comparison

Strategy Downtime Rollback Speed Resource Cost Complexity Observability Requirement
Recreate Yes Slow (redeploy old) Low (1× infra) Very low Minimal
Rolling Update No Medium (rollback pods) Low (1× + surge) Low Health checks required
Blue-Green No Instant (switch back) High (2× infra) Medium Smoke tests + monitoring
Canary No Fast (kill canary) Low (+1 instance) High Sophisticated metrics + analysis
Feature Flags No Instant (toggle off) None (same deploy) Medium (flag management) Flag-aware metrics
A/B Test No Instant (toggle off) None (same deploy) High (statistical analysis) Business metrics + significance
Dark Launch No N/A (not user-facing) Low (CPU for shadow) Medium Comparison metrics
Progressive No Automated (SLO gates) Medium Very high Full observability stack

Exercises

Exercise 1 — Choose a Strategy: Your team runs an e-commerce platform with 50,000 concurrent users. You need to deploy a new payment processing system. Which deployment strategy (or combination) would you choose? Justify your choice considering rollback requirements, downtime tolerance, and the criticality of the payment system.
Exercise 2 — Design a Canary Pipeline: Write an Argo Rollouts manifest that deploys a canary with the following progression: 2% for 5 minutes → 10% for 10 minutes → 50% for 15 minutes → 100%. Include an AnalysisTemplate that checks both error rate (<1%) and P99 latency (<500ms).
Exercise 3 — Safe Database Migration: You need to rename a table column from email_address to primary_email on a table with 10 million rows, while maintaining zero downtime. Write the three-phase migration plan including SQL statements and the application code changes needed at each phase.
Exercise 4 — Feature Flag Architecture: Design a feature flag system for your team. Define: (a) where flag definitions are stored, (b) how flags are evaluated at runtime, (c) how stale flags are identified and cleaned up, and (d) how you would audit who changed a flag and when.

Conclusion & Next Steps

Deployment strategy is not a one-size-fits-all decision. The right choice depends on your traffic volume, risk tolerance, infrastructure budget, team expertise, and the criticality of the system being deployed. Most mature organisations combine multiple strategies — using rolling updates as the default, feature flags for risky features, and canary analysis for critical services.

The key principles to remember: always have a rollback path, always monitor what you deploy, and always decouple deployment from release. If you internalise these three principles, you can adapt to any deployment tool or platform.

Next in the Series

In Part 17: Release Engineering & GitOps, we will explore how to manage releases at scale — semantic versioning, automated changelogs, GitOps with Argo CD and Flux, release trains, and the governance processes that keep large teams shipping safely.