Introduction
Deployment is the moment where your carefully crafted code meets the real world. Every line of code you write, every test you run, every review you conduct — it all builds toward this single event: placing new software into production where real users will interact with it.
The goal is deceptively simple: get new code into production without breaking anything. In practice, this is one of the hardest problems in software engineering because production systems are complex, interconnected, and serving real traffic 24/7.
The Three Deployment Goals
Every deployment strategy is evaluated against three goals, often in tension with each other:
- Zero downtime — users never experience an outage during deployment
- Instant rollback — if something goes wrong, revert within seconds, not minutes
- Observability — know immediately whether the new version is healthy
Different strategies make different tradeoffs between these goals, plus additional factors like infrastructure cost, operational complexity, and team expertise. This article covers every major strategy so you can choose the right one for your context.
In-Place Deployment (Recreate Strategy)
The simplest deployment strategy: stop the old version, start the new version. This is Kubernetes' Recreate strategy and the default behaviour of many traditional deployment tools.
sequenceDiagram
participant LB as Load Balancer
participant V1 as Version 1 (Old)
participant V2 as Version 2 (New)
Note over LB,V1: Normal traffic flow
LB->>V1: User requests
Note over V1: Stop all instances
V1--xLB: Instances terminated
Note over LB: DOWNTIME WINDOW
Note over V2: Start new instances
V2->>LB: Health check passes
LB->>V2: Resume traffic
Note over LB,V2: Normal traffic flow
How It Works
- Take all instances of the current version offline
- Deploy the new version to the same infrastructure
- Start the new instances and wait for health checks to pass
- Route traffic to the new version
The critical problem is obvious: there is a window where no instances are serving traffic. The downtime duration depends on how long it takes to start the new version — typically seconds to minutes.
# Kubernetes Recreate strategy
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app
spec:
replicas: 3
strategy:
type: Recreate
template:
spec:
containers:
- name: my-app
image: my-app:v2.0.0
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
When Recreate Is Acceptable
- Development and testing environments — nobody cares about downtime in dev
- Batch processing jobs — no real-time traffic to interrupt
- Database migrations requiring exclusive access — sometimes you genuinely need to stop the application
- Stateful applications with single-writer constraints — e.g., legacy systems that cannot tolerate two versions running simultaneously
- Cost-constrained environments — no budget for duplicate infrastructure
Rolling Updates
Rolling updates replace instances of the old version with the new version one at a time (or in small batches). At any point during the deployment, some instances are running the old version and some are running the new version. This eliminates downtime because there are always healthy instances serving traffic.
flowchart LR
subgraph "Step 1"
A1[v1] --- A2[v1] --- A3[v1] --- A4[v1]
end
subgraph "Step 2"
B1[v2] --- B2[v1] --- B3[v1] --- B4[v1]
end
subgraph "Step 3"
C1[v2] --- C2[v2] --- C3[v1] --- C4[v1]
end
subgraph "Step 4"
D1[v2] --- D2[v2] --- D3[v2] --- D4[v2]
end
Mechanics of a Rolling Update
- Start a new instance with the updated version
- Wait for health checks to confirm the new instance is ready
- Route traffic to the new instance
- Drain and terminate one old instance
- Repeat until all instances are running the new version
Configuration Parameters
| Parameter | Description | Example |
|---|---|---|
| maxUnavailable | Maximum number of instances that can be unavailable during update | 25% or 1 |
| maxSurge | Maximum number of extra instances created above desired count | 25% or 1 |
| minReadySeconds | Time a new instance must be ready before proceeding | 30 |
| progressDeadlineSeconds | Time before a stalled rollout is considered failed | 600 |
Kubernetes Rolling Update Configuration
# Kubernetes RollingUpdate strategy
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app
spec:
replicas: 4
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1 # At most 1 pod unavailable
maxSurge: 1 # At most 1 extra pod during update
template:
spec:
containers:
- name: my-app
image: my-app:v2.0.0
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
Rollback Triggers
Kubernetes automatically halts a rolling update if new pods fail health checks. You can also manually trigger a rollback:
# Check rollout status
kubectl rollout status deployment/my-app
# Rollback to previous version
kubectl rollout undo deployment/my-app
# Rollback to a specific revision
kubectl rollout undo deployment/my-app --to-revision=3
# View rollout history
kubectl rollout history deployment/my-app
readinessProbe on your containers. Without it, Kubernetes considers a pod ready as soon as it starts, which can route traffic to instances that haven't finished initialising. A failing readiness probe prevents traffic from reaching unhealthy pods and halts the rolling update.
Blue-Green Deployments
Blue-green deployment maintains two identical production environments — called "blue" and "green." At any time, only one environment serves live traffic (say, blue). You deploy the new version to the idle environment (green), verify it thoroughly, then switch all traffic from blue to green in one atomic operation.
flowchart TD
Users[Users] --> LB[Load Balancer / DNS]
LB -->|"Active (100%)"| Blue[Blue Environment - v1.0]
LB -.->|"Idle (0%)"| Green[Green Environment - v2.0]
Blue --> DB[(Shared Database)]
Green --> DB
Note1["Switch: Change LB target\nfrom Blue → Green"] -.-> LB
The Blue-Green Process
- Deploy v2.0 to Green — the idle environment receives the new version
- Run smoke tests against Green — verify the deployment is healthy before exposing users
- Switch traffic — update the load balancer or DNS to point at Green
- Monitor — watch error rates, latency, and business metrics
- If problems occur — switch back to Blue (instant rollback)
- If stable — Blue becomes the idle environment for the next deployment
Traffic Switching Mechanisms
| Mechanism | Switch Speed | Granularity | Considerations |
|---|---|---|---|
| DNS-based | Minutes (TTL dependent) | All-or-nothing | Simple but slow; DNS caching can delay switch |
| Load balancer | Seconds | All-or-nothing or weighted | Fastest option; requires LB supporting target group switching |
| Service mesh | Sub-second | Percentage-based | Most flexible; requires Istio/Linkerd infrastructure |
| Kubernetes Service | Seconds | Label selector switch | Simple if already on Kubernetes |
# Kubernetes blue-green via Service selector switch
# Step 1: Service points to blue
apiVersion: v1
kind: Service
metadata:
name: my-app
spec:
selector:
app: my-app
version: blue # Switch this to "green" to cutover
ports:
- port: 80
targetPort: 8080
---
# Step 2: Deploy green alongside blue
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app-green
spec:
replicas: 4
selector:
matchLabels:
app: my-app
version: green
template:
metadata:
labels:
app: my-app
version: green
spec:
containers:
- name: my-app
image: my-app:v2.0.0
Cost Implications
The primary drawback of blue-green is double the infrastructure cost. You maintain two full production environments. Mitigation strategies include:
- Use auto-scaling to keep the idle environment at minimum capacity until switch
- Scale up the idle environment just before the switch, scale down after stability is confirmed
- Use spot/preemptible instances for the idle environment during testing
- Share stateless infrastructure (databases, caches) between environments
Amazon's Blue-Green at Scale
Amazon pioneered blue-green deployments at massive scale. Their internal deployment system, Apollo, manages deployments across millions of hosts. Each service maintains two deployment groups. When deploying, the new version is deployed to the inactive group, verified via automated tests and canary checks, then traffic is shifted via internal load balancers. The ability to instantly revert to the previous group has prevented countless customer-facing incidents. Amazon reported that this approach, combined with automated rollback triggered by CloudWatch alarms, reduced their mean time to recovery from hours to single-digit minutes.
Canary Releases
Named after the coal miners' practice of taking canaries into mines to detect toxic gases, canary releases route a small percentage of production traffic to the new version. If the canary is healthy (no elevated error rates, latency within bounds), traffic is gradually increased. If problems emerge, the canary is terminated and all traffic returns to the stable version.
flowchart TD
A["Deploy canary (1% traffic)"] --> B{Metrics healthy?}
B -->|Yes| C["Increase to 5%"]
C --> D{Metrics healthy?}
D -->|Yes| E["Increase to 25%"]
E --> F{Metrics healthy?}
F -->|Yes| G["Promote to 100%"]
B -->|No| H["Rollback immediately"]
D -->|No| H
F -->|No| H
Canary Analysis Metrics
Automated canary analysis compares the canary version against the stable baseline using key metrics:
- Error rate — HTTP 5xx responses, exception counts, error logs
- Latency — P50, P95, P99 response times
- Saturation — CPU, memory, connection pool utilisation
- Business metrics — conversion rates, cart abandonment, revenue per request
Tools for Canary Deployments
| Tool | Platform | Key Feature |
|---|---|---|
| Argo Rollouts | Kubernetes | Native K8s CRD with automated analysis |
| Flagger | Kubernetes + Service Mesh | Progressive delivery with Istio/Linkerd/Contour |
| Spinnaker | Multi-cloud | Full deployment orchestration with Kayenta analysis |
| AWS CodeDeploy | AWS | Managed canary with CloudWatch integration |
# Argo Rollouts canary strategy
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: my-app
spec:
replicas: 10
strategy:
canary:
steps:
- setWeight: 5 # 5% traffic to canary
- pause:
duration: 5m # Wait 5 minutes
- analysis:
templates:
- templateName: success-rate
- setWeight: 25 # Increase to 25%
- pause:
duration: 10m
- analysis:
templates:
- templateName: success-rate
- setWeight: 50 # Increase to 50%
- pause:
duration: 10m
- setWeight: 100 # Full promotion
---
# Analysis template: checks error rate
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: success-rate
spec:
metrics:
- name: success-rate
interval: 60s
successCondition: result[0] >= 0.99
provider:
prometheus:
address: http://prometheus:9090
query: |
sum(rate(http_requests_total{status=~"2.."}[5m]))
/
sum(rate(http_requests_total[5m]))
Feature Flags
Feature flags (also called feature toggles) decouple deployment from release. You deploy code that contains new functionality, but the functionality is hidden behind a conditional check. You can then enable it for specific users, a percentage of traffic, or all users — without deploying again.
Flag Types
| Flag Type | Lifespan | Purpose | Example |
|---|---|---|---|
| Release flag | Days to weeks | Hide incomplete features until ready | new-checkout-flow |
| Experiment flag | Days to weeks | A/B test different implementations | pricing-page-variant-b |
| Ops flag | Permanent | Circuit breaker or kill switch | disable-email-notifications |
| Permission flag | Permanent | Entitlement or access control | premium-analytics-dashboard |
Implementation Pattern
// Feature flag implementation using a flag service
const featureFlags = require('./feature-flag-client');
async function handleCheckout(req, res) {
const userId = req.user.id;
// Check if user should see new checkout flow
const useNewCheckout = await featureFlags.isEnabled(
'new-checkout-flow',
{ userId, country: req.user.country, plan: req.user.plan }
);
if (useNewCheckout) {
return renderNewCheckoutFlow(req, res);
}
return renderLegacyCheckoutFlow(req, res);
}
// Flag evaluation can use:
// - User ID (target specific users)
// - Percentage (random 5% of traffic)
// - Attributes (country, plan, device)
// - Environment (staging vs production)
Feature Flag Tools
| Tool | Type | Strengths |
|---|---|---|
| LaunchDarkly | SaaS | Real-time flag updates, targeting rules, audit trail, SDKs for 25+ languages |
| Unleash | Open source | Self-hosted, strategy-based activation, GitOps-friendly |
| Flagsmith | Open source / SaaS | Remote config + flags, segment-based targeting |
| Split.io | SaaS | Statistical engine for experiment flags, impact analysis |
A/B Testing
A/B testing (also called split testing) routes users into statistically significant groups to compare the performance of different implementations. Unlike canary releases (which test for safety), A/B tests measure business outcomes — which version converts better, retains more users, or generates more revenue.
A/B Testing vs Canary Releases
| Dimension | Canary Release | A/B Test |
|---|---|---|
| Goal | Validate that new code doesn't break anything | Determine which variant performs better |
| Metrics | Error rates, latency, CPU | Conversion, revenue, engagement |
| Duration | Minutes to hours | Days to weeks (for statistical significance) |
| Traffic split | Increases over time (1% → 100%) | Fixed (typically 50/50) |
| Outcome | Promote or rollback | Choose winner based on data |
Statistical Significance
An A/B test is only valid when the results are statistically significant — meaning the observed difference is unlikely to be due to random chance. Key concepts:
- Sample size — enough users must see each variant (typically thousands)
- Confidence level — typically 95% (p-value < 0.05)
- Effect size — the minimum detectable difference you care about
- Duration — run the test long enough to cover weekly patterns (at least 1-2 business cycles)
Dark Launches
A dark launch deploys new code to production and exercises it with real traffic, but the results are never shown to users. The old code path still serves the actual response. The new code path runs in parallel (or asynchronously), and its output is compared against the old path or simply discarded after measurement.
Use Cases for Dark Launches
- Performance validation — will the new code handle production load without degradation?
- Correctness verification — compare outputs of old and new implementations under real data
- Database migration testing — write to both old and new databases, compare results
- ML model validation — run new model predictions alongside production model, measure accuracy
// Dark launch pattern: shadow execution
async function searchProducts(query) {
// Primary path: serves the actual response
const primaryResult = await legacySearchEngine.search(query);
// Dark path: new implementation, result discarded
// Runs asynchronously to avoid adding latency
newSearchEngine.search(query)
.then(darkResult => {
// Compare results for correctness
metrics.recordComparison({
query,
primaryCount: primaryResult.length,
darkCount: darkResult.length,
overlap: calculateOverlap(primaryResult, darkResult),
darkLatency: darkResult.latencyMs
});
})
.catch(err => {
// Dark path errors never affect users
metrics.recordDarkFailure({ query, error: err.message });
});
// Always return the primary result
return primaryResult;
}
GitHub's Scientist Library
GitHub open-sourced their dark launch framework as the Scientist library (available in Ruby, Python, and other languages). When rewriting critical code paths (like their permissions system), they used Scientist to run both implementations simultaneously on every request. The new implementation's output was compared against the old one, and any discrepancies were logged and investigated. This allowed them to rewrite core systems with confidence — they had proof that the new code produced identical results to the old code under real production conditions before switching over. The library tracked mismatch rates, latency differences, and exception counts, giving engineers complete visibility into the new code's behaviour without any user impact.
Progressive Delivery
Progressive delivery is the orchestration of multiple strategies into a unified, automated promotion pipeline. Rather than choosing one strategy, you combine them in sequence: deploy dark code behind a flag → enable for 1% as a canary → expand to 10% → run A/B test at 50% → promote to 100% GA — all automated with SLO-based gates.
The Progressive Delivery Pipeline
- Deploy — code ships to production behind a feature flag (users see nothing)
- Internal testing — enable for internal employees (dogfooding)
- Canary (1%) — enable for 1% of external traffic, monitor SLOs
- Expand (10%) — if canary metrics pass, increase to 10%
- Expand (50%) — optionally run A/B test at this point
- GA (100%) — full rollout to all users
- Cleanup — remove feature flag, delete old code path
At every step, automated analysis monitors error rates, latency, and business metrics. If any SLO is breached, the system automatically reverts to the previous step.
Database Schema Migrations
Database schema changes are the hardest part of zero-downtime deployment. Application code is stateless and easily replaced, but database schemas are shared state that both the old and new application versions must work with simultaneously during a rolling deployment.
The Expand-and-Contract Pattern
The solution is to never make breaking schema changes in one step. Instead, use a three-phase approach:
flowchart LR
subgraph "Phase 1: Expand"
A[Add new column\nkeep old column\nboth nullable]
end
subgraph "Phase 2: Migrate"
B[Deploy app v2\nwrites to both columns\nbackfill old data]
end
subgraph "Phase 3: Contract"
C[Remove old column\nafter all code uses new]
end
A --> B --> C
Safe Migration Rules
- Never rename a column — add new column, migrate data, drop old column
- Never change a column type — add new typed column, migrate data, drop old
- Never add a NOT NULL column without a default — existing rows will violate the constraint
- Never drop a column that old code still reads — ensure all running code has been updated first
- Always make migrations reversible — every up migration should have a corresponding down
# Example: Renaming "username" to "display_name" safely
# Phase 1: Expand (add new column)
ALTER TABLE users ADD COLUMN display_name VARCHAR(255);
# Phase 2: Migrate (backfill existing data)
UPDATE users SET display_name = username WHERE display_name IS NULL;
# Deploy app v2: reads from display_name, writes to both
# Wait until all instances are v2
# Phase 3: Contract (remove old column)
ALTER TABLE users DROP COLUMN username;
Strategy Comparison
| Strategy | Downtime | Rollback Speed | Resource Cost | Complexity | Observability Requirement |
|---|---|---|---|---|---|
| Recreate | Yes | Slow (redeploy old) | Low (1× infra) | Very low | Minimal |
| Rolling Update | No | Medium (rollback pods) | Low (1× + surge) | Low | Health checks required |
| Blue-Green | No | Instant (switch back) | High (2× infra) | Medium | Smoke tests + monitoring |
| Canary | No | Fast (kill canary) | Low (+1 instance) | High | Sophisticated metrics + analysis |
| Feature Flags | No | Instant (toggle off) | None (same deploy) | Medium (flag management) | Flag-aware metrics |
| A/B Test | No | Instant (toggle off) | None (same deploy) | High (statistical analysis) | Business metrics + significance |
| Dark Launch | No | N/A (not user-facing) | Low (CPU for shadow) | Medium | Comparison metrics |
| Progressive | No | Automated (SLO gates) | Medium | Very high | Full observability stack |
Exercises
email_address to primary_email on a table with 10 million rows, while maintaining zero downtime. Write the three-phase migration plan including SQL statements and the application code changes needed at each phase.
Conclusion & Next Steps
Deployment strategy is not a one-size-fits-all decision. The right choice depends on your traffic volume, risk tolerance, infrastructure budget, team expertise, and the criticality of the system being deployed. Most mature organisations combine multiple strategies — using rolling updates as the default, feature flags for risky features, and canary analysis for critical services.
The key principles to remember: always have a rollback path, always monitor what you deploy, and always decouple deployment from release. If you internalise these three principles, you can adapt to any deployment tool or platform.
Next in the Series
In Part 17: Release Engineering & GitOps, we will explore how to manage releases at scale — semantic versioning, automated changelogs, GitOps with Argo CD and Flux, release trains, and the governance processes that keep large teams shipping safely.