Introduction
For decades, software teams struggled with a fundamental question: How do we know if we're doing well? Lines of code? Story points completed? Number of deploys? These metrics either incentivised wrong behaviour or measured activity rather than outcomes.
In 2014, Dr. Nicole Forsgren, Jez Humble, and Gene Kim began the DORA (DevOps Research and Assessment) research program. Over seven years and 36,000+ survey respondents across every industry, they identified exactly four metrics that reliably predict both software delivery performance and organisational performance (profitability, market share, customer satisfaction).
These aren't arbitrary KPIs. They're statistically validated predictors with proven causal relationships. Teams that excel at these four metrics deliver software faster, more reliably, and with fewer defects — and their organisations outperform competitors commercially.
The Accelerate Research
The findings were published in Accelerate: The Science of Lean Software and DevOps (2018) and the annual State of DevOps Reports (2014–present, now published by Google Cloud DORA team). Key findings include:
- Elite performers deploy 973x more frequently than low performers
- Lead time for elite teams is 6,570x shorter than low performers
- Change failure rate is 3x lower for elite performers
- Recovery time is 6,570x faster for elite performers
- High-performing teams are 2x more likely to exceed organisational goals
- These capabilities predict commercial outcomes, not just technical ones
The research methodology uses cluster analysis to identify natural performance groups, structural equation modelling to establish causation, and survey-based measurement validated against objective data. It's the most rigorous empirical research in software engineering.
The Four Key Metrics
DORA measures delivery performance along two axes: throughput (how fast you deliver value) and stability (how reliably you deliver value). Two metrics measure each axis:
quadrantChart
title DORA Metrics Framework
x-axis "Low Throughput" --> "High Throughput"
y-axis "Low Stability" --> "High Stability"
quadrant-1 "Elite Performers"
quadrant-2 "Reliable but Slow"
quadrant-3 "Low Performers"
quadrant-4 "Fast but Fragile"
"Deployment Frequency": [0.85, 0.8]
"Lead Time": [0.75, 0.7]
"Change Failure Rate": [0.3, 0.85]
"MTTR": [0.25, 0.75]
Deployment Frequency (DF)
Definition: How often your organisation deploys code to production (or releases it to end users).
Deployment frequency is the clearest indicator of batch size. High deployment frequency means small batches, which means lower risk per change, faster feedback, and easier debugging when something goes wrong. If you deploy 100 lines of code and something breaks, you know exactly where to look. If you deploy 10,000 lines, good luck.
| Performance Tier | Deployment Frequency | Batch Size Implication |
|---|---|---|
| Elite | Multiple times per day (on demand) | Individual commits, feature flags |
| High | Once per day to once per week | Small feature branches, few days work |
| Medium | Once per week to once per month | Sprint-sized releases |
| Low | Once per month to once per six months | Large batches, release trains |
Lead Time for Changes (LT)
Definition: The time from when code is committed to when it is running in production and available to users.
Lead time measures the efficiency of your entire delivery pipeline — from the moment a developer finishes writing code to the moment it creates value for users. It encompasses code review, CI pipeline execution, approval gates, staging environments, and deployment processes.
| Performance Tier | Lead Time | Where Time is Spent |
|---|---|---|
| Elite | Less than 1 hour | Automated pipeline, trunk-based dev |
| High | 1 day to 1 week | Code review + automated deploy |
| Medium | 1 week to 1 month | Manual testing, approval committees |
| Low | 1 month to 6 months | Change advisory boards, manual deployments |
Change Failure Rate (CFR)
Definition: The percentage of deployments that cause a failure in production — requiring a rollback, hotfix, patch, or incident response.
Change failure rate measures the quality of your delivery process. A low CFR means your testing, review, and deployment practices catch problems before they reach users. It doesn't mean you never fail — it means you fail infrequently and predictably.
| Performance Tier | Change Failure Rate | What This Means |
|---|---|---|
| Elite | 0–15% | Extensive automated testing, canary deployments |
| High | 16–30% | Good test coverage, some gaps in integration |
| Medium | 16–30% | Variable testing, reactive quality practices |
| Low | 46–60% | Insufficient testing, large batches mask root causes |
Mean Time to Recovery (MTTR)
Definition: How quickly a service is restored after an incident or degradation — from detection to resolution.
MTTR measures your resilience. Failures are inevitable; what matters is how quickly you detect, diagnose, and recover. Elite teams prioritise MTTR over MTBF (Mean Time Between Failures) because in complex systems, preventing all failures is impossible — but fast recovery is achievable.
| Performance Tier | MTTR | Key Enablers |
|---|---|---|
| Elite | Less than 1 hour | Automated rollback, observability, runbooks |
| High | Less than 1 day | On-call rotation, basic monitoring, manual rollback |
| Medium | 1 day to 1 week | Escalation-heavy, limited observability |
| Low | 1 week to 6 months | No rollback capability, manual processes |
Performance Tiers
DORA identifies four performance clusters using statistical analysis. These aren't arbitrary divisions — they represent natural groupings found in the data:
| Metric | Elite | High | Medium | Low |
|---|---|---|---|---|
| Deployment Frequency | Multiple/day | Daily–Weekly | Weekly–Monthly | Monthly–Biannually |
| Lead Time | < 1 hour | 1 day – 1 week | 1 week – 1 month | 1 month – 6 months |
| Change Failure Rate | 0–15% | 16–30% | 16–30% | 46–60% |
| MTTR | < 1 hour | < 1 day | 1 day – 1 week | 1 week – 6 months |
The Widening Gap
Year over year, the State of DevOps Reports show that the gap between elite and low performers is growing. Elite teams continue accelerating while low performers stagnate. This creates a compounding effect — organisations that invest in delivery capabilities gain exponential advantages over those that don't.
The 2023 report found that elite performers now represent approximately 18% of respondents (up from 7% in 2018), suggesting that more teams are achieving elite status — but the bar keeps rising as elite teams continue improving.
Measuring DORA Metrics
Measurement must be automated and objective. Self-reported metrics are unreliable due to cognitive biases. Pull data from your systems:
flowchart LR
A[Git Repository] -->|Commit timestamps| E[Metrics Engine]
B[CI/CD Platform] -->|Build/Deploy events| E
C[Incident Manager] -->|Incident open/close| E
D[Monitoring/Alerting] -->|Detection time| E
E --> F[Dashboard]
E --> G[Trend Analysis]
E --> H[Team Comparisons]
style E fill:#3B9797,color:#fff
style F fill:#f8f9fa,stroke:#333
Data Sources by Metric
| Metric | Data Source | Calculation |
|---|---|---|
| DF | CI/CD deployment events | Count of successful production deploys / time period |
| LT | Git + CI/CD timestamps | Median(deploy_time - first_commit_time) per change |
| CFR | Incident tracker + deploy logs | Failed deploys / total deploys × 100% |
| MTTR | Incident tracker | Median(resolved_time - detected_time) across incidents |
Tools & Dashboards
import datetime
import statistics
def calculate_dora_metrics(deploys, incidents, period_days=30):
"""
Calculate DORA metrics from deployment and incident data.
Args:
deploys: List of dicts with 'timestamp', 'commit_time', 'caused_incident'
incidents: List of dicts with 'detected_at', 'resolved_at'
period_days: Measurement period in days
"""
# 1. Deployment Frequency
deployment_frequency = len(deploys) / period_days
df_label = classify_df(deployment_frequency)
# 2. Lead Time for Changes (median)
lead_times = []
for deploy in deploys:
lt = (deploy['timestamp'] - deploy['commit_time']).total_seconds() / 3600
lead_times.append(lt) # in hours
median_lead_time = statistics.median(lead_times) if lead_times else 0
# 3. Change Failure Rate
failed_deploys = sum(1 for d in deploys if d['caused_incident'])
change_failure_rate = (failed_deploys / len(deploys) * 100) if deploys else 0
# 4. Mean Time to Recovery (median)
recovery_times = []
for incident in incidents:
rt = (incident['resolved_at'] - incident['detected_at']).total_seconds() / 3600
recovery_times.append(rt) # in hours
median_mttr = statistics.median(recovery_times) if recovery_times else 0
return {
'deployment_frequency': {
'value': round(deployment_frequency, 2),
'unit': 'deploys/day',
'tier': df_label
},
'lead_time': {
'value': round(median_lead_time, 1),
'unit': 'hours',
'tier': classify_lt(median_lead_time)
},
'change_failure_rate': {
'value': round(change_failure_rate, 1),
'unit': '%',
'tier': classify_cfr(change_failure_rate)
},
'mttr': {
'value': round(median_mttr, 1),
'unit': 'hours',
'tier': classify_mttr(median_mttr)
}
}
def classify_df(deploys_per_day):
if deploys_per_day >= 1:
return 'Elite'
elif deploys_per_day >= 1/7:
return 'High'
elif deploys_per_day >= 1/30:
return 'Medium'
return 'Low'
def classify_lt(hours):
if hours < 1:
return 'Elite'
elif hours < 168: # 1 week
return 'High'
elif hours < 720: # 1 month
return 'Medium'
return 'Low'
def classify_cfr(percentage):
if percentage <= 15:
return 'Elite'
elif percentage <= 30:
return 'High'
return 'Low'
def classify_mttr(hours):
if hours < 1:
return 'Elite'
elif hours < 24:
return 'High'
elif hours < 168:
return 'Medium'
return 'Low'
# Example usage with sample data
deploys = [
{'timestamp': datetime.datetime(2026, 5, 13, 14, 0),
'commit_time': datetime.datetime(2026, 5, 13, 12, 30),
'caused_incident': False},
{'timestamp': datetime.datetime(2026, 5, 13, 16, 0),
'commit_time': datetime.datetime(2026, 5, 13, 14, 45),
'caused_incident': False},
{'timestamp': datetime.datetime(2026, 5, 12, 10, 0),
'commit_time': datetime.datetime(2026, 5, 12, 8, 0),
'caused_incident': True},
]
incidents = [
{'detected_at': datetime.datetime(2026, 5, 12, 10, 5),
'resolved_at': datetime.datetime(2026, 5, 12, 10, 35)},
]
metrics = calculate_dora_metrics(deploys, incidents, period_days=7)
for name, data in metrics.items():
print(f"{name}: {data['value']} {data['unit']} ({data['tier']})")
Purpose-built DORA measurement tools include:
- Four Keys (Google) — Open-source DORA dashboard using BigQuery and Cloud Build events
- LinearB — Git analytics platform with built-in DORA tracking
- Sleuth — Deployment tracking focused specifically on DORA metrics
- Jellyfish — Engineering management platform with DORA and business alignment
- Faros AI — Open-source connector that aggregates data from 50+ tools
- Backstage + plugins — Internal developer portal with DORA metric widgets
Beyond DORA
While the four key metrics remain the foundation, the research community has expanded measurement to address additional dimensions of software delivery and developer experience.
The Reliability Metric (5th DORA Metric)
In 2022, the DORA team added a fifth metric: Reliability — whether a team meets or exceeds its reliability targets (SLOs). This acknowledges that operational performance matters alongside delivery speed. A team deploying 100 times per day but consistently missing SLOs isn't performing well.
The SPACE Framework
Published by Forsgren, Storey, Maddila, Zimmermann, and others at Microsoft Research (2021), SPACE measures developer productivity across five dimensions:
| Dimension | What It Measures | Example Metrics |
|---|---|---|
| Satisfaction | Developer happiness and fulfilment | Survey scores, eNPS, retention |
| Performance | Outcomes and quality of work | Customer impact, reliability, code quality |
| Activity | Volume of actions (use carefully) | PRs merged, deployments, code reviews completed |
| Communication | Collaboration and knowledge flow | PR review time, documentation, onboarding speed |
| Efficiency | Minimal friction and interruptions | Flow state time, context switches, wait time |
Flow Metrics
Flow metrics (from Mik Kersten's Project to Product) measure value stream efficiency:
- Flow Time — Total time from work item creation to delivery (includes wait time)
- Flow Efficiency — Active work time / total flow time × 100% (typically 15–40%)
- Flow Load — Work items in progress (correlates with lead time via Little's Law)
- Flow Velocity — Number of items completed per time period
- Flow Distribution — Ratio of features vs defects vs debt vs risk work
Improving Each Metric
Improving Deployment Frequency
The path from monthly to daily (or multiple daily) deployments requires changes across code, process, and culture:
- Trunk-based development — Short-lived branches (hours, not weeks). Merge to main daily
- Feature flags — Decouple deployment from release. Deploy dark features that can be enabled independently
- Automated deployments — Every merge to main triggers production deployment (with gates)
- Smaller batches — Break large features into independently deployable increments
- Decouple services — Independent deployability means one team's changes don't block another's
# GitOps: Automated deployment on merge to main
# ArgoCD Application watching main branch
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: payment-service
namespace: argocd
spec:
project: default
source:
repoURL: https://github.com/my-org/payment-service
targetRevision: main
path: k8s/overlays/production
destination:
server: https://kubernetes.default.svc
namespace: payments
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=true
- PrunePropagationPolicy=foreground
Improving Lead Time
Lead time is the sum of all wait times and processing times in your pipeline. To reduce it, identify and eliminate bottlenecks:
import datetime
# Lead time breakdown analysis
def analyse_lead_time(change):
"""Break down where time is spent in the delivery pipeline."""
stages = {
'coding': change['pr_opened'] - change['first_commit'],
'review_wait': change['first_review'] - change['pr_opened'],
'review_cycles': change['approved'] - change['first_review'],
'ci_pipeline': change['ci_complete'] - change['approved'],
'deploy_wait': change['deploy_start'] - change['ci_complete'],
'deployment': change['deploy_complete'] - change['deploy_start'],
}
total = sum(stages.values(), datetime.timedelta())
print(f"Total Lead Time: {total}")
print(f"\nBreakdown:")
for stage, duration in stages.items():
pct = duration / total * 100
bar = '█' * int(pct / 2)
print(f" {stage:20s}: {str(duration):>15s} ({pct:.0f}%) {bar}")
# Identify biggest bottleneck
bottleneck = max(stages, key=stages.get)
print(f"\nBottleneck: {bottleneck} ({stages[bottleneck]})")
return stages
# Example: typical team before optimisation
sample_change = {
'first_commit': datetime.datetime(2026, 5, 10, 9, 0),
'pr_opened': datetime.datetime(2026, 5, 10, 11, 0),
'first_review': datetime.datetime(2026, 5, 11, 14, 0), # Next day!
'approved': datetime.datetime(2026, 5, 12, 10, 0),
'ci_complete': datetime.datetime(2026, 5, 12, 10, 45),
'deploy_start': datetime.datetime(2026, 5, 13, 9, 0), # Next deploy window
'deploy_complete': datetime.datetime(2026, 5, 13, 9, 15),
}
analyse_lead_time(sample_change)
Common lead time improvements:
- Reduce review wait time — Set team SLAs for PR review (e.g., first review within 4 hours)
- Parallelise CI — Run tests concurrently; aim for sub-10-minute pipelines
- Eliminate manual gates — Replace change advisory boards with automated policy checks
- Deploy continuously — Remove "deploy windows"; deploy any time CI passes
Improving Change Failure Rate
- Comprehensive automated testing — Unit, integration, contract, and end-to-end tests in CI
- Canary deployments — Route 1–5% of traffic to new version; monitor before full rollout
- Progressive delivery — Ring-based deployments, starting with internal users
- Feature flags with kill switches — Instant disable without redeployment
- Smaller batch sizes — Fewer changes per deployment = fewer failure modes
Improving MTTR
- Observability — Distributed tracing, structured logging, real-time dashboards
- Automated rollback — One-click (or zero-click) revert to last known good state
- Runbooks — Pre-written incident response procedures for known failure modes
- Incident response automation — PagerDuty/Opsgenie escalation, auto-creation of war rooms
- Chaos engineering — Practice recovery regularly so it's muscle memory during real incidents
# Argo Rollouts: Automated canary with auto-rollback
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: payment-api
spec:
replicas: 10
strategy:
canary:
steps:
- setWeight: 5 # 5% traffic to canary
- pause: {duration: 5m}
- analysis:
templates:
- templateName: success-rate
args:
- name: service-name
value: payment-api
- setWeight: 25 # 25% if analysis passes
- pause: {duration: 10m}
- analysis:
templates:
- templateName: latency-check
- setWeight: 50
- pause: {duration: 10m}
- setWeight: 100 # Full rollout
# Auto-rollback if any analysis fails
rollbackWindow:
revisions: 2
---
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: success-rate
spec:
metrics:
- name: success-rate
interval: 60s
successCondition: result[0] >= 0.99
provider:
prometheus:
address: http://prometheus:9090
query: |
sum(rate(http_requests_total{service="{{args.service-name}}",status=~"2.."}[5m]))
/
sum(rate(http_requests_total{service="{{args.service-name}}"}[5m]))
Anti-Patterns in Metrics
Metrics are powerful — but they can be weaponised. Goodhart's Law states: "When a measure becomes a target, it ceases to be a good measure." Here's how teams game DORA metrics and why it's destructive:
| Anti-Pattern | What It Looks Like | Why It's Harmful |
|---|---|---|
| Empty Deploys | Deploying trivial changes to inflate DF | Measures activity, not value delivery |
| Skipping Reviews | Auto-merging to reduce lead time | Trades quality for speed numbers |
| Hiding Incidents | Not reporting failures to keep CFR low | Destroys learning and trust |
| Premature Closure | Closing incidents before full resolution | Artificially lowers MTTR, problems recur |
| Individual Rankings | Using metrics to compare/rank developers | Kills collaboration, incentivises gaming |
| Target Fixation | Setting rigid targets ("deploy 10x/day or else") | Context-free targets drive wrong behaviour |
Case Studies
Etsy: From Monthly Releases to 50+ Deploys Per Day
In 2009, Etsy deployed once every two weeks with a dedicated "deploy day" that took 4+ hours and frequently caused outages. By 2014, they averaged 50+ deploys per day with a change failure rate below 5%. How?
Key changes: (1) Eliminated the release branch — all developers committed to trunk. (2) Built "Deployinator" — a one-click deploy tool any engineer could use. (3) Adopted feature flags for everything — deploy code dark, enable separately. (4) Invested heavily in monitoring — every deploy correlated with real-time metrics dashboards. (5) Made deploy responsibility distributed — the person who wrote the code deploys it.
Result: Deployment time dropped from 4 hours to 15 minutes. Incident rate dropped 80%. Engineer satisfaction improved dramatically because "deploy fear" disappeared. Revenue grew 40% YoY during the transformation period.
Capital One: DORA-Driven Transformation in Financial Services
Capital One faced a common financial services challenge: heavy regulation demanding stability, but business requirements demanding speed. They adopted DORA metrics as their transformation compass.
Approach: (1) Baseline measurement across 300+ engineering teams. (2) Identified that 80% of lead time was wait time (approvals, manual testing, deploy windows). (3) Replaced manual change advisory boards with automated policy-as-code. (4) Invested in test automation to replace manual QA gates. (5) Moved to immutable infrastructure and blue-green deployments for safe rollbacks.
Result over 3 years: Deployment frequency improved from monthly to daily for most teams. Lead time dropped from 2 weeks to under 1 day. MTTR improved from days to under 2 hours. All while maintaining regulatory compliance and passing SOC2 audits with automated evidence collection.
Building a Metrics Program
Implementing DORA metrics isn't a one-day project. It's a cultural shift that requires careful introduction:
Phase 1: Baseline (Weeks 1–4)
- Identify data sources (CI/CD platform, incident tracker, source control)
- Instrument measurement collection (automated, not manual)
- Calculate current state for all four metrics
- Share results transparently with the team (no blame, just facts)
Phase 2: Understand (Weeks 5–8)
- Discuss what the numbers mean with the team
- Identify the biggest constraint (usually lead time or MTTR)
- Map the value stream to find where time is lost
- Generate improvement hypotheses collaboratively
Phase 3: Improve (Ongoing)
- Pick one metric to improve first (the biggest constraint)
- Run time-boxed experiments (2–4 weeks)
- Measure the impact of each change
- Celebrate improvements publicly
- Iterate — once one metric improves, address the next constraint
{
"metrics_program": {
"team": "payments-squad",
"baseline_date": "2026-04-01",
"baseline": {
"deployment_frequency": "2 per week",
"lead_time_hours": 72,
"change_failure_rate_pct": 22,
"mttr_hours": 4.5
},
"current_tier": "High",
"target_tier": "Elite",
"focus_metric": "lead_time",
"improvement_hypothesis": "Reducing PR review wait time from 18h to 4h will cut lead time by 40%",
"experiment": {
"intervention": "Implement async code review SLA: first review within 4 hours",
"start_date": "2026-05-13",
"duration_weeks": 4,
"success_criteria": "Median lead time drops below 24 hours"
},
"review_cadence": "Bi-weekly metrics retro"
}
}
Exercises
Given the following data for a team over the past 30 days: 45 production deployments, median commit-to-deploy time of 6.5 hours, 4 deployments caused incidents, incidents resolved in 25min, 40min, 2hr, and 45min respectively. Calculate all four DORA metrics, classify the team's tier for each, and determine their overall performance category. What single improvement would have the biggest impact?
A team has the following lead time breakdown: Coding (2h), PR wait (22h), Review (3h), CI pipeline (45min), Deploy approval (8h), Deployment (10min). Total: ~36 hours. (1) Identify the two biggest bottlenecks. (2) Propose specific interventions for each. (3) Estimate the new lead time if your interventions succeed. (4) What tier would the team move to?
Your team is currently at "Medium" tier (weekly deploys, 2-week lead time, 25% CFR, 18-hour MTTR). Design a 12-week improvement plan targeting "High" tier. For each metric: (1) Define the target value, (2) List 2-3 specific technical or process changes, (3) Identify dependencies and risks, (4) Define how you'll measure progress weekly.
Choose your team's actual CI/CD system (GitHub Actions, GitLab CI, Jenkins, etc.) and design a measurement pipeline that automatically calculates DORA metrics. Document: (1) Which events/webhooks provide the data, (2) How you'll store historical data, (3) How you'll visualise trends (tool choice + dashboard mockup), (4) How you'll handle edge cases (reverts, hotfixes, maintenance deploys).
Conclusion & Next Steps
DORA metrics give you an evidence-based compass for software delivery improvement. They tell you where you are, where you can go, and — when measured over time — whether your investments in tooling, process, and culture are actually working.
The key lessons: speed and stability are not tradeoffs — they reinforce each other. Smaller batches reduce risk rather than increasing it. Metrics are for learning, not punishment. And the gap between elite and low performers is growing — the time to invest in delivery performance is now.
Remember: you can't improve what you don't measure, but you can destroy what you measure badly. Use DORA metrics with wisdom — as a mirror for self-reflection, not a weapon for judgement.
Next in the Series
In Part 26: Reliability & Observability, we'll build on MTTR by learning how to instrument systems for deep observability — distributed tracing, SLOs, error budgets, and the practices that make sub-hour recovery possible.