The DevOps Observability Loop
The DevOps infinity loop (Plan → Code → Build → Test → Release → Deploy → Operate → Monitor) generates telemetry at every stage. The most effective engineering organizations use observability data not just for operations, but as the primary feedback mechanism that informs planning, design, and development decisions.
flowchart LR
P["Plan
SLO budgets inform
feature priorities"]
C["Code
Instrumentation
as part of dev"]
B["Build
Pipeline metrics
build health"]
T["Test
Performance gates
k6 thresholds"]
R["Release
Canary analysis
feature flags"]
D["Deploy
Deployment annotations
change tracking"]
O["Operate
Dashboards, alerts
incident response"]
M["Monitor
SLI/SLO tracking
error budgets"]
P --> C --> B --> T --> R --> D --> O --> M
M -->|"Feedback loop"| P
DORA Metrics
The DevOps Research and Assessment (DORA) team identified four key metrics that predict software delivery performance. All four can be measured with observability data:
| Metric | Elite | High | Medium | Low | How to Measure |
|---|---|---|---|---|---|
| Deployment Frequency | On-demand (multiple/day) | Weekly–monthly | Monthly–quarterly | Quarterly+ | Count deployment annotations per service |
| Lead Time for Changes | < 1 hour | 1 day–1 week | 1 week–1 month | 1–6 months | Commit timestamp → deployment annotation timestamp |
| Change Failure Rate | 0–15% | 16–30% | 31–45% | 46–60% | Deploys followed by rollback or incident / total deploys |
| Mean Time to Recovery | < 1 hour | < 1 day | < 1 week | > 1 week | Incident start → resolution (from Grafana Incident) |
# PromQL queries for DORA metrics dashboard
# Deployment Frequency (deployments per day, per service)
sum by (service) (
count_over_time(
grafana_annotation_created{tags=~".*deploy.*"}[24h]
)
)
# Change Failure Rate (% of deploys causing incidents within 1h)
sum(
count_over_time(grafana_annotation_created{tags=~".*rollback.*"}[7d])
) /
sum(
count_over_time(grafana_annotation_created{tags=~".*deploy.*"}[7d])
) * 100
# Mean Time to Recovery (average incident duration)
avg(
grafana_incident_duration_seconds{status="resolved"}
)
Observability at Each Stage
Deployment Tracking & Annotations
Grafana Annotations API
Deployment annotations create vertical markers on Grafana dashboards, making it trivial to correlate metric changes with code changes:
# Create a deployment annotation from CI/CD pipeline
curl -X POST https://grafana.example.com/api/annotations \
-H "Authorization: Bearer ${GRAFANA_API_KEY}" \
-H "Content-Type: application/json" \
-d '{
"time": '$(date +%s000)',
"tags": ["deploy", "order-service", "v2.4.1"],
"text": "Deployed order-service v2.4.1\nCommit: abc123\nAuthor: jane.doe\nChanges: Fix payment retry logic, Add new discount endpoint"
}'
# GitHub Actions step to create deployment annotation
- name: Annotate Deployment in Grafana
if: success()
run: |
curl -s -X POST "${{ secrets.GRAFANA_URL }}/api/annotations" \
-H "Authorization: Bearer ${{ secrets.GRAFANA_TOKEN }}" \
-H "Content-Type: application/json" \
-d '{
"time": '"$(date +%s000)"',
"timeEnd": '"$(date +%s000)"',
"tags": ["deploy", "${{ github.event.repository.name }}", "${{ github.sha }}"],
"text": "Deploy ${{ github.event.repository.name }}@${{ github.sha }}\nTriggered by: ${{ github.actor }}\nBranch: ${{ github.ref_name }}\nRun: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}"
}'
Change-to-Impact Correlation
The real power of deployment annotations emerges when correlating them with service metrics. When a latency spike occurs, overlay the annotation layer to immediately identify “what changed?”:
# PromQL: Show error rate with deployment annotations
# Panel 1: Error rate
rate(http_requests_total{status=~"5.."}[5m])
/ rate(http_requests_total[5m]) * 100
# Panel 2: Deployment annotations (overlay)
# In Grafana: Dashboard Settings → Annotations → Add
# Query: tags = deploy AND service = order-service
# Show as: Vertical lines with tags
# Automated correlation query:
# "Did error rate increase > 2x within 30 minutes of a deploy?"
(
rate(http_requests_total{status=~"5.."}[5m])
/ rate(http_requests_total[5m])
)
> 2 * (
rate(http_requests_total{status=~"5.."}[5m] offset 1h)
/ rate(http_requests_total[5m] offset 1h)
)
CI/CD Pipeline Monitoring
Pipeline Metrics
Treat your CI/CD pipelines as production services with their own SLOs. Export pipeline metrics to Prometheus/Mimir:
# Pipeline metrics to track
pipeline_metrics:
- name: pipeline_duration_seconds
type: histogram
description: "Time from commit to production deploy"
labels: [service, branch, result]
- name: pipeline_stage_duration_seconds
type: histogram
description: "Duration of each pipeline stage"
labels: [service, stage, result]
# stages: build, unit_test, integration_test, security_scan, deploy_staging, e2e_test, deploy_prod
- name: pipeline_runs_total
type: counter
description: "Total pipeline executions"
labels: [service, result, trigger]
# result: success, failure, cancelled
# trigger: push, pr, schedule, manual
- name: pipeline_test_results
type: gauge
description: "Test execution results"
labels: [service, suite, result]
# suite: unit, integration, e2e, performance
- name: pipeline_flaky_tests_total
type: counter
description: "Tests that flipped pass/fail without code changes"
labels: [service, test_name]
Build & Test Observability
Export CI/CD telemetry using OpenTelemetry for trace-based pipeline analysis:
# OpenTelemetry CI/CD integration (GitHub Actions example)
# .github/workflows/ci.yml
name: CI Pipeline
on: push
jobs:
build-and-test:
runs-on: ubuntu-latest
env:
OTEL_EXPORTER_OTLP_ENDPOINT: "https://otel-collector.example.com:4317"
OTEL_SERVICE_NAME: "ci-pipeline"
steps:
- uses: actions/checkout@v4
- name: Build
run: |
# Wrap build in an OTel span
otel-cli exec --name "build" --kind server -- \
docker build -t app:${{ github.sha }} .
- name: Unit Tests
run: |
otel-cli exec --name "unit-tests" --kind server -- \
go test ./... -v -count=1
- name: Integration Tests
run: |
otel-cli exec --name "integration-tests" --kind server -- \
docker-compose -f docker-compose.test.yml up --abort-on-container-exit
- name: Security Scan
run: |
otel-cli exec --name "security-scan" --kind server -- \
trivy image app:${{ github.sha }}
CI/CD Health Dashboard Panels
| Panel | Metric | Target |
|---|---|---|
| Pipeline Success Rate | success / total * 100 | > 95% |
| Mean Build Duration | avg(pipeline_duration_seconds) | < 10 min |
| Flaky Test Rate | flaky_tests / total_tests * 100 | < 2% |
| Deploy Frequency (7d) | count(deploys) over 7d | Trending up |
| Slowest Stages | p95(stage_duration) by stage | Identify bottlenecks |
| Queue Wait Time | time_queued_seconds | < 5 min |
Progressive Delivery
Canary Deployments
Canary deployments route a small percentage of traffic to the new version while observability data validates its health. If metrics degrade, the canary is rolled back automatically:
flowchart TD
D["Deploy v2.4.1
to 5% of traffic"]
O["Observe for 10 min
Compare vs baseline"]
C{{"Canary healthy?
Error rate < 1%?
Latency p95 < 800ms?"}}
P1["Promote to 25%"]
P2["Promote to 50%"]
P3["Promote to 100%"]
R["Rollback to v2.4.0"]
A["Alert on-call team"]
D --> O --> C
C -->|"Yes"| P1 --> O
P1 --> C
C -->|"Yes (25%)"| P2
P2 --> O
C -->|"Yes (50%)"| P3
C -->|"No"| R --> A
# Flagger canary analysis with Grafana metrics
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: order-service
namespace: production
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: order-service
progressDeadlineSeconds: 600
service:
port: 80
analysis:
interval: 1m
threshold: 5 # Max failed checks before rollback
maxWeight: 50 # Max traffic percentage
stepWeight: 10 # Increment per step
metrics:
# Query Mimir for error rate
- name: error-rate
templateRef:
name: error-rate
namespace: observability
thresholdRange:
max: 1 # Rollback if error rate > 1%
# Query Mimir for latency
- name: latency-p99
templateRef:
name: latency-p99
namespace: observability
thresholdRange:
max: 1500 # Rollback if p99 > 1500ms
---
# Metric template querying Prometheus/Mimir
apiVersion: flagger.app/v1beta1
kind: MetricTemplate
metadata:
name: error-rate
namespace: observability
spec:
provider:
type: prometheus
address: http://mimir-query-frontend:8080/prometheus
query: |
100 - (
sum(rate(http_requests_total{service="{{ target }}", status!~"5.."}[2m]))
/ sum(rate(http_requests_total{service="{{ target }}"}[2m]))
* 100
)
Feature Flag Observability
Feature flags decouple deployment from release. Monitor flag impact by correlating flag state with application metrics:
// Instrument feature flag evaluation
import { faro } from '@grafana/faro-web-sdk';
function evaluateFlag(flagName, userId) {
const variant = featureFlagClient.evaluate(flagName, userId);
// Emit metric for flag evaluation
faro.api.pushMeasurement({
type: 'feature_flag',
values: { evaluation_count: 1 },
context: {
flag_name: flagName,
variant: variant,
user_segment: getUserSegment(userId),
},
});
return variant;
}
// In your application:
const showNewCheckout = evaluateFlag('new-checkout-flow', user.id);
if (showNewCheckout === 'variant_b') {
renderNewCheckout();
} else {
renderLegacyCheckout();
}
# PromQL: Compare conversion rate between flag variants
# Variant A (control) conversion rate
sum(rate(checkout_completed_total{flag_variant="control"}[1h]))
/ sum(rate(checkout_started_total{flag_variant="control"}[1h])) * 100
# Variant B (new checkout) conversion rate
sum(rate(checkout_completed_total{flag_variant="variant_b"}[1h]))
/ sum(rate(checkout_started_total{flag_variant="variant_b"}[1h])) * 100
# Error rate difference between variants
sum(rate(http_errors_total{flag_variant="variant_b"}[1h]))
- sum(rate(http_errors_total{flag_variant="control"}[1h]))
Automated Rollback
Use Grafana alerting rules to trigger automated rollbacks when deployment health degrades:
# Alert rule that triggers rollback webhook
groups:
- name: deployment_safety
rules:
- alert: DeploymentDegradation
expr: |
(
rate(http_requests_total{status=~"5.."}[5m])
/ rate(http_requests_total[5m])
) > 0.05
and
count(
grafana_annotation_created{tags=~".*deploy.*"}
unless grafana_annotation_created{tags=~".*deploy.*"} offset 30m
) > 0
for: 3m
labels:
severity: critical
action: auto-rollback
annotations:
summary: "Error rate > 5% within 30 minutes of deployment"
runbook: "Automated rollback triggered. Verify in Grafana."
# Alertmanager webhook receiver for auto-rollback
receivers:
- name: 'rollback-webhook'
webhook_configs:
- url: 'https://cd-system.internal/api/v1/rollback'
send_resolved: false
Chaos Engineering & Observability
Principles of Chaos
Chaos engineering uses controlled experiments to uncover systemic weaknesses. Observability is the measurement layer that determines whether the system behaved as expected during the experiment:
- Hypothesis: “If payment-service loses one replica, latency stays under 500ms and no orders fail”
- Steady state: Capture baseline metrics (p95 latency, error rate, throughput)
- Introduce chaos: Kill a pod, inject network delay, fill disk
- Observe: Compare metrics during chaos vs baseline using Grafana
- Conclude: Did the hypothesis hold? If not, what failed?
Observing Chaos Experiments
# LitmusChaos experiment with Grafana observability
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: pod-kill-payment
spec:
appinfo:
appns: 'production'
applabel: 'app=payment-service'
chaosServiceAccount: chaos-admin
experiments:
- name: pod-delete
spec:
components:
env:
- name: TOTAL_CHAOS_DURATION
value: '120' # 2 minutes of chaos
- name: CHAOS_INTERVAL
value: '30' # Kill a pod every 30 seconds
- name: FORCE
value: 'false' # Graceful termination
probe:
# Validate via Prometheus/Mimir query during chaos
- name: "latency-within-slo"
type: "promProbe"
mode: "Continuous"
runProperties:
probeTimeout: 5
retry: 2
interval: 10
promProbe/inputs:
endpoint: "http://mimir-query-frontend:8080/prometheus"
query: |
histogram_quantile(0.95,
rate(http_request_duration_seconds_bucket{service="payment-service"}[1m])
) < 0.5
comparator:
type: "bool"
value: "true"
Summary & Next Steps
- DORA metrics — measure deployment frequency, lead time, change failure rate, and MTTR with observability data to track engineering effectiveness
- Deployment annotations — mark deployments on Grafana dashboards for instant change-to-impact correlation
- CI/CD monitoring — treat pipelines as production services with their own SLOs, metrics, and dashboards
- Canary analysis — automated promotion/rollback based on observability queries comparing canary vs baseline metrics
- Feature flags — correlate flag variants with application metrics for data-driven release decisions
- Chaos engineering — use observability as the measurement layer for chaos experiments, validating system resilience
Next in the Series
In Part 15: Troubleshooting & Production Best Practices, we’ll conclude the Grafana track with advanced troubleshooting workflows, production anti-patterns to avoid, operational checklists, and a complete reference architecture bringing together everything from Parts 1–14.