What is AIOps?
AIOps (Artificial Intelligence for IT Operations) applies machine learning and data analytics to operations data — metrics, logs, traces, events — to automate detection, diagnosis, and resolution of operational issues. It evolves traditional monitoring from "alert when a threshold is crossed" to "detect anomalous patterns that humans would miss."
Think of traditional monitoring like a smoke detector — it triggers when smoke reaches a specific level. AIOps is like a fire prevention system that analyses electrical patterns, temperature trends, and humidity changes to predict where a fire might start before any smoke appears.
The Three Pillars + Events
flowchart TD
Metrics["Metrics
(Prometheus)"] --> AIOps["AIOps Engine
Correlation & ML"]
Logs["Logs
(Loki / ELK)"] --> AIOps
Traces["Traces
(Jaeger / Tempo)"] --> AIOps
Events["Events
(K8s / CloudWatch)"] --> AIOps
AIOps --> Alert["Smart Alerts"]
AIOps --> Root["Root Cause
Analysis"]
AIOps --> Auto["Auto
Remediation"]
style Metrics fill:#e8f4f4,stroke:#3B9797,color:#132440
style Logs fill:#e8f4f4,stroke:#3B9797,color:#132440
style Traces fill:#e8f4f4,stroke:#3B9797,color:#132440
style Events fill:#e8f4f4,stroke:#3B9797,color:#132440
style AIOps fill:#f0f4f8,stroke:#16476A,color:#132440
style Alert fill:#fff5f5,stroke:#BF092F,color:#132440
style Root fill:#f0f4f8,stroke:#16476A,color:#132440
style Auto fill:#e8f4f4,stroke:#3B9797,color:#132440
Anomaly Detection
Traditional threshold-based alerting fails for dynamic systems. A request latency of 200ms might be normal at 2am but anomalous at 2pm. Anomaly detection uses statistical models and ML to learn "normal" patterns and flag deviations.
Prometheus Recording Rules for Anomaly Detection
# prometheus-anomaly-rules.yaml
# Statistical anomaly detection using recording rules
groups:
- name: anomaly-detection
interval: 1m
rules:
# Calculate rolling average (7-day baseline)
- record: http_request_duration_avg_7d
expr: |
avg_over_time(
rate(http_request_duration_seconds_sum[5m])[7d:1h]
) /
avg_over_time(
rate(http_request_duration_seconds_count[5m])[7d:1h]
)
# Calculate rolling standard deviation
- record: http_request_duration_stddev_7d
expr: |
stddev_over_time(
rate(http_request_duration_seconds_sum[5m])
/ rate(http_request_duration_seconds_count[5m])
[7d:1h])
# Z-score anomaly detection (alert if > 3 standard deviations)
- record: http_request_duration_zscore
expr: |
(
rate(http_request_duration_seconds_sum[5m])
/ rate(http_request_duration_seconds_count[5m])
- http_request_duration_avg_7d
) / http_request_duration_stddev_7d
- name: anomaly-alerts
rules:
- alert: LatencyAnomaly
expr: abs(http_request_duration_zscore) > 3
for: 5m
labels:
severity: warning
team: "{{ $labels.team }}"
annotations:
summary: "Latency anomaly detected for {{ $labels.service }}"
description: |
Z-score: {{ $value | printf "%.2f" }}
Current latency deviates more than 3 standard deviations
from the 7-day baseline. This may indicate a performance
regression or unusual traffic pattern.
Predictive Alerting & Noise Reduction
Intelligent Alert Grouping
Alert fatigue is the number one operational failure mode. When on-call engineers receive 500 alerts in an hour, they stop reading them. Intelligent grouping correlates related alerts into a single incident.
# alertmanager-config.yaml — Intelligent grouping and routing
global:
resolve_timeout: 5m
route:
receiver: 'default-receiver'
group_by: ['alertname', 'cluster', 'service']
group_wait: 30s # Wait 30s to batch related alerts
group_interval: 5m # Wait 5m before sending updates
repeat_interval: 4h # Re-send unresolved alerts every 4h
routes:
# Critical alerts — immediate PagerDuty escalation
- match:
severity: critical
receiver: 'pagerduty-critical'
group_wait: 10s
continue: true
# Silence known flappy alerts during maintenance
- match_re:
alertname: '^(NodeNotReady|PodCrashLooping)$'
matchers:
- maintenance_window=~"active"
receiver: 'null'
# Route by team label
- match:
team: payments
receiver: 'team-payments-slack'
- match:
team: frontend
receiver: 'team-frontend-slack'
inhibit_rules:
# If a cluster is down, suppress all pod-level alerts
- source_matchers:
- alertname = ClusterUnreachable
target_matchers:
- severity =~ "warning|info"
equal: ['cluster']
receivers:
- name: 'pagerduty-critical'
pagerduty_configs:
- service_key_file: /etc/alertmanager/pagerduty-key
severity: critical
- name: 'team-payments-slack'
slack_configs:
- api_url_file: /etc/alertmanager/slack-webhook
channel: '#payments-alerts'
- name: 'team-frontend-slack'
slack_configs:
- api_url_file: /etc/alertmanager/slack-webhook
channel: '#frontend-alerts'
- name: 'null'
LinkedIn's Alert Noise Reduction
LinkedIn's infrastructure team was drowning in 15,000+ alerts per week across their 3,000+ microservices. Their AIOps initiative implemented three layers of noise reduction: (1) statistical deduplication that grouped 15,000 raw alerts into ~800 unique incidents, (2) ML-based correlation that linked related incidents across services (reducing to ~200 root causes), and (3) automated severity classification that correctly triaged 94% of incidents without human intervention. The result: on-call engineers went from 200+ daily pages to 15, with a 40% reduction in MTTR.
Chaos Engineering
Chaos engineering is the discipline of experimenting on a system to build confidence in its ability to withstand turbulent conditions in production. Instead of waiting for failures to happen, you deliberately inject failures to discover weaknesses before they cause outages.
Litmus ChaosEngine for Kubernetes
# Install Litmus Chaos (CNCF project)
kubectl apply -f https://litmuschaos.github.io/litmus/litmus-operator-v3.0.0.yaml
# Verify installation
kubectl get pods -n litmus
# NAME READY STATUS RESTARTS AGE
# litmus-server-xxx 1/1 Running 0 1m
# chaos-operator-xxx 1/1 Running 0 1m
echo "Litmus Chaos installed successfully"
# chaos-experiment-pod-delete.yaml
# Experiment: Kill random pods and verify service recovery
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: payment-service-chaos
namespace: production
spec:
appinfo:
appns: production
applabel: app=payment-service
appkind: deployment
engineState: active
chaosServiceAccount: litmus-admin
experiments:
- name: pod-delete
spec:
components:
env:
- name: TOTAL_CHAOS_DURATION
value: '60' # Run for 60 seconds
- name: CHAOS_INTERVAL
value: '10' # Delete a pod every 10 seconds
- name: FORCE
value: 'false' # Graceful termination
- name: PODS_AFFECTED_PERC
value: '50' # Kill 50% of pods
probe:
- name: health-check
type: httpProbe
httpProbe/inputs:
url: http://payment-service.production:8080/healthz
method:
get:
criteria: ==
responseCode: '200'
mode: Continuous
runProperties:
probeTimeout: 5s
retry: 3
interval: 10s
# chaos-experiment-network-latency.yaml
# Experiment: Inject network latency between services
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: network-latency-chaos
namespace: production
spec:
appinfo:
appns: production
applabel: app=api-gateway
appkind: deployment
engineState: active
chaosServiceAccount: litmus-admin
experiments:
- name: pod-network-latency
spec:
components:
env:
- name: TOTAL_CHAOS_DURATION
value: '120'
- name: NETWORK_LATENCY
value: '500' # Add 500ms latency
- name: JITTER
value: '100' # ±100ms jitter
- name: DESTINATION_IPS
value: '10.0.0.0/8' # Target internal traffic
- name: CONTAINER_RUNTIME
value: containerd
Self-Healing Infrastructure
Kubernetes Self-Healing Patterns
# PodDisruptionBudget — maintain availability during disruption
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: payment-service-pdb
namespace: production
spec:
minAvailable: 2 # Always keep at least 2 pods running
selector:
matchLabels:
app: payment-service
# HPA with custom metrics — scale based on business metrics
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: payment-service-hpa
namespace: production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: payment-service
minReplicas: 3
maxReplicas: 20
behavior:
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Pods
value: 4
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 25
periodSeconds: 120
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "100"
Runbook Automation
Event-Driven Remediation
# Kubernetes Event-Driven Autoscaling (KEDA) with remediation
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: order-processor
namespace: production
spec:
scaleTargetRef:
name: order-processor
pollingInterval: 15
cooldownPeriod: 60
minReplicaCount: 2
maxReplicaCount: 50
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus.monitoring:9090
metricName: orders_queue_depth
query: |
sum(orders_pending_total{namespace="production"})
threshold: "100" # Scale up when queue > 100
- type: prometheus
metadata:
serverAddress: http://prometheus.monitoring:9090
metricName: error_rate_high
query: |
sum(rate(http_requests_total{status=~"5..",app="order-processor"}[5m]))
/ sum(rate(http_requests_total{app="order-processor"}[5m])) > 0.05
threshold: "1" # Scale up when error rate > 5%
Conclusion & Next Steps
AIOps and intelligent automation represent the next evolution of operations — moving from reactive incident response to proactive, predictive, and self-healing systems. The key is not to automate everything at once, but to start with the highest-impact, lowest-risk automations and expand systematically.
- Start with observability — You can't automate what you can't see. Implement comprehensive metrics, logs, and traces before adding ML.
- Reduce alert noise first — Intelligent grouping and correlation have immediate impact on on-call quality of life.
- Practice chaos regularly — Scheduled chaos experiments build confidence in recovery mechanisms before real failures occur.
- Automate incrementally — Begin with "suggest" mode (recommend actions), then "semi-auto" (human approval), then full automation for well-understood scenarios.
- Always have a kill switch — Every automated remediation needs a circuit breaker and escalation path.
Next in the Series
In Part 16: Enterprise Platform Architecture, we'll explore designing platforms at organisational scale — multi-team governance, platform strategy, API management, compliance automation, and building a platform organisation.