Back to Modern DevOps & Platform Engineering Series

Part 15: AIOps & Intelligent Automation

May 15, 2026 Wasil Zafar 30 min read

Harness ML-driven operations — anomaly detection, predictive alerting, chaos engineering, automated incident response, and self-healing infrastructure for resilient systems at scale.

Table of Contents

  1. Introduction to AIOps
  2. Anomaly Detection
  3. Predictive Alerting
  4. Chaos Engineering
  5. Self-Healing Infrastructure
  6. Runbook Automation
  7. Conclusion & Next Steps

What is AIOps?

AIOps (Artificial Intelligence for IT Operations) applies machine learning and data analytics to operations data — metrics, logs, traces, events — to automate detection, diagnosis, and resolution of operational issues. It evolves traditional monitoring from "alert when a threshold is crossed" to "detect anomalous patterns that humans would miss."

Think of traditional monitoring like a smoke detector — it triggers when smoke reaches a specific level. AIOps is like a fire prevention system that analyses electrical patterns, temperature trends, and humidity changes to predict where a fire might start before any smoke appears.

Key Insight: AIOps doesn't replace human operators — it amplifies them. By handling noise reduction (filtering 10,000 alerts to 3 actionable incidents), pattern recognition (correlating metrics across 200 microservices), and automated remediation (restarting crashed pods, scaling under load), AIOps frees engineers to focus on systemic improvements rather than firefighting.

The Three Pillars + Events

AIOps Data Sources
flowchart TD
    Metrics["Metrics
(Prometheus)"] --> AIOps["AIOps Engine
Correlation & ML"] Logs["Logs
(Loki / ELK)"] --> AIOps Traces["Traces
(Jaeger / Tempo)"] --> AIOps Events["Events
(K8s / CloudWatch)"] --> AIOps AIOps --> Alert["Smart Alerts"] AIOps --> Root["Root Cause
Analysis"] AIOps --> Auto["Auto
Remediation"] style Metrics fill:#e8f4f4,stroke:#3B9797,color:#132440 style Logs fill:#e8f4f4,stroke:#3B9797,color:#132440 style Traces fill:#e8f4f4,stroke:#3B9797,color:#132440 style Events fill:#e8f4f4,stroke:#3B9797,color:#132440 style AIOps fill:#f0f4f8,stroke:#16476A,color:#132440 style Alert fill:#fff5f5,stroke:#BF092F,color:#132440 style Root fill:#f0f4f8,stroke:#16476A,color:#132440 style Auto fill:#e8f4f4,stroke:#3B9797,color:#132440

Anomaly Detection

Traditional threshold-based alerting fails for dynamic systems. A request latency of 200ms might be normal at 2am but anomalous at 2pm. Anomaly detection uses statistical models and ML to learn "normal" patterns and flag deviations.

Prometheus Recording Rules for Anomaly Detection

# prometheus-anomaly-rules.yaml
# Statistical anomaly detection using recording rules
groups:
  - name: anomaly-detection
    interval: 1m
    rules:
      # Calculate rolling average (7-day baseline)
      - record: http_request_duration_avg_7d
        expr: |
          avg_over_time(
            rate(http_request_duration_seconds_sum[5m])[7d:1h]
          ) /
          avg_over_time(
            rate(http_request_duration_seconds_count[5m])[7d:1h]
          )

      # Calculate rolling standard deviation
      - record: http_request_duration_stddev_7d
        expr: |
          stddev_over_time(
            rate(http_request_duration_seconds_sum[5m])
            / rate(http_request_duration_seconds_count[5m])
          [7d:1h])

      # Z-score anomaly detection (alert if > 3 standard deviations)
      - record: http_request_duration_zscore
        expr: |
          (
            rate(http_request_duration_seconds_sum[5m])
            / rate(http_request_duration_seconds_count[5m])
            - http_request_duration_avg_7d
          ) / http_request_duration_stddev_7d

  - name: anomaly-alerts
    rules:
      - alert: LatencyAnomaly
        expr: abs(http_request_duration_zscore) > 3
        for: 5m
        labels:
          severity: warning
          team: "{{ $labels.team }}"
        annotations:
          summary: "Latency anomaly detected for {{ $labels.service }}"
          description: |
            Z-score: {{ $value | printf "%.2f" }}
            Current latency deviates more than 3 standard deviations
            from the 7-day baseline. This may indicate a performance
            regression or unusual traffic pattern.

Predictive Alerting & Noise Reduction

Intelligent Alert Grouping

Alert fatigue is the number one operational failure mode. When on-call engineers receive 500 alerts in an hour, they stop reading them. Intelligent grouping correlates related alerts into a single incident.

# alertmanager-config.yaml — Intelligent grouping and routing
global:
  resolve_timeout: 5m

route:
  receiver: 'default-receiver'
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 30s        # Wait 30s to batch related alerts
  group_interval: 5m     # Wait 5m before sending updates
  repeat_interval: 4h    # Re-send unresolved alerts every 4h

  routes:
    # Critical alerts — immediate PagerDuty escalation
    - match:
        severity: critical
      receiver: 'pagerduty-critical'
      group_wait: 10s
      continue: true

    # Silence known flappy alerts during maintenance
    - match_re:
        alertname: '^(NodeNotReady|PodCrashLooping)$'
      matchers:
        - maintenance_window=~"active"
      receiver: 'null'

    # Route by team label
    - match:
        team: payments
      receiver: 'team-payments-slack'
    - match:
        team: frontend
      receiver: 'team-frontend-slack'

inhibit_rules:
  # If a cluster is down, suppress all pod-level alerts
  - source_matchers:
      - alertname = ClusterUnreachable
    target_matchers:
      - severity =~ "warning|info"
    equal: ['cluster']

receivers:
  - name: 'pagerduty-critical'
    pagerduty_configs:
      - service_key_file: /etc/alertmanager/pagerduty-key
        severity: critical
  - name: 'team-payments-slack'
    slack_configs:
      - api_url_file: /etc/alertmanager/slack-webhook
        channel: '#payments-alerts'
  - name: 'team-frontend-slack'
    slack_configs:
      - api_url_file: /etc/alertmanager/slack-webhook
        channel: '#frontend-alerts'
  - name: 'null'
Case Study LinkedIn

LinkedIn's Alert Noise Reduction

LinkedIn's infrastructure team was drowning in 15,000+ alerts per week across their 3,000+ microservices. Their AIOps initiative implemented three layers of noise reduction: (1) statistical deduplication that grouped 15,000 raw alerts into ~800 unique incidents, (2) ML-based correlation that linked related incidents across services (reducing to ~200 root causes), and (3) automated severity classification that correctly triaged 94% of incidents without human intervention. The result: on-call engineers went from 200+ daily pages to 15, with a 40% reduction in MTTR.

15K→15 Alerts 3000+ Services 40% MTTR Reduction

Chaos Engineering

Chaos engineering is the discipline of experimenting on a system to build confidence in its ability to withstand turbulent conditions in production. Instead of waiting for failures to happen, you deliberately inject failures to discover weaknesses before they cause outages.

Definition: Chaos engineering is not about breaking things randomly. It's a scientific approach: form a hypothesis about system behaviour under stress, design an experiment to test it, measure the impact, and use the results to improve resilience. Every experiment has a blast radius, duration, and abort condition.

Litmus ChaosEngine for Kubernetes

# Install Litmus Chaos (CNCF project)
kubectl apply -f https://litmuschaos.github.io/litmus/litmus-operator-v3.0.0.yaml

# Verify installation
kubectl get pods -n litmus
# NAME                                    READY   STATUS    RESTARTS   AGE
# litmus-server-xxx                       1/1     Running   0          1m
# chaos-operator-xxx                      1/1     Running   0          1m

echo "Litmus Chaos installed successfully"
# chaos-experiment-pod-delete.yaml
# Experiment: Kill random pods and verify service recovery
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: payment-service-chaos
  namespace: production
spec:
  appinfo:
    appns: production
    applabel: app=payment-service
    appkind: deployment
  engineState: active
  chaosServiceAccount: litmus-admin
  experiments:
    - name: pod-delete
      spec:
        components:
          env:
            - name: TOTAL_CHAOS_DURATION
              value: '60'         # Run for 60 seconds
            - name: CHAOS_INTERVAL
              value: '10'         # Delete a pod every 10 seconds
            - name: FORCE
              value: 'false'      # Graceful termination
            - name: PODS_AFFECTED_PERC
              value: '50'         # Kill 50% of pods
        probe:
          - name: health-check
            type: httpProbe
            httpProbe/inputs:
              url: http://payment-service.production:8080/healthz
              method:
                get:
                  criteria: ==
                  responseCode: '200'
            mode: Continuous
            runProperties:
              probeTimeout: 5s
              retry: 3
              interval: 10s
# chaos-experiment-network-latency.yaml
# Experiment: Inject network latency between services
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: network-latency-chaos
  namespace: production
spec:
  appinfo:
    appns: production
    applabel: app=api-gateway
    appkind: deployment
  engineState: active
  chaosServiceAccount: litmus-admin
  experiments:
    - name: pod-network-latency
      spec:
        components:
          env:
            - name: TOTAL_CHAOS_DURATION
              value: '120'
            - name: NETWORK_LATENCY
              value: '500'        # Add 500ms latency
            - name: JITTER
              value: '100'        # ±100ms jitter
            - name: DESTINATION_IPS
              value: '10.0.0.0/8' # Target internal traffic
            - name: CONTAINER_RUNTIME
              value: containerd

Self-Healing Infrastructure

Kubernetes Self-Healing Patterns

# PodDisruptionBudget — maintain availability during disruption
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: payment-service-pdb
  namespace: production
spec:
  minAvailable: 2    # Always keep at least 2 pods running
  selector:
    matchLabels:
      app: payment-service
# HPA with custom metrics — scale based on business metrics
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: payment-service-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: payment-service
  minReplicas: 3
  maxReplicas: 20
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
        - type: Pods
          value: 4
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Percent
          value: 25
          periodSeconds: 120
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Pods
      pods:
        metric:
          name: http_requests_per_second
        target:
          type: AverageValue
          averageValue: "100"

Runbook Automation

Event-Driven Remediation

# Kubernetes Event-Driven Autoscaling (KEDA) with remediation
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: order-processor
  namespace: production
spec:
  scaleTargetRef:
    name: order-processor
  pollingInterval: 15
  cooldownPeriod: 60
  minReplicaCount: 2
  maxReplicaCount: 50
  triggers:
    - type: prometheus
      metadata:
        serverAddress: http://prometheus.monitoring:9090
        metricName: orders_queue_depth
        query: |
          sum(orders_pending_total{namespace="production"})
        threshold: "100"    # Scale up when queue > 100
    - type: prometheus
      metadata:
        serverAddress: http://prometheus.monitoring:9090
        metricName: error_rate_high
        query: |
          sum(rate(http_requests_total{status=~"5..",app="order-processor"}[5m]))
          / sum(rate(http_requests_total{app="order-processor"}[5m])) > 0.05
        threshold: "1"      # Scale up when error rate > 5%
Automation Safety: Every automated remediation must have a blast radius limit and a circuit breaker. If auto-remediation triggers more than 3 times in 10 minutes for the same incident, stop and escalate to a human. Automated healing that runs in a loop can amplify failures instead of fixing them — cascading restarts, runaway scaling, or resource exhaustion.

Conclusion & Next Steps

AIOps and intelligent automation represent the next evolution of operations — moving from reactive incident response to proactive, predictive, and self-healing systems. The key is not to automate everything at once, but to start with the highest-impact, lowest-risk automations and expand systematically.

  • Start with observability — You can't automate what you can't see. Implement comprehensive metrics, logs, and traces before adding ML.
  • Reduce alert noise first — Intelligent grouping and correlation have immediate impact on on-call quality of life.
  • Practice chaos regularly — Scheduled chaos experiments build confidence in recovery mechanisms before real failures occur.
  • Automate incrementally — Begin with "suggest" mode (recommend actions), then "semi-auto" (human approval), then full automation for well-understood scenarios.
  • Always have a kill switch — Every automated remediation needs a circuit breaker and escalation path.

Next in the Series

In Part 16: Enterprise Platform Architecture, we'll explore designing platforms at organisational scale — multi-team governance, platform strategy, API management, compliance automation, and building a platform organisation.