Prometheus Deep Dive Part 9: Systems Monitoring with the Node Exporter

Node Exporter Overview

The Prometheus Node Exporter exposes hardware and OS-level metrics from *nix kernels. It reads from /proc, /sys, and other kernel pseudo-filesystems to provide hundreds of metrics covering CPU, memory, disk, network, filesystem, and more.

Architecture & Deployment

# Kubernetes DaemonSet — runs on every node
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: node-exporter
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: node-exporter
  template:
    metadata:
      labels:
        app: node-exporter
      annotations:
        prometheus.io/scrape: 'true'
        prometheus.io/port: '9100'
    spec:
      hostPID: true
      hostNetwork: true    # Access host network metrics
      containers:
        - name: node-exporter
          image: prom/node-exporter:v1.8.1
          args:
            - '--path.rootfs=/host'
            - '--path.procfs=/host/proc'
            - '--path.sysfs=/host/sys'
            - '--collector.textfile.directory=/host/var/lib/node_exporter/textfile'
            - '--collector.filesystem.mount-points-exclude=^/(dev|proc|sys|var/lib/docker/.+|var/lib/kubelet/.+)($|/)'
            - '--collector.netclass.ignored-devices=^(veth.*|docker.*|br-.*)$'
            - '--collector.systemd'
            - '--no-collector.mdadm'         # Disable unused collectors
            - '--no-collector.infiniband'
          ports:
            - containerPort: 9100
              hostPort: 9100
          volumeMounts:
            - name: rootfs
              mountPath: /host
              readOnly: true
              mountPropagation: HostToContainer
          resources:
            limits:
              cpu: 250m
              memory: 180Mi
            requests:
              cpu: 100m
              memory: 128Mi
      volumes:
        - name: rootfs
          hostPath:
            path: /
      tolerations:
        - effect: NoSchedule
          operator: Exists

Enabling/Disabling Collectors

Reference

Default Collectors (Enabled)

Collector	Metrics Prefix	Source
cpu	`node_cpu_*`	/proc/stat
meminfo	`node_memory_*`	/proc/meminfo
diskstats	`node_disk_*`	/proc/diskstats
filesystem	`node_filesystem_*`	statfs()
netdev	`node_network_*`	/proc/net/dev
loadavg	`node_load*`	/proc/loadavg
textfile	(custom)	*.prom files
uname	`node_uname_info`	uname()
time	`node_time_*`	clock_gettime()
conntrack	`node_nf_conntrack*`	/proc/sys/net/netfilter

ConfigurationLinux

CPU Collector

Key Metrics & Modes

The CPU collector exposes node_cpu_seconds_total — a counter tracking cumulative CPU time spent in each mode per CPU core:

# CPU modes exposed by node_cpu_seconds_total{mode="..."}
# user     — Time in user space (applications)
# system   — Time in kernel space (syscalls, drivers)
# idle     — Idle time (waiting for work)
# iowait   — Waiting for I/O completion
# irq      — Servicing hardware interrupts
# softirq  — Servicing software interrupts
# steal    — Time stolen by hypervisor (VMs)
# nice     — Low-priority user space processes
# guest    — Running virtual CPUs for guests

Essential PromQL Queries

# Overall CPU utilization (all cores, all modes except idle)
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Per-mode breakdown (useful for identifying bottleneck type)
avg by (instance, mode) (rate(node_cpu_seconds_total[5m])) * 100

# CPU saturation — load average vs CPU count
node_load1 / count without (cpu, mode) (node_cpu_seconds_total{mode="idle"})

# iowait specifically (indicates disk bottleneck)
avg by (instance) (rate(node_cpu_seconds_total{mode="iowait"}[5m])) * 100

# Steal time (VM neighbor noise / overcommit)
avg by (instance) (rate(node_cpu_seconds_total{mode="steal"}[5m])) * 100

# Number of CPUs per node
count without (cpu, mode) (node_cpu_seconds_total{mode="idle"})

Alerting Rules

# Production alerting rules for CPU
groups:
  - name: node_cpu_alerts
    rules:
      - alert: HighCpuUsage
        expr: |
          100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[10m])) * 100) > 85
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "High CPU on {{ $labels.instance }}"
          description: "CPU usage above 85% for 15 minutes (current: {{ $value | printf \"%.1f\" }}%)"

      - alert: CpuSaturation
        expr: |
          node_load15 / count without (cpu, mode) (node_cpu_seconds_total{mode="idle"}) > 2
        for: 30m
        labels:
          severity: critical
        annotations:
          summary: "CPU saturated on {{ $labels.instance }}"
          description: "15-min load average is {{ $value | printf \"%.1f\" }}x the CPU count"

      - alert: HighStealTime
        expr: |
          avg by (instance) (rate(node_cpu_seconds_total{mode="steal"}[5m])) * 100 > 10
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High steal time on {{ $labels.instance }}"
          description: "{{ $value | printf \"%.1f\" }}% steal — noisy neighbor or overcommitted host"

Memory Collector

Key Metrics

# Memory metrics from /proc/meminfo
node_memory_MemTotal_bytes        # Total physical RAM
node_memory_MemFree_bytes         # Completely free (unused)
node_memory_MemAvailable_bytes    # Available for allocation (includes reclaimable)
node_memory_Buffers_bytes         # Disk buffer cache
node_memory_Cached_bytes          # Page cache
node_memory_SwapTotal_bytes       # Total swap space
node_memory_SwapFree_bytes        # Free swap
node_memory_Slab_bytes            # Kernel slab allocator
node_memory_SReclaimable_bytes    # Reclaimable slab memory
node_memory_CommitLimit_bytes     # Overcommit limit
node_memory_Committed_AS_bytes    # Memory committed by all processes

PromQL Patterns

# Actual memory usage (most accurate)
# Uses MemAvailable which accounts for reclaimable cache
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100

# Breakdown: Used / Buffers / Cached / Free
node_memory_MemTotal_bytes
  - node_memory_MemFree_bytes
  - node_memory_Buffers_bytes
  - node_memory_Cached_bytes
  - node_memory_SReclaimable_bytes

# Swap usage (any swap usage may indicate memory pressure)
(1 - node_memory_SwapFree_bytes / node_memory_SwapTotal_bytes) * 100

# OOM kill count (if using kernel 4.13+)
rate(node_vmstat_oom_kill[5m])

# Memory pressure — major page faults (require disk I/O)
rate(node_vmstat_pgmajfault[5m])

# Memory alerting rules
groups:
  - name: node_memory_alerts
    rules:
      - alert: HighMemoryUsage
        expr: |
          (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 > 90
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High memory on {{ $labels.instance }}"
          description: "Memory usage {{ $value | printf \"%.1f\" }}% — available: {{ with printf \"node_memory_MemAvailable_bytes{instance='%s'}\" $labels.instance | query }}{{ . | first | value | humanize1024 }}{{ end }}"

      - alert: SwapUsageHigh
        expr: |
          (1 - node_memory_SwapFree_bytes / node_memory_SwapTotal_bytes) * 100 > 50
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Swap usage on {{ $labels.instance }}"
          description: "{{ $value | printf \"%.0f\" }}% swap in use — memory pressure likely"

Disk & Filesystem Collectors

Disk I/O Metrics

# Disk I/O metrics from /proc/diskstats
node_disk_reads_completed_total      # Completed read operations
node_disk_writes_completed_total     # Completed write operations
node_disk_read_bytes_total           # Bytes read
node_disk_written_bytes_total        # Bytes written
node_disk_read_time_seconds_total    # Time spent reading
node_disk_write_time_seconds_total   # Time spent writing
node_disk_io_time_seconds_total      # Time spent doing I/O (utilization)
node_disk_io_time_weighted_seconds_total  # Weighted I/O time (queue depth)

# Disk utilization (% of time doing I/O)
rate(node_disk_io_time_seconds_total{device!~"dm-.*"}[5m]) * 100

# Average read/write latency
rate(node_disk_read_time_seconds_total[5m])
  / rate(node_disk_reads_completed_total[5m])

# IOPS
rate(node_disk_reads_completed_total[5m])
  + rate(node_disk_writes_completed_total[5m])

# Throughput (bytes/second)
rate(node_disk_read_bytes_total[5m]) + rate(node_disk_written_bytes_total[5m])

# Average queue depth (saturation indicator)
rate(node_disk_io_time_weighted_seconds_total[5m])

Filesystem Metrics

# Filesystem metrics from statfs()
node_filesystem_size_bytes          # Total filesystem size
node_filesystem_avail_bytes         # Available space (non-root)
node_filesystem_free_bytes          # Free space (includes root reserved)
node_filesystem_files               # Total inodes
node_filesystem_files_free          # Free inodes
node_filesystem_readonly            # Read-only flag

# Filesystem usage percentage
(1 - node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100

# Predict when filesystem will be full (linear extrapolation)
predict_linear(node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"}[6h], 24*3600) < 0

# Inode usage (often overlooked until 100%)
(1 - node_filesystem_files_free / node_filesystem_files) * 100

# Disk & filesystem alerting
groups:
  - name: node_disk_alerts
    rules:
      - alert: DiskWillFillIn24h
        expr: |
          predict_linear(node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"}[6h], 24*3600) < 0
          and node_filesystem_avail_bytes / node_filesystem_size_bytes < 0.2
        for: 30m
        labels:
          severity: warning
        annotations:
          summary: "Disk filling on {{ $labels.instance }}:{{ $labels.mountpoint }}"
          description: "Filesystem {{ $labels.mountpoint }} predicted to fill within 24 hours"

      - alert: DiskSpaceCritical
        expr: |
          (1 - node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"} / node_filesystem_size_bytes) > 0.95
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Disk 95%+ full on {{ $labels.instance }}:{{ $labels.mountpoint }}"

      - alert: InodeExhaustion
        expr: |
          (1 - node_filesystem_files_free / node_filesystem_files) > 0.90
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Inode exhaustion on {{ $labels.instance }}:{{ $labels.mountpoint }}"

Network Collector

Network Device Metrics

# Network interface metrics from /proc/net/dev
node_network_receive_bytes_total        # Bytes received
node_network_transmit_bytes_total       # Bytes transmitted
node_network_receive_packets_total      # Packets received
node_network_transmit_packets_total     # Packets transmitted
node_network_receive_errs_total         # Receive errors
node_network_transmit_errs_total        # Transmit errors
node_network_receive_drop_total         # Dropped incoming
node_network_transmit_drop_total        # Dropped outgoing

# Bandwidth utilization (bits/sec)
rate(node_network_receive_bytes_total{device!~"lo|veth.*|docker.*"}[5m]) * 8
rate(node_network_transmit_bytes_total{device!~"lo|veth.*|docker.*"}[5m]) * 8

# Packet error rate
rate(node_network_receive_errs_total[5m])
  / rate(node_network_receive_packets_total[5m]) * 100

# Network interface speed and state
node_network_speed_bytes    # Negotiated link speed
node_network_up             # Interface operational state (1=up)

Conntrack & Sockets

# Connection tracking (critical for firewalls/load balancers)
node_nf_conntrack_entries            # Current tracked connections
node_nf_conntrack_entries_limit      # Maximum connections allowed

# Conntrack utilization (approaching 100% = dropped connections)
node_nf_conntrack_entries / node_nf_conntrack_entries_limit * 100

# TCP socket state (from /proc/net/sockstat)
node_sockstat_TCP_tw         # TIME_WAIT sockets
node_sockstat_TCP_alloc      # Allocated sockets
node_sockstat_sockets_used   # Total sockets in use

Textfile Collector

The textfile collector reads .prom files from a configured directory, exposing their contents as Prometheus metrics. This is the primary mechanism for exposing custom metrics from cron jobs, scripts, or applications that can’t serve an HTTP endpoint.

Setup & Configuration

# Enable with directory flag
--collector.textfile.directory=/var/lib/node_exporter/textfile

# Create the directory
mkdir -p /var/lib/node_exporter/textfile

# Write metrics in Prometheus exposition format
# File MUST have .prom extension
cat > /var/lib/node_exporter/textfile/backup_status.prom << 'EOF'
# HELP backup_last_success_timestamp_seconds Unix timestamp of last successful backup
# TYPE backup_last_success_timestamp_seconds gauge
backup_last_success_timestamp_seconds{job="database",target="postgres-main"} 1718452800
# HELP backup_size_bytes Size of last backup in bytes
# TYPE backup_size_bytes gauge
backup_size_bytes{job="database",target="postgres-main"} 5368709120
# HELP backup_duration_seconds Duration of last backup
# TYPE backup_duration_seconds gauge
backup_duration_seconds{job="database",target="postgres-main"} 342.5
EOF

Common Patterns

#!/bin/bash
# /etc/cron.d/node-exporter-textfile
# Cron job that writes textfile metrics every 5 minutes

# SSL certificate expiry
CERT_EXPIRY=$(echo | openssl s_client -connect myapp.example.com:443 2>/dev/null | \
  openssl x509 -noout -enddate | cut -d= -f2)
CERT_EPOCH=$(date -d "${CERT_EXPIRY}" +%s)

cat > /var/lib/node_exporter/textfile/ssl_expiry.prom << EOF
# HELP ssl_certificate_expiry_seconds Unix timestamp when cert expires
# TYPE ssl_certificate_expiry_seconds gauge
ssl_certificate_expiry_seconds{domain="myapp.example.com"} ${CERT_EPOCH}
EOF

# Package update count
UPDATES=$(apt list --upgradable 2>/dev/null | grep -c upgradable)
cat > /var/lib/node_exporter/textfile/apt_updates.prom << EOF
# HELP node_apt_upgradable_packages Number of packages with available updates
# TYPE node_apt_upgradable_packages gauge
node_apt_upgradable_packages ${UPDATES}
EOF

# Custom application health (from script/API call)
HTTP_CODE=$(curl -s -o /dev/null -w '%{http_code}' http://localhost:8080/health)
cat > /var/lib/node_exporter/textfile/app_health.prom << EOF
# HELP app_health_check_status HTTP status code from health endpoint
# TYPE app_health_check_status gauge
app_health_check_status{app="my-service"} ${HTTP_CODE}
EOF

                            
                            Textfile Gotchas: Always write to a temp file and mv atomically to avoid partial reads. Never use timestamps in textfile metrics (Prometheus adds scrape time). If the file is stale, the metric node_textfile_mtime_seconds will show when it was last modified — alert on staleness rather than checking the metric value.
                        

Advanced Collectors

systemd Collector

# Enable systemd collector
--collector.systemd
--collector.systemd.unit-include="(nginx|postgresql|redis|docker)\.service"

# Metrics exposed:
node_systemd_unit_state{name="nginx.service", state="active"}    # 1 if in this state
node_systemd_unit_state{name="nginx.service", state="failed"}    # 1 if failed
node_systemd_timer_last_trigger_seconds                          # Last timer trigger time

# Alert on service failure
node_systemd_unit_state{state="failed"} == 1

Hardware (IPMI, hwmon, thermal)

# Hardware temperature monitoring
node_hwmon_temp_celsius                          # Temperature sensors
node_thermal_zone_temp                           # CPU thermal zones
node_cooling_device_cur_state                    # Cooling device state

# Power supply (laptop/UPS)
node_power_supply_energy_watthour
node_power_supply_online

# IPMI (requires ipmi-tools + root access)
# Enable with: --collector.ipmi
node_ipmi_temperature_celsius{name="CPU1 Temp"}
node_ipmi_fan_speed_rpm{name="FAN1"}
node_ipmi_power_watts{name="System Board"}

Conclusion

                            
                            Node Exporter Best Practices:
                            Deploy as DaemonSet with hostNetwork: true and hostPID: true for complete visibility
Filter filesystem mounts — exclude tmpfs, overlay, and container-internal mounts
Use textfile collector for custom metrics (backup status, cert expiry, package updates)
Enable systemd collector to monitor critical services
Disable unused collectors to reduce scrape time and cardinality
Use recording rules for common dashboard queries (CPU %, memory %, disk predictions)
Alert on predictions (predict_linear) not just thresholds for disk and memory

                        

Previous Part 8: Optimizing & Debugging Next Part 10: Remote Storage