CentOS Management and Troubleshooting - Alerting Options for System Issues and RAID Failures

Introduction: Proactive System Management

Running CentOS in production requires more than just installation and configuration. When critical issues arise—disk failures, RAID degradation, memory exhaustion, or service outages—you need immediate notification and clear troubleshooting procedures.

This guide covers comprehensive monitoring and alerting strategies for CentOS systems, with particular focus on RAID array health and hardware failures. We will explore multiple notification channels, from simple email alerts to integrated dashboard solutions.

Why Monitoring Matters

The Cost of Unnoticed Failures

Issue Type Downtime Cost Detection Method
Single disk failure (RAID 1) Zero (if detected) SMART monitoring
RAID degradation Hours to days Array scrubbing
Memory exhaustion Minutes Process monitoring
Full filesystem Variable Capacity alerts
Service crash Until restart Process supervision

Without monitoring, a simple RAID 1 disk failure escalates to data loss when the second disk fails—often within weeks or months of the first.

Core Monitoring Components

1. RAID Array Monitoring (mdadm)

Linux software RAID requires active monitoring through the mdadm utility.

Checking RAID Status

1
2
3
4
5
6
7
8
9
10
11
12
13
# Overall RAID status
cat /proc/mdstat

# Detailed array information
mdadm --detail /dev/md0
mdadm --detail /dev/md1

# Check all arrays
mdadm --detail --scan

# Examine individual disks
mdadm --examine /dev/sda1
mdadm --examine /dev/sdb1

Automated RAID Monitoring Script

Create /usr/local/bin/raid-check.sh:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
#!/bin/bash

# RAID health check script
MAILTO="admin@example.com"
SLACK_WEBHOOK="https://hooks.slack.com/services/YOUR/WEBHOOK/URL"

# Check RAID status
RAID_STATUS=$(cat /proc/mdstat)
FAILED_DISKS=$(mdadm --detail --scan | grep -i "failed\|degraded" || true)

# Check for issues
if echo "$RAID_STATUS" | grep -q "\[U_]\|[_U]"; then
SUBJECT="CRITICAL: RAID Degraded on $(hostname)"
MESSAGE="RAID array has degraded. Check immediately.\n\nStatus:\n${RAID_STATUS}\n\nFailed details:\n${FAILED_DISKS}"

# Email alert
echo -e "$MESSAGE" | mail -s "$SUBJECT" "$MAILTO"

# Slack alert
curl -X POST -H 'Content-type: application/json' \
--data "{\"text\":\":warning: $SUBJECT\n$MESSAGE\"}" \
"$SLACK_WEBHOOK"

# Exit with error code for monitoring systems
exit 1
fi

# Check for syncing/rebuilding
if echo "$RAID_STATUS" | grep -q "recovery\|resync"; then
echo "RAID array rebuilding: $(grep -oP 'resync=\K[0-9]+' /proc/mdstat)%" | logger
fi

echo "RAID check passed"
exit 0

Make executable and schedule:

1
2
chmod +x /usr/local/bin/raid-check.sh
echo "*/5 * * * * root /usr/local/bin/raid-check.sh" >> /etc/crontab

2. SMART Disk Health Monitoring

Monitor physical disk health before RAID failure occurs.

Installing smartmontools

1
2
3
sudo dnf install -y smartmontools
sudo systemctl enable smartd
sudo systemctl start smartd

SMART Configuration

Edit /etc/smartd.conf:

1
2
3
4
5
6
7
8
9
10
# Email notifications for errors
DEFAULT -m admin@example.com -M exec /usr/local/bin/smart-alert.sh

# Monitor all disks
/dev/sda -a -o on -S on -s (S/../.././02|L/../../6/03)
/dev/sdb -a -o on -S on -s (S/../.././02|L/../../6/03)

# NVMe drives (require different syntax)
/dev/nvme0 -a -m admin@example.com
/dev/nvme1 -a -m admin@example.com

Smart alert handler (/usr/local/bin/smart-alert.sh):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
#!/bin/bash
# SMART alert notification script

MESSAGE="$SMARTD_MESSAGE"
DISK="$SMARTD_DEVICE"

# Send to multiple channels
echo "$MESSAGE" | mail -s "SMART Alert: $DISK" admin@example.com
echo "$(date): SMART alert for $DISK: $MESSAGE" >> /var/log/smart-alerts.log

# Optional: Push notification via Pushover
curl -s \
--form-string "token=YOUR_APP_TOKEN" \
--form-string "user=YOUR_USER_KEY" \
--form-string "title=SMART Alert" \
--form-string "message=Disk $DISK: $MESSAGE" \
https://api.pushover.net/1/messages.json

3. System Resource Monitoring

Memory and CPU Alerts

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
#!/bin/bash
# /usr/local/bin/resource-check.sh

MEMORY_THRESHOLD=90
CPU_THRESHOLD=90
DISK_THRESHOLD=85

# Check memory usage
MEMORY_USAGE=$(free | grep Mem | awk '{printf "%.0f", $3/$2 * 100.0}')
if [ "$MEMORY_USAGE" -gt "$MEMORY_THRESHOLD" ]; then
echo "High memory usage: ${MEMORY_USAGE}%" | \
mail -s "ALERT: Memory ${MEMORY_USAGE}% on $(hostname)" admin@example.com
fi

# Check CPU load (5-minute average)
CPU_LOAD=$(uptime | awk -F'load average:' '{print $2}' | awk '{print $2}' | sed 's/,//')
CPU_INT=$(echo "$CPU_LOAD * 100 / 1" | bc)
if [ "$CPU_INT" -gt "$((100 * $(nproc)))" ]; then
echo "High CPU load: $CPU_LOAD" | \
mail -s "ALERT: CPU Load $CPU_LOAD on $(hostname)" admin@example.com
fi

# Check disk usage
for DISK in $(df -h | grep '/dev/' | awk '{print $1}'); do
USAGE=$(df -h "$DISK" | awk 'NR==2 {print $5}' | sed 's/%//')
if [ "$USAGE" -gt "$DISK_THRESHOLD" ]; then
echo "Disk $DISK at ${USAGE}% capacity" | \
mail -s "ALERT: Disk Space ${USAGE}% on $(hostname)" admin@example.com
fi
done

Notification Boards and Dashboards

Option 1: Simple Web Dashboard

Create a status dashboard using a lightweight web server.

Installing and Configuring

1
2
3
sudo dnf install -y httpd php
sudo systemctl enable httpd
sudo systemctl start httpd

Create /var/www/html/status.php:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
<!DOCTYPE html>
<html>
<head>
<title>CentOS System Status</title>
<style>
body { font-family: Arial, sans-serif; margin: 40px; }
.status-ok { color: #00aa00; }
.status-warning { color: #ff8800; }
.status-critical { color: #cc0000; }
pre { background: #f5f5f5; padding: 10px; border-left: 4px solid #ccc; }
</style>
</head>
<body>
<h1>CentOS System Status - <?php echo gethostname(); ?></h1>

<h2>RAID Status</h2>
<pre><?php echo htmlspecialchars(shell_exec('cat /proc/mdstat')); ?></pre>

<h2>Disk Usage</h2>
<pre><?php echo htmlspecialchars(shell_exec('df -h')); ?></pre>

<h2>Memory Status</h2>
<pre><?php echo htmlspecialchars(shell_exec('free -h')); ?></pre>

<h2>SMART Health</h2>
<pre><?php echo htmlspecialchars(shell_exec('smartctl --scan | while read line; do disk=$(echo $line | cut -d" " -f1); echo "=== $disk ==="; smartctl -H $disk 2>/dev/null || echo \"N/A\"; done')); ?></pre>

<p>Last updated: <?php echo date('Y-m-d H:i:s'); ?></p>
</body>
</html>

Access via: http://your-server-ip/status.php

Red Hat’s Cockpit provides comprehensive system management with built-in alerting capabilities.

Installation

1
2
sudo dnf install -y cockpit cockpit-storaged cockpit-pcp
sudo systemctl enable --now cockpit.socket

Features

  • Real-time performance graphs
  • Storage management with RAID health
  • Service status and control
  • Terminal access
  • Log viewer with filtering

Access: https://your-server-ip:9090

Option 3: Integrated Monitoring with Prometheus + Grafana

For enterprise-grade monitoring, deploy