CentOS Management and Troubleshooting - Alerting Options for System Issues and RAID Failures

cybernative

2026-03-03

alerting, automation, centos, centos-stream, hardware-monitoring, monitoring, raid, server-management, system-administration, troubleshooting

Introduction: Proactive System Management

Running CentOS in production requires more than just installation and configuration. When critical issues arise—disk failures, RAID degradation, memory exhaustion, or service outages—you need immediate notification and clear troubleshooting procedures.

This guide covers comprehensive monitoring and alerting strategies for CentOS systems, with particular focus on RAID array health and hardware failures. We will explore multiple notification channels, from simple email alerts to integrated dashboard solutions.

Why Monitoring Matters

The Cost of Unnoticed Failures

Issue Type	Downtime Cost	Detection Method
Single disk failure (RAID 1)	Zero (if detected)	SMART monitoring
RAID degradation	Hours to days	Array scrubbing
Memory exhaustion	Minutes	Process monitoring
Full filesystem	Variable	Capacity alerts
Service crash	Until restart	Process supervision

Without monitoring, a simple RAID 1 disk failure escalates to data loss when the second disk fails—often within weeks or months of the first.

Core Monitoring Components

1. RAID Array Monitoring (mdadm)

Linux software RAID requires active monitoring through the mdadm utility.

Checking RAID Status

# Overall RAID status
cat /proc/mdstat

# Detailed array information
mdadm --detail /dev/md0
mdadm --detail /dev/md1

# Check all arrays
mdadm --detail --scan

# Examine individual disks
mdadm --examine /dev/sda1
mdadm --examine /dev/sdb1

Automated RAID Monitoring Script

Create /usr/local/bin/raid-check.sh:

#!/bin/bash

# RAID health check script
MAILTO="admin@example.com"
SLACK_WEBHOOK="https://hooks.slack.com/services/YOUR/WEBHOOK/URL"

# Check RAID status
RAID_STATUS=$(cat /proc/mdstat)
FAILED_DISKS=$(mdadm --detail --scan | grep -i "failed\|degraded" || true)

# Check for issues
if echo "$RAID_STATUS" | grep -q "\[U_]\|[_U]"; then
    SUBJECT="CRITICAL: RAID Degraded on $(hostname)"
    MESSAGE="RAID array has degraded. Check immediately.\n\nStatus:\n${RAID_STATUS}\n\nFailed details:\n${FAILED_DISKS}"
    
    # Email alert
    echo -e "$MESSAGE" | mail -s "$SUBJECT" "$MAILTO"
    
    # Slack alert
    curl -X POST -H 'Content-type: application/json' \
        --data "{\"text\":\":warning: $SUBJECT\n$MESSAGE\"}" \
        "$SLACK_WEBHOOK"
    
    # Exit with error code for monitoring systems
    exit 1
fi

# Check for syncing/rebuilding
if echo "$RAID_STATUS" | grep -q "recovery\|resync"; then
    echo "RAID array rebuilding: $(grep -oP 'resync=\K[0-9]+' /proc/mdstat)%" | logger
fi

echo "RAID check passed"
exit 0

Make executable and schedule:

1 2	chmod +x /usr/local/bin/raid-check.sh echo "/5 * * * root /usr/local/bin/raid-check.sh" >> /etc/crontab

2. SMART Disk Health Monitoring

Monitor physical disk health before RAID failure occurs.

Installing smartmontools

1
2
3

sudo dnf install -y smartmontools
sudo systemctl enable smartd
sudo systemctl start smartd

SMART Configuration

Edit /etc/smartd.conf:

# Email notifications for errors
DEFAULT -m admin@example.com -M exec /usr/local/bin/smart-alert.sh

# Monitor all disks
/dev/sda -a -o on -S on -s (S/../.././02|L/../../6/03)
/dev/sdb -a -o on -S on -s (S/../.././02|L/../../6/03)

# NVMe drives (require different syntax)
/dev/nvme0 -a -m admin@example.com
/dev/nvme1 -a -m admin@example.com

Smart alert handler (/usr/local/bin/smart-alert.sh):

#!/bin/bash
# SMART alert notification script

MESSAGE="$SMARTD_MESSAGE"
DISK="$SMARTD_DEVICE"

# Send to multiple channels
echo "$MESSAGE" | mail -s "SMART Alert: $DISK" admin@example.com
echo "$(date): SMART alert for $DISK: $MESSAGE" >> /var/log/smart-alerts.log

# Optional: Push notification via Pushover
curl -s \
    --form-string "token=YOUR_APP_TOKEN" \
    --form-string "user=YOUR_USER_KEY" \
    --form-string "title=SMART Alert" \
    --form-string "message=Disk $DISK: $MESSAGE" \
    https://api.pushover.net/1/messages.json

3. System Resource Monitoring

Memory and CPU Alerts

#!/bin/bash
# /usr/local/bin/resource-check.sh

MEMORY_THRESHOLD=90
CPU_THRESHOLD=90
DISK_THRESHOLD=85

# Check memory usage
MEMORY_USAGE=$(free | grep Mem | awk '{printf "%.0f", $3/$2 * 100.0}')
if [ "$MEMORY_USAGE" -gt "$MEMORY_THRESHOLD" ]; then
    echo "High memory usage: ${MEMORY_USAGE}%" | \
        mail -s "ALERT: Memory ${MEMORY_USAGE}% on $(hostname)" admin@example.com
fi

# Check CPU load (5-minute average)
CPU_LOAD=$(uptime | awk -F'load average:' '{print $2}' | awk '{print $2}' | sed 's/,//')
CPU_INT=$(echo "$CPU_LOAD * 100 / 1" | bc)
if [ "$CPU_INT" -gt "$((100 * $(nproc)))" ]; then
    echo "High CPU load: $CPU_LOAD" | \
        mail -s "ALERT: CPU Load $CPU_LOAD on $(hostname)" admin@example.com
fi

# Check disk usage
for DISK in $(df -h | grep '/dev/' | awk '{print $1}'); do
    USAGE=$(df -h "$DISK" | awk 'NR==2 {print $5}' | sed 's/%//')
    if [ "$USAGE" -gt "$DISK_THRESHOLD" ]; then
        echo "Disk $DISK at ${USAGE}% capacity" | \
            mail -s "ALERT: Disk Space ${USAGE}% on $(hostname)" admin@example.com
    fi
done

Notification Boards and Dashboards

Option 1: Simple Web Dashboard

Create a status dashboard using a lightweight web server.

Installing and Configuring

1
2
3

sudo dnf install -y httpd php
sudo systemctl enable httpd
sudo systemctl start httpd

Create /var/www/html/status.php:

<!DOCTYPE html>
<html>
<head>
    <title>CentOS System Status</title>
    <style>
        body { font-family: Arial, sans-serif; margin: 40px; }
        .status-ok { color: #00aa00; }
        .status-warning { color: #ff8800; }
        .status-critical { color: #cc0000; }
        pre { background: #f5f5f5; padding: 10px; border-left: 4px solid #ccc; }
    </style>
</head>
<body>
    <h1>CentOS System Status - <?php echo gethostname(); ?></h1>
    
    <h2>RAID Status</h2>
    <pre><?php echo htmlspecialchars(shell_exec('cat /proc/mdstat')); ?></pre>
    
    <h2>Disk Usage</h2>
    <pre><?php echo htmlspecialchars(shell_exec('df -h')); ?></pre>
    
    <h2>Memory Status</h2>
    <pre><?php echo htmlspecialchars(shell_exec('free -h')); ?></pre>
    
    <h2>SMART Health</h2>
    <pre><?php echo htmlspecialchars(shell_exec('smartctl --scan | while read line; do disk=$(echo $line | cut -d" " -f1); echo "=== $disk ==="; smartctl -H $disk 2>/dev/null || echo \"N/A\"; done')); ?></pre>
    
    <p>Last updated: <?php echo date('Y-m-d H:i:s'); ?></p>
</body>
</html>

Access via: http://your-server-ip/status.php

Option 2: Cockpit Web Console (Recommended)

Red Hat’s Cockpit provides comprehensive system management with built-in alerting capabilities.

Installation

1 2	sudo dnf install -y cockpit cockpit-storaged cockpit-pcp sudo systemctl enable --now cockpit.socket

Features

Real-time performance graphs
Storage management with RAID health
Service status and control
Terminal access
Log viewer with filtering

Access: https://your-server-ip:9090

Option 3: Integrated Monitoring with Prometheus + Grafana

For enterprise-grade monitoring, deploy