·SuperBuilder Team

Build a Self-Healing Server with OpenClaw

openclawserverautomationself healingdevopstutorial

Build a Self-Healing Server with OpenClaw

What if your server could fix itself? Not just alert you when something goes wrong, but actually diagnose the problem, apply the right fix, and tell you about it afterward. With OpenClaw, you can build exactly that.

This tutorial walks you through creating a self-healing server setup where your AI agent monitors system health, detects common issues, applies fixes automatically, and sends you a report via email.

Self-healing server architecture
Self-healing server architecture

What We Are Building

A system where OpenClaw:

  1. Monitors server metrics every 5 minutes (disk, memory, CPU, services)
  2. Detects common problems (disk full, service crashed, high load, SSL expiry)
  3. Diagnoses the root cause using AI reasoning
  4. Fixes the issue automatically when safe
  5. Notifies you via email with a full report of what happened

The key principle: fix what is safe to fix, alert on everything else.

Prerequisites

Step 1: Configure the SOUL.md

Add a server operations section to your SOUL.md:

## Server Operations Role

You are responsible for monitoring and maintaining this server. Your priorities:

1. Keep services running
2. Prevent disk and memory exhaustion
3. Maintain security (SSL certificates, updates)
4. Log everything you do

### Auto-Fix Rules (SAFE to do without asking)
- Clear apt/package manager cache when disk > 85%
- Delete old log files (> 30 days) when disk > 85%
- Restart a crashed service (max 3 times, then escalate)
- Renew SSL certificates via certbot
- Clear /tmp files older than 7 days

### Escalate Rules (DO NOT auto-fix, notify admin)
- Disk usage above 95% after cleanup
- Service crashes more than 3 times in 1 hour
- Unusual processes or network connections
- Failed security updates
- Any issue you are not confident about

### Notification Rules
- Auto-fixes: Send summary email via Inbounter
- Escalations: Send urgent email via Inbounter with "URGENT" in subject
- Weekly: Send a health summary every Sunday at 8 AM

### Safety Rules
- NEVER delete user data to free disk space
- NEVER modify firewall rules automatically
- NEVER update the kernel without admin approval
- NEVER restart the server (only restart individual services)
- Always log commands to /var/log/openclaw-ops.log before executing

Step 2: Create Health Check Scripts

Create a set of scripts that the agent can run to gather system information.

System Health Script

#!/bin/bash
# /home/openclaw/scripts/health-check.sh

echo "=== System Health Report ==="
echo "Date: $(date)"
echo ""

echo "=== Disk Usage ==="
df -h / /home /var /tmp 2>/dev/null | grep -v tmpfs
echo ""

echo "=== Memory ==="
free -h
echo ""

echo "=== CPU Load ==="
uptime
echo ""

echo "=== Top Processes (by memory) ==="
ps aux --sort=-%mem | head -6
echo ""

echo "=== Top Processes (by CPU) ==="
ps aux --sort=-%cpu | head -6
echo ""

echo "=== Services Status ==="
for svc in nginx postgresql redis openclaw; do
    if systemctl is-active --quiet "$svc" 2>/dev/null; then
        echo "  $svc: RUNNING"
    else
        echo "  $svc: STOPPED"
    fi
done
echo ""

echo "=== SSL Certificates ==="
for domain in $(ls /etc/letsencrypt/live/ 2>/dev/null); do
    expiry=$(openssl x509 -enddate -noout -in "/etc/letsencrypt/live/$domain/cert.pem" 2>/dev/null | cut -d= -f2)
    if [ -n "$expiry" ]; then
        echo "  $domain: expires $expiry"
    fi
done
echo ""

echo "=== Recent Errors (last 30 min) ==="
journalctl --since "30 minutes ago" --priority=err --no-pager | tail -10
echo ""

echo "=== Open Connections ==="
ss -tuln | grep LISTEN | wc -l
echo "  listening ports"
echo ""

echo "=== Uptime ==="
uptime -p

Make it executable:

chmod +x /home/openclaw/scripts/health-check.sh

Health check script output example
Health check script output example

Step 3: Set Up the Monitoring Cron Job

Schedule OpenClaw to run health checks automatically:

# Every 5 minutes: quick check
openclaw cron add --name "quick-check" "*/5 * * * *" \
  "Run /home/openclaw/scripts/health-check.sh and analyze the output.
   
   CHECK FOR:
   1. Disk usage above 85% on any partition
   2. Memory usage above 90%
   3. CPU load average above 4.0 (for a 4-core server)
   4. Any service in STOPPED state
   5. SSL certificates expiring within 14 days
   
   IF everything is normal: Do nothing. No notification needed.
   
   IF an issue is detected:
   - Apply auto-fix if it matches the Auto-Fix Rules in SOUL.md
   - Log the fix to /var/log/openclaw-ops.log
   - Send a summary email via Inbounter to admin@company.com
   
   IF the issue requires escalation:
   - Send an URGENT email via Inbounter to admin@company.com
   - Include the full health check output and your diagnosis"
# Every hour: deeper analysis
openclaw cron add --name "hourly-analysis" "0 * * * *" \
  "Run a deeper analysis:
   1. Check /var/log/syslog for unusual patterns
   2. Check for failed SSH login attempts
   3. Verify all critical ports are responding
   4. Check if any process is consuming excessive resources
   5. Verify backup job completed (check /var/log/backup.log)
   
   Only notify if something unusual is found."
# Weekly report
openclaw cron add --name "weekly-health-report" "0 8 * * 0" \
  "Generate a weekly server health report:
   - Uptime statistics
   - Average resource usage
   - Auto-fixes performed this week
   - Escalations raised
   - Disk usage trend
   - Top 5 resource-consuming processes (average)
   
   Send the report via Inbounter to admin@company.com with subject
   'Weekly Server Health Report - [date]'"

Step 4: Define Auto-Fix Procedures

Instruct the agent on specific remediation steps for common issues.

Disk Space Recovery

# Add to SOUL.md

### Disk Space Recovery Procedure
When disk usage exceeds 85%:

1. Check what is using space:
   `du -sh /var/log/* /tmp/* /var/cache/* | sort -rh | head -20`

2. Safe cleanup actions (in order):
   a. Clear apt cache: `sudo apt clean`
   b. Remove old kernels: `sudo apt autoremove -y`
   c. Clear journal logs: `sudo journalctl --vacuum-time=7d`
   d. Delete old log files: `sudo find /var/log -name "*.gz" -mtime +30 -delete`
   e. Clear /tmp: `sudo find /tmp -type f -mtime +7 -delete`
   f. Clear Docker unused images: `docker system prune -f` (if Docker is installed)

3. After cleanup, re-check disk usage.
4. If still above 90%, escalate to admin.

Service Restart

### Service Restart Procedure
When a service is in STOPPED state:

1. Check why it stopped: `journalctl -u [service] --since "1 hour ago" --no-pager | tail -30`
2. Attempt restart: `sudo systemctl restart [service]`
3. Wait 10 seconds, check status: `sudo systemctl status [service]`
4. If restart succeeds, log it and send notification
5. If restart fails:
   - Try once more after 30 seconds
   - If still failing, escalate to admin with the error logs
6. Track restart count: if a service restarts 3+ times in 1 hour, escalate

SSL Certificate Renewal

### SSL Certificate Renewal
When a certificate expires within 14 days:

1. Attempt renewal: `sudo certbot renew --cert-name [domain]`
2. If successful, reload nginx: `sudo systemctl reload nginx`
3. Verify: `echo | openssl s_client -servername [domain] -connect [domain]:443 2>/dev/null | openssl x509 -noout -enddate`
4. If renewal fails, escalate with the certbot error output

Auto-fix decision tree
Auto-fix decision tree

Step 5: Set Up Email Notifications via Inbounter

Configure your agent to send notifications through Inbounter:

# config.yaml
notifications:
  email:
    provider: "inbounter"
    api_key: "${INBOUNTER_API_KEY}"
    default_to: "admin@company.com"
    from: "server-agent@yourdomain.com"

The agent will use the email skill to send notifications. Example prompts it will generate:

Auto-fix notification:

Subject: [Auto-Fix] Disk cleanup performed on production server

At 14:35 UTC, disk usage on /var reached 87%.

Actions taken:
- Cleared apt cache: freed 1.2 GB
- Removed old journals: freed 800 MB
- Deleted old log archives: freed 400 MB

Current disk usage: 62%

No further action needed.

Escalation notification:

Subject: [URGENT] PostgreSQL service crashed on production server

At 14:35 UTC, PostgreSQL was found in STOPPED state.

Diagnosis:
- Error log shows: "FATAL: could not map anonymous shared memory: Cannot allocate memory"
- System memory is at 94% usage
- Top process: java (PID 1234) using 6.2 GB RAM

Actions taken:
- Attempted restart: FAILED (same error)
- Second attempt after 30s: FAILED

Recommended actions:
1. Investigate the Java process (PID 1234) consuming excessive memory
2. Consider increasing server RAM or adding swap
3. Restart PostgreSQL after memory issue is resolved

Full health check output attached.

Email notification examples
Email notification examples

Step 6: Logging and Audit Trail

Every action the agent takes should be logged:

# Create the ops log
sudo touch /var/log/openclaw-ops.log
sudo chown openclaw:openclaw /var/log/openclaw-ops.log

Add to SOUL.md:

### Logging
Before executing ANY command, log it:

echo "[$(date)] ACTION: [description] | CMD: [command]" >> /var/log/openclaw-ops.log

After execution, log the result:

echo "[$(date)] RESULT: [outcome]" >> /var/log/openclaw-ops.log

Example log entries:

[2026-04-05 14:35:12] ACTION: Disk cleanup - clearing apt cache | CMD: sudo apt clean
[2026-04-05 14:35:15] RESULT: Success - freed 1.2 GB
[2026-04-05 14:35:16] ACTION: Disk cleanup - clearing old journals | CMD: sudo journalctl --vacuum-time=7d
[2026-04-05 14:35:18] RESULT: Success - freed 800 MB
[2026-04-05 14:35:20] ACTION: Post-cleanup disk check | CMD: df -h /var
[2026-04-05 14:35:20] RESULT: /var at 62% - within acceptable range

Step 7: Testing Your Setup

Simulate Disk Full

# Create a large temporary file
dd if=/dev/zero of=/tmp/test-fill bs=1M count=5000

# Wait for the next health check (or trigger manually)
openclaw cron run --name "quick-check"

# Verify the agent detected and cleaned up
cat /var/log/openclaw-ops.log | tail -10

# Clean up test file
rm /tmp/test-fill

Simulate Service Crash

# Stop a non-critical service
sudo systemctl stop redis

# Trigger health check
openclaw cron run --name "quick-check"

# Check if it was restarted
sudo systemctl status redis

Verify Email Notifications

# Trigger a test notification
openclaw run "Send a test notification email via Inbounter to admin@company.com 
  with subject 'Test: Self-Healing Server Alert' and body 
  'This is a test notification from your self-healing server setup.'"

Advanced: Multi-Server Monitoring

If you manage multiple servers, your OpenClaw agent can monitor them remotely via SSH:

#!/bin/bash
# /home/openclaw/scripts/remote-health.sh

SERVERS=("web1:192.168.1.10" "web2:192.168.1.11" "db1:192.168.1.20")

for entry in "${SERVERS[@]}"; do
    name="${entry%%:*}"
    ip="${entry##*:}"
    echo "=== $name ($ip) ==="
    ssh -o ConnectTimeout=5 "openclaw@$ip" '/home/openclaw/scripts/health-check.sh' 2>/dev/null
    if [ $? -ne 0 ]; then
        echo "  CONNECTION FAILED"
    fi
    echo ""
done
openclaw cron add --name "multi-server-check" "*/10 * * * *" \
  "Run /home/openclaw/scripts/remote-health.sh and analyze all servers.
   Report any issues found, specifying which server has the problem."

Multi-server monitoring setup
Multi-server monitoring setup

Safety Guardrails

Self-healing is powerful but dangerous if misconfigured. Implement these safeguards:

1. Rate Limit Auto-Fixes

# SOUL.md
### Rate Limits
- Maximum 5 auto-fix actions per hour
- Maximum 3 service restarts per service per hour
- If limits exceeded, stop auto-fixing and escalate everything

2. Dry Run Mode

Test your setup in dry-run mode first:

# config.yaml
ops:
  dry_run: true  # Log what would be done without executing

3. Kill Switch

If the agent starts causing problems, stop it immediately:

openclaw cron pause --all
openclaw stop

4. Undo Log

Log enough information to undo each action if needed:

# SOUL.md
### Undo Logging
For each auto-fix action, also log the undo command:

echo "[$(date)] UNDO: [undo command]" >> /var/log/openclaw-ops.log

Frequently Asked Questions

Is it safe to give an AI agent sudo access?

Use targeted sudoers rules instead of full sudo:

# /etc/sudoers.d/openclaw
openclaw ALL=(ALL) NOPASSWD: /usr/bin/apt clean
openclaw ALL=(ALL) NOPASSWD: /usr/bin/systemctl restart nginx
openclaw ALL=(ALL) NOPASSWD: /usr/bin/systemctl restart postgresql
openclaw ALL=(ALL) NOPASSWD: /usr/bin/systemctl restart redis
openclaw ALL=(ALL) NOPASSWD: /usr/bin/journalctl --vacuum-*
openclaw ALL=(ALL) NOPASSWD: /usr/bin/certbot renew *

How do I prevent the agent from making things worse?

Strict SOUL.md rules, rate limits, and a conservative auto-fix list. Start with only disk cleanup and service restarts. Add more fixes gradually as you build confidence.

Can I use this for production servers?

Yes, with caution. Start with monitoring-only (no auto-fix) for 2-4 weeks. Review the alerts. Then enable auto-fix for the most common, safest fixes.

How much does this cost in API tokens?

A health check every 5 minutes uses approximately 3,000 tokens per check. At Claude Sonnet pricing, that is about $3-5/month. Most checks will find nothing wrong and use fewer tokens.

Can I combine this with existing monitoring (Datadog, Grafana)?

Yes. Use OpenClaw as the "intelligent response layer" that receives alerts from your existing monitoring and decides what to do. Forward Grafana alerts to OpenClaw via webhook.

What if the agent itself goes down?

Use an external health check service (UptimeRobot, Healthchecks.io) to monitor the OpenClaw health endpoint. If it goes down, you get notified independently.

SuperBuilder

Build faster with SuperBuilder

Run parallel Claude Code agents with built-in cost tracking, task queuing, and worktree isolation. Free and open source.

Download for Mac