How to Monitor OpenClaw Agent Performance

A running OpenClaw agent is not a "set and forget" system. Skills break, API keys expire, conversations fail silently, and token costs creep up. Proper monitoring catches these issues before they affect your experience. This guide covers every aspect of monitoring OpenClaw.

Built-In Monitoring

openclaw status

The quickest health check:

openclaw status

# Output:
# OpenClaw v0.9.6
# Status:        running
# Uptime:        14d 3h 22m
# Model:         claude-sonnet-4-20250514
# Memory DB:     82 MB (1,234 entries)
# Active channels: telegram (connected), email (connected)
# Skills:        12 installed, 12 healthy
# Cron jobs:     5 active, 0 paused
# Last message:  2 minutes ago
# CPU:           3.2%
# Memory:        312 MB

openclaw logs

Real-time log streaming:

# Follow all logs
openclaw logs -f

# Filter by component
openclaw logs --filter "channel" --since 1h
openclaw logs --filter "skill" --since 1h
openclaw logs --filter "error" --since 24h
openclaw logs --filter "llm" --since 1h

# Show only errors and warnings
openclaw logs --level warn --since 24h

openclaw doctor

A comprehensive diagnostic check:

openclaw doctor

# Output:
# [PASS] Node.js version: 20.11.0
# [PASS] OpenClaw version: 0.9.6 (latest)
# [PASS] Database: healthy, 82 MB
# [PASS] Anthropic API: connected, key valid
# [PASS] Telegram channel: connected
# [WARN] Email channel: webhook URL returns 404
# [PASS] Skills: 12/12 healthy
# [PASS] Cron: service running, 5 jobs active
# [PASS] Disk space: 12 GB free (60% available)
# [WARN] Memory usage: 312 MB (approaching 512 MB limit)
#
# 2 warnings found. Run `openclaw doctor --fix` to attempt auto-repair.

Key Metrics to Track

1. Response Time

How long does it take from receiving a message to sending a response?

openclaw metrics response-time --period 24h

# Output:
# Average:  4.2s
# P50:      3.1s
# P90:      8.4s
# P99:      15.2s
# Min:      1.8s
# Max:      22.1s

What to watch for:

Average > 10s: Consider a faster model or check network latency
P99 > 30s: Something is intermittently wrong (rate limiting, timeout)
Increasing trend: Context window growing, memory bloat

2. Success Rate

What percentage of messages get a successful response?

openclaw metrics success-rate --period 7d

# Output:
# Total messages:  342
# Successful:      331 (96.8%)
# Failed:          11 (3.2%)
#
# Failure breakdown:
#   API rate limit:    5
#   Skill error:       3
#   Timeout:           2
#   Parse error:       1

Target: 95%+ success rate. Below 90% indicates a systemic issue.

3. Token Usage

Track daily token consumption to manage costs:

openclaw metrics tokens --period 30d --daily

# Output:
# Date        Input Tokens  Output Tokens  Total     Cost
# 2026-04-05  45,230        12,340         57,570    $0.22
# 2026-04-04  52,100        15,890         67,990    $0.26
# 2026-04-03  38,900        10,200         49,100    $0.19
# ...
# Monthly:    1,234,567     345,678        1,580,245 $6.04

4. Skill Health

Monitor which skills are succeeding and failing:

openclaw metrics skills --period 7d

# Output:
# Skill              Calls  Success  Failed  Avg Time
# web-search         45     44       1       2.1s
# file-manager       32     32       0       0.3s
# email-send         18     17       1       1.5s
# image-gen          8      6        2       8.4s
# shell-exec         156    154      2       0.8s

5. Channel Uptime

Track connectivity for each messaging channel:

openclaw metrics channels --period 7d

# Output:
# Channel     Uptime    Disconnects  Avg Reconnect
# telegram    99.8%     1            3.2s
# email       99.9%     0            N/A
# whatsapp    97.2%     4            12.4s

Setting Up Grafana Dashboards

For visual monitoring, export OpenClaw metrics to Grafana via Prometheus.

Step 1: Enable Metrics Export

# config.yaml
metrics:
  enabled: true
  exporter: "prometheus"
  port: 9090
  path: "/metrics"

Step 2: Configure Prometheus

# prometheus.yml
scrape_configs:
  - job_name: 'openclaw'
    scrape_interval: 15s
    static_configs:
      - targets: ['localhost:9090']

Step 3: Create Grafana Dashboard

Import the community OpenClaw dashboard or create your own with these panels:

Response Time (Time Series)

histogram_quantile(0.95, rate(openclaw_response_duration_seconds_bucket[5m]))

Success Rate (Gauge)

rate(openclaw_messages_success_total[1h]) / rate(openclaw_messages_total[1h]) * 100

Token Usage (Bar Chart)

increase(openclaw_tokens_total[1d])

Active Channels (Status Panel)

openclaw_channel_connected

Docker Compose with Monitoring Stack

version: "3.8"
services:
  openclaw:
    image: openclaw/openclaw:latest
    ports:
      - "127.0.0.1:3080:3080"
      - "127.0.0.1:9090:9090"
    volumes:
      - ./config:/app/config
      - ./data:/app/data

  prometheus:
    image: prom/prometheus:latest
    ports:
      - "127.0.0.1:9091:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml

  grafana:
    image: grafana/grafana:latest
    ports:
      - "127.0.0.1:3000:3000"
    volumes:
      - grafana-data:/var/lib/grafana

volumes:
  grafana-data:

Health Checks

HTTP Health Endpoint

OpenClaw exposes a health check endpoint:

curl http://localhost:3080/health

# Response:
# {
#   "status": "healthy",
#   "uptime": 1234567,
#   "version": "0.9.6",
#   "components": {
#     "database": "healthy",
#     "llm_api": "healthy",
#     "channels": {
#       "telegram": "connected",
#       "email": "connected"
#     }
#   }
# }

Automated Health Monitoring

Use a simple script to check health and alert on failures:

#!/bin/bash
# health-check.sh

HEALTH_URL="http://localhost:3080/health"
RESPONSE=$(curl -s -o /dev/null -w "%{http_code}" "$HEALTH_URL")

if [ "$RESPONSE" != "200" ]; then
    # Alert via Inbounter email
    curl -X POST https://api.inbounter.com/v1/email/send \
      -H "Authorization: Bearer ${INBOUNTER_API_KEY}" \
      -H "Content-Type: application/json" \
      -d '{
        "to": "admin@company.com",
        "subject": "OpenClaw Health Check Failed",
        "body": "Health check returned HTTP '"$RESPONSE"' at '"$(date)"'"
      }'
fi

Schedule this with system cron:

# Check every 5 minutes
*/5 * * * * /home/user/scripts/health-check.sh

Heartbeat Monitoring

For a more robust approach, set up a heartbeat that alerts you when it stops:

openclaw cron add --name "heartbeat" "*/5 * * * *" \
  "Send a heartbeat ping. This confirms I am running. 
   If this message does not arrive within 10 minutes, 
   something is wrong."

Configure Inbounter to expect the heartbeat and alert you if it misses a window.

Alerting

Built-In Alerts

alerts:
  - name: "high-error-rate"
    condition: "error_rate > 0.10"
    period: "1h"
    channels: ["telegram"]
    message: "Error rate above 10% in the last hour"
  
  - name: "high-latency"
    condition: "p95_latency > 15"
    period: "15m"
    channels: ["telegram", "email"]
    message: "P95 response time above 15 seconds"
  
  - name: "budget-warning"
    condition: "daily_cost > 5.00"
    period: "1d"
    channels: ["email"]
    message: "Daily API cost exceeded $5.00"
  
  - name: "channel-disconnect"
    condition: "channel_disconnected"
    channels: ["email"]
    message: "Channel {{channel}} disconnected"

Email and SMS Alerts via Inbounter

For critical alerts, email or SMS is more reliable than Telegram (which itself might be the failing channel):

alerts:
  providers:
    inbounter:
      api_key: "${INBOUNTER_API_KEY}"
      email: "admin@company.com"
      sms: "+1234567890"  # For critical alerts only
  
  routing:
    warning: ["telegram"]
    error: ["telegram", "inbounter_email"]
    critical: ["telegram", "inbounter_email", "inbounter_sms"]

Log Management

Log Rotation

Prevent logs from filling your disk:

logging:
  file: "/var/log/openclaw/openclaw.log"
  max_size: "50MB"
  max_files: 10
  compress: true

Or use logrotate on Linux:

# /etc/logrotate.d/openclaw
/home/openclaw/.openclaw/logs/*.log {
    daily
    rotate 14
    compress
    missingok
    notifempty
    copytruncate
}

Structured Logging

Enable JSON-formatted logs for easier parsing:

logging:
  format: "json"
  fields:
    - timestamp
    - level
    - component
    - user
    - message
    - duration

This enables integration with log aggregation tools (Loki, ELK stack, Datadog).

Log Analysis

Quick commands for common log analysis:

# Most common errors today
openclaw logs --level error --since 24h --format json | \
  jq -r '.message' | sort | uniq -c | sort -rn | head -10

# Slowest responses
openclaw logs --filter "response_time" --since 24h --format json | \
  jq -r '. | select(.duration > 10) | "\(.timestamp) \(.duration)s \(.message)"'

# Failed skills
openclaw logs --filter "skill_error" --since 7d --format json | \
  jq -r '.skill_name' | sort | uniq -c | sort -rn

Performance Optimization

If monitoring reveals performance issues, here are the most common fixes:

High Response Time

# Check if it's the API or your server
openclaw metrics response-time --breakdown

# If API is slow: switch to a faster model or provider
# If server is slow: check CPU and memory
openclaw status --resources

Memory Leaks

# Track memory over time
watch -n 60 'openclaw status --resources | grep Memory'

# If memory keeps growing, restart periodically
openclaw cron add --name "weekly-restart" "0 3 * * 0" \
  "Notify admin that a scheduled restart is about to happen, 
   then restart the agent."

Database Optimization

# Check database size and health
openclaw db stats

# Compact the database
openclaw db vacuum

# Clear old data
openclaw memory prune --older-than 90d

Monitoring Checklist

Use this checklist to ensure your monitoring is complete:

openclaw doctor runs without errors
Health check endpoint is monitored externally
Alerts configured for error rate, latency, and budget
Log rotation is configured
Token usage is tracked daily
Channel connectivity is monitored
Backup verification is automated
Disk space alerts are set
Weekly performance review is scheduled

Frequently Asked Questions

How do I monitor OpenClaw if I am away from my computer?

Set up Telegram or email alerts via Inbounter. You will receive notifications on your phone when something needs attention.

What is a good baseline for response time?

With Claude Sonnet, expect 3-8 seconds for typical queries. Anything consistently above 15 seconds warrants investigation.

How much does monitoring add to my server costs?

Prometheus + Grafana add about 200-300 MB of RAM. On a $5+ VPS, this is manageable. If you are on minimal hardware, stick with openclaw status and cron-based alerts.

Can I monitor multiple OpenClaw instances from one dashboard?

Yes. Point Prometheus at multiple OpenClaw instances and Grafana will show all of them. Use labels to distinguish between instances.

Should I monitor token costs in real time?

Daily monitoring is usually sufficient. Set alerts for when daily spending exceeds your threshold so you catch anomalies quickly.

How do I know if my agent is performing well?

Track these three numbers: success rate (target: 95%+), average response time (target: under 8s), and daily token cost (target: within budget). If all three are green, your agent is healthy.