model-watchdog

March 15, 2026 · View on GitHub

Auto-rollback for AI agent config changes. Zero dependencies beyond Python 3.8+.

The Problem

I changed my AI agent's model config from claude-opus-4-5 to claude-opus-4-6 without checking if the installed software version supported it. The agent went down for 10 hours while I was asleep.

This tool watches your agent's health endpoint and automatically rolls back the config if it detects failures — then restarts the service. It also saves a "last known good" backup whenever the config changes and the agent is healthy.

Quick Start

# Probe http://localhost:18789/health every 30s
# Roll back after 3 failures in 3 minutes
python3 watchdog.py

# Custom config
python3 watchdog.py --config watchdog.yaml

# One-shot health check (for CI/scripts)
python3 watchdog.py --check-once

How It Works

  1. Probe your agent's health endpoint every N seconds
  2. On K failures within M minutes → rollback config + restart service
  3. When agent is healthy after config change → update the "good backup"
  4. Alert via Telegram, Slack, Discord, or any HTTP webhook
Agent healthy with new config → save as "good backup"

Config changes (model upgrade, etc.)

Agent starts failing

K failures in M minutes → rollback to good backup → restart

Alert sent → agent back online

Config

Generate a sample config:

python3 watchdog.py --dump-config > watchdog.yaml

Key options:

{
  "probe": {
    "url": "http://localhost:18789/health",
    "timeout_sec": 5,
    "expected_status": 200,
    "expected_body": "ok"
  },
  "thresholds": {
    "failures": 3,
    "window_sec": 180,
    "probe_interval_sec": 30
  },
  "rollback": {
    "config_path": "~/.openclaw/openclaw.json",
    "backup_path": "~/.openclaw/openclaw.json.watchdog-good",
    "restart_cmd": "systemctl --user restart openclaw-gateway",
    "restart_wait_sec": 10
  },
  "alerts": {
    "telegram_bot_token": "...",
    "telegram_chat_id": "..."
  }
}

Run as a Service

# Install as systemd user service
cat > ~/.config/systemd/user/model-watchdog.service << EOF
[Unit]
Description=model-watchdog AI agent health monitor
After=network.target

[Service]
ExecStart=/usr/bin/python3 /path/to/watchdog.py --config /path/to/watchdog.yaml
Restart=always
RestartSec=5

[Install]
WantedBy=default.target
EOF

systemctl --user enable --now model-watchdog
systemctl --user status model-watchdog

Works With

  • OpenClaw (default config paths)
  • Any AI agent with an HTTP health endpoint
  • Any service with a config file + restart command

Why No Dependencies?

Agents running 24/7 on minimal VPS installs shouldn't need a pip install to stay alive. This is a single Python file, standard library only.

Optional: pip install pyyaml for YAML config support (JSON works without it).

License

MIT