news.md

June 28, 2026 · View on GitHub

🔥🔥🔥 News (Pacific Time)

June 28, 2026 (latest): New accept-edits permission mode — the middle ground between auto and accept-all — plus an exposed plan mode and corrected permission docs. Previously auto asked before every file edit while accept-all ran everything (destructive shell commands included), with no setting in between — awkward for a coding session where you trust the model to edit files but want a prompt before it runs commands. accept-edits fills that gap: it auto-runs Write/Edit/NotebookEdit but still prompts for any non-allow-listed Bash, so a git push --force or rm is never run silently (and the host-destroying hard denylist — rm -rf /, mkfs, dd to a raw disk, fork bombs — still applies at execution time, in every mode). Implemented as one branch in _check_permission; reads and Bash keep the auto rules. Two cleanups shipped alongside: the already-implemented plan mode (read-only analysis, writes refused except to the plan file) is now listed in the /permissions menu and tab-completion instead of being reachable only via /plan; and the system prompt's "Safe vs Unsafe" section — which wrongly told the model that auto auto-approves checkpoint-protected edits, and that unsafe ops are asked "even under accept-all" — was rewritten to match the real _check_permission behavior (auto prompts for all edits; accept-all does not prompt; only the hard denylist blocks unconditionally). README, the 7 i18n READMEs, docs/guides/features.md and docs/guides/reference.md all list the five modes. See docs/guides/reference.md.
June 28, 2026: Memory staleness is now anchored to verification, not file mtime — and the agent is told how to keep memories fresh. Two parts. (1) The bug (PR #150). MemorySearch rewrites a memory file to bump last_used_at, which advanced its mtime; both the retrieval recency score and the ⚠ stale warning were derived from that mtime, so a single read of a stale, never-re-verified memory reset its recency to ~1.0 and suppressed its "verify against current code" warning — the "stale-but-confident" failure the design warns against, and worst for the most-retrieved (most-likely-acted-on) memories. The fix adds a last_verified frontmatter field (defaults to created); staleness/recency now come from verified_epoch (last_verified → created → mtime fallback for legacy files), never raw mtime. touch_last_used preserves mtime and never writes last_verified, so a read can't look like a write. A new MemoryVerify tool / mark_verified() is the only thing that refreshes the clock, called after the agent re-checks the claim against the environment — integral to the fix, since removing the broken implicit refresh (read = fresh) requires providing the correct explicit one. 7 regression tests encode the bug (tests/test_memory_staleness.py); existing memory tests stay green. (2) Follow-up — activating the dormant half. Merging #150 alone left the explicit-refresh path unused: the memory system prompt told the model to verify a memory before acting on it but never mentioned the MemoryVerify tool, so no memory ever got re-verified and still-correct old memories would keep the stale flag and decay in ranking forever. The prompt now explicitly instructs: after confirming a claim still holds, call MemoryVerify (the only thing that clears the flag / restores ranking); if it no longer holds, MemorySave (overwrite) or MemoryDelete instead. And the always-injected memory manifest — which still sorted and aged by mtime — is now anchored to verified_epoch too, matching how MemorySearch ranks (legacy files without date fields fall back to mtime, so they're unchanged). See docs/guides/features.md · docs/guides/reference.md.
June 23, 2026 (v3.5.83): Documentation slimmed to its essentials, a native desktop app brought into the repo, and the version-string format unified. Three threads of housekeeping. (1) README / docs trim. The top-level README had grown a verbose multi-paragraph News block and several reference dumps that duplicated the guides. Each News item is now one sentence + a [Details](docs/news.md) link (the full write-ups stay here); the 59-model Atlas Cloud list moved to docs/guides/usage.md (Option D) leaving only the 3-line example in the README; and the FAQ dropped to its three highest-value entries (MCP, Ollama tool calls, macOS PATH) with the rest pointed at docs/guides/faq.md. Net README ~551 → ~516 lines with no content lost — everything trimmed already lived in docs/. (2) Native desktop app (desktop/). A thin Electron shell that launches cheetahclaws --web --no-auth as a localhost sidecar, parses its Chat UI: http://…/chat ready line, and points a BrowserWindow at it — so the production web UI becomes a native window with nothing reimplemented. It couples to the rest only through that CLI contract (verified: npm run smoke launches the real server against this repo and confirms /chat / /health all serve), and scripts/build-app.sh can freeze the server with PyInstaller into a self-contained .dmg/.exe/.AppImage needing neither Node nor Python on the user's machine. Surfaced from three entry points (top-of-README callout, Web UI section, Documentation table). Remaining for a shippable installer: code signing / notarization. (3) Version-string format unified. Historical release notes mixed v3.05.x and v3.5.x; all ~166 occurrences across the README, the 7 i18n translations, docs/news.md, the guides, the demo/cast generators, and the recorded .cast/.svg banners are now the canonical v3.5.x. Version bumped 3.5.82 → 3.5.83 in pyproject.toml. Full suite green (2449 passed, 3 skipped); the desktop sidecar smoke test passes against this repo's web server. Not a breaking change — no runtime behavior changed.
June 16, 2026: All internal modules move into a single cheetahclaws package. Previously the importable modules lived flat at the top level (config.py, daemon/, kernel/, mcp_client/, providers.py, …). That works when you run from the repo dir but breaks once CheetahClaws is installed and launched from its entry point: a generic top-level name like config or daemon gets shadowed by whatever else is on sys.path — another project's config/ directory, the PyPI python-daemon package — and cheetahclaws dies at startup with ImportError: cannot import name … from 'config' (unknown location). (An earlier pass that merely dropped a cc_ prefix from four of these modules re-introduced exactly this collision, which the prefix had originally been added to prevent — so this change supersedes it.) The fix is the standard one: own a single namespace. All 21 single-file modules and 20 sub-packages now live under cheetahclaws/ and are imported as cheetahclaws.<name>; the entry script cheetahclaws.py became cheetahclaws/cli.py, with a deliberately light cheetahclaws/__init__.py (defines VERSION, lazily proxies CLI entry symbols via PEP 562 __getattr__ so importing a submodule never drags in the heavy CLI) and a cheetahclaws/__main__.py for python -m cheetahclaws. Imports were rewritten across all 448 .py files — 1269 from NAME + 126 import NAME + 41 dotted import NAME.sub statements, 118 string patch / mock / import_module targets, subprocess -m argv paths, the modular plugin f-string loaders, the voice/video back-compat shims, and embedded driver-script strings — all prefixed with cheetahclaws., using whole-word matching so RPC method names, filenames, and unrelated tokens were left alone. pyproject.toml now ships a single cheetahclaws* package (no py-modules) with entry point cheetahclaws.cli:main; agent_templates/ moved into the package so it ships as data. Triage of the move surfaced and fixed seven regression classes — kernel/daemon subprocess -m argv paths, the test_packaging import contract, the voice shim's submodule registration, the daemon e2e launcher, tests that patched the package object instead of the cli module, tests with hardcoded repo-root data paths, and a sys.modules stub-restore leak between test_research and test_setup_wizard. Breaking only for code that imports CheetahClaws internals directly — import kernel → from cheetahclaws import kernel, from mcp_client.client import get_mcp_manager → from cheetahclaws.mcp_client.client import get_mcp_manager; the cheetahclaws CLI, python -m cheetahclaws, the Web UI, and all bridges are unaffected. Verified end-to-end: python -m cheetahclaws --version and from cheetahclaws import config both work from outside the repo (the original crash), a built wheel contains cheetahclaws/* with all data files (web, prompts, agent_templates) and zero bare top-level modules, and the full suite is 2449 passed, 3 skipped, 0 failed.
June 6, 2026 (v3.5.82): macOS install reliably puts cheetahclaws on PATH, and local Ollama models that emit tool calls as text now actually execute them. Two fixes reported in issue #131. (1) Install / PATH on macOS. On macOS the installer creates a dedicated venv (~/.cheetahclaws-venv) and sources it, so the post-install verification if command -v cheetahclaws succeeded inside the script's own activated shell — it printed "cheetahclaws is on PATH" and short-circuited past the entire rc-file block, including the touch ~/.zshrc that was supposed to create the file. Result: ~/.zshrc was never created/updated, and in a fresh terminal (no venv active) the binary was unreachable, so users had to hunt for the install location by hand. The verification step no longer trusts the venv-polluted command -v: it confirms the binary at the expected BIN_DIR, then (for venv installs) symlinks only the cheetahclaws entry point into ~/.local/bin — pipx-style, so the venv's python/pip never get prepended to PATH and can't shadow the user's own — creates the right rc file if missing (~/.zshrc for zsh, ~/.bash_profile for bash on macOS, config.fish for fish), and appends the exposure dir to PATH there. The fish branch now also writes fish (set -gx PATH …) syntax instead of export, and the reload hint points bash-on-macOS at .bash_profile (scripts/install.sh). (2) Ollama tool calls (the "model just keeps talking" bug). The Ollama streaming path (stream_ollama) only read tool calls from Ollama's structured message.tool_calls field, whereas the OpenAI-compatible cloud path (stream_openai_compat) also recovers tool calls a model emits as text via _find_native_tool_marker + _extract_native_tool_calls. Many local models — Qwen-coder, Gemma, Mistral — emit calls as <tool_call>{…}</tool_call> / <|tool_call|>… / [TOOL_CALLS][…] inside content; on the Ollama path that markup was streamed straight to the screen as chat and never executed, so the agent loop saw no tool calls and ended the turn — exactly the reported "tool-calling-style chat that never runs." stream_ollama now mirrors the cloud path: when a native marker appears in the streamed content it buffers from that point (so the user never sees raw markup), and at end-of-stream parses the buffer into real tool calls (falling back to surfacing the buffered text if parsing fails, so nothing is silently swallowed). Note: Ollama's native /api/chat does not accept a tool_choice parameter, so the fix is the text-format recovery, not a request-param change. Existing provider + cache-token suites stay green. See docs/guides/usage.md · docs/guides/faq.md.
June 5, 2026 (v3.5.82): User-controllable token / cost budgets — set a spend cap; on hit the session auto-saves and you can resume or raise it. The quota engine (quota.py: per-session + per-day token/cost counters, enforced before each model call) already existed but had no friendly surface — you had to know four config keys (session_token_budget / session_cost_budget / daily_token_budget / daily_cost_budget) and there was no way to see how close you were, no warning before the wall, and the hard stop printed a bare [Quota exceeded]. This adds the UX layer on top of the unchanged engine: a /budget command — no args shows usage vs every budget as colored bars + percentages; /budget \$5 sets a session cost cap (the $ means USD), /budget 200k a session token cap (parses 200k / 1.5m / 200000), /budget daily \$20 / /budget daily 2m the daily caps, and /budget clear removes all. A --budget \$5 / --budget 200k startup flag sets the session cap at launch. Proximity warnings fire at the end of any turn that crosses ≥80% (yellow) / ≥95% (red) of a cap, so the wall never arrives by surprise. On hit the agent now yields a QuotaPause event (instead of a plain text line): the REPL auto-saves the session (session_latest.json + daily backup, the same path /resume reads) and prints a friendly next-steps block — raise the same cap or remove it (/budget clear) then resend, or restart later and /resume. So a long task that runs out of budget is never lost: you analyze, adjust, and continue. Tight enforcement (no surprise overshoot): the check projects the next request's input (compaction.estimate_tokens) and stops before the call if it would cross the cap, and clamps that call's max_tokens to the remaining headroom (quota.output_room) — so a single tool-heavy turn can't blow 40k→49k past the budget the way a pure "already-spent ≥ limit" check let it. One budget per scope: setting a cap replaces the other unit for that scope (/budget \$5 after /budget 200k switches the session cap to cost rather than stacking), so a leftover token cap can't silently keep blocking after you switch to a $ cap. Unit-matched hint: QuotaExceeded / QuotaPause carry which cap broke (key/scope/unit/limit), so the "raise it" suggestion is in the right unit — a token cap shows /budget 40k, a daily cost cap shows /budget daily \$40 — instead of a generic $ amount that wouldn't lift a token cap. New helpers quota.parse_budget / fmt_amount / usage_vs_limits / warnings / output_room; command in commands/core.py:cmd_budget; QuotaPause in agent.py; REPL handling + --budget in cheetahclaws.py; 42-case tests/test_budget.py (isolated quota dir, incl. a regression that the hint matches the breached unit and that switching units clears the stale cap). The daemon's conservative serve-mode defaults (200k tok / $2 per session, 2M / $20 per day) are unchanged — interactive stays unlimited by default, the server stays guard-railed. See docs/guides/features.md · docs/guides/reference.md.
June 5, 2026 (v3.5.82): Adaptive Markdown streaming — live output that stays correct on every device. In-place Rich Live redraw is great on capable terminals but breaks elsewhere: it was disabled wholesale over SSH (so SSH users got raw tokens with no formatting), and where it did run it could leave duplicate or stale frames — on macOS Terminal (which can't erase above the scroll boundary), over laggy network PTYs, or with wide CJK / emoji text whose display width a naive line-count gets wrong. The renderer now selects a streaming tier per device in ui.render.auto_stream_mode(config): live — full in-place redraw, only on terminals known to handle cursor-up (local TTYs, and modern emulators even over SSH: iTerm2, WezTerm, Windows Terminal, VSCode, kitty, Alacritty, Ghostty, detected via TERM_PROGRAM / TERM / WT_SESSION / KITTY_WINDOW_ID / ALACRITTY_WINDOW_ID / WEZTERM_PANE); commit — append-only progressive Markdown, the safe default for unknown-SSH / Apple Terminal / pipes / non-TTY, where each completed block (split on blank lines, respecting open code fences so a fenced block renders atomically) is rendered and printed permanently and the cursor is never moved, making a duplicate frame structurally impossible regardless of terminal, latency, or character width; plain — raw tokens, only when rich is unavailable. The append-only floor is provably duplication-free; live is progressive enhancement on top. Override with /config stream_mode=live|commit|plain (legacy boolean /config rich_live=true|false still works → live/commit). Implemented in ui/render.py (set_stream_mode / auto_stream_mode / _safe_commit_point / _commit_stream / _commit_flush), wired in at REPL start in cheetahclaws.py, with a 26-case test suite in tests/test_stream_modes.py (device routing, code-fence-aware block boundaries, append-only commit, and a regression asserting commit mode emits zero cursor sequences even on a TTY with CJK text). Two related UX items shipped alongside: /context is now a visual grid — a Claude-Code-style 20×10 cell grid of context-window usage, colored and broken down by category (system prompt / system tools / memory files / skills / messages / free space) with per-category token counts and percentages, adapting to the model's real context window and falling back to #/. on non-UTF-8 terminals (commands/core.py:cmd_context); and deepseek-v4-flash is registered at its 1M context window in providers._MODEL_CONTEXT_LIMITS (overriding the 128K deepseek provider default, which still applies to deepseek-chat / deepseek-v4-pro), so the prompt %, /context, and the compaction trigger all reflect the true 1M window. See docs/guides/features.md · docs/guides/reference.md.
June 4, 2026 (v3.5.81): Claude-Code-style quiet output — hide tool execution, show one summary line per turn. Long analysis turns used to scroll the terminal with a ⚙ Bash(...) line and a ✓ → N lines (… chars) line for every tool call, and the permission prompt dumped the entire inline script (e.g. a 60-line python3 << 'PYEOF' heredoc). A new quiet mode (on by default) suppresses the per-tool lines — the spinner conveys live activity and a single summary line is emitted at the tool→text boundary, sitting just above the reply (Read 2 files, ran 3 shell commands), the way Claude Code does. Errors and denials still surface so a mid-turn failure is never silent. In quiet mode the permission prompt also collapses a multi-line command to one line (Run: python3 << 'PYEOF' … (+59 行)) instead of printing the whole script. /verbose overrides quiet (full per-tool lines + inputs + token counts); toggle with /quiet, or launch with --show-tools (alias --no-quiet). The startup banner gains an Output: quiet / Output: full line so the active mode is visible at a glance. Live status line: the spinner now shows elapsed time plus a running output-token estimate (Thinking… (7s · ↓ 435 tokens)) — char-based, since providers only report real usage at the end — and each quiet turn closes with a real-usage footer ✻ Worked for 7.2s · ↑ 1.2k · ↓ 435 built from the true TurnDone counts. Implemented in ui/render.py (turn-level tool accumulator + turn_summary_line(), spinner token meter, print_turn_stats()), wired through the REPL event loop in cheetahclaws.py, with the /quiet toggle in commands/config_cmd.py. See docs/guides/features.md.
June 4, 2026: Context-window override — the prompt % and compaction now follow a settable context length. The prompt's context-usage % (and the compaction trigger) derive from the model's context window, which previously could only be a hardcoded provider default — and max_tokens (the OUTPUT cap) doesn't change it, so /config max_tokens=… left the % unchanged (a common point of confusion). New per-session key context_window (/config context_window=<N>, 0 = model default) overrides it, kept deliberately distinct from max_tokens. A single parser (providers.context_window_override) feeds the prompt %, /context, the compaction trigger, and the per-call output-token cap, so all four stay consistent; it is bidirectional — a smaller value forces earlier compaction, a larger value corrects a stale default. The value is read live each prompt, so switching model or context_window updates the % with no restart. /config warns when the value exceeds the model's real window (which would disable compaction and let the API reject oversized prompts). No-op when unset, so existing behavior is unchanged. See docs/guides/reference.md.
June 4, 2026: Rich Live streaming — long responses stay live via a bounded tail window. Large streamed responses that would overflow the terminal's redraw area could leave duplicate or stale frames behind on some emulators (macOS Terminal, etc.), because Rich Live redraws the whole accumulated output in place and the cursor can't reach content that has scrolled into the scrollback. Building on the per-response fallback from PR #133, Rich Live now keeps the live region bounded to the viewport: a short response is shown in full, but once it would overflow, only the last screenful of rendered lines (a tail window) is redrawn — so the Live region can never exceed the terminal and cannot leave stale frames. The complete output is committed once when the response finishes (including on Ctrl-C, since the REPL flushes on interrupt), so the head that scrolled out of the window is never lost. Plain streaming is kept only as a safety net (precise render failed, or the terminal is too small to bound a window). A cheap per-line wrap estimate short-circuits the expensive full render_lines() measurement while a response stays well under the limit, so normal responses pay no extra Markdown re-render per chunk. Adds focused tests covering full-frame streaming, the full→tail transition, tail-window commit-on-flush, real Segments rendering, and both safety-net fallbacks. See docs/guides/features.md.
May 31, 2026: QQ bot bridge — /qq connects cheetahclaws to QQ groups + C2C private chats (PR #121). Uses the official qq-botpy WebSocket + HTTP SDK (pip install "cheetahclaws[qq]"). botpy's async client runs on a dedicated asyncio event loop inside a daemon thread, bridged to the synchronous main thread via thread-safe queues. Handles on_group_at_message_create (group @-mentions, prefix stripped) and on_c2c_message_create (private). Since QQ has no message-edit API, replies stream as new messages every ~2 s (2000-char chunking) instead of updating a placeholder; passive replies reference the original msg_id/event_id within QQ's 5-minute window, then fall back to active pushes. Per-target FIFO job queues, slash-command passthrough, !jobs/!retry/!cancel remote control, image input, and permission prompts scoped to the originating chat (no cross-chat approvals). A supervisor reconnects with exponential backoff (2 s → 120 s). Secret handling matches the hardening standard below: $QQ_SECRET (recommended) > REPL arg (deprecated, warns + scrubs history) > config; env-supplied secrets never touch ~/.cheetahclaws/config.json. /qq <appid>, /qq, /qq stop|status|logout. Two follow-up fixes over the original PR: image downloads moved off the event loop into loop.run_in_executor (a blocking urlopen would freeze the WebSocket heartbeat for up to 30 s), and the secret no longer gets written to disk unconditionally. See docs/guides/bridges.md.
May 12, 2026 (v3.5.80): (security-hardening branch): Two-round security hardening sweep — CRITICAL + HIGH findings from the in-repo code review. Lands a cluster of fixes that close real attack surfaces opened by the recent rapid feature growth. Zero regressions across the full 2347-test suite.

Bot tokens off argv / readline history. cmd_telegram and cmd_slack now accept a single-arg form (/telegram <chat_id> / /slack <channel_id>) and read the bot token from $TELEGRAM_BOT_TOKEN / $SLACK_BOT_TOKEN. Env-supplied tokens never get persisted to ~/.cheetahclaws/config.json; only tokens that actually came in via the deprecated REPL-arg path are saved on disk. New bridges.scrub_token_from_history(token) walks readline.get_history_item backwards and removes any in-memory entry that embeds the token the moment we know its value. Bridge supervisors get a token=/channel= kwarg so the env-sourced token can flow to the worker thread without ever sitting on the config dict — _slack_start_bridge(config, *, token, channel). Telegram already passed the token explicitly to _tg_supervisor. WeChat is unaffected (QR-scan token, never in argv).

Web UI CSRF — double-submit cookie. Server mints ccsrf=<24B>; Path=/; SameSite=Strict; Max-Age=86400 (non-HttpOnly) on every connection that arrives without one. _handle_connection gates POST/PUT/PATCH/DELETE on a matching X-CSRF-Token request header (rejection: 403 csrf token mismatch). Exempt: /api/auth/{bootstrap,register,login,logout,api/auth} — they establish the session that later carries the cookie. New web/static/js/csrf.js monkey-patches window.fetch so every state-changing request automatically echoes the cookie value; loaded as the first script in chat.html, the inline terminal script in _build_html, and lab.html. Test harness (tests/test_web_api.py:_client) gains an httpx event hook that mirrors the browser behaviour. SameSite=Strict on the JWT cookie remains the first-line defence; CSRF is the second line.

Web terminal session ownership. _PtySession(owner_uid=...) records the creator's JWT sub at /api/session time. _check_pty_owner(session, cookie) is consulted at /api/stream / /api/input / /api/resize — any other authenticated user trying to reach a known sid gets 403 not session owner. Password-only mode (no JWT) keeps owner_uid=None and skips the check, preserving the shared-secret model. Closes the trivial-sid-hijack hole in multi-user web deployments.

Bash hard-denylist. Eight regexes in tools/shell.py:_BASH_HARD_DENY refuse host-destroying patterns regardless of permission_mode — rm -rf / and its --recursive/--force variants, rm -rf /*, mkfs.*, dd of=/dev/{sd,hd,nvme,vd,mmcblk,xvd}, > /dev/{sd,hd,...}, chmod -R 777 /, chown -R <user> /, and the classic :(){ :|:& };: fork bomb. Hits the Bash tool, the REPL !cmd escape, and every bridge's !cmd path. Plus NUL-byte + control-char + 64 KB length rejection on every Bash invocation.

Filesystem credential denylist. tools/security.py:_check_path_allowed now refuses access to a small denylist by default — SSH private keys (~/.ssh/id_*), ~/.aws, ~/.gnupg, ~/.kube, ~/.docker, ~/.netrc, ~/.pgpass, /etc/shadow, /etc/gshadow, /etc/sudoers*, /root. Public-by-convention SSH files (config, known_hosts, authorized_keys) remain readable. Set CHEETAHCLAWS_FS_NO_SANDBOX=1 to bypass when intentionally auditing your own secrets. Independent of allowed_root, which still works as the strict-mode toggle for multi-user daemon deployments.

Plugin loader hardening. Two new env switches in plugin/loader.py: CHEETAHCLAWS_DISABLE_PLUGINS=1 (kill switch) and CHEETAHCLAWS_PLUGIN_ALLOWLIST=a,b,c (whitelist). EXTERNAL-scope plugins (loaded via $CHEETAHCLAWS_PLUGIN_PATH) print a one-time stderr warning on first load so a stolen env-var-set doesn't silently execute. Module path resolution now uses Path.resolve() + relative_to(install_dir) to confine a malicious manifest's "tools": ["../../etc/passwd_loader"] style entry.

MCP env sanitisation. mcp_client/client.py:_sanitized_mcp_env strips a fixed set of process-hijack keys (LD_PRELOAD, LD_LIBRARY_PATH, LD_AUDIT, DYLD_INSERT_LIBRARIES, DYLD_LIBRARY_PATH, PYTHONPATH, PYTHONSTARTUP, PYTHONHOME, PYTHONEXECUTABLE, NODE_OPTIONS, NODE_PATH, BASH_ENV, ENV) from any env map an .mcp.json config supplies. Dropped keys print a one-line stderr notice. Bypass: CHEETAHCLAWS_MCP_TRUST_ENV=1. Closes a real local-priv-esc path on a host with multiple MCP server configs of varying trust.

macOS daemon peer-cred. daemon/auth.py:get_peer_uid now branches on sys.platform: Linux keeps SO_PEERCRED, macOS / *BSD goes through ctypes-loaded getpeereid(2). Closes a long-standing TODO that effectively reduced macOS Unix-socket auth to token-only (a stolen daemon-token implied full RCE without peer-uid validation).

Smaller fixes folded in. Web JWT secret loader rewritten with O_CREAT \| O_EXCL + 0o600 + post-write mode verification (refuses to read a world-readable secret file; auto-falls-back to in-memory secret if chmod can't be enforced; override with CHEETAHCLAWS_WEB_SECRET). Terminal one-time password from secrets.token_urlsafe(6)[:6] (~30 bits, online-bruteable) to secrets.token_urlsafe(32) (~190 bits). config.save_config strips permission_mode=accept-all before persisting — once-confirmed escape hatches no longer outlive the session that set them. session_store.save_session wrapped in a module-level Lock + explicit BEGIN IMMEDIATE / ROLLBACK so two threads writing the same session_id no longer silently drop one set of changes. agent_runner.py err_msg initialised before the try block (defends against a NameError on first iteration if _handle_permission_request returns "error"); quota.QuotaExceeded matched by isinstance instead of class-name string. compaction.compact_messages wraps stream_auxiliary in try/except + falls back to the original messages instead of crashing the agent loop. providers._recover_args_from_text caps the regex scan window to the last 32 KB of accumulated text (was scanning ~100 KB+ on every tool call). context.get_git_info + get_claude_md get TTL caches (30 s / 10 s, keyed by cwd) so the per-turn git rev-parse / status / log and CLAUDE.md re-read stop showing up in profiles. mcp_client/client.py reader loops use dict.pop() instead of in+index so a late response after a timeout doesn't race the request side. tool_registry._cache_key adds session_id dimension so a Read(/etc/...) cached for one session never leaks to another. session_store.search_sessions LIKE-fallback path escapes %/_/\ before interpolation.

Frontend XSS audit. Existing _esc (textContent-→-innerHTML) and _renderMd (HTML-tag-strip → marked) cover all user/model content paths. One deep-trust hole closed: web/static/js/settings.js:_renderModels previously injected server-supplied model names directly into an onclick="app.selectModel('${full}')" attribute — now uses data-model + a delegated click handler, so a malicious model registry entry cannot break out of the JS string literal.

Defaults you can flip. CHEETAHCLAWS_BRIDGE_TERMINAL=0 hard-disables the bridge !cmd shell entirely (default 1, owner-bound by chat_id whitelist anyway). CHEETAHCLAWS_FS_NO_SANDBOX=1 lifts the credential denylist. CHEETAHCLAWS_DISABLE_PLUGINS=1 / CHEETAHCLAWS_PLUGIN_ALLOWLIST=… / CHEETAHCLAWS_MCP_TRUST_ENV=1 control plugin + MCP behaviour. Full reference in docs/guides/security.md. All 12 CRITICAL + 10 HIGH items from the review now closed (4 of those 22 turned out to be review misjudgements — _all_errors init, permission double-answer race, _broadcast iter race, and the QuotaExceeded classname check was a real fix but the surrounding "shell injection in REPL !command" was reclassified as user-typed-input not RCE). Architecture refactor items (cheetahclaws.py / providers.py God-object split, sentinel state machine) deliberately left for a separate decision — they're shape changes, not bug fixes.
May 12, 2026 (daemon/f-4-followups-f-6-9 branch): Daemon foundation roadmap finished — all nine F-1…F-9 items in RFC 0002 now LANDED. Closes the remaining four scope items end-to-end (≈1500 LoC of code + ≈900 LoC of tests + docs). Drilldown:

F-4 #2 — Bridge notify forwarding. The subprocess-runner reader loop's notify IPC branch used to drop the payload on the floor (F-6/7/8 didn't exist yet). Now it routes through daemon.bridge_supervisor.notify(kind, text). The runner can target a specific bridge via msg["bridge"] (e.g. "telegram") or omit it for a "*" broadcast. agent_runner_notify events on the bus carry {name, run_id, bridge, delivered, text[:500]} so observers can audit deliveries. Empty-text frames are silently dropped (common during agent shutdown).

F-4 #3 — Restart policy. New RestartPolicy dataclass: mode (none | on-crash), max_restarts, backoff_base_s, backoff_cap_s, backoff_jitter_s. Frozen + a pure next_delay(restart_count) so the decision matrix is unit-testable. agent.start accepts the five fields flat (validation rejects cap < base which would clamp every attempt down to a useless ceiling). On a crash the reader's finally arms a threading.Timer(delay, _do_restart, ...); the Timer respawns via a swappable spawner hook (_RESTART_SPAWNER for tests) and carries restart_count forward. stop() cancels the Timer before the kill ladder, and the same _unregister(name, expected=handle) identity check protects against a Timer-fired respawn racing past a deliberate stop. Bus events: agent_runner_restart_scheduled, agent_runner_restart, agent_runner_restart_failed, agent_runner_restart_exhausted.

F-6 / F-7 / F-8 Phase 1 — Telegram / Slack / WeChat in daemon. Single daemon/bridge_supervisor.py owns lifecycle for all three kinds, gated per-bridge by feature flags (CHEETAHCLAWS_ENABLE_F6/7/8, default off, REPL is byte-for-byte unchanged until the operator opts in). The Phase 1 worker invokes today's bridges/<kind>.py:_<kind>_supervisor unchanged — same HTTP code, same reconnect/backoff, just owned by a daemon thread instead of a REPL one. Outbound bridge.notify(kind, text) dispatches via the per-kind sender (_tg_send / _slack_send / _wx_send); F-4 #2 plugs straight into it. Persistence in the F-2 bridges SQLite table (kind, enabled, config_json with secrets redacted, last_poll_at, last_error); bridge.list merges live workers with rows from previous daemon runs so disabled bridges remain visible in daemon status. Wire surface: bridge.{start,stop,list,send,status} RPCs in daemon/bridge_methods.py. F-7 depends on F-6 (shared scaffolding); F-8 the same. WeChat keeps a clear-error path for missing token/base_url since the QR-login handshake is still REPL-driven (/wechat login).

F-6 Phase 2 — Inbound refactor. When bridge.start daemon_phase2=True is passed, the legacy supervisor is bypassed for a slim daemon-driven loop: (a) outbound subscriber on the event bus, filters session_outbound events by session_id (tg:<chat_id> / sl:<channel> / wc:<user_id>) + target_bridges, calls handle.sender for delivery; (b) per-kind inbound poller (_phase2_telegram_inbound / _phase2_slack_inbound / _phase2_wechat_inbound) that re-uses today's HTTP helpers but publishes session_inbound on every new phone message instead of calling session_ctx.run_query. The agent driver — REPL, Web, or a future automation client — subscribes to session_inbound, runs the agent, calls session.reply(session_id, text, target_bridges?) for outbound chunks. Three new RPCs in daemon/session_methods.py: session.send, session.reply, session.list_recent. Permission requests born inside a bridge-driven turn route only back to the originating bridge via the existing PermissionStore originator stamp (<kind>:<session_id>).

F-9 — Cost-guardrail defaults + per-runner quota-pause. Headless cheetahclaws serve now sets four conservative defaults (session_token_budget=200_000, session_cost_budget=\$2, daily_token_budget=2_000_000, daily_cost_budget=\$20) via _apply_serve_defaults; REPL --in-process keeps None (unlimited) for back-compat. New system.status RPC returns {budgets, runners, bridges} so daemon status prints the live ceilings. agent.resume(budget_overrides, name?) merges overrides into daemon_state.config and (when name is supplied) calls runner_supervisor.resume(name) to deliver a resume IPC frame to a paused runner. The hook itself: a new pre-iter quota.check_quota raises into _on_quota_exceeded; the base impl is a no-op (REPL keeps today's behaviour where agent.run catches internally and yields a quota text), while _PipeAgentRunner overrides it to ship a paused_budget IPC frame, set status, and block on _resume_event.wait(). Supervisor reader publishes quota_warn + flips agent_runs.status='paused_budget'. On resume, runner sends resumed IPC, supervisor publishes agent_runner_resumed + flips status back to running. Control loop's stop handler also sets _resume_event so a stop arriving while paused unblocks cleanly.

Post-implementation audit fixed 5 real bugs in the new code. (1) _phase2_wechat_inbound used wrong field names (messages / fromUserName / msgId / syncKey instead of msgs / from_user_id / message_id / get_updates_buf per bridges/wechat.py:411). (2) _phase2_slack_inbound initialized cursor to None, so the first poll would replay the channel's recent backlog — fixed to seed at current wall-clock time (matches bridges/slack.py:_slack_poll_loop). (3) _phase2_telegram_inbound long-polled with timeout=25 s, meaning stop() had to wait up to 25 s for the HTTP call to return before observing stop_event — dropped to 5 s. (4) _unregister(name) was identity-blind; a Timer-fired _do_restart racing with stop() could see its freshly-spawned successor handle silently popped (orphaning the subprocess). Added an optional expected=handle identity check applied at every terminal stop site (runner_supervisor + bridge_supervisor have the symmetric fix). (5) _safe_cfg only matched token / secret keys; since bridge.start merges daemon_state.config into the bridge config, provider API keys (anthropic_api_key, etc.) and password / auth_* fields could bleed through to bridges SQLite rows and SSE events — extended to (token, secret, api_key, apikey, password, passwd, auth). Two new regression tests pin both.

Full repo suite (three independent runs): 2347 passing, 3 skipped (env-gated live LiteLLM tests), 0 failed, ~3:32 each. ~90 new daemon-specific tests across test_daemon_runner_{restart_policy,notify_routing,quota_pause}.py, test_daemon_{bridge_supervisor,bridge_methods,bridge_phase2,session_methods,f9_budgets}.py. RFC 0002 + docs/architecture.md §Daemon updated to reflect all of F-1 → F-9 landed. Details: RFC 0002.
May 12, 2026 (fix/litellm-provider-followup branch): litellm/ provider follow-up to PR #119 — make litellm a real optional dep, fix ledger / streaming, and wire it into the CLI / Web UI path. PR #119 (RheagalFire) introduced kernel/runner/llm/litellm_provider.py so CheetahClaws could route to 100+ LLM providers behind one SDK, but a careful re-review against the merge surfaced four classes of integration gap that the 12 mocked unit tests didn't catch. The follow-up branch (fix/litellm-provider-followup, 2 commits, 9 files, +1093/-229) fixes all of them and lands the docs the original PR was missing. (1) Dependency classification — description said optional, diff put it in core. Pyproject's [project] dependencies had grown a litellm>=1.60.0,<2.0.0 line, and requirements.txt's core block matched; every pip install cheetahclaws was force-pulling litellm and its transitive chain (tokenizers, tiktoken, pinned pydantic versions). Moved to [project.optional-dependencies] under a new litellm extra, also added to all; requirements.txt now only documents the optional install via a comment. Backed up by a test_litellm_is_optional_dependency regression. (2) Not reachable through either user path. kernel/runner/llm/__main__.py:_select_provider only knew mock / scripted / anthropic, and the top-level providers.PROVIDERS registry (which the CLI + Web UI consult to resolve --model <X>) had no litellm entry at all, so end-to-end the new class was reachable only by direct Python import. Added a litellm branch to _select_provider (reads CC_LLM_API_KEY as an optional explicit override), a PROVIDERS["litellm"] entry with type: "litellm", and a new stream_litellm() generator in providers.py mirroring stream_openai_compat's shape — yields TextChunk per delta then AssistantTurn at end. The dispatcher in providers.stream() branches on prov["type"] == "litellm". bare_model("litellm/openai/gpt-4o") strips only the first /, leaving openai/gpt-4o — exactly what litellm.completion(model=...) expects. (3) Streaming silently zeroed the ledger. stream() returned tokens_input=0, tokens_output=0, tool_calls=(), finish_reason="stop" unconditionally. The kernel runner emits charge IPC messages from those fields and gates RFC 0022 tool dispatch on response.is_tool_use, so every streamed call bypassed quota and lost any tool_use the model emitted. Fix passes stream_options={"include_usage": True} to litellm.completion and reassembles the chunk list with litellm.stream_chunk_builder(chunks, messages=...) so the synthesized final response carries real token counts, tool_calls, and finish_reason. Two regression tests pin the contract (test_stream_emits_deltas_and_returns_usage, test_stream_preserves_tool_calls); a third (test_cost_unknown_set_when_chunk_builder_fails) covers the fallback when the builder returns None on very old litellm versions. (4) cost_micro hard-coded to 0 — quota free pass. Both __call__ and stream() returned cost_micro=0 regardless of model. Switched to litellm.completion_cost(completion_response=resp, model=model) which uses litellm's per-model price table (covers 100+ providers, kept in sync upstream); convert USD → micro-USD via the same * 1_000_000 factor AnthropicProvider uses. On completion_cost raising (unknown model) or returning None, the response carries metadata["cost_unknown"]=True so the ledger can distinguish a real $0 (Ollama, free NIM tier) from an unpriced call. Exception mapping. try: ... except Exception: raise ProviderUnavailable(...) swallowed every error class into "their fault" — 401s, malformed requests and connection timeouts all looked the same to the runner. New _map_exception reads self._litellm.exceptions.{AuthenticationError, BadRequestError, NotFoundError, UnsupportedParamsError} and re-raises those as ProviderInvalidRequest ("your fault"); everything else stays ProviderUnavailable so the runner may retry. Reads exception classes off the already-imported self._litellm module (instead of from litellm import exceptions) so the mapper stays testable without a real SDK installed. Lazy import. Top-level import litellm violated the module-level contract in kernel/runner/llm/__init__.py ("imported lazily so the absence of an SDK doesn't break this module's import") — every place that imported the runner's LLM package was implicitly importing litellm. Refactored to an _ensure_litellm() first-use pattern matching AnthropicProvider._ensure_client, with a test_module_imports_without_litellm that strongly verifies the property (the local dev env doesn't have litellm installed — the test passes). Self-review caught 5 more bugs before pushing. (a) _parse_tool_calls called tc.function.name outside the try block — a malformed tool_call with function=None would crash the whole response instead of the single bad call; fixed by getattr chain + continue-on-empty-name. (b) json.loads("null") and json.loads("[1,2]") return None / list, which trip LlmResponse.__post_init__'s isinstance(tc["input"], dict) validator; fixed by coercing non-dict to {}. (c) Same JSON-non-dict bug in providers.stream_litellm's streaming tool-call assembly; same isinstance guard. (d) The streaming fallback (when stream_chunk_builder returns None) emitted metadata={} instead of {"cost_unknown": True}, breaking ledger consistency. (e) tests/e2e_litellm_provider.py's fixture's try/except ImportError was dead code once the import was lazy — would confusingly fail on real assertions rather than pytest.skip if CC_LITELLM_E2E=1 was set on a box without litellm. Replaced with an explicit _ensure_litellm() probe + pytest.skip on ProviderUnavailable. 6 new defensive tests pin all five fixes. Tests. 23 unit tests in tests/test_litellm_provider.py (was 12 mocked-only) — covers lazy import, registry wiring (both _select_provider and providers.PROVIDERS), cost computation with cost_unknown fallback, streaming usage + tool_calls preservation, exception class mapping (AuthenticationError → ProviderInvalidRequest), and 6 defensive tool-call parsing regressions. New tests/e2e_litellm_provider.py mirrors the 3 live-API tests the PR body claimed but never committed (basic call, streaming, system prompt steering); skipif-gated on CC_LITELLM_E2E=1 AND per-provider credentials so CI / dev runs don't accidentally bill. Full non-e2e suite: 2222 / 2222 passing, zero regressions (up from 2154 baseline). Docs. New section in docs/guides/recipes.md under Section 1, between the vLLM/custom/ walkthrough and Section 2 — covers Bedrock SigV4, Azure deployment routing, Vertex service-account JWTs with concrete env-var setup, plus a 5-row troubleshooting table mirroring the existing vLLM one (litellm not installed, drop_params masking, cost_unknown semantics, Bedrock 401 region mismatch, Azure 403 stale api_version). README gains a pip install ".[litellm]" line in Optional extras, three Supported Models table rows (Bedrock / Azure / Vertex via litellm), and a dedicated LiteLLM (AWS Bedrock / Azure / Vertex AI) subsection under Closed-Source API Models with concrete invocation examples and an explicit pointer toward custom/ for plain OpenAI-shaped endpoints so users don't pull litellm when they don't need it. i18n READMEs (CN/JP/ES/DE/PT) intentionally left for the maintainer's translation cadence. Branch: fix/litellm-provider-followup (2 commits — abc3357 code + tests + recipes, f5f364d README), open for review against main.
May 11, 2026 (daemon/f-4 branch): F-4 skeleton — agent_runner becomes a supervised subprocess (RFC 0002). The fourth piece of the daemon foundation roadmap lands as a feature-flagged skeleton on the daemon/f-4 branch. Today each /agent <template> runner lives in a Python thread inside the REPL / web server process — one rogue runner can OOM-kill or hang the whole thing. F-4 makes each runner its own python -m agent_runner --pipe subprocess under daemon supervision so a leak, infinite loop, segfault, or kill -9 on the runner becomes an observable event (agent_runner_crash on the daemon event bus) instead of a process-wide failure. Components: (1) daemon/runner_supervisor.py (~650 LoC) — start / stop / stop_all / get / list_all, 3-phase stop (IPC stop → SIGTERM after 2 s → SIGKILL after another 3 s, bounded ≤ 5 s as required by the RFC acceptance criteria), background reader thread per runner pumping iteration_done / permission_request / notify / log IPC messages, crash classification on EOF, and best-effort writes to F-2's agent_runs + agent_iterations SQLite tables (INSERT OR IGNORE makes iteration re-delivery idempotent; last_iteration UPDATE never regresses). (2) daemon/runner_ipc.py — thin re-export of kernel.runner.ipc.JsonLineChannel so the kernel-side and daemon-side runners share one wire-format implementation (avoids the duplicate-fix-twice trap). (3) daemon/agent_methods.py — four JSON-RPC methods agent.start / agent.stop / agent.list / agent.status registered alongside the F-3 monitor.* family, with full param validation (TypeError → -32602 INVALID_PARAMS via daemon.rpc). (4) agent_runner.py gains a --pipe entry point: _pipe_main reads init from stdin, builds a _PipeAgentRunner subclass that bridges send_fn → IPC notify and _persist_record → IPC iteration_done, then drives the existing _run_loop body so all stagnation-detection / circuit-breaker / dup-summary logic from the threaded path is preserved unchanged. (5) start_runner / stop_runner / stop_all now dispatch on agent_runner_subprocess config key or CHEETAHCLAWS_ENABLE_F4=1 env var; default off, Windows always thread-mode. Self-review caught and fixed 3 real bugs before pushing: (a) reader-thread race (started before _register + DB insert) reordered; (b) malformed-message orphan (a null iteration field unwound the reader → finally classified crashed but subprocess kept running) — wrapped per-message dispatch in try/except + hard-kill in finally if proc still alive; (c) pre-handshake log+exit IPC on template-not-found that supervisor misread as the ready reply, switched to stderr + non-zero exit so the handshake EOF surfaces a clean error. Tests: 27 new (test_daemon_runner_supervisor.py 19 + test_daemon_agent_methods.py 10 — handshake, graceful stop ≤ 5 s, SIGKILL escalation on hung runner, external SIGKILL crash detection, IPC shim identity, 9 SQLite persistence cases incl. duplicate-delivery idempotency, 2 malformed-input safety-net regressions, RPC param validation for all 4 methods, end-to-end list → status → stop with an inline runner). 104 / 104 passing across F-4 + daemon + kernel + existing agent_runner tests, zero regressions. Still TODO before flipping from "skeleton" to "MERGED": permission routing through daemon/permission.py (currently auto-approves), bridge notify forwarding (waiting on F-6/7/8), restart policy, e2e test with the real python -m agent_runner against a tiny template. Branch: daemon/f-4. RFC: docs/RFC/0002-daemon-foundation-roadmap.md.
May 10, 2026 (v3.5.79): Web Chat UI session organization + headless-bridges slash handler + stale-session reaper crash fix. Three threads of work merged into a single release. Bridges / headless deploys (#84 follow-up): Telegram / Slack / WeChat /help, /monitor, /model, /status produced zero response in Docker / --web deploys because _start_headless_bridges() only wired run_query and agent_state on the shared session_ctx — never handle_slash. The bridge poll loops gate on if slash_cb: and fell through to continue before the 📩 Telegram: log line, so the failure was invisible in docker compose logs -f. Fix: extracted the slash handler (originally inlined in repl()) into a module-level factory _make_bridge_slash_handler(state, config, run_query); both REPL and headless paths now use it (single source of truth, no future drift between modes). Stale-session reaper crash: web/api.py:reap_stale_chat_sessions() called remove_chat_session(sid) without the user_id the function now requires for ownership-check parity — every reaper tick raised TypeError, killing the daemon thread, so stale ChatSession objects accumulated forever in the in-memory cache. Fix: capture (sid, user_id) pairs from the cached ChatSession objects under _chat_lock, then apply outside the lock. Web UI session organization: five-feature bundle layered on top — folders + drag-drop + Move-to context menu, ChatGPT-style active-folder context (click a folder name → + New and direct-typing both drop new sessions into that folder, with a Chat · in <Folder> topbar breadcrumb), batch select with Select-all-respecting-search-filter, batch delete + combined-Markdown export (chats-N-sessions.md), and a 4-px draggable sidebar divider with localStorage persistence. Backend adds a folders table, chat_sessions.folder_id nullable FK, in-place PRAGMA table_info + ALTER TABLE migration in init_db(), and 5 new HTTP endpoints (GET/POST /api/folders, PATCH/DELETE /api/folders/{id}, PATCH /api/sessions/{id}/folder). Also rolled in: issue #111 (handle_slash_sync / handle_slash_stream no longer double-broadcast to WS) and --web --model X persistence. Tests: +16 new across test_web_api.py (folder CRUD, batch ops, reaper regression) and the new test_bridge_slash_handler.py (5 cases pinning the headless handler contract). Full suite: 2154 / 2154 passing, zero regressions. User-side guide: docs/guides/web-ui.md.
May 10, 2026: Web Chat UI fixes — slash commands no longer reply twice; --web --model X actually applies the model. Two related issues that surfaced when wiring a self-hosted vLLM endpoint into the Chat UI. (1) Issue #111 — slash commands duplicated in Chat UI but not in terminal. web/api.py:handle_slash_sync was both returning events inline in the HTTP response and broadcasting the same events to the WS subscribers of the same client; chat.js then iterated data.events AND fired _handleEvent from ws.onmessage, rendering every reply twice. Same bug in handle_slash_stream for SSE-streamed long commands (/brainstorm, /worker, /agent, /plan). Both helpers now deliver events through a single channel — HTTP/SSE only — so _handleEvent runs exactly once per event. Background-thread events (sentinel flows, agent runs) are unaffected: by the time the worker thread emits, _broadcast is already restored to the live WS broadcaster in finally. (2) --web --model X was silently ignored. The CLI override branch only ran in the interactive-REPL path; the if args.web: branch loaded config straight from disk and started the server, so python cheetahclaws.py --web --model custom/qwen2.5-72b would happily boot but every request handler reloaded ~/.cheetahclaws/config.json with the previous model name (e.g. gemma-4-31B-it), producing a confusing 404: model does not exist against the new endpoint. Fix: cheetahclaws.py now persists args.model to config before calling start_web_server, matching the documented behavior; provider:model → provider/model normalization is identical to the REPL path. User-side guide: docs/guides/web-ui.md (Troubleshooting + Architecture notes updated).
May 10, 2026: Small-context local models survive large workloads — 4-part fix: ctx cap, auto-fanout, stagnation-stop, output paths under ~/.cheetahclaws/. Repro that motivated the work: running /agent → 1 (Research Assistant) on a 6.6 MB PDF (AutoRedTeamer.pdf — ~70k tokens of extracted text) with custom/qwen2.5-72b (32k ctx). Old behavior: 400 BadRequest "context length 32768"; the agent_runner kept polling the template every 2 s; the model produced 1500+ identical "task complete" summaries before anything stopped it. New behavior, four cooperating layers: (1) Per-model context-window registry + dynamic max_tokens cap (providers._MODEL_CONTEXT_LIMITS + get_model_context_window + dynamic_cap_max_tokens) — covers Qwen 2.5/3, Llama 3.x, Mistral/Mixtral, Phi, Gemma, DeepSeek local variants; _fetch_custom_model_limit now backfills PROVIDERS["custom"]["context_limit"] so compaction sees the live /v1/models value; per-call shrink based on actual prompt size keeps input + output + 1024 safety ≤ ctx. compaction.get_context_limit gains an optional config arg so custom-endpoint detection works on the very first turn. (2) Auto-fanout for oversize tool outputs (multi_agent/fanout.py) — when a single tool result (Read on a huge PDF, Grep over a giant tree, WebFetch of a long article) exceeds 0.4 × ctx_window, split into chunks at paragraph boundaries with token-overlap, dispatch parallel sub-LLM map calls (one per chunk, default cap 5 subagents), merge with a single reduce call; substitutes the merged summary in conversation history instead of letting the next API call overflow. Hooked at the tool-result append site in agent.py; transparent UX prints [Auto-fanout: <Tool> returned ~N chars (>threshold) → dispatching K parallel sub-summaries]. Configurable: auto_fanout_enabled / _threshold / _max_subagents / _chunk_overlap_tokens. (3) Stagnation-stop in agent_runner.py — when the model emits the same summary N iterations in a row (default 3, whitespace/case-normalized), stop the loop with a clear notification instead of burning thousands of API calls; configurable via auto_agent_dup_summary_limit (0 disables). (4) Agent output paths under ~/.cheetahclaws/ — /agent wizard now resolves relative output filenames (e.g. research_notes.md) to absolute paths under ~/.cheetahclaws/agents/<name>/output/ instead of CWD; AgentRunner exposes runner.output_dir, eagerly mkdir'd; Summary block + post-start info show the resolved path in green; absolute paths pass through unchanged. Tests: +47 new (fanout 23, ctx cap 18, dup-stop 13, output paths 8). Full suite: 2139 passing, zero regressions. User-side guide: docs/guides/extensions.md.
May 9, 2026: Read tool auto-redirects on overflow — defense-in-depth for the case where model ignores the template instruction. Re-running the same /agent + autodan.pdf failure showed two real-world problems with the prior fix: (1) The user was running the pip-installed binary (/home/shangdinggu/anaconda3/bin/cheetahclaws), not the source tree. New tools / templates added to source had no effect. (2) Even if the user reinstalled, qwen2.5-72b would likely still call Read instead of SummarizeLargeFile — models default to familiar tools no matter what the template says. The fix moves the routing decision into the Read tool itself. (a) New _maybe_redirect_to_summarize helper (tools/files.py). When Read or ReadPDF would return content too large to safely fit in the next API call, it instead returns a short redirect message like [ReadTooLarge: file is too large — call SummarizeLargeFile with file_path='X' instead] PREVIEW: …. The model sees the redirect, calls SummarizeLargeFile, gets a chunked-and-merged summary back. The raw content never enters the API call. (b) CJK-aware token estimation. CJK content tokenizes at ~1 token per character (vs ~2.8 chars/token for English). New _is_cjk_heavy() heuristic: ≥20% CJK characters → use 1:1 char-to-token estimate. A 24K-char Chinese file is 24K tokens, not 8.6K, and now triggers redirect on a 32K-context model. (c) Conservative ceiling for unreliable provider declarations. custom/<model> provider declares 128K context by default but the underlying model is often 32K (qwen2.5-72b, llama 3 8B, etc.). New safe_ctx = min(declared_ctx, 30000) caps the threshold at 30K tokens regardless of provider claims — the redirect now fires on the user's exact ~25K-token PDF case (would NOT have fired with the unconditional 128K ceiling, which is exactly the bug). (d) Wrapped Read registration (tools/__init__.py). New _read_with_overflow_check lambda calls _maybe_redirect_to_summarize after _read returns; for results <8KB it skips (not worth the check). ReadPDF gets the same treatment inline in _read_pdf. Why this works even on the old install: as soon as the user updates tools/files.py and tools/__init__.py, the redirect fires regardless of whether SummarizeLargeFile / template changes are present. The redirect's prose tells the model exactly which tool to call and with what args. Tests: 14 new pytest cases (tests/test_read_overflow_redirect.py) — CJK detection (English / Chinese / Japanese / mixed-minority / empty), threshold logic (small file → no redirect; user's exact failure case → redirect with right pointer; CJK at lower char count triggers vs same chars in English; conservative ceiling protects against overconfident provider; preview included for context). Plus 2 integration tests via execute_tool("Read", ...) confirming the wrapper applies the redirect end-to-end. 2077 targeted regression tests pass (2063 prior + 14 new), zero regressions across the whole repo.
May 9, 2026: Multi-agent map-reduce SummarizeLargeFile tool — solves the "file too big for model context" problem at the source. Re-running the same /agent + autodan.pdf failure case showed the SAFETY_BUFFER bumps were still band-aids — even with 2500-token buffer the prompt re-tokenization sometimes ate ~1K, leaving no margin. The real fix: when a file is too big for the model's context, chunk it and run multiple sub-LLM agents in parallel then merge. This makes file size irrelevant. (a) New SummarizeLargeFile(file_path, focus="") tool (tools/files.py). Reads any-size file (PDF / txt / md / code), estimates tokens, and: if it fits in (model_ctx - 8.5K_reserved) tokens → single-shot summary; otherwise → splits into N chunks (number adaptive to file size: 200KB on 32K-context model → ~4 chunks; 200KB on 200K-context → 2 chunks), summarizes each chunk in parallel via ThreadPoolExecutor (up to 8 workers), then a reduce step merges all chunk summaries into one unified output. Per-chunk failures are logged inline as [chunk N: error] markers so one flaky source doesn't sink the whole job. Returns the final summary as the tool result. Registered with read_only=True, concurrent_safe=True. (b) /summarize <path> [focus] slash command (commands/advanced.py:cmd_summarize). Thin wrapper around the same helper for direct user invocation — handy for quickly summarizing a paper or large code file without spinning up a full /agent flow. (c) research_assistant.md template updated. Step 2 of "each iteration" now tells the agent to prefer SummarizeLargeFile over Read for academic papers (handles chunking + never overflows context regardless of length). Falls back to Read for tiny (< 5KB) files. (d) Quick band-aid: SAFETY_BUFFER 1000 → 2500 in _try_reduce_output_cap_from_error. Even with the new tool, output-cap auto-reduction is still useful for the rare case where Read is called on a moderately big file. The 2500-token (~7.6% of 32K) buffer now absorbs the +1K vLLM decoder-priming variance we observed in the wild. Tests: 18 new pytest cases (tests/test_summarize_large_file.py) — token estimator parametrized cases, chunk planner adaptiveness (small file → 1 chunk; size scales monotonically; larger context → fewer chunks; chunks have overlap; chunks cover all content), file reader dispatch (text / missing / directory rejected), full pipeline (small → single-shot, big → map-reduce with N≥3 map calls + 1 reduce), tool registration + schema check. 2063 targeted regression tests pass (2045 prior + 18 new), zero regressions. Golden prompt fixture regenerated for the new /summarize command in the help index.
May 9, 2026: Two follow-up fixes after re-running the same /agent failure case. The previous patch wasn't enough — running the user's exact scenario again still showed: 1st call prompt 24577 + cap 8192 = 32769 fail → my auto-reduction fired → 2nd call prompt 24778 + cap 7991 = 32769 fail again. The prompt grew by 201 tokens between attempts (provider re-tokenized differently on retry), exactly eating the 200-token safety buffer. AND the agent_runner's consecutive-failure detector kept resetting because agent.py alternates between [Failed ...] and [Circuit breaker ...] markers, so signature-matched counter went 1 → 1 → 1 → 1 forever. (a) Bumped SAFETY_BUFFER 200 → 1000 in _try_reduce_output_cap_from_error. ~3% headroom on a 32K window absorbs provider-side tokenization variance. User's case: new safe cap = 32768 - 24577 - 1000 = 7191, which actually fits even after the prompt grows. (b) agent_runner now counts ANY failure, not just signature-matched. New parallel counter consecutive_any_failures increments on ANY [Failed] / [Circuit breaker] marker regardless of signature; trips at 4 consecutive iterations. The [Failed → Circuit breaker → Failed → ...] alternation now stops the agent at iteration 4 instead of looping forever. Updated stop-message clarifies whether the trip was "same identical failures" or "consecutive mixed failures". 8 existing tests updated for new buffer + 2045 targeted regression tests pass.
May 9, 2026: Three fixes for the context-overflow + circuit-breaker doom loop. User report: /ssj 15 → Research Assistant pointed at a large PDF, model qwen2.5-72b (32K context), output cap 8192, prompt 24577 input tokens → total 32769 → 1 token over the limit. Every API call returned the same BadRequestError. The retry loop hit the same error 5 times in 60s → circuit breaker opened (120s cooldown). After cooldown the agent runner retried with the SAME config → re-opened the breaker → cycle continued forever, generating hundreds of circuit_open_skip log lines. Three coordinated fixes break the loop. (a) agent.py auto-reduces output cap on context overflow. New _try_reduce_output_cap_from_error parses the explicit token counts from the error message (max=32768, requested=8192, prompt=24577) and computes a safe new cap = model_max - prompt_tokens - 200_buffer. In the user's case: 32768 - 24577 - 200 = 7991, which fits. The retry uses the new cap WITHOUT consuming the attempt budget; bounded to ONE auto-reduction per turn so a true overflow (prompt itself too big to fit any reasonable output) eventually surfaces. Tolerant regex matches both OpenAI-style and Anthropic-style overflow messages. Falls through to existing _force_compact path if numbers can't be parsed or the safe cap < 256. (b) agent_runner.py stops after N consecutive identical failures. Track each iteration's failure signature (the [Failed ...] or [Circuit breaker ...] marker text from agent.py's output, capped at 80 chars). When 3 in a row match, stop the agent with a clear notify message naming the underlying error. Prevents the doom loop where a fundamentally broken request (context too big for compaction to fix, missing API key, unauthorized model) keeps re-running every 2s for hours. (c) agent_runner.py honors circuit-breaker cooldown. When iteration text contains [Circuit breaker OPEN ... Cooldown: Xs], parse Xs and wait that long (capped at 5 min) instead of the configured 2s interval before next iteration. Avoids 60+ wasted iterations per single 120s cooldown. Tests: 8 new pytest cases (tests/test_context_overflow_recovery.py) — parser reproduces user's exact failure → 7991 cap, no-op when current cap already fits, give-up when safe cap < 256, OpenAI vs Anthropic phrasing tolerance, regex match for circuit-breaker cooldown extraction, regex match for [Failed / [Circuit breaker markers in real outputs. 2045 targeted regression tests pass (2037 prior + 8 new), zero regressions.
May 9, 2026: /brainstorm v2: programmatic backstops + ranked synthesis + --bg background mode. Three coordinated additions that make brainstorm output usable even when the lead model is weak (qwen2.5 etc.) and let users keep working while the debate runs. (a) Programmatic action-plan filter (commands/advanced.py). Two new helpers _extract_ban_keywords(opening) and _filter_action_plan(synthesis_md, ban_keywords). After _lead_synthesis returns, the action plan is regex-scanned (case-insensitive substring) against a built-in default ban list — consult an advisor, diversify your portfolio, monitor regularly, 考虑, 咨询, 定期监控, 多元化, 咨询财务顾问, 分散投资, 关注市场动态 and dozens more, English + Chinese — PLUS topic-specific bans extracted from quoted strings ("..." / 「...」) in the lead's own opening. Matched items are dropped with a _(programmatic self-check removed N action(s))_ note appended. Deterministic — runs regardless of whether the lead model actually executed its prompt-side SELF-CHECK instruction. The user-reported failure case where qwen2.5 banned "consult an advisor" in the opening but still wrote "明天与财务顾问讨论" as Action Plan item #10 is now caught at the code level. (b) Ranked synthesis enforcement. The _lead_synthesis prompt's ## Consensus section is renamed to ## Ranked Consensus with a mandatory **Ranked by: <metric>** header (metric extracted from the user's topic — "highest expected return" / "best refactor impact" / etc.) and items must be numbered with a → Why this rank: <one sentence> line. Programmatic backstop _consensus_is_ranked regex-checks for ≥2 numbered items in the section; if missing, ONE fallback LLM call asks the lead to rank. If the fallback also fails to produce a ranking, the original ships unchanged (no crash). (c) Background mode --bg (or --background). New flag spawns a daemon thread, returns the REPL immediately. Stage progress (Lead opening, Round 2/3 (cross-examination), Synthesis) prints from the thread and interleaves with the user's typing — acceptable trade-off for a freed REPL. New /brainstorm status subcommand shows all in-flight bg brainstorms with their current stage + elapsed time + output path. Implementation uses recursion: when --bg is set, the thread re-enters cmd_brainstorm with _bg_recursion=True markers in config that bypass the interactive prompts (which would block on stdin) and suppress the TODO-generation sentinel (no REPL is listening for it). Module-level _BG_BRAINSTORMS dict is mutex-locked so /brainstorm status reads a clean snapshot. Finished brainstorms older than 1h are pruned from status to keep the list useful; running ones never prune regardless of age. Tests: 27 new pytest cases (tests/test_brainstorm_v2_advanced.py) — ban-keyword extraction (defaults + opening-quoted), action-plan filter (English + Chinese + no-section + all-clean), ranking detector (proper / unranked-bullets / no-section / single-item), _ensure_consensus_is_ranked (no-op when ranked + LLM call when not + keep-original on LLM failure), --bg flag parsing (7 cases including --background alias + flag-position-tolerance + --bgmode not matching), bg registry (register/set_stage/complete/snapshot + sort + 1h-prune-finished + keep-running-regardless). 2037 targeted regression tests pass (2010 prior + 27 new), zero regressions across the whole repo. Doc: docs/guides/brainstorm.md adds --bg row to the flag table + new "Programmatic backstops on the synthesis" section + tip "use --bg for long debates so you can keep working".
May 9, 2026: /brainstorm --ground: pre-fetch real /research data so personas debate against facts. Closes the biggest remaining gap in the brainstorm pipeline. Until now /brainstorm was pure-reasoning (no_tools=True on every persona) — fine for design / refactor / strategy questions, but useless for data-hungry topics like stocks / current events / recent news where personas would confidently invent prices and tickers from training memory. New --ground (or --ground=N for top-N cap, clamped to [3, 50], default 15) runs research.aggregator.research() on the topic BEFORE the debate starts, formats the top results as a compact ### GROUNDING DATA markdown block, and inlines that into the snapshot every persona / lead opening / lead synthesis sees. Persona round-1 instructions gain "you MUST cite specific results by [N] when your claim relates to one — do not invent figures the data doesn't show." Lead opening detects the grounding block and anchors the agenda to it ("forbid any claim that contradicts the grounding data without citing it"). Lead synthesis takes a new grounding= kwarg and the prompt requires every consensus claim to trace to either a [N] result OR a specific persona claim — un-traceable claims must be DROPPED. Failure-tolerant: any exception from the research aggregator (network, missing API keys, all sources 429) is caught silently — _fetch_grounding returns "" and the brainstorm continues un-grounded with a logged warning. Cost: 10-30s for the fetch, but cached for 24h via the existing /research SQLite cache so back-to-back runs on the same topic are basically free. Composes cleanly with --rounds, --lead, --models. SSJ interactive flow gains a new Ground in /research data first? [y/N] prompt right after Rounds; default N so existing usage is unchanged. Tests: 18 new pytest cases (tests/test_brainstorm_grounding.py) — 8 flag-parse cases including bound-clamping + four-flag composition, brief-formatting shape + sort + char-budget + empty-results, three fetch-graceful-degradation paths (raises / empty brief / happy path), backward-compat for _lead_synthesis(grounding=). 2010 targeted regression tests pass (1992 prior + 18 new), zero regressions across the whole repo. Doc: docs/guides/brainstorm.md "Data-hungry topics" section rewritten with examples + tip "always pass --ground for any topic touching the real world".
May 9, 2026: /brainstorm output-quality guards — fix 5 real bugs surfaced from a live transcript. Reviewing brainstorm_outputs/brainstorm_20260509_000935.md exposed five concrete failures the structural changes alone didn't catch. (a) All persona letters were P — letter, name = get_identity(persona_name[0].upper()) and persona dict keys are p1/p2/…, so every Agent ended up labeled P ("Agent P quoting Agent P attacking Agent P"). Letters now come from a stable persona_identity map keyed by index → A, B, C, D, E… (capped at Z). (b) Same persona's NAME re-rolled every round because get_identity was called fresh and Faker is random — round 1's "Riley Torres" became round 2's "Alex Lopez". persona_identity is sealed once before the rounds loop. (c) Round 2+ challenges were verbatim copy-paste — qwen2.5 saw the first persona's CHALLENGE block in history and cloned it (8 of 10 round-2/3 challenges in the failing transcript were >95% identical). New _extract_challenge_blocks + _jaccard_similarity + _is_redundant_challenge (threshold 0.7) guards: when a round-2+ persona's CHALLENGE is too similar to a prior one, the lead force-regenerates ONCE with explicit "pick a different target / different angle" nudge; if still redundant, the contribution is kept but tagged _[lead note: contribution flagged as redundant]_ so the synthesizer can ignore it. (d) Lead synthesis self-contradicted itself — listed "consult an advisor" in What Was Filler then included "明天与财务顾问讨论" as Action Plan item #10. _lead_synthesis now takes the lead's own opening text as context and the prompt explicitly forces a SELF-CHECK before writing the action plan: "if any action matches a banned escape hatch, REWRITE or DELETE." (e) Weak lead models silently produce flat output — qwen2.5-72b leading qwen2.5-72b is the same model on both sides with no real moderation. New _is_weak_lead_model family check (qwen / qwq / gemma / phi-3 / mistral-7b / llama-3.2 / kimi-7b / minimax-text / abab / etc.); when triggered, prints a one-line warning suggesting --lead claude-opus-4-7 or the free --lead nim/deepseek-ai/deepseek-r1. Never silently overrides — just informs. Plus a new docs/guides/brainstorm.md "When NOT to use /brainstorm" section: the panel runs with no_tools=True so it can't pull live data — bad fit for stocks / current events / repo-specific code; good fit for architecture decisions / refactor strategy / risk assessment / API design. Tests: 28 new pytest cases (extraction + Jaccard + redundancy + weak-lead + synthesis-with-opening). 297 targeted regression tests pass.
May 9, 2026: /brainstorm round 2+ becomes adversarial cross-examination. Previous round-2+ prompt asked personas to "engage with what others said" but that was too soft — weak models defaulted to "agree-and-extend" or just continued their own line, producing N rounds of polite parallel monologues instead of a real debate. Three coordinated changes flip round 2+ into mandatory adversarial mode. (a) Persona round-2+ prompt rewrite (commands/advanced.py:call_persona). Each persona MUST: quote a specific claim from another agent verbatim (by letter), attack a specific weakness (data wrong / mechanism doesn't produce outcome / confounder ignored / claim un-falsifiable / contradicts stronger claim), AND propose a falsifiable counter-claim with a specific number/date/named entity. Structured format ### [CHALLENGE → Agent X] so weak models can follow. Politeness ("great point", "I agree, and would add", restating without attacking) is explicitly FORBIDDEN. Synthesis is the lead's job, not the persona's. (b) Round-aware lead probe (_lead_probe). Round 1 keeps the existing concrete-vs-vague check. Round 2+ uses a different probe that fires on DODGES — a polite agreement, a synthesis, or a defense-only reply that doesn't quote and attack another agent earns a probe demanding "Agent X said '...'. Attack it or accept it — your call, but commit. Quote and refute, don't dodge." (c) Lead opening warns about cross-examination upfront. Opening prompt now ends with explicit rule: "in any round after the first, each expert MUST quote a specific claim from another expert and either attack with a counter-claim OR explicitly accept it. Polite agreement counts as a dodge." UI label changes too — ── Round 2/3 (adversarial cross-examination — agents must attack each other's claims) ──. Tests: 3 new round-aware probe cases (round-2 polite-agreement gets probed; round-2 real challenge passes; round-1 still uses old vague check — captured so a future round-2 change can't regress round 1). 269 targeted regression tests pass.
May 8, 2026: /ssj brainstorm: interactive Rounds prompt. Tiny UX follow-up to the multi-round /brainstorm landing — /ssj → 1 (Brainstorm) now asks Rounds [1=monologues, 2=critique (default), 3-6=more debate] > right after the existing "How many agents?" prompt, so SSJ users can dial in debate depth without remembering the --rounds N CLI flag. Behaviour: when the user invokes /brainstorm --rounds 3 … directly via the slash-command line, the explicit value wins and the prompt is skipped (no double-asking). Telegram / web bridge sessions still skip the prompt entirely (no interactive input channel) and use the documented default of 2 rounds.
May 8, 2026: /brainstorm: real multi-round debate + tighter post-Write contract. Two follow-up fixes after the lead-moderator landing. (a) Multi-round debate (commands/advanced.py). Previous flow ran every persona exactly once — even with the lead moderator, that's three monologues stapled together, not a debate. New --rounds N flag (default 2, capped to [1, 6]) wraps the persona iteration in an outer rounds loop. Round 1 is initial positions (existing prompt). Round 2+ uses a different system prompt that explicitly forbids repeating: "Read the prior debate. Pick 1-2 specific claims from OTHER agents that you disagree with, can sharpen, or that change your view. Quote and engage. Do NOT re-list your round-1 ideas." Lead probes still fire after each persona in each non-final round. The synthesis prompt's transcript is rebuilt from brainstorm_history directly so adding new header rows can't mis-slice it again. Composes with --lead <model> and --models a,b,c: /brainstorm --rounds 3 --lead claude-opus-4-7 --models gpt-5,nim/deepseek-ai/deepseek-r1 redesign auth. (b) Tighter TODO prompt (cheetahclaws.py). The previous "do not echo / do not Read" prompt didn't stop qwen2.5 from Write → echo content as text → Bash ls to verify (with truncated path due to vLLM streaming) → echo content again. New prompt is numbered STRICT RULES: call Write EXACTLY ONCE; do NOT call Read; do NOT call Bash to verify; do NOT echo file content after Write; after Write succeeds, your turn ENDS. Both REPL and Telegram handlers updated. Tests: 9 new pytest cases (--rounds parser including bound-clamping + non-numeric rejection + three-flag composition). 266 targeted regression tests pass. The Bash-args truncation symptom (ls /srv/.../cheetahcla cut mid-path) is a vLLM hermes-parser streaming bug at the model server, not fixable on the client side; the tighter prompt avoids the Bash call entirely.
May 8, 2026: Three fixes for /monitor + /research stability — multi-word topics + aggregator deadlock + REPL Ctrl+C. Two distinct bugs reported on a /ssj → 17 (Trend Track) flow with the topic "Agent OS Benchmark". (a) Topic truncated to first word (commands/monitor_cmd.py:_parse_subscribe_args). The previous parser did args.split() and treated the FIRST whitespace token as the topic, dropping the rest. So /subscribe research:7d:Agent OS Benchmark daily became topic=research:7d:Agent + the rest was either silently dropped or mis-classified as flags. The new rule: walk left-to-right, peel off --flag tokens into channels, then if the LAST remaining token is in _VALID_SCHEDULES it's the schedule — everything before joined by single spaces is the topic. Correctly handles ai_research, ai_research weekly, custom:quantum computing weekly, research:7d:Agent OS Benchmark daily, research:7d:Agent OS Benchmark (default schedule), and edge cases. 12 new pytest cases (tests/test_subscribe_parser.py). (b) Aggregator deadlocked on slow source then killed REPL on Ctrl+C (research/aggregator.py:190). The with concurrent.futures.ThreadPoolExecutor(...) context manager calls shutdown(wait=True) on __exit__, which BLOCKS waiting for any in-flight worker to finish. When as_completed(timeout=...) fires its TimeoutError because one source is hung on a stuck socket, control unwinds into the __exit__ and joins the hung thread. Then the user Ctrl+Cs to escape, the KeyboardInterrupt fires during the join, and Python's atexit hook _python_exit ALSO joins the same threads — double-blocking, then atexit kills the process and the user is dumped to bash. Fix: switch to manual try/finally with shutdown(wait=False, cancel_futures=True) (Python 3.9+) so partial results return immediately; the hung worker keeps running as a daemon thread and dies silently with the process. Both _cf.TimeoutError and KeyboardInterrupt paths now mark unfinished sources with a status entry ("timeout (aggregator deadline exceeded)" or "interrupted by user") instead of dropping them silently. (c) REPL: Ctrl+C during a slow slash command killed the process (cheetahclaws.py:1368). The REPL did result = handle_slash(user_input, state, config) with NO try/except, so a KeyboardInterrupt during /monitor run, /research, /trading backtest, etc. unwound the call stack all the way to main() → sys.exit() → atexit. Fix: wrap the REPL slash dispatch in try / except KeyboardInterrupt → print '(command interrupted)' → continue so Ctrl+C cancels the command and returns to the prompt. Also wrapped the SSJ inner re-dispatches at lines 1420/1430 (__ssj_passthrough__ and __ssj_cmd__) so Ctrl+C from inside a slow SSJ-launched command bounces back to the SSJ menu instead of killing the REPL. 257 targeted regression tests pass.
May 8, 2026: /brainstorm gets a real lead moderator + read-only tool dedup. Two coordinated changes that turn /brainstorm from "round-robin echo chamber that produces filler advice" into "moderated debate with a structured master plan", and stop weak models from re-Reading the same file twice. (a) Lead moderator (commands/advanced.py). Three new in-process stages (no main-agent invocation, no tool calls — the whole pipeline lives inside cmd_brainstorm): (i) Opening — lead frames the agenda, names the concrete artifact this debate must produce (e.g. "specific tickers with thesis, not 'consider semiconductors'"), and lists 2-3 cheap escape hatches that will be REJECTED ("consult an advisor", "diversify", "monitor regularly"). The opening becomes the persona system-prompt's "DEBATE ANCHOR" so every persona writes against the same bar. (ii) Probe — after each persona speaks, lead reads their contribution and either replies NO_PROBE (concrete enough) or asks one ≤25-word follow-up that demands a specific commitment; the persona then gets one more swing answering the probe. (iii) Synthesis — lead produces the final master plan with four named sections (Consensus / Dissents / Concrete Action Plan / What Was Filler), with the consensus matrix tagging each claim with the agent letters that backed it. New --lead <model> flag lets you point lead at a stronger model than the default (/brainstorm --lead claude-opus-4-7 --models gpt-5,deepseek-r1 redesign auth). Composes cleanly with the existing --models a,b,c flag. (b) Eliminates the duplicate-Read bug. The previous flow returned a sentinel that asked the main agent to Read the brainstorm file and synthesize — qwen2.5 + vLLM cheerfully Read it twice and echoed the entire 4 KB master plan as text twice (also writing a different much shorter content via Write — a separate tool-call truncation issue). The new sentinel inlines the lead's master plan directly in the TODO-generation prompt, so the main agent only writes the TODO file. No Read, no rewrite. The old _save_synthesis step is now a no-op (everything is written inside cmd_brainstorm). (c) Read-only tool dedup (agent.py). Defense-in-depth even outside brainstorm: when the model fires Read/Glob/Grep/WebFetch/WebSearch with identical args twice within a single run(), the 2nd call is short-circuited — execute_tool is skipped (saves time), ToolStart/ToolEnd UI yields are suppressed (no ⚙ Read(...) printed twice), a brief [deduped Read: already in context] text marker is yielded so the user still knows what happened, and a synthetic [deduped] reminder is appended as the tool_result so the model sees "you already called this; use the content already in your context" — both nudging the model AND keeping the OpenAI/Anthropic tool_calls ↔ tool_response pairing valid. Write/Edit/Bash are explicitly NOT deduped (those can be intentional rewrites). Tests: 19 new pytest cases (8 lead helpers + 4 dedup integration via fake provider stream + 7 flag-parse). 245 targeted regression tests pass.
May 8, 2026: /ssj brainstorm hot-fixes — absolute path in synthesis prompt + tool dispatch hardened against empty args. Two bugs surfaced when a user ran /ssj → 1 (Brainstorm) on custom/qwen2.5-72b. (a) commands/advanced.py:244 — synthesis prompt leaked a relative path. The brainstorm synthesizer was injecting out_file (a Path resolved relative to cwd) into the model's prompt as brainstorm_outputs/brainstorm_<ts>.md. The model — obeying the system prompt's "always use absolute paths" rule — invented an absolute prefix and guessed wrong (in this case …/PR/cheetahclaws/brainstorm_outputs/…, a stale sibling source tree it had never been told existed). Read failed, the synthesis ran on no actual evidence. Fix: out_file.resolve() before formatting + an explicit "use this path verbatim, do NOT prepend any directory" line. (b) tools/init.py:459-471 — permission-prompt description used inputs['file_path'] not inputs.get(...). When a weak model fired a tool_call with empty arguments (qwen2.5 + vLLM hermes-parser is a documented offender — see "Be agentic on every model" entry above), the wrapper raised KeyError: 'file_path' before the registered ToolDef's friendly "Error: missing required parameter 'file_path'" lambda ever ran. The user saw Error executing Write: KeyError: 'file_path' and the model couldn't self-correct. Fix: .get(..., '<missing path>') for Write/Edit/NotebookEdit description, .get('command', '') or '' for Bash, so the inner ToolDef's friendly error always reaches the model. Bash's _is_safe_bash already tolerates empty input. Tests: 9 new pytest cases (tests/test_tool_dispatch_robustness.py) — empty args on Write/Edit/Read/Bash/NotebookEdit must return a friendly string and never leak KeyError to the agent loop. 226 targeted regression tests pass.
May 8, 2026: NVIDIA NIM free-tier provider + 429 cascade fallback + multi-model /brainstorm. Three small, focused additions — borrowed selectively from sibling forks (Falcon for NIM, Dulus for the multi-model debate idea) — that lower the barrier to entry for users without paid API keys and tighten epistemic diversity in brainstorming. (a) NIM provider (providers.py). New nim entry registered against https://integrate.api.nvidia.com/v1 (build.nvidia.com — free signup, no payment info), curated 10-model chain (deepseek-r1, deepseek-v3.1, llama-3.3-70b, llama-3.1-405b, nemotron-70b, mixtral-8x22b, qwen2.5-72b, qwen2.5-coder-32b, phi-3-medium, gemma-2-27b). All listed in COSTS as $0 so the UI doesn't show "unknown" for free-tier usage. Invocation: cheetahclaws --model nim/<vendor>/<model> — the double-prefix preserves NIM's upstream <vendor>/<name> form through detect_provider + bare_model. (b) 429 cascade fallback (agent.py). When a NIM model returns rate-limit (ErrorCategory.RATE_LIMIT), the agent loop calls nim_next_model() to pick the next model in the curated chain and retries — without consuming a regular retry slot. Capped at _NIM_FALLBACK_LIMIT = 3 swaps per turn so a fully-throttled tier can't busy-loop; after the cap, falls through to the standard exponential-backoff retry path. Disabled by setting nim_auto_fallback=False in config. Other providers (anthropic / openai / etc.) are not affected — the swap is gated by detect_provider() == "nim". (c) Multi-model /brainstorm (commands/advanced.py). New --models a,b,c flag distributes models round-robin across personas (/brainstorm --models claude-opus-4-7,gpt-5,nim/deepseek-ai/deepseek-r1 redesign auth) so a 5-persona session alternates 1, 2, 3, 1, 2 instead of running every persona on the same model. Single-model brainstorm is an echo chamber — different model families have different training data and blind spots, so multi-model debate buys real epistemic diversity. Each persona's section in the output Markdown is tagged with the model that produced it (## 🏗️ Architect _(via gpt-5)_) so the synthesizer can weight by source. Borrowed in spirit from Dulus's RoundtableAgent; the existing /brainstorm flow is unchanged when --models is omitted. Tests: 21 new pytest cases (tests/test_nim_provider.py 12 + tests/test_brainstorm_models_flag.py 9) covering provider registration, chain cycling (cycle-through + wraparound + unknown-model head fallback), 429 swap-then-succeed, fallback-cap-then-fallthrough, fallback-disabled honor, non-NIM no-leak, flag parsing across --models a,b,c / --models=a,b,c / flag-at-end / provider-prefixed IDs / single model. 217 targeted regression tests pass, zero regressions. Skipped by design: ia-web-parser's WebToolParser — Cheetahclaws' existing _extract_native_tool_calls already covers 4 marker formats (Gemma official + asymmetric, Hermes, Mistral) plus channel-tagged form and args recovery, so the streaming-vs-buffered UX delta wasn't worth the duplication.
May 8, 2026 (earlier): "Be agentic on every model" pass — explore-first prompt + qwen overlay + runtime auto-nudge. A user reported cheetahclaws --model custom/qwen2.5-72b replying "please tell me which file you mean" when handed a directory path, instead of just ls-ing it. Three coordinated defenses, layered so any one of them is enough to fix the failure mode on any model: (a) prompts/base/default.md — new "Investigate Before Asking" section + softened Stop Conditions. Every model now gets explicit "default to action over conversation" framing: a directory is not "missing information", it's an invitation to enumerate; AskUserQuestion is reserved for genuine post-exploration ambiguity (intent that no ls/Glob/Read could disambiguate), never as a substitute for a tool call. (b) prompts/overlays/qwen.md — new family overlay (10 lines, cites the Qwen function-calling guide). Qwen / QwQ chat-tuned models hedge by default ("could you specify…"); the overlay overrides that with "treat every concrete noun the user names — path, filename, URL, function, command, error string — as an instruction to investigate it with a tool, not echo it back as a question." Registered in _OVERLAY_RULES for all qwen / qwq model IDs regardless of runtime (DashScope / Ollama / vLLM / OpenRouter all match). (c) agent.py runtime auto-nudge — model-agnostic safety net. New _looks_like_investigation() heuristic detects absolute-path tokens in the user message (URL-stripped to avoid false positives on https://host/path); if the heuristic fires AND the model's first reply is text-only with zero tool calls, the loop injects a one-shot [system reminder] use your tools, don't ask for what was given message into history and continues. Bounded to one nudge per run() invocation so it can never cause a loop — second text-only reply always falls through to break. The nudge fires on conversion to the OpenAI/Anthropic format as a normal user-role message and is invisible in the rendered UI (yielded events drive the display, not state.messages). Tests: 13 new pytest cases (tests/test_agent_nudge.py) — heuristic positives/negatives across English + Chinese + URL-only + relative-path + bare greeting; loop integration via fake provider stream verifying nudge fires, doesn't fire without path, fires at most once. 89 prompt + 196 targeted regression tests pass, zero regressions. Docs updated: prompts/README.md overlay table + Known Gaps, docs/architecture.md overlays tree + agent-loop step (h), docs/contributor_guide.md overlay enumeration. The three layers compose: strong models (Claude/Gemini) read the new default rule but already behaved this way; mid-tier models (GPT/DeepSeek/Kimi) get a clearer prompt-level instruction; weak models (qwen2.5/QwQ) get prompt + overlay + runtime nudge stacked. Even on a model that ignores the prompt entirely, the runtime nudge gives one free retry before the user has to intervene.
May 8, 2026 (earlier): Agent-OS layer (kernel/) reaches v1.0 — 27 RFCs shipped, 1771 tests passing, zero regressions on the legacy REPL/bridges path. What started as a daemon foundation (RFC 0001/0002) is now a single-node agent operating system: AgentProcess + EventLog (0003), Capability model (0005), per-agent ResourceLedger with first_breach signal (0006), priority Scheduler with admission filter (0007), RLIMIT + bubblewrap Sandbox (0008), Mailbox + topic pub/sub (0009), AgentRegistry (0010), AgentFS unified VFS (0011), Observability + Prometheus exposition (0012), and a frozen 58-method JSON-RPC contract with CI drift guard (0013). On top of that substrate: F-4 Subprocess agent runner (0016), WorkerLoop scheduler↔supervisor glue (0017), Bridge mirror that wires Telegram/WeChat/Slack into kernel.mbox without touching bridges/ (0018), LLM runner MVP (0019), DialogueOrchestrator for multi-turn (0020), Tool Dispatch + Permission Routing (0021), LLM Tool Calling Integration (0022), defense-in-depth tools — Exec (argv-only, RLIMITed, env scrubbed; 0023), Glob+List (0024), Fetch (SSRF + DNS-rebind + redirect-leak defended; 0025) — three streaming layers (IPC chunks 0026, LLM token streaming 0027, Exec line streaming 0028, Fetch body streaming 0029), and three new built-in inspectors (Diff 0030, AST 0031, Git 0032). All kernel code lives in kernel/ and is gated behind --enable-kernel — default CheetahClaws CLI / REPL / bridges / web UI are byte-for-byte unchanged. Operators introspect via cheetahclaws kernel summary | info | agents | proc <pid> | events | queue | registry | methods | prometheus. Kernel SQLite schema is forward-only (v1 → v7). RFC 0014 multi-tenant + RFC 0015 cluster remain explicitly parked. Full overview: docs/agent-os.md. Each design note in docs/RFC/.
May 8, 2026: F-2/F-3 follow-ups + CI unblock (feature/fix-f2). Two-commit branch on top of #101's daemon foundation (F-2 SQLite persistence + F-3 monitor in daemon). (a) CI unblock (fix(ci)). Main has been red since 9c01237d (the trading-agent #99 merge) — tests/test_packaging.py::test_required_module_imports[modular.trading.ml] (the regression test added for issue #97) caught that modular/trading/ml/features.py and modular/trading/portfolio.py import numpy at module top while numpy is in the [trading] extra, not core deps. So pip install . (no extras) shipped a wheel where import modular.trading.ml blew up. PR #100 and #101 both inherited the red. Fix: dead import numpy as np removed from features.py; stacker.py defers numpy to inside train() and predict_proba() past the early-return paths so the diagnostic-only callers (train(too_few_rows), predict_proba(missing_model)) still work without the heavy stack; portfolio.py gates the numpy import behind try/except so module import succeeds and runtime callers raise on first use as before. test_trading_advanced.py and test_trading_discovery.py get pytest.mark.skipif markers on tests that genuinely need numpy / scipy / sklearn / pandas at runtime — skip cleanly on lean CI installs, run as before on full installs. Verified in a clean venv with only [web,autosuggest] (the exact CI install): 1075 passed, 11 skipped; with [all] extras: 1086 passed, no regressions. (b) F-2/F-3 follow-ups (fix(daemon)). Five issues found during the #101 review that the merged code didn't address: (i) daemon/cli.py:cmd_serve started monitor.scheduler.start(...) before the listener bound — order matters because if a due subscription fires before the daemon is reachable, an LLM/network error in fetch/summarize/deliver surfaces in the log before the user sees the listening line, and external clients can't yet act on the resulting monitor_report SSE event; moved past the bind + discovery write. (ii) monitor/scheduler.py had no defense against the daemon coming up after REPL /monitor start fired — both schedulers would race on last_run_at and double-fire subscriptions; added _foreign_daemon_running() step-aside check at every loop tick (REPL-side instances bow out when a daemon registers ownership), with owned_by_daemon=True flag the daemon passes to opt out of the check on its own scheduler. (iii) EventBus.publish was synchronous=FULL (SQLite default) → every event was an fsync per commit, ~305 μs each; for streaming agent output (text_chunk events at dozens/sec) that's a real disk-IO concern. daemon/schema.py now sets PRAGMA synchronous=NORMAL on init + every thread-local connection — safe under WAL (only the most recent transactions can be lost on hard kernel crash, which for a 24h-pruned event log is fine), microbenchmark drops to 39 μs/publish (~8×). (iv) The PR description said the JSON files were "kept readable for one release as fallback", but no fallback read path actually exists — jobs.py and monitor/store.py migration is fundamentally one-way once the schema_meta marker is set. Updated docstrings + docs/architecture.md to make the one-way semantics explicit and tell users how to redo a migration if needed. (v) docs/RFC/0002-daemon-foundation-roadmap.md F-2/F-3 marked OPEN → MERGED #101 + follow-ups (#fix-f2), with a new "Follow-ups" subsection under each. Branch: feature/fix-f2.
May 8, 2026: Two production fixes — Gemma 4 native tool-call interceptor + issue #97 (pip install . shipping a broken wheel). Two unrelated bugs that both blocked end users on the v3.1 release. (a) Gemma 4 native tool-call interceptor (providers.py). When users run cheetahclaws against gemma-4-31B-it via vLLM, the model emits its native <|tool_call>call:NAME{json}<tool_call|> format instead of the Hermes/JSON envelope vLLM's --tool-call-parser hermes expects. vLLM doesn't recognise the format → leaves it in delta.content → cheetahclaws yields it as TextChunk → terminal shows raw <|tool_call>call:Research{topic:<\|"\|>...<\|"\|>}<tool_call\|> garbage instead of a coherent answer. The interceptor in stream_openai_compat now watches the streamed text for any of four native tool-call openers (Gemma official <|tool_call|>, Gemma 4 asymmetric <|tool_call>, Hermes <tool_call>, Mistral [TOOL_CALLS]); on detection it (i) yields the pre-marker text as a clean TextChunk, (ii) stops yielding text and switches into buffer mode, (iii) at end-of-stream tries three parser branches against the buffer (Gemma's call:NAME{json}, JSON envelope with name/arguments, Mistral's array form) and adds successful matches to tool_calls. Also normalises Gemma's <|"|> → " quote escaping. If no parser matches, falls back to yielding the buffered raw text so users see something rather than a silent stall. Tests: 16 new pytest cases (tests/test_native_tool_intercept.py) covering marker detection (4 variants), 3 parser branches, robustness (empty buffer / unparseable garbage / multi-call buffer), and end-to-end streaming via mocked OpenAI client (verifies pre-marker text yielded as TextChunk + <|tool_call> tokens NOT in any TextChunk + tool_call appears in AssistantTurn). (b) Issue #97 — pip install . produces a broken wheel (pyproject.toml, deleted memory.py, tests/test_packaging.py). Reported by @albertcheng on Windows + Python 3.13: cheetahclaws.exe crashed at startup with ModuleNotFoundError: No module named 'prompts'. Root cause: a name collision in pyproject.toml — memory was listed in BOTH py-modules (referring to a 11-line backward-compat shim memory.py that re-exports from the memory/ package) AND packages (the real memory/ directory). Python's import system always prefers the package directory over a same-named .py file, so the shim was dead code; setuptools ≥ 75 on Windows treats this dual-registration as a hard error and silently drops unrelated packages from the wheel build — which is how prompts/ went missing. Fix: deleted the dead memory.py shim, removed memory from py-modules, and replaced the manual packages = [...] list with [tool.setuptools.packages.find] + wildcard include patterns so future sub-packages auto-discover. This also caught a separate latent bug — the four sub-packages added in the v3.1 trading discovery layer (modular.trading.alt_data, modular.trading.broker, modular.trading.discover, modular.trading.ml) were missing from the manual packages = [...] list and would have been excluded from production wheels even after a successful build. Tests: 29 new pytest cases (tests/test_packaging.py) — config sanity (no module/package name collision allowed; memory.py shim must not be re-introduced; pyproject.toml must use find not manual list), discovery walk (every top-level dir with __init__.py is reachable from find's include patterns or explicitly excluded), and the exact issue #97 failure reproduction (parametrised import test for 24 modules including prompts, prompts.select, all four new modular.trading.* sub-packages, and the cheetahclaws entry point — fails the build if any can't be imported). Verified locally: rebuilt wheel after fix contains all 31 packages including prompts/ and the four new sub-packages. 1005 passing (976 baseline + 16 native-tool-intercept + 29 packaging = 1005), zero regressions. CONTRIBUTING.md updated with explicit packaging discipline notes: never put a name in both py-modules and packages, sub-packages auto-discover via find, only top-level packages need a new include pattern.
May 8, 2026 (later): /trading v3.1 — automatic candidate discovery + composite ranking + anomaly detector + market monitor with bridge alerts. Closes the biggest gap in v3: previously you had to feed the agent symbols (/trading analyze NVDA); now it actively scans a universe and finds candidates for you. Four orthogonal discovery scanners ship: (a) insider_cluster — SEC EDGAR Form 4 cluster detector, flags tickers with ≥3 officer / 10%-holder filings in 30 days, surfaces SEC URLs so user can verify direction; (b) earnings_beat — yfinance earnings_dates surprise extractor, requires ≥10% beat AND post-print continuation (filters out the pop-and-fade pattern); (c) momentum_quality — factor intersection over the new factors.py (momentum = 6m return + 50d>200d trend confirmation; quality = ROE − 0.3·D/E + 2·op-margin; both min-max normalised + composite-scored); (d) sector_rotation — ranks SPDR Select sector ETFs by 1m+3m return, surfaces top holdings of the leaders. The orchestrator (discover/orchestrator.py) merges per-symbol hits across all four sources with weighted aggregation (insider 1.0, earnings 0.9, mom-qual 0.7, sector 0.5) AND a +0.5 confluence bonus when ≥2 sources flag the same ticker. New CLI: /trading discover [insider|earnings|momentum-quality|sector|all] [--universe sp100|sectors] [--add-watchlist N] — the --add-watchlist flag auto-promotes the top N hits to your watchlist for downstream /trading scan / /trading analyze. New /trading rank composite-ranks candidates by 0.5×factor + 0.3×discovery + ±0.1 calibration-tilt; output is a triage table for "which names deserve a real /trading analyze". New /trading factors [SYMS] shows raw momentum/quality/low-vol scores with a 24h disk cache at ~/.cheetahclaws/trading/factors_cache.json (S&P 100 takes ~1-2 min to scan, parallel ThreadPoolExecutor with 4 workers). New /trading anomaly [SYMS] runs three independent checks per ticker: volume spike (today vs 90d median ratio ≥ 2×), price gap (open vs prior close ≥ 3%), volatility regime z-score (5d realised vol vs 90d distribution ≥ 2σ). New /trading monitor scan runs one full monitoring cycle — anomaly detection + stop-loss/take-profit hits on open paper trades + earnings within 3 days for any held position + new SEC Form 4 filings since last scan (delta detection persisted in ~/.cheetahclaws/trading/monitor_state.db); --notify [telegram] [slack] [wechat] dispatches structured alerts (severity-tagged: critical/warning/info) through cheetahclaws's existing bridge layer. Honest framing on "real-time" in the docs: yfinance is 15-20min delayed for free tier, so polling more often than every 5-10 min is wasted effort; three scheduling options documented (manual, external cron, /monitor integration). New universe.py ships hardcoded S&P 100 (~7-8% drift/year, refresh quarterly) + 11 SPDR Select sector ETFs + curated top-10 holdings per sector ETF for sector_rotation. The discovery layer also fixes a real gap in the system prompt: the LLM didn't know what /trading discover etc. existed, so when users asked "can you find me good stocks" it confabulated; the dynamic _render_commands_block from earlier session now picks up the new subcommands automatically. Tests: 21 new pytest cases in test_trading_discovery.py covering universe resolution, factor scan + score with stubbed yfinance, insider cluster threshold logic, momentum-quality intersection, sector rotation top-sector picking, orchestrator multi-source merge + bonus, anomaly triple-check (volume/gap/vol-regime), ranker factor+discovery combination, monitor alert rendering + dispatch + end-to-end scan with stubbed market data. 960 passing (939 baseline + 21 new), zero regressions; golden system-prompt fixture regenerated. Honest disclaimer in PLUGIN.md and trading.md: discovery reduces search cost, not generates alpha — the named factors (momentum, value, quality) are well-known and largely priced in by quant funds; what users get is a 100-ticker → 15-ticker triage list to spend tokens on, plus structured discipline (anomaly detection, stop monitoring, earnings calendar) that's hard to do by hand. Form 4 transaction direction is NOT yet parsed from XML (we count filings, not buys vs sales); URLs included so user verifies in 5 seconds. Insider direction parsing is on the roadmap but requires reliable XML scraping of SEC archives across version drift.
May 8, 2026: /trading v3 — paper-trade tracker, calibration, managed $X portfolios, alt-data, MV optimizer, ML stacker, walk-forward, broker abstraction. A two-stage upgrade that turns the trading module from "ask LLM about a stock" into a measurable research substrate. Stage 1 (the discipline layer): every /trading analyze recommendation is auto-recorded as a paper trade (~/.cheetahclaws/trading/paper_trades.db) — long and short signals account correctly. /trading calibration aggregates closed trades by confidence + signal and reports hit rate + mean return + a t-stat vs zero baseline; if 30+ closed trades show HIGH conviction not outperforming LOW, the agent's confidence label is noise and the diagnosis fires. /trading verify enforces hard risk rules (single-name 5% / sector 25% / total exposure 80% / stop 1-10% / earnings blackout 3 days → cap 2.5%) reading the live paper book — fixes the "LLM forgets its own rules" problem. The analyze prompt now auto-injects macro context (SPY/QQQ trend + VIX regime + 10y headwind, 30-min cached), earnings calendar warnings (🚨 if reporting within 7 days), and the current paper-book exposure so the LLM doesn't double-down on a sector already at 30%. /trading walkforward runs rolling out-of-sample chunks with a STABLE/MIXED/FRAGILE/INCONCLUSIVE verdict, replacing the dishonest aggregate backtest. /trading scan does a coarse heuristic sweep (RSI / 50d / 200d) over the watchlist before spending tokens on a real analyze. Stage 2 (the autonomous + alpha-research layer): /trading review runs a multi-agent debate on existing positions and emits structured ACTION ID=… DECISION=HOLD|ADD|TRIM|EXIT … rows for each. /trading manage start hundred 100 creates a virtual $100 portfolio backed by a SQLite-cleanly-namespaced PaperBroker; /trading manage step hundred runs one mean-variance rebalance cycle (scipy SLSQP, long-only, single-name + sector caps), /trading manage report hundred prints a markdown PnL report with equity curve — this is the canonical "I give the agent $100, check in a week" workflow. /trading optimize exposes the same MV solver standalone. The alt-data layer auto-injects three sources LLM analysis can actually add value on: SEC EDGAR Form 4 insider transactions (urllib, no API key, free), LLM-scored yfinance news headlines via the auxiliary cheap model (-10..+10 per headline aggregated to BULLISH/MIXED/BEARISH), and Google Trends search interest (soft-fails if pytrends not installed). The broker layer has a tiny BrokerBackend protocol with two backends — PaperBroker works out of the box, IBKRBroker is a stub with full setup docs (pip install ib_insync + IB Gateway config + connect()); the abstraction means switching from paper to live is one line when the user is ready. /trading ml train builds a LightGBM (or sklearn GradientBoostingClassifier fallback) classifier on closed paper trades — features are LLM signal one-hot + confidence ordinal + position size + stop / take profit + sector one-hot, label is "did this trade beat zero"; reports cross-validated AUC and feature importance, persists to ~/.cheetahclaws/trading/ml/stacker.pkl. The _CMD_META registry is also auto-populated from modular/-loaded commands now (closed a pre-existing bug where /trading, /video, /voice, /tts were callable but invisible to /help, tab-completion, and the system-prompt slash-command index — the LLM literally couldn't see its own subcommands). Tests: 46 new pytest cases across test_trading_pipeline.py and test_trading_advanced.py covering paper-trader CRUD, long/short PnL math, Phase-5 parser permissiveness, calibration aggregation, verifier 8-branch enforcement, macro/earnings/insider/sentiment/trends soft-fail behavior, MV optimizer constraints, broker buy/sell/avg-cost round-trip, IBKR stub setup-required diagnostic, end-to-end $100→step→status→report lifecycle with mocked quotes, ML feature engineering + train + predict, and the position-review prompt format. 939 passing (893 baseline + 46 new), zero regressions; golden system-prompt fixture regenerated. Also fixed a banner-rendering bug where the welcome box's right border was missing on every middle line (cheetahclaws.py now computes inner width from plain-text length and pads each row to close with │ regardless of model-name length). Honest disclaimer in the docs and PLUGIN.md: this is a research and discipline tool, not a money printer — public-data + LLM analysis does not have predictive edge over quant funds; the value is information aggregation, programmatic risk discipline, and empirical accountability. Run paper for ≥3 months with green calibration + walk-forward before considering an IBKR live account; small accounts (<$1k) have unfavorable fixed-cost economics in real life regardless of strategy.
May 7, 2026: /theme slash command — 15 console color presets + post-merge UX fixes (PRs #92, follow-up). Adds a curated palette system to ui/render.py and a new /theme command:
- /theme lists all 15 presets (default, dracula, nord, gruvbox, solarized, tokyo-night, catppuccin, matrix, synthwave, midnight, ocean, monokai, cheetah, mono, none); each row renders an info / ok / warn / err swatch in the row's own theme colors so the listing is a real palette preview, not 15 identical lines in the current theme.
- /theme <name> mutates the shared C ANSI dict in-place so every existing clr() / info() / ok() / warn() / err() call site (~25 files) switches palette without touching any call site, and persists the choice via save_config() so the next launch re-applies it (early in cheetahclaws.py:main(), before the first output).
- Per-theme color roles. Each palette declares 4 semantic colors — accent (info / cyan / blue), ok (success / green / diff +), warn (yellow / magenta), err (red / diff -) — plus a Rich code style. Picking 4 hexes per theme means info() and ok() are always visually distinguishable, and render_diff keeps semantic colors (green = add, red = remove) under every theme. The original PR collapsed cyan/green/blue to a single accent color, making info() and ok() indistinguishable and turning diff additions into the accent color (purple under dracula, yellow under gruvbox, magenta under synthwave) — the follow-up split them apart.
- CODE_THEME is now actually consumed. _make_renderable() in ui/render.py passes code_theme=CODE_THEME to rich.markdown.Markdown, so Rich code-block syntax highlighting tracks the active theme (the original PR set CODE_THEME but never plumbed it through — it was dead code).
- none theme is genuinely uncolored (clears every key in C, including reset, to "" so clr() returns plain text). mono is genuinely grayscale (4 distinct gray levels for accent/ok/warn/err — the original PR hardcoded C["red"] = "\033[38;5;196m" regardless of theme, breaking both).
- Tests: 9 new pytest cases (tests/test_theme.py) covering schema validation, unknown-theme rejection, info/ok distinguishability across all themes, diff-color distinguishability, none-as-plain-text, CODE_THEME tracking, apply_theme idempotency across state, and the Rich Markdown code_theme round-trip. 893 passing, zero regressions on the 884 pre-existing.
May 7, 2026 (v3.5.78): Research lab Phase A — unattended multi-day research; WeChat smart-reply + /draft; reliability + UX hardening. Three feature areas + a reliability pass.
- Phase A: research lab as a 24/7 workflow (research/lab/{resume,iterate,backlog,daemon}.py). The orchestrator in v3.5.77 was single-shot; this commit makes it iterative + queueable.
  - /lab resume <run_id> [<stage>] — rebuilds LabState from SQLite (reconstruct_state): RQs from the rq artifact, survey/outline/results/draft from their lab_artifacts rows, latest experiment code from experiment_code_v<N>, latest sandbox result from lab_experiments (synthesises a SandboxResult since the original tempdir is gone), skip_experiment from PI's "skip experiment" decision message. Optional <stage> rewinds: artifacts produced at or after the target stage are intentionally dropped from the in-memory state so the orchestrator regenerates them; earlier artifacts are kept. Old artifact versions remain in storage for audit.
  - /lab iterate <run_id> — 3-reviewer self-review (default lab_iterate_reviewers=3) reads the final report artifact, scores 4 dimensions on 1-10 (novelty, rigor, clarity, evidence). Aggregated per-dim mean → overall = mean of 4 dims. The lowest-scoring dim picks which stage to rewind to via DIMENSION_TO_STAGE: novelty→QUESTIONING, rigor→IMPLEMENTATION, clarity→DRAFTING, evidence→EXPERIMENT. Loops until score ≥ target_score (default 7.0), max_iterations (default 5), plateau (|delta| < 0.3 for 2 consecutive), or run budget. Every iteration audited in new lab_iterations SQLite table (target, score, breakdown, delta, revise_stage, status). Score parser is permissive (regex matches \d+(?:\.\d+)?) + clamped to [0,10] so 11/10 doesn't poison the average.
  - /lab backlog add <topic> [--iterate] [--target=N] [--max=N] [--prio=N] + list / remove / clear — new lab_backlog SQLite table (auto-incrementing id, priority desc + added_at asc ordering). claim_next_backlog() is atomic (SELECT...LIMIT 1 + UPDATE...status='running' in one txn) so two daemons against the same DB don't double-process. Parser rejects unknown tokens after the flag block (/lab backlog add "..." --max=5 start no longer silently appends start to the topic).
  - /lab daemon start / stop / status — singleton-protected single-worker loop (run_backlog_worker) that claims items, runs run_one_lab_session, optionally runs iterate_until_converged, marks done|failed. On startup the daemon calls reset_running_backlog() to unstick rows a previous crashed daemon left in running. Stop is cooperative — current stage finishes before exit.
  - /lab models — prints the resolved per-role model (PI / questioner / surveyor / designer / engineer / analyst / writer / reviewer × 3 / lay_reader = 11 roles) + which env-var drove the choice + ● for explicit overrides + warning when reviewers span <N model families (same-source rubber-stamping kills the meta-loop signal).
  - Human-readable output paths. ~/.cheetahclaws/research_papers/<run_id>/ was opaque — replaced with ~/.cheetahclaws/research_papers/<YYYY-MM-DD>_<HH-MM>_<topic-slug>_<run_id_short>/ (e.g. 2026-05-08_14-30_post-transformer-architectures-comparative-survey-2026_b16036de/). Slug is ASCII-alnum + hyphen, ≤60 chars cut at word boundary; CJK-only topics fall back to untitled (run_id-suffix still keeps it unique). /lab migrate-paths [--apply] is idempotent, dry-run by default, never overwrites existing targets, lists unknown legacy dirs separately.
- WeChat smart-reply panel (bridges/wechat_smart_reply.py, ..._store.py). Inbound from a whitelisted contact triggers the auxiliary cheap model to draft 3 candidate replies → push panel to filehelper (文件传输助手) with a 2-letter ID like [AA]. User replies with 1/2/3 to send, freeform text to customise, x to skip, q to list pending panels, AA 1 to address a specific panel. Confirmed sends append to wx_reply_history and feed style mimicking on subsequent panels. SQLite at ~/.cheetahclaws/wx_smart_reply.db (auto-fallback to in-memory on init failure); contacts JSON at ~/.cheetahclaws/wx_contacts.json (mtime-hot-reloaded; missing file = empty store). 6 new config keys (wechat_smart_reply, _whitelist, _groups, _groups_at_only, _timeout_s, wechat_self_uid). Architectural fix: bot owner's own uid is auto-recorded on first non-filehelper, non-group inbound, and is_smart_reply_target() always returns False for that uid — so your own messages reach the agent even if you accidentally put yourself in the whitelist (which the iLink ClawBot architecture makes easy to do).
- /draft <message> slash command (commands/advanced.py, bridges/draft_cache.py). Semi-automatic reply path for the iLink-ClawBot architecture where the bot is a separate account (so the bot can't see your main-account inbound). Auxiliary model drafts 3 candidates; optionally tone-conditioned via @<uid_or_label> against wx_contacts.json. When invoked from a bridge channel, candidates are echoed back to the originating WeChat / Telegram / Slack uid + stashed in bridges.draft_cache (per-uid, 10-min TTL, one-shot). The bridge inbound handler (in bridges/wechat.py) checks digit-only replies against the cache before the smart-reply path, so 1/2/3 after a /draft returns just the chosen line — no agent invocation, no smart-reply panel triggered.
- Reliability + UX hardening.
  - research/http.py 429-aware backoff: separate schedules for 5xx/timeout (0.5/1/2/4s) vs 429 (10/30/60/120s); honours Retry-After headers (seconds or HTTP-date form), capped at 180s. Default retry budget bumped 2 → 4 (academic APIs hit 429 routinely on busy queries). _parse_retry_after + _backoff_seconds helpers covered by 8 new pytest cases.
  - Surveyor grounding (research/lab/orchestrator.py:_stage_survey). Before invoking the surveyor LLM we now run research.aggregator.research() on topic + selected_RQ (academic + tech buckets, top 30 hits, no model-synthesis). Top hits formatted as [N] (source) Title / URL / snippet blocks (≤8KB), passed as context, prompt instructs surveyor to cite from this list rather than memory. Search hits persisted as survey_search_hits artifact for reviewer-replay determinism. On any aggregator failure (no Tavily/Brave/etc. key, all sources 429, network down) surveyor logs a diagnostic note ([grounding skipped] aggregator returned 0 results, per-source: arxiv: 429, tavily: KEY_MISSING, ...) and falls back to the original prompt-only path. Reduces fabricated citations significantly on tested topics; verifier still catches the rest.
  - _dedupe_self_repeat in _invoke() trims trailing self-repetition emitted by cheap / quantised models (text == text+text exact-halves match, or first-200-chars recur in back half with ≥80% normalised match). Sanity floor: never trim below 30% of original length. Why this matters: gpt-5-nano on the lab baseline produced PI rationale messages and RQ lists that appeared twice concatenated; without dedup these doubled inputs went into every downstream prompt, eating context and confusing the surveyor / writer. _extract_numbered similarly dedupes by content (first 80 chars whitespace-collapsed lower-case) so a 1..5\n1..5 re-emission keeps 5 unique items, not 10.
  - Verifier hard timeout (research/lab/verifier.py:verify_citations). Per-citation hard wall-clock cap (default 30s) enforced via concurrent.futures.ThreadPoolExecutor + future.result(timeout) so a hung urlopen() is interrupted at the Python level — socket-level timeout alone doesn't fire on slow-loris servers (we observed an 11-minute hang on arxiv in the field). Fresh single-worker pool per citation + shutdown(wait=False) so a hung worker doesn't queue-block subsequent citations (it leaks as a daemon thread, dies with the process). Stage-level cap (default 5 min) — citations not processed when budget exhausts get marked verification_skipped so finalization still produces a report. progress_cb(i, n, status) wired to a verifier message in the run log so /lab logs <run_id> shows [3/12] verified, [5/12] hard timeout etc.
  - REPL ergonomics. /lab daemon start + /lab start print the eventual report.md path up front (no more "where did my report go?" friction). Stage transitions stream live to the terminal as the orchestrator runs (↳ /lab daemon ► [run_id] survey). /lab status <run_id> shows both new-format and legacy lab_xxx/ paths so users can find old reports without manual digging. /config parses JSON-style values (lists, dicts, signed numbers, quoted strings) — /config wechat_smart_reply_whitelist=["wxid_..."] is no longer silently saved as a string. Leading whitespace before / is stripped at the REPL loop so « /lab daemon start» (paste with a stray space) hits the slash dispatcher instead of being routed to the agent — saves the user from a confusing failure on local cheap models that hallucinate tool-call syntax as text when asked to "run /lab daemon start".
- Tests: 884 passing in 95 seconds (842 unit/integration + 22 e2e), zero regressions on the prior 669-test baseline. ~80 new pytest cases covering: iteration scoring (parser permissiveness + clamp + dim averaging + weakest-dim routing), state reconstruction (full + rewind, all 9 stages), backlog CRUD + atomic claim + reset-running, daemon singleton semantics, verifier per-citation + stage-level timeouts, slug edge cases (Chinese, max-len, word boundary), _dedupe_self_repeat (exact halves, prefix recurrence, sanity floor, no-op clean text), _extract_numbered dedupe, self-uid bypass for smart-reply, draft cache one-shot + TTL, Retry-After parsing (seconds + HTTP-date + None), backlog parser strict mode.
May 7, 2026 (v3.5.77): MCP HTTP/SSE transport + OAuth 2.0 PKCE, .env loader, ANTHROPIC_ENDPOINT corporate-proxy override, AskUserQuestion UI polish (#88, #89). Three loosely related improvements landed together:
- MCP HTTP / Streamable-HTTP / SSE transport (mcp_client/client.py). HttpTransport now handles three response shapes: plain JSON, Streamable-HTTP (POST returns an SSE stream — read first data: line), and bidirectional SSE with a session endpoint. The default Accept header is application/json, text/event-stream because servers like sap-jira return 406 when only one is advertised. On a 401 from the resource URL, an OAuthSession is initialised lazily, the access token is injected as Authorization: Bearer <token>, and the httpx.Client is rebuilt under a dedicated _oauth_lock so two concurrent 401-retries can't race on close+create. Server-name sanitization on MCPManager keys lets hyphenated names like github-tools resolve correctly through the qualified mcp__server__tool path; add_server, connect_server, and reload_server all sanitize the lookup key the same way the parser does.
- OAuth 2.0 PKCE flow (mcp_client/oauth.py). Full MCP Authorization spec: RFC 9728 resource-server discovery (tries /.well-known/oauth-protected-resource/<path> then the bare path), RFC 8414 AS metadata, RFC 7591 dynamic client registration when no client_id is configured, Authorization Code + PKCE (S256) with a local 127.0.0.1 callback HTTP server, refresh-token rotation, and atomic token persistence to ~/.cheetahclaws/mcp_oauth.json written via a .tmp swap then os.replace, with the file at mode 0600 and the parent directory at 0700. The redirect-URI port is picked once and reused for both registration and the callback (otherwise strict OAuth servers reject redirect_uri mismatch). Scope is sourced from the AS's advertised scopes_supported — preferring mcp if listed, otherwise the first one, otherwise the scope parameter is omitted entirely so servers without an mcp scope no longer reject with invalid_scope. state mismatch and error query params surface as runtime errors; the callback browser tab confirms auth completion.
- REPL surface (commands/advanced.py, cheetahclaws.py). New /mcp add <name> --transport http <url> and --transport sse <url> for one-line HTTP/SSE registration; explicit /mcp list subcommand; tool descriptions in /mcp output now wrap at 72 cols via textwrap.wrap instead of a hard 60-char slice. /help advertises the new subcommands.
- .env + ANTHROPIC_ENDPOINT (cheetahclaws.py, config.py, providers.py, commands/core.py). _load_env() parses <repo>/.env at the very top of cheetahclaws.py — before any other import reads os.environ — supporting both K=V and K="quoted V", ignoring # comments, and using os.environ.setdefault so existing shell vars always win. config.py reads ANTHROPIC_ENDPOINT from os.environ and unconditionally writes it to cfg["anthropic_endpoint"] (env var beats persisted JSON), defaulting to https://api.anthropic.com when neither is set. providers.py passes base_url=cfg["anthropic_endpoint"] to anthropic.Anthropic; the /doctor and onboarding probes hit f"{_ant_base}/v1/messages" via the same value. Net effect: a corporate proxy can replace api.anthropic.com cleanly across streaming, health checks, and onboarding without touching ~/.cheetahclaws/config.json. MCP HTTP headers values now also pass through os.path.expandvars, so "Authorization": "Bearer $GITHUB_TOKEN" works after the .env loader has populated os.environ.
- AskUserQuestion UI polish (agent.py, ui/render.py, tools/interaction.py, cheetahclaws.py). AskUserQuestion is now in the auto-approve set alongside EnterPlanMode/ExitPlanMode — it's an interactive tool by definition, the permission gate was redundant. print_tool_start and print_tool_end early-return for AskUserQuestion so the spinner and → N lines (M chars) summary don't appear; _tool_desc adds a short preview of the first question. The question itself is rendered through clr() with Markdown stripped (**bold**, `code`, *italic* removed in that order so ***x*** collapses correctly), option indices are cyan, descriptions dim. The REPL prompt now prints a full-width ─ rule via os.get_terminal_size() (80-char fallback) before each input, matching Claude Code's visual rhythm.
May 5, 2026: Telegram bridge — file round-trip + clickable permission prompts (fixes #84). Two missing code paths in bridges/telegram.py produced both halves of the issue: (1) the bridge only had _tg_send (text via sendMessage), so --accept-all made no difference — when the model claimed it had "sent a file" it was just text, and there was no multipart sendDocument helper, no inbound document handler, and no way for the agent to emit a file; (2) permission prompts arrived as text containing [y/N/a(ccept-all)] that looked clickable but weren't, because the poll loop only listened for message updates and there was no inline-keyboard rendering path. Patch:
- _tg_send_document(token, chat_id, file_path, caption=None) — multipart/form-data upload assembled by hand because urllib's JSON-only path can't carry binary bodies. 49 MB ceiling (Telegram's hard limit is 50 MB; the headroom catches encoding overhead). Six explicit failure modes, each surfaces a specific error to the chat: missing file, stat failure, empty file, oversize, network exception, API ok: false (description forwarded verbatim).
- Inbound document handler in _tg_poll_loop — downloads via getFile, sanitizes filename to [A-Za-z0-9._-]_ to keep the save path safe, writes to /workspace if mounted (Docker scenario) or tempfile.gettempdir() otherwise, echoes the saved path back to chat, and submits a path-aware prompt to the agent ("I just uploaded a file at <path>. Please review it." — overridden by caption if present).
- !sendfile <absolute_path> — explicit user-driven send, runs in a daemon thread so the poll loop doesn't block on uploads. Strips backticks/quotes around the path.
- Auto-send on Write — _bg_runner._on_tool_start records the in-flight file_path for Write calls; _on_tool_end mails it (FIFO-paired so parallel writes match correctly). Skipped when the result starts with Error: or Denied:. De-duplicated per turn via a _sent_files: set[str] so the agent retrying the same path doesn't double-mail.
- Permission UX across every channel ([approve][reject] is now actually pickable everywhere). Issue #84 also flagged that permission prompts looked like buttons but weren't; fixed in the same patch and extended cross-bridge so the experience is consistent regardless of where the user is. ask_input_interactive(prompt, config, options=[(label, value), …]) is the new contract; ask_permission_interactive passes [("✅ Approve", "y"), ("❌ Reject", "n"), ("✅✅ Accept all", "a")] and every channel renders an interactive picker:
  - Telegram — real inline_keyboard buttons. callback_data is cc:<prompt_id>:<value> where prompt_id is a fresh 8-char id; _tg_poll_loop's allowed_updates widened to ["message", "callback_query"] and the new _handle_callback_query(token, chat_id, cb, session_ctx) performs auth check (chat_id match), answerCallbackQuery to clear the click spinner, prompt-id validation (stale clicks on older prompts silently dropped so two rapid permission prompts cannot bleed into each other), editMessageText appending ✓ Selected: <value> for visible scroll-back, and finally fires tg_input_event. Markdown failure on the prompt body falls back to a no-parse_mode keyboard send; total failure falls back to plain text — and the menu block embedded in the prompt body keeps that path usable.
  - Slack / WeChat — numbered menu rendered into the message body (Slack header ❓ Input Required, WeChat header ❓ 需要输入). The message reads [1] ✅ Approve (reply 1 or y) etc.; the user replies with the digit, the canonical letter, or any label word (approve / reject / accept / all). All three reply forms normalize to the canonical value before the caller sees them.
  - Terminal — same numbered menu printed above the input cursor, same digit / letter / label-word reply normalization.
  - Web (chat API) — untouched; the existing browser approval UI handles this. The cross-bridge wiring lives in three pure helpers in tools/interaction.py: _format_menu_block(options) (numbered text rendering), _build_value_map(options) (digit + canonical-value + label-word lookup table, first-write-wins on collisions), and _resolve_choice(raw, value_map) (whitespace-trimmed, case-insensitive lookup; pass-through for unknown replies so free-text fallback still works). Backward-compatible: every existing ask_input_interactive caller (and there are many — /checkpoint, /session, /agent, /config, /voice) passes no options= and gets exactly the same free-text behavior as before. New RuntimeContext fields: tg_callback_prompt_id: str and tg_callback_message_id: int.
- Tests — 49 new pytest cases. tests/test_telegram_bridge.py (27): urllib.request.urlopen and _tg_api mocked; threading.Thread monkeypatched to a synchronous stub for the auto-send hook; an end-to-end test drives ask_input_interactive(options=…) from a worker thread, simulates a click via _handle_callback_query, and asserts the worker returns the clicked value. Coverage: text splitting + Markdown fallback, multipart body assertions (chat_id, UTF-8 caption, filename, raw bytes), all six _tg_send_document failure paths, four _bg_runner Write variants, four _tg_send_keyboard paths, five _handle_callback_query paths, two end-to-end click variants. tests/test_options_menu.py (22): the three pure helpers (rendering / value map / resolution; including emoji-stripped label tokens, case-insensitivity, whitespace, non-string defensive paths, first-write-wins on collisions), plus per-bridge worker-thread end-to-end for Slack (4 reply forms), WeChat (2 reply forms), terminal (digit / label / canonical / no-options regression). Full suite: 718 passed in 43s, no regressions on the 669 pre-existing tests.
May 3, 2026: Research Lab — autonomous multi-agent paper writing with sandboxed experiments + web UI. New /lab slash command (CLI) and /lab page (web) drive 9 specialised agents through 9 stages — questioning, literature survey, outline, code drafting, sandboxed Python execution, analysis, paper drafting with reviewer iteration, citation verification, finalisation — until convergence or budget exhaustion. Output is a Markdown report with verified citations + BibTeX bundle + (when the topic admits experiments) the engineer's runnable script and any plots produced. Targets arXiv-grade preprint quality, not 顶会; honest about the LLM-substrate ceiling. Branch: feature/research-lab (PR pending).
- 9 agents, deliberately heterogeneous models. PI / Questioner / Surveyor / Designer / Engineer / Analyst / Writer / Reviewer × 3 / Lay Reader. The reviewer pool defaults to 3 different provider families (Claude / GPT / Gemini, etc.) when API keys are available, to reduce the same-source rubber-stamping that plagues homogeneous multi-agent debate. Per-role model overrides via lab_role_override.
- 9-stage state machine: QUESTIONING → SURVEY → OUTLINE → IMPLEMENTATION → EXPERIMENT → ANALYSIS → DRAFTING → VERIFICATION → FINALIZATION. Each producer-stage is followed by reviewer-author iteration with a 2/3-reviewers-pass quorum (default), max 5 rounds force-advance, and a "0/3 for 3 rounds → redesign" early bail. PI breaks ties.
- Real experiments via subprocess sandbox. research/lab/sandbox.py runs the Engineer's Python script with a 180 s timeout, 4-min CPU rlimit, 2 GB AS rlimit, dedicated workspace cwd, and MPLBACKEND=Agg so matplotlib plots without a display. On non-zero exit the Engineer is fed the stderr and revises (max 3 attempts). The Analyst then parses RESULT: {...} JSON lines from stdout into a Results section so the Writer doesn't get to invent numbers. v0 isolation only — not a hostile-code boundary; Docker is Phase 2.5.
- Citation verifier — three APIs, four states. Each citation in the final draft is checked against arXiv → Semantic Scholar → CrossRef in priority order. Jaccard title similarity ≥ 0.55 + last-name set overlap ≥ 0.5 to count as verified. The four-state outcome (verified | ambiguous | not_found | verification_skipped) explicitly distinguishes "we found this isn't real" from "we couldn't reach the network"; the latter never gets recorded as a fabrication signal.
- SQLite persistence at ~/.cheetahclaws/research_lab.db (separate file from the daemon's sessions.db so neither interferes with the other). Six tables: lab_runs, lab_stages, lab_messages, lab_artifacts, lab_budget, lab_experiments. State survives a cheetahclaws restart in principle; auto-resume is Phase 2.5.
- Web UI at /lab. Single vanilla-JS page (no build step, no React) that talks to /api/lab/* JSON endpoints; auto-polls every 5 s while a run is open; renders the final report inline with a mini Markdown renderer; auto dark/light mode. The HTTP layer (web/lab_api.py) slots into the existing stdlib HTTP server with one dispatcher branch — no FastAPI/Flask dep.
- Realistic positioning, stated explicitly. Sakana AI Scientist, Stanford Agent Lab, and similar prior work all hit a ceiling near rejection-line ICLR; this lab inherits that ceiling. The product target is arXiv-grade preprint, not 顶会, and the docs say so up front. Dominant residual failure mode is fabricated citations passing title-match but with subtle author/year errors — the verifier catches most but human review of references is non-optional.
- What's deliberately not in v0 (tracked in docs/guides/research-lab.md): multi-tenant isolation, GPU pool, Docker sandbox, LaTeX rendering, reference-manager integration, plagiarism check, real-time SSE updates, billing, /lab resume.
- Tests: 54 cases for the lab (storage, convergence, verifier, sandbox, orchestrator-with-stubbed-LLM end-to-end, web routes), full suite 701 passing on feature/research-lab. Pricing — single run typically $2-5 (survey-style, no experiments) to $5-15 (with sklearn-scale experiments) using Claude Sonnet + GPT-4o + Gemini Pro mixed.
May 2, 2026: Daemon foundation lands — cheetahclaws serve is real. F-1 of the 9-PR roadmap merged via PR #80, on top of a re-landed spike (PR #81) that the RFC 0001 contract code lives in. End users see no new feature yet — F-1 ships the headless daemon plus its cheetahclaws daemon {status, stop, logs, rotate-token} control surface, but no service runs inside it; that's F-2..F-8.
- Recap of how the spike landed. PR #77 (the spike) merged then immediately reverted (3183fc6) to avoid pre-empting @mxh1999's foundation design. Once @mxh1999 opened PR #80 explicitly built on top of the spike, the revert was undone via PR #81 (Re-land daemon spike for #80 (un-revert 3183fc6)) so the F-1 PR could merge cleanly without a delete-vs-modify conflict. End state on main: spike + foundation + verified.
- What's runnable now. cheetahclaws serve --listen tcp://127.0.0.1:8765 --print-token boots the daemon; cheetahclaws daemon status reports pid / transport / address / uptime / system.ping outcome from the discovery file at ~/.cheetahclaws/daemon.json; cheetahclaws daemon stop calls system.shutdown over RPC and falls back to SIGTERM / TerminateProcess; cheetahclaws daemon logs [-n N] tails ~/.cheetahclaws/logs/daemon.log; cheetahclaws daemon rotate-token regenerates the TCP bearer token. The legacy cheetahclaws spike-daemon ... from the spike-notes is preserved as a backward-compat alias.
- Behavior change worth flagging. /healthz, /readyz, /metrics are now auth-gated by default per RFC 0001 §3 — the spike returned them unauthenticated as a stub. Prometheus / external scrapers opt out via cheetahclaws serve --unauthenticated-metrics (off by default; documented as a deliberate weakening with a one-line warning at startup).
- Polish nits surfaced during smoke and fixed in a follow-up. (1) daemon.json now optionally records a token_path field when serve --token-path overrides the default, so cheetahclaws daemon status / stop / rotate-token find the token the daemon is actually using instead of failing 401 against the default location. (2) python -m daemon.cli --help (and the cheetahclaws spike-daemon --help alias) now print a usage banner and exit 0 instead of unknown subcommand: --help / exit 2; unknown subcommands also include the banner so users see how to recover. (3) The serve-mode startup prints (token: …, cheetahclaws daemon listening on …, audit log: …) now flush=True so they appear immediately when stdout is redirected to a file under & — previously they sat in Python's 4KB block buffer until the daemon exited. (4) tests/e2e_daemon_skeleton.py token-length floor raised from 32 to 40 so an accidental shrink to 16 raw bytes (~22 chars) would break loudly.
- Tests: 669 passing on main (637 unit + 22 daemon-only e2e + 10 polish-fix unit tests). pytest tests/ -q.
- What's NOT in F-1, intentional. No agent.run integration (session.send exposes only the demo echo.ping from the spike + the contract system.ping from the foundation). No bridges in the daemon (Telegram/Slack/WeChat are F-6/7/8). No SQLite event store (in-memory ring from the spike survives until F-2). No cost guardrails (F-9). No subprocess-per-agent runner (F-4). macOS peer-cred still left as TODO(macos) in daemon/auth.py.
Apr 30, 2026: [spike] daemon foundation reference scaffolding — validates RFC #74 end-to-end.
- What landed. A new daemon/ package (~1.1k LoC across 9 files, plus 360 lines of pytest) that implements the contract surface defined in docs/RFC/0001-daemon-design-note.md: ThreadedTCPServer / ThreadedUnixServer, JSON-RPC 2.0 dispatcher on POST /rpc, SSE on GET /events?since=<id>, Linux SO_PEERCRED peer-cred + bearer-token auth, audit log, client_id mint/persist/resume, originator-only permission answer, 30 min interactive timeout with permission.refresh_timeout. Three demo methods (echo.ping, permission.demo, permission.answer) prove the model without dragging agent.run integration into the spike scope.
- Why a spike before the foundation PR. RFC #74 was merged with 9 must-fix follow-ups from review (threading model, SSE heartbeat, client_id lifecycle, session.send semantics, API versioning, event retention, audit-log default-flip, interactive timeout, macOS peer-cred). Rather than re-litigate them in a doc PR, the spike puts every "✓" item from the review matrix into runnable code; mxh1999's foundation PR can then rebuild on the contract or replace the throwaway parts. Coverage matrix in docs/RFC/0001-spike-notes.md.
- What it deliberately is not. No agent.run wiring, no bridge migration, no SQLite event store, no cost guardrails, no agent-runner subprocess, no metrics. macOS SO_PEERCRED is punted with a TODO(macos) — the spike runs Linux-only.
- Surprises worth flagging for the foundation PR (full list in spike notes): stdlib HTTPServer defaults request_queue_size=5 — long-lived SSE connections cause new TCP connects to wait on a full-second SYN retransmit; bumped to 256. BaseHTTPRequestHandler defaults to HTTP/1.0, so curl --no-buffer won't print SSE bytes until the connection closes; EventSource and http.client (used by tests + daemon.spike_client) are unaffected. SO_PEERCRED ucred struct format is iII (signed pid, unsigned uid/gid), not the older docs' 3i. Originator persistence is whole-file rewrites for the spike — foundation PR should swap for the SQLite originator schema.
- How to try it. cheetahclaws spike-daemon serve --listen tcp://127.0.0.1:8765 --print-token, then python -m daemon.spike_client --target tcp://127.0.0.1:8765 ping (or watch for SSE tail, or request / answer for the originator-routing demo). Token can be passed via $CHEETAHCLAWS_TOKEN to avoid argparse's --token <value> quirk on tokens starting with -.
- Tests: 13 cases covering #1, 2, 3, 4, 6, 7, 8, 9 from the review matrix; all green. pytest tests/test_daemon_spike.py -v. Branch: feature/daemon-spike (draft PR).
Apr 30, 2026: Docker / home-server support, terminal AskUserQuestion deadlock fix, Ollama tool-call payload fix.
- Docker (#73) — new Dockerfile, docker-compose.yml, .env.example, .dockerignore at the repo root, plus a full walkthrough at docs/guides/docker.md. Targets the home-server / DGX-Spark scenario: web UI + Telegram bridge running together in one container, talking to an Ollama instance on the host via host.docker.internal:11434, with ./workspace bind-mounted so files can be shared over Samba to your phone or other PCs. Container runs as a non-root cheetah user; UID/GID inherit from the host (${UID:-1000}:${GID:-1000}) so files in ./workspace stay owned by you. tini for clean PID-1 signal handling, healthcheck on /api/config, EXPOSE 8080. README's "Web UI" section gets a Docker subsection; "Documentation" table gets a new row.
- --web auto-starts bridges — previously --web only spun up the HTTP server and sys.exit-ed, skipping the ~/.cheetahclaws/config.json Telegram / WeChat / Slack auto-start block in the REPL bootstrap. New helper _start_headless_bridges(config) creates a shared AgentState, wires session_ctx.run_query to a minimal headless driver around agent.run(), then starts every configured bridge as a daemon thread in the same process. Docker users get browser UI + phone bridge from a single command; non-Docker --web users on a remote box get the same. No new flag — same --web, just complete.
- AskUserQuestion deadlock fix (#69) — the previous queue + threading.Event design assumed a separate consumer thread would drain _pending_questions and event.set() the agent thread, but the consumer (drain_pending_questions) ran after run() returned. Since run() blocked inside _ask_user_question's event.wait(timeout=300), the drain never reached, the terminal froze for 300 seconds, and Ctrl-C was swallowed by Event.wait. Bridges (Telegram / WeChat / Slack / Web) only worked because their listener threads called event.set() externally; the terminal had no equivalent. Fix: _ask_user_question is now synchronous — prints the prompt and reads input directly via ask_input_interactive, which already routes correctly for terminal and every bridge. Removed _pending_questions, _ask_lock, and drain_pending_questions; removed the post-turn drain in cheetahclaws.py. Tests rewritten to mock builtins.input over the new sync path.
- Ollama tool-call payload fix (#71) — for assistant turns that carry only tool_calls and no visible text, messages_to_openai emitted content: null (m.get("content") or None). OpenAI accepts that, but Ollama's OpenAI-compat endpoint rejects it with HTTP 400 invalid message content type: <nil>. Switched to or "" — empty string is accepted by every OpenAI-compat backend we target. The same 400 used to fall into ErrorCategory.UNKNOWN, which is retryable, so the same broken payload was retried 3× and burned the circuit-breaker budget → 120 s cooldown blocking the entire session even though the request body, not the network, was the problem. New INVALID_REQUEST category matches BadRequest / 400 / invalid.?message.?content / malformed.?request and is non-retryable; urllib.error.HTTPError with code=400 maps to it explicitly; and the hint surfaces a pointer to issue #71 plus a /clear suggestion when the error string contains invalid message content type.
- Tests: 589 passing. Includes rewritten TestAskUserQuestion (free-text answer, option selection by number, 0 → freetext fallback) and quick verification that _start_headless_bridges is a no-op without bridge config and starts a Telegram thread when telegram_token + telegram_chat_id are present.
Apr 24, 2026: Multi-model prompt adaptation — single shared default.md baseline + tiny per-family overlays. Routing by model family, not provider/runtime. DeepSeek v4 thinking-mode protocol.
- Single base + small overlay design. prompts/base/default.md is the shared baseline for every model — general prompt-engineering guidance (be concise, parallel tool calls, minimal scope, stop conditions, safe-vs-unsafe action list, etc.) applies to all families. Family-specific quirks live in prompts/overlays/<family>.md and are appended only when the model has an authoritative, vendor-documented quirk.
- Three overlays ship today (each cites the vendor guide it's based on):
  - claude.md — XML-tag preference for structured output (Anthropic prompt engineering guide).
  - gemini.md — explicit "Agentic Mode (Active)" framing + 4-step explore→verify→act→report loop (Gemini 3 prompting guide).
  - openai-reasoning.md — only matches o1 / o3 / o4 / gpt-5-codex; suppresses "Let me think step by step…" narration since reasoning is internal (OpenAI reasoning best practices).
- Routing by model family, not by provider/runtime. A Qwen-3 model gets the same prompt whether served by Alibaba DashScope, Ollama, vLLM, or OpenRouter. pick_base_prompt(provider, model_id) matches on the last path segment of the model id, case-insensitive. Tested by test_runtime_is_irrelevant_for_family_routing.
- Overlay-admission policy. Every overlay must (a) cite a vendor prompting guide URL in a top-of-file  comment (enforced by test_overlay_cites_source), (b) not duplicate anything already in default.md, (c) stay ≤ 20 lines (enforced by test_overlay_under_line_cap). The unified default.md itself caps at 150 lines (enforced by test_base_prompt_under_line_cap).
- DeepSeek v4 thinking-mode protocol. Streams delta.reasoning_content as ThinkingChunk; round-trips reasoning_content through AssistantTurn → neutral history → messages_to_openai so v4's spec is satisfied when an assistant turn carries tool_calls. config["thinking"] is tri-state — None (default, server-default ON), True (explicit ON), False (explicit OFF, injects extra_body={"thinking":{"type":"disabled"}}). Bumps DeepSeek context window to 128K and registers deepseek-v4-pro / deepseek-v4-flash.
- Tests: 73 prompt-related cases, 578 unit tests total, all green. New regression guards: test_dead_family_base_files_are_gone (no per-family base files), test_overlay_cites_source (every overlay grounded in vendor docs), test_env_block_separates_platform_from_git_info (locks the Platform: Linux- Git branch: whitespace fix).
- Architecture refactor lineage. Builds on PR #63 (which split SYSTEM_PROMPT_TEMPLATE into per-family files) and consolidates back to single base + overlays after benchmarking showed the per-family duplication was net negative for non-flagship models. See prompts/README.md for full design rationale.
Apr 20, 2026 (v3.5.76): Research pipeline — 20 sources, time-range filter, cross-platform heat table, citations analysis, saved reports, Chinese platforms (B站 · 微博 · 小红书 · 知乎), /monitor trend-tracking, one-click /ssj wizard, entity extraction, multi-query expansion, side-by-side compare
- /research <topic> — fans out to 20 sources in parallel: arXiv · Semantic Scholar · OpenAlex · HuggingFace Papers · alphaXiv · Google Scholar · HackerNews · GitHub · Reddit · StackOverflow · Google News · Polymarket · SEC EDGAR · Tavily · Brave · Twitter/X · 知乎 · B站 · 微博 · 小红书. 13 sources work zero-config; 7 optional (need keys or cookies).
- Engagement-weighted ranking — each source's native signal (HN points, GitHub stars, Reddit upvotes, citations, HF upvotes, B站播放, 微博赞, 小红书赞, Twitter likes, Polymarket USD volume) is log-normalized against a per-source calibration to a shared 0-1 scale. Blended with a 14-day-half-life recency bonus. Cross-source dedup by URL keeps the highest-engagement entry on duplicates.
- Time range filter — --range 1d|3d|7d|14d|30d|60d|90d|6m|1y|2y|5y|all (or natural 30days, 6months, 2years) and explicit --since YYYY-MM-DD --until YYYY-MM-DD. Each source translates the window to its native filter: arXiv submittedDate:[...], Semantic Scholar year=LO-HI, OpenAlex from_publication_date:..., HN numericFilters=created_at_i>..., GitHub pushed:>..., Reddit t=hour|day|week|month|year|all, StackOverflow fromdate=/todate=, Google News after:/before:, SEC EDGAR dateRange=custom, Tavily start_published_date, Brave freshness=pd|pw|pm|py, Twitter v2 start_time/end_time, Google Scholar client-side year filter, HuggingFace / Bilibili / Weibo client-side. Polymarket and Zhihu have no date filter API and are documented as exceptions.
- Cross-platform attention table — every brief renders a Markdown table: per-platform result count · top engagement label · median result age · domain. Skipped/failed sources appear too with clear reasons. The LLM synthesis prompt copies this table verbatim and adds 2-3 sentences comparing attention distribution (academic-heavy vs. social-heavy vs. news-heavy).
- Publication trend sparkline + 12-month bar chart — a compact Unicode sparkline (▁▂▃▄▅▆▇█) across the last 24 months in the brief header; a full per-month bar chart lower down. Built from ALL dated results across academic/news/social sources, giving a single-glance view of where the buzz has moved.
- Notable-citer analysis (--citations) — secondary Semantic Scholar calls on top academic results, pulling citing-paper authors and filtering to those with ≥10k total citations (configurable via --citation-threshold). Surfaces a table with name · affiliation · total cites · h-index · which papers they cited. Adds 2-10 API calls per run; recommended to pair with SEMANTIC_SCHOLAR_API_KEY to escape the anonymous 100-req/5-min limit.
- Entity extraction — offline, zero-LLM pattern-matching that scans every pulled result for frequent named entities across four categories: models (GPT-5, Claude-Opus-5, Llama-4, Gemini-2.5-Pro, GLM-5.1, Qwen-3, DeepSeek-V3, Grok, Mistral, Phi, Yi, Kimi, …), benchmarks (MMLU, MMLU-Pro, GSM8K, MATH, HumanEval, HumanEval+, SWE-bench, LiveCodeBench, MMMU, MathVista, GAIA, AgentBench, WebArena, Arena-Hard, FrontierMath, ARC-AGI, GPQA-Diamond, HLE, C-Eval, CMMLU, RULER, LongBench, …), orgs (OpenAI, Anthropic, Google DeepMind, Meta, xAI, Mistral AI, DeepSeek, Moonshot, Alibaba, Zhipu, Tencent, ByteDance, Hugging Face, NVIDIA, 01.AI, AI2, Mila, Stanford, MIT, Berkeley, CMU, Tsinghua, …), and people (from academic result author fields). Counts dedupe within a single result so one spammy abstract doesn't skew the ranking. Renders as a "Top mentioned entities" section directly beneath the heat table — one glance answers "what's everyone talking about?" without the LLM round-trip.
- Multi-query expansion (--expand or --expand N) — asks the active model to propose 2-6 sibling subqueries (different angles — theory vs. tooling vs. industry deployment vs. controversy — not paraphrases), then runs each in parallel across all sources with proportionally reduced per-source limits. Results merge into the main pipeline (dedup + rank + synth). Example: /research --expand "frontier LLM benchmarks" auto-expands to LLM evaluation methodology, benchmark saturation and contamination, capability measurement frontier models, human preference benchmarks evaluation. Coverage jumps several-fold for broad topics.
- Side-by-side compare — /research compare "topic A" vs "topic B" [vs "topic C"] runs 2 or 3 independent research queries in parallel and produces a unified comparative brief: verdict at a glance · side-by-side heat tables · shared themes · unique strengths per topic · open questions. Citations use prefixed [A-N] / [B-N] / [C-N] markers so readers can trace every claim back to the right topic's evidence pool. Falls back to a deterministic no-LLM rendering with all three heat tables + entity tables when no model is configured.
- Auto-save to ~/.cheetahclaws/research_reports/ — every /research and /research compare run writes two files: <YYYY-MM-DD_HHMMSS>-<slug>.md (rendered brief) + .json sidecar (serialized Brief + notable citers + entities). Opt out with --no-save. Explicit export via --save-as PATH. New /reports command: list (50 most recent) · open <id> (print) · path <id> (print file path) · delete <id>.
- Weekly trend tracking via /monitor — new topic prefix research:<query> (or research:<range>:<query> — e.g. research:30d:RLHF) dispatches to the full 20-source pipeline each scheduled run. Supports daily/weekly/12h/... schedules and --telegram/--slack/console channels. Each invocation: pulls all 20 sources · filters by the subscription's time window · renders the cross-platform heat table + sparkline as the first digest item · writes a full report · pushes to configured channels. Subscribe via /subscribe research:<topic> weekly or the /monitor wizard's new "Trend tracker" option.
- /ssj wizard integration — 3 new menu items for zero-flag operation:
  - 16. 🔍 Research — asks topic → time range (1-5) → citations y/N → runs /research with right flags
  - 17. 📊 Trend Track — asks topic → tracking window → frequency → creates the /subscribe research:<range>:<topic> subscription
  - 18. 📁 Reports — opens /reports browser
- Chinese platform sources (4 of them):
  - Bilibili (B站) — zero-config search-all endpoint; returns video + article results with 播放/点赞/弹幕/评论 engagement. [video · 11:55] 彻底搞懂 Transformer · 54,209 播放 · 2,430 赞 · 78 弹幕.
  - 知乎 Zhihu — v4 search_v3 API, requires ZHIHU_COOKIE (browser-extracted d_c0; z_c0); returns answers / articles / questions with 赞/评论/关注 engagement.
  - 微博 Weibo — m.weibo.cn getIndex endpoint, requires WEIBO_COOKIE (browser-extracted SUB; SUBP); returns posts with 赞/转/评 engagement. Parses relative Chinese time forms (刚刚, 5分钟前, 2小时前, 今天 HH:MM, MM-DD).
  - 小红书 Xiaohongshu — edith.xiaohongshu.com notes search, requires XHS_COOKIE (+ often XHS_X_S); returns notes with 赞/评/收藏 engagement. Note: Xiaohongshu anti-bot is aggressive; cookies may expire hourly. Fallback: use --sources tavily with <query> site:xiaohongshu.com.
- Architecture:
  - research/ package: __init__.py, types.py, time_range.py, http.py, cache.py (24h SQLite at ~/.cheetahclaws/research_cache.db, keyed on source + query + limit + time range), classifier.py (keyword-based topic→domain routing, zero latency, zero LLM), ranker.py, aggregator.py, synthesizer.py, citations.py, entities.py, reports.py, sources/ (20 modules).
  - tools/research.py: exposes Research tool to agent (13 parameters: topic, domains, sources, limit, time_range, since, until, analyze_citations, citation_threshold, expand, save_as, auto_save, synthesize, use_cache).
  - commands/research_cmd.py: /research (with compare subcommand) and /reports.
  - monitor/fetchers.py: fetch_research() bridges /monitor subscriptions to the research pipeline.
  - commands/advanced.py: SSJ menu entries 16/17/18 delegate to the right /research / /subscribe / /reports command line.
- Tests (tests/test_research.py) — 88 tests across 23 sections covering: types, classifier routing, engagement ranker, cross-source dedup, SQLite cache roundtrip + TTL expiry, each of the 20 sources (happy path + schema-shift resilience + missing-key skip behavior), aggregator parallel fan-out + failure isolation + cache integration, synthesizer LLM path + deterministic no-LLM fallback, heat table + sparkline + trend rendering, citations helper, time-range preset + ISO parsing + per-source native mapping, reports save/load/delete/path, Chinese platform parsing (including Zhihu answer/article/question shapes, Weibo relative-date parser, Xiaohongshu localized count parsing), monitor research: prefix dispatch + range-prefix form, entity extraction across all four categories + dedup-within-result guarantee, multi-query expansion producing distinct cache keys, compare mode running 2-3 parallel queries + correct prefixed citation markers.
- Packaging — pyproject.toml adds research and research.sources to the editable packages list so installed binaries can import the new module.
- Version bumped to 3.5.76.

Apr 18, 2026 (v3.5.75): External plugin discovery via CHEETAHCLAWS_PLUGIN_PATH + safer dependency management; end-to-end prompt-cache token tracking across providers

PluginScope.EXTERNAL — new scope for plugins discovered in-place (never copied to ~/.cheetahclaws/plugins/). Complements existing USER and PROJECT scopes. Use case: shared team/company plugin directories mounted at a common path.
CHEETAHCLAWS_PLUGIN_PATH env var — colon-separated (os.pathsep) list of directories scanned for plugin subdirs. Each immediate subdirectory that has a plugin.json or PLUGIN.md is surfaced as an external plugin. No new manifest format — reuses the existing PluginManifest.from_plugin_dir() loader. Missing or empty path segments are ignored; hidden directories (.git, .DS_Store, etc.) are skipped.
Default disabled — external plugins land in /plugin list as [external] disabled. User must run /plugin enable <name> once to activate. Enable state persists to ~/.cheetahclaws/plugins.json under a new external_enabled: {name: bool} map, so it survives restarts without the plugin being installed.
No silent pip install — unlike the original proposal in #49, cheetahclaws never installs plugin dependencies from an import-failure fallback. Dependency installation happens only at explicit user-consent points: /plugin install (existing flow), or the first /plugin enable of an external plugin that declares dependencies. The model cannot trick the runtime into mutating the Python environment.
Dependency check uses importlib.metadata.distribution() — new _missing_dependencies(deps) helper keys off the PyPI distribution name, not find_spec(name). This fixes the PyPI-vs-import-name trap that breaks common packages: Pillow (imports as PIL), PyYAML (imports as yaml), opencv-python (cv2), scikit-learn (sklearn), beautifulsoup4 (bs4). The old find_spec("pillow") approach returned None for installed Pillow and would loop-install forever.
Safety guards — uninstall_plugin on an EXTERNAL entry only drops the enable-state record; it never shutil.rmtrees the user's source directory. update_plugin refuses external plugins with "update the source directory directly" instead of attempting git pull. Malformed plugin.json files are logged to stderr and skipped, so one bad manifest can't crash /plugin list.
Dedupe on name collision — if a plugin name exists in both installed (USER/PROJECT) and external scopes, the installed entry wins. Within external scopes, the earliest directory in CHEETAHCLAWS_PLUGIN_PATH wins (consistent with $PATH semantics).
Tests (tests/test_plugin_external.py) — 16 tests covering: env var parsing with empty/nonexistent segments, plugin.json and PLUGIN.md discovery, hidden-directory skip, malformed-JSON resilience, path-order priority, installed-shadows-external dedupe, enable/disable persistence round-trip, PEP 508 requirement parsing (package[extra]>=1.0 → package), and a regression test for the PyPI-vs-import-name bug.
New public export — from plugin import PLUGIN_PATH_ENV gives the env var name for use in tooling/docs.
Not changed: existing USER/PROJECT install flow, plugin.json/PLUGIN.md manifest format, /plugin command subcommands. Fully backward compatible — unset CHEETAHCLAWS_PLUGIN_PATH and the system behaves exactly as before.
Fix (tool-history integrity for OpenAI-compatible providers) — resolves #57: after long sessions, DeepSeek (and other OpenAI-compatible endpoints) started rejecting requests with "Messages with role 'tool' must be a response to a preceding message with 'tool_calls'" (HTTP 400), only recoverable by rebooting which lost all context. Root cause: compaction.find_split_point() chose a split index by token count alone, so a split could land between an assistant(tool_calls) message and its tool response messages, leaving orphaned tool entries in the kept half. Three-layer defense:
- compaction._respect_tool_pairs(messages, split) — post-processes the split index: if the last message in the old half is an assistant with tool_calls, advances the split forward past all consecutive tool responses; also skips any standalone tool message the split would land on. Falls back to returning 0 (skip compaction this turn) if no safe split exists — the threshold will re-trigger next turn.
- compaction.sanitize_history(messages) — single-pass O(n) invariant enforcer. Tracks pending tool_call_ids from the most recent assistant(tool_calls) in a rolling set; drops any tool message whose tool_call_id is not in the set (orphan), and strips unanswered tool_calls entries from assistant messages when a non-tool message intervenes. If all tool_calls on an assistant are stripped, the tool_calls key is removed entirely and content is normalized to a non-null string (required by the OpenAI schema). Does not mutate input.
- agent.run() — calls sanitize_history after every maybe_compact and before each stream() call. Any divergence (from compaction, crashed tool execution, checkpoint restore, or future code paths) is caught before it reaches the provider; emits a history_sanitized warn-log with the number of messages removed so regressions are visible.
- Why three layers instead of one: the split-point fix prevents the primary source of orphans; the sanitizer is a defense-in-depth net that keeps the invariant regardless of where history corruption originates; the agent-loop wiring ensures the net is actually applied. No user-visible behavior change on well-formed histories — test_well_formed_history_unchanged pins this.
- Tests (tests/test_compaction.py) — 15 new tests across three classes (TestFindSplitPoint.test_split_never_splits_tool_pair, TestRespectToolPairs$ \times 4, $TestSanitizeHistory × 7) covering split-boundary edge cases (split at every ratio from 0.2 to 0.5, multi-tool-call blocks, standalone orphan tool at split), sanitizer correctness (well-formed history unchanged, orphan drop, partial and full unanswered-tool_calls stripping, unanswered at end of list, wrong tool_call_id drop), and an input-immutability guarantee.

End-to-end prompt-cache token tracking (closes #43) — cache hit/miss counters now flow from provider → AgentState → checkpoint snapshots across every supported provider family. Two new default-0 fields cache_read_tokens / cache_write_tokens on AssistantTurn; AgentState.total_cache_read_tokens / total_cache_write_tokens accumulate via getattr(..., 0) so providers that never set the fields still work. Extraction centralized into two helpers in providers.py: _anthropic_cache_tokens(usage) reads cache_read_input_tokens + cache_creation_input_tokens; _openai_cached_read_tokens(usage) walks prompt_tokens_details.cached_tokens. Both coerce missing / None to 0 — older SDKs, non-cached calls, Bedrock-over-litellm wrappers all fall through instead of raising AttributeError. Provider coverage:

Family	Cache read	Cache write	Mechanism
Anthropic (`stream_anthropic`)	✓	✓	Both fields on `final.usage` when prompt-caching beta is active
OpenAI-schema (`stream_openai_compat` — OpenAI, Gemini, Kimi, Qwen, Zhipu, DeepSeek, MiniMax, Groq, xAI, any compatible endpoint)	✓	0 (by design)	OpenAI's schema has no separate "cache creation" counter; caching is implicit on their side
Ollama (`stream_ollama`)	0	0	No prompt-caching in Ollama today
Any future / custom provider	0 (default)	0 (default)	`getattr(event, "cache_read_tokens", 0)` no-op fallback

Persistence: checkpoint/store.make_snapshot writes token_snapshot["cache_read"] / ["cache_write"]; /checkpoint <id> (and /rewind) restores them alongside input/output totals so counters stay in lock-step with whatever snapshot the user rewound to. Structured logging: api_call_done records now include cache_read_tokens / cache_write_tokens alongside in_tokens / out_tokens. Note: not yet surfaced in /cost or /status output — the tracking layer landed first, a follow-up will expose it in the user-facing commands.

Tests (tests/test_cache_tokens.py) — 14 tests across 5 layers: AssistantTurn field defaults + explicit values; AgentState accumulation across increments; real make_snapshot on tmp_path with all four token fields; Anthropic + OpenAI extraction helpers against synthetic usage objects (populated / missing / None); end-to-end agent.run with a scripted stream — single-turn propagation and multi-turn accumulation; plus a test_rewind_restores_cache_tokens_from_snapshot regression test that asserts the round-trip. tests/e2e_checkpoint.py updated to keep the scripted rewind path in sync with production code.
Version bumped to 3.5.75.

Apr 16, 2026 (v3.5.74): Web UI production hardening — persistence, multi-user auth, ops endpoints, JS module split, pytest suite
- SQLite persistence (web/db.py, web/models.py) — SQLAlchemy-backed store with 4 tables: users, chat_sessions, messages, api_credentials. Sessions + message history now survive server restarts (previously in-memory only, lost on restart). DB file at ~/.cheetahclaws/web.db (0600). Config key CHEETAHCLAWS_WEB_DB overrides the path.
- Multi-user auth (web/auth.py) — replaced single generated password with full accounts: bcrypt password hashing (passlib) + stateless JWT cookies (PyJWT, HS256, 7-day TTL). JWT signing secret persisted to ~/.cheetahclaws/web_secret (0600) so logins survive restarts. New endpoints: POST /api/auth/register (first user becomes admin), POST /api/auth/login, POST /api/auth/logout, GET /api/auth/whoami, GET /api/auth/bootstrap (first-run routing). Legacy POST /api/auth kept for the terminal password page.
- Session CRUD — new PATCH /api/sessions/{id} to rename, DELETE /api/sessions/{id} to remove, GET /api/sessions/{id}/export to download conversation as Markdown. Auto-titling from first user message. Cross-user isolation enforced even on in-memory cache hits (one session hit patched after smoke test revealed the leak).
- Structured JSON logging (web/logging_setup.py) — logging + custom JSON formatter emits one record per line to stderr, e.g. {"ts":..., "level":"info", "logger":"web.server", "msg":"req", "method":"POST", "path":"/api/auth/login", "status":200, "dur_ms":259, "user_id":1}. Every HTTP response auto-logs method/path/status/dur_ms/user_id/peer. Level controlled by CHEETAHCLAWS_LOG_LEVEL env (default INFO).
- Ops endpoints — GET /health returns {ok, db, uptime_s} (503 if DB unreachable); GET /metrics returns Prometheus v0.0.4 text with cheetahclaws_{uptime_seconds, requests_total, requests_4xx, requests_5xx, auth_logins_total, auth_logins_failed, auth_registrations_total, users_total, ws_connections_total}. Unauthenticated so Prometheus/k8s probes can hit them.
- JS module split (web/static/js/) — monolithic 1813-line chat.html → 552 lines of HTML + 9 vanilla JS modules (chat.js core class, util.js, auth.js, sidebar.js, tools.js, approval.js, settings.js, welcome.js, init.js) loaded via plain <script src> tags. Prototype-mixin pattern (Object.assign(ChatApp.prototype, {...})) keeps app.foo() call sites unchanged. No bundler, no build step.
- ETag + conditional caching — JS/CSS/HTML served with Cache-Control: no-cache, must-revalidate + weak ETag (mtime-size). Browser gets 304 when unchanged, fresh content after any edit. Binary assets keep 24h cache. Path traversal blocked by resolved-path is_relative_to check.
- pytest suite (tests/test_web_api.py) — 21 end-to-end HTTP tests using httpx: bootstrap/register/login/whoami/logout, sessions CRUD + export + markdown, cross-user isolation, persistence after cache clear, /health, /metrics counter deltas, CORS preflight, auth gating of every endpoint. Spins the real server in a thread on a random port, DB truncated between tests. Runs in ~5s. pytest tests/test_web_api.py.
- Sidebar UX — chat sessions now show title + relative time ("just now", "12m ago", "3d ago") + message count + busy dot. Search box filters by title/id on the client. Right-click (or long-press) gives a context menu: Rename / Export Markdown / Delete. Footer shows current username + Sign out link.
- Register-or-login on first visit — chat UI now calls /api/auth/bootstrap on load; if no user exists it shows a "Create your first account" form (first registration becomes admin), otherwise the "Sign in" form. Username + password instead of a single server-generated password.
- Theme: light default + system auto — :root now carries the light palette; @media (prefers-color-scheme: dark) swaps in the dark palette when the user hasn't explicitly chosen a theme. Toggle button cycles system → light → dark → system, icon reflects the effective theme, title tooltip spells out the current mode. Inline pre-paint script in <head> sets data-theme before first paint to avoid FOUC.
- Auto port selection — cheetahclaws --web (no --port) now tries 8080 first; on EADDRINUSE it binds :0 and lets the kernel pick a free port, banner reports the real URL. Explicit --port N binds exactly N or fails loudly (user intent preserved). --port argparse default changed from 8080 → None as a sentinel.
- Favicon + MIME polish — web/static/favicon.{png,ico} cropped from docs/media/logos/logo-5.png (leaping cheetah, transparent background, multi-size ICO 16/32/48). Served from root as /favicon.ico for browser defaults. MIME table extended with .ico (image/vnd.microsoft.icon), .svg, .jpg, .woff, .woff2.
- Welcome dashboard rebalanced — old 5-card "Bridges & Media" row (ragged in 2×2 grid) split into two 4-card sections: Bridges (Telegram · WeChat · Slack · Monitor) and Multi-Modal Media (Voice Input · Vision · Copy Output · Export). /cwd added to Development Tools. Tagline changed to "Personal AI Assistant · Support Any Model · Autonomous 24/7".
- Bridges commands in Chat UI — /telegram, /wechat (+/weixin alias), /slack, /voice now registered in web/api.py's slash registry (previously only the terminal REPL had them), so clicking the dashboard cards actually runs the command.
- New extras — pip install 'cheetahclaws[web]' installs sqlalchemy>=2.0, passlib[bcrypt]>=1.7.4, PyJWT>=2.8.0. CLI-only installs remain dependency-free. [all] extra updated.
- Version bumped to 3.5.74.
Apr 16, 2026 (v3.5.73): Web UI — browser-based Chat UI + structured event API
- Web Chat UI (web/chat.html) — cheetahclaws --web now serves a rich browser-based chat interface at /chat alongside the existing PTY terminal at /. Features: real-time streaming via Server-Sent Events (SSE), collapsible tool cards with status badges, inline permission approval buttons (Allow/Deny), activity indicator (spinner + state labels for Thinking/Running/Processing), Markdown rendering with XSS sanitization (marked.js bundled), dark/light theme toggle with localStorage persistence, mobile-responsive layout with sidebar overlay.
- Structured event API (web/api.py) — new ChatSession class bridges agent.run() generator to WebSocket/SSE event streams following the same pattern as the Telegram/Slack/WeChat bridges. Events: text_chunk, thinking_chunk, tool_start, tool_end, permission_request, permission_response, turn_done, command_result, interactive_menu, input_request, status, error. Event buffer with replay for late-joining subscribers.
- 8 new API endpoints — POST /api/prompt (submit prompt or slash command), WS /api/events (real-time event stream), POST /api/approve (permission response), GET /api/sessions (list sessions), GET /api/sessions/{id} (session details + message history), GET/PATCH /api/config (read/write config), GET /api/models (list all 11 providers and models), POST /api/auth (login, sets HttpOnly cookie).
- Settings panel — click ⚙ to open: model selector grouped by 11 providers (Anthropic, OpenAI, Gemini, Ollama, DeepSeek, Qwen, etc.), permission mode dropdown, thinking/verbose toggles, max tokens input, per-provider API key management with status indicators, quick action buttons (Compact/Status/Cost/Context), terminal link for fallback.
- Slash command support in Chat UI — all 45+ commands work. Quick commands (/status, /help, /model, /context) return results instantly via POST response. Long-running commands (/brainstorm, /worker, /plan, /agent) stream events in real-time via SSE (server keeps HTTP connection open). /ssj renders a clickable 12-item interactive menu. /brainstorm (no args) shows a topic input box before starting.
- SSJ sub-commands — /ssj debate, /ssj commit, /ssj readme, /ssj scan, /ssj propose, /ssj review now run directly as agent queries without showing the interactive menu. The menu only appears for /ssj (no args).
- Feature dashboard — welcome page shows 24 feature cards organized in 6 categories (Core, Agent Features, Session & Memory, Multi-Model, Development Tools, Bridges & Media) with 7 clickable quick-command chips.
- Security hardening — hmac.compare_digest() for timing-safe token comparison, XSS sanitization (HTML tags escaped before Markdown rendering), CORS restricted to request Origin echo (no wildcard), HttpOnly + SameSite=Strict cookies, auth checked before WebSocket upgrade, _BufferedSocket wrapper replaces fragile sock.recv monkey-patching.
- Session management — chat sessions with idle timeout (30 min), background reaper for orphaned sessions, session list in sidebar with message count and busy indicator, click to switch, "+" to create new.
- Web bridge integration — RuntimeContext extended with web_input_event, web_input_value, in_web_turn fields. tools/interaction.py routes permission prompts to web bridge via threading.Event synchronization. commands/advanced.py detects web turns and skips interactive prompts (uses defaults like Telegram bridge).
- Thread-safe stdout streaming — _ThreadLocalStdout intercepts print() only from the target command thread, broadcasts as text_chunk events. Other threads unaffected.
- pyproject.toml packaging — web package added to packages list, *.js, *.css, *.html added to package-data. Static assets (xterm.min.js, marked.min.js, chat.html) correctly included in pip install distributions.
- Docs — new Web UI Guide (304 lines): quick start, full feature list, settings panel, API reference with JSON examples for all 8 endpoints and 12 event types, architecture notes, troubleshooting. README updated with Web UI section, feature table entry, CLI options, and examples.
- Version bumped to 3.5.73.
Apr 15, 2026 (v3.5.72): Trading agent, error classifier, parallel tools, prompt injection detection, SQLite sessions, tool cache, auxiliary model, safe stdio
- Trading agent module (modular/trading/) — AI-powered multi-agent trading analysis and backtesting system. 5-phase analysis pipeline: data collection (technical indicators, fundamentals, news) → Bull/Bear researcher debate with BM25 memory → research judge recommendation → risk management panel (aggressive/conservative/neutral 3-way debate) → portfolio manager final decision (BUY/OVERWEIGHT/HOLD/UNDERWEIGHT/SELL). 4 built-in backtest strategies (dual MA, RSI mean reversion, Bollinger breakout, MACD crossover) with equity and crypto engines. 7 AI tools (GetMarketData, GetPrice, GetTechnicalIndicators, GetFundamentals, GetNews, RunBacktest, TradingMemory). 11 pure-Python technical indicators. Data source fallback chains (yfinance → coingecko → akshare). Post-trade reflection mechanism feeds lessons back into BM25 memory. SSJ integration as option 14 with guided sub-menu. Supports US/HK/A-share stocks and 20+ cryptos. Install: pip install "cheetahclaws[trading]".
- Error classifier (error_classifier.py) — centralized API error taxonomy (auth, billing, rate_limit, context_overflow, model_not_found, overloaded, connection, timeout) with per-category recovery hints, retryability, and backoff multipliers. Replaces fragile string matching in agent.py and cheetahclaws.py.
- Parallel tool execution (agent.py) — when the LLM returns multiple tool calls, concurrent_safe=True tools (Read, Glob, Grep, WebSearch, etc.) now run in parallel via ThreadPoolExecutor (up to 8 workers). Write tools remain sequential. Permission checks are still serial.
- Prompt injection detection (context.py) — CLAUDE.md files are scanned for 8 threat patterns (e.g., "ignore previous instructions", "system prompt override", credential exfiltration via curl/echo) before injection into the system prompt. Detected files are excluded with a security warning.
- SQLite session store + full-text search (session_store.py) — sessions are now saved to SQLite (WAL mode) alongside JSON files. FTS5 index enables /search <query> to find past conversations by content. Auto-imports legacy history.json on first search.
- Tool result cache (tool_registry.py) — read-only tools cache results by sha256(name + params), LRU eviction at 64 entries. Write tools (Write, Edit, Bash, NotebookEdit) invalidate the cache automatically. Eliminates redundant file reads in agent loops.
- Auxiliary model routing (auxiliary.py) — side tasks (context compression, summarization) now route to a fast/cheap model (Gemini Flash, GPT-4o-mini, etc.) instead of the primary model. Auto-detects from available API keys. Configurable via auxiliary_model in config.
- Auto-discovery tool loading (tools/__init__.py) — extension modules loaded via _EXTENSION_MODULES list + __import__() loop instead of manual import statements. Adding a new extension is one line.
- Safe stdio wrapper (cheetahclaws.py) — sys.stdout/sys.stderr wrapped with _SafeWriter that silently handles BrokenPipeError and closed file descriptors. Prevents crashes when terminal disconnects during bridge/daemon operation.
- One-line installer (scripts/install.sh) — curl -fsSL .../install.sh | bash handles platform detection (Linux/macOS/WSL2/Termux), Python/git/pip checks, clone, install, and PATH setup. First run triggers the setup wizard automatically.
- Contributing section in README with quick-start commands for contributors, linking to CONTRIBUTING.md and Plugin Authoring Guide.
- Browser tool (tools/browser.py) — WebBrowse renders JavaScript pages with headless Chromium (via playwright). Supports extract, screenshot, and click actions with CSS selectors. Solves dynamic/SPA pages that WebFetch can't handle. Optional: pip install cheetahclaws[browser].
- Email tools (tools/email.py) — ReadEmail (IMAP) reads inbox with search by sender/subject; SendEmail (SMTP) sends emails with threading support. Zero external deps (Python stdlib). Configure with /config email_address=....
- File tools (tools/files.py) — ReadPDF extracts text from PDFs (pymupdf); ReadImage does OCR on images (pytesseract, 99 languages); ReadSpreadsheet reads Excel/CSV/TSV with formatted table output. Optional: pip install cheetahclaws[files].
- [all] extra — pip install cheetahclaws[all] installs every optional dependency (voice, vision, autosuggest, browser, files, OCR).
- Version bumped to 3.5.72.
Apr 15, 2026 (v3.5.71): Plugin docs, example template, config namespace fix, typing-time autosuggest
- Plugin authoring guide (docs/guides/plugin-authoring.md) — full guide for building third-party plugins: tools (TOOL_DEFS), commands (COMMAND_DEFS), skills, MCP servers, manifest format, testing, publishing checklist, and common mistakes.
- Example plugin template (examples/example-plugin/) — copy-and-edit starter with working tools (ExampleSearch, ExampleStatus), command (/example with subcommands), skill, and plugin.json manifest.
- Fix config namespace collision — renamed config.py to config.py to avoid conflict with system config namespace packages. pip install -e . followed by cheetahclaws from outside the project directory no longer crashes with ImportError.
- Typing-time autosuggest (PR #38 by @honghua) — optional prompt_toolkit integration for inline ghost suggestions and keyboard-selectable completion menu while typing slash commands. Install with pip install cheetahclaws[autosuggest]. Falls back to readline when not installed. Env var CHEETAH_PT_INPUT=0 to opt out.
- Python 3.10-3.13 compat fix (PR #38) — Path.read_text(newline=) in tools/fs.py replaced with portable open() helper (the newline= kwarg is 3.14+ only).
- Version bumped to 3.5.71.
Apr 14, 2026 (v3.5.70): Setup wizard, Ollama UX, context indicator, and session robustness
- Interactive setup wizard (commands/core.py, cheetahclaws.py) — cheetahclaws --setup or /setup launches a guided setup: pick from 6 providers (Ollama, Anthropic, OpenAI, Gemini, DeepSeek, custom), auto-detect env vars, set API key, verify connection. Auto-triggers on first run (no config.json). API key missing warning now suggests --setup.
- Ollama UX improvements — /model now shows live local Ollama models (via /api/tags) instead of a hardcoded list. /model ollama triggers the interactive model picker. Connection failures and 404 errors now give actionable messages ("Is Ollama running?", "Pull it with: ollama pull ..."). Tool-calling fallback message clarified.
- Context usage in prompt — the REPL prompt now shows context window usage as a percentage: dim when <40%, yellow at 40-70%, red at >=70%. Users can see when compaction is approaching without running /context.
- Session save/resume robustness — atomic writes (write-to-temp + rename) prevent corruption on crash. /load and /resume now catch corrupted JSON with friendly error messages and suggest daily backups. History file corruption no longer blocks auto-save.
- Version from pyproject.toml — VERSION is now read dynamically from pyproject.toml (single source of truth), no more hardcoded version drift. Falls back to importlib.metadata when installed as a package.
- /doctor enhanced — added internet connectivity check and pyte dependency check; optional vs required deps now distinguished ([FAIL] for missing required deps).
- Fix mcp namespace collision — renamed internal mcp/ package to mcp_client/ to avoid conflict with the official mcp pip package (Anthropic MCP SDK). Previously, pip install . followed by cheetahclaws crashed with ImportError: cannot import name 'MCPClient'.
- Version bumped to 3.5.70.
Apr 14, 2026 (v3.5.69): Actionable error messages, dependency sync, and contributor guide
- Actionable API error messages (cheetahclaws.py) — the REPL error handler now detects 6 common failure modes (invalid API key, network timeout, Ollama not running, rate limit, model not found, insufficient credits) and prints a specific hint alongside the error instead of a generic message. The proactive watcher background thread no longer dumps raw Python tracebacks to stdout — errors are routed through logging_utils instead.
- Dependency sync (pyproject.toml, requirements.txt) — pyte>=0.8.0 added to pyproject.toml core dependencies (was only in requirements.txt, causing import failures after pip install .). requirements.txt rewritten to mirror pyproject.toml as single source of truth, with optional deps (sounddevice, Pillow) clearly marked.
- CONTRIBUTING.md — new contributor guide covering project structure, architecture (config vs RuntimeContext, tool/plugin/hooks systems), development conventions, and a PR checklist. Addresses recurring PR issues where contributors misunderstood the plugin loader (TOOL_DEFS vs register_tool()), hooks system (no event-based hooks), and runtime state management.
- Version bumped to 3.5.69.
Apr 14, 2026 (v3.5.68): CI/CD, config/runtime separation, and module reorganization
- GitHub Actions CI (.github/workflows/ci.yml) — added automated testing on every push and PR: pytest across Python 3.10–3.13, plus a package smoke test that installs via pip install . and verifies all modules are importable. No more silent packaging regressions.
- Config/runtime separation (runtime.py) — runtime state (_proactive_thread, _pending_image, _plan_file, bridge turn flags, etc.) moved out of the config dict into RuntimeContext fields. The config dict now holds only serializable user configuration. Added runtime.get_ctx(config) helper for easy access. Migrated 18 files; 327 tests pass.
- Tool module reorganization — 7 top-level tools_*.py files consolidated into a tools/ package (tools/security.py, tools/fs.py, tools/shell.py, tools/web.py, tools/notebook.py, tools/diagnostics.py, tools/interaction.py). All existing from tools import ... code continues to work unchanged via tools/__init__.py.
- Version bumped to 3.5.68.
Apr 14, 2026 (v3.5.67): Packaging fix, /config safety, and readline completion fix
- Fix ModuleNotFoundError on pip install / uv tool install (pyproject.toml) — 16 missing top-level modules (logging_utils, agent_runner, tools_fs, tools_shell, etc.) and the monitor package were not declared in pyproject.toml, causing No module named 'logging_utils' and similar crashes after installation (#36). All runtime modules are now correctly packaged.
- /config no longer exposes secrets (commands/config_cmd.py) — the /config display now filters out sensitive keys (api_key, telegram_token, wechat_token, and any key ending in _key, _token, or _secret) as well as internal runtime keys (prefixed with _). Previously, /config crashed with TypeError on non-serializable threading.Thread objects and leaked credentials.
- Readline completion condition fix (cheetahclaws.py) — changed "/" in line to line.startswith("/") in the completer and display hook, preventing false matches on non-slash input containing /. Completion menu now redisplays the prompt line correctly after showing matches.
- Packaging fix, /config safety, and readline completion fix — Fixed ModuleNotFoundError on install (#36), secrets filtering in /config, readline completion.
- Version bumped to 3.5.67.
Apr 12, 2026 (v3.5.66): Auto max_tokens cap per model + tool robustness fixes
- Automatic max_tokens capping (providers.py) — a new resolve_max_tokens() function automatically caps max_tokens to the model's actual limit before every API call, eliminating BadRequestError: max_tokens cannot be greater than max_model_len errors when using vLLM or other bounded local endpoints. Priority: (1) per-model hard limit from a built-in table of 30+ known models; (2) for custom provider, GET /v1/models is queried at first call and the max_model_len field is used (result cached per base URL); (3) provider-level context_limit // 2 as a conservative fallback. The user's configured value is always treated as an upper bound — never increased.
- KeyError: 'file_path' in agent tool calls (tools.py) — when a model (e.g. Qwen) generates a malformed tool call omitting the required file_path parameter for Read / Write / Edit, the agent runner now returns a descriptive error string ("Error: missing required parameter 'file_path'") instead of crashing with an unhandled KeyError. The agent can then self-correct on the next iteration.
- KeyError: 'white' in /agent SSJ wizard (ui/render.py) — "white": "\033[37m" added to the ANSI color table C; the agent wizard's summary box used clr(name, 'white') which crashed on startup.
- Version bumped to 3.5.66.
Apr 12, 2026 (v3.5.65): /agent SSJ entry, bridge-compatible wizard, and /monitor interactive wizard fix
- SSJ entry 14 — 🤖 Agent (commands/advanced.py) — The SSJ power menu now has a 14th option that launches the /agent interactive wizard directly, covering all four autonomous templates (Research Assistant, Auto Bug Fixer, Paper Writer, Auto Coder).
- Bridge-compatible agent wizard (commands/agent_cmd.py) — The wizard's input helper _ask() now routes through ask_input_interactive() so it works correctly over Telegram, Slack, and WeChat bridges (previously used bare input() which is terminal-only).
- /monitor interactive wizard input fix (all three bridges) — When the /monitor wizard sends a menu to a bridge and waits for user input, the next message from the user was incorrectly treated as a new AI query. Each bridge's poll loop now checks session_ctx.tg/slack/wx_input_event before dispatching to the AI — wizard replies are correctly routed back to the waiting prompt.
- Version bumped to 3.5.65.
Apr 12, 2026 (v3.5.64): /monitor — AI subscription system & /agent task template system for auto research
- /monitor wizard — typing /monitor with no arguments launches an interactive setup wizard: live subscription list, numbered menu (add subscription / run now / start-stop scheduler / remove / configure notifications), zero memorization required. Works in terminal and all three bridges.
- monitor/ package — fetchers.py (arxiv RSS + weekend API fallback · Yahoo Finance · CoinGecko · Reuters/BBC/AP RSS · DuckDuckGo), summarizer.py (AI summarization via providers.stream()), notifier.py (Telegram / Slack / console delivery), scheduler.py (background daemon, daily / 6h / 30m schedules), store.py (persistent subscriptions at ~/.cheetahclaws/monitor_subscriptions.json).
- /subscribe <topic> [schedule] [--telegram] [--slack] — subscribe to ai_research, stock_TSLA, crypto_BTC, world_news, or custom:<query>. Schedule defaults to daily; delivery defaults to configured channels.
- /agent wizard — /agent with no args launches the autonomous agent wizard (Research Assistant / Auto Bug Fixer / Paper Writer / Auto Coder / Custom); walks through template-specific questions, confirms, then starts the loop in a background thread.
- agent_runner.py — isolated AgentState per runner, calls agent.run() per iteration, auto-approves permissions, pushes iteration summaries via active bridge, persists to ~/.cheetahclaws/agents/<name>/log.jsonl.
- 4 built-in agent templates (agent_templates/): research_assistant, auto_bug_fixer, paper_writer, auto_coder — Markdown-driven program.md style (inspired by Karpathy's autoresearch).
- Job queue & remote control (all three bridges) — persistent job registry (jobs.py, ~/.cheetahclaws/jobs.json); new bridge commands: !jobs / !j (dashboard), !job <id> (detail), !retry <id> (re-run failed job), !cancel [id] (stop job); per-bridge queue (FIFO when AI is busy); on_tool_start / on_tool_end hooks wired in all three bridges for live step tracking.
- Version bumped to 3.5.64.
Apr 12, 2026 (v3.5.63): Phone bridge: PTY permission prompt now responds correctly to digit inputs
- Ink SelectInput fix (bridges/interactive_session.py) — Claude Code's permission prompts (e.g. "❯ 1. Yes 2. Yes, don't ask again 3. No") are rendered by Ink's SelectInput which only responds to arrow-key + Enter events, not digit key presses. Sending 2 from the phone previously had no effect (or misrouted to the wrong option) because the raw digit was written verbatim to the PTY.
- Automatic arrow-key translation — send_input() now detects when the pyte screen shows a numbered ❯ 1. menu and maps the digit to the correct ANSI escape sequence: 1 → Enter (cursor already on item 1), 2 → ↓ + Enter, 3 → ↓↓ + Enter, and so on up to 9. The translation fires only when the screen shows the menu pattern and the input is a single digit; all other inputs are forwarded unchanged.
- Version bumped to 3.5.63.
Apr 12, 2026 (v3.5.62): /agent — autonomous research & coding loop (task template system)
- /agent wizard — typing /agent with no arguments launches an interactive setup wizard: numbered menu (Research Assistant / Auto Bug Fixer / Paper Writer / Auto Coder / Custom), template-specific follow-up questions, summary & confirm. Zero memorization required.
- agent_runner.py — core autonomous loop engine. Each AgentRunner owns an isolated AgentState, runs agent.run() per iteration, auto-grants permissions (configurable), reports iteration summaries via bridge (Telegram/Slack/WeChat) or terminal, persists results to ~/.cheetahclaws/agents/<name>/log.jsonl.
- 4 built-in task templates (agent_templates/): research_assistant (read papers → notes → related work), auto_bug_fixer (run tests → fix → commit), paper_writer (outline → draft section by section), auto_coder (tasks.md → implement → test → commit).
- Custom templates — drop any .md file following the program.md pattern (inspired by Karpathy's autoresearch) and launch with /agent start /path/to/template.md.
- /agent start <template> [args] — power-user direct launch with --name, --interval, --no-auto-approve flags.
- /agent stop/list/status/templates — full lifecycle management from terminal or phone bridge.
- Phone control — !agent list, !agent stop <name>, !agent status <name> work in all three bridges for remote monitoring while agents run overnight.
- Version bumped to 3.5.62.
Apr 12, 2026 (v3.5.61): Phone vibe-coding: interactive PTY session robustness improvements
- !exit with accidental space (bridges/telegram.py, slack.py, wechat.py) — exit detection now normalises whitespace so ! exit, ! quit, ! stop (with a space after !) all correctly terminate the PTY session.
- /exit and /quit intercepted — these were previously forwarded verbatim into the running process (e.g. Claude Code), causing confusion. They are now caught by the bridge before routing and cleanly end the session.
- Input acknowledgement — every keystroke forwarded to a PTY session immediately echoes back ⌨ <text> so the user knows their input was received, even before the process produces output.
- !ping / !screen / !refresh — new meta-commands (also tolerating a space: ! ping) that force the current pyte screen state to be re-rendered and sent regardless of the deduplication cache.
- Dedup reset on input (interactive_session.py) — send_input() now clears _last_sent after writing to the PTY, guaranteeing the next output flush is always delivered even if screen content appears unchanged.
- force_flush() method (interactive_session.py) — public method that resets the dedup cache and immediately re-renders and sends the visible screen; used by !ping.
- Version bumped to 3.5.61.
Apr 12, 2026 (v3.5.60): Production reliability, maintainability, and product completeness improvements
- Structured logging (logging_utils.py) — newline-delimited JSON log output with error/warn/info/debug level filtering, thread-safe file or stderr sink, and configure_from_config() for zero-boilerplate setup. All API calls, retries, tool events, and bridge lifecycle events now emit structured log events with session_id correlation.
- Circuit breaker (circuit_breaker.py) — per-provider three-state machine (CLOSED → OPEN → HALF_OPEN) with rolling failure window (default: 5 failures in 60 s) and exponential cooldown (default: 120 s). providers.py wraps every streaming call; agent.py catches CircuitOpenError and returns a user-visible message without retrying.
- Quota control (quota.py) — four enforcement limits (session_token_budget, session_cost_budget, daily_token_budget, daily_cost_budget) checked before every API call. Daily accumulation persisted to ~/.cheetahclaws/quota/YYYY-MM-DD.json; in-memory counters are thread-safe. All nine new config keys registered in config.py DEFAULTS.
- Explicit bootstrap (bootstrap.py) — startup sequence made visible and testable: ① configure logging → ② import tool registry → ③ start health-check server. cheetahclaws.py calls _bootstrap(config) once after load_config(); all steps are idempotent.
- tools.py split — the 1,400-line tools.py decomposed into seven focused sub-modules: tools_security.py (path safety, bash whitelist), tools_fs.py (read/write/edit/diff/glob), tools_shell.py (bash/grep/process-tree), tools_web.py (webfetch/websearch), tools_notebook.py (NotebookEdit), tools_diagnostics.py (GetDiagnostics), tools_interaction.py (AskUserQuestion, SleepTimer, drain_pending_questions). tools.py remains as a thin re-export shim — all from tools import X calls continue to work unchanged.
- Session file versioning (commands/session.py) — saved session files now include "_version": 1. _migrate_session() upgrades v0 → v1 on load/resume; future schema changes can add new migration steps without breaking existing saves.
- Health-check HTTP server (health.py) — optional daemon thread started via health_check_port config key. Three endpoints: GET /healthz (always 200, uptime + active sessions), GET /readyz (503 if any circuit breaker is open), GET /metrics (full JSON: uptime, model, sessions, circuit states, daily token/cost usage).
- Bridge auto-reconnect (bridges/telegram.py, bridges/slack.py, bridges/wechat.py) — each poll loop now returns an exit reason ("stopped" for clean shutdown, "auth_error" for invalid token). A supervisor wrapper (_tg/slack/wx_supervisor) catches unexpected crashes and restarts the poll loop with exponential backoff (2 s → 4 s → … → 120 s). Auth errors stop the bridge immediately without reconnect. All bridges log bridge_crash / bridge_auth_error events via logging_utils.
- /help completeness — 13 commands that were registered but missing from the help docstring are now shown: /resume, /status, /compact, /init, /export, /copy, /doctor, /checkpoint, /rewind, /plan, /brainstorm, /worker, /image. Product tagline updated to reflect CheetahClaws's current scope.
- Version bumped to 3.5.60.
Apr 12, 2026 (v3.5.59): Modular architecture refactoring — monolith → layered packages
- cheetahclaws.py split — the 5,100-line monolith has been decomposed into focused packages. cheetahclaws.py is now a ~1,300-line REPL entry-point; all bridge, UI, and command logic lives in dedicated modules.
- ui/render.py — ANSI color helpers (clr, info, ok, warn, err) and Rich Live streaming renderer extracted into a standalone package; imported by every module that needs terminal output.
- bridges/ — Telegram (telegram.py), WeChat (wechat.py), and Slack (slack.py) bridge implementations moved out of cheetahclaws.py into their own sub-package.
- commands/ — REPL slash-command handlers extracted into session.py (session load/save/export), config_cmd.py (/config, /status, /doctor), core.py (/clear, /compact, /cost, /verbose, /thinking, /image, /model), checkpoint_plan.py (/checkpoint, /rewind, /plan), and advanced.py (/brainstorm, /worker, /ssj and related).
- runtime.py — RuntimeContext singleton — live session references (run_query, handle_slash, agent_state, tg_send, slack_send, wx_send) that were previously injected into the config dict under _underscore keys are now a typed @dataclass singleton (runtime.ctx). One process → one ctx → no key collisions, no dict sprawl. Per-bridge synchronous input events (tg_input_event/value, slack_input_event/value, wx_input_event/value) are also stored here, eliminating the last threading-Event race in config.
- Packaging fixes (pyproject.toml) — runtime added to py-modules; ui, bridges, commands, modular, modular.video, modular.voice, video added to packages so all new layers are included in pip install .. package-data added for modular/video/PLUGIN.md and modular/voice/PLUGIN.md.
- pytest config — asyncio_default_fixture_loop_scope = "function" added to silence pytest-asyncio deprecation warnings; python_files extended to collect e2e_*.py alongside test_*.py (267 tests now collected by default).
- Version bumped to 3.5.59.
Apr 11, 2026 (v3.5.58): Slack bridge via Slack Web API
- Slack bridge (/slack) (cheetahclaws.py) — /slack <xoxb-token> <channel_id> connects cheetahclaws to a Slack channel using the Slack Web API (no external packages required — stdlib urllib only). Polls conversations.history every 2 seconds for new messages; sends responses via chat.postMessage. A "⏳ Thinking…" placeholder is posted immediately and then updated in-place with the real reply when the model finishes.
- Slash command passthrough — send /cost, /model gpt-4o, /clear, etc. from Slack and they execute in cheetahclaws; results are sent back to the same channel.
- Interactive menu routing — permission prompts and interactive menus are routed to Slack; your next message is used as the selection input.
- Auth check on start — auth.test is called before starting the poll loop; invalid or revoked tokens are caught immediately with a clear error message.
- Auto-start — slack_token + slack_channel saved to ~/.cheetahclaws/config.json; bridge starts automatically on every subsequent launch.
- /slack stop / /slack logout / /slack status — full lifecycle control; /stop sent from Slack also stops the bridge gracefully.
- WeChat / Slack auto-start banner flags — the startup banner now shows wechat and slack flags when the respective bridges are configured (previously only telegram was shown).
Apr 11, 2026 (v3.5.57): WeChat bridge, tmux integration, shell escape, max_tokens fix, new OpenAI models
- WeChat bridge (/wechat) (cheetahclaws.py) — /wechat login authenticates with WeChat by scanning a QR code (same iLink Bot API used by the official WeixinClawBot / @tencent-weixin/openclaw-weixin plugin). After a one-time scan, token + base_url are saved to ~/.cheetahclaws/config.json and the bridge auto-starts on every subsequent launch. The bridge runs a long-poll loop (POST /ilink/bot/getupdates, 35-second window) in a daemon thread — normal timeouts are handled transparently and do not trigger backoff or reconnect.
- context_token echo — the iLink protocol requires each reply to include the sender's latest context_token. The bridge caches this per user_id in memory and echoes it automatically on every outbound message.
- Typing indicator — a sendtyping request is sent every 4 seconds while the model processes, keeping the WeChat chat responsive.
- Slash command passthrough — send /cost, /model gpt-4o, /clear, etc. from WeChat and they execute in cheetahclaws; results are sent back to the same WeChat conversation.
- Session expiry handling — errcode -14 (session expired) clears saved credentials and prompts re-authentication on the next /wechat call.
- Message deduplication — message_id / seq dedup prevents double-processing on reconnect.
- /wechat stop / /wechat logout / /wechat status — full lifecycle control from the terminal or from WeChat itself (/stop).
- Bug fix: max_tokens rejected by gpt-5-nano / o4-mini / o3 (providers.py) — newer OpenAI models have removed the legacy max_tokens parameter and require max_completion_tokens instead. Any request using max_tokens with these models was returning a 400 error and exhausting all retries. The OpenAI provider now unconditionally sends max_completion_tokens; all other OpenAI-compatible providers (Ollama, vLLM, Gemini, Kimi, …) continue to use max_tokens, which their servers expect.
- New models listed — gpt-5, gpt-5-nano, gpt-5-mini, o3, o4-mini added to the known OpenAI model list so they appear in /model suggestions and get the correct token-cap from the provider config.
- Native tmux integration (tmux_tools.py) — 11 tmux tools for the AI agent: TmuxListSessions, TmuxNewSession, TmuxSplitWindow, TmuxSendKeys, TmuxCapture, TmuxListPanes, TmuxSelectPane, TmuxKillPane, TmuxNewWindow, TmuxListWindows, TmuxResizePane. Auto-detected at startup — tools register only when tmux (Linux/macOS) or psmux (Windows) is found; zero impact if absent. The AI can now run long-lived commands in visible panes that outlive the Bash tool's timeout, read output on demand with TmuxCapture, and build autonomous monitoring loops. System prompt is automatically extended with tmux usage guidance when the binary is present.
- Shell escape (cheetahclaws.py) — type ! followed by any shell command (!git status, !ls -la, !python --version) to execute it directly without AI involvement. Output prints inline; control returns to the prompt immediately.
Apr 10, 2026 (v3.5.56): Retry mechanism, improved token estimator, plan-context fix after force compaction
- Retry with exponential backoff (agent.py) — the provider stream loop now retries up to 3 times on any API error instead of crashing the session. Context-too-long errors trigger an immediate force compaction and retry; overloaded/rate-limit errors use longer backoff (4 s, 8 s, 16 s); all other errors use standard backoff (2 s, 4 s, 8 s). After exhausting retries a graceful inline message is shown — the session is never killed.
- Improved token estimator (compaction.py) — estimate_tokens now uses chars / 2.8 (was chars / 3.5) to better account for code-heavy content, adds 4 tokens per message for framing overhead, and applies a 10 % safety buffer. The old divisor underestimated real token counts, causing compaction to skip when it should have triggered and leading to context-overflow crashes.
- Force-compact safety net (cheetahclaws.py) — run_query now catches any uncaught error and shows a friendly message instead of crashing the REPL. Context-too-long errors are handled first with a force compaction + retry.
- Bug fix: plan context preserved after force compaction (agent.py) — _force_compact now restores the plan file context into state.messages after calling compact_messages, matching the behavior of maybe_compact. Previously, force compaction in plan mode silently dropped the plan file content from context.
- Bug fix: removed dead context-error handler (cheetahclaws.py) — the is_context_err block inside run_query's outer except was unreachable because context-too-long exceptions are already caught and handled inside agent.py's retry loop. The dead code has been removed.
- Remote Ollama support (providers.py) — the Ollama provider base URL can now be overridden via the OLLAMA_BASE_URL environment variable or the ollama_base_url config key, replacing the hardcoded localhost:11434 default. This enables connecting to a remote Ollama instance (e.g. inside Docker or on another machine) without switching to the generic OpenAI-compatible provider.
- Readline resilience in containerised environments (cheetahclaws.py) — setup_readline now catches PermissionError and OSError when loading history from a read-only or bind-mounted home directory. The atexit write-history callback is also wrapped in a try/except so shutdown errors are swallowed silently instead of printing noisy tracebacks.
Apr 08, 2026 (v3.5.55): Modular ecosystem, TTS Content Factory, CJK voice auto-detect, readline ANSI fix
- Modular ecosystem (modular/) — new plug-and-play module folder. Each submodule (modular/video/, modular/voice/) is self-contained with its own cmd.py exporting a COMMAND_DEFS dict. The registry auto-discovers all modules at startup; missing modules degrade gracefully without affecting the rest of the system. Existing video/ and voice/ imports continue to work via backward-compat shims.
- TTS Content Factory (/tts) — new command for AI-powered text-to-speech generation. Interactive wizard: choose a voice style (narrator, newsreader, storyteller, ASMR, motivational, documentary, children, podcast, meditation, custom), duration, TTS engine (Gemini → ElevenLabs → Edge, best available), and individual voice. In AI mode the active model writes the script; in custom-text mode you paste your own. Output: .mp3 audio + _script.txt companion file. Also accessible as option 12 in /ssj.
- CJK auto-voice detection — Edge TTS with an English voice silently skips Chinese/Japanese/Korean characters (only reads the Latin parts). The TTS backend now detects CJK-heavy text and auto-switches to zh-CN-XiaoxiaoNeural when a non-CJK voice is selected, ensuring the full text is synthesized.
- Edge TTS long-text chunking — Edge TTS silently truncates text beyond ~3 000 chars. The pipeline now splits text into ≤ 2 000-char chunks at sentence boundaries, synthesizes each chunk independently, and concatenates with ffmpeg — audio now always covers the complete script.
- Readline ANSI fix (#29 / #31) — ANSI color codes in input() prompts now wrapped with \001…\002 (RL_PROMPT_START/END_IGNORE) so readline accounts for them as zero-width. Fixes cursor drift and duplicate-line content when scrolling REPL history.
- SSJ Developer Mode extended — SSJ menu now includes option 11 (🎬 Video factory, conditional) and option 12 (🎙 TTS factory, conditional), matching the modular availability flags.
Apr 07, 2026 (v3.5.54): Video factory major upgrade: custom script mode, PIL subtitle engine, web image search, wizard UX overhaul: Idea → Story → Final AI Video. Inspired by Kevin, with sincere thanks for his great help and inspiration in making this project better.
- Custom script mode — new content mode in /video wizard. Choose "Custom script" to paste your own narration text: TTS reads it aloud, and the same text is automatically shown as subtitles (timed proportionally). No AI story generation step, no Whisper required. Ideal for product promos, personal narrations, and multilingual content.
- PIL subtitle rendering engine — subtitles are now rendered with Pillow (PIL) + NotoSansSC font instead of the libass filter. This fixes non-Latin characters (Chinese, Japanese, Korean, Cyrillic, Arabic) showing as black boxes. The pipeline uses a two-pass approach: fast -c:v copy assembly, then PIL-rendered PNG overlays burned in via ffmpeg filter_complex. Falls back to no subtitles if PIL fails — never crashes the pipeline.
- Subtitle source selection — new wizard step to choose subtitle mode: Auto (Whisper transcription), Story text (burn script/story as subtitles — works for all languages, no Whisper needed), Custom text (paste your own), or None.
- Text-to-SRT from plain text — text_to_srt() splits any plain text into natural subtitle chunks (word-wrap for Latin, punctuation+character-wrap for CJK) and distributes timing proportionally across the audio duration. Works for all languages, offline.
- Free web image search — /video now searches for relevant stock photos from Pexels → Wikimedia Commons → Lorem Picsum when no source images or Gemini Web session are available. AI-generated search queries (model-driven) improve relevance. Always produces images — never fails.
- AI-powered source image selection — when a source folder contains more images than needed, the model reads filenames and story content to select the most relevant ones. Keyword-scoring fallback when the model is unavailable.
- Wizard UX overhaul — full step-loop wizard with b=back, q=quit at every step. All options have Auto as default (Enter = Auto). Custom language input (type any language name + Whisper code). Style list shows before prompting. Custom output path step. Detects content language from topic text automatically.
- Audio/video sync fix — _audio_duration() now parses ffmpeg -i stderr Duration output for accurate measurement. Previously used a file-size estimate at 128kbps, causing 2.7× overestimate for Edge TTS (which outputs at 48kbps). Videos now always match audio length.
- Source materials — --source <dir> pre-loads images, audio, video, and text files. Images are used directly; audio/video narration replaces TTS; text files are summarised and injected as story context.
Apr 07, 2026 (v3.5.53): Telegram photo/voice support, process-tree kill on Bash timeout, Windows shell hints, worker fix
- Telegram photo vision — send a photo to the Telegram bridge and CheetahClaws will describe it using the active vision model (GPT-4o, Gemini 2.0 Flash, Claude, etc.). The bot downloads the highest-resolution version, encodes it as Base64, and routes it through the same _pending_image path as /img. Caption text (or a default "describe this image" prompt) is forwarded alongside the image.
- Telegram voice/audio STT — send a voice message or audio file to the Telegram bridge and CheetahClaws transcribes it automatically. OGG voice notes are converted to PCM via ffmpeg and passed to the local Whisper backend; falls back to the OpenAI Whisper API when ffmpeg is unavailable. The transcription is echoed back to the chat before being submitted as a query.
- Process-tree kill on Bash timeout — when a Bash command times out, CheetahClaws now kills the entire child process tree instead of only the shell. On Unix, os.killpg sends SIGKILL to the process group; on Windows, taskkill /F /T terminates all child processes. GUI apps (e.g. PyQt games launched by the agent) no longer leave zombie processes after a timeout. The internal implementation uses start_new_session=True instead of preexec_fn=os.setsid for thread safety.
- Worker runs all pending tasks by default — /worker previously processed only 1 task per session (a bug). It now runs all pending tasks by default. The --workers N flag still limits the batch size when needed.
- Windows shell hints in system prompt — non-Claude models now receive a Windows-specific shell cheat-sheet in the system prompt (type vs cat, dir /s /b vs find, del vs rm, etc.) so the agent generates correct commands on Windows without manual guidance.
- Bash timeout hints — the Bash tool description now advises the model to use timeout=120–300 for slow commands (npm install, npx, pip install, builds), reducing spurious 30-second timeouts on package operations.
- Bug fix: background event prompt shows actual cwd — the yellow re-prompt printed after a background event completed was hardcoded to [claude-code-local]; it now shows the real working-directory name ([{cwd.name}]), consistent with the main REPL prompt.
Apr 06, 2026 (v3.5.53): Telegram interactive menus, /img alias, /voice device, OpenAI/Gemini vision support
- Telegram interactive menus fixed — slash commands with interactive input (e.g. /ollama, /permission, /checkpoint) were blocking the Telegram poll loop, making it impossible to respond to the menu prompts. Slash commands now run in a daemon thread (like regular queries), keeping the poll loop free. All interactive menus (ask_input_interactive) work correctly over Telegram.
- /img alias — /img is now an alias for /image, for faster clipboard-image workflows.
- /voice device — new subcommand to list all available input microphones and select one interactively. The selected device index is persisted in the session config and shown in /voice status. Useful on systems with multiple audio interfaces (e.g. USB headset + built-in mic).
- Vision support for OpenAI / Gemini models — /img (and /image) now sends images in the OpenAI multipart image_url format to cloud vision models (GPT-4o, Gemini 2.0 Flash, etc.), in addition to the existing Ollama native format. No configuration change needed — the correct format is selected automatically based on the active provider.
- Bug fix: threading race condition — _in_telegram_turn is now tracked via threading.local() per-slash-runner thread instead of a shared config key, eliminating a race condition that could corrupt the flag when a regular message arrived while an interactive slash command was waiting for input.
Apr 06, 2026 (v3.5.52): Checkpoint system, plan mode, compact, and utility commands, support MiniMax Models, fix telegram bugs With sincere thanks for Xiaohan's great help in making this project better.
- Checkpoint system (checkpoint/ package): auto-snapshots conversation state and file changes after every turn. /checkpoint lists all snapshots; /checkpoint <id> rewinds both files and conversation history to any previous state; /checkpoint clear removes all snapshots for the session. /rewind is an alias. 100-snapshot sliding window; initial snapshot captured at session start. Throttling: skips when nothing changed. File backups use copy-on-write; snapshots capture post-edit state.
- Plan mode: /plan <desc> enters a read-only analysis mode — Claude may only read the codebase and write to a dedicated plan file (.nano_claude/plans/<session_id>.md). All other writes are silently blocked with a helpful message. /plan shows the current plan; /plan done exits plan mode and restores original permissions; /plan status reports whether plan mode is active. Two new agent tools — EnterPlanMode and ExitPlanMode — let Claude autonomously enter and exit plan mode for complex multi-file tasks; both are auto-approved in all permission modes.
- /compact [focus]: manually trigger conversation compaction at any time. An optional focus string guides the LLM summarizer on what context to preserve. Auto-compact and manual compact both restore plan file context after compaction.
- Utility commands: /init creates a CLAUDE.md template in the current directory; /export [filename] exports the conversation as Markdown (default) or JSON; /copy copies the last assistant response to the clipboard (Windows/macOS/Linux); /status shows version, model, provider, permissions, session ID, token usage, and context %; /doctor diagnoses installation health (Python version, git, API key + live connectivity test, optional deps, CLAUDE.md presence, checkpoint disk usage, permission mode).
Apr 06, 2026 (v3.5.51): Project renamed from Nano Claude Code to CheetahClaws
- The project has been rebranded from Nano Claude Code to CheetahClaws — a more distinctive name that captures the spirit of the tool: a sharp, agile coding assistant. The Cl in CheetahClaws is a subtle nod to Claude.
- CLI command: nano_claude → cheetahclaws
- PyPI package: nano-claude-code → cheetahclaws
- Config directory: ~/.nano_claude/ → ~/.clawnest/ → ~/.cheetahclaws/
- Main entry point: nano_claude.py → cheetahclaws.py
- All documentation, GitHub URLs, and internal references updated accordingly.
- Added CheetahClaws vs OpenClaw comparison section to README.
Apr 06, 2026 (v3.5.53): Telegram interactive menus, /img alias, /voice device, OpenAI/Gemini vision support
- Telegram interactive menus fixed — slash commands with interactive input (e.g. /ollama, /permission, /checkpoint) were blocking the Telegram poll loop, making it impossible to respond to the menu prompts. Slash commands now run in a daemon thread (like regular queries), keeping the poll loop free. All interactive menus (ask_input_interactive) work correctly over Telegram.
- /img alias — /img is now an alias for /image, for faster clipboard-image workflows.
- /voice device — new subcommand to list all available input microphones and select one interactively. The selected device index is persisted in the session config and shown in /voice status. Useful on systems with multiple audio interfaces (e.g. USB headset + built-in mic).
- Vision support for OpenAI / Gemini models — /img (and /image) now sends images in the OpenAI multipart image_url format to cloud vision models (GPT-4o, Gemini 2.0 Flash, etc.), in addition to the existing Ollama native format. No configuration change needed — the correct format is selected automatically based on the active provider.
- Bug fix: threading race condition — _in_telegram_turn is now tracked via threading.local() per-slash-runner thread instead of a shared config key, eliminating a race condition that could corrupt the flag when a regular message arrived while an interactive slash command was waiting for input.
Apr 06, 2026 (v3.5.52): Checkpoint system, plan mode, compact, and utility commands, support MiniMax Models, fix telegram bugs
- Checkpoint system (checkpoint/ package): auto-snapshots conversation state and file changes after every turn. /checkpoint lists all snapshots; /checkpoint <id> rewinds both files and conversation history to any previous state; /checkpoint clear removes all snapshots for the session. /rewind is an alias. 100-snapshot sliding window; initial snapshot captured at session start. Throttling: skips when nothing changed. File backups use copy-on-write; snapshots capture post-edit state.
- Plan mode: /plan <desc> enters a read-only analysis mode — Claude may only read the codebase and write to a dedicated plan file (.nano_claude/plans/<session_id>.md). All other writes are silently blocked with a helpful message. /plan shows the current plan; /plan done exits plan mode and restores original permissions; /plan status reports whether plan mode is active. Two new agent tools — EnterPlanMode and ExitPlanMode — let Claude autonomously enter and exit plan mode for complex multi-file tasks; both are auto-approved in all permission modes.
- /compact [focus]: manually trigger conversation compaction at any time. An optional focus string guides the LLM summarizer on what context to preserve. Auto-compact and manual compact both restore plan file context after compaction.
- Utility commands: /init creates a CLAUDE.md template in the current directory; /export [filename] exports the conversation as Markdown (default) or JSON; /copy copies the last assistant response to the clipboard (Windows/macOS/Linux); /status shows version, model, provider, permissions, session ID, token usage, and context %; /doctor diagnoses installation health (Python version, git, API key + live connectivity test, optional deps, CLAUDE.md presence, checkpoint disk usage, permission mode).
Apr 06, 2026 (v3.5.51): Project renamed from Nano Claude Code to CheetahClaws
- The project has been rebranded from Nano Claude Code to CheetahClaws — a more distinctive name that captures the spirit of the tool: a sharp, agile coding assistant. The Cl in CheetahClaws is a subtle nod to Claude.
- CLI command: nano_claude → cheetahclaws
- PyPI package: nano-claude-code → cheetahclaws
- Config directory: ~/.nano_claude/ → ~/.clawnest/ → ~/.cheetahclaws/
- Main entry point: nano_claude.py → cheetahclaws.py
- All documentation, GitHub URLs, and internal references updated accordingly.
- Added CheetahClaws vs OpenClaw comparison section to README.
00.29 PM, Apr 06, 2026 (v3.5.5): SSJ Developer Mode, Telegram Bridge, Worker Command, and UX improvements
- /ssj — SSJ Developer Mode: Interactive power menu with 10 workflow options: Brainstorm, TODO viewer, Worker, Expert Debate, Propose Improvements, Code Review, README generator, Commit helper, Git Diff Scan, and Idea-to-Tasks Promotion. Menu stays open between actions and supports /command passthrough (e.g. /exit works from inside SSJ).
- /worker command: Auto-implements pending tasks from brainstorm_outputs/todo_list.txt one by one. Supports selecting specific tasks with comma-separated numbers (e.g. 1,4,6), a custom todo file path (--path /other/todo.md), and a worker count limit (--workers 3). If you accidentally pass a brainstorm .md output file, Worker detects it and offers to redirect to todo_list.txt — or to generate it first from the brainstorm file and then run Worker automatically. Each task gets a dedicated prompt that reads code, implements the change, and marks it done.
- /telegram — Telegram Bot Bridge: Receives messages via Telegram Bot API and routes them through the model, sending responses back to the chat. Auto-starts on launch if configured. Only responds to the authorized chat_id. Supports slash command passthrough (/cost, /model, etc.), shows a typing indicator while the model processes, and can be stopped remotely by sending /stop in Telegram.
- Brainstorm → TODO pipeline: After brainstorm synthesis, automatically generates brainstorm_outputs/todo_list.txt with prioritized checkbox tasks. TODO viewer (SSJ option 2) shows only pending tasks as numbered (completed tasks shown with ✓ without a number).
- Expert Debate improvements: SSJ option 4 now prompts for the number of debate agents (default 2, minimum 2); rounds are auto-calculated as (agents × 2 − 1). The debate result is saved to the same directory as the debated file (<stem>_debate_HHMMSS.md). An animated per-round per-expert spinner (⚔️ Round 2/3 — Expert 1 thinking...) keeps the terminal lively throughout the debate.
- Brainstorm spinner: Animated spinner with random phrases while brainstorm agents are thinking.
- Force quit: 3× Ctrl+C within 2 seconds triggers os._exit(1) — kills the process immediately regardless of blocking I/O.
- Interactive Ollama Model Picker — when a request fails with 404 (model not found), cheetahclaws queries the local Ollama API (/api/tags) and presents a numbered model selector to switch models and retry without restarting. Cancelling aborts gracefully without crashing the REPL.
- Windows file handling — _read, _write, and _edit in tools.py now force UTF-8 encoding and newline="". _edit detects pure-CRLF files (every \n is part of \r\n) and restores line endings after edit; mixed-line-ending files are left as-is to avoid corruption.
- /brainstorm command — /brainstorm [topic] runs a multi-persona AI debate. The model first generates N expert personas tailored to the topic (geopolitics → analysts & diplomats; software → architects & engineers; etc.). Agent count is chosen interactively at runtime (2–100, default 5). Results are saved to brainstorm_outputs/ and synthesized by the main agent.
- Rich Live SSH fix — Rich's in-place Live streaming is now automatically disabled in SSH sessions (SSH_CLIENT/SSH_TTY detected) where ANSI cursor-up breaks and causes repeated output lines. Override with /config rich_live=true/false.
- threading.RLock — replaced threading.Lock with RLock to support re-entrant calls from brainstorm synthesis and Ollama retry paths.
05:39 PM, Apr 05, 2026 (v3.5.4): Reasoning, Rendering, and Packaging Improvements, Enhanced Memory System, Native vision support for local Ollama models, Bracketed Paste Mode, Rich Tab Completion
- Bracketed Paste Mode — replaced the old timing-based multi-line paste detection with the standard terminal Bracketed Paste Mode protocol. Pasted text of any length (code blocks, long prompts, multi-paragraph instructions) is now collected as a single turn with zero latency and no blank-line artifacts. Falls back to a 60 ms timing window for terminals that don't support BPM. Bracketed paste mode is cleanly disabled on REPL exit.
- Rich Tab Completion with descriptions — pressing Tab after / now shows every command with a one-line description and a hint of its subcommands. Typing /plugin then Tab lists all subcommands (install, uninstall, enable, …). Auto-completes to the unique match when only one command matches the prefix. Subcommands supported for /mcp, /plugin, /tasks, /cloudsave, /voice, /permissions, /proactive, and /memory.
- Model name bug fix — --model ollama/qwen3.5:35b no longer gets corrupted to ollama/qwen3.5/35b. The startup colon-to-slash conversion now only fires when the left side of : is a known provider name and no / is already present, preserving Ollama's model:tag format.
- Native vision support for local Ollama models (llava, gemma4, llama3.2-vision): new /image [prompt] command captures the current clipboard image, encodes it to Base64, and attaches it to the next prompt. Install Pillow with pip install cheetahclaws[vision]; Linux users also need xclip (sudo apt install xclip).
- Enhanced Memory System — added confidence / source / last_used_at / conflict_group metadata to every memory entry; conflict detection on MemorySave warns before overwriting; MemorySearch re-ranks results by confidence × recency (30-day decay) and updates last_used_at on hits; new /memory consolidate command runs a lightweight AI analysis of the current session and auto-saves up to 3 long-term insights (user preferences, feedback corrections, project decisions) at 0.8 confidence — never overwrites higher-confidence user memories.
- Post-merge fixes — removed a debug debug_payload.json file write that was firing on every OpenAI-compatible API call (left over from PR #11 development). Also fixed ANSI dim color not being reset after the thinking block ends, which caused subsequent text to appear dim in non-Rich terminals. Bumped pyproject.toml version to 3.5.4, and moved sounddevice to the optional voice extra (pip install cheetahclaws[voice]).
- Native Ollama reasoning + terminal rendering fix — local reasoning models (deepseek-r1, qwen3, gemma4) now stream their <think> blocks to the terminal. Ollama exposes thoughts in msg["thinking"], but cheetahclaws was previously dropping them; this is now fixed by yielding ThinkingChunk from the Ollama adapter. Also fixed a Windows CMD/PowerShell rendering issue where token-by-token ANSI dim resets caused thoughts to print vertically, and corrected flush_response() so it runs once at the end instead of on every thinking token. Enable with /verbose and /thinking.
- uv support — added pyproject.toml; install with uv tool install . to make the cheetahclaws command available globally from anywhere in an isolated environment, without manual PATH setup.
00:41 PM, Apr 05, 2026: v3.5.3 add structured session history — Structured session history: on every exit, sessions are saved to daily/YYYY-MM-DD/ (capped at session_daily_limit, default 5 per day) and appended to a master history.json (capped at session_history_limit, default 100). Each session file now includes session_id and saved_at metadata. /load groups sessions by date with time, ID, and turn-count display; supports multi-select (1,2,3) to merge sessions and H to load the full history with token-count confirmation. Both limits are configurable via /config.
00:41 PM, Apr 05, 2026: v3.5.3 fix session — Structured session history: on every exit, sessions are saved to daily/YYYY-MM-DD/ (capped at session_daily_limit, default 5 per day) and appended to a master history.json (capped at session_history_limit, default 100). Each session file now includes session_id and saved_at metadata. /load groups sessions by date with time, ID, and turn-count display; supports multi-select (1,2,3) to merge sessions and H to load the full history with token-count confirmation. Both limits are configurable via /config.
09:34 AM, Apr 05, 2026: v3.5.3 — Added GitHub Gist cloud sync: /cloudsave setup <token> to configure, /cloudsave to upload the current session to a private Gist, /cloudsave auto on to sync automatically on /exit, /cloudsave list to browse cloud sessions, and /cloudsave load <id> to restore from the cloud. Uses stdlib urllib — no new dependencies. Also added version number (e.g., v3.5.2) in the startup banner: The startup banner now displays the current version number (v3.5.2) in green, making it easy to identify which version is running at a glance.
08:58 AM, Apr 05, 2026: v3.5.2 — Introduced /proactive [duration] command: a background daemon thread watches for user inactivity and automatically wakes the agent up after the specified interval (e.g. /proactive 5m), enabling continuous monitoring loops without user intervention. /proactive with no args now shows current status; /proactive off disables it explicitly. Proactive polling state is stored in config (no module-level globals). Watcher exceptions are logged via traceback instead of silently swallowed. Also fixed duplicated output in Rich-enabled terminals by buffering text during streaming and rendering Markdown once via rich.live.Live — updates happen in-place for a true streaming Markdown experience.
10:51 PM, Apr 04, 2026: v3.05_fix04 — Fixed a crash on /model and config save commands caused by the newly introduced _run_query_callback being serialized to JSON; also added SleepTimer usage
guidance to the system prompt so the agent knows when to invoke background timers proactively.
10:28 PM, Apr 04, 2026: v3.05_fix03 — Added a native SleepTimer tool that lets the agent schedule background timers and autonomously wake itself up after a delay — no user prompt required. Paired with a threading.Lock to prevent output collisions when background and foreground calls overlap. Also includes cross-platform fixes: Windows ANSI color support, CRLF-aware Edit tool matching, an interactive numbered menu for /load, native Ollama streaming via /api/chat, and auto-capping max_tokens per provider to prevent API errors.
08:31 PM, Apr 04, 2026: v3.05_fix — Autosave + /resume: session is automatically saved to mr_sessions/session_latest.json on /exit, /quit, Ctrl+C, and Ctrl+D. Run /resume to restore the last session instantly, or /resume <file> to load a specific file from mr_sessions/, and better support for api and local Ollama models (specifically gemma4), along with Windows compatibility enhancements, session management UX improvements, and cross-platform reliability fixes for the Edit tool.
00:41 AM, Apr 04, 2026: v3.05 — Voice input (voice/ package): sounddevice → arecord → SoX recording backends, faster-whisper → openai-whisper → OpenAI API STT backends. Smart keyterm extraction from git branch + project name + recent files passed as Whisper initial_prompt for coding-domain accuracy. /voice, /voice status, /voice lang <code> REPL commands. Works fully offline with no API key. 29 new tests (~11.6K lines of Python).
10:29 PM, Apr 03, 2026: v3.04 — Expanded tool coverage: NotebookEdit (edit Jupyter .ipynb cells — replace/insert/delete with full JSON round-trip) and GetDiagnostics (LSP-style diagnostics via pyright/mypy/flake8/tsc/shellcheck). Also fixed a pre-existing schema-index bug in _register_builtins by switching to name-based lookup (~10.5K lines of Python).
06:00 PM, Apr 03, 2026: v3.03 — Task management system (task/ package): TaskCreate / TaskUpdate / TaskGet / TaskList tools with sequential IDs, dependency edges (blocks/blocked_by), metadata, persistence to .cheetahclaws/tasks.json, thread-safe store, /tasks REPL command, 37 new tests (~9500 lines of Python).
02:50 PM, Apr 03, 2026: v3.02 — Plugin system (plugin/ package): install/uninstall/enable/disable/update via /plugin CLI, recommendation engine (keyword+tag matching), multi-scope (user/project), git-based marketplace. AskUserQuestion tool: interactive mid-task user prompts with numbered options and free-text input (~8500 lines of Python).
10:00 AM, Apr 03, 2026: v3.01 — MCP (Model Context Protocol) support: mcp/ package, stdio + SSE + HTTP transports, auto tool discovery, /mcp command, 34 new tests (~7000 lines of Python).
12:20 PM, Apr 02, 2026: v3.0 — Multi-agent packages (multi_agent/), memory package (memory/), skill package (skill/) with built-in skills, argument substitution, fork/inline execution, AI memory search, git worktree isolation, agent type definitions (~5000 lines of Python), see update.
10:00 AM, Apr 02, 2026: v2.0 — Context compression, memory, sub-agents, skills, diff view, tool plugin system (~3400 lines of Python Code).
01:47 PM, Apr 01, 2026: Support VLLM inference (~2000 lines of Python Code).
11:30 AM, Apr 01, 2026: Support more closed-source models and open-source models: Claude, GPT, Gemini, Kimi, Qwen, Zhipu, DeepSeek, and local open-source models via Ollama or any OpenAI-compatible endpoint. (~1700 lines of Python Code).
09:50 AM, Apr 01, 2026: Support more closed-source models: Claude, GPT, Gemini. (~1300 lines of Python Code).
08:23 AM, Apr 01, 2026: Release the initial version of CheetahClaws (~900 lines of Python Code).