DesktopCtl

May 18, 2026 ยท View on GitHub

Local CLI for AI agents to observe and control your computer via screen, mouse, and keyboard. Bring your own AI - any model, even without vision.

Runs fully local. No screenshots sent to the cloud.

Learn more at https://desktopctl.com

https://github.com/user-attachments/assets/4321b23e-6706-4792-a911-89e13766ebc0

Why DesktopCtl

  • Local-first runtime. No cloud dependency
  • Bring your own AI: works with any desktop AI agent
  • GPU-accelerated text recognition and computer vision
  • Selector-first automation (--text, --token) with coordinate fallback
  • Agent-friendly explicit waits and post-action verification
  • Stable JSON contracts for agent integrations

Architecture

DesktopCtl is split into two binaries:

  • DesktopCtl.app (desktopctld): daemon that owns perception, state, execution, and verification
  • desktopctl: stateless CLI surface for actions and queries over local IPC

Repository layout:

  • src/desktop/core - shared protocol and types
  • src/desktop/daemon - daemon runtime
  • src/desktop/cli - CLI client

Current Scope

  • macOS-first
  • OCR-first perception pipeline
  • Tokenized screen output for agent grounding
  • Deterministic CLI primitives for click/type/wait flows

Prerequisites

  • macOS (current support target)
  • Rust toolchain (cargo)
  • just command runner
  • Accessibility permission for DesktopCtl.app
  • Screen Recording permission for DesktopCtl.app

Quick Start

make install
raw="$(desktopctl app open Notes --json)"
win_id="$(printf '%s' "$raw" | jq -r '.result.window_id // empty')"
desktopctl keyboard press cmd+f --active-window "$win_id" --no-observe
desktopctl keyboard type "Shopping list" --active-window "$win_id" --no-observe
desktopctl screen tokenize --active-window "$win_id"

Status / Roadmap

  • Status: active development, with macOS-first CLI and daemon workflows already usable.
  • Reliability for text/token-driven actions and verification loops. Stable machine-readable error codes.
  • Upcoming CLI: doctor, richer window/app introspection, and --explain failure output.
  • Better local computer vision and semantic UI tokenization.
  • Multi-platform support.