OpenAdapt Architecture Evolution

January 17, 2026 ยท View on GitHub

Version: 3.0 Date: January 2026 Status: Living Document


Executive Summary

This document traces the evolution of OpenAdapt from its original alpha vision through the modern modular implementation, synthesizing state-of-the-art GUI agent research into a unified framework. OpenAdapt's core innovation is demonstration-conditioned automation: "show, don't tell."


Table of Contents

  1. Original Alpha Vision
  2. The Abstraction Ladder
  3. Core Innovation: Demo-Conditioned Agents
  4. Modern Architecture
  5. SOTA GUI Agent Integration
  6. Package Responsibilities
  7. Feedback Loops
  8. Implementation Status
  9. Architecture Evolution Diagrams
  10. Future Directions

1. Original Alpha Vision

The Three-Stage Pipeline (2023)

OpenAdapt was conceived as a three-stage pipeline for AI-first process automation:

+=====================+     +=====================+     +=====================+
|                     |     |                     |     |                     |
|      RECORDING      | --> |      ANALYSIS       | --> |       REPLAY        |
|                     |     |                     |     |                     |
|  Capture human      |     |  Convert to         |     |  Generate and       |
|  demonstrations:    |     |  tokenized format   |     |  replay synthetic   |
|  - Screenshots      |     |  for LMM            |     |  input via model    |
|  - User input       |     |  processing         |     |  completions        |
|                     |     |                     |     |                     |
+=====================+     +=====================+     +=====================+

Original Design Goals

From the legacy README:

"The goal is similar to that of Robotic Process Automation (RPA), except that we use Large Multimodal Models instead of conventional RPA tools."

Key Differentiators (Alpha):

  1. Model Agnostic - Works with any LMM
  2. Auto-Prompted - Learns from demonstration, not user prompts
  3. Grounded in Existing Processes - Mitigates hallucinations
  4. Universal GUI Support - Desktop, web, and virtualized (Citrix)
  5. Open Source - MIT license

Legacy Monolithic Implementation

The alpha codebase (legacy/openadapt/) implemented:

openadapt/
  record.py           # Screenshot/event capture
  replay.py           # Strategy-based playback
  models.py           # Recording, ActionEvent, Screenshot, WindowEvent
  events.py           # Event aggregation/processing
  strategies/
    base.py           # BaseReplayStrategy abstract class
    naive.py          # Direct literal replay
    stateful.py       # GPT-4 + OS-level window data
    vanilla.py        # Full VLM reasoning per step
    visual.py         # FastSAM segmentation
    visual_browser.py # DOM-based segments
  adapters/
    anthropic.py      # Claude API integration
    openai.py         # GPT API integration
    replicate.py      # Open-source model hosting
  privacy/
    base.py           # Scrubbing provider interface
    providers/        # Presidio, AWS Comprehend, Private AI

The Strategy Pattern (Original)

The original architecture used a BaseReplayStrategy abstract class:

class BaseReplayStrategy(ABC):
    """Base class for implementing replay strategies."""

    def __init__(self, recording: Recording) -> None:
        self.recording = recording
        self.action_events = []
        self.screenshots = []
        self.window_events = []

    @abstractmethod
    def get_next_action_event(
        self,
        screenshot: Screenshot,
        window_event: WindowEvent,
    ) -> ActionEvent:
        """Get the next action based on current observation."""
        pass

    def run(self) -> None:
        """Execute the replay loop."""
        while True:
            screenshot = Screenshot.take_screenshot()
            window_event = WindowEvent.get_active_window_event()
            action_event = self.get_next_action_event(screenshot, window_event)
            if action_event:
                playback.play_action_event(action_event, ...)

This pattern evolved into the modern policy/grounding separation.

Alpha Data Model

class Recording:
    """Container for a demonstration session."""
    id: int
    timestamp: float
    task_description: str
    action_events: list[ActionEvent]
    screenshots: list[Screenshot]
    window_events: list[WindowEvent]

class ActionEvent:
    """A single user action (click, type, scroll, etc.)."""
    name: str                    # "click", "type", "scroll", "press", "release"
    timestamp: float
    screenshot: Screenshot       # Screenshot just before action
    window_event: WindowEvent    # Active window state
    mouse_x, mouse_y: int        # Mouse coordinates
    key_char, key_name: str      # Keyboard input
    element_state: dict          # Accessibility info

class Screenshot:
    """A captured screen image."""
    timestamp: float
    png_data: bytes
    image: PIL.Image

2. The Abstraction Ladder

Core Concept: Progressive Abstraction

OpenAdapt processes demonstrations through ascending levels of abstraction, enabling generalization and transfer learning.

+=========================================================================+
|                                                                         |
|  Level 4: GOAL (Task Specification)                              FUTURE |
|  "Say hello to the customer"                                            |
|                                                                         |
|           ^                                                             |
|           | Goal Composition (high-level planning)                      |
|           |                                                             |
+=========================================================================+
|                                                                         |
|  Level 3: SEMANTIC (Intent Recognition)                         FUTURE |
|  { action: "greet", target: "user" }                                    |
|                                                                         |
|           ^                                                             |
|           | Process Mining (discover patterns)                          |
|           |                                                             |
+=========================================================================+
|                                                                         |
|  Level 2: TEMPLATE (Parameterized Actions)                      PARTIAL |
|  { type: "hi <firstname>" }                                             |
|                                                                         |
|           ^                                                             |
|           | Anonymization (extract parameters)                          |
|           |                                                             |
+=========================================================================+
|                                                                         |
|  Level 1: SYMBOLIC (Semantic Actions)                       IMPLEMENTED |
|  { type: "hi bob" }                                                     |
|                                                                         |
|           ^                                                             |
|           | Reduction (aggregate consecutive events)                    |
|           |                                                             |
+=========================================================================+
|                                                                         |
|  Level 0: LITERAL (Raw Events)                              IMPLEMENTED |
|  { press: "h" }, { press: "i" }, { press: " " }, { press: "b" }, ...    |
|                                                                         |
+=========================================================================+

Abstraction Level Details

LevelNameRepresentationTransformationStatus
0LiteralRaw keypresses, mouse coordsNone (raw capture)Implemented
1SymbolicAggregated actions (type "hello")Event reductionImplemented
2TemplateParameterized (type "<greeting>")Regex extractionPartial
3SemanticIntent-level (greet user)LLM intent recognitionResearch
4GoalTask description ("Welcome customer")Goal compositionFuture

Why Abstraction Matters

LevelEnablesExample Use Case
LiteralExact replay, debuggingAudit trails, regression tests
SymbolicHuman-readable logsTraining data visualization
TemplateParameterized replaySame task, different data
SemanticCross-application transferGreeting in any messaging app
GoalNatural language control"Greet the next customer"

Current Implementation

Literal to Symbolic (openadapt-capture):

  • Event aggregation in events.py
  • Consecutive keypresses become type actions
  • Mouse drags become drag actions
  • Click sequences become doubleclick or tripleclick

Symbolic to Template (Partial):

  • Regex-based parameter extraction
  • User-defined placeholders

Template to Semantic (Research):

  • LLM-based intent recognition
  • Pattern library discovery

Semantic to Goal (Future):

  • Process mining algorithms
  • Cross-demo pattern extraction

3. Core Innovation: Demo-Conditioned Agents

The Fundamental Differentiator

OpenAdapt's core insight is demonstration-conditioned automation: "show, don't tell."

+-------------------------------------------------------------------+
|                     TRADITIONAL APPROACH                           |
+-------------------------------------------------------------------+
|                                                                    |
|  User: "Click the submit button"                                   |
|                                                                    |
|  Agent: [Which submit button? What context? What state?]           |
|         [Multiple submit buttons on page?]                         |
|         [Different applications have different buttons]            |
|                                                                    |
|  Result: AMBIGUOUS -> Requires prompt engineering                  |
|                                                                    |
+-------------------------------------------------------------------+

+-------------------------------------------------------------------+
|                    DEMO-CONDITIONED APPROACH                       |
+-------------------------------------------------------------------+
|                                                                    |
|  User: [Records clicking the blue "Submit Order" button            |
|         after filling out form fields]                             |
|                                                                    |
|  Agent: [Learns full context:                                      |
|          - Form state before action                                |
|          - Button appearance and location                          |
|          - Preceding actions in sequence                           |
|          - Window/application context]                             |
|                                                                    |
|  Result: GROUNDED -> No prompt engineering needed                  |
|                                                                    |
+-------------------------------------------------------------------+

Why Demo-Conditioning Works

  1. Captures Implicit Knowledge: Users demonstrate things they can't easily verbalize
  2. Grounded in Reality: Actions tied to actual UI states, not abstract descriptions
  3. Reduces Ambiguity: Visual context eliminates interpretation errors
  4. Lower Barrier: No prompt engineering skills required

Empirical Results

Demo conditioning improves first-action accuracy:

ApproachFirst-Action AccuracyNotes
Prompt-only~33%Ambiguity in action selection
Demo-conditioned~100%Full context from demonstration

The "Show, Don't Tell" Principle

# Traditional: Prompt-driven
agent.execute("Click the submit button")
# -> Which submit button? What state? What context?

# Demo-Conditioned: Demonstration-driven
demo = capture_demonstration()  # User clicks specific submit button
agent = train_policy(demo)      # Agent learns the full context
agent.execute(new_context)      # Agent adapts to variations

4. Modern Architecture

Evolution: Monolith to Meta-Package

ALPHA (2023-2024)                        MODERN (2025+)
+====================+                   +====================+
|                    |                   |     openadapt      |
|     openadapt      |                   |    (meta-pkg)      |
|    (monolithic)    |                   +=========+=========+
|                    |                             |
|  - record.py       |           +-----------------+-----------------+
|  - replay.py       |           |         |       |       |         |
|  - strategies/     |      +----+----+ +--+--+ +--+--+ +--+--+ +----+----+
|  - models.py       |      |capture  | | ml  | |evals| |viewer| |optional |
|  - adapters/       |      +---------+ +-----+ +-----+ +------+ +---------+
|  - privacy/        |
|  - visualize.py    |      + grounding, retrieval, privacy
|                    |
+====================+

The Modern Three-Phase Architecture

Building on the alpha vision, the modern architecture formalizes three phases:

+=======================+     +=======================+     +=======================+
||                     ||     ||                     ||     ||                     ||
||     DEMONSTRATE     || --> ||       LEARN         || --> ||      EXECUTE        ||
||                     ||     ||                     ||     ||                     ||
||  (Observation       ||     ||  (Policy            ||     ||  (Agent             ||
||   Collection)       ||     ||   Acquisition)      ||     ||   Deployment)       ||
||                     ||     ||                     ||     ||                     ||
||  Packages:          ||     ||  Packages:          ||     ||  Packages:          ||
||  - capture          ||     ||  - ml               ||     ||  - evals            ||
||  - privacy          ||     ||  - retrieval        ||     ||  - grounding        ||
||                     ||     ||                     ||     ||                     ||
+=======================+     +=======================+     +=======================+

Phase 1: DEMONSTRATE (Observation Collection)

Purpose: Capture rich trajectories from human demonstrations.

Inputs:

  • User performs task in target application
  • Optional: Task description, success criteria, audio narration

Outputs:

  • Screenshot sequences (PNG/JPEG)
  • Input events (mouse, keyboard, touch)
  • Accessibility tree snapshots (a11y)
  • Window metadata (title, bounds, process)
  • Audio transcription (optional)

Packages: openadapt-capture, openadapt-privacy

Phase 2: LEARN (Policy Acquisition)

Purpose: Transform demonstrations into executable agent policies.

Three Learning Paths:

PathMechanismAdvantagePackage
A: Retrieval-AugmentedIndex demos, retrieve similarNo training neededopenadapt-retrieval
B: Fine-TuningTrain VLM on demo datasetSpecialized performanceopenadapt-ml
C: Process MiningExtract reusable patternsCross-task transferopenadapt-ml (future)

Phase 3: EXECUTE (Agent Deployment)

Purpose: Run trained/conditioned agents autonomously.

Execution Loop:

while not task_complete:
    1. OBSERVE    - Capture screenshot + a11y tree
    2. GROUND     - Localize UI elements (SoM, OmniParser)
    3. PLAN       - VLM reasoning with demo context
    4. ACT        - Execute via input synthesis
    5. EVALUATE   - Check success, decide next step

Packages: openadapt-evals, openadapt-grounding, openadapt-ml


5. SOTA GUI Agent Integration

Policy/Grounding Separation

From Claude Computer Use, UFO, and SeeAct research:

+====================+          +====================+
|                    |          |                    |
|      POLICY        |    -->   |     GROUNDING      |
|                    |          |                    |
|   "What to do"     |          |   "Where to do"    |
|                    |          |                    |
| - Observation      |          | - Element          |
|   encoding         |          |   detection        |
| - Action           |          | - Coordinate       |
|   selection        |          |   mapping          |
| - History          |          | - Bounding         |
|   context          |          |   boxes            |
|                    |          |                    |
+====================+          +====================+

OpenAdapt Implementation:

  • Policy: openadapt-ml adapters (Claude, GPT-4V, Qwen-VL)
  • Grounding: openadapt-grounding providers (OmniParser, Florence2, Gemini)

Set-of-Mark (SoM) Prompting

From Microsoft's Set-of-Mark paper:

Original Screenshot              SoM-Annotated Screenshot
+---------------------+          +---------------------+
|  [Login]   [Help]   |          |   [1]      [2]      |
|                     |    ->    |                     |
|  Email: [________]  |          |  Email: [3]         |
|  Pass:  [________]  |          |  Pass:  [4]         |
|         [Submit]    |          |         [5]         |
+---------------------+          +---------------------+

Prompt: "Enter email in element [3], password in [4], click [5]"

OpenAdapt Implementation: openadapt-grounding.SoMPrompt

Safety Gates

From responsible AI patterns:

+------------------+     +------------------+     +------------------+
|                  |     |                  |     |                  |
|     OBSERVE      | --> |    VALIDATE      | --> |      ACT         |
|                  |     |                  |     |                  |
|  Get current     |     |  - Check bounds  |     |  Execute if      |
|  state           |     |  - Verify perms  |     |  validated       |
|                  |     |  - Rate limit    |     |                  |
+------------------+     +--------+---------+     +------------------+
                                 |
                                 v (rejected)
                         +------------------+
                         |    ESCALATE      |
                         |  Human review    |
                         +------------------+

Status: Planned in openadapt-evals safety module.

Research Alignment

Research PaperKey ContributionOpenAdapt Integration
Claude Computer Use (Anthropic, 2024)Production VLM agent APIAPI adapter in openadapt-ml
UFO (Microsoft, 2024)Windows agent architecturePrompt patterns adopted
OSWorld (CMU, 2024)Cross-platform benchmarkBenchmark adapter planned
Set-of-Mark (Microsoft, 2023)Visual grounding via labelsCore grounding mode
OmniParser (Microsoft, 2024)Pure-vision UI parsingProvider in openadapt-grounding
SeeAct (OSU, 2024)Grounded action generationAction space design
WebArena (CMU, 2023)Web automation benchmarkBenchmark adapter implemented
AppAgent (Tencent, 2024)Mobile GUI agentMobile support planned

6. Package Responsibilities

Package-to-Phase Mapping

+===============================================================================+
|                           DEMONSTRATE PHASE                                    |
+===============================================================================+
| Package           | Responsibility             | Key Exports                  |
+-------------------+----------------------------+------------------------------+
| openadapt-capture | GUI recording, storage     | Recorder, CaptureSession     |
|                   |                            | Action, Screenshot, Trajectory|
+-------------------+----------------------------+------------------------------+
| openadapt-privacy | PII/PHI scrubbing          | Scrubber, Redactor           |
|                   | (integrates at capture)    | PrivacyFilter                |
+===============================================================================+

+===============================================================================+
|                             LEARN PHASE                                        |
+===============================================================================+
| Package             | Responsibility           | Key Exports                  |
+---------------------+--------------------------+------------------------------+
| openadapt-ml        | Model training,          | Trainer, AgentPolicy         |
|                     | inference, adapters      | QwenVLAdapter, ClaudeAdapter |
+---------------------+--------------------------+------------------------------+
| openadapt-retrieval | Demo embedding,          | DemoIndex, Embedder          |
|                     | similarity search        | SearchResult                 |
+===============================================================================+

+===============================================================================+
|                            EXECUTE PHASE                                       |
+===============================================================================+
| Package              | Responsibility          | Key Exports                  |
+----------------------+-------------------------+------------------------------+
| openadapt-evals      | Benchmark evaluation,   | BenchmarkAdapter, ApiAgent   |
|                      | metrics collection      | evaluate_agent_on_benchmark  |
+----------------------+-------------------------+------------------------------+
| openadapt-grounding  | UI element detection,   | ElementDetector, SoMPrompt   |
|                      | coordinate mapping      | OmniParser, GeminiGrounder   |
+===============================================================================+

+===============================================================================+
|                           CROSS-CUTTING                                        |
+===============================================================================+
| Package           | Responsibility             | Key Exports                  |
+-------------------+----------------------------+------------------------------+
| openadapt-viewer  | HTML visualization,        | PageBuilder, HTMLBuilder     |
|                   | trajectory replay          | TrajectoryViewer             |
+-------------------+----------------------------+------------------------------+
| openadapt         | Unified CLI,               | cli.main, lazy imports       |
| (meta-package)    | dependency coordination    |                              |
+===============================================================================+

Package Dependency Matrix

                    capture  ml  evals  viewer  grounding  retrieval  privacy
openadapt-capture      -     -    -       -        -          -          O
openadapt-ml           R     -    -       -        O          O          -
openadapt-evals        -     R    -       O        O          O          -
openadapt-viewer       O     O    O       -        -          -          O
openadapt-grounding    -     -    -       -        -          -          -
openadapt-retrieval    R     -    -       -        -          -          -
openadapt-privacy      -     -    -       -        -          -          -

Legend: R = Required, O = Optional, - = None

7. Feedback Loops

System-Level Feedback Architecture

                              DEMONSTRATE
                                   |
                                   | Human demonstrations
                                   v
+-----------------------------> LEARN <----------------------------+
|                                  |                               |
|                                  | Trained policies              |
|   +-----------------------------|---------------------+          |
|   |                             v                     |          |
|   |   +----------------->  EXECUTE  <--------------+  |          |
|   |   |                       |                    |  |          |
|   |   | Retry on              | Success/Failure   |  |          |
|   |   | recoverable           | outcomes          |  |          |
|   |   | errors                v                   |  |          |
|   |   |               +-------+-------+           |  |          |
|   |   |               |               |           |  |          |
|   |   +---------------+   EVALUATE    +-----------+  |          |
|   |   (Loop 1: Retry) |               |              |          |
|   |                   +-------+-------+              |          |
|   |                           |                      |          |
|   |                           | Execution traces     |          |
|   |                           v                      |          |
|   |                   Demo library grows             |          |
|   |                           |                      |          |
|   +---------------------------+                      |          |
|   (Loop 2: Library Growth)                           |          |
|                                                      |          |
|   Failure analysis identifies gaps                   |          |
|                    |                                 |          |
|                    v                                 |          |
|           Human correction                           |          |
|                    |                                 |          |
+--------------------+                                 |          |
(Loop 3: Human-in-Loop)                                |          |
                                                       |          |
                     Self-improvement loop             |          |
                     (execution traces -> training)    |          |
                              |                        |          |
                              +------------------------+          |
                              (Loop 4: Self-Improvement)          |
                                                                  |
             Benchmark-driven development                         |
             (eval results -> architecture improvements)          |
                              |                                   |
                              +-----------------------------------+
                              (Loop 5: Benchmark-Driven)

Feedback Loop Details

LoopNameTriggerOutcomeStatus
1RetryRecoverable errorRe-attempt actionImplemented
2Library GrowthSuccessful executionNew demo addedImplemented
3Human-in-LoopUnrecoverable failureHuman correction -> demoImplemented
4Self-ImprovementExecution tracesFine-tuningResearch
5Benchmark-DrivenEval metricsArchitecture changesActive

8. Implementation Status

What's Implemented vs Future Work

+==============================================================================+
|                    IMPLEMENTED (Solid)                                        |
+==============================================================================+
| Component                | Package          | Notes                          |
+--------------------------+------------------+--------------------------------+
| Screen capture           | capture          | macOS, Windows, Linux          |
| Event recording          | capture          | Mouse, keyboard, touch         |
| Event aggregation        | capture          | Literal -> Symbolic            |
| A11y tree capture        | capture          | macOS, Windows                 |
| Demo storage             | capture          | JSON/Parquet/PNG               |
| Privacy scrubbing        | privacy          | Presidio, AWS Comprehend       |
| Demo embedding           | retrieval        | CLIP, SigLIP                   |
| Vector indexing          | retrieval        | FAISS, Annoy                   |
| Similarity search        | retrieval        | Top-k retrieval                |
| API model adapters       | ml               | Claude, GPT-4V, Gemini         |
| Element detection        | grounding        | OmniParser, Florence2          |
| SoM annotation           | grounding        | Numbered element labels        |
| WAA benchmark            | evals            | Full integration               |
| Mock benchmark           | evals            | Testing infrastructure         |
| HTML visualization       | viewer           | Trajectory replay              |
| Unified CLI              | openadapt        | capture/train/eval/view        |
+==============================================================================+

+==============================================================================+
|                    IN PROGRESS (Dashed)                                       |
+==============================================================================+
| Component                | Package          | Notes                          |
+--------------------------+------------------+--------------------------------+
| Training pipeline        | ml               | Qwen-VL fine-tuning            |
| LoRA adapters            | ml               | Parameter-efficient training   |
| Template extraction      | capture          | Regex-based parameterization   |
| WebArena benchmark       | evals            | Browser automation             |
| Training dashboard       | viewer           | Loss/metrics visualization     |
| Audio transcription      | capture          | Whisper integration            |
+--------------------------+------------------+--------------------------------+

+==============================================================================+
|                    FUTURE WORK (Dotted)                                       |
+==============================================================================+
| Component                | Package          | Notes                          |
+--------------------------+------------------+--------------------------------+
| Process mining           | ml (future)      | Semantic action discovery      |
| Goal composition         | ml (future)      | High-level task planning       |
| Self-improvement         | ml (future)      | Training on execution traces   |
| OSWorld benchmark        | evals            | Cross-platform desktop         |
| Multi-agent collaboration| ml (future)      | Agent coordination             |
| Active learning          | ml (future)      | Human feedback integration     |
| Mobile platform          | capture          | iOS, Android                   |
| Safety gates             | evals            | Action validation layer        |
+==============================================================================+

Abstraction Ladder Implementation Status

LevelNameStatusImplementation
0LiteralImplementedRaw event recording in capture
1SymbolicImplementedEvent aggregation in capture
2TemplatePartialRegex extraction in capture
3SemanticResearchLLM intent recognition
4GoalFutureProcess mining

9. Architecture Evolution Diagrams

Era 1: Alpha Monolith (2023)

+=========================================================================+
|                        ALPHA ARCHITECTURE (2023)                         |
+=========================================================================+
|                                                                          |
|   +------------------------------------------------------------------+  |
|   |                       openadapt (monolithic)                      |  |
|   +------------------------------------------------------------------+  |
|   |                                                                   |  |
|   |   +-------------+    +-------------+    +-------------+          |  |
|   |   |   record    | -> |  visualize  | -> |   replay    |          |  |
|   |   +-------------+    +-------------+    +-------------+          |  |
|   |         |                  |                  |                  |  |
|   |         v                  v                  v                  |  |
|   |   +-------------+    +-------------+    +------------------+     |  |
|   |   |   models    |    |  plotting   |    |   strategies/    |     |  |
|   |   | - Recording |    | - HTML gen  |    |   - base.py      |     |  |
|   |   | - ActionEvt |    |             |    |   - naive.py     |     |  |
|   |   | - Screenshot|    |             |    |   - vanilla.py   |     |  |
|   |   | - WindowEvt |    |             |    |   - visual.py    |     |  |
|   |   +-------------+    +-------------+    +------------------+     |  |
|   |         |                                       |                |  |
|   |         v                                       v                |  |
|   |   +-------------+                       +---------------+        |  |
|   |   |    db/      |                       |   adapters/   |        |  |
|   |   | - SQLite    |                       | - anthropic   |        |  |
|   |   | - CRUD ops  |                       | - openai      |        |  |
|   |   +-------------+                       | - replicate   |        |  |
|   |                                         +---------------+        |  |
|   +------------------------------------------------------------------+  |
|                                                                          |
+=========================================================================+

Characteristics:
- Single repository, single package
- Tightly coupled components
- Strategy pattern for replay variants
- SQLite + Alembic migrations
- Prompts embedded in code

Era 2: Transition (2024)

+=========================================================================+
|                     TRANSITION ARCHITECTURE (2024)                       |
+=========================================================================+
|                                                                          |
|   Legacy codebase frozen -> /legacy/                                     |
|                                                                          |
|   New modular packages designed:                                         |
|                                                                          |
|   +-------------+  +-------------+  +-------------+  +-------------+    |
|   |   capture   |  |     ml      |  |    evals    |  |   viewer    |    |
|   +-------------+  +-------------+  +-------------+  +-------------+    |
|   |   privacy   |  |  retrieval  |  |  grounding  |                     |
|   +-------------+  +-------------+  +-------------+                     |
|                                                                          |
|   Key changes:                                                           |
|   - Separate PyPI packages                                               |
|   - Lazy imports for optional deps                                       |
|   - Unified CLI in meta-package                                          |
|   - Policy/grounding separation                                          |
|   - Benchmark-first development                                          |
|                                                                          |
+=========================================================================+

Era 3: Modern Meta-Package (2025+)

+=========================================================================+
|                    MODERN ARCHITECTURE (2025+)                           |
+=========================================================================+
|                                                                          |
|                         +------------------+                             |
|                         |    User Layer    |                             |
|                         |   CLI / Web UI   |                             |
|                         +--------+---------+                             |
|                                  |                                       |
|                                  v                                       |
|                         +------------------+                             |
|                         |    openadapt     |                             |
|                         |   (meta-package) |                             |
|                         +--------+---------+                             |
|                                  |                                       |
|         +------------------------+------------------------+              |
|         |            |           |           |            |              |
|         v            v           v           v            v              |
|   +---------+  +---------+  +---------+  +---------+  +--------+        |
|   | capture |  |   ml    |  |  evals  |  | viewer  |  |optional|        |
|   +---------+  +---------+  +---------+  +---------+  +--------+        |
|       |            |            |            |            |              |
|       v            v            v            v            v              |
|   +---------------------------------------------------------------+     |
|   |                      Shared Interfaces                         |     |
|   |  - Trajectory format (JSON/Parquet)                           |     |
|   |  - Action space specification                                  |     |
|   |  - Observation schema                                          |     |
|   |  - Benchmark protocols                                         |     |
|   +---------------------------------------------------------------+     |
|                                  |                                       |
|                                  v                                       |
|   +---------------------------------------------------------------+     |
|   |                      Model Layer                               |     |
|   |  +----------+  +----------+  +----------+  +----------+       |     |
|   |  |  Claude  |  |  GPT-4V  |  |  Gemini  |  | Qwen-VL  |       |     |
|   |  +----------+  +----------+  +----------+  +----------+       |     |
|   +---------------------------------------------------------------+     |
|                                                                          |
+=========================================================================+

Full System Architecture (Mermaid)

flowchart TB
    subgraph User["User Layer"]
        CLI[openadapt CLI]
        UI[Desktop/Web GUI]
    end

    subgraph Phase1["DEMONSTRATE"]
        direction TB
        REC[Recorder<br/>openadapt-capture]
        SCRUB[Privacy Scrubber<br/>openadapt-privacy]
        STORE[(Demo Library)]

        REC --> SCRUB
        SCRUB --> STORE
    end

    subgraph Phase2["LEARN"]
        direction TB

        subgraph Retrieval["Retrieval Path"]
            EMB[Embedder]
            IDX[Vector Index]
            SEARCH[Similarity Search]
        end

        subgraph Training["Training Path"]
            LOADER[Data Loader]
            TRAINER[Model Trainer]
            CKPT[(Checkpoints)]
        end

        subgraph Mining["Process Mining"]
            ABSTRACT[Abstractor]
            PATTERNS[Pattern Library]
        end
    end

    subgraph Phase3["EXECUTE"]
        direction TB
        OBS[1. OBSERVE]
        GROUND[2. GROUND<br/>openadapt-grounding]
        PLAN[3. PLAN<br/>Demo-Conditioned]
        ACT[4. ACT]
        EVAL[5. EVALUATE<br/>openadapt-evals]

        OBS --> GROUND
        GROUND --> PLAN
        PLAN --> ACT
        ACT --> EVAL
        EVAL -->|retry| OBS
    end

    subgraph Models["Model Layer"]
        direction LR
        CLAUDE[Claude]
        GPT[GPT-4o]
        GEMINI[Gemini]
        QWEN[Qwen-VL]
    end

    subgraph Viewer["Cross-Cutting: Viewer"]
        VIZ[Visualization]
        REPLAY[Replay]
        DASH[Dashboard]
    end

    %% User interactions
    CLI --> REC
    UI --> REC
    CLI --> TRAINER
    CLI --> EVAL

    %% Demo flow
    STORE --> EMB
    STORE --> LOADER
    STORE --> ABSTRACT

    EMB --> IDX
    IDX --> SEARCH
    LOADER --> TRAINER
    TRAINER --> CKPT
    ABSTRACT --> PATTERNS

    %% Execution flow (demo-conditioning)
    SEARCH -->|demo context| PLAN
    CKPT -->|policy| PLAN
    PATTERNS -.->|templates| PLAN

    %% Model connections
    PLAN --> Models
    GROUND --> Models

    %% Viewer connections
    STORE -.-> VIZ
    CKPT -.-> DASH
    EVAL -.-> DASH

    %% Feedback loops
    EVAL -->|success trace| STORE
    EVAL -->|failure| User

    %% Styling
    classDef userLayer fill:#E74C3C,stroke:#A93226,color:#fff
    classDef phase1 fill:#3498DB,stroke:#1A5276,color:#fff
    classDef phase2 fill:#27AE60,stroke:#1E8449,color:#fff
    classDef phase3 fill:#9B59B6,stroke:#6C3483,color:#fff
    classDef models fill:#F39C12,stroke:#B7950B,color:#fff
    classDef viewer fill:#1ABC9C,stroke:#148F77,color:#fff
    classDef future fill:#95A5A6,stroke:#707B7C,color:#fff,stroke-dasharray: 5 5

    class CLI,UI userLayer
    class REC,SCRUB,STORE phase1
    class EMB,IDX,SEARCH,LOADER,TRAINER,CKPT phase2
    class ABSTRACT,PATTERNS future
    class OBS,GROUND,PLAN,ACT,EVAL phase3
    class CLAUDE,GPT,GEMINI,QWEN models
    class VIZ,REPLAY,DASH viewer

Execution Loop Evolution

ALPHA: Strategy-Based                    MODERN: Policy/Grounding
================================         ================================

+------------------+                     +------------------+
|  BaseReplay      |                     |     OBSERVE      |
|  Strategy        |                     |  (Screenshot +   |
|                  |                     |   A11y tree)     |
|  while True:     |                     +--------+---------+
|    screenshot =  |                              |
|      take()      |                              v
|    action =      |                     +------------------+
|      get_next()  |      ------>        |     GROUND       |
|    play(action)  |                     |  (Element detect |
|                  |                     |   + SoM annotate)|
+------------------+                     +--------+---------+
                                                  |
                                                  v
                                         +------------------+
                                         |      PLAN        |
                                         |  (VLM reasoning  |
                                         |   + demo context)|
                                         +--------+---------+
                                                  |
                                                  v
                                         +------------------+
                                         |       ACT        |
                                         |  (Input synth +  |
                                         |   safety check)  |
                                         +--------+---------+
                                                  |
                                                  v
                                         +------------------+
                                         |    EVALUATE      |
                                         |  (Success check  |
                                         |   + feedback)    |
                                         +------------------+

Package Responsibility Diagram

flowchart LR
    subgraph demonstrate["DEMONSTRATE Phase"]
        CAP[openadapt-capture]
        PRV[openadapt-privacy]
    end

    subgraph learn["LEARN Phase"]
        RET[openadapt-retrieval]
        ML[openadapt-ml]
    end

    subgraph execute["EXECUTE Phase"]
        GRD[openadapt-grounding]
        EVL[openadapt-evals]
    end

    subgraph crosscut["Cross-Cutting"]
        VWR[openadapt-viewer]
    end

    subgraph meta["Meta-Package"]
        OA[openadapt CLI]
    end

    %% CLI orchestrates all
    OA --> CAP
    OA --> ML
    OA --> EVL
    OA --> VWR

    %% Phase 1 flow
    CAP -->|raw data| PRV
    PRV -->|scrubbed data| ML
    PRV -->|scrubbed data| RET

    %% Phase 2 flow
    RET -->|embeddings| EVL
    ML -->|checkpoints| EVL

    %% Phase 3 flow
    GRD -->|coordinates| EVL
    GRD -->|coordinates| ML

    %% Viewer integration
    CAP -.->|demos| VWR
    ML -.->|training| VWR
    EVL -.->|results| VWR

    %% Styling
    classDef phase1 fill:#3498DB,stroke:#1A5276,color:#fff
    classDef phase2 fill:#27AE60,stroke:#1E8449,color:#fff
    classDef phase3 fill:#9B59B6,stroke:#6C3483,color:#fff
    classDef cross fill:#1ABC9C,stroke:#148F77,color:#fff
    classDef meta fill:#2C3E50,stroke:#1A252F,color:#fff

    class CAP,PRV phase1
    class RET,ML phase2
    class GRD,EVL phase3
    class VWR cross
    class OA meta

Feedback Loop Diagram

flowchart TB
    subgraph Loop1["Loop 1: Demo Library Growth"]
        D1[Human Demo] --> L1[Learn]
        L1 --> E1[Execute]
        E1 --> |success| S1[Store as Demo]
        E1 --> |failure| F1[Failure Analysis]
        F1 --> |new demo| D1
        S1 --> L1
    end

    subgraph Loop2["Loop 2: Self-Improvement"]
        E2[Execute] --> T2[Trace]
        T2 --> |success trace| FT2[Fine-tune]
        FT2 --> E2
    end

    subgraph Loop3["Loop 3: Benchmark-Driven"]
        B3[Benchmark] --> M3[Metrics]
        M3 --> A3[Architecture Improvement]
        A3 --> B3
    end

    Loop1 -.->|traces| Loop2
    Loop2 -.->|models| Loop1
    Loop1 -.->|metrics| Loop3
    Loop3 -.->|improvements| Loop1

    %% Styling
    classDef loop1 fill:#3498DB,stroke:#1A5276,color:#fff
    classDef loop2 fill:#27AE60,stroke:#1E8449,color:#fff,stroke-dasharray: 5 5
    classDef loop3 fill:#9B59B6,stroke:#6C3483,color:#fff

    class D1,L1,E1,S1,F1 loop1
    class E2,T2,FT2 loop2
    class B3,M3,A3 loop3

10. Future Directions

Near-Term (Q1 2026)

PriorityGoalPackageStatus
P0PyPI releases for all packagesallIn progress
P0WAA baseline metrics establishedevalsPending
P1Fine-tuning pipeline validatedmlIn progress
P1Demo conditioning in evalsevals + retrievalPending
P2docs.openadapt.ai launcheddocsPending

Medium-Term (2026)

GoalDescription
Process MiningAutomatic extraction of semantic actions from demos
Self-ImprovementTraining on successful execution traces
Multi-BenchmarkWebArena + OSWorld integration
Enterprise DeploymentProduction deployment guides

Long-Term (2026+)

GoalDescription
Cross-App TransferDemos from Excel help with Google Sheets
Multi-AgentCoordinated agents for complex workflows
Active LearningAgents request human help strategically
Mobile PlatformsiOS and Android capture/replay

Research Questions

  1. Abstraction Discovery: Can we automatically extract semantic actions from literal event sequences?
  2. Transfer Learning: How much does demo conditioning help across different applications?
  3. Explanation: How do we make agent decisions interpretable to users?
  4. Safety: What guardrails prevent harmful autonomous actions?

Appendix A: Glossary

TermDefinition
A11y TreeAccessibility tree - structured UI element representation
DemoRecorded human demonstration (trajectory)
GroundingMapping text/intent to specific UI coordinates
LoRALow-Rank Adaptation - efficient fine-tuning method
PolicyDecision function mapping observations to actions
SoMSet-of-Mark - visual grounding via numbered labels
TrajectorySequence of (observation, action) pairs
VLMVision-Language Model
WAAWindows Agent Arena benchmark

Appendix C: Version History

VersionDateChanges
3.0Jan 2026Alpha vision synthesis, evolution diagrams, SOTA alignment
2.0Jan 2026Comprehensive redesign, modular architecture
1.0Dec 2025Initial modular architecture
0.x2023-2024Legacy monolithic design

This document is maintained as part of the OpenAdapt project. For updates, see the GitHub repository.