openadapt-evals

January 17, 2026 · View on GitHub

Benchmark evaluation infrastructure for GUI automation agents.

Repository: OpenAdaptAI/openadapt-evals

Installation

pip install openadapt[evals]
# or
pip install openadapt-evals

Overview

The evals package provides:

Benchmark adapters for standardized evaluation
API agent implementations (Claude, GPT-4V)
Evaluation runners and metrics
Mock environments for testing

CLI Commands

Run Evaluation

# Evaluate a trained policy
openadapt eval run --checkpoint training_output/model.pt --benchmark waa

# Evaluate an API agent
openadapt eval run --agent api-claude --benchmark waa

Options:

--checkpoint - Path to trained policy checkpoint
--agent - Agent type (api-claude, api-gpt4v, custom)
--benchmark - Benchmark name (waa, osworld, etc.)
--tasks - Number of tasks to evaluate (default: all)
--output - Output directory for results

Run Mock Evaluation

Test your setup without running actual benchmarks:

openadapt eval mock --tasks 10

List Available Benchmarks

openadapt eval benchmarks

Supported Benchmarks

Benchmark	Description	Tasks
`waa`	Windows Agent Arena	154
`osworld`	OSWorld	369
`webarena`	WebArena	812
`mock`	Mock benchmark for testing	Configurable

API Agents

Claude Agent

export ANTHROPIC_API_KEY=your-key-here
openadapt eval run --agent api-claude --benchmark waa

GPT-4V Agent

export OPENAI_API_KEY=your-key-here
openadapt eval run --agent api-gpt4v --benchmark waa

Python API

from openadapt_evals import ApiAgent, BenchmarkAdapter, evaluate_agent_on_benchmark

# Create an API agent
agent = ApiAgent.claude()

# Or load a trained policy
from openadapt_ml import AgentPolicy
agent = AgentPolicy.from_checkpoint("model.pt")

# Run evaluation
results = evaluate_agent_on_benchmark(
    agent=agent,
    benchmark="waa",
    num_tasks=10
)

print(f"Success rate: {results.success_rate:.2%}")
print(f"Average steps: {results.avg_steps:.1f}")

Evaluation Loop

flowchart TB
    subgraph Agent["Agent Under Test"]
        POLICY[Agent Policy]
        API[API Agent]
    end

    subgraph Benchmark["Benchmark System"]
        ADAPTER[Benchmark Adapter]
        MOCK[Mock Adapter]
        LIVE[Live Adapter]
    end

    subgraph Tasks["Task Execution"]
        TASK[Get Task]
        OBS[Observe State]
        ACT[Execute Action]
        CHECK[Check Success]
    end

    subgraph Metrics["Metrics"]
        SUCCESS[Success Rate]
        STEPS[Avg Steps]
        TIME[Execution Time]
    end

    POLICY --> ADAPTER
    API --> ADAPTER
    ADAPTER --> MOCK
    ADAPTER --> LIVE

    MOCK --> TASK
    LIVE --> TASK
    TASK --> OBS
    OBS --> POLICY
    OBS --> API
    POLICY --> ACT
    API --> ACT
    ACT --> CHECK
    CHECK -->|next| TASK
    CHECK -->|done| SUCCESS
    CHECK --> STEPS
    CHECK --> TIME

Key Exports

Export	Description
`ApiAgent`	API-based agent (Claude, GPT-4V)
`BenchmarkAdapter`	Benchmark interface
`MockAdapter`	Mock benchmark for testing
`evaluate_agent_on_benchmark`	Agent evaluation function
`EvalResults`	Evaluation results container

Metrics

Metric	Description
Success Rate	Percentage of tasks completed successfully
Average Steps	Mean number of steps per task
Execution Time	Total and per-task timing
Error Rate	Percentage of tasks that errored

openadapt-ml - Learn policies to evaluate
openadapt-capture - Collect demonstrations