openadapt-evals

January 17, 2026 ยท View on GitHub

Benchmark evaluation infrastructure for GUI automation agents.

Repository: OpenAdaptAI/openadapt-evals

Installation

pip install openadapt[evals]
# or
pip install openadapt-evals

Overview

The evals package provides:

  • Benchmark adapters for standardized evaluation
  • API agent implementations (Claude, GPT-4V)
  • Evaluation runners and metrics
  • Mock environments for testing

CLI Commands

Run Evaluation

# Evaluate a trained policy
openadapt eval run --checkpoint training_output/model.pt --benchmark waa

# Evaluate an API agent
openadapt eval run --agent api-claude --benchmark waa

Options:

  • --checkpoint - Path to trained policy checkpoint
  • --agent - Agent type (api-claude, api-gpt4v, custom)
  • --benchmark - Benchmark name (waa, osworld, etc.)
  • --tasks - Number of tasks to evaluate (default: all)
  • --output - Output directory for results

Run Mock Evaluation

Test your setup without running actual benchmarks:

openadapt eval mock --tasks 10

List Available Benchmarks

openadapt eval benchmarks

Supported Benchmarks

BenchmarkDescriptionTasks
waaWindows Agent Arena154
osworldOSWorld369
webarenaWebArena812
mockMock benchmark for testingConfigurable

API Agents

Claude Agent

export ANTHROPIC_API_KEY=your-key-here
openadapt eval run --agent api-claude --benchmark waa

GPT-4V Agent

export OPENAI_API_KEY=your-key-here
openadapt eval run --agent api-gpt4v --benchmark waa

Python API

from openadapt_evals import ApiAgent, BenchmarkAdapter, evaluate_agent_on_benchmark

# Create an API agent
agent = ApiAgent.claude()

# Or load a trained policy
from openadapt_ml import AgentPolicy
agent = AgentPolicy.from_checkpoint("model.pt")

# Run evaluation
results = evaluate_agent_on_benchmark(
    agent=agent,
    benchmark="waa",
    num_tasks=10
)

print(f"Success rate: {results.success_rate:.2%}")
print(f"Average steps: {results.avg_steps:.1f}")

Evaluation Loop

flowchart TB
    subgraph Agent["Agent Under Test"]
        POLICY[Agent Policy]
        API[API Agent]
    end

    subgraph Benchmark["Benchmark System"]
        ADAPTER[Benchmark Adapter]
        MOCK[Mock Adapter]
        LIVE[Live Adapter]
    end

    subgraph Tasks["Task Execution"]
        TASK[Get Task]
        OBS[Observe State]
        ACT[Execute Action]
        CHECK[Check Success]
    end

    subgraph Metrics["Metrics"]
        SUCCESS[Success Rate]
        STEPS[Avg Steps]
        TIME[Execution Time]
    end

    POLICY --> ADAPTER
    API --> ADAPTER
    ADAPTER --> MOCK
    ADAPTER --> LIVE

    MOCK --> TASK
    LIVE --> TASK
    TASK --> OBS
    OBS --> POLICY
    OBS --> API
    POLICY --> ACT
    API --> ACT
    ACT --> CHECK
    CHECK -->|next| TASK
    CHECK -->|done| SUCCESS
    CHECK --> STEPS
    CHECK --> TIME

Key Exports

ExportDescription
ApiAgentAPI-based agent (Claude, GPT-4V)
BenchmarkAdapterBenchmark interface
MockAdapterMock benchmark for testing
evaluate_agent_on_benchmarkAgent evaluation function
EvalResultsEvaluation results container

Metrics

MetricDescription
Success RatePercentage of tasks completed successfully
Average StepsMean number of steps per task
Execution TimeTotal and per-task timing
Error RatePercentage of tasks that errored