QwenClawBench
June 10, 2026 · View on GitHub
QwenClawBench
A real-user-distribution benchmark for OpenClaw agents — built for robust evaluation at scale
English | 中文
QwenClawBench is a real-user-distribution benchmark for evaluating OpenClaw agents. It was originally built as an internal benchmark during the development of Qwen3.6-Plus, and has since been optimized and open-sourced.
Why QwenClawBench?
QwenClawBench contains 100 tasks across 8 core domains, each with an isolated simulated workspace. Domains are carefully chosen to reflect real OpenClaw usage patterns, and assets are designed to simulate authentic working scenarios.
Reproducing robust large-scale evaluations on OpenClaw is nontrivial, as results depend heavily on infrastructure reliability. We have built in the following features to address this:
- Docker Isolation: Each task runs in a dedicated Docker container, ensuring a consistent and reproducible environment
- Concurrent Execution: Tasks run in parallel across multiple containers, significantly reducing total evaluation time
- Anomaly Detection: Infrastructure failures (API errors, container crashes, timeouts) are flagged explicitly rather than silently folded into scores, so you always know which results to trust
- Resumable Runs: Interrupted runs can be resumed from where they left off, skipping already-completed healthy tasks to obtain stable results
Quick Start
Requirements
- Python >= 3.10
- Docker
Install dependencies:
pip install pyyaml>=6.0.1 tqdm>=4.0
Pull the OpenClaw Docker image:
docker pull ghcr.io/openclaw/openclaw:main
Configuration
- Place the benchmark data under
data/<dataset_name>/(default:qwenclawbench-v1.1-100), with the following structure:
data/qwenclawbench-v1.1-100/
├── tasks/ # task_*.md files
└── assets/ # per-task asset directories
-
Configure the model provider in
openclaw_config/openclaw.json. Seeopenclaw_config/openclaw.json.examplefor reference, or copy your existing~/.openclaw/openclaw.json. -
Set API credentials in
openclaw_config/.env. Seeopenclaw_config/.env.example— required variables are marked in the example file.
Run Evaluation
# Set variables
DATASET="qwenclawbench-v1.1-100"
RUNS=3
CONCURRENCY=10
LOGDIR="logs/$DATASET"
# Run evaluation: 10 parallel containers, 3 runs per task (scores are averaged)
./scripts/run.sh --model dashscope/qwen3.6-plus \
--dataset $DATASET \
--runs $RUNS \
--concurrency $CONCURRENCY \
--output-dir ./results/$DATASET/qwen3.6-plus \
--log-file $LOGDIR/qwen-3.6-plus.log
Interrupted runs can be resumed without re-running completed tasks. Anomalous runs can also be selectively retried:
# Resume an interrupted run — use the same --output-dir as the original run
./scripts/run.sh --model dashscope/qwen3.6-plus \
--dataset $DATASET \
--runs $RUNS \
--concurrency $CONCURRENCY \
--output-dir ./results/$DATASET/qwen3.6-plus \
--log-file $LOGDIR/qwen-3.6-plus.log
# Resume and rerun all anomalous tasks
./scripts/run.sh --model dashscope/qwen3.6-plus \
--dataset $DATASET \
--runs $RUNS \
--concurrency $CONCURRENCY \
--output-dir ./results/$DATASET/qwen3.6-plus \
--log-file $LOGDIR/qwen-3.6-plus.log \
--rerun-anomalous
# Force a clean start, discarding any existing results
./scripts/run.sh --model dashscope/qwen3.6-plus \
--dataset $DATASET \
--runs $RUNS \
--concurrency $CONCURRENCY \
--output-dir ./results/$DATASET/qwen3.6-plus \
--log-file $LOGDIR/qwen-3.6-plus.log \
--no-resume
Tasks
Task Category Distribution
QwenClawBench emphasizes realism and complexity, covering 100 tasks across 8 domains.
| Category | Count | Description |
|---|---|---|
| Workflow and Agent Orchestration | 21 | Workflow orchestration, skill creation, cron jobs, multi-agent coordination |
| System Operations and Administration | 20 | System ops, environment configuration, troubleshooting, workspace management |
| Knowledge and Memory Management | 15 | Knowledge base construction, memory system design, document management, context retrieval |
| Finance and Quantitative Trading | 10 | Quant strategy backtesting, arbitrage monitoring, trade analysis, position management |
| Data Analysis and Modeling | 10 | Statistical analysis, data processing, quality auditing, regression modeling |
| Security and Vulnerability Management | 9 | Security auditing, credential management, injection defense, privacy compliance |
| Communication and Scheduling | 8 | Message notifications, schedule planning, timed reminders, task scheduling |
| Research and Information Retrieval | 7 | Competitive analysis, literature retrieval, technical research, SEO keyword research |
Task Structure
Each task is a Markdown file (data/tasks/task_*.md) with a YAML frontmatter header and structured body sections.
Frontmatter Metadata:
| Field | Description |
|---|---|
id | Unique task identifier |
name | Short task title |
category / subcategory | Task category and subcategory |
grading_type | Scoring mode: automated (pure automated), llm_judge (pure LLM review), hybrid (combined) |
grading_weights | Weight allocation between automated and LLM judge scoring in hybrid mode |
timeout_seconds | Task execution timeout |
workspace_files | Initial workspace file mappings |
Body Sections:
| Section | Content |
|---|---|
## Prompt | User instructions for the agent — the specific task the agent needs to complete |
## Expected Behavior | Detailed description of expected behavior, also serves as reference context for the LLM judge |
## Grading Criteria | Scoring checklist (- [ ] format) |
## Automated Checks | Automated scoring code (Python), defines a grade(transcript, workspace_path) -> dict function |
## LLM Judge Rubric | LLM judge scoring dimensions with detailed descriptions for each score tier |
Asset Directories:
Each task has a corresponding directory under data/assets/<task_id>/ containing the initial workspace files (code, configs, data, logs, etc.). These files are copied into the Docker container's workspace before the task runs.
Scoring Mechanism
QwenClawBench supports three scoring modes: automated, llm_judge, and hybrid.
Automated:
A Python function grade(transcript, workspace_path) embedded in the task definition performs deterministic, rule-based checks on the agent's deliverables — verifying output files, command results, and workspace state. The final score is the mean across all checked dimensions.
LLM Judge: A judge model (claude-opus-4.5 by default) reviews the agent's action transcript and scores performance across multiple rubric dimensions, each from 0.0 to 1.0. The final score is a weighted average across dimensions.
Hybrid:
Both methods run independently and are combined via grading_weights. Typically, automated checks verify concrete deliverables against ground-truth rules, while the LLM judge evaluates the quality and coherence of the agent's reasoning trajectory.
In practice, we observed that on some tasks, agents failed to produce correct deliverables yet still received high LLM judge scores. This occurs because LLM judges primarily assess trajectory quality and are inherently less stable — and more susceptible to being hacked by agents that produce fluent but incorrect outputs. To prevent this, we apply penalized scoring:
When the automated score falls below 0.75, the LLM judge contribution is zeroed out — the assumption being that a model that failed basic deliverable checks should not receive credit from the judge regardless of how well it reasoned. Use --simple-scoring to disable this and fall back to a plain weighted average.
Project Structure
QwenClawBench/
├── README.md
├── scripts/
│ ├── benchmark.py # Main evaluation entry point
│ ├── lib_tasks.py # Task loading and parsing
│ ├── lib_grading.py # Scoring engine (automated + LLM judge + hybrid)
│ ├── lib_docker.py # Docker container management
│ ├── lib_agent.py # OpenClaw agent interaction
│ ├── lib_anomalies.py # Anomaly detection
│ └── run.sh # Helper run script
├── openclaw_config/
│ ├── openclaw.json # OpenClaw model and provider configuration
│ └── .env # API credentials (should not be committed to VCS)
└── data/
└── <dataset>/
├── tasks/ # 100 task_*.md task definition files
└── assets/ # 100 task asset directories (initial workspace files)
Acknowledgments
QwenClawBench is built on top of the PinchBench framework. We also acknowledge other open-source contributions from the community, such as Claw-Eval, ZClawBench, and WildClawBench.
Citation
If you use QwenClawBench in your research, please cite:
@misc{qwenclawbench1.1,
title = {{QwenClawBench}: Real-user-distribution benchmark for OpenClaw agents},
url = {github.com/SKYLENAGE-AI/QwenClawBench},
author = {{Qwen Team} and {Alibaba Data}},
month = {April},
year = {2026}
}
License
MIT — see LICENSE for details.