Running CUA-RAG in Windows Agent Arena

January 28, 2026 · View on GitHub

Prerequisites

Pull the latest WindowsAgentArena code in the WAA submodule
Follow the instructions in WindowsAgentArena/internal/LOCALDEV.md to:
- Prepare and build the Docker image
- Create the winarena conda environment

After building the image, create a clean backup copy:

cp -rf cua_skill/WindowsAgentArena/src/win-arena-container/vm/storage \
         cua_skill/WindowsAgentArena/src/win-arena-container/vm/storage_gold

If you have a downloaded storage image, you can also name it storage_gold and place it in the same directory to use.

Switch to the rag branch in cua_skill for the latest features

Create a .env file in the ./agent directory with the following content:

UITARS_V1_BEARER_KEY="your_uitars_key"
AZURE_AD_TOKEN=""

Note: Leave AZURE_AD_TOKEN empty initially

az login --scope https://cognitiveservices.azure.com/.default --use-device-code

Navigate to the evaluation directory and run the token refresh script:
```
cd cua_skill/evaluation/WindowsAgentArena
./refresh_token.sh
```
Keep this screen session running in the background

Changes in cua_skill/agent/ are automatically synced to: cua_skill/WindowsAgentArena/src/win-arena-container/client/mm_agents/rag_cua

File mappings:

Configure model settings in agent/config_rag.json:

Setting	Description
`planner.model_class`	Select planner model: `"gpt"` or `"qwen"`
`rag.rel_action_sample_path`	Set action sampling percentage (e.g., `"mm_agents/rag_cua/sample_actions/0percent.json"`). Leave empty (`""`) to allow all actions. options can be found in cua_skill/agent/sample_actions

sudo bash ./run_cua_rag.sh <test_json_filename> [options]

--use_gold_image: Use the clean backup copy of the storage image
--clean_mode: Reset environment between each test case (recommended)
--reset_image: Remove current storage and regenerate from setup.iso by running:
```
sudo "./run-local.sh" --prepare-image true
```

# Run with clean environment for each test case (recommended)
sudo bash ./run_cua_rag.sh "test_one.json" --use_gold_image --clean_mode

Tips:

You can select different tasks within test_one.json
If using a downloaded storage image, rename it to storage_gold and use --use_gold_image
Using --clean_mode is recommended to avoid display errors and ensure test isolation