AGUVIS

January 14, 2025 · View on GitHub

📑 Paper | 🌐 Project Page | 💾 AGUVIS Data Collection

Introduction

AGUVIS is a unified pure vision-based framework for autonomous GUI agents that can operate across various platforms (web, desktop, mobile). Unlike previous approaches that rely on textual representations, AGUVIS leverages unified purely vision-based observations and a consistent action space to ensure better generalization across different platforms.

Key Features & Contributions

🔍 Pure Vision Framework: First fully autonomous pure vision GUI agent capable of performing tasks independently without relying on closed-source models
🔄 Cross-Platform Unification: Unified action space and plugin system that works consistently across different GUI environments
📊 Comprehensive Dataset: Large-scale dataset of GUI agent trajectories with multimodal grounding and reasoning
🧠 Two-Stage Training: Novel training pipeline focusing on GUI grounding followed by planning and reasoning
💭 Inner Monologue: Explicit planning and reasoning capabilities integrated into the model training

Our framework demonstrates state-of-the-art performance in both offline and real-world online scenarios, offering a more efficient and generalizable approach to GUI automation.

https://github.com/user-attachments/assets/83f2c281-961c-4e2d-90dd-8cb1857adfb6

Clone the repository:

git clone git@github.com:xlang-ai/aguvis.git
cd aguvis

Create and activate a conda environment:

conda create -n aguvis python=3.10
conda activate aguvis

Install PyTorch and dependencies:

conda install pytorch torchvision torchaudio pytorch-cuda -c pytorch -c nvidia
pip install -e .

Data Preparation

Stage 1: Grounding
- Download the dataset from aguvis-stage1
- Place the data according to the structure defined in data/stage1.yaml
Stage 2: Planning and Reasoning
- Download the dataset from aguvis-stage2
- Place the data according to the structure defined in data/stage2.yaml

Training

Configure your training settings:
- Open scripts/train.sh
- Set the SFT_TASK variable to specify your training stage
Start training:

bash scripts/train.sh

Model Checkpoints

Aguvis-7B-720P: Hugging Face
Cooking... 🧑‍🍳

Inference

Configure your inference settings:
- Open scripts/inference.sh
- Set the MODEL_PATH variable to specify your model path
- Set the IMAGE_PATH variable to specify your image path
- Set the INSTRUCTION variable to specify your instruction
- Set the PREVIOUS_ACTIONS variable to specify your previous actions or leave it empty
- Set the LOW_LEVEL_INSTRUCTION variable to specify your low-level instruction or leave it empty
Start inference:

bash scripts/inference.sh

Checklist

Data
- ✅ Stage 1: Grounding Dataset
- ✅ Stage 2: Planning and Reasoning Trajectories
Code
- ✅ Training Pipeline
- 🚧 Model Weights and Configurations
- 🚧 Inference Scripts
- 🚧 Evaluation Toolkit

Citation

If this work is helpful, please kindly cite as:

@article{xu2024aguvis,
  title={Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction},
  author={Yiheng Xu and Zekun Wang and Junli Wang and Dunjie Lu and Tianbao Xie and Amrita Saha and Doyen Sahoo and Tao Yu and Caiming Xiong},
  year={2024},
  url={https://arxiv.org/abs/2412.04454}
}

AGUVIS

Introduction

Key Features & Contributions

Mobile Tasks (Android World)

Web Browsing Tasks (Mind2Web-Live)

Computer-use Tasks (OSWorld)

Getting Started