AGUVIS
January 14, 2025 ยท View on GitHub
๐ Paper ย ย | ย ย ๐ Project Page ย ย | ย ย ๐พ AGUVIS Data Collection ย ย
Introduction
AGUVIS is a unified pure vision-based framework for autonomous GUI agents that can operate across various platforms (web, desktop, mobile). Unlike previous approaches that rely on textual representations, AGUVIS leverages unified purely vision-based observations and a consistent action space to ensure better generalization across different platforms.
Key Features & Contributions
- ๐ Pure Vision Framework: First fully autonomous pure vision GUI agent capable of performing tasks independently without relying on closed-source models
- ๐ Cross-Platform Unification: Unified action space and plugin system that works consistently across different GUI environments
- ๐ Comprehensive Dataset: Large-scale dataset of GUI agent trajectories with multimodal grounding and reasoning
- ๐ง Two-Stage Training: Novel training pipeline focusing on GUI grounding followed by planning and reasoning
- ๐ญ Inner Monologue: Explicit planning and reasoning capabilities integrated into the model training
Our framework demonstrates state-of-the-art performance in both offline and real-world online scenarios, offering a more efficient and generalizable approach to GUI automation.
https://github.com/user-attachments/assets/83f2c281-961c-4e2d-90dd-8cb1857adfb6
Mobile Tasks (Android World)
https://github.com/user-attachments/assets/9a0147b2-e966-4500-8494-8e64d4b1b890
Web Browsing Tasks (Mind2Web-Live)
https://github.com/user-attachments/assets/f78b2263-5145-4ada-9556-a3173eb71144
Computer-use Tasks (OSWorld)
https://github.com/user-attachments/assets/d1083c7d-992b-4cf4-8b07-3c9065821179
Getting Started
Installation
- Clone the repository:
git clone git@github.com:xlang-ai/aguvis.git
cd aguvis
- Create and activate a conda environment:
conda create -n aguvis python=3.10
conda activate aguvis
- Install PyTorch and dependencies:
conda install pytorch torchvision torchaudio pytorch-cuda -c pytorch -c nvidia
pip install -e .
Data Preparation
-
Stage 1: Grounding
- Download the dataset from aguvis-stage1
- Place the data according to the structure defined in
data/stage1.yaml
-
Stage 2: Planning and Reasoning
- Download the dataset from aguvis-stage2
- Place the data according to the structure defined in
data/stage2.yaml
Training
-
Configure your training settings:
- Open
scripts/train.sh - Set the
SFT_TASKvariable to specify your training stage
- Open
-
Start training:
bash scripts/train.sh
Model Checkpoints
- Aguvis-7B-720P: Hugging Face
- Cooking... ๐งโ๐ณ
Inference
-
Configure your inference settings:
- Open
scripts/inference.sh - Set the
MODEL_PATHvariable to specify your model path - Set the
IMAGE_PATHvariable to specify your image path - Set the
INSTRUCTIONvariable to specify your instruction - Set the
PREVIOUS_ACTIONSvariable to specify your previous actions or leave it empty - Set the
LOW_LEVEL_INSTRUCTIONvariable to specify your low-level instruction or leave it empty
- Open
-
Start inference:
bash scripts/inference.sh
Checklist
- Data
- โ Stage 1: Grounding Dataset
- โ Stage 2: Planning and Reasoning Trajectories
- Code
- โ Training Pipeline
- ๐ง Model Weights and Configurations
- ๐ง Inference Scripts
- ๐ง Evaluation Toolkit
Citation
If this work is helpful, please kindly cite as:
@article{xu2024aguvis,
title={Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction},
author={Yiheng Xu and Zekun Wang and Junli Wang and Dunjie Lu and Tianbao Xie and Amrita Saha and Doyen Sahoo and Tao Yu and Caiming Xiong},
year={2024},
url={https://arxiv.org/abs/2412.04454}
}