AGUVIS

January 14, 2025 ยท View on GitHub

๐Ÿ“‘ Paper ย ย  | ย ย  ๐ŸŒ Project Page ย ย  | ย ย  ๐Ÿ’พ AGUVIS Data Collection ย ย 

Introduction

AGUVIS is a unified pure vision-based framework for autonomous GUI agents that can operate across various platforms (web, desktop, mobile). Unlike previous approaches that rely on textual representations, AGUVIS leverages unified purely vision-based observations and a consistent action space to ensure better generalization across different platforms.

Key Features & Contributions

  • ๐Ÿ” Pure Vision Framework: First fully autonomous pure vision GUI agent capable of performing tasks independently without relying on closed-source models
  • ๐Ÿ”„ Cross-Platform Unification: Unified action space and plugin system that works consistently across different GUI environments
  • ๐Ÿ“Š Comprehensive Dataset: Large-scale dataset of GUI agent trajectories with multimodal grounding and reasoning
  • ๐Ÿง  Two-Stage Training: Novel training pipeline focusing on GUI grounding followed by planning and reasoning
  • ๐Ÿ’ญ Inner Monologue: Explicit planning and reasoning capabilities integrated into the model training

Our framework demonstrates state-of-the-art performance in both offline and real-world online scenarios, offering a more efficient and generalizable approach to GUI automation.

https://github.com/user-attachments/assets/83f2c281-961c-4e2d-90dd-8cb1857adfb6

Mobile Tasks (Android World)

https://github.com/user-attachments/assets/9a0147b2-e966-4500-8494-8e64d4b1b890

Web Browsing Tasks (Mind2Web-Live)

https://github.com/user-attachments/assets/f78b2263-5145-4ada-9556-a3173eb71144

Computer-use Tasks (OSWorld)

https://github.com/user-attachments/assets/d1083c7d-992b-4cf4-8b07-3c9065821179

Getting Started

Installation

  1. Clone the repository:
git clone git@github.com:xlang-ai/aguvis.git
cd aguvis
  1. Create and activate a conda environment:
conda create -n aguvis python=3.10
conda activate aguvis
  1. Install PyTorch and dependencies:
conda install pytorch torchvision torchaudio pytorch-cuda -c pytorch -c nvidia
pip install -e .

Data Preparation

  1. Stage 1: Grounding

  2. Stage 2: Planning and Reasoning

Training

  1. Configure your training settings:

    • Open scripts/train.sh
    • Set the SFT_TASK variable to specify your training stage
  2. Start training:

bash scripts/train.sh

Model Checkpoints

  • Aguvis-7B-720P: Hugging Face
  • Cooking... ๐Ÿง‘โ€๐Ÿณ

Inference

  1. Configure your inference settings:

    • Open scripts/inference.sh
    • Set the MODEL_PATH variable to specify your model path
    • Set the IMAGE_PATH variable to specify your image path
    • Set the INSTRUCTION variable to specify your instruction
    • Set the PREVIOUS_ACTIONS variable to specify your previous actions or leave it empty
    • Set the LOW_LEVEL_INSTRUCTION variable to specify your low-level instruction or leave it empty
  2. Start inference:

bash scripts/inference.sh

Checklist

  • Data
    • โœ… Stage 1: Grounding Dataset
    • โœ… Stage 2: Planning and Reasoning Trajectories
  • Code
    • โœ… Training Pipeline
    • ๐Ÿšง Model Weights and Configurations
    • ๐Ÿšง Inference Scripts
    • ๐Ÿšง Evaluation Toolkit

Citation

If this work is helpful, please kindly cite as:

@article{xu2024aguvis,
  title={Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction},
  author={Yiheng Xu and Zekun Wang and Junli Wang and Dunjie Lu and Tianbao Xie and Amrita Saha and Doyen Sahoo and Tao Yu and Caiming Xiong},
  year={2024},
  url={https://arxiv.org/abs/2412.04454}
}