README.md

May 23, 2025 · View on GitHub

InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection

This is the repo for the paper "InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection". In this work, we develop a multimodal large language model-based GUI agent that enables enhanced task automation on computing devices. Our agent is trained through a two-stage supervised fine-tuning approach that focuses on fundamental GUI understanding skills and advanced reasoning capabilities, where we integrate hierarchical reasoning and expectation-reflection reasoning to enable native reasoning abilities in GUI interactions.

🔥 News

🔥[2025/5/15] Our paper "OS Agents: A Survey on MLLM-based Agents for Computer, Phone and Browser Use" is accepted by ACL 2025.
🔥[2025/4/19] Our paper "InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners" released. More information can be found in the repository.
🔥[2025/1/9] Our paper "InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection" released.
🔥[2024/12/12] Our paper "OS Agents: A Survey on MLLM-based Agents for Computer, Phone and Browser Use" released.
[2024/4/2] Our paper "InfiAgent-DABench: Evaluating Agents on Data Analysis Tasks" is accepted by ICML 2024.

InfiGUIAgent

We are in the process of uploading key artifacts from our paper to our 🤗 Hugging Face Collection.

Regarding the full model release, due to licensing restrictions on portions of our training data from third-party sources, we are currently sanitizing the dataset and retraining/refining the final model to ensure full compliance while maintaining performance.

Stay tuned for updates! 🔜