README.md

May 23, 2025 ยท View on GitHub

ToRA
InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection


| ๐Ÿ  Homepage | ๐Ÿ“š Arxiv | ๐Ÿค— Paper | ๐Ÿค— Collection |

This is the repo for the paper "InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection". In this work, we develop a multimodal large language model-based GUI agent that enables enhanced task automation on computing devices. Our agent is trained through a two-stage supervised fine-tuning approach that focuses on fundamental GUI understanding skills and advanced reasoning capabilities, where we integrate hierarchical reasoning and expectation-reflection reasoning to enable native reasoning abilities in GUI interactions.

๐Ÿ”ฅ News

InfiGUIAgent

We are in the process of uploading key artifacts from our paper to our ๐Ÿค— Hugging Face Collection.

Regarding the full model release, due to licensing restrictions on portions of our training data from third-party sources, we are currently sanitizing the dataset and retraining/refining the final model to ensure full compliance while maintaining performance.

Stay tuned for updates! ๐Ÿ”œ