MMAT-1M: A Large Reasoning Dataset for Multimodal Agent Tuning

July 30, 2025 ยท View on GitHub

๐ŸŒ Homepage | | ๐Ÿค— MMAT-1M | ๐Ÿ“– MMAT-1M arXiv

Update

  • [2025-07-30] The MMAT-1M arXiv paper has been updated.
  • [2025-07-24] Our paper titleed "A Large Reasoning Dataset for Multimodal Agent Tuning" has been accepted by ICCV 2025! ๐ŸŽ‰
  • [2025-07-17] The MMAT-1M dataset, a million-scale multimodal agent tuning dataset, has been released!๐Ÿ”ฅ ๐Ÿ˜†

Introduction

MMAT-1M

MMAT-1M is a new million-scale multimodal agent tuning dataset designed to unlock the full potential of multimodal large language models in Chain-of-Thought (CoT) reasoning, reflection, and dynamic tool utilization. Unlike the current lack of large-scale, high-quality agent tuning resources in the multimodal domain, MMAT-1M is constructed through a novel four-stage data engine: first, curating publicly available multimodal datasets with question-answer pairs; second, leveraging GPT-4o to generate rationales for these pairs and dynamically integrate API calls and Retrieval Augmented Generation (RAG) information via a multi-turn paradigm; third, refining rationales through reflection to ensure logical consistency and accuracy, forming a multi-turn dialogue dataset with both Rationale and Reflection (RR); finally, optionally compressing multi-turn dialogues into a One-turn Rationale and Reflection format (ORR) for efficiency.

Alt text

Dataset Creation

MMAT-1M is meticulously designed to challenge and evaluate multimodal models with complex reasoning. For more detailed information, please refer to our Hugging Face datasets:

License & Disclaimer

MMAT-1M is a composite multimodal dataset derived from multiple sources with heterogeneous licenses. By using this dataset, you agree to comply with the following terms and all applicable source license conditions.

1. Source License Compliance

MMAT-1M integrates data under the following licenses:

Source DatasetLicenseKey Restrictions
Visual CoTApache 2.0Requires attribution and license notice.
LLaVA-CoTApache 2.0Same as above.
The CauldronSubset-specific licenses + CC-BY-4.0 for derived promptsCommercial use may require separate permissions.
TabMWPCC BY-NC-SA 4.0Non-commercial only, share-alike.
InfoseekApache 2.0Attribution required.

2. Reporting Violations

If you suspect license non-compliance in MMAT-1M, please contact us.

3. Limitation of Liability

The MMAT-1M team:

  • Does not warrant the legal status of individual data samples.
  • Is not liable for misuse of dataset components beyond their original license terms.

Citation

BibTeX:

@inproceedings{Gao2025MMAT1M,
  title={MMAT-1M: A Large Reasoning Dataset for Multimodal Agent Tuning},
  author={Tianhong Gao and Yannian Fu and Weiqun Wu and Haixiao Yue and Shanshan Liu and Gang Zhang},
  booktitle={Proceedings of ICCV},
  year={2025},
}