README.md
April 4, 2025 · View on GitHub
Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models
Yuhao Dong*,1 Zuyan Liu*,2,3 Hai-Long Sun2,4 Jingkang Yang1
Winston Hu2 Yongming Rao2,3,✉ Ziwei Liu1,✉
1S-Lab, NTU 2Tencent 3Tsinghua University 4Nanjing University
* Equal Contribution ✉ Corresponding Author
📢 News
- [04/2025] Insight-V is selected as Highlight paper by CVPR2025!
- [02/2025] Insight-V is accepted by CVPR2025!
- [11/2024] 🔧🔨Training & Inference Scripts Release! Try Insight-V on your own!
- [11/2024] 🔥 🚀Introducing Insight-V! An early attempt to explore long-chain visual reasoning with MLLMs.
- [Paper]: Detailed introduction of Insight-V, including structured, long-chain data generation pipeline and effective multi-agent system design!
- [Checkpoints]: We release model checkpoints on LLaVA-NeXT-LLaMA3 and our base model.
🚀 Introducing Insight-V
Main idea of Insight-V
Insight-V is an early effort to explore long-chain visual reasoning with MLLMs.
Insight-V offers 1) a scalable data generation pipeline for long-chain, high-quality reasoning data, 2) a multi-agent system that decomposes visual reasoning tasks into reasoning and summarization, and 3) a two-stage training pipeline to enhance visual reasoning capabilities. Together, these contributions address key challenges in visual reasoning, providing a solid foundation for future research in MLLM reasoning.
Overview of Data Generation Pipeline
The reasoning processes are generated progressively through a reasoning generator, and then fed into a multi-granularity assessment system to ensure high-quality reasoning.
Overview of Multi-Agent System
We derive a multi-agent system from a single model. By decomposing the task into reasoning and summarization, the two agents collaborate to enhance the overall reasoning capability.
✅ TODO List
- Release paper on arXiv
- Release Insight-V models.
- Demo code for generation.
- All the training and inference code.
- Evaluation code for visual reasoning benchmarks.
- Insight-V SFT Data.
- Insight-V with stronger MLLMs.
📃 Main Results
Results on Visual Reasoning Benchmarks
Results on Other Image Benchmarks
Qualitative Results
Citation
If you find it useful for your research and applications, please cite our paper using this BibTeX:
@article{dong2024insight,
title={Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models},
author={Dong, Yuhao and Liu, Zuyan and Sun, Hai-Long and Yang, Jingkang and Hu, Winston and Rao, Yongming and Liu, Ziwei},
journal={arXiv preprint arXiv:2411.14432},
year={2024}
}