README.md

April 4, 2025 · View on GitHub

Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models

Yuhao Dong^*,1 Zuyan Liu^*,2,3 Hai-Long Sun^2,4 Jingkang Yang¹

Winston Hu² Yongming Rao^2,3,✉ Ziwei Liu^1,✉

¹S-Lab, NTU ²Tencent ³Tsinghua University ⁴Nanjing University

^* Equal Contribution^✉ Corresponding Author

arXiv Paper:

Model Checkpoints:

📢 News

[04/2025] Insight-V is selected as Highlight paper by CVPR2025!
[02/2025] Insight-V is accepted by CVPR2025!
[11/2024] 🔧🔨Training & Inference Scripts Release! Try Insight-V on your own!
[11/2024] 🔥 🚀Introducing Insight-V! An early attempt to explore long-chain visual reasoning with MLLMs.
- [Paper]: Detailed introduction of Insight-V, including structured, long-chain data generation pipeline and effective multi-agent system design!
- [Checkpoints]: We release model checkpoints on LLaVA-NeXT-LLaMA3 and our base model.

🚀 Introducing Insight-V

Main idea of Insight-V

Insight-V is an early effort to explore long-chain visual reasoning with MLLMs.

Insight-V offers 1) a scalable data generation pipeline for long-chain, high-quality reasoning data, 2) a multi-agent system that decomposes visual reasoning tasks into reasoning and summarization, and 3) a two-stage training pipeline to enhance visual reasoning capabilities. Together, these contributions address key challenges in visual reasoning, providing a solid foundation for future research in MLLM reasoning.

Overview of Data Generation Pipeline

The reasoning processes are generated progressively through a reasoning generator, and then fed into a multi-granularity assessment system to ensure high-quality reasoning.

Overview of Multi-Agent System

We derive a multi-agent system from a single model. By decomposing the task into reasoning and summarization, the two agents collaborate to enhance the overall reasoning capability.

✅ TODO List

Release paper on arXiv
Release Insight-V models.
Demo code for generation.
All the training and inference code.
Evaluation code for visual reasoning benchmarks.
Insight-V SFT Data.
Insight-V with stronger MLLMs.

@article{dong2024insight,
  title={Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models},
  author={Dong, Yuhao and Liu, Zuyan and Sun, Hai-Long and Yang, Jingkang and Hu, Winston and Rao, Yongming and Liu, Ziwei},
  journal={arXiv preprint arXiv:2411.14432},
  year={2024}
}

Acknowledgement

Our codebase is conducted on LLaVA
The data generation pipeline is mitigated from g1
Thanks to lmms-eval team, for building such a useful evaluation system!

README.md

Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models

📢 News

🚀 Introducing Insight-V

Main idea of Insight-V

Overview of Data Generation Pipeline

Overview of Multi-Agent System

✅ TODO List

📃 Main Results

Results on Visual Reasoning Benchmarks

Results on Other Image Benchmarks

Qualitative Results

Citation

Acknowledgement