Best Practice

September 14, 2024 · View on GitHub

We strongly recommend using VLMEevalKit for its useful features and ready-to-use LVLM implementations.

MMIU

Quick Start | HomePage | arXiv | Dataset | Citation

This repository is the official implementation of MMIU.

MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models
Fanqing Meng^*, Jin Wang^*, Chuanhao Li^*, Quanfeng Lu, Hao Tian, Jiaqi Liao, Xizhou Zhu, Jifeng Dai, Yu Qiao, Ping Luo, Kaipeng Zhang^#, Wenqi Shao^#
^* MFQ, WJ and LCH contribute equally.
^# SWQ (shaowenqi@pjlab.org.cn) and ZKP (zhangkaipeng@pjlab.org.cn) are correponding authors.

💡 News

2024/08/13: We have released the codes.
2024/08/08: We have released the dataset at https://huggingface.co/datasets/FanqingM/MMIU-Benchmark 🔥🔥🔥
2024/08/05: The datasets and codes are coming soon! 🔥🔥🔥
2024/08/05: The technical report of MMIU is released! And check our project page! 🔥🔥🔥

Multimodal Multi-image Understanding (MMIU) benchmark, a comprehensive evaluation suite designed to assess LVLMs across a wide range of multi-image tasks. MMIU encompasses 7 types of multi-image relationships, 52 tasks, 77K images, and 11K meticulously curated multiple-choice questions, making it the most extensive benchmark of its kind. overview

Evaluation Results Overview

The closed-source proprietary model GPT-4o from OpenAI has taken a leading position in MMIU, surpassing other models such as InternVL2-pro, InternVL1.5-chat, Claude3.5-Sonnet, and Gemini1.5 flash. Note that the open-source models InternVL2-pro.
Some powerful LVLMs like InternVL1.5 and GLM4V whose pre-training data do not contain multi-image content even outperform many multi-image models which undergo multi-image supervised fine-tuning (SFT), indicating the strong capacity in single-image understanding is the foundation of multi-image comprehension.
By comparing performance at the level of image relationships, we conclude that LVLM excels at understanding semantic content in multi-image scenarios but has weaker performance in comprehending temporal and spatial relationships in multi-image contexts.
The analysis based on the task map reveals that models perform better on high-level understanding tasks such as video captioning which are in-domain tasks, but struggle with 3D perception tasks such as 3D detection and temporal reasoning tasks such as image ordering which are out-of-domain tasks.
By task learning difficulty analysis, tasks involving ordering, retrieval and massive images cannot be overfitted by simple SFT, suggesting that additional pre-training data or training techniques should be incorporated for improvement.

🏆 Leaderboard

Rank	Model	Score
1	GPT4o	55.72
2	Gemini	53.41
3	Claude3	53.38
4	InternVL2	50.30
5	Mantis	45.58
6	Gemini1.0	40.25
7	internvl1.5-chat	37.39
8	Llava-interleave	32.37
9	idefics2_8b	27.80
10	glm-4v-9b	27.02
11	deepseek_vl_7b	24.64
12	XComposer2_1.8b	23.46
13	deepseek_vl_1.3b	23.21
14	flamingov2	22.26
15	llava_next_vicuna_7b	22.25
16	XComposer2	21.91
17	MiniCPM-Llama3-V-2_5	21.61
18	llava_v1.5_7b	19.19
19	sharegpt4v_7b	18.52
20	sharecaptioner	16.10
21	qwen_chat	15.92
22	monkey-chat	13.74
23	idefics_9b_instruct	12.84
24	qwen_base	5.16
-	Frequency Guess	31.5
-	Random Guess	27.4

🚀 Quick Start

Here, we mainly use the VLMEvalKit framework for testing, with some separate tests as well. Specifically, for multi-image models, we include the following models:

transformers == 33.0

XComposer2
XComposer2_1.8b
qwen_base
idefics_9b_instruct
qwen_chat
flamingov2

transformers == 37.0

deepseek_vl_1.3b
deepseek_vl_7b

transformers == 40.0

idefics2_8b

For single-image models, we include the following:

transformers == 33.0

sharecaptioner
monkey-chat

transformers == 37.0

sharegpt4v_7b
llava_v1.5_7b
glm-4v-9b

transformers == 40.0

llava_next_vicuna_7b
MiniCPM-Llama3-V-2_5

We use the VLMEvalKit framework for testing. You can refer to the code in VLMEvalKit/test_models.py. Additionally, for closed-source models, please replace the following part of the code by following the example here:

response = model.generate(tmp) # tmp = image_paths + [question]

For other open-source models, we have provided reference code for Mantis and InternVL1.5-chat. For LLava-Interleave, please refer to the original repository.

💐 Acknowledgement

We expressed sincerely gratitude for the projects listed following:

VLMEvalKit provides useful out-of-box tools and implements many adavanced LVLMs. Thanks for their selfless dedication.
The Team of InternVL for apis.

📧 Contact

If you have any questions, feel free to contact Fanqing Meng with mengfanqing33@gmail.com

🖊️ Citation

If you feel MMIU useful in your project or research, please kindly use the following BibTeX entry to cite our paper. Thanks!

@article{meng2024mmiu,
  title={MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models},
  author={Meng, Fanqing and Wang, Jin and Li, Chuanhao and Lu, Quanfeng and Tian, Hao and Liao, Jiaqi and Zhu, Xizhou and Dai, Jifeng and Qiao, Yu and Luo, Ping and others},
  journal={arXiv preprint arXiv:2408.02718},
  year={2024}
}