Benchmarking Multi-Image Understanding in Vision and Language Models: Perception, Knowledge, Reasoning, and Multi-Hop Reasoning

July 29, 2024 ยท View on GitHub

[Paper][Dataset][Code]

Abstract

The advancement of large language models (LLMs) has significantly broadened the scope of applications in natural language processing, with multi-modal LLMs extending these capabilities to integrate and interpret visual data. However, existing benchmarks for visual language models (VLMs) predominantly focus on single-image inputs, neglecting the crucial aspect of multi-image understanding. In this paper, we introduce a Multi-Image Relational Benchmark MIRB, designed to evaluate VLMs' ability to compare, analyze, and reason across multiple images. Our benchmark encompasses four categories: perception, visual world knowledge, reasoning, and multi-hop reasoning. Through a comprehensive evaluation of a wide range of open-source and closed-source models, we demonstrate that while open-source VLMs were shown to approach the performance of GPT-4V in single-image tasks, a significant performance gap remains in multi-image reasoning tasks. Our findings also reveal that even the state-of-the-art GPT-4V model struggles with our benchmark, underscoring the need for further research and development in this area. We believe our contribution of MIRB could serve as a testbed for developing the next-generation multi-modal models.

Updates

  • [2024/07] We have integrated MIRB into lmms-eval. You can also evaluate your model on our benchmark from there.

Environment

conda create -n MIRB python==3.10 -y
conda activate MIRB
pip install -r requirements.txt
# optional
# pip install flash-attn --no-build-isolation --no-cache-dir

You should be able to run most of the models now, but may also want to check some models for specific requirements such as LLaVA, VILA, and Qwen-VL.

Data

Put huggingface data in ./MIR and unzip ./MIR/images.zip.

Inference

Quick Start:

python inference.py --engine phi3-vision idefics2-8b --dataset codeu analogy

Results will be saved in results folder.

Evaluation

python evaluate.py --engine phi3-vision idefics2-8b --dataset codeu analogy

Results

ModelsKnowledgeReasoningPerceptionMulti-HopAverage
Random Chance20.8037.6221.420.0023.02
LLaVA-v1.5-7B48.8627.1437.890.0028.47
LLaVA-Next-7B48.4029.3541.560.0029.83
LLaVA-Next-13B48.4429.8540.220.0029.38
Qwen-VL-Chat19.2313.8724.440.0014.38
InternLM-XComposer254.7437.2337.220.8132.50
VILA-2.7B53.2731.0148.330.0033.15
VILA-7B63.6635.3147.110.0036.52
Emu2-Chat40.4024.5144.000.0027.23
IDEFICS1-9B45.8923.4936.890.0026.57
IDEFICS2-8B61.2631.8339.000.0033.02
Mantis-IDEFICS258.7333.7846.780.0034.82
LongVA-7B66.6335.3148.890.0037.71
Phi-3-Vision60.1934.4946.220.0035.23
InternLM-XC2d567.6739.4851.3311.4342.48
GPT-4V75.6650.5949.6736.2953.05

Citations

@article{zhao2024mirb
  author    = {Bingchen Zhao, Yongshuo Zong, Letian Zhang, Timothy Hospedales},
  title     = {Benchmarking Multi-Image Understanding in Vision and Language Models: Perception, Knowledge, Reasoning, and Multi-Hop Reasoning},
  journal   = {arXiv preprint},
  year      = {2024},
}