MMEvol: Empowering Multimodal Large Language Models with Evol-Instruct

May 15, 2025 · View on GitHub

Run Luo^1,2*, Haonan Zhang^3*, Longze Chen^1,2*, Ting-En Lin^3*,
Xiong Liu³, Yuchuan Wu³, Min Yang^1,2🌟, Yongbin Li^3🌟,
Minzheng Wang², Pengpeng Zeng⁴, Lianli Gao⁵, Heng Tao Shen⁴,
Yunshui Li^1,2, Xiaobo Xia⁶, FeiHuang³, Jingkuan Song^4🌟,

* Equal contribution 🌟 Corresponding author

¹ Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences
² University of Chinese Academy of Sciences
³ Alibaba Group ⁴ Tongji University ⁵ Independent Researcher ⁶ The University of Sydney

[📖 arXiv Paper] [📊 Dataset] [🏆 Models]

MMEvol is the pioneering method that successfully incorporates Evol-Instruct into the multimodal domain, enhancing the diversity and complexity of multimodal instruction data. Unlike previous methods such as VILA2, MIMIC-IT, and MMInstruct, it achieves iterative evolution in an elegant, simple, and fully automated manner, transcending traditional limits on data complexity and diversity. MMEvol imposes no restrictions on data format, task type, or intricate processing, allowing for rapid self-iterative evolution of limited image instruction data to produce exceptionally high-quality multimodal data. This empowers multimodal models with enhanced capabilities. Additionally, it can be seamlessly combined with other data flow-driven methods like VILA2, MIMIC-IT, and MMInstruct for more robust data construction. We invite everyone to experience it now!

🔥 Update

[16/05]🔥one paper (MAmmoTH-VL) based on our method at large scale is accepted by ACL2025 main!
[11/10]🔥MMEvol is coming! We release the code, models, and data for MMEvol!
[09/09]🔥MMEvol is coming! We release the paper for MMEvol!

📷 Setup

Please follow the instructions below to install the required packages.

Clone this repository

git clone https://github.com/RainBowLuoCS/MMEvol.git
cd MMEvol

Install Package

conda create -n llava-next python=3.10 -y
conda activate llava-next
pip install --upgrade pip  # enable PEP 660 support
pip install -e .

Install additional packages for training

pip install -e ".[train]"
pip install flash-attn --no-build-isolation

📷 Hyperparameters

Both hyperparameters used in pretraining and finetuning are provided below.

Hyperparameter	Global Batch Size	LLM lr	Projector lr	Vision Tower lr	Epochs	Max length	Weight decay
PT	256	0	1e-3	0	1	4096	0
FT	128	2e-5	2e-5	2e-6	1	4096	0

🔍 Model

Here are the pretrained weights and instruction tuning weights

Model	Pretrained Projector	Base LLM	PT Data	IT Data	Download
MMEvol-Qwen2-7B	mm_projector	Qwen2-7B	LLaVA-Pretrain	MMEvol	ckpt
MMEvol-LLaMA3-8B	mm_projector	LLaMA3-8B	LLaVA-Pretrain	MMEvol	ckpt

Performance

VLMEvalKit Support (OpenCompass)

Model	MME_C	MMStar	HallBench	MathVista_mini	MMMU_val	AI2D	POPE	BLINK	RWQA
MMEvol-LLaMA3-8B	47.8	50.1	62.3	50.0	40.8	73.9	86.8	46.4	62.6
MMEvol-Qwen2-7B	55.8	51.6	64.1	52.4	45.1	74.7	87.8	47.7	63.9

VLMEvalKit Not Support (VQADataSet)

Model	VQA_v2	GQA	MIA	MMSInst
MMEvol-LLaMA3-8B	83.4	65.0	78.8	32.3
MMEvol-Qwen2-7B	83.1	65.5	77.6	41.8

💡Preparation

Dataset

Please follow LLaVA to prepare the corresponding images and data.

data structure

datasets
├── json
│   ├── allava_vflan.json
│   ├── arxivqa.json
│   ├── cambrain_math_code.json
│   ├── data_engine.json
│   ├── shargpt_40k.json
│   ├── tabmwp.json
│   ├── wizardlm_143k.json
│   ├── mmevol_seed_no_evol_163k.json
│   ├── mmevol_evol_480k.json
│   └── mix_evol_sft.json
├── ai2d
│   ├── abc_images
│   ├── annotations
│   ├── images
│   ├── questions
│   └── categories.json
├── alfword
│   ├── alf-image-id-0
│   ├── alf-image-id-1
│   ├── alf-image-id-2
│   ├── alf-image-id-3
│   └── alf-image-id-4
├── allava_vflan
│   └── images
├── arxivqa
│   └── images
├── chartqa
│   ├── test
│   ├── train
│   └── val
├── coco
│   ├── train2014 
│   ├── train2017
│   ├── val2014
│   └── val2017
├── clevr
│   ├── CLEVR_GoGenT_v1.0
│   └── CLEVR_v1.0
├── data_engine
│   ├── partI
│   ├── partII 
│   └── partIII
├── design2code
│   └── images  
├── docvqa
│   ├── test
│   ├── train
│   └── val
├── dvqa
│   └── images
├── geo170k
│   ├── images/geo3k
│   └── images/geoqa_plus
├── geoqa+
│   └── images 
├── gpt4v-dataset
│   └── images 
├── gqa
│   └── images 
├── hfdata
│   └── ....
├── llava
│   └── llava_pretrain/images
├── llavar
│   └── finetune
├── mathvision
│   └── images
├── ocr_vqa
│   └── images
├── Q-Instruct-DB
│   ├── livefb_liveitw_aigc
│   └── spqa_koniq
├── sam
│   └── images
├── scienceqa
│   └── images
├── share_textvqa
│   └── images
├── synthdog-en
│   └── images
├── tabmwp
│   └── tables
├── textbookqa
│   └── tqa_train_val_test
├── textvqa
│   └── train_images
├── vg
│   ├── VG_100K
│   └── VG_100K_2
├── vizwiz
│   └── train
├── web-celebrity
│   └── images
├── web-landmark
│   └── images
└── wikiart
│   └── images

mmevol_evol_480k.json is the 480k evolution data evolved from the seed data mmevol_seed_no_evol_163k.json. You can freely combine other data such as allava_vflan.json for instruction ftuning (IT) training according to your personal preferences, or directly use our mixed mix_evol_sft.json for training.

📈 Train

Pretrain

Please download the 558K subset of the LAION-CC-SBU dataset with BLIP captions here and organize the data following Preparation before training . Make sure set up the corresponding train script with correct setting (data path, weight path, and hyper-paramaters)

bash scripts/v1_6/train/llama3/pretrain.sh
bash scripts/v1_6/train/qwen2/pretrain.sh

Visual Instruction Tuning

Please make sure you download and organize the data following Preparation before training. Make sure set up the corresponding train script with correct setting (data path, weight path, and hyper-paramaters)

bash scripts/v1_6/train/llama3/finetune.sh
bash scripts/v1_6/train/qwen2/finetune.sh

📈 Evaluation

Ensure that your api_base and key are correctly configured before evaluation.

opencompass

First, enter the vlmevalkit directory and install all dependencies:

cd vlmevalkit
pip install -r requirements.txt

Then, run script/run_inference.sh, which receives three input parameters in sequence: MODELNAME, DATALIST, and MODE. MODELNAME represents the name of the model, DATALIST represents the datasets used for inference, and MODE represents evaluation mode:

chmod +x ./script/run_inference.sh
./script/run_inference.sh $MODELNAME $DATALIST $MODE

The two available choices for MODELNAME are listed in vlmeval/config.py:

ungrouped = {
    'MMEvol-Llama3-V-1_6': partial(LLaVA_Llama3_V, model_path="checkpoints/xxx/checkpoint-14000"),
    'MMEvol-Qwen2-V-1_6': partial(LLaVA_Qwen2_V, model_path="checkpoints/xxx/checkpoint-14000"),
}

All available choices for DATALIST are listed in vlmeval/utils/dataset_config.py. While evaluating on a single dataset, call the dataset name directly without quotation marks; while evaluating on multiple datasets, separate the names of different datasets with spaces and add quotation marks at both ends:

$DATALIST="MME MMMU_DEV_VAL MathVista_MINI RealWorldQA MMStar AI2D_TEST HallusionBench POPE BLINK"

While scoring on each benchmark directly, set MODE=all. If only inference results are required, set MODE=infer. In order to reproduce the results in the table displayed on the homepage (columns between MME and RealWorldQA), you need to run the script according to the following settings:

# run on all 9 datasets
./script/run_inference.sh MiniCPM-Llama3-V-2_5 "MME MMMU_DEV_VAL MathVista_MINI LLaVABench RealWorldQA MMStar MMVet AI2D_TEST OCRBench HallusionBench POPE BLINK" all

# The following are instructions for running on a single dataset
# MME
./script/run_inference.sh MMEvol-Llama3-V-1_6 MME all
# MMMU_DEV_VAL
./script/run_inference.sh MMEvol-Llama3-V-1_6 MMMU_DEV_VAL all
# MathVista_MINI
./script/run_inference.sh MMEvol-Llama3-V-1_6 MathVista_MINI all
.....

# NOTE you should use llava/eval/blink_eval.py for blink evaluation individually.
python llava/eval/blink_eval.py

vqadataset

For VQA and GQA dataset, please follow LLaVA for evaluation.

For MIA and MMSInst , first download the dataset and then run the following scripts for evaluation

cd mmevol
# test
python llava/eval/model_vqa_mia.py
python llava/eval/model_vqa_mminst.py
# eval
python llava/eval/mia_eval.py
python llava/eval/mminst_eval.py

👀 Visualization

The Tongyi-ConvAI generates this dataset for multi-modal supervised fine-tuning. This dataset was used to train Evol-Llama3-8B-Instruct and Evol-Qwen2-7B reported in our paper. To create this dataset, we first selected 163K Seed Instruction Tuning Dataset for Evol-Instruct, then we enhance data quality through an iterative process that involves a refined combination of fine-grained perception, cognitive reasoning, and interaction evolution. This process results in the generation of a more complex and diverse image-text instruction dataset, which in turn empowers MLLMs with enhanced capabilities. Below we showcase the detailed data distribution of the SEED-163K, which is prepared for multi-round evolution mentioned above. More details can be found in our paper.

Click to expand more examples

Schedule

Release MMEvol-10M
Release training & evaluation code
Release model weight
Release evolved dataset MMEvol-480K

Citation

If you find this repo useful for your research, please consider citing the paper

@article{luo2024mmevol,
  title={Mmevol: Empowering multimodal large language models with evol-instruct},
  author={Luo, Run and Zhang, Haonan and Chen, Longze and Lin, Ting-En and Liu, Xiong and Wu, Yuchuan and Yang, Min and Wang, Minzheng and Zeng, Pengpeng and Gao, Lianli and others},
  journal={arXiv preprint arXiv:2409.05840},
  year={2024}
}

Contact

if you have any question, please consider following concat for help

Run Luo — r.luo@siat.ac.cn
Haonan Zhang — zchiowal@gmail.com

Acknowledgement

- LLaVA: the codebase we built upon. Thanks for their brilliant contributions to the community! We just can't wait to use LLaVA-NeXT.

- VLMEvalKit: the amazing open-sourced suit for evaluating various LMMs!