MMEvol: Empowering Multimodal Large Language Models with Evol-Instruct
May 15, 2025 Β· View on GitHub
Run Luo1,2*, Haonan Zhang3*, Longze Chen1,2*, Ting-En Lin3*,
Xiong Liu3, Yuchuan Wu3, Min Yang1,2π, Yongbin Li3π,
Minzheng Wang2, Pengpeng Zeng4, Lianli Gao5, Heng Tao Shen4,
Yunshui Li1,2, Xiaobo Xia6, FeiHuang3, Jingkuan Song4π,
* Equal contribution π Corresponding author
1 Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences
2 University of Chinese Academy of Sciences
3 Alibaba Group
4 Tongji University
5 Independent Researcher
6 The University of Sydney
π₯ Update
- [16/05]π₯one paper (MAmmoTH-VL) based on our method at large scale is accepted by ACL2025 main!
- [11/10]π₯MMEvol is coming! We release the code, models, and data for MMEvol!
- [09/09]π₯MMEvol is coming! We release the paper for MMEvol!
π Contents
π· Setup
Please follow the instructions below to install the required packages.
- Clone this repository
git clone https://github.com/RainBowLuoCS/MMEvol.git
cd MMEvol
- Install Package
conda create -n llava-next python=3.10 -y
conda activate llava-next
pip install --upgrade pip # enable PEP 660 support
pip install -e .
- Install additional packages for training
pip install -e ".[train]"
pip install flash-attn --no-build-isolation
π· Hyperparameters
Both hyperparameters used in pretraining and finetuning are provided below.
| Hyperparameter | Global Batch Size | LLM lr | Projector lr | Vision Tower lr | Epochs | Max length | Weight decay |
|---|---|---|---|---|---|---|---|
| PT | 256 | 0 | 1e-3 | 0 | 1 | 4096 | 0 |
| FT | 128 | 2e-5 | 2e-5 | 2e-6 | 1 | 4096 | 0 |
π Model
Here are the pretrained weights and instruction tuning weights
| Model | Pretrained Projector | Base LLM | PT Data | IT Data | Download |
|---|---|---|---|---|---|
| MMEvol-Qwen2-7B | mm_projector | Qwen2-7B | LLaVA-Pretrain | MMEvol | ckpt |
| MMEvol-LLaMA3-8B | mm_projector | LLaMA3-8B | LLaVA-Pretrain | MMEvol | ckpt |
Performance
VLMEvalKit Support (OpenCompass)
| Model | MME_C | MMStar | HallBench | MathVista_mini | MMMU_val | AI2D | POPE | BLINK | RWQA |
|---|---|---|---|---|---|---|---|---|---|
| MMEvol-LLaMA3-8B | 47.8 | 50.1 | 62.3 | 50.0 | 40.8 | 73.9 | 86.8 | 46.4 | 62.6 |
| MMEvol-Qwen2-7B | 55.8 | 51.6 | 64.1 | 52.4 | 45.1 | 74.7 | 87.8 | 47.7 | 63.9 |
VLMEvalKit Not Support (VQADataSet)
| Model | VQA_v2 | GQA | MIA | MMSInst |
|---|---|---|---|---|
| MMEvol-LLaMA3-8B | 83.4 | 65.0 | 78.8 | 32.3 |
| MMEvol-Qwen2-7B | 83.1 | 65.5 | 77.6 | 41.8 |
π‘Preparation
Dataset
Please follow LLaVA to prepare the corresponding images and data.
data structure
datasets
βββ json
β βββ allava_vflan.json
β βββ arxivqa.json
β βββ cambrain_math_code.json
β βββ data_engine.json
β βββ shargpt_40k.json
β βββ tabmwp.json
β βββ wizardlm_143k.json
β βββ mmevol_seed_no_evol_163k.json
β βββ mmevol_evol_480k.json
β βββ mix_evol_sft.json
βββ ai2d
β βββ abc_images
β βββ annotations
β βββ images
β βββ questions
β βββ categories.json
βββ alfword
β βββ alf-image-id-0
β βββ alf-image-id-1
β βββ alf-image-id-2
β βββ alf-image-id-3
β βββ alf-image-id-4
βββ allava_vflan
β βββ images
βββ arxivqa
β βββ images
βββ chartqa
β βββ test
β βββ train
β βββ val
βββ coco
β βββ train2014
β βββ train2017
β βββ val2014
β βββ val2017
βββ clevr
β βββ CLEVR_GoGenT_v1.0
β βββ CLEVR_v1.0
βββ data_engine
β βββ partI
β βββ partII
β βββ partIII
βββ design2code
β βββ images
βββ docvqa
β βββ test
β βββ train
β βββ val
βββ dvqa
β βββ images
βββ geo170k
β βββ images/geo3k
β βββ images/geoqa_plus
βββ geoqa+
β βββ images
βββ gpt4v-dataset
β βββ images
βββ gqa
β βββ images
βββ hfdata
β βββ ....
βββ llava
β βββ llava_pretrain/images
βββ llavar
β βββ finetune
βββ mathvision
β βββ images
βββ ocr_vqa
β βββ images
βββ Q-Instruct-DB
β βββ livefb_liveitw_aigc
β βββ spqa_koniq
βββ sam
β βββ images
βββ scienceqa
β βββ images
βββ share_textvqa
β βββ images
βββ synthdog-en
β βββ images
βββ tabmwp
β βββ tables
βββ textbookqa
β βββ tqa_train_val_test
βββ textvqa
β βββ train_images
βββ vg
β βββ VG_100K
β βββ VG_100K_2
βββ vizwiz
β βββ train
βββ web-celebrity
β βββ images
βββ web-landmark
β βββ images
βββ wikiart
β βββ images
mmevol_evol_480k.json is the 480k evolution data evolved from the seed data mmevol_seed_no_evol_163k.json. You can freely combine other data such as allava_vflan.json for instruction ftuning (IT) training according to your personal preferences, or directly use our mixed mix_evol_sft.json for training.
π Train
Pretrain
Please download the 558K subset of the LAION-CC-SBU dataset with BLIP captions here and organize the data following Preparation before training . Make sure set up the corresponding train script with correct setting (data path, weight path, and hyper-paramaters)
bash scripts/v1_6/train/llama3/pretrain.sh
bash scripts/v1_6/train/qwen2/pretrain.sh
Visual Instruction Tuning
Please make sure you download and organize the data following Preparation before training. Make sure set up the corresponding train script with correct setting (data path, weight path, and hyper-paramaters)
bash scripts/v1_6/train/llama3/finetune.sh
bash scripts/v1_6/train/qwen2/finetune.sh
π Evaluation
Ensure that your api_base and key are correctly configured before evaluation.
opencompass
First, enter the vlmevalkit directory and install all dependencies:
cd vlmevalkit
pip install -r requirements.txt
Then, run script/run_inference.sh, which receives three input parameters in sequence: MODELNAME, DATALIST, and MODE. MODELNAME represents the name of the model, DATALIST represents the datasets used for inference, and MODE represents evaluation mode:
chmod +x ./script/run_inference.sh
./script/run_inference.sh $MODELNAME $DATALIST $MODE
The two available choices for MODELNAME are listed in vlmeval/config.py:
ungrouped = {
'MMEvol-Llama3-V-1_6': partial(LLaVA_Llama3_V, model_path="checkpoints/xxx/checkpoint-14000"),
'MMEvol-Qwen2-V-1_6': partial(LLaVA_Qwen2_V, model_path="checkpoints/xxx/checkpoint-14000"),
}
All available choices for DATALIST are listed in vlmeval/utils/dataset_config.py. While evaluating on a single dataset, call the dataset name directly without quotation marks; while evaluating on multiple datasets, separate the names of different datasets with spaces and add quotation marks at both ends:
$DATALIST="MME MMMU_DEV_VAL MathVista_MINI RealWorldQA MMStar AI2D_TEST HallusionBench POPE BLINK"
While scoring on each benchmark directly, set MODE=all. If only inference results are required, set MODE=infer. In order to reproduce the results in the table displayed on the homepage (columns between MME and RealWorldQA), you need to run the script according to the following settings:
# run on all 9 datasets
./script/run_inference.sh MiniCPM-Llama3-V-2_5 "MME MMMU_DEV_VAL MathVista_MINI LLaVABench RealWorldQA MMStar MMVet AI2D_TEST OCRBench HallusionBench POPE BLINK" all
# The following are instructions for running on a single dataset
# MME
./script/run_inference.sh MMEvol-Llama3-V-1_6 MME all
# MMMU_DEV_VAL
./script/run_inference.sh MMEvol-Llama3-V-1_6 MMMU_DEV_VAL all
# MathVista_MINI
./script/run_inference.sh MMEvol-Llama3-V-1_6 MathVista_MINI all
.....
# NOTE you should use llava/eval/blink_eval.py for blink evaluation individually.
python llava/eval/blink_eval.py
vqadataset
For VQA and GQA dataset, please follow LLaVA for evaluation.
For MIA and MMSInst , first download the dataset and then run the following scripts for evaluation
cd mmevol
# test
python llava/eval/model_vqa_mia.py
python llava/eval/model_vqa_mminst.py
# eval
python llava/eval/mia_eval.py
python llava/eval/mminst_eval.py
π Visualization
The Tongyi-ConvAI generates this dataset for multi-modal supervised fine-tuning. This dataset was used to train Evol-Llama3-8B-Instruct and Evol-Qwen2-7B reported in our paper. To create this dataset, we first selected 163K Seed Instruction Tuning Dataset for Evol-Instruct, then we enhance data quality through an iterative process that involves a refined combination of fine-grained perception, cognitive reasoning, and interaction evolution. This process results in the generation of a more complex and diverse image-text instruction dataset, which in turn empowers MLLMs with enhanced capabilities. Below we showcase the detailed data distribution of the SEED-163K, which is prepared for multi-round evolution mentioned above. More details can be found in our paper.
Click to expand more examples
Schedule
- Release MMEvol-10M
- Release training & evaluation code
- Release model weight
- Release evolved dataset MMEvol-480K
Citation
If you find this repo useful for your research, please consider citing the paper
@article{luo2024mmevol,
title={Mmevol: Empowering multimodal large language models with evol-instruct},
author={Luo, Run and Zhang, Haonan and Chen, Longze and Lin, Ting-En and Liu, Xiong and Wu, Yuchuan and Yang, Min and Wang, Minzheng and Zeng, Pengpeng and Gao, Lianli and others},
journal={arXiv preprint arXiv:2409.05840},
year={2024}
}
Contact
if you have any question, please consider following concat for help
-
Run Luo β r.luo@siat.ac.cn
-
Haonan Zhang β zchiowal@gmail.com
Acknowledgement
- LLaVA: the codebase we built upon. Thanks for their brilliant contributions to the community! We just can't wait to use LLaVA-NeXT.
- VLMEvalKit: the amazing open-sourced suit for evaluating various LMMs!