Are Unified Vision-Language Models Necessary: Generalization Across Understanding and Generation
May 23, 2025 ยท View on GitHub
Install
- Install Package
conda create -n llava python=3.10 -y
conda activate llava
pip install -e .
Please also refer to requirements.txt
- Install additional packages for training cases
pip install -e ".[train]"
pip install flash-attn --no-build-isolation
-
Download
vq_ds16_c2i.ptto./llava/model/multimodal_encoder/from llamaGen -
Install pytorch-fid to do evaluation on image generation
Experiments on Synthetic Dataset
-
Generate the training and testing dataset using
./generate_data_smart_watch.ipynb -
Use the
finetune_lora*.shin./scripts/v1_5/to do training and evaluation. To test the model with affine-transformation, use./llava/model/multimodal_encoder/affine_transformation_generation.ipynbto generate the transformation first and then change the finetune_lora bash file accordingly.
Experiments on LLaVA-1.5 Dataset
-
Prepare the ShareGPT4V dataset following ShareGPT4V. You do not need to download the images from SAM, since we will bypass them in preprocessing to save time. Images in ShareGPT4V dataset include images in LLaVA-1.5 dataset. Also download the text part of LLaVA-1.5 dataset from LLaVA-1.5
-
Use
./scripts/v1_5/caption-to-image-generation.ipynbto transform the ShareGPT4V image caption data into text-to-image generation data, and append to the original LLaVA-1.5 dataset. -
Use the
pretrain*.shandfinetune*.shin./scripts/v1_5/to do the two-stage training. You can refer to LLaVA-1.5 for more details. -
Install the lmms-eval to do evaluation. Please replace the
lmms-eval/lmms_eval/models/llava.pyin the lmms-eval code with./lmms-eval/llava.py. Example command:
python3 -m accelerate.commands.launch \
--num_processes=8 \
-m lmms_eval \
--model llava \
--model_args pretrained="./scripts/v1_5/checkpoints/llava-v1.5-7b/checkpoint-6761" \
--tasks pope,textvqa,mmvet,vizwiz_vqa,gqa,mmbench_en,mme,mmstar \
--batch_size 1 \
--log_samples \
--log_samples_suffix llava_v1.5-7b \
--output_path ./logs/
Bug Fixing
A few matters need attention:
- Please manually add
do_sample:truein in vicuna's generation_config.json file, according to this issue - We use
zero2settings in visual instruction tuning, becausezero3may cause some unknow timeout error. Please set"overlap_comm": falseto avoid zero loss error, according to this issue
Acknowledgement
- LLaVA: the codebase we built upon.