Transferable Visual Prompting for Multimodal Large Language Models

December 20, 2024 ยท View on GitHub

Installation

  1. Create the virtual environment for the project.
cd Transferable_VP_MLLM
conda create -n transvp python=3.11
pip install -r requirements.txt
  1. Prepare the model weights

Put the model weights under ./model_weights

  • MiniGPT-4: Follow MiniGPT-4 and prepare the MiniGPT-4-Vicuna-V0-7B
  • InstructBLIP: Follow LAVIS and prepare the InstructBLIP-Vicuna-7b-v1.1
  • BLIP2: Follow LAVIS and prepare the BLIP2-FlanT5-xl
  • VPGTrans: Follow MiniGPT-4 and prepare Vicuna-v0-7B as LLM
  • BLIVA: Follow BLIVA and prepare BLIVA-Vicuna-7B
  • VisualGLM-6B: No special operation needed.

To Reproduce Reproduced Results

  1. On CIFAR10
python transfer_cls.py --dataset cifar10 --model_name minigpt-4 --target_models instructblip blip2 --learning_rate 10 --fca 0.005 --tse 0.001 --epochs 1
  1. Inference with a model Specify the path to checkpoint if you want to evaluate on the dataset with trained prompt. A reproducible checkpoint is placed in save/checkpoint_best.pth.
python transfer_cls.py --dataset cifar10 --model_name minigpt-4 --evaluate --checkpoint $PATH_TO_PROMPT

Bibtex

If you find this work helpful, please cite it with the bibtex below.

@InProceedings{Zhang_2024_CVPR,
    author    = {Zhang, Yichi and Dong, Yinpeng and Zhang, Siyuan and Min, Tianzan and Su, Hang and Zhu, Jun},
    title     = {Exploring the Transferability of Visual Prompting for Multimodal Large Language Models},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2024},
    pages     = {26562-26572}
}