MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for Accelerating Vision-Language Transformer

September 6, 2024 · View on GitHub

Official implementation of MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for Accelerating Vision-Language Transformer.

What's New 🥳

(SEP 6, 2024), we released the implementation and scripts of MADTP. (Note that checkpoints and logs will come soon.)[Code] 🚩
(Feb 27, 2024), MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for Accelerating Vision-Language Transformer was accepted by CVPR 2024. [Paper] [ArXiv]. 🎉

Installation

The code is tested on Pytorch==1.11.0, cuda==11.3.1, and python==3.8.13. The dependencies can be installed by:

conda env create -f environment.yml

Supported Tasks, Models, and Datasets

Type	Supported Tasks	Supported Models	Supported Datasets
Multi-modal	Visual Reasoning	BLIP (instructions)	NLVR2
Multi-modal	Image Caption	BLIP (instructions)	COCO Caption
Multi-modal	Visual Question Answer	BLIP (instructions)	VQAv2
Multi-modal	Image-Text Retrieval	CLIP (instructions), BLIP (instructions)	COCO, Flickr30k
Multi-modal	Text-Image Retrieval	CLIP (instructions), BLIP (instructions)	COCO, Flickr30k

Visual Reasoning on the NLVR2 Dataset

Dataset & Annotation

Download the NLVR2 dataset, unzip it under the datasets folder, and accordingly modify the image_root in config. Download all-in-one annotations (including annotations for Visual Reasoning, Image Caption, VQA, Image-Text Retrieval, and Text-Image Retrieval tasks) from this link, unzip it under the annotation folder, and accordingly modify the annotation in config. See here for expected folder structres.

Evaluation

Download compressed checkpoints from the table below, put them under the output folder, and accordingly modify the --pretrained of the scripts. For example, to evaluate a compressed model with 0.5 reduce ratio:

python -m torch.distributed.run --nproc_per_node=8 compress_nlvr.py --evaluate \
--pretrained output/nlvr_nlvr2_compression_p0.5/model_base_nlvr_nlvr2_p0.5_compressed.pth \
--config ./configs/nlvr.yaml \
--output_dir output/nlvr_nlvr2_compression_p0.5

Compression

Download the uncompressed model from the table below, put it under the pretrained folder, and accordingly modify the pretrained in config. For example, to conduct a compression at 0.5 reduce ratio on 8 A100 GPUs (80G):

python -m torch.distributed.run --nproc_per_node=8 compress_nlvr_dtp.py --p 0.5 --epoch 15 \
--pretrained pretrained/model_base_nlvr.pth \
--config ./configs/nlvr.yaml \
--output_dir output/nlvr_nlvr2_compression_p0.5

Resources

Reduction	Uncompressed Model	Compression Script	Training Log	Compressed Checkpoint	Evaluation Script
0.3	Download	Link	Download	Download	Link
0.5	Download	Link	Download	Download	Link
0.6	Download	Link	Download	Download	Link
0.7	Download	Link	Download	Download	Link
0.8	Download	Link	Download	Download	Link

Image Caption on the COCO Caption Dataset

Dataset & Annotation

Download the COCO Caption dataset, unzip it under the datasets folder, and accordingly modify the image_root in config. Download all-in-one annotations from this link, unzip it under the annotation folder, and accordingly modify the annotation in config. See here for expected folder structres.

Evaluation

python -m torch.distributed.run --nproc_per_node=8 compress_caption_dtp.py --evaluate \
--pretrained output/caption_coco_compression_p0.5/model_base_caption_capfilt_large_coco_p0.5_compressed.pth \
--config ./configs/caption_coco.yaml \
--output_dir output/caption_coco_compression_p0.5

Compression

python -m torch.distributed.run --nproc_per_node=8 compress_caption_dtp.py --p 0.5 --epoch 5 \
--pretrained pretrained/model_base_caption_capfilt_large.pth \
--config ./configs/caption_coco.yaml \
--output_dir output/caption_coco_compression_p0.5

Visual Question Answer on the VQAv2 Dataset

Dataset & Annotation

Download the VQAv2 dataset and Visual Genome dataset, unzip them under the datasets folder, and accordingly modify the image_root in config. Download all-in-one annotations from this link, unzip it under the annotation folder, and accordingly modify the annotation in config. See here for expected folder structres.
Evaluation

Download compressed checkpoints from the table below, put them under the output folder, and accordingly modify the --pretrained of the scripts. For example, to evaluate a compressed model with 0.5 reduce ratio: (Note that the scripts will generate answers vqa_result.json, which should be submitted to the official server to obtain evaluation results.)
```
python -m torch.distributed.run --nproc_per_node=8 compress_vqa_dtp.py --evaluate \
--pretrained output/vqa_vqa2_compression_p0.5/model_base_vqa_capfilt_large_vqa2_p0.5_compressed.pth \
--config ./configs/vqa.yaml \
--output_dir output/vqa_vqa2_compression_p0.5
```

Compression

python -m torch.distributed.run --nproc_per_node=8 compress_vqa_dtp.py --p 0.5 --epoch 3 \
--pretrained pretrained/model_base_vqa_capfilt_large.pth \
--config ./configs/vqa.yaml \
--output_dir output/vqa_vqa2_compression_p0.5

Image-Text and Text-Image Retrieval on the COCO Dataset

Dataset & Annotation

Download the COCO dataset, unzip it under the datasets folder, and accordingly modify the image_root in config. Download all-in-one annotations from this link, unzip it under the annotation folder, and accordingly modify the annotation in config. See here for expected folder structres.

Evaluation

python -m torch.distributed.run --nproc_per_node=8 compress_retrieval_dtp.py --evaluate \
--pretrained output/retrieval_coco_compression_p0.5/model_base_retrieval_coco_p0.5_compressed.pth --config ./configs/retrieval_coco.yaml \
--output_dir output/retrieval_coco_compression_p0.5

Compression

python -m torch.distributed.run --nproc_per_node=8 compress_retrieval_dtp.py --p 0.5 --epoch 5 \
--pretrained pretrained/model_base_retrieval_coco.pth \
--config ./configs/retrieval_coco.yaml \
--output_dir output/retrieval_coco_compression_p0.5

Image-Text and Text-Image Retrieval on the Flickr30K Dataset

Dataset & Annotation

Download the Flickr30k dataset, unzip it under the datasets folder, and accordingly modify the image_root in config. Download all-in-one annotations from this link, unzip it under the annotation folder, and accordingly modify the annotation in config. See here for expected folder structres.

Evaluation

python -m torch.distributed.run --nproc_per_node=8 compress_retrieval_flickr.py --evaluate \
--pretrained output/retrieval_flickr_compression_2x/model_base_retrieval_flickr_2x_compressed.pth \
--config ./configs/retrieval_flickr.yaml \
--output_dir output/retrieval_flickr_compression_2x

Compression

python -m torch.distributed.run --nproc_per_node=8 compress_retrieval_flickr_dtp.py --p 0.5 --epoch 10 \
--pretrained pretrained/model_base_retrieval_flickr.pth \
--config ./configs/retrieval_flickr.yaml \
--output_dir output/retrieval_flickr_compression_p0.75

Image-Text and Text-Image Retrieval on the COCO Dataset with CLIP

Dataset & Annotation

Download the COCO dataset, unzip it under the datasets folder, and accordingly modify the image_root in config. Download all-in-one annotations from this link, unzip it under the annotation folder, and accordingly modify the annotation in config. See here for expected folder structres.

Evaluation

python -m torch.distributed.run --nproc_per_node=8 compress_retrieval_clip_dtp.py --evaluate \
--pretrained output/retrieval_coco_clip_compression_p0.5/clip_large_retrieval_coco_p0.5_compressed.pth \
--config ./configs/retrieval_coco_clip.yaml \
--output_dir output/retrieval_coco_clip_compression_p0.5

Compression

python -m torch.distributed.run --nproc_per_node=8 compress_retrieval_clip_dtp.py --p 0.5 --epoch 5 \
--pretrained pretrained/clip_large_retrieval_coco.pth \
--config ./configs/retrieval_coco_clip.yaml \
--output_dir output/retrieval_coco_clip_compression_p0.5

Image-Text and Text-Image Retrieval on the Flickr30K Dataset with CLIP

Dataset & Annotation

Download the Flickr30k dataset, unzip it under the datasets folder, and accordingly modify the image_root in config. Download all-in-one annotations from this link, unzip it under the annotation folder, and accordingly modify the annotation in config. See here for expected folder structres.

Evaluation

python -m torch.distributed.run --nproc_per_node=8 compress_retrieval_clip_dtp.py --evaluate \
--pretrained output/retrieval_flickr_clip_compression_p0.5/checkpoint_best.pth \
--config ./configs/retrieval_flickr_clip.yaml \
--output_dir output/retrieval_flickr_clip_compression_p0.5

Compression

python -m torch.distributed.run --nproc_per_node=8 compress_retrieval_clip_dtp.py --p 0.5 --epoch 10 \
--pretrained pretrained/clip_large_retrieval_flickr.pth \
--config ./configs/retrieval_flickr_clip.yaml \
--output_dir output/retrieval_flickr_clip_compression_p0.5

Common Issues

1. Evaluation with single GPU

For BLIP and CLIP models, evaluate the 2x compressed BLIP model on the NLVR2 dataset as an example:

python compress_nlvr_dtp.py --evaluate \
--pretrained output/nlvr_nlvr2_compression_p0.5/checkpoint_best.pth \
--config ./configs/nlvr.yaml \
--output_dir output/nlvr_nlvr2_compression_p0.5

2. Compress with single GPU

For BLIP and CLIP models, compress the BLIP model to half on the NLVR2 dataset as an example:

python compress_nlvr_dtp.py --p 0.5 --epoch 15 \
--pretrained pretrained/model_base_nlvr.pth \
--config ./configs/nlvr.yaml \
--output_dir output/nlvr_nlvr2_compression_p0.5

3. Other issues

You can post them on the Issues page.

Expected Folder Structures

├── annotation
│   ├── answer_list.json
│   ├── coco_gt
│   │   ├── coco_karpathy_test_gt.json
│   │   └── coco_karpathy_val_gt.json
│   ├── ...
├── clip                                               
├── compress_caption_dtp.py             
├── compress_nlvr_dtp.py                  
├── compress ...    
├── configs                                             
├── data                                        
├── datasets
│   └── vision
│       ├── coco
│       ├── flickr
│       ├── NLVR2     
│       ├── ...                                                                               
├── log                                     
├── models            
├── output                                    
├── pretrained
│   ├── bert-base-uncased
│   ├── clip_large_retrieval_coco.pth
│   ├── clip_large_retrieval_flickr.pth
│   ├── ...       
├──                                                                                
├── transform                                                                           
└── utils.py

Acknowledgments

This code is built upon BLIP, CLIP, UPop, and <a href=https://github.com/huggingface/pytorch-image-models/tree/main/timm>timm. We thank the original authors for their open-source work.

Citation

If you find this work useful, please consider citing the corresponding paper:

@article{cao2024madtp,
  title={MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for Accelerating Vision-Language Transformer},
  author={Jianjian, Cao and Peng, Ye and Shengze, Li and Chong, Yu and Yansong, Tang and Jiwen, Lu and Tao, Chen},
  journal={IEEE Conference on Computer Vision and Pattern Recognition},
  year={2024}
}