MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for Accelerating Vision-Language Transformer
September 6, 2024 ยท View on GitHub
Official implementation of MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for Accelerating Vision-Language Transformer.
What's New ๐ฅณ
-
(SEP 6, 2024), we released the
implementationandscriptsof MADTP. (Note thatcheckpointsandlogswill come soon.)[Code] ๐ฉ -
(Feb 27, 2024), MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for Accelerating Vision-Language Transformer was accepted by CVPR 2024. [Paper] [ArXiv]. ๐
Installation
The code is tested on Pytorch==1.11.0, cuda==11.3.1, and python==3.8.13. The dependencies can be installed by:
conda env create -f environment.yml
Supported Tasks, Models, and Datasets
| Type | Supported Tasks | Supported Models | Supported Datasets |
|---|---|---|---|
| Multi-modal | Visual Reasoning | BLIP (instructions) | NLVR2 |
| Multi-modal | Image Caption | BLIP (instructions) | COCO Caption |
| Multi-modal | Visual Question Answer | BLIP (instructions) | VQAv2 |
| Multi-modal | Image-Text Retrieval | CLIP (instructions), BLIP (instructions) | COCO, Flickr30k |
| Multi-modal | Text-Image Retrieval | CLIP (instructions), BLIP (instructions) | COCO, Flickr30k |
Visual Reasoning on the NLVR2 Dataset
-
Dataset & Annotation
Download the NLVR2 dataset, unzip it under the
datasetsfolder, and accordingly modify theimage_rootin config. Download all-in-one annotations (including annotations for Visual Reasoning, Image Caption, VQA, Image-Text Retrieval, and Text-Image Retrieval tasks) from this link, unzip it under theannotationfolder, and accordingly modify theannotationin config. See here for expected folder structres. -
Evaluation
Download compressed checkpoints from the table below, put them under the
outputfolder, and accordingly modify the--pretrainedof the scripts. For example, to evaluate a compressed model with 0.5 reduce ratio:python -m torch.distributed.run --nproc_per_node=8 compress_nlvr.py --evaluate \ --pretrained output/nlvr_nlvr2_compression_p0.5/model_base_nlvr_nlvr2_p0.5_compressed.pth \ --config ./configs/nlvr.yaml \ --output_dir output/nlvr_nlvr2_compression_p0.5 -
Compression
Download the uncompressed model from the table below, put it under the
pretrainedfolder, and accordingly modify thepretrainedin config. For example, to conduct a compression at 0.5 reduce ratio on 8 A100 GPUs (80G):python -m torch.distributed.run --nproc_per_node=8 compress_nlvr_dtp.py --p 0.5 --epoch 15 \ --pretrained pretrained/model_base_nlvr.pth \ --config ./configs/nlvr.yaml \ --output_dir output/nlvr_nlvr2_compression_p0.5 -
Resources
Reduction Uncompressed Model Compression Script Training Log Compressed Checkpoint Evaluation Script 0.3 Download Link Download Download Link 0.5 Download Link Download Download Link 0.6 Download Link Download Download Link 0.7 Download Link Download Download Link 0.8 Download Link Download Download Link
Image Caption on the COCO Caption Dataset
-
Dataset & Annotation
Download the COCO Caption dataset, unzip it under the
datasetsfolder, and accordingly modify theimage_rootin config. Download all-in-one annotations from this link, unzip it under theannotationfolder, and accordingly modify theannotationin config. See here for expected folder structres. -
Evaluation
Download compressed checkpoints from the table below, put them under the
outputfolder, and accordingly modify the--pretrainedof the scripts. For example, to evaluate a compressed model with 0.5 reduce ratio:python -m torch.distributed.run --nproc_per_node=8 compress_caption_dtp.py --evaluate \ --pretrained output/caption_coco_compression_p0.5/model_base_caption_capfilt_large_coco_p0.5_compressed.pth \ --config ./configs/caption_coco.yaml \ --output_dir output/caption_coco_compression_p0.5 -
Compression
Download the uncompressed model from the table below, put it under the
pretrainedfolder, and accordingly modify thepretrainedin config. For example, to conduct a compression at 0.5 reduce ratio on 8 A100 GPUs (80G):python -m torch.distributed.run --nproc_per_node=8 compress_caption_dtp.py --p 0.5 --epoch 5 \ --pretrained pretrained/model_base_caption_capfilt_large.pth \ --config ./configs/caption_coco.yaml \ --output_dir output/caption_coco_compression_p0.5
Visual Question Answer on the VQAv2 Dataset
-
Dataset & Annotation
Download the VQAv2 dataset and Visual Genome dataset, unzip them under the
datasetsfolder, and accordingly modify theimage_rootin config. Download all-in-one annotations from this link, unzip it under theannotationfolder, and accordingly modify theannotationin config. See here for expected folder structres. -
Evaluation
Download compressed checkpoints from the table below, put them under the
outputfolder, and accordingly modify the--pretrainedof the scripts. For example, to evaluate a compressed model with 0.5 reduce ratio: (Note that the scripts will generate answersvqa_result.json, which should be submitted to the official server to obtain evaluation results.)python -m torch.distributed.run --nproc_per_node=8 compress_vqa_dtp.py --evaluate \ --pretrained output/vqa_vqa2_compression_p0.5/model_base_vqa_capfilt_large_vqa2_p0.5_compressed.pth \ --config ./configs/vqa.yaml \ --output_dir output/vqa_vqa2_compression_p0.5 -
Compression
Download the uncompressed model from the table below, put it under the
pretrainedfolder, and accordingly modify thepretrainedin config. For example, to conduct a compression at 0.5 reduce ratio on 8 A100 GPUs (80G):python -m torch.distributed.run --nproc_per_node=8 compress_vqa_dtp.py --p 0.5 --epoch 3 \ --pretrained pretrained/model_base_vqa_capfilt_large.pth \ --config ./configs/vqa.yaml \ --output_dir output/vqa_vqa2_compression_p0.5
Image-Text and Text-Image Retrieval on the COCO Dataset
-
Dataset & Annotation
Download the COCO dataset, unzip it under the
datasetsfolder, and accordingly modify theimage_rootin config. Download all-in-one annotations from this link, unzip it under theannotationfolder, and accordingly modify theannotationin config. See here for expected folder structres. -
Evaluation
Download compressed checkpoints from the table below, put them under the
outputfolder, and accordingly modify the--pretrainedof the scripts. For example, to evaluate a compressed model with 0.5 reduce ratio:python -m torch.distributed.run --nproc_per_node=8 compress_retrieval_dtp.py --evaluate \ --pretrained output/retrieval_coco_compression_p0.5/model_base_retrieval_coco_p0.5_compressed.pth --config ./configs/retrieval_coco.yaml \ --output_dir output/retrieval_coco_compression_p0.5 -
Compression
Download the uncompressed model from the table below, put it under the
pretrainedfolder, and accordingly modify thepretrainedin config. For example, to conduct a compression at 0.5 reduce ratio on 8 A100 GPUs (80G):python -m torch.distributed.run --nproc_per_node=8 compress_retrieval_dtp.py --p 0.5 --epoch 5 \ --pretrained pretrained/model_base_retrieval_coco.pth \ --config ./configs/retrieval_coco.yaml \ --output_dir output/retrieval_coco_compression_p0.5
Image-Text and Text-Image Retrieval on the Flickr30K Dataset
-
Dataset & Annotation
Download the Flickr30k dataset, unzip it under the
datasetsfolder, and accordingly modify theimage_rootin config. Download all-in-one annotations from this link, unzip it under theannotationfolder, and accordingly modify theannotationin config. See here for expected folder structres. -
Evaluation
Download compressed checkpoints from the table below, put them under the
outputfolder, and accordingly modify the--pretrainedof the scripts. For example, to evaluate a compressed model with 0.5 reduce ratio:python -m torch.distributed.run --nproc_per_node=8 compress_retrieval_flickr.py --evaluate \ --pretrained output/retrieval_flickr_compression_2x/model_base_retrieval_flickr_2x_compressed.pth \ --config ./configs/retrieval_flickr.yaml \ --output_dir output/retrieval_flickr_compression_2x -
Compression
Download the uncompressed model from the table below, put it under the
pretrainedfolder, and accordingly modify thepretrainedin config. For example, to conduct a compression at 0.5 reduce ratio on 8 A100 GPUs (80G):python -m torch.distributed.run --nproc_per_node=8 compress_retrieval_flickr_dtp.py --p 0.5 --epoch 10 \ --pretrained pretrained/model_base_retrieval_flickr.pth \ --config ./configs/retrieval_flickr.yaml \ --output_dir output/retrieval_flickr_compression_p0.75
Image-Text and Text-Image Retrieval on the COCO Dataset with CLIP
-
Dataset & Annotation
Download the COCO dataset, unzip it under the
datasetsfolder, and accordingly modify theimage_rootin config. Download all-in-one annotations from this link, unzip it under theannotationfolder, and accordingly modify theannotationin config. See here for expected folder structres. -
Evaluation
Download compressed checkpoints from the table below, put them under the
outputfolder, and accordingly modify the--pretrainedof the scripts. For example, to evaluate a compressed model with 0.5 reduce ratio:python -m torch.distributed.run --nproc_per_node=8 compress_retrieval_clip_dtp.py --evaluate \ --pretrained output/retrieval_coco_clip_compression_p0.5/clip_large_retrieval_coco_p0.5_compressed.pth \ --config ./configs/retrieval_coco_clip.yaml \ --output_dir output/retrieval_coco_clip_compression_p0.5 -
Compression
Download the uncompressed model from the table below, put it under the
pretrainedfolder, and accordingly modify thepretrainedin config. For example, to conduct a compression at 0.5 reduce ratio on 8 A100 GPUs (80G):python -m torch.distributed.run --nproc_per_node=8 compress_retrieval_clip_dtp.py --p 0.5 --epoch 5 \ --pretrained pretrained/clip_large_retrieval_coco.pth \ --config ./configs/retrieval_coco_clip.yaml \ --output_dir output/retrieval_coco_clip_compression_p0.5
Image-Text and Text-Image Retrieval on the Flickr30K Dataset with CLIP
-
Dataset & Annotation
Download the Flickr30k dataset, unzip it under the
datasetsfolder, and accordingly modify theimage_rootin config. Download all-in-one annotations from this link, unzip it under theannotationfolder, and accordingly modify theannotationin config. See here for expected folder structres. -
Evaluation
Download compressed checkpoints from the table below, put them under the
outputfolder, and accordingly modify the--pretrainedof the scripts. For example, to evaluate a compressed model with 0.5 reduce ratio:python -m torch.distributed.run --nproc_per_node=8 compress_retrieval_clip_dtp.py --evaluate \ --pretrained output/retrieval_flickr_clip_compression_p0.5/checkpoint_best.pth \ --config ./configs/retrieval_flickr_clip.yaml \ --output_dir output/retrieval_flickr_clip_compression_p0.5 -
Compression
Download the uncompressed model from the table below, put it under the
pretrainedfolder, and accordingly modify thepretrainedin config. For example, to conduct a compression at 0.5 reduce ratio on 8 A100 GPUs (80G):python -m torch.distributed.run --nproc_per_node=8 compress_retrieval_clip_dtp.py --p 0.5 --epoch 10 \ --pretrained pretrained/clip_large_retrieval_flickr.pth \ --config ./configs/retrieval_flickr_clip.yaml \ --output_dir output/retrieval_flickr_clip_compression_p0.5
Common Issues
1. Evaluation with single GPU
-
For BLIP and CLIP models, evaluate the 2x compressed BLIP model on the NLVR2 dataset as an example:
python compress_nlvr_dtp.py --evaluate \ --pretrained output/nlvr_nlvr2_compression_p0.5/checkpoint_best.pth \ --config ./configs/nlvr.yaml \ --output_dir output/nlvr_nlvr2_compression_p0.5
2. Compress with single GPU
-
For BLIP and CLIP models, compress the BLIP model to half on the NLVR2 dataset as an example:
python compress_nlvr_dtp.py --p 0.5 --epoch 15 \ --pretrained pretrained/model_base_nlvr.pth \ --config ./configs/nlvr.yaml \ --output_dir output/nlvr_nlvr2_compression_p0.5
3. Other issues
You can post them on the Issues page.
Expected Folder Structures
โโโ annotation
โย ย โโโ answer_list.json
โย ย โโโ coco_gt
โย ย โย ย โโโ coco_karpathy_test_gt.json
โย ย โย ย โโโ coco_karpathy_val_gt.json
โย ย โโโ ...
โโโ clip
โโโ compress_caption_dtp.py
โโโ compress_nlvr_dtp.py
โโโ compress ...
โโโ configs
โโโ data
โโโ datasets
โย ย โโโ vision
โย ย โโโ coco
โย ย โโโ flickr
โย ย โโโ NLVR2
โย ย โโโ ...
โโโ log
โโโ models
โโโ output
โโโ pretrained
โ โโโ bert-base-uncased
โ โโโ clip_large_retrieval_coco.pth
โ โโโ clip_large_retrieval_flickr.pth
โ โโโ ...
โโโ
โโโ transform
โโโ utils.py
Acknowledgments
This code is built upon BLIP, CLIP, UPop, and <a href=https://github.com/huggingface/pytorch-image-models/tree/main/timm>timm. We thank the original authors for their open-source work.
Citation
If you find this work useful, please consider citing the corresponding paper:
@article{cao2024madtp,
title={MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for Accelerating Vision-Language Transformer},
author={Jianjian, Cao and Peng, Ye and Shengze, Li and Chong, Yu and Yansong, Tang and Jiwen, Lu and Tao, Chen},
journal={IEEE Conference on Computer Vision and Pattern Recognition},
year={2024}
}