MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for Accelerating Vision-Language Transformer

September 6, 2024 ยท View on GitHub

[Paper] [ArXiv] [Code]

Official implementation of MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for Accelerating Vision-Language Transformer.

What's New ๐Ÿฅณ

  • (SEP 6, 2024), we released the implementation and scripts of MADTP. (Note that checkpoints and logs will come soon.)[Code] ๐Ÿšฉ

  • (Feb 27, 2024), MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for Accelerating Vision-Language Transformer was accepted by CVPR 2024. [Paper] [ArXiv]. ๐ŸŽ‰

Installation

The code is tested on Pytorch==1.11.0, cuda==11.3.1, and python==3.8.13. The dependencies can be installed by:

conda env create -f environment.yml

Supported Tasks, Models, and Datasets

TypeSupported TasksSupported ModelsSupported Datasets
Multi-modalVisual ReasoningBLIP (instructions)NLVR2
Multi-modalImage CaptionBLIP (instructions)COCO Caption
Multi-modalVisual Question AnswerBLIP (instructions)VQAv2
Multi-modalImage-Text RetrievalCLIP (instructions), BLIP (instructions)COCO, Flickr30k
Multi-modalText-Image RetrievalCLIP (instructions), BLIP (instructions)COCO, Flickr30k

Visual Reasoning on the NLVR2 Dataset

  • Dataset & Annotation

    Download the NLVR2 dataset, unzip it under the datasets folder, and accordingly modify the image_root in config. Download all-in-one annotations (including annotations for Visual Reasoning, Image Caption, VQA, Image-Text Retrieval, and Text-Image Retrieval tasks) from this link, unzip it under the annotation folder, and accordingly modify the annotation in config. See here for expected folder structres.

  • Evaluation

    Download compressed checkpoints from the table below, put them under the output folder, and accordingly modify the --pretrained of the scripts. For example, to evaluate a compressed model with 0.5 reduce ratio:

    python -m torch.distributed.run --nproc_per_node=8 compress_nlvr.py --evaluate \
    --pretrained output/nlvr_nlvr2_compression_p0.5/model_base_nlvr_nlvr2_p0.5_compressed.pth \
    --config ./configs/nlvr.yaml \
    --output_dir output/nlvr_nlvr2_compression_p0.5
    
  • Compression

    Download the uncompressed model from the table below, put it under the pretrained folder, and accordingly modify the pretrained in config. For example, to conduct a compression at 0.5 reduce ratio on 8 A100 GPUs (80G):

    python -m torch.distributed.run --nproc_per_node=8 compress_nlvr_dtp.py --p 0.5 --epoch 15 \
    --pretrained pretrained/model_base_nlvr.pth \
    --config ./configs/nlvr.yaml \
    --output_dir output/nlvr_nlvr2_compression_p0.5
    
  • Resources

    ReductionUncompressed ModelCompression ScriptTraining LogCompressed CheckpointEvaluation Script
    0.3DownloadLinkDownloadDownloadLink
    0.5DownloadLinkDownloadDownloadLink
    0.6DownloadLinkDownloadDownloadLink
    0.7DownloadLinkDownloadDownloadLink
    0.8DownloadLinkDownloadDownloadLink

Image Caption on the COCO Caption Dataset

  • Dataset & Annotation

    Download the COCO Caption dataset, unzip it under the datasets folder, and accordingly modify the image_root in config. Download all-in-one annotations from this link, unzip it under the annotation folder, and accordingly modify the annotation in config. See here for expected folder structres.

  • Evaluation

    Download compressed checkpoints from the table below, put them under the output folder, and accordingly modify the --pretrained of the scripts. For example, to evaluate a compressed model with 0.5 reduce ratio:

    python -m torch.distributed.run --nproc_per_node=8 compress_caption_dtp.py --evaluate \
    --pretrained output/caption_coco_compression_p0.5/model_base_caption_capfilt_large_coco_p0.5_compressed.pth \
    --config ./configs/caption_coco.yaml \
    --output_dir output/caption_coco_compression_p0.5
    
  • Compression

    Download the uncompressed model from the table below, put it under the pretrained folder, and accordingly modify the pretrained in config. For example, to conduct a compression at 0.5 reduce ratio on 8 A100 GPUs (80G):

    python -m torch.distributed.run --nproc_per_node=8 compress_caption_dtp.py --p 0.5 --epoch 5 \
    --pretrained pretrained/model_base_caption_capfilt_large.pth \
    --config ./configs/caption_coco.yaml \
    --output_dir output/caption_coco_compression_p0.5
    

Visual Question Answer on the VQAv2 Dataset

  • Dataset & Annotation

    Download the VQAv2 dataset and Visual Genome dataset, unzip them under the datasets folder, and accordingly modify the image_root in config. Download all-in-one annotations from this link, unzip it under the annotation folder, and accordingly modify the annotation in config. See here for expected folder structres.

  • Evaluation

    Download compressed checkpoints from the table below, put them under the output folder, and accordingly modify the --pretrained of the scripts. For example, to evaluate a compressed model with 0.5 reduce ratio: (Note that the scripts will generate answers vqa_result.json, which should be submitted to the official server to obtain evaluation results.)

    python -m torch.distributed.run --nproc_per_node=8 compress_vqa_dtp.py --evaluate \
    --pretrained output/vqa_vqa2_compression_p0.5/model_base_vqa_capfilt_large_vqa2_p0.5_compressed.pth \
    --config ./configs/vqa.yaml \
    --output_dir output/vqa_vqa2_compression_p0.5
    
  • Compression

    Download the uncompressed model from the table below, put it under the pretrained folder, and accordingly modify the pretrained in config. For example, to conduct a compression at 0.5 reduce ratio on 8 A100 GPUs (80G):

    python -m torch.distributed.run --nproc_per_node=8 compress_vqa_dtp.py --p 0.5 --epoch 3 \
    --pretrained pretrained/model_base_vqa_capfilt_large.pth \
    --config ./configs/vqa.yaml \
    --output_dir output/vqa_vqa2_compression_p0.5
    

Image-Text and Text-Image Retrieval on the COCO Dataset

  • Dataset & Annotation

    Download the COCO dataset, unzip it under the datasets folder, and accordingly modify the image_root in config. Download all-in-one annotations from this link, unzip it under the annotation folder, and accordingly modify the annotation in config. See here for expected folder structres.

  • Evaluation

    Download compressed checkpoints from the table below, put them under the output folder, and accordingly modify the --pretrained of the scripts. For example, to evaluate a compressed model with 0.5 reduce ratio:

    python -m torch.distributed.run --nproc_per_node=8 compress_retrieval_dtp.py --evaluate \
    --pretrained output/retrieval_coco_compression_p0.5/model_base_retrieval_coco_p0.5_compressed.pth --config ./configs/retrieval_coco.yaml \
    --output_dir output/retrieval_coco_compression_p0.5
    
  • Compression

    Download the uncompressed model from the table below, put it under the pretrained folder, and accordingly modify the pretrained in config. For example, to conduct a compression at 0.5 reduce ratio on 8 A100 GPUs (80G):

    python -m torch.distributed.run --nproc_per_node=8 compress_retrieval_dtp.py --p 0.5 --epoch 5 \
    --pretrained pretrained/model_base_retrieval_coco.pth \
    --config ./configs/retrieval_coco.yaml \
    --output_dir output/retrieval_coco_compression_p0.5
    

Image-Text and Text-Image Retrieval on the Flickr30K Dataset

  • Dataset & Annotation

    Download the Flickr30k dataset, unzip it under the datasets folder, and accordingly modify the image_root in config. Download all-in-one annotations from this link, unzip it under the annotation folder, and accordingly modify the annotation in config. See here for expected folder structres.

  • Evaluation

    Download compressed checkpoints from the table below, put them under the output folder, and accordingly modify the --pretrained of the scripts. For example, to evaluate a compressed model with 0.5 reduce ratio:

    python -m torch.distributed.run --nproc_per_node=8 compress_retrieval_flickr.py --evaluate \
    --pretrained output/retrieval_flickr_compression_2x/model_base_retrieval_flickr_2x_compressed.pth \
    --config ./configs/retrieval_flickr.yaml \
    --output_dir output/retrieval_flickr_compression_2x
    
  • Compression

    Download the uncompressed model from the table below, put it under the pretrained folder, and accordingly modify the pretrained in config. For example, to conduct a compression at 0.5 reduce ratio on 8 A100 GPUs (80G):

    python -m torch.distributed.run --nproc_per_node=8 compress_retrieval_flickr_dtp.py --p 0.5 --epoch 10 \
    --pretrained pretrained/model_base_retrieval_flickr.pth \
    --config ./configs/retrieval_flickr.yaml \
    --output_dir output/retrieval_flickr_compression_p0.75
    

Image-Text and Text-Image Retrieval on the COCO Dataset with CLIP

  • Dataset & Annotation

    Download the COCO dataset, unzip it under the datasets folder, and accordingly modify the image_root in config. Download all-in-one annotations from this link, unzip it under the annotation folder, and accordingly modify the annotation in config. See here for expected folder structres.

  • Evaluation

    Download compressed checkpoints from the table below, put them under the output folder, and accordingly modify the --pretrained of the scripts. For example, to evaluate a compressed model with 0.5 reduce ratio:

    python -m torch.distributed.run --nproc_per_node=8 compress_retrieval_clip_dtp.py --evaluate \
    --pretrained output/retrieval_coco_clip_compression_p0.5/clip_large_retrieval_coco_p0.5_compressed.pth \
    --config ./configs/retrieval_coco_clip.yaml \
    --output_dir output/retrieval_coco_clip_compression_p0.5
    
  • Compression

    Download the uncompressed model from the table below, put it under the pretrained folder, and accordingly modify the pretrained in config. For example, to conduct a compression at 0.5 reduce ratio on 8 A100 GPUs (80G):

    python -m torch.distributed.run --nproc_per_node=8 compress_retrieval_clip_dtp.py --p 0.5 --epoch 5 \
    --pretrained pretrained/clip_large_retrieval_coco.pth \
    --config ./configs/retrieval_coco_clip.yaml \
    --output_dir output/retrieval_coco_clip_compression_p0.5
    

Image-Text and Text-Image Retrieval on the Flickr30K Dataset with CLIP

  • Dataset & Annotation

    Download the Flickr30k dataset, unzip it under the datasets folder, and accordingly modify the image_root in config. Download all-in-one annotations from this link, unzip it under the annotation folder, and accordingly modify the annotation in config. See here for expected folder structres.

  • Evaluation

    Download compressed checkpoints from the table below, put them under the output folder, and accordingly modify the --pretrained of the scripts. For example, to evaluate a compressed model with 0.5 reduce ratio:

    python -m torch.distributed.run --nproc_per_node=8 compress_retrieval_clip_dtp.py --evaluate \
    --pretrained output/retrieval_flickr_clip_compression_p0.5/checkpoint_best.pth \
    --config ./configs/retrieval_flickr_clip.yaml \
    --output_dir output/retrieval_flickr_clip_compression_p0.5
    
  • Compression

    Download the uncompressed model from the table below, put it under the pretrained folder, and accordingly modify the pretrained in config. For example, to conduct a compression at 0.5 reduce ratio on 8 A100 GPUs (80G):

    python -m torch.distributed.run --nproc_per_node=8 compress_retrieval_clip_dtp.py --p 0.5 --epoch 10 \
    --pretrained pretrained/clip_large_retrieval_flickr.pth \
    --config ./configs/retrieval_flickr_clip.yaml \
    --output_dir output/retrieval_flickr_clip_compression_p0.5
    

Common Issues

1. Evaluation with single GPU

  • For BLIP and CLIP models, evaluate the 2x compressed BLIP model on the NLVR2 dataset as an example:

    python compress_nlvr_dtp.py --evaluate \
    --pretrained output/nlvr_nlvr2_compression_p0.5/checkpoint_best.pth \
    --config ./configs/nlvr.yaml \
    --output_dir output/nlvr_nlvr2_compression_p0.5
    

2. Compress with single GPU

  • For BLIP and CLIP models, compress the BLIP model to half on the NLVR2 dataset as an example:

    python compress_nlvr_dtp.py --p 0.5 --epoch 15 \
    --pretrained pretrained/model_base_nlvr.pth \
    --config ./configs/nlvr.yaml \
    --output_dir output/nlvr_nlvr2_compression_p0.5
    

3. Other issues

You can post them on the Issues page.

Expected Folder Structures

โ”œโ”€โ”€ annotation
โ”‚ย ย  โ”œโ”€โ”€ answer_list.json
โ”‚ย ย  โ”œโ”€โ”€ coco_gt
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ coco_karpathy_test_gt.json
โ”‚ย ย  โ”‚ย ย  โ””โ”€โ”€ coco_karpathy_val_gt.json
โ”‚ย ย  โ”œโ”€โ”€ ...
โ”œโ”€โ”€ clip                                               
โ”œโ”€โ”€ compress_caption_dtp.py             
โ”œโ”€โ”€ compress_nlvr_dtp.py                  
โ”œโ”€โ”€ compress ...    
โ”œโ”€โ”€ configs                                             
โ”œโ”€โ”€ data                                        
โ”œโ”€โ”€ datasets
โ”‚ย ย  โ””โ”€โ”€ vision
โ”‚ย ย      โ”œโ”€โ”€ coco
โ”‚ย ย      โ”œโ”€โ”€ flickr
โ”‚ย ย      โ”œโ”€โ”€ NLVR2     
โ”‚ย ย      โ”œโ”€โ”€ ...                                                                               
โ”œโ”€โ”€ log                                     
โ”œโ”€โ”€ models            
โ”œโ”€โ”€ output                                    
โ”œโ”€โ”€ pretrained
โ”‚   โ”œโ”€โ”€ bert-base-uncased
โ”‚   โ”œโ”€โ”€ clip_large_retrieval_coco.pth
โ”‚   โ”œโ”€โ”€ clip_large_retrieval_flickr.pth
โ”‚   โ”œโ”€โ”€ ...       
โ”œโ”€โ”€                                                                                
โ”œโ”€โ”€ transform                                                                           
โ””โ”€โ”€ utils.py                                

Acknowledgments

This code is built upon BLIP, CLIP, UPop, and <a href=https://github.com/huggingface/pytorch-image-models/tree/main/timm>timm. We thank the original authors for their open-source work.

Citation

If you find this work useful, please consider citing the corresponding paper:

@article{cao2024madtp,
  title={MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for Accelerating Vision-Language Transformer},
  author={Jianjian, Cao and Peng, Ye and Shengze, Li and Chong, Yu and Yansong, Tang and Jiwen, Lu and Tao, Chen},
  journal={IEEE Conference on Computer Vision and Pattern Recognition},
  year={2024}
}