MultiCapCLIP

August 8, 2024 · View on GitHub

Data used in our ACL'23 paper:

MultiCapCLIP: Auto-Encoding Prompts for Zero-Shot Multilingual Visual Captioning

Bang Yang, Fenglin Liu, Xian Wu, Yaowei Wang, Xu Sun, and Yuexian Zou

ACL Anthology, arXiv

Data

Our data follows the structure shown below:

MultiCapCLIP/
    data
    ├── checkpoints                 # off-the-shelf models
    │   ├── ViT-B-16.pt
    │   ├── bert-base-multilingual-cased
    │   │   ├── ...
    │   │   └── vocab.txt
    │   └── bert-base-uncased                       
    │       ├── config.json
    │       ├── pytorch_model.bin
    │       ├── tokenizer.json
    │       ├── tokenizer_config.json
    │       └── vocab.txt
    ├── annotations                 # for evaluations or supervised training (finetuning)
    │   └── $dataset   
    │       └── $lang               # annotations in a specific language
    │           ├── subsets         # for semi-supervised training
    │           │    ├─ 0.1%_0.json # a 0.1% subset of train.json 
    │           │    ├─ 0.1%_1.json # different seed
    │           │    ├─ 0.1%_2.json # totally 3 seeds
    │           │    ├─ ...
    │           │    └─ 10%_2.json  # a 10% subset of train.json 
    │           ├── train.json          
    │           ├── val.json
    │           ├── val_gt.json
    │           ├── test.json     
    │           └── test_gt.json  
    ├── corpus                      # for text-only training
    │   ├── $dataset.txt            # one caption per line
    │   └── $dataset_$lang.tsv      # English-$lang pairs (separated by `\t`) per line
    ├── feats                       # speedup training and inference
    │   ├── vit-b-16                # CLIP's ViT-B/16 English text embeddings
    │   │   └── ...
    │   └── vit-b-16_image          # CLIP's ViT-B/16 image embeddings
    │       └── ...
    ├── concepts                    # for concept prompting
    │   └── ...
    ├── related_caption_ids         # indexes of captions of similar semantics
    │   └── vit-b-16
    │       └── ...
    ├── stanford-corenlp-4.5.2      # segment English, German, and French sentences
    │   └── ...               
    └── ...                         # folders that store raw images and videos

You can download our full data from Google Drive or Baidu网盘(extract code: huk0)

The data folder contains the following subfolders:

  • The checkpoints folder contains pre-trained weights, configs, and vocab files of off-the-shelf models (e.g., CLIP's ViT-B-16.pt, huggingface's bert-base-uncased and bert-base-multilingual-cased). We download these files in adavance to avoid network issues and set related configureations in configs/adapt.yaml and configs/finetune.yaml.
  • The annotations folder contains many subfolders named with dataset names, where the training, validation, and testing json files for supervised finetuning are included. Note that each json file is a list of dictionaries and each of dictionary looks like, e.g., {image: path_relative_to_the_root, caption: caption_of_this_image}. Please refer to ZeroNLG/data for more details.
  • The corpus folder contains .txt or .tsv files that store the (parallel) corpus for CLIP-based autoencoding or translating.
  • The feats folder contains .pkl files that store the features of English texts and images/videos.
  • The concepts folder contains concept files (.txt) extracted from the English captions of each dataset and language.
  • The related_caption_ids folder records indexes of captions of similar semantics for each caption.
  • The stanford-corenlp-4.5.2 folder has files for segmenting English, German, and French sentences. See utils/eval.py for details.
  • Other folders that stores the raw images or videos, e.g., data/MSCOCO/train2014/*.jpg (see the variable image_video_root in configs and the below structure).
    data
    ├── MSCOCO
    │   ├── train2014
    │   │   └── *.jpg
    │   └── val2014
    │       └── *.jpg
    ├── Flickr30k
    │   └── flickr30k-images   
    │       └── *.jpg
    ├── MSRVTT
    │   └── all_videos   
    │       ├── video0.mp4
    │       ├── ...
    │       └── video9999.mp4
    └── VATEX
        └── all_videos   
            ├── video0.mp4
            ├── ...
            └── video34990.mp4
    

Here are official or shared links to download raw images or videos:

DatasetsOfficial LinkShared Link (Others)Shared Link (Ours)
MSCOCOLinkLinkN/A
Flickr30kLinkLinkN/A
MSRVTTLink (expired)LinkN/A
VATEXLinkN/AOnedrive, PKU Yun (37.3G)

Note:

From-Scratch Preparation

1. Follow ZeroNLG/data to prepare annotations.

2. Download MSRVTT-CN that contains translated Chinese captions from HuiGuanLab/nrccr

Note: If the original link is expired, you can download MSRVTT-CN from our link (given above).

3. Prepare Corpus

Download en_core_web_sm for concept extraction. Note that we use the version 3.4.1.

wget https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.4.1/en_core_web_sm-3.4.1-py3-none-any.whl

pip install en_core_web_sm-3.4.1-py3-none-any.whl

Then run

python pretreatments/prepare_corpus.py --dataset coco
python pretreatments/prepare_corpus.py --dataset msrvtt
python pretreatments/prepare_corpus.py --dataset vatex
python pretreatments/prepare_corpus.py --dataset flickr30k

Note:

  • You should download data/corpus/flickr30k_de.tsv from our link before this step.
  • We find that the 145K English and German training captions in Multi30K's task2 are not one-to-one mappings. Therefore, we obtain data/corpus/flickr30k_de.tsv by translating German captions into English captions via Google Translate.
  • We do not use 29K English-German training pairs in Multi30K's task1 because its scale is smaller than 145K. For fair comparisons with fully-supervised models trained on 145K image-German pairs, we carry out text-only training on the same scale of texts.
  • This step will yield the concepts folder except corpora.

4. Prepare subsets

Generate subsets of size 0.01% (if applicable), 0.1%, 1%, 10% using three different seeds for semi-supervised training:

python pretreatments/prepare_subsets.py --dataset coco
python pretreatments/prepare_subsets.py --dataset msrvtt

We highly recommend you to use the same subsets (given in the above download links) as ours for fair comparisons.

5. Prepare features

Extract English text embeddings in adavnace as follows to avoid extracting them from the frozen CLIP on-the-fly:

python pretreatments/extract_text_embs.py data/corpus/coco.txt
python pretreatments/extract_text_embs.py data/corpus/msrvtt.txt
python pretreatments/extract_text_embs.py data/corpus/msrvtt_zh.tsv
python pretreatments/extract_text_embs.py data/corpus/vatex.txt
python pretreatments/extract_text_embs.py data/corpus/vatex_zh.tsv
python pretreatments/extract_text_embs.py data/corpus/flickr30k.txt
python pretreatments/extract_text_embs.py data/corpus/flickr30k_de.tsv
python pretreatments/extract_text_embs.py data/corpus/flickr30k_fr.tsv

Extract image embeddings in adavnace as follows to avoid extracting them from the frozen CLIP on-the-fly:

python pretreatments/extract_image_embs.py --dataset coco
python pretreatments/extract_image_embs.py --dataset msrvtt
python pretreatments/extract_image_embs.py --dataset vatex
python pretreatments/extract_image_embs.py --dataset flickr30k

You can download it from our link to save time.

7. The stanford-corenlp-4.5.2 will be downloaded automatically during evaluation.

You can download it from our link to avoid network issues.

Citation

Please [★star] this repo and [cite] the following papers if you feel our data useful to your research:

@inproceedings{yang-etal-2023-multicapclip,
    title = "{M}ulti{C}ap{CLIP}: Auto-Encoding Prompts for Zero-Shot Multilingual Visual Captioning",
    author = "Yang, Bang and Liu, Fenglin and Wu, Xian and Wang, Yaowei and Sun, Xu and Zou, Yuexian",
    booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    year = "2023",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.acl-long.664",
    doi = "10.18653/v1/2023.acl-long.664",
    pages = "11908--11922",
}

@article{Yang2023ZeroNLG,
   title={{Z}ero{NLG}: Aligning and Autoencoding Domains for Zero-Shot Multimodal and Multilingual Natural Language Generation},
   author={Yang, Bang and Liu, Fenglin and Zou, Yuexian and Wu, Xian and Wang, Yaowei and Clifton, David A.},
   journal={arXiv preprint arXiv:2303.06458}
   year={2023}
}