Generating the data
September 22, 2023 ยท View on GitHub
We detail the steps to reproduce our merge_train.pymultilingual instruction mix and evaluation data. Note that most scripts have to be updated with your path to the raw data used by the scripts.
License
Use of the data has to comply with the licenses of the original datasets used to generate this data.
Translations are produced with NLLB so use has to comply with their license.
- MSCOCO: CC BY 4.0 for annotations, Flickr Terms of Use for images
- BLIP captions (Web CapFilt): BSD 3
- LLaVA: CC BY NC 4.0. It should abide by the policy of OpenAI: https://openai.com/policies/terms-of-use
- VQAv2: CC BY 4.0
- A-OKVQA Apache 2.0
- ImageNet: Non-Commercial, Babel-ImageNet: BabelNet Non-Commercial License
Training Instruction Mix
You will need to download the respective 'raw' data from the websites of MSCOCO (including images), A-OKVQA, LLaVA, ImageNet and the BLIP captions.
BLIP Web CapFilt
- Run in pretrain
filter.pyanddownload_images.pyto sample captions from the full data and download the images. - Run
generate_train.pyto generate a English intermediate file,translate_train.pyto generate the translations, andgenerate_train.pyagain for the final data file.
As exactly reproducing our sampling is impossible due to randomness, we include our result after step 1 here. This file also includes image URL which you can use to download the images. As of 06.2023, all links were still available.
Caption Matching
To generate the image-caption matching data, first run hard_match.py to generate the English examples (this takes a while),
translate_match_train.pyto generate the translations, and then generate_match_train.py to produce the final file.
Other tasks
Run generate_train.py once to generate an intermediate file, translate_train.pyto generate the translations, and generate_train.py again for the final data files.
The translation step is not needed for A-OKVQA and the second generate_train.py is not needed for LLaVA.
For ImageNet examples, you need the label file from https://github.com/gregor-ge/Babel-ImageNet.
Combine
Run pretrain/merge_train.py to combine the different files into one task mix file.
Evaluation
Note for captioning: Both XM3600 and xFlickrCo also generate files used by the pycocoeval library
for evaluation - those files contain coco in their name.
IGLUE (xGQA, XVNLI, MaRVL, xFlickrCo)
Download the data from the IGLUE repository along with the images and run the scripts in the folders.
XM3600 and MaXM
Download the raw data and images from the CrossModal3600 and MaXM repositories and run the respective scripts in the folders.
POPE and CHAIR
Clone the POPE repository (https://github.com/AoiDragon/POPE) and then run the respective scripts.