Generating the data

September 22, 2023 · View on GitHub

We detail the steps to reproduce our merge_train.pymultilingual instruction mix and evaluation data. Note that most scripts have to be updated with your path to the raw data used by the scripts.

License

Use of the data has to comply with the licenses of the original datasets used to generate this data.

Translations are produced with NLLB so use has to comply with their license.

MSCOCO: CC BY 4.0 for annotations, Flickr Terms of Use for images
BLIP captions (Web CapFilt): BSD 3
LLaVA: CC BY NC 4.0. It should abide by the policy of OpenAI: https://openai.com/policies/terms-of-use
VQAv2: CC BY 4.0
A-OKVQA Apache 2.0
ImageNet: Non-Commercial, Babel-ImageNet: BabelNet Non-Commercial License

Training Instruction Mix

You will need to download the respective 'raw' data from the websites of MSCOCO (including images), A-OKVQA, LLaVA, ImageNet and the BLIP captions.

BLIP Web CapFilt

Run in pretrain filter.py and download_images.py to sample captions from the full data and download the images.
Run generate_train.py to generate a English intermediate file, translate_train.pyto generate the translations, and generate_train.py again for the final data file.

As exactly reproducing our sampling is impossible due to randomness, we include our result after step 1 here. This file also includes image URL which you can use to download the images. As of 06.2023, all links were still available.

Caption Matching

To generate the image-caption matching data, first run hard_match.py to generate the English examples (this takes a while), translate_match_train.pyto generate the translations, and then generate_match_train.py to produce the final file.

Other tasks

Run generate_train.py once to generate an intermediate file, translate_train.pyto generate the translations, and generate_train.py again for the final data files.

The translation step is not needed for A-OKVQA and the second generate_train.py is not needed for LLaVA.

For ImageNet examples, you need the label file from https://github.com/gregor-ge/Babel-ImageNet.

Generating the data

License

Training Instruction Mix

BLIP Web CapFilt

Caption Matching

Other tasks

Combine

Evaluation

IGLUE (xGQA, XVNLI, MaRVL, xFlickrCo)

XM3600 and MaXM

POPE and CHAIR