Data.md

October 12, 2024 · View on GitHub

Data

Data file name	Size
open-llava-next_instruct_mix1M.json	1.64 GB
vqa_collection.zip	30.20 GB

We have made every effort to align our training data with that of LLaVA-NeXT. However, we were unable to access the tens of thousands of real user interaction data that LLaVA-NeXT collected. As a result, we used 200K ALLaVA-Instruct-VFLAN-4V data as a substitute. Additionally, since TextVQA has been included in the training data of most existing LMMs, we chose to retain it to enable fair comparisons with other LMMs.

Dataset

The dataset, based on sharegpt4v_mix665k, has been expanded to include ALLaVA-Instruct-VFLAN-4V, DocVQA, SynDog-EN, ChartQA, DVQA, AI2D, and GeoQA+, totaling 1M image-text pairs.

Prepare Images

First, download all images we used.

LAION-CC-SBU-558K: images.zip
COCO: train2017
WebData: images. Only for academic usage.
SAM: images
GQA: images
OCR-VQA: download script. We save all files as .jpg
TextVQA: trainvalimages
VisualGenome: part1, part2
A collection of several VQA datasets: DocVQA, SynDog-EN, ChartQA, DVQA, AI2D, and GeoQA+.
ALLaVA-Instruct-VFLAN-4V: image_191-task_1k

Then, organize the data as follows:

Open-LLaVA-NeXT
├── ...
├── data
│   ├── llava
│   │   ├── llava_pretrain
│   │   │   ├── images
│   ├── coco
│   │   ├── train2017
│   ├── sam
│   │   ├── images
│   ├── gqa
│   │   ├── images
│   ├── ocr_vqa
│   │   ├── images
│   ├── textvqa
│   │   ├── train_images
│   ├── vg
│   │   ├── VG_100K
│   │   ├── VG_100K_2
│   ├── open-llava-next
│   │   ├── open-llava-next_instruct_mix1M.json
│   ├── web-celebrity
│   │   ├── images
│   ├── web-landmark
│   │   ├── images
│   ├── wikiart
│   │   ├── images
│   ├── allava_vflan
│   │   ├── images
│   │   │   ├── images_191task_1k
│   ├── share_textvqa
│   │   ├── images
│   ├── ai2d
│   │   ├── images
│   ├── chatqa
│   │   ├── train
│   │   │   ├── png
│   ├── docvqa
│   │   ├── train
│   │   │   ├── documents
│   ├── dvqa
│   │   ├── images
│   ├── geoqa+ 
│   │   ├── images
│   ├── synthdog-en
│   │   ├── images
├── ...

Reference

Open-LLaVA-NeXT: https://github.com/xiaoachen98/Open-LLaVA-NeXT/blob/master/docs/Data.md