Data.md
October 12, 2024 · View on GitHub
Data
| Data file name | Size |
|---|---|
| open-llava-next_instruct_mix1M.json | 1.64 GB |
| vqa_collection.zip | 30.20 GB |
We have made every effort to align our training data with that of LLaVA-NeXT. However, we were unable to access the tens of thousands of real user interaction data that LLaVA-NeXT collected. As a result, we used 200K ALLaVA-Instruct-VFLAN-4V data as a substitute. Additionally, since TextVQA has been included in the training data of most existing LMMs, we chose to retain it to enable fair comparisons with other LMMs.
Dataset
The dataset, based on sharegpt4v_mix665k, has been expanded to include ALLaVA-Instruct-VFLAN-4V, DocVQA, SynDog-EN, ChartQA, DVQA, AI2D, and GeoQA+, totaling 1M image-text pairs.
Prepare Images
First, download all images we used.
- LAION-CC-SBU-558K: images.zip
- COCO: train2017
- WebData: images. Only for academic usage.
- SAM: images
- GQA: images
- OCR-VQA: download script. We save all files as
.jpg - TextVQA: trainvalimages
- VisualGenome: part1, part2
- A collection of several VQA datasets: DocVQA, SynDog-EN, ChartQA, DVQA, AI2D, and GeoQA+.
- ALLaVA-Instruct-VFLAN-4V: image_191-task_1k
Then, organize the data as follows:
Open-LLaVA-NeXT
├── ...
├── data
│ ├── llava
│ │ ├── llava_pretrain
│ │ │ ├── images
│ ├── coco
│ │ ├── train2017
│ ├── sam
│ │ ├── images
│ ├── gqa
│ │ ├── images
│ ├── ocr_vqa
│ │ ├── images
│ ├── textvqa
│ │ ├── train_images
│ ├── vg
│ │ ├── VG_100K
│ │ ├── VG_100K_2
│ ├── open-llava-next
│ │ ├── open-llava-next_instruct_mix1M.json
│ ├── web-celebrity
│ │ ├── images
│ ├── web-landmark
│ │ ├── images
│ ├── wikiart
│ │ ├── images
│ ├── allava_vflan
│ │ ├── images
│ │ │ ├── images_191task_1k
│ ├── share_textvqa
│ │ ├── images
│ ├── ai2d
│ │ ├── images
│ ├── chatqa
│ │ ├── train
│ │ │ ├── png
│ ├── docvqa
│ │ ├── train
│ │ │ ├── documents
│ ├── dvqa
│ │ ├── images
│ ├── geoqa+
│ │ ├── images
│ ├── synthdog-en
│ │ ├── images
├── ...
Reference
Open-LLaVA-NeXT: https://github.com/xiaoachen98/Open-LLaVA-NeXT/blob/master/docs/Data.md