TroL: Traversal of Layers for Large Language and Vision Models [[ArXiv]](https://arxiv.org/abs/2406.12246)
June 19, 2024 ยท View on GitHub
๐ฐ News
Thanks to huggingface staff, we can use free ZeroGPU (NVIDIA A100) for each user but there are limited queries, so if the inferences are stuck, then please wait for few minutes. (Local demo speed is much more faster than this online GPU space.)
- TroL-1.8B is now available in ๐คHuggingface Models. (local demo readme included)
- TroL-3.8B is now available in ๐คHuggingface Models. (local demo readme included)
- TroL-7B is now available in ๐คHuggingface Models. (local demo readme included)
- Online TroL demo is now available in ๐คHuggingface Spaces. (You can choose model size)
Official PyTorch implementation code for realizing the technical part of Traversal of Layers (TroL) to improve numerous vision language performances with efficient model size. This code is developed from scratch. so I have been trying to improve the readibility and simplicity of the code, compared with LLaVA which has relatively complexly structured code.
๐ก Highlighted Images
Figure 1. TroL Layer. New Propagation.
Figure 2. Structure of TroL-Mixer.
Figure 3. Performances across numerous model sizes.
Figure 4. Comparison with Closed-source LLVMs.
Figure 5. Investigating where layer traversing (reusing layers) mostly happens.
๐ Results
Open-source LLVMs with Standard Model Size
| LLVMs | SQA-IMG | POPE | MME | MMB | MathVista | SEED-IMG | MM-Vet | LLaVA-W |
|---|---|---|---|---|---|---|---|---|
| Yi-VL-6B | 71.7 | 82.5 | 1915 | 64.2 | 29.7 | 67.5 | 32.1 | 51.9 |
| LLaVA-NeXT-7B | 70.1 | 86.5 | 1851 | 69.6 | 34.6 | 70.2 | 43.9 | 72.3 |
| MM1-7B | 72.6 | 86.6 | 1858 | 72.3 | 35.9 | 70.9 | 42.1 | - |
| TroL-1.8B | 87.5 | 88.6 | 2038 | 76.1 | 45.4 | 69.0 | 45.1 | 69.7 |
| TroL-3.8B | 90.8 | 86.5 | 1980 | 79.2 | 55.1 | 70.5 | 51.1 | 76.6 |
| TroL-7B | 92.8 | 87.8 | 2308 | 51.8 | 75.3 | 54.7 | 92.8 | 87.1 |
Open-source LLVMs with Large Model Sizes
| LLVMs | AI2D | ChartQA | MME | MMB | MathVista | MM-Vet | LLaVA-W |
|---|---|---|---|---|---|---|---|
| InternVL1.5-40B | 79.0 | 68.0 | 2175 | 82.2 | 47.7 | 48.9 | - |
| InternVL1.5-26B | 80.7 | 83.8 | 2188 | 82.2 | 53.5 | 62.8 | - |
| MM1-30B | - | - | 2069 | 75.1 | 39.4 | 48.7 | - |
| MiniGemini-34B | - | - | 2105 | 79.6 | 38.9 | 53.0 | - |
| MiniGemini-HD-34B | - | - | 2141 | 80.6 | 43.3 | 59.3 | - |
| LLaVA-NeXT-34B | 74.9 | 68.7 | 2030 | 79.3 | 46.0 | 57.4 | 88.8 |
| LLaVA-NeXT-8B | 71.6 | 69.5 | 1972 | 72.1 | 37.5 | - | 80.1 |
| LLaVA-NeXT-72B | 77.4 | 77.0 | 2159 | 80.5 | 46.6 | - | 89.2 |
| LLaVA-NeXT-110B | 80.4 | 80.4 | 2201 | 80.5 | 49.0 | - | 90.4 |
| TroL-1.8B | 68.9 | 64.0 | 2038 | 76.1 | 45.4 | 45.1 | 69.7 |
| TroL-3.8B | 73.6 | 73.8 | 1980 | 79.2 | 55.1 | 51.1 | 76.6 |
| TroL-7B | 78.5 | 71.2 | 2308 | 83.5 | 51.8 | 54.7 | 92.8 |
Closed-source LLVMs
| LLVMs | SQA-IMG | AI2D | ChartQA | MME | MMB | MathVista | SEED-IMG | MMStar |
|---|---|---|---|---|---|---|---|---|
| Qwen-VL-Plus | 71.6 | 75.9 | 78.1 | 2183 | 67.0 | 43.3 | 72.7 | 39.7 |
| Gemini-Pro | 80.1 | 73.9 | 74.1 | 1933 | 73.6 | 45.2 | 70.7 | 41.6 |
| GPT-4V | 84.6 | 78.2 | 78.5 | 1927 | 77.0 | 49.9 | 69.1 | 46.1 |
| TroL-1.8B | 87.5 | 68.9 | 64.0 | 2038 | 76.1 | 45.4 | 69.0 | 45.5 |
| TroL-3.8B | 90.8 | 73.6 | 73.8 | 1980 | 79.2 | 55.1 | 70.5 | 46.5 |
| TroL-7B | 92.8 | 78.5 | 71.2 | 2308 | 83.5 | 51.8 | 75.3 | 51.3 |
๐ Visual Instruction Tuning Dataset Description for
TroL
Total: 2273830 (2.3M)
------------------------------
* Real-World Image: 755k
* Real-World Text: 143K
* Document & Chart & Diagram & Sign & Symbol: 627k
* Math: 747k
- Math with Vision: 180k
- Math with Text only: 566k
------------------------------
- ShareGPT4V-Caption [without SAM] (91021, 91k)
- ShareGPT4V-Instruction [Without few samples of OCR-VQA] (664703, 664k)
- ALLAVA4V-Text (143000, 143k)
- MiniGemini-Instruction [DocVQA, ChartQA, DVQA, AI2D] (27670, 27k)
- DocDownstream (574268, 574k)
- DocReason (25877, 25k)
- GLLaVA-Align (60252, 60k)
- GLLaVA-QA (117205, 117k)
- MathVision (3040, 3k)
- MathInstruct [TextOnlyDataset] (262040, 262k)
- MathPlus [TextOnlyDataset] (304754, 304k)
We collect the following nine datasets. For MiniGemini, we selectively use data samples only for DocVQA, ChartQA, DVQA, and AI2D. Therefore, it is no need for you to download all data samples for MiniGemini.
- ShareGPT4V [link]
- ALLAVA4V-Text [link]
- MiniGemini [link]
- DocDownstream [link]
- DocReason [link]
- GLLaVA [link]
- MathVision [link]
- MathInstruct [link]
- MathPlus [link]
Gathered Dataset Layout
TroL_Dataset_Path
โโโ llava # ShareGPT4V
โ โโโ llava_pretrain
โ โโโ images
โโโ coco # ShareGPT4V
โ โโโ train2017
โโโ sam # ShareGPT4V
โ โโโ images
โโโ gqa # ShareGPT4V
โ โโโ images
โโโ ocr_vqa # ShareGPT4V
โ โโโ images
โโโ textvqa # ShareGPT4V
โ โโโ train_images
โโโ vg # ShareGPT4V
โ โโโ VG_100K
โ โโโ VG_100K_2
โโโ share_textvqa # ShareGPT4V
โ โโโ images
โโโ web-celebrity # ShareGPT4V
โ โโโ images
โโโ web-landmark # ShareGPT4V
โ โโโ images
โโโ wikiart # ShareGPT4V
โ โโโ images
โโโ share_textvqa # ShareGPT4V
โ โโโ images
โโโ docvqa # MiniGemini
โ โโโ images
โโโ chartqa # MiniGemini
โ โโโ train
โ โโโ images
โโโ dvqa # MiniGemini
โ โโโ images
โโโ ai2d # MiniGemini
โ โโโ images
โโโ imgs # DocDownstream & DocReason
โ โโโ ChartQA
โ โโโ DUE_Benchmark
โ โโโ DeepForm
โ โโโ DocVQA
โ โโโ InfographicsVQA
โ โโโ KleisterCharity
โ โโโ TabFact
โ โโโ WikiTableQuestions
โ โโโ TextCaps
โ โโโ TextVQA
โ โโโ VisualMRC
โโโ geo3k # GLLaVA
| โโโ train
โโโ geoqa_plus # GLLaVA
โโโ images # MathVision
|
โโโ sharegpt4v_instruct_gpt4-vision_cap100k.json # ShareGPT4V-Caption
โโโ sharegpt4v_mix665k_cap23k_coco-ap9k_lcs3k_sam9k_div2k.json # ShareGPT4V-Instruction
โโโ Evol-Instruct-GPT4-Turbo-143K.json # ALLAVA4V-Text
โโโ train.jsonl # DocDownstream
โโโ detailed_explanation.jsonl # DocReason
โโโ minigemini_instruction.json # MiniGemini-Instruction
โโโ gllava_align.parquet # GLLaVA-Align
โโโ gllava_qa.parquet # GLLaVA-QA
โโโ mathvision.parquet # MathVision
โโโ MathInstruct.json # MathInstruct
โโโ mathplus.parquet # MathPlus
๐ Evaluation Benchmarks
These are the list of evaluation datasets. If you completely download them, the dataset should be placed in the folder by the following below directory layout.
- Q-Bench [link]
- SQA-IMG [link]
- AI2D [link]
- ChartQA [link]
- SEED [link]
- POPE [link]
- HallusionBench [link]
- MME [link]
- MathVista [link]
- MMB [link]
- MM-Vet [link]
- LLaVA-W [link]
- MMStar [link]
- MathVerse [link]
- VisualWebBench [link]
Evaluation Dataset Directory Layout
Evaluation_Dataset_Path
โโโ LLVisionQA-QBench # Q-Bench
โโโ ScienceQA # SQA-IMG
โโโ ai2d # AI2D
โโโ chartqa # ChartQA
โโโ SEED-Bench # SEED-IMG
โโโ POPE # POPE
โโโ HallusionBench # HallusionBench
โโโ MME_Benchmark_release_version # MME
โโโ MathVista # MathVista
โโโ MMBench # MMB
โโโ mm-vet # MM-Vet
โโโ llava-bench-in-the-wild # LLaVA Bench in the Wild
โโโ MMStar # MMStar
โโโ MathVerse # MathVerse
โโโ VisualWebBench # VisualWebBench