Getting Started with VCoder
December 25, 2023 ยท View on GitHub
This document provides a brief intro to the usage of VCoder LLaVA-1.5. Our code is based on original LLaVA, please checkout their repo for more information.
Training
Download LLaVA-1.5 checkpoints
We add our VCoder to a pretrained LLaVA-1.5 model and train on the COST dataset.
LLaVA-1.5-7b
# Download the Projector weights store them inside outputs folder
git lfs install
mkdir outputs
cd outputs
git clone https://huggingface.co/liuhaotian/llava-v1.5-mlp2x-336px-pretrain-vicuna-7b-v1.5
LLaVA-1.5-13b
# Download the Projector weights store them inside outputs folder
git lfs install
mkdir outputs
cd outputs
git clone https://huggingface.co/liuhaotian/llava-v1.5-mlp2x-336px-pretrain-vicuna-13b-v1.5
We provide training code for two variants of VCoder. We train all our models on 8 A100s.
Only Trained for Object Identification and Counting
-
Run
bash scripts/vcoder_train.shto train either of following variants on the COST dataset:- VCoder LLaVA-1.5-7b: We train the model for 2 epochs. The training time is ~8 hours.
- VCoder LLaVA-1.5-13b: We train the model for 2 epochs. The training time is ~14 hours.
-
Remember to set the model variant in scripts/vcoder_train.sh before training.
Trained for Object Identification, Counting and Depth Order Prediction
Note: These are the models which we use in our demo.
-
Run
bash scripts/vcoder_ds_train.shto train either of following variants on the combination of COST dataset and General Question Answering (for regularization) datasets.- VCoder-DS LLaVA-1.5-7b: We train the model for 1 epoch. The training time is ~17 hours.
- VCoder-DS LLaVA-1.5-13b: We train the model for 1 epoch. The training time is ~30 hours.
-
Remember to set the model variant in scripts/vcoder_ds_train.sh before training.
Evaluation
We evaluate our models on the COST val dataset. We have written our own evaluators for the same.
Object Identification and Counting
We evaluate on the semantic, instance and panoptic object perception tasks.
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/v1_5/eval/cost.sh
Remember to set the model variant in scripts/v1_5/eval/cost.sh before evaluating.
Depth Order Identification for Objects
We evaluate on the depth object perception tasks.
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/v1_5/eval/cost_depth.sh
Remember to set the model variant in scripts/v1_5/eval/cost_depth.sh before evaluating.
General Question-Answering
- We follow the same evaluation setting from LLaVA-1.5.
- Download and unzip the eval files from google drive to
./playground/data/eval. This also provides a general structure for all datasets.
# pip3 install gdown
cd playground/data/eval
gdown https://drive.google.com/uc?id=1atZSBBrAX54yYpxtVVW33zFvcnaHeFPy
unzip eval.zip
VQAv2
-
Download
test2015and put it under./playground/data/eval/vqav2. -
Multi-GPU inference.
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/v1_5/eval/vqav2.sh -
Submit the results to the evaluation server.
GQA
-
Download the data and evaluation scripts following the official instructions and put under
./playground/data/eval/gqa/data. -
Multi-GPU inference.
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/v1_5/eval/gqa.sh
VisWiz
-
Download
test.jsonand extracttest.ziptotest. Put them under./playground/data/eval/vizwiz. -
Single-GPU inference.
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/vizwiz.sh -
Submit the results to the evaluation server.
POPE
-
Download
cocofrom POPE and put under./playground/data/eval/pope. -
Single-GPU inference and evaluate.
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/pope.sh
MME
-
Download the data following the official instructions here.
-
Downloaded images to
MME_Benchmark_release_version. -
put the official
eval_toolandMME_Benchmark_release_versionunder./playground/data/eval/MME. -
Single-GPU inference and evaluate.
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/mme.sh
MMBench
-
Download
mmbench_dev_20230712.tsvand put under./playground/data/eval/mmbench. -
Single-GPU inference.
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/mmbench.sh -
Submit the results to the evaluation server.