Getting Started with VCoder

December 25, 2023 · View on GitHub

This document provides a brief intro to the usage of VCoder LLaVA-1.5. Our code is based on original LLaVA, please checkout their repo for more information.

Training

Download LLaVA-1.5 checkpoints

We add our VCoder to a pretrained LLaVA-1.5 model and train on the COST dataset.

LLaVA-1.5-7b

MLP Projector | MLLM

# Download the Projector weights store them inside outputs folder
git lfs install
mkdir outputs
cd outputs
git clone https://huggingface.co/liuhaotian/llava-v1.5-mlp2x-336px-pretrain-vicuna-7b-v1.5

LLaVA-1.5-13b

MLP Projector | MLLM

# Download the Projector weights store them inside outputs folder
git lfs install
mkdir outputs
cd outputs
git clone https://huggingface.co/liuhaotian/llava-v1.5-mlp2x-336px-pretrain-vicuna-13b-v1.5

We provide training code for two variants of VCoder. We train all our models on 8 A100s.

Only Trained for Object Identification and Counting

Run bash scripts/vcoder_train.sh to train either of following variants on the COST dataset:
- VCoder LLaVA-1.5-7b: We train the model for 2 epochs. The training time is ~8 hours.
- VCoder LLaVA-1.5-13b: We train the model for 2 epochs. The training time is ~14 hours.
Remember to set the model variant in scripts/vcoder_train.sh before training.

Trained for Object Identification, Counting and Depth Order Prediction

Note: These are the models which we use in our demo.

Run bash scripts/vcoder_ds_train.sh to train either of following variants on the combination of COST dataset and General Question Answering (for regularization) datasets.
- VCoder-DS LLaVA-1.5-7b: We train the model for 1 epoch. The training time is ~17 hours.
- VCoder-DS LLaVA-1.5-13b: We train the model for 1 epoch. The training time is ~30 hours.
Remember to set the model variant in scripts/vcoder_ds_train.sh before training.

Evaluation

We evaluate our models on the COST val dataset. We have written our own evaluators for the same.

Object Identification and Counting

We evaluate on the semantic, instance and panoptic object perception tasks.

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/v1_5/eval/cost.sh

Remember to set the model variant in scripts/v1_5/eval/cost.sh before evaluating.

Depth Order Identification for Objects

We evaluate on the depth object perception tasks.

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/v1_5/eval/cost_depth.sh

Remember to set the model variant in scripts/v1_5/eval/cost_depth.sh before evaluating.

General Question-Answering

We follow the same evaluation setting from LLaVA-1.5.
Download and unzip the eval files from google drive to ./playground/data/eval. This also provides a general structure for all datasets.

# pip3 install gdown
cd playground/data/eval
gdown https://drive.google.com/uc?id=1atZSBBrAX54yYpxtVVW33zFvcnaHeFPy
unzip eval.zip

VQAv2

Download test2015 and put it under ./playground/data/eval/vqav2.

Multi-GPU inference.

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/v1_5/eval/vqav2.sh

Submit the results to the evaluation server.

GQA

Download the data and evaluation scripts following the official instructions and put under ./playground/data/eval/gqa/data.

Multi-GPU inference.

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/v1_5/eval/gqa.sh

VisWiz

Download test.json and extract test.zip to test. Put them under ./playground/data/eval/vizwiz.

Single-GPU inference.

CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/vizwiz.sh

Submit the results to the evaluation server.

POPE

Download coco from POPE and put under ./playground/data/eval/pope.

Single-GPU inference and evaluate.

CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/pope.sh

MME

Download the data following the official instructions here.
Downloaded images to MME_Benchmark_release_version.
put the official eval_tool and MME_Benchmark_release_version under ./playground/data/eval/MME.

Single-GPU inference and evaluate.

CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/mme.sh

MMBench

Download mmbench_dev_20230712.tsv and put under ./playground/data/eval/mmbench.

Single-GPU inference.

CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/mmbench.sh

Submit the results to the evaluation server.