Evaluation
December 14, 2023 ยท View on GitHub
In ShareGPT4V, we evaluate models on a diverse set of 12 benchmarks. To ensure the reproducibility, we evaluate the models with greedy decoding. We do not evaluate using beam search to make the inference process consistent with the chat demo of real-time outputs.
Currently, we mostly utilize the official toolkit or server for the evaluation.
Evaluate on Custom Datasets
You can evaluate ShareGPT4V mode on your custom datasets by converting your dataset to ShareGPT4V's jsonl format, and evaluate using model_vqa.py.
Below we provide a general guideline for evaluating datasets with some common formats.
- Short-answer (e.g. VQAv2, MME).
<question>
Answer the question using a single word or phrase.
- Option-only for multiple-choice (e.g. MMBench, SEED-Bench).
<question>
A. <option_1>
B. <option_2>
C. <option_3>
D. <option_4>
Answer with the option's letter from the given choices directly.
- Natural QA (e.g. LLaVA-Bench, MM-Vet).
No postprocessing is needed.
Scripts
Before preparing task-specific data, you MUST first download ShareGPT4V's playgroud.zip. It contains custom annotations, scripts, and the prediction files with ShareGPT4V-7B. Extract to ./projects/ShareGPT4V/playground. This also provides a general structure for all datasets.
VQAv2
- Download
test2015and put it under./playground/data/eval/vqav2. - Multi-GPU inference.
# for single node inference
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/sharegpt4v/eval/vqav2.sh
# for slurm inference
srun -p Your partion --gres gpu:8 bash scripts/sharegpt4v/eval/vqav2.sh
- Submit the results to the evaluation server:
./playground/data/eval/vqav2/answers_upload.
GQA
- Download the data and evaluation scripts following the official instructions and put under
./playground/data/eval/gqa/data. You may need to modifyeval.pyas this due to the missing assets in the GQA v1.2 release. - Multi-GPU inference.
# for single node inference
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/sharegpt4v/eval/gqa.sh
# for slurm inference
srun -p Your partion --gres gpu:8 bash scripts/sharegpt4v/eval/gqa.sh
VisWiz
- Download
test.jsonand extracttest.ziptotest. Put them under./playground/data/eval/vizwiz. - Single-GPU inference.
# for single node inference
CUDA_VISIBLE_DEVICES=0 bash scripts/sharegpt4v/eval/vizwiz.sh
# for slurm inference
srun -p Your partion --gres gpu:1 bash scripts/sharegpt4v/eval/vizwiz.sh
- Submit the results to the evaluation server:
./playground/data/eval/vizwiz/answers_upload.
ScienceQA
- Under
./playground/data/eval/scienceqa, downloadimages,pid_splits.json,problems.jsonfrom thedata/scienceqafolder of the ScienceQA repo. - Single-GPU inference and evaluate.
# for single node inference
CUDA_VISIBLE_DEVICES=0 bash scripts/sharegpt4v/eval/sqa.sh
# for slurm inference
srun -p Your partion --gres gpu:8 bash scripts/sharegpt4v/eval/sqa.sh
TextVQA
- Download
TextVQA_0.5.1_val.jsonand images and extract to./playground/data/eval/textvqa. - Single-GPU inference and evaluate.
# for single node inference
CUDA_VISIBLE_DEVICES=0 bash scripts/sharegpt4v/eval/textvqa.sh
# for slurm inference
srun -p Your partion --gres gpu:1 bash scripts/sharegpt4v/eval/textvqa.sh
MME
- Download the data following the official instructions here.
- Downloaded images to
MME_Benchmark_release_version. - put the official
eval_toolandMME_Benchmark_release_versionunder./playground/data/eval/MME. - Single-GPU inference and evaluate.
# for single node inference
CUDA_VISIBLE_DEVICES=0 bash scripts/sharegpt4v/eval/mme.sh
# for slurm inference
srun -p Your partion --gres gpu:1 bash scripts/sharegpt4v/eval/mme.sh
MMBench-EN
- Download
mmbench_dev_20230712.tsvand put under./playground/data/eval/mmbench. - Multi-GPU inference.
# for single node inference
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/sharegpt4v/eval/mmbench_en.sh
# for slurm inference
srun -p Your partion --gres gpu:8 bash scripts/sharegpt4v/eval/mmbench_en.sh
- Submit the results to the evaluation server:
./playground/data/eval/mmbench/answers_upload/mmbench_dev_20230712.
MMBench-CN
- Download
mmbench_dev_cn_20231003.tsvand put under./playground/data/eval/mmbench. - Multi-GPU inference.
# for single node inference
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/sharegpt4v/eval/mmbench_cn.sh
# for slurm inference
srun -p Your partion --gres gpu:8 bash scripts/sharegpt4v/eval/mmbench_cn.sh
- Submit the results to the evaluation server:
./playground/data/eval/mmbench/answers_upload/mmbench_dev_cn_20231003.
SEED-Bench-Image
- Following the official instructions to download the images. Put images under
./playground/data/eval/seed_bench/SEED-Bench-image. - Multiple-GPU inference and evaluate.
# for single node inference
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/sharegpt4v/eval/seed.sh
# for slurm inference
srun -p Your partion --gres gpu:8 bash scripts/sharegpt4v/eval/seed.sh
- Optionally, submit the results to the leaderboard:
./playground/data/eval/seed_bench/answers_uploadusing the official jupyter notebook.
LLaVA-Bench-in-the-Wild
- Extract contents of
llava-bench-in-the-wildto./playground/data/eval/llava-bench-in-the-wild. - Single-GPU inference and evaluate.
# for single node inference
CUDA_VISIBLE_DEVICES=0 bash scripts/sharegpt4v/eval/llavabench.sh
# for slurm inference
srun -p Your partion --gres gpu:1 bash scripts/sharegpt4v/eval/llavabench.sh
MM-Vet
- Extract
mm-vet.zipto./playground/data/eval/mmvet. - Single-GPU inference.
# for single node inference
CUDA_VISIBLE_DEVICES=0 bash scripts/sharegpt4v/eval/mmvet.sh
# for slurm inference
srun -p Your partion --gres gpu:1 bash scripts/sharegpt4v/eval/mmvet.sh
- Evaluate the predictions in
./playground/data/eval/mmvet/resultsusing the official jupyter notebook.
Q-Bench
- Download
llvisionqa_dev.json(fordev-subset) andllvisionqa_test.json(fortest-subset). Put them under./playground/data/eval/qbench. - Download and extract images and put all the images directly under
./playground/data/eval/qbench/images_llviqionqa. - Single-GPU inference.
# for single node inference
CUDA_VISIBLE_DEVICES=0 bash scripts/sharegpt4v/eval/qbench.sh
# for slurm inference
srun -p Your partion --gres gpu:1 bash scripts/sharegpt4v/eval/qbench.sh
We only support dev evaluation in your local machine for now.