Evaluation for full-parameter tuning models

July 24, 2024 · View on GitHub

We recommend trying RGBD on MME and SpatialBench, as they involve some depth related questions.

Please change conv-mode to minicpm/phi3/llama for MODEL_TYPE = phi-3/llama3-8b/qwen1.5-0.5b/qwen1.5-1.8b.

SpatialBench

Download SpatialBench and put them under ./eval/spatial_bench.
The script uses RGBD by default. To test with RGB, comment out --depth in it.

sh script/eval/full/spatial_bench.sh

MME & MME-Depth

Refer to MME GitHub to download the benchmark dataset and put MME_Benchmark_release_version under eval/mme.
Update MODEL_TYPE and TARGET_DIR accordingly.
Run following lines if you want to run with RGB or RGBD

CUDA_VISIBLE_DEVICES=0 sh script/eval/full/mme.sh # RGB
CUDA_VISIBLE_DEVICES=0 sh script/eval/full/mme_depth.sh # RGBD

The responses and scores can be found in eval/mme/answers_upload.

MMBench

Refer to MMBench GitHub to download the benchmark dataset. We support MMBench-Dev, MMBench-Test, MMBench-Dev (cn) and MMBench-Test (cn). Please note that only the files downloaded by legacy link are supported. Put MMBench_DEV_EN_legacy.tsv, MMBench_TEST_EN_legacy.tsv, MMBench_DEV_CN_legacy.tsv or MMBench_TEST_CN_legacy.tsv under eval/mmbench.
Update SPLIT, LANG (en/cn), MODEL_TYPE and TARGET_DIR accordingly.

CUDA_VISIBLE_DEVICES=0 sh script/eval/full/mmbench.sh

The response file can be found in eval/mmbench/answers_upload. You can submit the Excel file to submission link to obtain the evaluation scores.

SEED-Bench-1

Refer to SEED-Bench Instruction to download the images and videos and put the images under eval/seed-bench/SEED-Bench-image and the videos under eval/seed-bench/SEED-Bench-video. Then, extract the video frames in the middle from the downloaded videos by running:
```
pip install av decord
python eval/seed-bench/extract_video_frames.py
```
Update MODEL_TYPE and TARGET_DIR accordingly.

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash script/eval/full/seedbench.sh

The response file can be found in eval/seed-bench/answers_upload and the scores can be found in eval/seed-bench/scores.

VQAv2

Download COCO 2015 Test images and put test2015 under eval/vqav2. Then:

tar -zxvf eval/vqav2/bunny_vqav2_mscoco_test2015.tar.gz -C eval/vqav2 && rm eval/vqav2/bunny_vqav2_mscoco_test2015.tar.gz && tar -zxvf eval/vqav2/bunny_vqav2_mscoco_test-dev2015.tar.gz -C eval/vqav2 && rm eval/vqav2/bunny_vqav2_mscoco_test-dev2015.tar.gz

Update MODEL_TYPE and TARGET_DIR accordingly.

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash script/eval/full/vqav2.sh

The response file can be found in eval/vqav2/answers_upload. You can submit the json response file to submission link (Test-Dev Phase) to obtain the evaluation scores.

GQA & GQA-Depth

Download the images of GQA, unzip it and put images under eval/gqa. Then:

tar -zxvf eval/gqa/testdev_balanced_questions.tar.gz -C eval/gqa && rm eval/gqa/testdev_balanced_questions.tar.gz

Update MODEL_TYPE and TARGET_DIR accordingly.

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash script/eval/full/gqa.sh # RGB
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash script/eval/full/gqa_depth.sh # RGBD

POPE

Download COCO 2014 Val images and put val2014 under eval/pope. Then, refer to POPE GitHub to download the benchmark dataset and put the three json files under eval/pope/coco.
Update MODEL_TYPE and TARGET_DIR accordingly.

CUDA_VISIBLE_DEVICES=0 sh script/eval/full/pope.sh

We report the averaged F1-score of three categories (random, popular and adversarial).