Evaluation for full-parameter tuning models
July 24, 2024 ยท View on GitHub
We recommend trying RGBD on MME and SpatialBench, as they involve some depth related questions.
Please change conv-mode to minicpm/phi3/llama for MODEL_TYPE = phi-3/llama3-8b/qwen1.5-0.5b/qwen1.5-1.8b.
SpatialBench
- Download SpatialBench and put them under
./eval/spatial_bench. - The script uses RGBD by default. To test with RGB, comment out
--depthin it.
sh script/eval/full/spatial_bench.sh
MME & MME-Depth
- Refer to MME GitHub to download the benchmark dataset and put
MME_Benchmark_release_versionundereval/mme. - Update
MODEL_TYPEandTARGET_DIRaccordingly. - Run following lines if you want to run with RGB or RGBD
CUDA_VISIBLE_DEVICES=0 sh script/eval/full/mme.sh # RGB
CUDA_VISIBLE_DEVICES=0 sh script/eval/full/mme_depth.sh # RGBD
The responses and scores can be found in eval/mme/answers_upload.
MMBench
- Refer to MMBench GitHub to download the benchmark dataset. We support
MMBench-Dev,MMBench-Test,MMBench-Dev (cn)andMMBench-Test (cn). Please note that only the files downloaded by legacy link are supported. PutMMBench_DEV_EN_legacy.tsv,MMBench_TEST_EN_legacy.tsv,MMBench_DEV_CN_legacy.tsvorMMBench_TEST_CN_legacy.tsvundereval/mmbench. - Update
SPLIT,LANG (en/cn),MODEL_TYPEandTARGET_DIRaccordingly.
CUDA_VISIBLE_DEVICES=0 sh script/eval/full/mmbench.sh
The response file can be found in eval/mmbench/answers_upload. You can submit the Excel file to submission link to obtain the evaluation scores.
SEED-Bench-1
-
Refer to SEED-Bench Instruction to download the images and videos and put the images under
eval/seed-bench/SEED-Bench-imageand the videos undereval/seed-bench/SEED-Bench-video. Then, extract the video frames in the middle from the downloaded videos by running:pip install av decord python eval/seed-bench/extract_video_frames.py -
Update
MODEL_TYPEandTARGET_DIRaccordingly.
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash script/eval/full/seedbench.sh
The response file can be found in eval/seed-bench/answers_upload and the scores can be found in eval/seed-bench/scores.
VQAv2
-
Download COCO 2015 Test images and put
test2015undereval/vqav2. Then:tar -zxvf eval/vqav2/bunny_vqav2_mscoco_test2015.tar.gz -C eval/vqav2 && rm eval/vqav2/bunny_vqav2_mscoco_test2015.tar.gz && tar -zxvf eval/vqav2/bunny_vqav2_mscoco_test-dev2015.tar.gz -C eval/vqav2 && rm eval/vqav2/bunny_vqav2_mscoco_test-dev2015.tar.gz -
Update
MODEL_TYPEandTARGET_DIRaccordingly.
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash script/eval/full/vqav2.sh
The response file can be found in eval/vqav2/answers_upload. You can submit the json response file to submission link (Test-Dev Phase) to obtain the evaluation scores.
GQA & GQA-Depth
-
Download the images of GQA, unzip it and put
imagesundereval/gqa. Then:tar -zxvf eval/gqa/testdev_balanced_questions.tar.gz -C eval/gqa && rm eval/gqa/testdev_balanced_questions.tar.gz -
Update
MODEL_TYPEandTARGET_DIRaccordingly.
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash script/eval/full/gqa.sh # RGB
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash script/eval/full/gqa_depth.sh # RGBD
POPE
- Download COCO 2014 Val images and put
val2014undereval/pope. Then, refer to POPE GitHub to download the benchmark dataset and put the threejsonfiles undereval/pope/coco. - Update
MODEL_TYPEandTARGET_DIRaccordingly.
CUDA_VISIBLE_DEVICES=0 sh script/eval/full/pope.sh
We report the averaged F1-score of three categories (random, popular and adversarial).