VLRMBench

June 27, 2025 ยท View on GitHub

This is the official repository for the paper:

"VLRMBench: A Comprehensive and Challenging Benchmark for Vision-Language Reward Models" has been accepted by ICCV 2025.


๐Ÿ“ฆ Dataset

  • Benchmark .jsonl files are located in the benchmark_data/ directory.
  • Images can be downloaded from this link and should be extracted to:
meta_data/Image

Each file in the benchmark_data/ folder corresponds to one specific evaluation task in VLRMBench:

File NameTask NameAbbreviation
step_correctness.jsonlStep CorrectnessSC
redundant_det.jsonlRedundant DetectionRD
most_confidence.jsonlConfidence MisdirectionCM
existence_hallucination.jsonlExistence HallucinationEH
attribute_hallucination.jsonlAttribute HallucinationAH
detail_error.jsonlDetail ErrorDE
location_error.jsonlSpatial RelationshipSR
image_ref_error.jsonlImage ConfusionIRE
multi_solution.jsonlMulti-SolutionMS
foresight.jsonlForecasting FutureFF
error_reason_analysis.jsonlError Reason AnalysisERA
error_correction.jsonlError CorrectionEC

Each .jsonl file contains multiple entries (one per line), each representing a benchmark instance for that task. Fields may vary depending on the task type.

๐Ÿ” Evaluation

Important: Please make sure to update your API keys and dataset paths before running evaluations.

1. Configuration

Modify the model, dataset paths, and API credentials in the following files:

  • model_eval/run_vllm.sh
  • model_eval/run_vllm_api_eval_with_metrices.sh
  • model_eval/vllm_localapi_eval.py
  • model_eval/run_online_api_eval_with_metrices.sh
  • model_eval/online_api_eval.py

2. Local Model Evaluation

Start your VLLM server using:

bash model_eval/run_vllm.sh

Then run:

bash model_eval/run_vllm_api_eval_with_metrices.sh

This will evaluate the local Vision-Language model using the VLRMBench benchmark.

3. Online Model Evaluation

To evaluate remote models via API (e.g., OpenAI, Claude, Gemini), run:

bash model_eval/run_online_api_eval_with_metrices.sh

4. Get Evaluation Results

After running evaluations, calculate metrics using the appropriate scripts:

Binary Classification Tasks (SC, RD, CM, EH, AH, DE, SR, IRE)

python model_eval/get_sc_mc_rd_eval_res.py

Multi-Solution Task (MS)

python model_eval/get_ms_eval_res.py

Forecasting Future Task (FF)

python model_eval/get_fores_eval_res.py

Generation Tasks (ERA, EC)

  1. Run judge evaluation:
python model_eval/get_ec_era_eval_res.py
  1. Calculate win rates:
python model_eval/get_ec_era_eval_res_after_judger.py

Note: Update task_name variable in each script to match your evaluation task. For ERA/EC tasks, configure the judge model API settings in get_ec_era_eval_res.py.


๐Ÿ“œ Citation

If you use this benchmark or codebase in your research, please cite:

VLRMBench: A Comprehensive and Challenging Benchmark for Vision-Language Reward Models
arXiv: 2503.07478


๐Ÿ“ฌ Contact

For questions or collaborations, feel free to open an issue or contact the authors directly.