VCBench: Benchmarking Multimodal Mathematical Reasoning with Explicit Visual Dependency

April 29, 2025 · View on GitHub

This is the official repo for VCBench, a comprehensive benchmark designed for assessing multimodal mathematical reasoning tasks with explicit visual dependencies.

🔥 Update

[2025-04-10]: 🚀 Paper, Codes and Datas of VCBench online. Check out this link for details.

🎯 Overview

Dataset

The VCBench dataset consists of 1720 question answer pairs $(Q,A^*)$ and 6697 images.

The question-answer pairs and corresponding images can be found here.

we assessed 24 state-of-the-art LVLMs across 17 distinct task categories within VCBench, evaluates five distinct model competencies: temporal reasoning, geometric reasoning, logical reasoning, spatial reasoning, and pattern recognition.

Despite achieving near-perfect accuracy on normal human-level performance, the best-performing visual models were unable to exceed 50% accuracy. This underscores the significant challenges that remain in the integration of visual and mathematical reasoning at the elementary level and highlights the need for further research in developing models that can handle the complexities of multi-modal, visually dependent reasoning tasks.

🤖 Automatic Evaluation

For model evaluation, please refer to evaluation.

🏆 Leaderboard

The Leaderboard for VCBench is continuously being updated, welcoming the contribution of your LVLMs!

Please note that to thoroughly evaluate your own LVLM, you are required to provide us with jsonl file. These should include the question-id and your final response. We have provided a submission format in the submit.jsonl file. After completing the aforementioned steps, please contact us via gasolsun36@gmail.com to submit your results and to update the leaderboard.

📧 Contact

📝 Citation

If you find our work helpful for your research, please consider giving a star and citation.

@misc{wong2025vcbench
  author    = {Zhikai Wang and Jiashuo Sun and Wenqi Zhang and Zhiqiang Hu and Xin Li and Fan Wang and Deli Zhao},
  title     = {Benchmarking Multimodal Mathematical Reasoning with Explicit Visual Dependency},
  year      = {2025},
  eprint    = {2504.18589},
  archivePrefix = {arxiv},
  primaryClass  = {cs.CV},
  url       = {https://arxiv.org/abs/2504.18589}
}