MTVQA
May 15, 2025 ยท View on GitHub
MTVQA: Benchmarking Multilingual Text-Centric Visual Question Answering
Text-Centric Visual Question Answering (TEC-VQA) in its proper format not only facilitates human-machine interaction in text-centric visual environments but also serves as a de facto gold proxy to evaluate AI models in the domain of text-centric scene understanding. Nonetheless, most existing TEC-VQA benchmarks have focused on high-resource languages like English and Chinese. Despite pioneering works to expand multilingual QA pairs in non-text-centric VQA datasets through translation engines, the translation-based protocol encounters a substantial ''Visual-textual misalignment'' problem when applied to TEC-VQA. Specifically, it prioritizes the text in question-answer pairs while disregarding the visual text present in images. Furthermore, it fails to address complexities related to nuanced meaning, contextual distortion, language bias, and question-type diversity. In this work, we tackle multilingual TEC-VQA by introducing MTVQA, the first benchmark featuring high-quality human expert annotations across 9 diverse languages. Further, by comprehensively evaluating numerous state-of-the-art Multimodal Large Language Models (MLLMs), including GPT-4o, GPT-4V, Claude3, and Gemini, on the MTVQA dataset, it is evident that there is still large room for performance improvement, underscoring the value of the dataset. Additionally, we supply multilingual training data within the MTVQA dataset, demonstrating that straightforward fine-tuning with this data can substantially enhance multilingual TEC-VQA performance. We aspire that MTVQA will offer the research community fresh insights and stimulate further exploration in multilingual visual text comprehension.
|๐ Project Page | ๐ Paper |๐ Dataset | ๐ Leaderboard
๐ฅ News
2025.05.16๐ MTVQA is accepted by ACL 2025๏ผ2025.03.25๐ The Elice team from Korea tests their MLLM Helpy-V Reasoning on MTVQA. Helpy-V Reasoning ranks Second among all models, and notably, its performance in Korean Text comprehension is far superior to previous SOTA models. Congratulations to the Elice team!2024.12.12๐ InternVL2.5 tests its performance on MTVQA, InternVL2.5 78B model outperforms Qwen2VL 76B and achieves the SOTA performance, congratulations to the InternVL2.5 team!2024.09.29๐ The Blue LM team from VIVO tests their BlueLM-V-3B performance on MTVQA. BlueLM-V-3B achieves performance comparable to GPT-4o. It ranks the third place among all the SOTA MLLMs!2024.09.09๐ We test GPT-4o mini's performance on MTVQA and it performs exceptionally well among the leading lightweight MLLMs!2024.09.04๐ InternVL2 tests its performance on MTVQA, InternVL2 76B model outperforms GPT-4V, thanks to the InternVL2 team.2024.08.30๐ Qwen2VL 72B is released, outperforming GPT-4o and achieving the best performance overall, congratulations!2024.07.23๐ MTVQA is now supported in VLMEvalKit.2024.07.23๐ MTVQA is now supported in OpenCompass.2024.06.04๐ We are excited to launch MTVQA, the first multilingual visual text comprehension evaluation benchmark for MLLMs! MTVQA includes 9 widely-used but low-resource languages, i.t., AR, DE, FR, IT, JA, KO, RU, TH, and VI.2024.06.04๐ GPT-4o achieves the best performance overall, MiniCPM-V2.5 achieves the best performance among open-source models!
๐ Data
| RawData (Google Drive) | Huggingface Dataset
๐ฎ Evaluation
The test code for evaluating models in the paper can be found in scripts.
If you want to add your results to the MTVQA leaderboard, feel free to email us directly at tangjingqun@bytedance.com, haoliu.0128@bytedance.com or can.huang@bytedance.com.
๐ LeaderBoard
| Models | Open-Source | AR | DE | FR | IT | JA | KO | RU | TH | VI | AVG. |
|---|---|---|---|---|---|---|---|---|---|---|---|
| InternVL2.5 78B ๐ฅ | โ | 15.9 | 39.0 | ๐45.6 | ๐42.9 | 21.1 | 33.9 | 12.2 | ๐23.8 | 41.5 | ๐31.9 |
| Helpy-V Reasoning ๐ฅ | โ | 16.6 | 34.7 | 40.0 | 39.6 | 20.2 | ๐43.5 | 13.6 | 13.0 | 46.6 | 31.6 |
| Qwen2-VL 72B ๐ฅ | โ | ๐ 20.7 | 36.5 | 44.1 | 42.8 | 21.6 | 37.4 | ๐15.6 | 17.7 | 41.6 | 30.9 |
| GPT-4o | โ | 20.2 | 34.2 | 41.2 | 32.7 | 20.0 | 33.9 | 11.5 | 22.5 | 34.2 | 27.8 |
| BlueLM-V-3B | โ | 17.3 | ๐39.5 | 44.7 | 32.2 | ๐23.5 | 34.0 | 9.2 | 20.3 | 22.9 | 27.0 |
| Claude3 Opus | โ | 15.1 | 33.4 | 40.6 | 34.4 | 19.4 | 27.2 | 13.0 | 19.5 | 29.1 | 25.7 |
| Qwen2-VL 7B | โ | 15.5 | 32.1 | 41.6 | 38.9 | 17.8 | 30.6 | 13.0 | 10.8 | 30.0 | 25.6 |
| GPT-4o mini | โ | 16.9 | 33.0 | 41.2 | 32.1 | 18.5 | 27.4 | 11.5 | 19.9 | 29.1 | 25.5 |
| Gemini Ultra | โ | 14.7 | 32.3 | 40.0 | 31.8 | 12.3 | 17.2 | 11.8 | 20.3 | 28.6 | 23.2 |
| InternVL2 76B | โ | 9.5 | 31.3 | 35.7 | 35.2 | 11.1 | 14.3 | 11.9 | 10.0 | 26.9 | 22.0 |
| GPT-4V | โ | 11.5 | 31.5 | 40.4 | 32.3 | 11.5 | 16.7 | 10.3 | 15.0 | 28.9 | 22.0 |
| QwenVL Max | โ | 7.7 | 31.4 | 37.6 | 30.2 | 18.6 | 25.4 | 10.4 | 4.8 | 23.5 | 21.1 |
| Claude3 Sonnet | โ | 10.5 | 28.9 | 35.6 | 31.8 | 13.9 | 22.2 | 11.0 | 15.2 | 20.8 | 21.1 |
| QwenVL Plus | โ | 4.8 | 28.8 | 33.7 | 27.1 | 12.8 | 19.9 | 9.4 | 5.6 | 18.1 | 17.8 |
| MiniCPM-V2.5 | โ | 6.1 | 29.6 | 35.7 | 26.0 | 12.1 | 13.1 | 5.7 | 12.6 | 15.3 | 17.3 |
| InternVL-V1.5 | โ | 3.4 | 27.1 | 31.4 | 27.1 | 9.9 | 9.0 | 4.9 | 8.7 | 12.4 | 14.9 |
| GLM4V | โ | 0.3 | 30.0 | 34.1 | 30.1 | 3.4 | 5.7 | 3.0 | 3.5 | 12.3 | 13.6 |
| TextSquare | โ | 3.7 | 27.0 | 30.8 | 26.7 | 3.2 | 7.2 | 6.7 | 5.2 | 12.4 | 13.6 |
| Mini-Gemini-HD-34B | โ | 2.2 | 25.0 | 29.2 | 25.5 | 6.1 | 8.6 | 4.1 | 4.3 | 11.8 | 13.0 |
| Xcomposer2-4KHD | โ | 2.0 | 20.6 | 23.2 | 21.6 | 5.6 | 7.7 | 4.1 | 6.1 | 10.1 | 11.2 |
| Llava-Next-34B | โ | 3.3 | 24.0 | 28.0 | 22.3 | 3.6 | 6.1 | 2.6 | 0.4 | 9.8 | 11.1 |
| TextMonkey | โ | 2.0 | 18.1 | 19.9 | 22.1 | 4.6 | 7.2 | 3.2 | 0.9 | 11.1 | 9.9 |
| MiniCPM-V2.0 | โ | 1.3 | 12.7 | 14.9 | 17.0 | 3.7 | 5.6 | 2.2 | 2.2 | 6.8 | 7.4 |
| mPLUG-DocOwl 1.5 | โ | 1.0 | 13.9 | 14.9 | 18.2 | 2.9 | 5.0 | 2.0 | 0.9 | 6.4 | 7.2 |
| YI-VL-34B | โ | 1.7 | 13.5 | 15.7 | 12.1 | 4.8 | 5.2 | 0.8 | 3.5 | 4.1 | 6.8 |
| DeepSeek-VL | โ | 0.6 | 14.2 | 15.3 | 15.2 | 2.9 | 3.8 | 1.6 | 0.9 | 5.2 | 6.6 |
:black_nib: Citation
If you wish to refer to the baseline results published here, please use the following BibTeX entries:
@misc{tang2024mtvqa,
title={MTVQA: Benchmarking Multilingual Text-Centric Visual Question Answering},
author={Jingqun Tang and Qi Liu and Yongjie Ye and Jinghui Lu and Shu Wei and Chunhui Lin and Wanqing Li and Mohamad Fitri Faiz Bin Mahmood and Hao Feng and Zhen Zhao and Yanjie Wang and Yuliang Liu and Hao Liu and Xiang Bai and Can Huang},
year={2024},
eprint={2405.11985},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
Licence
Bias, Risks, and Limitations
Your access to and use of this dataset are at your own risk. We do not guarantee the accuracy of this dataset. The dataset is provided โas isโ and we make no warranty or representation to you with respect to it and we expressly disclaim, and hereby expressly waive, all warranties, express, implied, statutory or otherwise. This includes, without limitation, warranties of quality, performance, merchantability or fitness for a particular purpose, non-infringement, absence of latent or other defects, accuracy, or the presence or absence of errors, whether or not known or discoverable. In no event will we be liable to you on any legal theory (including, without limitation, negligence) or otherwise for any direct, special, indirect, incidental, consequential, punitive, exemplary, or other losses, costs, expenses, or damages arising out of this public license or use of the licensed material. The disclaimer of warranties and limitation of liability provided above shall be interpreted in a manner that, to the extent possible, most closely approximates an absolute disclaimer and waiver of all liability.