README.md
May 31, 2025 · View on GitHub
VidText: Towards Comprehensive Evaluation for Video Text Understanding
This is the official repo of “VidText: Towards Comprehensive Evaluation for Video-Text Understanding” (arXiv 2505.22810).
:bell: News:
License
Our dataset is under the CC-BY-NC-SA-4.0 license.
:warning: If you need to access and use our dataset, you must understand and agree: This dataset is for research purposes only and cannot be used for any commercial or other purposes. The user assumes all effects arising from any other use and dissemination.
We do not own the copyright of any raw video files. Currently, we provide video access to researchers under the condition of acknowledging the above license. For the video data used, we respect and acknowledge any copyrights of the video authors. Therefore, for the movies, TV series, documentaries, and cartoons used in the dataset, we have reduced the resolution, clipped the length, adjusted dimensions, etc. of the original videos to minimize the impact on the rights of the original works.
If the original authors of the related works still believe that the videos should be removed, please contact authors or directly raise an issue.
Introduction
We introduce VidText, a comprehensive benchmark designed explicitly for systematic evaluation of multimodal large language models (MLLMs) in video text understanding. VidText encompasses a diverse set of videos of varying lengths across 27 fine-grained genres, covering multiple languages and scenarios, and includes 8 designed tasks spanning both perception and reasoning dimensions. These tasks challenge MLLMs to leverage textual cues appearing dynamically within videos at various granularities—from holistic video-level understanding to instance-level grounding.
Our extensive evaluation of 18 state-of-the-art multimodal LLMs, including proprietary models such as Gemini-1.5 Pro,GPT-4o and prominent open-source models like VideoLLaMA-3,Qwen2.5-VL,InternVL2.5 reveals substantial difficulties in effectively utilizing textual information in video contexts. Even the highest-performing models achieve an average accuracy of merely 45.3% across tasks, underscoring significant limitations in current MLLMs' OCR integration, temporal grounding, and multi-step reasoning capabilities.
We anticipate that VidText will serve as a crucial catalyst for future research, driving improvements in multilingual text spotting, video-level reasoning, and the integration of visual and textual modalities, thereby substantially advancing the community's capabilities in comprehensive video-text understanding.

Leaderboard on VidText
| Method | Size | Avg. | HolisticOCR | HolisticReasoning | LocalOCR | LocalReasoning | TextLocalization | TemporalCausalReasoning | TextTracking | SpatialReasoning |
|---|---|---|---|---|---|---|---|---|---|---|
| Gemini 1.5 Pro | -- | 45.3 | 34.8 | 43.6 | 50.2 | 50.1 | 48.7 | 47.0 | 40.3 | 47.9 |
| GPT-4o | -- | 40.2 | 29.5 | 38.9 | 46.0 | 43.3 | 45.5 | 42.5 | 36.2 | 39.8 |
| VideoLLaMA3 | 7B | 39.9 | 23.5 | 31.5 | 39.2 | 41.2 | 47.3 | 55.6 | 31.1 | 50.0 |
| InternVL2.5 | 78B | 39.8 | 40.2 | 37.4 | 29.0 | 50.4 | 30.5 | 48.5 | 29.9 | 52.3 |
| Qwen2.5-VL (72B) | 72B | 38.5 | 40.1 | 49.3 | 35.9 | 28.2 | 28.7 | 52.5 | 31.1 | 42.1 |
| LLava-OV | 72B | 36.1 | 20.1 | 28.1 | 41.3 | 49.4 | 9.9 | 54.6 | 31.8 | 53.4 |
| Oryx-1.5 | 32B | 35.4 | 35.3 | 33.9 | 30.8 | 48.5 | 26.7 | 45.2 | 26.0 | 36.4 |
| Gemini 1.5 Flash | -- | 34.7 | 26.3 | 34.0 | 40.2 | 42.4 | 28.9 | 40.0 | 30.7 | 35.4 |
| Qwen2.5-VL (7B) | 7B | 31.9 | 35.9 | 36.0 | 37.0 | 26.5 | 26.5 | 35.4 | 22.4 | 35.2 |
| Qwen2-VL (7B) | 7B | 30.3 | 27.0 | 34.0 | 37.5 | 23.7 | 11.2 | 42.4 | 24.6 | 42.1 |
| GPT-4-Turbo | -- | 29.7 | 22.9 | 28.7 | 36.7 | 36.5 | 15.8 | 39.4 | 24.3 | 33.6 |
| VideoChatFlash | 7B | 29.2 | 13.6 | 13.3 | 1.0 | 50.1 | 45.1 | 42.4 | 23.3 | 44.3 |
| MiniCPM-V2.6 | 7B | 26.5 | 29.2 | 21.2 | 11.4 | 42.9 | 13.3 | 30.3 | 20.5 | 43.2 |
| Video-XL-Pro | 3B | 22.5 | 10.9 | 22.9 | 30.4 | 15.6 | 18.7 | 27.9 | 20.9 | 32.9 |
| Qwen2.5-VL (3B) | 3B | 21.1 | 11.4 | 23.2 | 28.5 | 17.8 | 18.7 | 15.4 | 18.3 | 35.3 |
| LongVA | 7B | 19.2 | 4.8 | 5.6 | 3.2 | 46.9 | 4.5 | 28.3 | 29.6 | 30.5 |
| LongVU | 3B | 17.0 | 5.8 | 20.4 | 15.4 | 17.0 | 15.6 | 15.9 | 15.4 | 30.5 |
| ShareGPT4Video | 8B | 16.4 | 2.5 | 2.6 | 0.8 | 43.5 | 0.0 | 27.3 | 28.0 | 26.1 |
VidText Benchmark
Before you access our dataset, we kindly ask you to thoroughly read and understand the license outlined above. If you cannot agree to these terms, we request that you refrain from downloading our video data.
The annotation file is readily accessible here. For the raw videos, you can access them via this 🤗 HF Link.
VidText includes 8 tasks spanning holistic and local OCR, text grounding, and multimodal reasoning, to comprehensively evaluate video-level, clip-level, and instance-level video text understanding.Examples of the tasks are displayed below.

Evaluation
Please refer to our evaluation folder for more details.
Citation
If you find this repository useful, please consider giving a star :star: and citation
@article{VidText,
title={VidText: Towards Comprehensive Evaluation for Video-Text Understanding},
author={Yang, Zhoufaran and Shu, Yan and Yang, Zhifei and Zhang, Yan and Li, Yu and Lu, Keyang and Zeng, Gangyan and Liu, Shaohui and Zhou, Yu and Sebe, Nicu},
journal={arXiv preprint arXiv:2505.22810},
year={2025}
}