The Evaluation Suite of Xiaomi MiMo-VL
August 8, 2025 ยท View on GitHub
To promote rigorous, reproducible, and thinking-oriented evaluation of Vision-Language Models (VLMs), we open-source our evaluation suite for MiMo-VL and beyond.
Built on top of the excellent lmms-eval framework, we introduce several improvements in model integration, evaluation protocol, and task coverage to better support the next generation of reasoning-capable VLMs.
๐ฐ News
[25/08/08] We update our evaluation framework along with the release of MiMo-VL-7B-SFT-2508 and MiMo-VL-7B-RL-2508. New features include:
- Additional GUI action benchmarks AndroidControl and CAGUI (evaluated using
--model mimo_agent) - Additional evaluation benchmarks on video spatial reasoning (VSI-Bench), physics reasoning (PhysReason), multi-modal long context understanding (MMLongBench), multi-modal instruction following (MM-IFEval);
- Enables no_think evaluation by adding model argument
disable_thinking_user=True
๐ง Key Features
1. โ๏ธ MiVLLM: A vLLM-based Model Wrapper for MiMo-VL
We introduce a new MiVLLM model class based on the original VLLM class in lmms-eval, which is tailored for MiMo-VL. Compared to the original implementation, it:
- Greatly improves data loading efficiency
- Enables fine-grained control over image and video preprocessing
2. ๐ง Adaptation to Thinking VLMs
The original lmms-eval tasks were designed for non-thinking VLMs: they prompt directly for short answers and compare outputs without post-processing. We redesign this process to support reasoning-intensive models:
- Introduce a unified
\boxed{}output format using the prompt: Put your final answer in\boxed{}. - Extend
max_new_tokensto 32768 to allow the model to reason before answering - Automatically extract predictions from the final
\boxed{}output
3. ๐ Refined Open-ended Evaluation Metrics
For open-ended tasks such as DocVQA, InfoVQA, ChartQA, and OCRBench, we calculate accuracy using GPT-4o as the evaluator. This improves the fidelity of evaluation for free-form answers and better reflects model capabilities.
4. ๐งฉ 20+ New Tasks for Comprehensive Evaluation
We contribute over 20 new evaluation tasks covering:
- General vision-language understanding
- Math and logic reasoning
- GUI understanding and grounding
- Video understanding and reasoning
๐ A complete list of supported tasks is available here.
Usage
Installation
git clone https://github.com/XiaomiMiMo/lmms-eval
cd lmms-eval
pip install -e . && pip uninstall -y opencv-python-headless
pip install -r requirements.txt
Evaluation Script
bash mimovl_docs/eval_mimo_vl.sh
Reproduction of MiMo-VL-7B-SFT results in our technical report can be found here.
Citations
@misc{coreteam2025mimovl,
title={MiMo-VL Technical Report},
author={{Xiaomi LLM-Core Team}},
year={2025},
url={https://github.com/XiaomiMiMo/MiMo-VL},
}
@misc{mimovleval2025,
title={The Evaluation Suite of Xiaomi MiMo-VL},
author={LLM-Core Xiaomi},
year={2025},
url={https://github.com/XiaomiMiMo/lmms-eval}
}