MME-Unify: A Comprehensive Benchmark for Unified Multimodal Understanding and Generation Models
April 10, 2025 ยท View on GitHub
[2025/04/07] ๐ฅ We are proud to open-source MME-Unify, a comprehensive evaluation framework designed to systematically assess U-MLLMs. Our benchmark includes:
- A Standardized Traditional Task Evaluation We sample from 12 datasets, covering 10 tasks with 30 subtasks, ensuring consistent and fair comparisons across studies.
- A Unified Task Assessment We introduce five novel tasks testing multimodal reasoning, including image editing, commonsense QA with image generation, and geometric reasoning.
- A Comprehensive Model Benchmarking We evaluate 12 leading U-MLLMs, such as Janus-Pro, EMU3, and VILA-U, Gemini-2-Flash-exp, alongside specialized understanding (e.g., Claude-3.5-Sonnet) and generation models (e.g., DALL-E-3/2).
Our findings reveal substantial performance gaps in existing U-MLLMs, highlighting the need for more robust models capable of handling mixed-modality tasks effectively.
Dataset Examples
Evaluation Pipeline
Prompt
The common prompt used in our evaluation for different tasks can be found in:
MME-Unify/Prompt.txt
Dataset
You can download images in our Hugging Face repository and the final structure should look like this:
MME-Unify
โโโ CommonSense_Questions
โโโ Conditional_Image_to_Video_Generation
โโโ Fine-Grained_Image_Reconstruction
โโโ Math_Reasoning
โโโ Multiple_Images_and_Text_Interlaced
โโโ Single_Image_Perception_and_Understanding
โโโ Spot_Diff
โโโ Text-Image_Editing
โโโ Text-Image_Generation
โโโ Text-to-Video_Generation
โโโ Video_Perception_and_Understanding
โโโ Visual_CoT
You can found QA pairs in:
MME-Unify/Unify_Dataset
and the structure should look like this:
Unify_Dataset
โโโ Understanding
โโโ Generation
โโโ Unify_Capability
โ โโโ Auxiliary_Lines
โ โโโ Common_Sense_Question
โ โโโ Image_Editing_and_Explaning
โ โโโ SpotDiff
โ โโโ Visual_CoT
Evaluate
To extract the answer and calculate the scores, we add the model response to a JSON file. Here we provide an example template output_test_template.json. Once you have prepared the model responses in this format, please refer to the evaluation scripts in:
MME-Unify/evaluate
Dataset License
License:
MME-Unify is only used for academic research. Commercial use in any form is prohibited.
The copyright of all images belongs to the image owners.
If there is any infringement in MME-Unify, please email yifanzhang.cs@gmail.com and we will remove it immediately.
Without prior approval, you cannot distribute, publish, copy, disseminate, or modify MME-Unify in whole or in part.
You must strictly comply with the above restrictions.
Please send an email to yifanzhang.cs@gmail.com. ๐
Citation
If you find it useful for your research and applications, please cite related papers/blogs using this BibTeX:
@article{xie2025mme,
title={MME-Unify: A Comprehensive Benchmark for Unified Multimodal Understanding and Generation Models},
author={Xie, Wulin and Zhang, Yi-Fan and Fu, Chaoyou and Shi, Yang and Nie, Bingyan and Chen, Hongkai and Zhang, Zhang and Wang, Liang and Tan, Tieniu},
journal={arXiv preprint arXiv:2504.03641},
year={2025}
}
Related Works
Explore our related researches:
- [SliME] Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models
- [VITA] VITA: Towards Open-Source Interactive Omni Multimodal LLM
- [Long-VITA] Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuracy
- [MME] MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
- [Video-mme] Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
- [MME-RealWorld] Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?
- [MME-Survey] MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs
- [MM-RLHF] MM-RLHF: The Next Step Forward in Multimodal LLM Alignment