Citation

March 14, 2026 · View on GitHub

Awesome Unified Multimodal Models

Figure 1: Timeline of Publicly Available and Unavailable Unified Multimodal Models. The models are categorized by their release years, from 2023 to 2025. Models underlined in the diagram represent any-to-any multimodal models, capable of handling inputs or outputs beyond text and image, such as audio, video, and speech. The timeline highlights the rapid growth in this field.

🔥 We are hiring!

We are looking for both interns and full-time researchers to join our team, focusing on multimodal understanding, generation, reasoning, AI agents, and unified multimodal models. If you are interested in exploring these exciting areas, please reach out to us at qingguo.cqg@alibaba-inc.com.

👉 What is This Repo for?

This repository provides a comprehensive collection of resources related to unified multimodal models, featuring:

A survey of advances, challenges, and timelines for unified models
Categorized lists of diffusion-based, autoregressive (MLLM), and hybrid architectures for unified image–text understanding and generation
Benchmarks for evaluating multimodal comprehension, image generation, and interleaved image–text tasks
Representative datasets covering multimodal understanding, text-to-image synthesis, image editing, and interleaved interactions

Designed to help researchers and practitioners explore, compare, and build state-of-the-art unified multimodal systems.

Text-and-Image Unified Models

Figure 2: Classification of Unified Multimodal Understanding and Generation Models. The models are divided into three main categories based on their backbone architecture: Diffusion, MLLM (AR), and MLLM (AR + Diffusion). Each category is further subdivided according to the encoding strategy employed, including Pixel Encoding, Semantic Encoding, Learnable Query Encoding, and Hybrid Encoding. We illustrate the architectural variations within these categories and their corresponding encoder-decoder configurations.

Diffusion

Name	Title	Venue	Date	Code	Demo
UniModel	UniModel: A Visual-Only Framework for Unified Multimodal Understanding and Generation	arXiv	2025/11/21	-	-
Lavida-O	Lavida-O: Elastic Large Masked Diffusion Models for Unified Multimodal Understanding and Generation	arXiv	2025/09/24	Github	-
Muddit	Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model	arXiv	2025/05/23	Github	-
FUDOKI	FUDOKI: Discrete Flow-based Unified Understanding and Generation via Kinetic-Optimal Velocities	arXiv	2025/05/20	-	-
MMaDA	MMaDA: Multimodal Large Diffusion Language Models	arXiv	2025/05/21	Github	Demo
UniDisc	Unified Multimodal Discrete Diffusion	arXiv	2025/03/20	Github	-
Dual Diffusion	Dual Diffusion for Unified Image Generation and Understanding	arXiv	2024/12/31	Github	-

MLLM AR

b-1: Pixel Encoding

Name	Title	Venue	Date	Code	Demo
Emu3.5	Emu3.5: Native Multimodal Models are World Learners	arXiv	2025/10/30	Github	-
Uni-X	Uni-X: Mitigating Modality Conflict with a Two-End-Separated Architecture for Unified Multimodal Models	ICLR	2025/09/29	Github	-
OneCat	OneCAT: Decoder-Only Auto-Regressive Model for Unified Understanding and Generation	arXiv	2025/09/03	Github	-
Selftok	Selftok: Discrete Visual Tokens of Autoregression, by Diffusion, and for Reasoning	arXiv	2025/05/12	Github	-
TokLIP	TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation	arXiv	2025/05/08	Github	-
Harmon	Harmonizing Visual Representations for Unified Multimodal Understanding and Generation	arXiv	2025/03/27	Github	Demo
UGen	UGen: Unified Autoregressive Multimodal Model with Progressive Vocabulary Learning	arXiv	2025/03/27	-	-
SynerGen-VL	SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding	arXiv	2024/12/12	-	-
Liquid	Liquid: Language Models are Scalable and Unified Multi-modal Generators	arXiv	2024/12/05	Github	Demo
Orthus	Orthus: Autoregressive Interleaved Image-Text Generation with Modality-Specific Heads	arXiv	2024/11/28	Github	-
MMAR	MMAR: Towards Lossless Multi-Modal Auto-Regressive Probabilistic Modeling	arXiv	2024/10/14	-	-
Emu3	Emu3: Next-Token Prediction is All You Need	arXiv	2024/09/27	Github	Demo
ANOLE	ANOLE: An Open, Autoregressive, Native Large Multimodal Models for Interleaved Image-Text Generation	arXiv	2024/07/08	Github	-
Chameleon	Chameleon: Mixed-Modal Early-Fusion Foundation Models	arXiv	2024/05/16	Github	-
LWM	World Model on Million-Length Video And Language With Blockwise RingAttention	ICLR	2024/02/13	Github	-

b-2: Semantic Encoding

Name	Title	Venue	Date	Code	Demo
MammothModa2	MammothModa2: A Unified AR-Diffusion Framework for Multimodal Understanding and Generation	arXiv	2025/11/23	GitHub	-
Ming-UniVision	Ming-UniVision: Joint Image Understanding and Generation with a Unified Continuous Tokenizer	arXiv	2025/10/08	GitHub	-
Bifrost-1	Bifrost-1: Bridging Multimodal LLMs and Diffusion Models with Patch-level CLIP Latents	arXiv	2025/08/08	GitHub	-
Qwen-Image	Qwen-Image Technical Report	arXiv	2025/08/04	GitHub	Demo
X-Omni	X-Omni: Reinforcement Learning Makes Discrete Autoregressive Image Generative Models Great Again	arXiv	2025/07/29	GitHub	Demo
Ovis-U1	Ovis-U1 Technical Report	arXiv	2025/06/28	GitHub	Demo
UniCode $^2$	UniCode $^2$ : Cascaded Large-scale Codebooks for Unified Multimodal Understanding and Generation	arXiv	2025/06/20	-	-
OmniGen2	OmniGen2: Exploration to Advanced Multimodal Generation	arXiv	2025/06/18	Github	Demo
Tar	Vision as a Dialect: Unifying Visual Understanding and Generation via Text-Aligned Representations	arXiv	2025/06/18	Github	Demo
UniFork	UniFork: Exploring Modality Alignment for Unified Multimodal Understanding and Generation	arXiv	2025/06/17	Github	-
UniWorld	UniWorld: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation	arXiv	2025/06/03	Github	-
Pisces	Pisces: An Auto-regressive Foundation Model for Image Understanding and Generation	arXiv	2025/06/10	Github	-
DualToken	DualToken: Towards Unifying Visual Understanding and Generation with Dual Visual Vocabularies	arXiv	2025/03/18	Github	-
UniTok	UniTok: A Unified Tokenizer for Visual Generation and Understanding	arXiv	2025/02/27	Github	Demo
QLIP	QLIP: Text-Aligned Visual Tokenization Unifies Auto-Regressive Multimodal Understanding and Generation	arXiv	2025/02/05	Github	-
MetaMorph	MetaMorph: Multimodal Understanding and Generation via Instruction Tuning	arXiv	2024/12/18	Github	-
ILLUME	ILLUME: Illuminating Your LLMs to See, Draw, and Self-Enhance	arXiv	2024/12/09	-	-
PUMA	PUMA: Empowering Unified MLLM with Multi-granular Visual Generation	arXiv	2024/10/17	Github	-
VILA-U	VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation	ICLR	2024/09/06	Github	Demo
Mini-Gemini	Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models	arXiv	2024/03/27	Github	Demo
MM-Interleaved	MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer	arXiv	2024/01/18	Github	-
VL-GPT	VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation	arXiv	2023/12/14	Github	-
Emu2	Generative Multimodal Models are In-Context Learners	CVPR	2023/12/10	Github	Demo
DreamLLM	DreamLLM: Synergistic Multimodal Comprehension and Creation	ICLR	2023/09/20	Github	-
LaVIT	Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization	ICLR	2023/09/09	Github	-
Emu	Emu: Generative Pretraining in Multimodality	ICLR	2023/07/11	Github	Demo

b-3: Learnable Query Encoding

Name	Title	Venue	Date	Code	Demo
UniPic 2.0	Skywork UniPic 2.0: Building Kontext Model with Online RL for Unified Multimodal Model	arXiv	2025/09/04	Github	-
TBAC-UniImage	TBAC-UniImage: Unified Understanding and Generation by Ladder-Side Diffusion Tuning	arXiv	2025/08/11	Github	-
UniLIP	UniLiP: Adapting CLIP for Unified Multimodal Understanding, Generation and Editing	arXiv	2025/07/31	-	-
OpenUni	OpenUni: A Simple Baseline for Unified Multimodal Understanding and Generation	arXiv	2025/05/23	Github	-
BLIP3-o	BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset	arXiv	2025/05/14	Github	-
Ming-Lite-Uni	Ming-Lite-Uni: Advancements in Unified Architecture for Natural Multimodal Interaction	arXiv	2025/05/05	Github	-
Nexus-Gen	Nexus-Gen: A Unified Model for Image Understanding, Generation, and Editing	arXiv	2025/04/30	Github	Demo
MetaQueries	Transfer between Modalities with MetaQueries	arXiv	2025/04/08	-	-
SEED-X	SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation	arXiv	2024/04/22	Github	Demo
SEED-LLaMA	Making LLaMA SEE and Draw with SEED Tokenizer	ICLR	2023/10/02	Github	Demo
SEED	Planting a SEED of Vision in Large Language Model	arXiv	2023/07/16	Github	Demo

b-4: Hybrid Encoding (Pseduo)

Name	Title	Venue	Date	Code	Demo
Skywork UniPic	Skywork UniPic: Unified Autoregressive Modeling for Visual Understanding and Generation	arXiv	2025/08/05	Github	Demo
MindOmni	MindOmni: Unleashing Reasoning Generation in Vision Language Models with RGPO	arXiv	2025/05/13	Github	Demo
UniFluid	Unified Autoregressive Visual Generation and Understanding with Continuous Tokens	arXiv	2025/03/17	-	-
OmniMamba	OmniMamba: Efficient and Unified Multimodal Understanding and Generation via State Space Models	arXiv	2025/03/11	Github	-
Janus-Pro	Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling	arXiv	2025/01/29	Github	Demo
Janus	Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation	arXiv	2024/10/17	Github	Demo

b-5: Hybrid Encoding (Joint)

Name	Title	Venue	Date	Code	Demo
Show-o2	Show-o2: Improved Native Unified Multimodal Models	arXiv	2025/06/15	Github	-
UniToken	UniToken: Harmonizing Multimodal Understanding and Generation through Unified Visual Encoding	CVPRW	2025/04/06	Github	-
VARGPT-v1.1	VARGPT-v1.1: Improve Visual Autoregressive Large Unified Model via Iterative Instruction Tuning and Reinforcement Learning	arXiv	2025/04/03	Github	-
ILLUME+	ILLUME+: Illuminating Unified MLLM with Dual Visual Tokenization and Diffusion Refinement	arXiv	2025/04/02	Github	-
SemHiTok	SemHiTok: A Unified Image Tokenizer via Semantic-Guided Hierarchical Codebook for Multimodal Understanding and Generation	arXiv	2025/03/06	-	-
VARGPT	VARGPT: Unified Understanding and Generation in a Visual Autoregressive Multimodal Large Language Model	arXiv	2025/01/21	Github	-
TokenFlow	TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation	CVPR	2024/12/04	Github	-
MUSE-VL	MUSE-VL: Modeling Unified VLM through Semantic Discrete Encoding	arXiv	2024/11/26	-	-

MLLM AR-Diffusion

c-1: Pxiel Encoding

Name	Title	Venue	Date	Code	Demo
TUNA	Tuna: Taming Unified Visual Representations for Native Unified Multimodal Models	arXiv	2025/12/01	Github	-
LMFusion	LMFusion: Adapting Pretrained Language Models for Multimodal Generation	arXiv	2024/12/19	-	-
MonoFormer	MonoFormer: One Transformer for Both Diffusion and Autoregression	arXiv	2024/09/24	Github	-
Show-o	Show-o: One Single Transformer to Unify Multimodal Understanding and Generation	ICLR	2024/08/22	Github	Demo
Transfusion	Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model	ICLR	2024/08/20	Github	-

c-2: Hybrid Encoding (Pseduo)

Name	Title	Venue	Date	Code	Demo
EMMA	EMMA: Efficient Multimodal Understanding, Generation, and Editing with a Unified Architecture	arXiv	2025/12/15	Github	-
HBridge	HBridge: H-Shape Bridging of Heterogeneous Experts for Unified Multimodal Understanding and Generation	arXiv	2025/11/25	-	-
LightFusion	LightFusion: A Light-weighted, Double Fusion Framework for Unified Multimodal Understanding and Generation	arXiv	2025/10/27	Github	-
BAGEL	Emerging Properties in Unified Multimodal Pretraining	arXiv	2025/05/20	Github	Demo
Mogao	Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation	arXiv	2025/05/08	-	-
JanusFlow	JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation	arXiv	2024/11/12	Github	Demo

Any-to-Any Multimodal models

Name	Title	Venue	Date	Code	Demo
LongCat-Flash-Omni	LongCat-Flash-Omni Technical Report	arXiv	2025/11/28	Github	-
Ming-Flash-Omni	Ming-Flash-Omni: A Sparse, Unified Architecture for Multimodal Perception and Generation	arXiv	2025/10/28	Github	-
Qwen3-Omni	Qwen3-Omni Technical Report	arXiv	2025/09/22	Github	-
Ming-Omni	Ming-Omni: A Unified Multimodal Model for Perception and Generation	arXiv	2025/06/09	Github	-
M2-omni	M2-omni: Advancing Omni-MLLM for Comprehensive Modality Support with Competitive Performance	arXiv	2025/02/26	-	-
OmniFlow	OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows	CVPR	2024/12/02	Github	-
Spider	Spider: Any-to-Many Multimodal LLM	arXiv	2024/11/14	Github	-
MIO	MIO: A Foundation Model on Multimodal Tokens	arXiv	2024/09/26	Github
X-VILA	X-VILA: Cross-Modality Alignment for Large Language Model	arXiv	2024/05/29	-	-
AnyGPT	AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling	arXiv	2024/02/19	Github	-
Video-LaVIT	Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization	ICML	2024/02/05	Github	-
Unified-IO 2	Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action	CVPR	2023/12/28	Github	-
NExT-GPT	NExT-GPT: Any-to-Any Multimodal LLM	ICML	2023/09/11	Github	-

Benchmark for Evaluation

Benchmarks on Understanding Tasks

Name	Paper	Venue	Date	Code
General-Bench	On Path to Multimodal Generalist: General-Level and General-Bench	ICML	2025/05/07	Github
MM-Vet v2	MM-Vet v2: A Challenging Benchmark to Evaluate Large Multimodal Models for Integrated Capabilities	arXiv	2024/08/01	Github
OwlEval	mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality	arXiv	2024/04/27	Github
oVQA	Open-ended VQA benchmarking of Vision-Language models by exploiting Classification datasets and their semantic hierarchy	ICLR	2024/02/11	Github
SEED-Bench-2	SEED-Bench-2: Benchmarking Multimodal Large Language Models	arXiv	2023/11/28	Github
MMMU	MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI	CVPR	2023/11/27	Github
MM-Vet	MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities	ICML	2023/08/04	Github
SEED-Bench	SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension	CVPR	2023/07/30	Github
MMBench	MMBench: Is Your Multi-modal Model an All-around Player?	ECCV	2023/07/12	Github
LAMM	LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark	NeurIPS	2023/06/11	Github
HaluEval	HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models	EMNLP	2023/05/19	Github
GQA	GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering	CVPR	2019/02/25	Github
VQA	VQA: Visual Question Answering	ICCV	2015/05/03	ProjectPage

Benchmarks on Image Generation Tasks

Name	Paper	Venue	Date	Code
GenExam	GenExam: A Multidisciplinary Text-to-Image Exam	arxiv	2025/09/18	Github
CVTG	TextCrafter: Accurately Rendering Multiple Texts in Complex Visual Scenes	arxiv	2025/08/05	Github
OneIG-Bench	OneIG-Bench: Omni-dimensional Nuanced Evaluation for Image Generation	arxiv	2025/06/26	Github
ComplexBench-Edit	ComplexBench-Edit: Benchmarking Complex Instruction-Driven Image Editing via Compositional Dependencies	arxiv	2025/06/15	Github
EditInspector	EditInspector: A Benchmark for Evaluation of Text-Guided Image Edits	arxiv	2025/06/11	-
ByteMorph-Bench	ByteMorph: Benchmarking Instruction-Guided Image Editing with Non-Rigid Motions	arxiv	2025/06/03	Github
RefEdit-Bench	RefEdit: A Benchmark and Method for Improving Instruction-based Image Editing Model on Referring Expressions	arxiv	2025/06/03	Github
WISE	WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation	arxiv	2025/05/27	Github
ImgEdit-Bench	ImgEdit: A Unified Image Editing Dataset and Benchmark	arxiv	2025/05/26	Github
MMIG-Bench	MMGen-Bench: Towards Comprehensive and Explainable Evaluation of Multi-Modal Image Generation Models	arxiv	2025/05/26	Github
KRIS-Bench	KRIS-Bench: Benchmarking Next-Level Intelligent Image Editing Models	NeurIPS 2025	2025/05/22	Github
CompBench	CompBench: Benchmarking Complex Instruction-guided Image Editing	arxiv	2025/05/18	-
WorldGenBench	WorldGenBench: A World-Knowledge-Integrated Benchmark for Reasoning-Driven Text-to-Image Generation	arxiv	2025/05/02	HuggingFace
GEdit-Bench	Step1X-Edit: A Practical Framework for General Image Editing	arXiv	2025/04/28	Github
DreamBench++	DreamBench++: A Human-Aligned Benchmark for Personalized Image Generation	ICLR	2025/03/09	Github
T2I-CompBench++	T2I-CompBench++: An Enhanced and Comprehensive Benchmark for Compositional Text-to-image Generation	TPAMI	2025/03/08	Github
IE-Bench	IE-Bench: Advancing the Measurement of Text-Driven Image Editing for Human Perception Alignment	arXiv	2025/01/17	-
AnyEdit	AnyEdit: Mastering Unified High-Quality Image Editing for Any Idea	CVPR	2024/11/24	Github
I2EBench	I2EBench: A Comprehensive Benchmark for Instruction-based Image Editing	NeurIPS	2024/08/26	Github
ConceptMix	ConceptMix: A Compositional Image Generation Benchmark with Controllable Difficulty	NeurIPS	2024/08/26	Github
GenAI-Bench	GenAI-Bench: Evaluating and Improving Compositional Text-to-Visual Generation	CVPR	2024/06/19	Github
Commonsense-T2I	Commonsense-T2I Challenge: Can Text-to-Image Generation Models Understand Commonsense?	COLM	2024/06/11	Github
HQ-Edit	HQ-Edit: A High-Quality Dataset for Instruction-based Image Editing	ICLR	2024/04/15	Github
VQAScore	Evaluating Text-to-Visual Generation with Image-to-Text Generation	ECCV	2024/04/01	Github
FlashEval	FlashEval: Towards Fast and Accurate Evaluation of Text-to-image Diffusion Generative Models	CVPR	2024/03/25	Github
DPG-Bench	ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment	arxiv	2024/03/08	Github
Reason-Edit	SmartEdit: Exploring Complex Instruction-based Image Editing with Multimodal Large Language Models	CVPR	2023/12/11	Github
Emu Edit	Emu Edit: Precise Image Editing via Recognition and Generation Tasks	CVPR	2023/11/16	HuggingFace
HEIM	Holistic Evaluation of Text-To-Image Models	NeurIPS	2023/11/07	Github
DSG-1k	Davidsonian Scene Graph: Improving Reliability in Fine-grained Evaluation for Text-to-Image Generation	ICLR	2023/10/27	Github
GenEval	GenEval: An Object-Focused Framework for Evaluating Text-to-Image Alignment	NeurIPS	2023/10/17	Github
EditVal	EditVal: Benchmarking Diffusion Based Text-Guided Image Editing Methods	arXiv	2023/10/03	Github
T2I-CompBench	T2I-CompBench: A Comprehensive Benchmark for Open-world Compositional Text-to-image Generation	NeurIPS	2023/07/12	Github
DreamSim	DreamSim: Learning New Dimensions of Human Visual Similarity using Synthetic Data	NeurIPS	2023/06/15	Github
MagicBrush	MagicBrush: A Manually Annotated Dataset for Instruction-Guided Image Editing	NeurIPS	2023/06/16	Github
MultiGen-20M	UniControl: A Unified Diffusion Model for Controllable Visual Generation In the Wild	NeurIPS	2023/05/18	Github
HRS-Bench	HRS-Bench: Holistic, Reliable and Scalable Benchmark for Text-to-Image Models	ICCV	2023/04/11	Github
TIFA	TIFA: Accurate and Interpretable Text-to-Image Faithfulness Evaluation with Question Answering	ICCV	2023/03/21	Github
EditBench	Imagen Editor and EditBench: Advancing and Evaluating Text-Guided Image Inpainting	CVPR	2022/12/13	ProjectPage
PartiPrompts	Scaling Autoregressive Models for Content-Rich Text-to-Image Generation	TMLR	2022/06/22	Github
DrawBench	Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding	NeurIPS	2022/05/23	ProjectPage
PaintSkills	DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generation Models	ICCV	2022/02/08	Github

Benchmarks on Interleaved / Compositional / Other Tasks

Name	Paper	Venue	Date	Code
VTBench	VTBench: Evaluating Visual Tokenizers for Autoregressive Image Generation	arXiv	2025/05/19	Github
UniBench	UniEval: Unified Holistic Evaluation for Unified Multimodal Understanding and Generation	arXiv	2025/05/15	Github
OpenING	OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation	CVPR	2024/11/27	Github
ISG	Interleaved Scene Graphs for Interleaved Text-and-Image Generation Assessment	ICLR	2024/11/26	Github
MMIE	MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models	ICLR	2024/10/14	Github
InterleavedBench	Holistic Evaluation for Interleaved Text-and-Image Generation	EMNLP	2024/06/20	HuggingFace
OpenLEAF	OpenLEAF: Open-Domain Interleaved Image-Text Generation and Evaluation	MM	2023/10/01	-

Dataset

Multimodal Understanding

Dataset	Samples	Paper	Venue	Date
Honey-Data-15M	15M	Bee: A High-Quality Corpus and Full-Stack Suite to Unlock Advanced Fully Open MLLMs	arXiv	2025/11/11
Infinity-MM	40M	Infinity-MM: Scaling Multimodal Performance with Large-Scale and High-Quality Instruction Data	arXiv	2024/10/24
LLaVA-OneVision	4.8M	LLaVA-OneVision: Easy Visual Task Transfer	TMLR	2024/08/06
Cambrian-10M	10M	Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs	NeurIPS	2024/06/24
ShareGPT4V	100K	Sharegpt4v: Improving large multi-modal models with better captions	ECCV	2023/11/21
CapsFusion-120M	120M	Capsfusion: Rethinking image-text data at scale	CVPR	2023/10/31
GRIT	20M	Kosmos-2: Grounding multimodal large language models to the world	ICLR	2023/06/26
DataComp	1.4B	DATACOMP: In search of the next generation of multimodal datasets	NeurIPS	2023/04/27
Laion-COCO	600M	Laion coco: 600m synthetic captions from laion2b-en	-	2022/09/15
COYO	747M	Coyo-700m: Image-text pair dataset	-	2022/08/31
Laion	5.9B	Laion-5b: An open large-scale dataset for training next generation image-text models	NeurIPS	2022/03/31
Wukong	100M	Wukong: A 100 million large-scale chinese cross-modal pre-training benchmark	NeurIPS	2022/02/14
RedCaps	12M	Redcaps: Web-curated image-text data created by the people, for the people	NeurIPS	2021/11/22

Text-to-Image

Dataset	Samples	Paper	Venue	Date
FLUX-Reason-6M	6M	FLUX-Reason-6M & PRISM-Bench: A Million-Scale Text-to-Image Reasoning Dataset and Comprehensive Benchmark	arXiv	2025/09/11
Echo-4o-Image	106K	Echo-4o: Harnessing the Power of GPT-4o Synthetic Images for Improved Image Generation	arXiv	2025/08/13
Poster100K	100K	PosterCraft: Rethinking High-Quality Aesthetic Poster Generation in a Unified Framework	arXiv	2025/06/12
Text-Render-2M	2M	PosterCraft: Rethinking High-Quality Aesthetic Poster Generation in a Unified Framework	arXiv	2025/06/12
ShareGPT-4o-Image	45K	ShareGPT-4o-Image: Aligning Multimodal Models with GPT-4o-Level Image Generation	arXiv	2025/06/22
BLIP3o-60k	60K	BLIP3-o: A Family of Fully Open Unified Multimodal Models—Architecture, Training and Dataset	arXiv	2025/05/14
TextAtlas5M	5M	TextAtlas5M: A Large-scale Dataset for Dense Text Image Generation	arXiv	2025/02/11
EliGen TrainSet	500K	EliGen: Entity-Level Controlled Image Generation with Regional Attention	arXiv	2025/01/02
PD12M	12M	Public domain 12m: A highly aesthetic image-text dataset with novel governance mechanisms	arXiv	2024/10/30
SFHQ-T2I	122K	-	-	2024/10/06
text-to-image-2M	2M	-	-	2024/09/13
DenseFusion	1M	Densefusion-1m: Merging vision experts for comprehensive multimodal perception	NeurIPS	2024/07/11
Megalith	10M	-	-	2024/07/01
PixelProse	16M	From pixels to prose: A large dataset of dense image captions	arXiv	2024/06/14
DOCCI	15K	DOCCI: Descriptions of Connected and Contrasting Images	ECCV	2024/04/30
CosmicMan-HQ 1.0	6M	Cosmicman: A text-to-image foundation model for humans	CVPR	2024/04/01
AnyWord-3M	3M	Anytext: Multilingual visual text generation and editing	ICLR	2023/11/06
JourneyDB	4M	JourneyDB: A Benchmark for Generative Image Understanding	NeurIPS	2023/07/03
RenderedText	12M	-	-	2023/06/30
Mario-10M	10M	Textdiffuser: Diffusion models as text painters	NeurIPS	2023/05/18
SAM	11M	Segment Anything	ICCV	2023/04/05
LAION-Aesthetics	120M	Laion-5b: An open large-scale dataset for training next generation image-text models	NeurIPS	2022/08/16
CC-12M	12M	Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts	CVPR	2021/02/17

Image Editing

Dataset	Samples	Paper	Venue	Date
Pico-Banana-400K	400K	Pico-Banana-400K: A Large-Scale Dataset for Text-Guided Image Editing	arXiv	2025/10/22
X2Edit	3.7M	X2Edit: Revisiting Arbitrary-Instruction Image Editing through Self-Constructed Data and Task-Aware Representation Learning	arXiv	2025/08/11
ShareGPT-4o-Image (Editing)	46K	ShareGPT-4o-Image: Aligning Multimodal Models with GPT-4o-Level Image Generation	arXiv	2025/06/22
ByteMorph-6M	6M	ByteMorph: Benchmarking Instruction-Guided Image Editing with Non-Rigid Motions	arXiv	2025/06/03
ImgEdit	1.2M	ImgEdit: A Unified Image Editing Dataset and Benchmark	arXiv	2025/05/26
RefEdit	18K	RefEdit: A Benchmark and Method for Improving Instruction-based Image Editing Model for Referring Expression	arXiv	2025/04/03
AnyEdit	2.5M	Anyedit: Mastering unified high-quality image editing for any idea	CVPR	2024/11/24
OmniEdit	1.2M	Omniedit: Building image editing generalist models through specialist supervision	ICLR	2024/11/11
PromptFix	1M	PromptFix: You Prompt and We Fix the Photo	NeurIPS	2024/09/19
UltraEdit	4M	Ultraedit: Instruction-based fine-grained image editing at scale	NeurIPS	2024/07/07
EditWorld	8.6K	EditWorld: Simulating World Dynamics for Instruction-Following Image Editing	arXiv	2024/06/23
SEED-Data-Edit	3.7M	Seed-data-edit technical report: A hybrid dataset for instructional image editing	arXiv	2024/05/07
HQ-Edit	197K	Hq-edit: A high-quality dataset for instruction-based image editing	arXiv	2024/04/15
HIVE	1.1M	HIVE: Harnessing Human Feedback for Instructional Visual Editing	arXiv	2023/07/08
Magicbrush	10K	Magicbrush: A manually annotated dataset for instruction-guided image editing	NeurIPS	2023/06/16
InstructP2P	313K	Instructpix2pix: Learning to follow image editing instructions	CVPR	2022/11/17

Interleaved Image-Text

Dataset	Samples	Paper	Venue	Date
OmniCorpus	8B	OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text	ICLR	2024/10/22
CoMM	227K	Comm: A coherent interleaved image-text dataset for multimodal understanding and generation	CVPR	2024/06/15
OBELICS	141M	Obelics: An open web-scale filtered dataset of interleaved image-text documents	NeurIPS	2023/06/21
Multimodal C4	101.2M	Multimodal c4: An open, billion-scale corpus of images interleaved with text	NeurIPS	2023/04/14

Other Text-Image-to-Image

Dataset	Samples	Paper	Venue	Date
Echo-4o-Image	73K	Echo-4o: Harnessing the Power of GPT-4o Synthetic Images for Improved Image Generation	arXiv	2025/08/13
MetaQuery Instruct 2.4M	2.4M	Transfer between Modalities with MetaQueries	arXiv	2025/06/24
Graph200K	200K	VisualCloze: A Universal Image Generation Framework via Visual In-Context Learning	arXiv	2025/03/30
SynCD	95K	Generating multi-image synthetic data for text-to-image customization	arXiv	2025/02/03
X2I-subject-driven	2.5M	OmniGen: Unified Image Generation	arXiv	2024/12/14
Subjects200K	200K	Ominicontrol: Minimal and universal control for diffusion transformer	arXiv	2024/11/22
MultiGen-20M	20M	Unicontrol: A unified diffusion model for controllable visual generation in the wild	NeurIPS	2023/05/18
LAION-Face	50M	General facial representation learning in a visual-linguistic manner	CVPR	2021/12/06

Applications and Opportunities

Name	Title	Venue	Date	Code	Demo
UniGame	UniGame: Turning a Unified Multimodal Model Into Its Own Adversary	arXiv	2025/11/24	Github	-
UniCTokens	UniCTokens: Boosting Personalized Understanding and Generation via Unified Concept Tokens	arXiv	2025/05/20	Github	-
Fair-UMLLM	On Fairness of Unified Multimodal Large Language Model for Image Generation	arXiv	2025/02/05	-	-
T2I-R1	T2I-R1: Reinforcing Image Generation with Collaborative Semantic-level and Token-level CoT	arXiv	2025/01/29	Github	-

Citation

If you find this repo is helpful for your research, please cite our paper:

@article{zhang2025unified,
  title={Unified Multimodal Understanding and Generation Models: Advances, Challenges, and Opportunities},
  author={Zhang, Xinjie and Guo, Jintao and Zhao, Shanshan and Fu, Minghao and Duan, Lunhao and Hu, Jiakui and Chng, Yong Xien and Wang, Guo-Hua and Chen, Qing-Guo and Xu, Zhao and Luo, Weihua and Zhang, Kaifu},
  journal={arXiv preprint arXiv:2505.02567},
  year={2025}
}