Citation

March 14, 2026 · View on GitHub

Awesome Unified Multimodal Models

📚Survey • 🤗 HF Repo

Figure 1: Timeline of Publicly Available and Unavailable Unified Multimodal Models. The models are categorized by their release years, from 2023 to 2025. Models underlined in the diagram represent any-to-any multimodal models, capable of handling inputs or outputs beyond text and image, such as audio, video, and speech. The timeline highlights the rapid growth in this field.

🔥 We are hiring!

We are looking for both interns and full-time researchers to join our team, focusing on multimodal understanding, generation, reasoning, AI agents, and unified multimodal models. If you are interested in exploring these exciting areas, please reach out to us at qingguo.cqg@alibaba-inc.com.

👉 What is This Repo for?

This repository provides a comprehensive collection of resources related to unified multimodal models, featuring:

  • A survey of advances, challenges, and timelines for unified models
  • Categorized lists of diffusion-based, autoregressive (MLLM), and hybrid architectures for unified image–text understanding and generation
  • Benchmarks for evaluating multimodal comprehension, image generation, and interleaved image–text tasks
  • Representative datasets covering multimodal understanding, text-to-image synthesis, image editing, and interleaved interactions

Designed to help researchers and practitioners explore, compare, and build state-of-the-art unified multimodal systems.

Awesome Papers & Datasets

Text-and-Image Unified Models

Figure 2: Classification of Unified Multimodal Understanding and Generation Models. The models are divided into three main categories based on their backbone architecture: Diffusion, MLLM (AR), and MLLM (AR + Diffusion). Each category is further subdivided according to the encoding strategy employed, including Pixel Encoding, Semantic Encoding, Learnable Query Encoding, and Hybrid Encoding. We illustrate the architectural variations within these categories and their corresponding encoder-decoder configurations.

Diffusion

NameTitleVenueDateCodeDemo
UniModelUniModel: A Visual-Only Framework for Unified Multimodal Understanding and GenerationarXiv2025/11/21--
Lavida-OLavida-O: Elastic Large Masked Diffusion Models for Unified Multimodal Understanding and Generation GitHub Repo starsarXiv2025/09/24Github-
MudditMuddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model GitHub Repo starsarXiv2025/05/23Github-
FUDOKIFUDOKI: Discrete Flow-based Unified Understanding and Generation via Kinetic-Optimal VelocitiesarXiv2025/05/20--
MMaDAMMaDA: Multimodal Large Diffusion Language Models GitHub Repo starsarXiv2025/05/21GithubDemo
UniDiscUnified Multimodal Discrete Diffusion GitHub Repo starsarXiv2025/03/20Github-
Dual DiffusionDual Diffusion for Unified Image Generation and Understanding GitHub Repo starsarXiv2024/12/31Github-

MLLM AR

b-1: Pixel Encoding
NameTitleVenueDateCodeDemo
Emu3.5Emu3.5: Native Multimodal Models are World Learners GitHub Repo starsarXiv2025/10/30Github-
Uni-XUni-X: Mitigating Modality Conflict with a Two-End-Separated Architecture for Unified Multimodal Models GitHub Repo starsICLR2025/09/29Github-
OneCatOneCAT: Decoder-Only Auto-Regressive Model for Unified Understanding and Generation GitHub Repo starsarXiv2025/09/03Github-
SelftokSelftok: Discrete Visual Tokens of Autoregression, by Diffusion, and for Reasoning GitHub Repo starsarXiv2025/05/12Github-
TokLIPTokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation GitHub Repo starsarXiv2025/05/08Github-
HarmonHarmonizing Visual Representations for Unified Multimodal Understanding and Generation GitHub Repo starsarXiv2025/03/27GithubDemo
UGenUGen: Unified Autoregressive Multimodal Model with Progressive Vocabulary LearningarXiv2025/03/27--
SynerGen-VLSynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token FoldingarXiv2024/12/12--
LiquidLiquid: Language Models are Scalable and Unified Multi-modal Generators GitHub Repo starsarXiv2024/12/05GithubDemo
OrthusOrthus: Autoregressive Interleaved Image-Text Generation with Modality-Specific Heads GitHub Repo starsarXiv2024/11/28Github-
MMARMMAR: Towards Lossless Multi-Modal Auto-Regressive Probabilistic ModelingarXiv2024/10/14--
Emu3Emu3: Next-Token Prediction is All You Need GitHub Repo starsarXiv2024/09/27GithubDemo
ANOLEANOLE: An Open, Autoregressive, Native Large Multimodal Models for Interleaved Image-Text Generation GitHub Repo starsarXiv2024/07/08Github-
ChameleonChameleon: Mixed-Modal Early-Fusion Foundation Models GitHub Repo starsarXiv2024/05/16Github-
LWMWorld Model on Million-Length Video And Language With Blockwise RingAttention GitHub Repo starsICLR2024/02/13Github-
b-2: Semantic Encoding
NameTitleVenueDateCodeDemo
MammothModa2MammothModa2: A Unified AR-Diffusion Framework for Multimodal Understanding and GenerationGitHub Repo starsarXiv2025/11/23GitHub-
Ming-UniVisionMing-UniVision: Joint Image Understanding and Generation with a Unified Continuous TokenizerGitHub Repo starsarXiv2025/10/08GitHub-
Bifrost-1Bifrost-1: Bridging Multimodal LLMs and Diffusion Models with Patch-level CLIP LatentsGitHub Repo starsarXiv2025/08/08GitHub-
Qwen-ImageQwen-Image Technical ReportGitHub Repo starsarXiv2025/08/04GitHubDemo
X-OmniX-Omni: Reinforcement Learning Makes Discrete Autoregressive Image Generative Models Great AgainGitHub Repo starsarXiv2025/07/29GitHubDemo
Ovis-U1Ovis-U1 Technical ReportGitHub Repo starsarXiv2025/06/28GitHubDemo
UniCode2^2UniCode2^2: Cascaded Large-scale Codebooks for Unified Multimodal Understanding and GenerationarXiv2025/06/20--
OmniGen2OmniGen2: Exploration to Advanced Multimodal Generation GitHub Repo starsarXiv2025/06/18GithubDemo
TarVision as a Dialect: Unifying Visual Understanding and Generation via Text-Aligned Representations GitHub Repo starsarXiv2025/06/18GithubDemo
UniForkUniFork: Exploring Modality Alignment for Unified Multimodal Understanding and Generation GitHub Repo starsarXiv2025/06/17Github-
UniWorldUniWorld: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation GitHub Repo starsarXiv2025/06/03Github-
PiscesPisces: An Auto-regressive Foundation Model for Image Understanding and Generation GitHub Repo starsarXiv2025/06/10Github-
DualTokenDualToken: Towards Unifying Visual Understanding and Generation with Dual Visual Vocabularies GitHub Repo starsarXiv2025/03/18Github-
UniTokUniTok: A Unified Tokenizer for Visual Generation and Understanding GitHub Repo starsarXiv2025/02/27GithubDemo
QLIPQLIP: Text-Aligned Visual Tokenization Unifies Auto-Regressive Multimodal Understanding and Generation GitHub Repo starsarXiv2025/02/05Github-
MetaMorphMetaMorph: Multimodal Understanding and Generation via Instruction Tuning GitHub Repo starsarXiv2024/12/18Github-
ILLUMEILLUME: Illuminating Your LLMs to See, Draw, and Self-EnhancearXiv2024/12/09--
PUMAPUMA: Empowering Unified MLLM with Multi-granular Visual Generation GitHub Repo starsarXiv2024/10/17Github-
VILA-UVILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation GitHub Repo starsICLR2024/09/06GithubDemo
Mini-GeminiMini-Gemini: Mining the Potential of Multi-modality Vision Language Models GitHub Repo starsarXiv2024/03/27GithubDemo
MM-InterleavedMM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer GitHub Repo starsarXiv2024/01/18Github-
VL-GPTVL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation GitHub Repo starsarXiv2023/12/14Github-
Emu2Generative Multimodal Models are In-Context Learners GitHub Repo starsCVPR2023/12/10GithubDemo
DreamLLMDreamLLM: Synergistic Multimodal Comprehension and Creation GitHub Repo starsICLR2023/09/20Github-
LaVITUnified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization GitHub Repo starsICLR2023/09/09Github-
EmuEmu: Generative Pretraining in Multimodality GitHub Repo starsICLR2023/07/11GithubDemo
b-3: Learnable Query Encoding
NameTitleVenueDateCodeDemo
UniPic 2.0Skywork UniPic 2.0: Building Kontext Model with Online RL for Unified Multimodal Model GitHub Repo starsarXiv2025/09/04Github-
TBAC-UniImageTBAC-UniImage: Unified Understanding and Generation by Ladder-Side Diffusion Tuning GitHub Repo starsarXiv2025/08/11Github-
UniLIPUniLiP: Adapting CLIP for Unified Multimodal Understanding, Generation and EditingarXiv2025/07/31--
OpenUniOpenUni: A Simple Baseline for Unified Multimodal Understanding and Generation GitHub Repo starsarXiv2025/05/23Github-
BLIP3-oBLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset GitHub Repo starsarXiv2025/05/14Github-
Ming-Lite-UniMing-Lite-Uni: Advancements in Unified Architecture for Natural Multimodal Interaction GitHub Repo starsarXiv2025/05/05Github-
Nexus-GenNexus-Gen: A Unified Model for Image Understanding, Generation, and Editing GitHub Repo starsarXiv2025/04/30GithubDemo
MetaQueriesTransfer between Modalities with MetaQueriesarXiv2025/04/08--
SEED-XSEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation GitHub Repo starsarXiv2024/04/22GithubDemo
SEED-LLaMAMaking LLaMA SEE and Draw with SEED Tokenizer GitHub Repo starsICLR2023/10/02GithubDemo
SEEDPlanting a SEED of Vision in Large Language Model GitHub Repo starsarXiv2023/07/16GithubDemo
b-4: Hybrid Encoding (Pseduo)
NameTitleVenueDateCodeDemo
Skywork UniPicSkywork UniPic: Unified Autoregressive Modeling for Visual Understanding and Generation GitHub Repo starsarXiv2025/08/05GithubDemo
MindOmniMindOmni: Unleashing Reasoning Generation in Vision Language Models with RGPO GitHub Repo starsarXiv2025/05/13GithubDemo
UniFluidUnified Autoregressive Visual Generation and Understanding with Continuous TokensarXiv2025/03/17--
OmniMambaOmniMamba: Efficient and Unified Multimodal Understanding and Generation via State Space Models GitHub Repo starsarXiv2025/03/11Github-
Janus-ProJanus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling GitHub Repo starsarXiv2025/01/29GithubDemo
JanusJanus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation GitHub Repo starsarXiv2024/10/17GithubDemo
b-5: Hybrid Encoding (Joint)
NameTitleVenueDateCodeDemo
Show-o2Show-o2: Improved Native Unified Multimodal Models GitHub Repo starsarXiv2025/06/15Github-
UniTokenUniToken: Harmonizing Multimodal Understanding and Generation through Unified Visual Encoding GitHub Repo starsCVPRW2025/04/06Github-
VARGPT-v1.1VARGPT-v1.1: Improve Visual Autoregressive Large Unified Model via Iterative Instruction Tuning and Reinforcement Learning GitHub Repo starsarXiv2025/04/03Github-
ILLUME+ILLUME+: Illuminating Unified MLLM with Dual Visual Tokenization and Diffusion Refinement GitHub Repo starsarXiv2025/04/02Github-
SemHiTokSemHiTok: A Unified Image Tokenizer via Semantic-Guided Hierarchical Codebook for Multimodal Understanding and GenerationarXiv2025/03/06--
VARGPTVARGPT: Unified Understanding and Generation in a Visual Autoregressive Multimodal Large Language Model GitHub Repo starsarXiv2025/01/21Github-
TokenFlowTokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation GitHub Repo starsCVPR2024/12/04Github-
MUSE-VLMUSE-VL: Modeling Unified VLM through Semantic Discrete EncodingarXiv2024/11/26--

MLLM AR-Diffusion

c-1: Pxiel Encoding
NameTitleVenueDateCodeDemo
TUNATuna: Taming Unified Visual Representations for Native Unified Multimodal Models GitHub Repo starsarXiv2025/12/01Github-
LMFusionLMFusion: Adapting Pretrained Language Models for Multimodal GenerationarXiv2024/12/19--
MonoFormerMonoFormer: One Transformer for Both Diffusion and Autoregression GitHub Repo starsarXiv2024/09/24Github-
Show-oShow-o: One Single Transformer to Unify Multimodal Understanding and Generation GitHub Repo starsICLR2024/08/22GithubDemo
TransfusionTransfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model GitHub Repo starsICLR2024/08/20Github-
c-2: Hybrid Encoding (Pseduo)
NameTitleVenueDateCodeDemo
EMMAEMMA: Efficient Multimodal Understanding, Generation, and Editing with a Unified Architecture GitHub Repo starsarXiv2025/12/15Github-
HBridgeHBridge: H-Shape Bridging of Heterogeneous Experts for Unified Multimodal Understanding and GenerationarXiv2025/11/25--
LightFusionLightFusion: A Light-weighted, Double Fusion Framework for Unified Multimodal Understanding and Generation GitHub Repo starsarXiv2025/10/27Github-
BAGELEmerging Properties in Unified Multimodal Pretraining GitHub Repo starsarXiv2025/05/20GithubDemo
MogaoMogao: An Omni Foundation Model for Interleaved Multi-Modal GenerationarXiv2025/05/08--
JanusFlowJanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation GitHub Repo starsarXiv2024/11/12GithubDemo

Any-to-Any Multimodal models

NameTitleVenueDateCodeDemo
LongCat-Flash-OmniLongCat-Flash-Omni Technical Report GitHub Repo starsarXiv2025/11/28Github-
Ming-Flash-OmniMing-Flash-Omni: A Sparse, Unified Architecture for Multimodal Perception and Generation GitHub Repo starsarXiv2025/10/28Github-
Qwen3-OmniQwen3-Omni Technical Report GitHub Repo starsarXiv2025/09/22Github-
Ming-OmniMing-Omni: A Unified Multimodal Model for Perception and Generation GitHub Repo starsarXiv2025/06/09Github-
M2-omniM2-omni: Advancing Omni-MLLM for Comprehensive Modality Support with Competitive PerformancearXiv2025/02/26--
OmniFlowOmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows GitHub Repo starsCVPR2024/12/02Github-
SpiderSpider: Any-to-Many Multimodal LLM GitHub Repo starsarXiv2024/11/14Github-
MIOMIO: A Foundation Model on Multimodal Tokens GitHub Repo starsarXiv2024/09/26Github
X-VILAX-VILA: Cross-Modality Alignment for Large Language ModelarXiv2024/05/29--
AnyGPTAnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling GitHub Repo starsarXiv2024/02/19Github-
Video-LaVITVideo-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization GitHub Repo starsICML2024/02/05Github-
Unified-IO 2Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action GitHub Repo starsCVPR2023/12/28Github-
NExT-GPTNExT-GPT: Any-to-Any Multimodal LLM GitHub Repo starsICML2023/09/11Github-

Benchmark for Evaluation

Benchmarks on Understanding Tasks

NamePaperVenueDateCode
General-BenchOn Path to Multimodal Generalist: General-Level and General-Bench StarICML2025/05/07Github
MM-Vet v2MM-Vet v2: A Challenging Benchmark to Evaluate Large Multimodal Models for Integrated Capabilities StararXiv2024/08/01Github
OwlEvalmPLUG-Owl: Modularization Empowers Large Language Models with Multimodality StararXiv2024/04/27Github
oVQAOpen-ended VQA benchmarking of Vision-Language models by exploiting Classification datasets and their semantic hierarchy StarICLR2024/02/11Github
SEED-Bench-2SEED-Bench-2: Benchmarking Multimodal Large Language Models StararXiv2023/11/28Github
MMMUMMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI StarCVPR2023/11/27Github
MM-VetMM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities StarICML2023/08/04Github
SEED-BenchSEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension StarCVPR2023/07/30Github
MMBenchMMBench: Is Your Multi-modal Model an All-around Player? StarECCV2023/07/12Github
LAMMLAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark StarNeurIPS2023/06/11Github
HaluEvalHaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models StarEMNLP2023/05/19Github
GQAGQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering StarCVPR2019/02/25Github
VQAVQA: Visual Question AnsweringICCV2015/05/03ProjectPage

Benchmarks on Image Generation Tasks

NamePaperVenueDateCode
GenExamGenExam: A Multidisciplinary Text-to-Image Exam Stararxiv2025/09/18Github
CVTGTextCrafter: Accurately Rendering Multiple Texts in Complex Visual Scenes Stararxiv2025/08/05Github
OneIG-BenchOneIG-Bench: Omni-dimensional Nuanced Evaluation for Image Generation Stararxiv2025/06/26Github
ComplexBench-EditComplexBench-Edit: Benchmarking Complex Instruction-Driven Image Editing via Compositional Dependencies Stararxiv2025/06/15Github
EditInspectorEditInspector: A Benchmark for Evaluation of Text-Guided Image Editsarxiv2025/06/11-
ByteMorph-BenchByteMorph: Benchmarking Instruction-Guided Image Editing with Non-Rigid Motions Stararxiv2025/06/03Github
RefEdit-BenchRefEdit: A Benchmark and Method for Improving Instruction-based Image Editing Model on Referring Expressions Stararxiv2025/06/03Github
WISEWISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation Stararxiv2025/05/27Github
ImgEdit-BenchImgEdit: A Unified Image Editing Dataset and Benchmark Stararxiv2025/05/26Github
MMIG-BenchMMGen-Bench: Towards Comprehensive and Explainable Evaluation of Multi-Modal Image Generation Models Stararxiv2025/05/26Github
KRIS-BenchKRIS-Bench: Benchmarking Next-Level Intelligent Image Editing Models StarNeurIPS 20252025/05/22Github
CompBenchCompBench: Benchmarking Complex Instruction-guided Image Editingarxiv2025/05/18-
WorldGenBenchWorldGenBench: A World-Knowledge-Integrated Benchmark for Reasoning-Driven Text-to-Image Generationarxiv2025/05/02HuggingFace
GEdit-BenchStep1X-Edit: A Practical Framework for General Image Editing StararXiv2025/04/28Github
DreamBench++DreamBench++: A Human-Aligned Benchmark for Personalized Image Generation StarICLR2025/03/09Github
T2I-CompBench++T2I-CompBench++: An Enhanced and Comprehensive Benchmark for Compositional Text-to-image Generation StarTPAMI2025/03/08Github
IE-BenchIE-Bench: Advancing the Measurement of Text-Driven Image Editing for Human Perception AlignmentarXiv2025/01/17-
AnyEditAnyEdit: Mastering Unified High-Quality Image Editing for Any Idea StarCVPR2024/11/24Github
I2EBenchI2EBench: A Comprehensive Benchmark for Instruction-based Image Editing StarNeurIPS2024/08/26Github
ConceptMixConceptMix: A Compositional Image Generation Benchmark with Controllable Difficulty StarNeurIPS2024/08/26Github
GenAI-BenchGenAI-Bench: Evaluating and Improving Compositional Text-to-Visual Generation StarCVPR2024/06/19Github
Commonsense-T2ICommonsense-T2I Challenge: Can Text-to-Image Generation Models Understand Commonsense? StarCOLM2024/06/11Github
HQ-EditHQ-Edit: A High-Quality Dataset for Instruction-based Image Editing StarICLR2024/04/15Github
VQAScoreEvaluating Text-to-Visual Generation with Image-to-Text Generation StarECCV2024/04/01Github
FlashEvalFlashEval: Towards Fast and Accurate Evaluation of Text-to-image Diffusion Generative Models StarCVPR2024/03/25Github
DPG-BenchELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment Stararxiv2024/03/08Github
Reason-EditSmartEdit: Exploring Complex Instruction-based Image Editing with Multimodal Large Language Models StarCVPR2023/12/11Github
Emu EditEmu Edit: Precise Image Editing via Recognition and Generation TasksCVPR2023/11/16HuggingFace
HEIMHolistic Evaluation of Text-To-Image Models StarNeurIPS2023/11/07Github
DSG-1kDavidsonian Scene Graph: Improving Reliability in Fine-grained Evaluation for Text-to-Image Generation StarICLR2023/10/27Github
GenEvalGenEval: An Object-Focused Framework for Evaluating Text-to-Image Alignment StarNeurIPS2023/10/17Github
EditValEditVal: Benchmarking Diffusion Based Text-Guided Image Editing Methods StararXiv2023/10/03Github
T2I-CompBenchT2I-CompBench: A Comprehensive Benchmark for Open-world Compositional Text-to-image Generation StarNeurIPS2023/07/12Github
DreamSimDreamSim: Learning New Dimensions of Human Visual Similarity using Synthetic Data StarNeurIPS2023/06/15Github
MagicBrushMagicBrush: A Manually Annotated Dataset for Instruction-Guided Image Editing StarNeurIPS2023/06/16Github
MultiGen-20MUniControl: A Unified Diffusion Model for Controllable Visual Generation In the Wild StarNeurIPS2023/05/18Github
HRS-BenchHRS-Bench: Holistic, Reliable and Scalable Benchmark for Text-to-Image Models StarICCV2023/04/11Github
TIFATIFA: Accurate and Interpretable Text-to-Image Faithfulness Evaluation with Question Answering StarICCV2023/03/21Github
EditBenchImagen Editor and EditBench: Advancing and Evaluating Text-Guided Image InpaintingCVPR2022/12/13ProjectPage
PartiPromptsScaling Autoregressive Models for Content-Rich Text-to-Image Generation StarTMLR2022/06/22Github
DrawBenchPhotorealistic Text-to-Image Diffusion Models with Deep Language UnderstandingNeurIPS2022/05/23ProjectPage
PaintSkillsDALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generation Models StarICCV2022/02/08Github

Benchmarks on Interleaved / Compositional / Other Tasks

NamePaperVenueDateCode
VTBenchVTBench: Evaluating Visual Tokenizers for Autoregressive Image Generation StararXiv2025/05/19Github
UniBenchUniEval: Unified Holistic Evaluation for Unified Multimodal Understanding and Generation StararXiv2025/05/15Github
OpenINGOpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation StarCVPR2024/11/27Github
ISGInterleaved Scene Graphs for Interleaved Text-and-Image Generation Assessment StarICLR2024/11/26Github
MMIEMMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models StarICLR2024/10/14Github
InterleavedBenchHolistic Evaluation for Interleaved Text-and-Image GenerationEMNLP2024/06/20HuggingFace
OpenLEAFOpenLEAF: Open-Domain Interleaved Image-Text Generation and EvaluationMM2023/10/01-

Dataset

Multimodal Understanding

DatasetSamplesPaperVenueDate
Honey-Data-15M15MBee: A High-Quality Corpus and Full-Stack Suite to Unlock Advanced Fully Open MLLMsarXiv2025/11/11
Infinity-MM40MInfinity-MM: Scaling Multimodal Performance with Large-Scale and High-Quality Instruction DataarXiv2024/10/24
LLaVA-OneVision4.8MLLaVA-OneVision: Easy Visual Task TransferTMLR2024/08/06
Cambrian-10M10MCambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMsNeurIPS2024/06/24
ShareGPT4V100KSharegpt4v: Improving large multi-modal models with better captionsECCV2023/11/21
CapsFusion-120M120MCapsfusion: Rethinking image-text data at scaleCVPR2023/10/31
GRIT20MKosmos-2: Grounding multimodal large language models to the worldICLR2023/06/26
DataComp1.4BDATACOMP: In search of the next generation of multimodal datasetsNeurIPS2023/04/27
Laion-COCO600MLaion coco: 600m synthetic captions from laion2b-en-2022/09/15
COYO747MCoyo-700m: Image-text pair dataset-2022/08/31
Laion5.9BLaion-5b: An open large-scale dataset for training next generation image-text modelsNeurIPS2022/03/31
Wukong100MWukong: A 100 million large-scale chinese cross-modal pre-training benchmarkNeurIPS2022/02/14
RedCaps12MRedcaps: Web-curated image-text data created by the people, for the peopleNeurIPS2021/11/22

Text-to-Image

DatasetSamplesPaperVenueDate
FLUX-Reason-6M6MFLUX-Reason-6M & PRISM-Bench: A Million-Scale Text-to-Image Reasoning Dataset and Comprehensive BenchmarkarXiv2025/09/11
Echo-4o-Image106KEcho-4o: Harnessing the Power of GPT-4o Synthetic Images for Improved Image GenerationarXiv2025/08/13
Poster100K100KPosterCraft: Rethinking High-Quality Aesthetic Poster Generation in a Unified FrameworkarXiv2025/06/12
Text-Render-2M2MPosterCraft: Rethinking High-Quality Aesthetic Poster Generation in a Unified FrameworkarXiv2025/06/12
ShareGPT-4o-Image45KShareGPT-4o-Image: Aligning Multimodal Models with GPT-4o-Level Image GenerationarXiv2025/06/22
BLIP3o-60k60KBLIP3-o: A Family of Fully Open Unified Multimodal Models—Architecture, Training and DatasetarXiv2025/05/14
TextAtlas5M5MTextAtlas5M: A Large-scale Dataset for Dense Text Image GenerationarXiv2025/02/11
EliGen TrainSet500KEliGen: Entity-Level Controlled Image Generation with Regional AttentionarXiv2025/01/02
PD12M12MPublic domain 12m: A highly aesthetic image-text dataset with novel governance mechanismsarXiv2024/10/30
SFHQ-T2I122K--2024/10/06
text-to-image-2M2M--2024/09/13
DenseFusion1MDensefusion-1m: Merging vision experts for comprehensive multimodal perceptionNeurIPS2024/07/11
Megalith10M--2024/07/01
PixelProse16MFrom pixels to prose: A large dataset of dense image captionsarXiv2024/06/14
DOCCI15KDOCCI: Descriptions of Connected and Contrasting ImagesECCV2024/04/30
CosmicMan-HQ 1.06MCosmicman: A text-to-image foundation model for humansCVPR2024/04/01
AnyWord-3M3MAnytext: Multilingual visual text generation and editingICLR2023/11/06
JourneyDB4MJourneyDB: A Benchmark for Generative Image UnderstandingNeurIPS2023/07/03
RenderedText12M--2023/06/30
Mario-10M10MTextdiffuser: Diffusion models as text paintersNeurIPS2023/05/18
SAM11MSegment AnythingICCV2023/04/05
LAION-Aesthetics120MLaion-5b: An open large-scale dataset for training next generation image-text modelsNeurIPS2022/08/16
CC-12M12MConceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual conceptsCVPR2021/02/17

Image Editing

DatasetSamplesPaperVenueDate
Pico-Banana-400K400KPico-Banana-400K: A Large-Scale Dataset for Text-Guided Image EditingarXiv2025/10/22
X2Edit3.7MX2Edit: Revisiting Arbitrary-Instruction Image Editing through Self-Constructed Data and Task-Aware Representation LearningarXiv2025/08/11
ShareGPT-4o-Image (Editing)46KShareGPT-4o-Image: Aligning Multimodal Models with GPT-4o-Level Image GenerationarXiv2025/06/22
ByteMorph-6M6MByteMorph: Benchmarking Instruction-Guided Image Editing with Non-Rigid MotionsarXiv2025/06/03
ImgEdit1.2MImgEdit: A Unified Image Editing Dataset and BenchmarkarXiv2025/05/26
RefEdit18KRefEdit: A Benchmark and Method for Improving Instruction-based Image Editing Model for Referring ExpressionarXiv2025/04/03
AnyEdit2.5MAnyedit: Mastering unified high-quality image editing for any ideaCVPR2024/11/24
OmniEdit1.2MOmniedit: Building image editing generalist models through specialist supervisionICLR2024/11/11
PromptFix1MPromptFix: You Prompt and We Fix the PhotoNeurIPS2024/09/19
UltraEdit4MUltraedit: Instruction-based fine-grained image editing at scaleNeurIPS2024/07/07
EditWorld8.6KEditWorld: Simulating World Dynamics for Instruction-Following Image EditingarXiv2024/06/23
SEED-Data-Edit3.7MSeed-data-edit technical report: A hybrid dataset for instructional image editingarXiv2024/05/07
HQ-Edit197KHq-edit: A high-quality dataset for instruction-based image editingarXiv2024/04/15
HIVE1.1MHIVE: Harnessing Human Feedback for Instructional Visual EditingarXiv2023/07/08
Magicbrush10KMagicbrush: A manually annotated dataset for instruction-guided image editingNeurIPS2023/06/16
InstructP2P313KInstructpix2pix: Learning to follow image editing instructionsCVPR2022/11/17

Interleaved Image-Text

DatasetSamplesPaperVenueDate
OmniCorpus8BOmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with TextICLR2024/10/22
CoMM227KComm: A coherent interleaved image-text dataset for multimodal understanding and generationCVPR2024/06/15
OBELICS141MObelics: An open web-scale filtered dataset of interleaved image-text documentsNeurIPS2023/06/21
Multimodal C4101.2MMultimodal c4: An open, billion-scale corpus of images interleaved with textNeurIPS2023/04/14

Other Text-Image-to-Image

DatasetSamplesPaperVenueDate
Echo-4o-Image73KEcho-4o: Harnessing the Power of GPT-4o Synthetic Images for Improved Image GenerationarXiv2025/08/13
MetaQuery Instruct 2.4M 2.4MTransfer between Modalities with MetaQueriesarXiv2025/06/24
Graph200K200KVisualCloze: A Universal Image Generation Framework via Visual In-Context LearningarXiv2025/03/30
SynCD95KGenerating multi-image synthetic data for text-to-image customizationarXiv2025/02/03
X2I-subject-driven2.5MOmniGen: Unified Image GenerationarXiv2024/12/14
Subjects200K200KOminicontrol: Minimal and universal control for diffusion transformerarXiv2024/11/22
MultiGen-20M20MUnicontrol: A unified diffusion model for controllable visual generation in the wildNeurIPS2023/05/18
LAION-Face50MGeneral facial representation learning in a visual-linguistic mannerCVPR2021/12/06

Applications and Opportunities

NameTitleVenueDateCodeDemo
UniGameUniGame: Turning a Unified Multimodal Model Into Its Own Adversary GitHub Repo starsarXiv2025/11/24Github-
UniCTokensUniCTokens: Boosting Personalized Understanding and Generation via Unified Concept Tokens GitHub Repo starsarXiv2025/05/20Github-
Fair-UMLLMOn Fairness of Unified Multimodal Large Language Model for Image GenerationarXiv2025/02/05--
T2I-R1T2I-R1: Reinforcing Image Generation with Collaborative Semantic-level and Token-level CoT GitHub Repo starsarXiv2025/01/29Github-

Citation

If you find this repo is helpful for your research, please cite our paper:

@article{zhang2025unified,
  title={Unified Multimodal Understanding and Generation Models: Advances, Challenges, and Opportunities},
  author={Zhang, Xinjie and Guo, Jintao and Zhao, Shanshan and Fu, Minghao and Duan, Lunhao and Hu, Jiakui and Chng, Yong Xien and Wang, Guo-Hua and Chen, Qing-Guo and Xu, Zhao and Luo, Weihua and Zhang, Kaifu},
  journal={arXiv preprint arXiv:2505.02567},
  year={2025}
}