Awesome-Multimodal-Large-Language-Models

May 1, 2026 · View on GitHub

✨ Highlights of NJU-MiG

🔥🔥 Surveys of MLLMs | 💬 WeChat (MLLM微信交流群)

🌟 MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs
arXiv 2025, Paper, Project
🌟 A Survey of Unified Multimodal Understanding and Generation: Advances and Challenges
arXiv 2025, Paper, Project
A Survey on Multimodal Large Language Models
NSR 2024, Paper, Project

🔥🔥 VITA Series Omni MLLMs | 💬 WeChat (VITA微信交流群)

VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
NeurIPS 2025 Highlight, Paper, Project
VITA: Towards Open-Source Interactive Omni Multimodal LLM
arXiv 2024, Paper, Project
VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model
NeurIPS 2025, Paper, Project

🔥🔥 MME Series MLLM Benchmarks

🔥 Video-MME-v2: Towards the Next Stage in Video Understanding Evaluation

[🍎 Project Page] [📖 Paper] [🤗 Dataset] [🏆 Leaderboard]

🌟 MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs
arXiv 2025, Paper, Project
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
NeurIPS 2025 DB Highlight, Paper, Dataset, Eval Tool, ✒️ Citation
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
CVPR 2025, Paper, Project, Dataset

Table of Contents

Awesome Papers
Awesome Datasets

Awesome Papers

Multimodal Instruction Tuning (& Latest Works)

Title	Venue	Date	Code	Demo
DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence	DeepSeek	2026-04-24	Huggingface	-
Qwen3.6-27B: Flagship-Level Coding in a 27B Dense Model	Blog	2026-04-22	Huggingface	Demo
Xiaomi MiMo-V2.5	Blog	2026-04-22	-	Demo
Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding	arXiv	2026-04-06	Github	Demo
Introducing Muse Spark: Scaling Towards Personal Superintelligence	Blog	2026-04-08	-	Demo
VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing	arXiv	2026-04-03	Github	Local Demo
Gemma 4: Byte for byte, the most capable open models	Blog	2026-04-02	-	Demo
Qwen3.6-Plus: Towards Real World Agents	Blog	2026-04-02	-	-
Qwen3.5-Omni: Scaling Up, Toward Native Omni-Modal AGI	Blog	2026-03-30	-	Demo
Xiaomi MiMo-V2-Omni	Blog	2026-03-18	-	-
InternVL-U: Democratizing Unified Multimodal Models for Understanding, Reasoning, Generation and Editing	arXiv	2026-03-10	Github	Local Demo
Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion	arXiv	2026-03-06	Github	-
Beyond Language Modeling: An Exploration of Multimodal Pretraining	arXiv	2026-03-03	-	-
Gemini 3.1 Pro: A smarter model for your most complex tasks	Blog	2026-02-19	-	-
Qwen3.5: Towards Native Multimodal Agents	Blog	2026-02-16	Github	Demo
MiniCPM-o 4.5	Blog	2026-02-06	Github	Demo
Kimi K2.5: Visual Agentic Intelligence	arXiv	2026-02-02	Github	-
DeepSeek-OCR 2: Visual Causal Flow	DeepSeek	2026-01-27	Github	-
Seed1.8 Model Card: Towards Generalized Real-World Agency	Bytedance Seed	2025-12-18	-	-
Introducing GPT-5.2	OpenAI	2025-12-11	-	-
Introducing Mistral 3	Blog	2025-12-02	Huggingface	-
Qwen3-VL Technical Report	arXiv	2025-11-26	Github	Demo
Emu3.5: Native Multimodal Models are World Learners	arXiv	2025-10-30	Github	-
VITA-E: Natural Embodied Interaction with Concurrent Seeing, Hearing, Speaking, and Acting	arXiv	2025-10-21	Github	Local Demo
DeepSeek-OCR: Contexts Optical Compression	arXiv	2025-10-21	Github	-
OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM	arXiv	2025-10-17	Github	-
NExT-OMNI: Towards Any-to-Any Omnimodal Foundation Models with Discrete Flow Matching	arXiv	2025-10-16	-	-
InteractiveOmni: A Unified Omni-modal Model for Audio-Visual Multi-turn Dialogue	arXiv	2025-10-15	Github	-
VITA-VLA: Efficiently Teaching Vision-Language Models to Act via Action Expert Distillation	arXiv	2025-10-10	Github	-
LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training	arXiv	2025-10-09	Github	Demo
Qwen3-Omni Technical Report	arXiv	2025-09-22	Github	Demo
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency	arXiv	2025-08-27	Github	Demo
MiniCPM-V 4.5: A GPT-4o Level MLLM for Single Image, Multi Image and Video Understanding on Your Phone	-	2025-08-26	Github	Demo
Thyme: Think Beyond Images	arXiv	2025-08-18	Github	Demo
Introducing GPT-5	OpenAI	2025-08-07	-	-
dots.vlm1	rednote-hilab	2025-08-06	Github	Demo
Step3: Cost-Effective Multimodal Intelligence	StepFun	2025-07-31	Github	Demo
GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning	arXiv	2025-07-02	Github	Demo
DenseWorld-1M: Towards Detailed Dense Grounded Caption in the Real World	arXiv	2025-06-30	Github	-
Qwen VLo: From "Understanding" the World to "Depicting" It	Qwen	2025-06-26	-	Demo
MMSearch-R1: Incentivizing LMMs to Search	arXiv	2025-06-25	Github	-
Show-o2: Improved Native Unified Multimodal Models	arXiv	2025-06-18	Github	-
Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities	Google	2025-06-17	-	-
Ego-R1: Chain-of-Tool-Thought for Ultra-Long Egocentric Video Reasoning	arXiv	2025-06-16	Github	-
MiMo-VL Technical Report	arXiv	2025-06-04	Github	-
OpenUni: A Simple Baseline for Unified Multimodal Understanding and Generation	arXiv	2025-05-29	Github	-
Emerging Properties in Unified Multimodal Pretraining	arXiv	2025-05-23	Github	Demo
MMaDA: Multimodal Large Diffusion Language Models	arXiv	2025-05-21	Github	Demo
UniGen: Enhanced Training & Test-Time Strategies for Unified Multimodal Understanding and Generation	arXiv	2025-05-20	-	-
BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset	arXiv	2025-05-14	Github	Local Demo
Seed1.5-VL Technical Report	arXiv	2025-05-11	-	-
Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models	arXiv	2025-05-08	Github	-
VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model	arXiv	2025-05-06	Github	Local Demo
Skywork R1V2: Multimodal Hybrid Reinforcement Learning for Reasoning	arXiv	2025-04-23	Github	-
Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models	arXiv	2025-04-21	Github	-
An LMM for Efficient Video Understanding via Reinforced Compression of Video Cubes	arXiv	2025-04-21	Github	-
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models	arXiv	2025-04-14	Github	Demo
Introducing GPT-4.1 in the API	OpenAI	2025-04-14	-	-
Kimi-VL Technical Report	arXiv	2025-04-10	Github	Demo
The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation	Meta	2025-04-05	Hugging Face	-
Qwen2.5-Omni Technical Report	Qwen	2025-03-26	Github	Demo
Addendum to GPT-4o System Card: Native image generation	OpenAI	2025-03-25	-	-
Sparrow: Data-Efficient Video-LLM with Text-to-Image Augmentation	arXiv	2025-03-17	Github	-
Nexus-O: An Omni-Perceptive And -Interactive Model for Language, Audio, And Vision	arXiv	2025-03-07	-	-
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs	arXiv	2025-03-03	Hugging Face	Demo
Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuray	arXiv	2025-02-19	Github	-
Qwen2.5-VL Technical Report	arXiv	2025-02-19	Github	Demo
Baichuan-Omni-1.5 Technical Report	Tech Report	2025-01-26	Github	Local Demo
LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs	arXiv	2025-01-10	Github	-
VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction	arXiv	2025-01-03	Github	-
QVQ: To See the World with Wisdom	Qwen	2024-12-25	Github	Demo
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding	arXiv	2024-12-13	Github	-
Apollo: An Exploration of Video Understanding in Large Multimodal Models	arXiv	2024-12-13	-	-
InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions	arXiv	2024-12-12	Github	Local Demo
StreamChat: Chatting with Streaming Video	arXiv	2024-12-11	Coming soon	-
CompCap: Improving Multimodal Large Language Models with Composite Captions	arXiv	2024-12-06	-	-
LinVT: Empower Your Image-level Large Language Model to Understand Videos	arXiv	2024-12-06	Github	-
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling	arXiv	2024-12-06	Github	Demo
NVILA: Efficient Frontier Visual Language Models	arXiv	2024-12-05	Github	Demo
Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning	arXiv	2024-12-04	Github	-
TimeMarker: A Versatile Video-LLM for Long and Short Video Understanding with Superior Temporal Localization Ability	arXiv	2024-11-27	Github	-
ChatRex: Taming Multimodal LLM for Joint Perception and Understanding	arXiv	2024-11-27	Github	Local Demo
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding	arXiv	2024-10-22	Github	Demo
Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate	arXiv	2024-10-09	Github	-
AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark	arXiv	2024-10-04	Github	Local Demo
EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions	CVPR	2024-09-26	Github	Demo
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models	arXiv	2024-09-25	Huggingface	Demo
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution	arXiv	2024-09-18	Github	Demo
ChartMoE: Mixture of Expert Connector for Advanced Chart Understanding	ICLR	2024-09-05	Github	Local Demo
LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture	arXiv	2024-09-04	Github	-
EAGLE: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders	arXiv	2024-08-28	Github	Demo
LLaVA-MoD: Making LLaVA Tiny via MoE Knowledge Distillation	arXiv	2024-08-28	Github	-
mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models	arXiv	2024-08-09	Github	-
VITA: Towards Open-Source Interactive Omni Multimodal LLM	arXiv	2024-08-09	Github	-
LLaVA-OneVision: Easy Visual Task Transfer	arXiv	2024-08-06	Github	Demo
MiniCPM-V: A GPT-4V Level MLLM on Your Phone	arXiv	2024-08-03	Github	Demo
VILA^2: VILA Augmented VILA	arXiv	2024-07-24	-	-
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models	arXiv	2024-07-22	-	-
EVLM: An Efficient Vision-Language Model for Visual Understanding	arXiv	2024-07-19	-	-
IDA-VLM: Towards Movie Understanding via ID-Aware Large Vision-Language Model	arXiv	2024-07-10	Github	-
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output	arXiv	2024-07-03	Github	Demo
OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding	arXiv	2024-06-27	Github	Local Demo
DocKylin: A Large Multimodal Model for Visual Document Understanding with Efficient Visual Slimming	AAAI	2024-06-27	Github	-
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs	arXiv	2024-06-24	Github	Local Demo
Long Context Transfer from Language to Vision	arXiv	2024-06-24	Github	Local Demo
video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models	ICML	2024-06-22	Github	-
TroL: Traversal of Layers for Large Language and Vision Models	EMNLP	2024-06-18	Github	Local Demo
Unveiling Encoder-Free Vision-Language Models	arXiv	2024-06-17	Github	Local Demo
VideoLLM-online: Online Video Large Language Model for Streaming Video	CVPR	2024-06-17	Github	Local Demo
RoboPoint: A Vision-Language Model for Spatial Affordance Prediction for Robotics	CoRL	2024-06-15	Github	Demo
Comparison Visual Instruction Tuning	arXiv	2024-06-13	Github	Local Demo
Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models	arXiv	2024-06-12	Github	-
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs	arXiv	2024-06-11	Github	Local Demo
Parrot: Multilingual Visual Instruction Tuning	arXiv	2024-06-04	Github	-
Ovis: Structural Embedding Alignment for Multimodal Large Language Model	arXiv	2024-05-31	Github	-
Matryoshka Query Transformer for Large Vision-Language Models	arXiv	2024-05-29	Github	Demo
ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models	arXiv	2024-05-24	Github	-
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models	arXiv	2024-05-24	Github	Demo
Libra: Building Decoupled Vision System on Large Language Models	ICML	2024-05-16	Github	Local Demo
CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts	arXiv	2024-05-09	Github	Local Demo
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites	arXiv	2024-04-25	Github	Demo
Graphic Design with Large Multimodal Model	arXiv	2024-04-22	Github	-
BRAVE: Broadening the visual encoding of vision-language models	ECCV	2024-04-10	-	-
InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD	arXiv	2024-04-09	Github	Demo
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs	arXiv	2024-04-08	-	-
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding	CVPR	2024-04-08	Github	-
VITRON: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing	NeurIPS	2024-04-04	Github	Local Demo
TOMGPT: Reliable Text-Only Training Approach for Cost-Effective Multi-modal Large Language Model	ACM TKDD	2024-03-28	-	-
LITA: Language Instructed Temporal-Localization Assistant	arXiv	2024-03-27	Github	Local Demo
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models	arXiv	2024-03-27	Github	Demo
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training	arXiv	2024-03-14	-	-
MoAI: Mixture of All Intelligence for Large Language and Vision Models	arXiv	2024-03-12	Github	Local Demo
DeepSeek-VL: Towards Real-World Vision-Language Understanding	arXiv	2024-03-08	Github	Demo
TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document	arXiv	2024-03-07	Github	Demo
The All-Seeing Project V2: Towards General Relation Comprehension of the Open World	arXiv	2024-02-29	Github	-
GROUNDHOG: Grounding Large Language Models to Holistic Segmentation	CVPR	2024-02-26	Coming soon	Coming soon
AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling	arXiv	2024-02-19	Github	-
Momentor: Advancing Video Large Language Model with Fine-Grained Temporal Reasoning	arXiv	2024-02-18	Github	-
ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model	arXiv	2024-02-18	Github	Demo
CoLLaVO: Crayon Large Language and Vision mOdel	arXiv	2024-02-17	Github	-
Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models	ICML	2024-02-12	Github	-
CogCoM: Train Large Vision-Language Models Diving into Details through Chain of Manipulations	arXiv	2024-02-06	Github	-
MobileVLM V2: Faster and Stronger Baseline for Vision Language Model	arXiv	2024-02-06	Github	-
GITA: Graph to Visual and Textual Integration for Vision-Language Graph Reasoning	NeurIPS	2024-02-03	Github	-
Enhancing Multimodal Large Language Models with Vision Detection Models: An Empirical Study	arXiv	2024-01-31	Coming soon	-
LLaVA-NeXT: Improved reasoning, OCR, and world knowledge	Blog	2024-01-30	Github	Demo
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models	arXiv	2024-01-29	Github	Demo
InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model	arXiv	2024-01-29	Github	Demo
Yi-VL	-	2024-01-23	Github	Local Demo
SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities	arXiv	2024-01-22	-	-
ChartAssisstant: A Universal Chart Multimodal Language Model via Chart-to-Table Pre-training and Multitask Instruction Tuning	ACL	2024-01-04	Github	Local Demo
MobileVLM : A Fast, Reproducible and Strong Vision Language Assistant for Mobile Devices	arXiv	2023-12-28	Github	-
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks	CVPR	2023-12-21	Github	Demo
Osprey: Pixel Understanding with Visual Instruction Tuning	CVPR	2023-12-15	Github	Demo
CogAgent: A Visual Language Model for GUI Agents	arXiv	2023-12-14	Github	Coming soon
Pixel Aligned Language Models	arXiv	2023-12-14	Coming soon	-
VILA: On Pre-training for Visual Language Models	CVPR	2023-12-13	Github	Local Demo
See, Say, and Segment: Teaching LMMs to Overcome False Premises	arXiv	2023-12-13	Coming soon	-
Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models	ECCV	2023-12-11	Github	Demo
Honeybee: Locality-enhanced Projector for Multimodal LLM	CVPR	2023-12-11	Github	-
Gemini: A Family of Highly Capable Multimodal Models	Google	2023-12-06	-	-
OneLLM: One Framework to Align All Modalities with Language	arXiv	2023-12-06	Github	Demo
Lenna: Language Enhanced Reasoning Detection Assistant	arXiv	2023-12-05	Github	-
VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding	arXiv	2023-12-04	-	-
TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding	arXiv	2023-12-04	Github	Local Demo
Making Large Multimodal Models Understand Arbitrary Visual Prompts	CVPR	2023-12-01	Github	Demo
Dolphins: Multimodal Language Model for Driving	arXiv	2023-12-01	Github	-
LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning	arXiv	2023-11-30	Github	Coming soon
VTimeLLM: Empower LLM to Grasp Video Moments	arXiv	2023-11-30	Github	Local Demo
mPLUG-PaperOwl: Scientific Diagram Analysis with the Multimodal Large Language Model	arXiv	2023-11-30	Github	-
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models	arXiv	2023-11-28	Github	Coming soon
LLMGA: Multimodal Large Language Model based Generation Assistant	arXiv	2023-11-27	Github	Demo
ChartLlama: A Multimodal LLM for Chart Understanding and Generation	arXiv	2023-11-27	Github	-
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions	arXiv	2023-11-21	Github	Demo
LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge	arXiv	2023-11-20	Github	-
An Embodied Generalist Agent in 3D World	arXiv	2023-11-18	Github	Demo
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection	arXiv	2023-11-16	Github	Demo
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding	CVPR	2023-11-14	Github	-
To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning	arXiv	2023-11-13	Github	-
SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models	arXiv	2023-11-13	Github	Demo
Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models	CVPR	2023-11-11	Github	Demo
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents	arXiv	2023-11-09	Github	Demo
NExT-Chat: An LMM for Chat, Detection and Segmentation	arXiv	2023-11-08	Github	Local Demo
mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration	arXiv	2023-11-07	Github	Demo
OtterHD: A High-Resolution Multi-modality Model	arXiv	2023-11-07	Github	-
CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding	arXiv	2023-11-06	Coming soon	-
GLaMM: Pixel Grounding Large Multimodal Model	CVPR	2023-11-06	Github	Demo
What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning	arXiv	2023-11-02	Github	-
MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning	arXiv	2023-10-14	Github	Local Demo
SALMONN: Towards Generic Hearing Abilities for Large Language Models	ICLR	2023-10-20	Github	-
Ferret: Refer and Ground Anything Anywhere at Any Granularity	arXiv	2023-10-11	Github	-
CogVLM: Visual Expert For Large Language Models	arXiv	2023-10-09	Github	Demo
Improved Baselines with Visual Instruction Tuning	arXiv	2023-10-05	Github	Demo
LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment	ICLR	2023-10-03	Github	Demo
Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs	arXiv	2023-10-01	Github	-
Reformulating Vision-Language Foundation Models and Datasets Towards Universal Multimodal Assistants	arXiv	2023-10-01	Github	Local Demo
AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model	arXiv	2023-09-27	-	-
InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition	arXiv	2023-09-26	Github	Local Demo
DreamLLM: Synergistic Multimodal Comprehension and Creation	ICLR	2023-09-20	Github	Coming soon
An Empirical Study of Scaling Instruction-Tuned Large Multimodal Models	arXiv	2023-09-18	Coming soon	-
TextBind: Multi-turn Interleaved Multimodal Instruction-following	arXiv	2023-09-14	Github	Demo
NExT-GPT: Any-to-Any Multimodal LLM	arXiv	2023-09-11	Github	Demo
Sight Beyond Text: Multi-Modal Training Enhances LLMs in Truthfulness and Ethics	arXiv	2023-09-13	Github	-
ImageBind-LLM: Multi-modality Instruction Tuning	arXiv	2023-09-07	Github	Demo
Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning	arXiv	2023-09-05	-	-
PointLLM: Empowering Large Language Models to Understand Point Clouds	arXiv	2023-08-31	Github	Demo
✨Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models	arXiv	2023-08-31	Github	Local Demo
MLLM-DataEngine: An Iterative Refinement Approach for MLLM	arXiv	2023-08-25	Github	-
Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Models	arXiv	2023-08-25	Github	Demo
Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities	arXiv	2023-08-24	Github	Demo
Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages	ICLR	2023-08-23	Github	Demo
StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data	arXiv	2023-08-20	Github	-
BLIVA: A Simple Multimodal LLM for Better Handling of Text-rich Visual Questions	arXiv	2023-08-19	Github	Demo
Fine-tuning Multimodal LLMs to Follow Zero-shot Demonstrative Instructions	arXiv	2023-08-08	Github	-
The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World	ICLR	2023-08-03	Github	Demo
LISA: Reasoning Segmentation via Large Language Model	arXiv	2023-08-01	Github	Demo
MovieChat: From Dense Token to Sparse Memory for Long Video Understanding	arXiv	2023-07-31	Github	Local Demo
3D-LLM: Injecting the 3D World into Large Language Models	arXiv	2023-07-24	Github	-
ChatSpot: Bootstrapping Multimodal LLMs via Precise Referring Instruction Tuning	arXiv	2023-07-18	-	Demo
BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs	arXiv	2023-07-17	Github	Demo
SVIT: Scaling up Visual Instruction Tuning	arXiv	2023-07-09	Github	-
GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest	arXiv	2023-07-07	Github	Demo
What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?	arXiv	2023-07-05	Github	-
mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding	arXiv	2023-07-04	Github	Demo
Visual Instruction Tuning with Polite Flamingo	arXiv	2023-07-03	Github	Demo
LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding	arXiv	2023-06-29	Github	Demo
Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic	arXiv	2023-06-27	Github	Demo
MotionGPT: Human Motion as a Foreign Language	arXiv	2023-06-26	Github	-
Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration	arXiv	2023-06-15	Github	Coming soon
LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark	arXiv	2023-06-11	Github	Demo
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models	arXiv	2023-06-08	Github	Demo
MIMIC-IT: Multi-Modal In-Context Instruction Tuning	arXiv	2023-06-08	Github	Demo
M³IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning	arXiv	2023-06-07	-	-
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding	arXiv	2023-06-05	Github	Demo
LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day	arXiv	2023-06-01	Github	-
GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction	arXiv	2023-05-30	Github	Demo
PandaGPT: One Model To Instruction-Follow Them All	arXiv	2023-05-25	Github	Demo
ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst	arXiv	2023-05-25	Github	-
Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models	arXiv	2023-05-24	Github	Local Demo
DetGPT: Detect What You Need via Reasoning	arXiv	2023-05-23	Github	Demo
Pengi: An Audio Language Model for Audio Tasks	NeurIPS	2023-05-19	Github	-
VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks	arXiv	2023-05-18	Github	-
Listen, Think, and Understand	arXiv	2023-05-18	Github	Demo
VisualGLM-6B	-	2023-05-17	Github	Local Demo
PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering	arXiv	2023-05-17	Github	-
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning	arXiv	2023-05-11	Github	Local Demo
VideoChat: Chat-Centric Video Understanding	arXiv	2023-05-10	Github	Demo
MultiModal-GPT: A Vision and Language Model for Dialogue with Humans	arXiv	2023-05-08	Github	Demo
X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages	arXiv	2023-05-07	Github	-
LMEye: An Interactive Perception Network for Large Language Models	arXiv	2023-05-05	Github	Local Demo
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model	arXiv	2023-04-28	Github	Demo
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality	arXiv	2023-04-27	Github	Demo
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models	arXiv	2023-04-20	Github	-
Visual Instruction Tuning	NeurIPS	2023-04-17	GitHub	Demo
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention	ICLR	2023-03-28	Github	Demo
MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning	ACL	2022-12-21	Github	-

Multimodal Hallucination

Title	Venue	Date	Code	Demo
Look Twice Before You Answer: Memory-Space Visual Retracing for Hallucination Mitigation in Multimodal Large Language Models	arXiv	2024-10-04	Github	-
Interpreting and Editing Vision-Language Representations to Mitigate Hallucinations	arXiv	2024-10-03	Github	-
FIHA: Autonomous Hallucination Evaluation in Vision-Language Models with Davidson Scene Graphs	arXiv	2024-09-20	Link	-
Alleviating Hallucination in Large Vision-Language Models with Active Retrieval Augmentation	arXiv	2024-08-01	-	-
Paying More Attention to Image: A Training-Free Method for Alleviating Hallucination in LVLMs	ECCV	2024-07-31	Github	-
Evaluating and Analyzing Relationship Hallucinations in LVLMs	ICML	2024-06-24	Github	-
AGLA: Mitigating Object Hallucinations in Large Vision-Language Models with Assembly of Global and Local Attention	arXiv	2024-06-18	Github	-
CODE: Contrasting Self-generated Description to Combat Hallucination in Large Multi-modal Models	arXiv	2024-06-04	Coming soon	-
Mitigating Object Hallucination via Data Augmented Contrastive Tuning	arXiv	2024-05-28	Coming soon	-
VDGD: Mitigating LVLM Hallucinations in Cognitive Prompts by Bridging the Visual Perception Gap	arXiv	2024-05-24	Coming soon	-
Detecting and Mitigating Hallucination in Large Vision Language Models via Fine-Grained AI Feedback	arXiv	2024-04-22	-	-
Mitigating Hallucinations in Large Vision-Language Models with Instruction Contrastive Decoding	arXiv	2024-03-27	-	-
What if...?: Counterfactual Inception to Mitigate Hallucination Effects in Large Multimodal Models	arXiv	2024-03-20	Github	-
Strengthening Multimodal Large Language Model with Bootstrapped Preference Optimization	arXiv	2024-03-13	-	-
Debiasing Multimodal Large Language Models	arXiv	2024-03-08	Github	-
HALC: Object Hallucination Reduction via Adaptive Focal-Contrast Decoding	arXiv	2024-03-01	Github	-
IBD: Alleviating Hallucinations in Large Vision-Language Models via Image-Biased Decoding	arXiv	2024-02-28	-	-
Less is More: Mitigating Multimodal Hallucination from an EOS Decision Perspective	arXiv	2024-02-22	Github	-
Logical Closed Loop: Uncovering Object Hallucinations in Large Vision-Language Models	arXiv	2024-02-18	Github	-
The Instinctive Bias: Spurious Images lead to Hallucination in MLLMs	arXiv	2024-02-06	Github	-
Unified Hallucination Detection for Multimodal Large Language Models	arXiv	2024-02-05	Github	-
A Survey on Hallucination in Large Vision-Language Models	arXiv	2024-02-01	-	-
Temporal Insight Enhancement: Mitigating Temporal Hallucination in Multimodal Large Language Models	arXiv	2024-01-18	-	-
Hallucination Augmented Contrastive Learning for Multimodal Large Language Model	arXiv	2023-12-12	Github	-
MOCHa: Multi-Objective Reinforcement Mitigating Caption Hallucinations	arXiv	2023-12-06	Github	-
Mitigating Fine-Grained Hallucination by Fine-Tuning Large Vision-Language Models with Caption Rewrites	arXiv	2023-12-04	Github	-
RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback	arXiv	2023-12-01	Github	Demo
OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation	CVPR	2023-11-29	Github	-
Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding	CVPR	2023-11-28	Github	-
Beyond Hallucinations: Enhancing LVLMs through Hallucination-Aware Direct Preference Optimization	arXiv	2023-11-28	Github	Comins Soon
Mitigating Hallucination in Visual Language Models with Visual Supervision	arXiv	2023-11-27	-	-
HalluciDoctor: Mitigating Hallucinatory Toxicity in Visual Instruction Data	arXiv	2023-11-22	Github	-
An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation	arXiv	2023-11-13	Github	-
FAITHSCORE: Evaluating Hallucinations in Large Vision-Language Models	arXiv	2023-11-02	Github	-
Woodpecker: Hallucination Correction for Multimodal Large Language Models	arXiv	2023-10-24	Github	Demo
Negative Object Presence Evaluation (NOPE) to Measure Object Hallucination in Vision-Language Models	arXiv	2023-10-09	-	-
HallE-Switch: Rethinking and Controlling Object Existence Hallucinations in Large Vision Language Models for Detailed Caption	arXiv	2023-10-03	Github	-
Analyzing and Mitigating Object Hallucination in Large Vision-Language Models	ICLR	2023-10-01	Github	-
Aligning Large Multimodal Models with Factually Augmented RLHF	arXiv	2023-09-25	Github	Demo
Evaluation and Mitigation of Agnosia in Multimodal Large Language Models	arXiv	2023-09-07	-	-
CIEM: Contrastive Instruction Evaluation Method for Better Instruction Tuning	arXiv	2023-09-05	-	-
Evaluation and Analysis of Hallucination in Large Vision-Language Models	arXiv	2023-08-29	Github	-
VIGC: Visual Instruction Generation and Correction	arXiv	2023-08-24	Github	Demo
Detecting and Preventing Hallucinations in Large Vision Language Models	arXiv	2023-08-11	-	-
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning	ICLR	2023-06-26	Github	Demo
Evaluating Object Hallucination in Large Vision-Language Models	EMNLP	2023-05-17	Github	-

Multimodal In-Context Learning

Title	Venue	Date	Code	Demo
Visual In-Context Learning for Large Vision-Language Models	arXiv	2024-02-18	-	-
RAG-Driver: Generalisable Driving Explanations with Retrieval-Augmented In-Context Learning in Multi-Modal Large Language Model	RSS	2024-02-16	Github	-
Can MLLMs Perform Text-to-Image In-Context Learning?	arXiv	2024-02-02	Github	-
Generative Multimodal Models are In-Context Learners	CVPR	2023-12-20	Github	Demo
Hijacking Context in Large Multi-modal Models	arXiv	2023-12-07	-	-
Towards More Unified In-context Visual Understanding	arXiv	2023-12-05	-	-
MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning	arXiv	2023-09-14	Github	Demo
Link-Context Learning for Multimodal LLMs	arXiv	2023-08-15	Github	Demo
OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models	arXiv	2023-08-02	Github	Demo
Med-Flamingo: a Multimodal Medical Few-shot Learner	arXiv	2023-07-27	Github	Local Demo
Generative Pretraining in Multimodality	ICLR	2023-07-11	Github	Demo
AVIS: Autonomous Visual Information Seeking with Large Language Models	arXiv	2023-06-13	-	-
MIMIC-IT: Multi-Modal In-Context Instruction Tuning	arXiv	2023-06-08	Github	Demo
Exploring Diverse In-Context Configurations for Image Captioning	NeurIPS	2023-05-24	Github	-
Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models	arXiv	2023-04-19	Github	Demo
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace	arXiv	2023-03-30	Github	Demo
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action	arXiv	2023-03-20	Github	Demo
ICL-D3IE: In-Context Learning with Diverse Demonstrations Updating for Document Information Extraction	ICCV	2023-03-09	Github	-
Prompting Large Language Models with Answer Heuristics for Knowledge-based Visual Question Answering	CVPR	2023-03-03	Github	-
Visual Programming: Compositional visual reasoning without training	CVPR	2022-11-18	Github	Local Demo
An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA	AAAI	2022-06-28	Github	-
Flamingo: a Visual Language Model for Few-Shot Learning	NeurIPS	2022-04-29	Github	Demo
Multimodal Few-Shot Learning with Frozen Language Models	NeurIPS	2021-06-25	-	-

Multimodal Chain-of-Thought

Title	Venue	Date	Code	Demo
Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models	arXiv	2024-11-21	Github	-
Cantor: Inspiring Multimodal Chain-of-Thought of MLLM	arXiv	2024-04-24	Github	Local Demo
Visual CoT: Unleashing Chain-of-Thought Reasoning in Multi-Modal Language Models	arXiv	2024-03-25	Github	Local Demo
Compositional Chain-of-Thought Prompting for Large Multimodal Models	CVPR	2023-11-27	Github	-
DDCoT: Duty-Distinct Chain-of-Thought Prompting for Multimodal Reasoning in Language Models	NeurIPS	2023-10-25	Github	-
Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic	arXiv	2023-06-27	Github	Demo
Explainable Multimodal Emotion Reasoning	arXiv	2023-06-27	Github	-
EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought	arXiv	2023-05-24	Github	-
Let’s Think Frame by Frame: Evaluating Video Chain of Thought with Video Infilling and Prediction	arXiv	2023-05-23	-	-
T-SciQ: Teaching Multimodal Chain-of-Thought Reasoning via Large Language Model Signals for Science Question Answering	arXiv	2023-05-05	-	-
Caption Anything: Interactive Image Description with Diverse Multimodal Controls	arXiv	2023-05-04	Github	Demo
Visual Chain of Thought: Bridging Logical Gaps with Multimodal Infillings	arXiv	2023-05-03	Coming soon	-
Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models	arXiv	2023-04-19	Github	Demo
Chain of Thought Prompt Tuning in Vision Language Models	arXiv	2023-04-16	Coming soon	-
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action	arXiv	2023-03-20	Github	Demo
Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models	arXiv	2023-03-08	Github	Demo
Multimodal Chain-of-Thought Reasoning in Language Models	arXiv	2023-02-02	Github	-
Visual Programming: Compositional visual reasoning without training	CVPR	2022-11-18	Github	Local Demo
Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering	NeurIPS	2022-09-20	Github	-

LLM-Aided Visual Reasoning

Title	Venue	Date	Code	Demo
VideoDeepResearch: Long Video Understanding With Agentic Tool Using	arXiv	2025-06-12	Github	Local Demo
Beyond Embeddings: The Promise of Visual Table in Multi-Modal Models	arXiv	2024-03-27	Github	-
V∗: Guided Visual Search as a Core Mechanism in Multimodal LLMs	arXiv	2023-12-21	Github	Local Demo
LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing	arXiv	2023-11-01	Github	Demo
MM-VID: Advancing Video Understanding with GPT-4V(vision)	arXiv	2023-10-30	-	-
ControlLLM: Augment Language Models with Tools by Searching on Graphs	arXiv	2023-10-26	Github	-
Woodpecker: Hallucination Correction for Multimodal Large Language Models	arXiv	2023-10-24	Github	Demo
MindAgent: Emergent Gaming Interaction	arXiv	2023-09-18	Github	-
Towards Language Models That Can See: Computer Vision Through the LENS of Natural Language	arXiv	2023-06-28	Github	Demo
Retrieving-to-Answer: Zero-Shot Video Question Answering with Frozen Large Language Models	arXiv	2023-06-15	-	-
AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn	arXiv	2023-06-14	Github	-
AVIS: Autonomous Visual Information Seeking with Large Language Models	arXiv	2023-06-13	-	-
GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction	arXiv	2023-05-30	Github	Demo
Mindstorms in Natural Language-Based Societies of Mind	arXiv	2023-05-26	-	-
LayoutGPT: Compositional Visual Planning and Generation with Large Language Models	arXiv	2023-05-24	Github	-
IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models	arXiv	2023-05-24	Github	Local Demo
Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation	arXiv	2023-05-10	Github	-
Caption Anything: Interactive Image Description with Diverse Multimodal Controls	arXiv	2023-05-04	Github	Demo
Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models	arXiv	2023-04-19	Github	Demo
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace	arXiv	2023-03-30	Github	Demo
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action	arXiv	2023-03-20	Github	Demo
ViperGPT: Visual Inference via Python Execution for Reasoning	arXiv	2023-03-14	Github	Local Demo
ChatGPT Asks, BLIP-2 Answers: Automatic Questioning Towards Enriched Visual Descriptions	arXiv	2023-03-12	Github	Local Demo
ICL-D3IE: In-Context Learning with Diverse Demonstrations Updating for Document Information Extraction	ICCV	2023-03-09	-	-
Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models	arXiv	2023-03-08	Github	Demo
Prompt, Generate, then Cache: Cascade of Foundation Models makes Strong Few-shot Learners	CVPR	2023-03-03	Github	-
From Images to Textual Prompts: Zero-shot VQA with Frozen Large Language Models	CVPR	2022-12-21	Github	Demo
SuS-X: Training-Free Name-Only Transfer of Vision-Language Models	arXiv	2022-11-28	Github	-
PointCLIP V2: Adapting CLIP for Powerful 3D Open-world Learning	CVPR	2022-11-21	Github	-
Visual Programming: Compositional visual reasoning without training	CVPR	2022-11-18	Github	Local Demo
Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language	arXiv	2022-04-01	Github	-

Foundation Models

Title	Venue	Date	Code	Demo
Introducing GPT-5	OpenAI	2025-08-07	-	-
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding	arXiv	2025-01-22	Github	Demo
Emu3: Next-Token Prediction is All You Need	arXiv	2024-09-27	Github	Local Demo
Llama 3.2: Revolutionizing edge AI and vision with open, customizable models	Meta	2024-09-25	-	Demo
Pixtral-12B	Mistral	2024-09-17	-	-
xGen-MM (BLIP-3): A Family of Open Large Multimodal Models	arXiv	2024-08-16	Github	-
The Llama 3 Herd of Models	arXiv	2024-07-31	-	-
Chameleon: Mixed-Modal Early-Fusion Foundation Models	arXiv	2024-05-16	-	-
Hello GPT-4o	OpenAI	2024-05-13	-	-
The Claude 3 Model Family: Opus, Sonnet, Haiku	Anthropic	2024-03-04	-	-
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context	Google	2024-02-15	-	-
Gemini: A Family of Highly Capable Multimodal Models	Google	2023-12-06	-	-
Fuyu-8B: A Multimodal Architecture for AI Agents	Blog	2023-10-17	Huggingface	Demo
Unified Model for Image, Video, Audio and Language Tasks	arXiv	2023-07-30	Github	Demo
PaLI-3 Vision Language Models: Smaller, Faster, Stronger	arXiv	2023-10-13	-	-
GPT-4V(ision) System Card	OpenAI	2023-09-25	-	-
Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization	arXiv	2023-09-09	Github	-
Multimodal Foundation Models: From Specialists to General-Purpose Assistants	arXiv	2023-09-18	-	-
Bootstrapping Vision-Language Learning with Decoupled Language Pre-training	NeurIPS	2023-07-13	Github	-
Generative Pretraining in Multimodality	arXiv	2023-07-11	Github	Demo
Kosmos-2: Grounding Multimodal Large Language Models to the World	arXiv	2023-06-26	Github	Demo
Transfer Visual Prompt Generator across LLMs	arXiv	2023-05-02	Github	Demo
GPT-4 Technical Report	arXiv	2023-03-15	-	-
PaLM-E: An Embodied Multimodal Language Model	arXiv	2023-03-06	-	Demo
Prismer: A Vision-Language Model with An Ensemble of Experts	arXiv	2023-03-04	Github	Demo
Language Is Not All You Need: Aligning Perception with Language Models	arXiv	2023-02-27	Github	-
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models	arXiv	2023-01-30	Github	Demo
VIMA: General Robot Manipulation with Multimodal Prompts	ICML	2022-10-06	Github	Local Demo
MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge	NeurIPS	2022-06-17	Github	-
Write and Paint: Generative Vision-Language Models are Unified Modal Learners	ICLR	2022-06-15	Github	-
Language Models are General-Purpose Interfaces	arXiv	2022-06-13	Github	-

Evaluation

Title	Venue	Date	Page
Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces	arXiv	2024-12-18	Github
MMGenBench: Evaluating the Limits of LMMs from the Text-to-Image Generation Perspective	arXiv	2024-11-21	Github
OmniBench: Towards The Future of Universal Omni-Language Models	arXiv	2024-09-23	Github
MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?	arXiv	2024-08-23	Github
UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models	TPAMI	2023-10-17	Github
MMEvalPro: Calibrating Multimodal Benchmarks Towards Trustworthy and Efficient Evaluation	arXiv	2024-06-29	Github
Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs	arXiv	2024-06-28	Github
CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs	arXiv	2024-06-26	Github
ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via Chart-to-Code Generation	arXiv	2024-04-15	Github
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis	arXiv	2024-05-31	Github
Benchmarking Large Multimodal Models against Common Corruptions	NAACL	2024-01-22	Github
Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs	arXiv	2024-01-11	Github
A Challenger to GPT-4V? Early Explorations of Gemini in Visual Expertise	arXiv	2023-12-19	Github
BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models	arXiv	2023-12-05	Github
How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs	arXiv	2023-11-27	Github
Charting New Territories: Exploring the Geographic and Geospatial Capabilities of Multimodal LLMs	arXiv	2023-11-24	Github
MLLM-Bench, Evaluating Multi-modal LLMs using GPT-4V	arXiv	2023-11-23	Github
VLM-Eval: A General Evaluation on Video Large Language Models	arXiv	2023-11-20	Coming soon
Holistic Analysis of Hallucination in GPT-4V(ision): Bias and Interference Challenges	arXiv	2023-11-06	Github
On the Road with GPT-4V(ision): Early Explorations of Visual-Language Model on Autonomous Driving	arXiv	2023-11-09	Github
Towards Generic Anomaly Detection and Understanding: Large-scale Visual-linguistic Model (GPT-4V) Takes the Lead	arXiv	2023-11-05	-
A Comprehensive Study of GPT-4V's Multimodal Capabilities in Medical Imaging	arXiv	2023-10-31	-
An Early Evaluation of GPT-4V(ision)	arXiv	2023-10-25	Github
Exploring OCR Capabilities of GPT-4V(ision) : A Quantitative and In-depth Evaluation	arXiv	2023-10-25	Github
HallusionBench: You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5, and Other Multi-modality Models	CVPR	2023-10-23	Github
MathVista: Evaluating Math Reasoning in Visual Contexts with GPT-4V, Bard, and Other Large Multimodal Models	ICLR	2023-10-03	Github
Fool Your (Vision and) Language Model With Embarrassingly Simple Permutations	arXiv	2023-10-02	Github
Beyond Task Performance: Evaluating and Reducing the Flaws of Large Multimodal Models with In-Context Learning	arXiv	2023-10-01	Github
Can We Edit Multimodal Large Language Models?	arXiv	2023-10-12	Github
REVO-LION: Evaluating and Refining Vision-Language Instruction Tuning Datasets	arXiv	2023-10-10	Github
The Dawn of LMMs: Preliminary Explorations with GPT-4V(vision)	arXiv	2023-09-29	-
TouchStone: Evaluating Vision-Language Models by Language Models	arXiv	2023-08-31	Github
✨Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models	arXiv	2023-08-31	Github
SciGraphQA: A Large-Scale Synthetic Multi-Turn Question-Answering Dataset for Scientific Graphs	arXiv	2023-08-07	Github
Tiny LVLM-eHub: Early Multimodal Experiments with Bard	arXiv	2023-08-07	Github
MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities	arXiv	2023-08-04	Github
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension	CVPR	2023-07-30	Github
MMBench: Is Your Multi-modal Model an All-around Player?	arXiv	2023-07-12	Github
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models	arXiv	2023-06-23	Github
LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models	arXiv	2023-06-15	Github
LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark	arXiv	2023-06-11	Github
M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models	arXiv	2023-06-08	Github
On The Hidden Mystery of OCR in Large Multimodal Models	arXiv	2023-05-13	Github

Multimodal RLHF

Title	Venue	Date	Code	Demo
R1-Reward: Training Multimodal Reward Model Through Stable Reinforcement Learning	arXiv	2025-05-09	Github	-
Aligning Multimodal LLM with Human Preference: A Survey	arXiv	2025-03-23	Github	-
MM-RLHF: The Next Step Forward in Multimodal LLM Alignment	arXiv	2025-02-14	Github	-
Enhancing Multimodal LLM for Detailed and Accurate Video Captioning using Multi-Round Preference Optimization	arXiv	2024-10-09	-	-
Silkie: Preference Distillation for Large Visual Language Models	arXiv	2023-12-17	Github	-
RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback	arXiv	2023-12-01	Github	Demo
Aligning Large Multimodal Models with Factually Augmented RLHF	arXiv	2023-09-25	Github	Demo
RoVRM: A Robust Visual Reward Model Optimized via Auxiliary Textual Preference Data	arXiv	2024-08-22	Github	-

Others

Title	Venue	Date	Code	Demo
TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models	arXiv	2024-11-17	Github	-
Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models	arXiv	2024-02-03	Github	-
VCoder: Versatile Vision Encoders for Multimodal Large Language Models	arXiv	2023-12-21	Github	Local Demo
Prompt Highlighter: Interactive Control for Multi-Modal LLMs	arXiv	2023-12-07	Github	-
Planting a SEED of Vision in Large Language Model	arXiv	2023-07-16	Github
Can Large Pre-trained Models Help Vision Models on Perception Tasks?	arXiv	2023-06-01	Github	-
Contextual Object Detection with Multimodal Large Language Models	arXiv	2023-05-29	Github	Demo
Generating Images with Multimodal Language Models	arXiv	2023-05-26	Github	-
On Evaluating Adversarial Robustness of Large Vision-Language Models	arXiv	2023-05-26	Github	-
Grounding Language Models to Images for Multimodal Inputs and Outputs	ICML	2023-01-31	Github	Demo

Awesome Datasets

Datasets of Pre-Training for Alignment

Name	Paper	Type	Modalities
ShareGPT4Video	ShareGPT4Video: Improving Video Understanding and Generation with Better Captions	Caption	Video-Text
COYO-700M	COYO-700M: Image-Text Pair Dataset	Caption	Image-Text
ShareGPT4V	ShareGPT4V: Improving Large Multi-Modal Models with Better Captions	Caption	Image-Text
AS-1B	The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World	Hybrid	Image-Text
InternVid	InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation	Caption	Video-Text
MS-COCO	Microsoft COCO: Common Objects in Context	Caption	Image-Text
SBU Captions	Im2Text: Describing Images Using 1 Million Captioned Photographs	Caption	Image-Text
Conceptual Captions	Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning	Caption	Image-Text
LAION-400M	LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs	Caption	Image-Text
VG Captions	Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations	Caption	Image-Text
Flickr30k	Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models	Caption	Image-Text
AI-Caps	AI Challenger : A Large-scale Dataset for Going Deeper in Image Understanding	Caption	Image-Text
Wukong Captions	Wukong: A 100 Million Large-scale Chinese Cross-modal Pre-training Benchmark	Caption	Image-Text
GRIT	Kosmos-2: Grounding Multimodal Large Language Models to the World	Caption	Image-Text-Bounding-Box
Youku-mPLUG	Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks	Caption	Video-Text
MSR-VTT	MSR-VTT: A Large Video Description Dataset for Bridging Video and Language	Caption	Video-Text
Webvid10M	Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval	Caption	Video-Text
WavCaps	WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research	Caption	Audio-Text
AISHELL-1	AISHELL-1: An open-source Mandarin speech corpus and a speech recognition baseline	ASR	Audio-Text
AISHELL-2	AISHELL-2: Transforming Mandarin ASR Research Into Industrial Scale	ASR	Audio-Text
VSDial-CN	X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages	ASR	Image-Audio-Text

Datasets of Multimodal Instruction Tuning

Name	Paper	Link	Notes
Inst-IT Dataset	Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning	Link	An instruction-tuning dataset which contains fine-grained multi-level annotations for 21k videos and 51k images
E.T. Instruct 164K	E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding	Link	An instruction-tuning dataset for time-sensitive video understanding
MSQA	Multi-modal Situated Reasoning in 3D Scenes	Link	A large scale dataset for multi-modal situated reasoning in 3D scenes
MM-Evol	MMEvol: Empowering Multimodal Large Language Models with Evol-Instruct	Link	An instruction dataset with rich diversity
UNK-VQA	UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models	Link	A dataset designed to teach models to refrain from answering unanswerable questions
VEGA	VEGA: Learning Interleaved Image-Text Comprehension in Vision-Language Large Models	Link	A dataset for enhancing model capabilities in comprehension of interleaved information
ALLaVA-4V	ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model	Link	Vision and language caption and instruction dataset generated by GPT4V
IDK	Visually Dehallucinative Instruction Generation: Know What You Don't Know	Link	Dehallucinative visual instruction for "I Know" hallucination
CAP2QA	Visually Dehallucinative Instruction Generation	Link	Image-aligned visual instruction dataset
M3DBench	M3DBench: Let's Instruct Large Models with Multi-modal 3D Prompts	Link	A large-scale 3D instruction tuning dataset
ViP-LLaVA-Instruct	Making Large Multimodal Models Understand Arbitrary Visual Prompts	Link	A mixture of LLaVA-1.5 instruction data and the region-level visual prompting data
LVIS-Instruct4V	To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning	Link	A visual instruction dataset via self-instruction from GPT-4V
ComVint	What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning	Link	A synthetic instruction dataset for complex visual reasoning
SparklesDialogue	✨Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models	Link	A machine-generated dialogue dataset tailored for word-level interleaved multi-image and text interactions to augment the conversational competence of instruction-following LLMs across multiple images and dialogue turns.
StableLLaVA	StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data	Link	A cheap and effective approach to collect visual instruction tuning data
M-HalDetect	Detecting and Preventing Hallucinations in Large Vision Language Models	Coming soon	A dataset used to train and benchmark models for hallucination detection and prevention
MGVLID	ChatSpot: Bootstrapping Multimodal LLMs via Precise Referring Instruction Tuning	-	A high-quality instruction-tuning dataset including image-text and region-text pairs
BuboGPT	BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs	Link	A high-quality instruction-tuning dataset including audio-text audio caption data and audio-image-text localization data
SVIT	SVIT: Scaling up Visual Instruction Tuning	Link	A large-scale dataset with 4.2M informative visual instruction tuning data, including conversations, detailed descriptions, complex reasoning and referring QAs
mPLUG-DocOwl	mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding	Link	An instruction tuning dataset featuring a wide range of visual-text understanding tasks including OCR-free document understanding
PF-1M	Visual Instruction Tuning with Polite Flamingo	Link	A collection of 37 vision-language datasets with responses rewritten by Polite Flamingo.
ChartLlama	ChartLlama: A Multimodal LLM for Chart Understanding and Generation	Link	A multi-modal instruction-tuning dataset for chart understanding and generation
LLaVAR	LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding	Link	A visual instruction-tuning dataset for Text-rich Image Understanding
MotionGPT	MotionGPT: Human Motion as a Foreign Language	Link	A instruction-tuning dataset including multiple human motion-related tasks
LRV-Instruction	Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning	Link	Visual instruction tuning dataset for addressing hallucination issue
Macaw-LLM	Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration	Link	A large-scale multi-modal instruction dataset in terms of multi-turn dialogue
LAMM-Dataset	LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark	Link	A comprehensive multi-modal instruction tuning dataset
Video-ChatGPT	Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models	Link	100K high-quality video instruction dataset
MIMIC-IT	MIMIC-IT: Multi-Modal In-Context Instruction Tuning	Link	Multimodal in-context instruction tuning
M³IT	M³IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning	Link	Large-scale, broad-coverage multimodal instruction tuning dataset
LLaVA-Med	LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day	Coming soon	A large-scale, broad-coverage biomedical instruction-following dataset
GPT4Tools	GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction	Link	Tool-related instruction datasets
MULTIS	ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst	Coming soon	Multimodal instruction tuning dataset covering 16 multimodal tasks
DetGPT	DetGPT: Detect What You Need via Reasoning	Link	Instruction-tuning dataset with 5000 images and around 30000 query-answer pairs
PMC-VQA	PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering	Coming soon	Large-scale medical visual question-answering dataset
VideoChat	VideoChat: Chat-Centric Video Understanding	Link	Video-centric multimodal instruction dataset
X-LLM	X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages	Link	Chinese multimodal instruction dataset
LMEye	LMEye: An Interactive Perception Network for Large Language Models	Link	A multi-modal instruction-tuning dataset
cc-sbu-align	MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models	Link	Multimodal aligned dataset for improving model's usability and generation's fluency
LLaVA-Instruct-150K	Visual Instruction Tuning	Link	Multimodal instruction-following data generated by GPT
MultiInstruct	MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning	Link	The first multimodal instruction tuning benchmark dataset

Datasets of In-Context Learning

Name	Paper	Link	Notes
MIC	MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning	Link	A manually constructed instruction tuning dataset including interleaved text-image inputs, inter-related multiple image inputs, and multimodal in-context learning inputs.
MIMIC-IT	MIMIC-IT: Multi-Modal In-Context Instruction Tuning	Link	Multimodal in-context instruction dataset

Datasets of Multimodal Chain-of-Thought

Name	Paper	Link	Notes
EMER	Explainable Multimodal Emotion Reasoning	Coming soon	A benchmark dataset for explainable emotion reasoning task
EgoCOT	EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought	Coming soon	Large-scale embodied planning dataset
VIP	Let’s Think Frame by Frame: Evaluating Video Chain of Thought with Video Infilling and Prediction	Coming soon	An inference-time dataset that can be used to evaluate VideoCOT
ScienceQA	Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering	Link	Large-scale multi-choice dataset, featuring multimodal science questions and diverse domains

Datasets of Multimodal RLHF

Name	Paper	Link	Notes
VLFeedback	Silkie: Preference Distillation for Large Visual Language Models	Link	A vision-language feedback dataset annotated by AI

Benchmarks for Evaluation

Name	Paper	Link	Notes
Inst-IT Bench	Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning	Link	A benchmark to evaluate fine-grained instance-level understanding in images and videos
M³CoT	M³CoT: A Novel Benchmark for Multi-Domain Multi-step Multi-modal Chain-of-Thought	Link	A multi-domain, multi-step benchmark for multimodal CoT
MMGenBench	MMGenBench: Evaluating the Limits of LMMs from the Text-to-Image Generation Perspective	Link	A benchmark that gauges the performance of constructing image-generation prompt given an image
MiCEval	MiCEval: Unveiling Multimodal Chain of Thought's Quality via Image Description and Reasoning Steps	Link	A multimodal CoT benchmark to evaluate MLLMs' reasoning capabilities
LiveXiv	LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content	Link	A live benchmark based on arXiv papers
TemporalBench	TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models	Link	A benchmark for evaluation of fine-grained temporal understanding
OmniBench	OmniBench: Towards The Future of Universal Omni-Language Models	Link	A benchmark that evaluates models' capabilities of processing visual, acoustic, and textual inputs simultaneously
MME-RealWorld	MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?	Link	A challenging benchmark that involves real-life scenarios
VELOCITI	VELOCITI: Can Video-Language Models Bind Semantic Concepts through Time?	Link	A video benhcmark that evaluates on perception and binding capabilities
MMR	Seeing Clearly, Answering Incorrectly: A Multimodal Robustness Benchmark for Evaluating MLLMs on Leading Questions	Link	A benchmark for measuring MLLMs' understanding capability and robustness to leading questions
CharXiv	CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs	Link	Chart understanding benchmark curated by human experts
Video-MME	Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis	Link	A comprehensive evaluation benchmark of Multi-modal LLMs in video analysis
VL-ICL Bench	VL-ICL Bench: The Devil in the Details of Benchmarking Multimodal In-Context Learning	Link	A benchmark for M-ICL evaluation, covering a wide spectrum of tasks
TempCompass	TempCompass: Do Video LLMs Really Understand Videos?	Link	A benchmark to evaluate the temporal perception ability of Video LLMs
GVLQA	GITA: Graph to Visual and Textual Integration for Vision-Language Graph Reasoning	Link	A benchmark for evaluation of graph reasoning capabilities
CoBSAT	Can MLLMs Perform Text-to-Image In-Context Learning?	Link	A benchmark for text-to-image ICL
VQAv2-IDK	Visually Dehallucinative Instruction Generation: Know What You Don't Know	Link	A benchmark for assessing "I Know" visual hallucination
Math-Vision	Measuring Multimodal Mathematical Reasoning with MATH-Vision Dataset	Link	A diverse mathematical reasoning benchmark
SciMMIR	SciMMIR: Benchmarking Scientific Multi-modal Information Retrieval	Link
CMMMU	CMMMU: A Chinese Massive Multi-discipline Multimodal Understanding Benchmark	Link	A Chinese benchmark involving reasoning and knowledge across multiple disciplines
MMCBench	Benchmarking Large Multimodal Models against Common Corruptions	Link	A benchmark for examining self-consistency under common corruptions
MMVP	Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs	Link	A benchmark for assessing visual capabilities
TimeIT	TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding	Link	A video instruction-tuning dataset with timestamp annotations, covering diverse time-sensitive video-understanding tasks.
ViP-Bench	Making Large Multimodal Models Understand Arbitrary Visual Prompts	Link	A benchmark for visual prompts
M3DBench	M3DBench: Let's Instruct Large Models with Multi-modal 3D Prompts	Link	A 3D-centric benchmark
Video-Bench	Video-Bench: A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models	Link	A benchmark for video-MLLM evaluation
Charting-New-Territories	Charting New Territories: Exploring the Geographic and Geospatial Capabilities of Multimodal LLMs	Link	A benchmark for evaluating geographic and geospatial capabilities
MLLM-Bench	MLLM-Bench, Evaluating Multi-modal LLMs using GPT-4V	Link	GPT-4V evaluation with per-sample criteria
BenchLMM	BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models	Link	A benchmark for assessment of the robustness against different image styles
MMC-Benchmark	MMC: Advancing Multimodal Chart Understanding with Large-scale Instruction Tuning	Link	A comprehensive human-annotated benchmark with distinct tasks evaluating reasoning capabilities over charts
MVBench	MVBench: A Comprehensive Multi-modal Video Understanding Benchmark	Link	A comprehensive multimodal benchmark for video understanding
Bingo	Holistic Analysis of Hallucination in GPT-4V(ision): Bias and Interference Challenges	Link	A benchmark for hallucination evaluation that focuses on two common types
MagnifierBench	OtterHD: A High-Resolution Multi-modality Model	Link	A benchmark designed to probe models' ability of fine-grained perception
HallusionBench	HallusionBench: You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5, and Other Multi-modality Models	Link	An image-context reasoning benchmark for evaluation of hallucination
PCA-EVAL	Towards End-to-End Embodied Decision Making via Multi-modal Large Language Model: Explorations with GPT4-Vision and Beyond	Link	A benchmark for evaluating multi-domain embodied decision-making.
MMHal-Bench	Aligning Large Multimodal Models with Factually Augmented RLHF	Link	A benchmark for hallucination evaluation
MathVista	MathVista: Evaluating Math Reasoning in Visual Contexts with GPT-4V, Bard, and Other Large Multimodal Models	Link	A benchmark that challenges both visual and math reasoning capabilities
SparklesEval	✨Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models	Link	A GPT-assisted benchmark for quantitatively assessing a model's conversational competence across multiple images and dialogue turns based on three distinct criteria.
ISEKAI	Link-Context Learning for Multimodal LLMs	Link	A benchmark comprising exclusively of unseen generated image-label pairs designed for link-context learning
M-HalDetect	Detecting and Preventing Hallucinations in Large Vision Language Models	Coming soon	A dataset used to train and benchmark models for hallucination detection and prevention
I4	Empowering Vision-Language Models to Follow Interleaved Vision-Language Instructions	Link	A benchmark to comprehensively evaluate the instruction following ability on complicated interleaved vision-language instructions
SciGraphQA	SciGraphQA: A Large-Scale Synthetic Multi-Turn Question-Answering Dataset for Scientific Graphs	Link	A large-scale chart-visual question-answering dataset
MM-Vet	MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities	Link	An evaluation benchmark that examines large multimodal models on complicated multimodal tasks
SEED-Bench	SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension	Link	A benchmark for evaluation of generative comprehension in MLLMs
MMBench	MMBench: Is Your Multi-modal Model an All-around Player?	Link	A systematically-designed objective benchmark for robustly evaluating the various abilities of vision-language models
Lynx	What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?	Link	A comprehensive evaluation benchmark including both image and video tasks
GAVIE	Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning	Link	A benchmark to evaluate the hallucination and instruction following ability
MME	MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models	Link	A comprehensive MLLM Evaluation benchmark
LVLM-eHub	LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models	Link	An evaluation platform for MLLMs
LAMM-Benchmark	LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark	Link	A benchmark for evaluating the quantitative performance of MLLMs on various2D/3D vision tasks
M3Exam	M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models	Link	A multilingual, multimodal, multilevel benchmark for evaluating MLLM
OwlEval	mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality	Link	Dataset for evaluation on multiple capabilities

Others

Name	Paper	Link	Notes
IMAD	IMAD: IMage-Augmented multi-modal Dialogue	Link	Multimodal dialogue dataset
Video-ChatGPT	Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models	Link	A quantitative evaluation framework for video-based dialogue models
CLEVR-ATVC	Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation	Link	A synthetic multimodal fine-tuning dataset for learning to reject instructions
Fruit-ATVC	Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation	Link	A manually pictured multimodal fine-tuning dataset for learning to reject instructions
InfoSeek	Can Pre-trained Vision and Language Models Answer Visual Information-Seeking Questions?	Link	A VQA dataset that focuses on asking information-seeking questions
OVEN	Open-domain Visual Entity Recognition: Towards Recognizing Millions of Wikipedia Entities	Link	A dataset that focuses on recognizing the Visual Entity on the Wikipedia, from images in the wild