Awesome-Multimodal-Large-Language-Models

May 1, 2026 · View on GitHub

✨ Highlights of NJU-MiG

🔥🔥 Surveys of MLLMs | 💬 WeChat (MLLM微信交流群)

  • 🌟 MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs
    arXiv 2025, Paper, Project

  • 🌟 A Survey of Unified Multimodal Understanding and Generation: Advances and Challenges
    arXiv 2025, Paper, Project

  • A Survey on Multimodal Large Language Models
    NSR 2024, Paper, Project


🔥🔥 VITA Series Omni MLLMs | 💬 WeChat (VITA微信交流群)

  • VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
    NeurIPS 2025 Highlight, Paper, Project

  • VITA: Towards Open-Source Interactive Omni Multimodal LLM
    arXiv 2024, Paper, Project

  • VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model
    NeurIPS 2025, Paper, Project


🔥🔥 MME Series MLLM Benchmarks

  • 🔥 Video-MME-v2: Towards the Next Stage in Video Understanding Evaluation

[🍎 Project Page] [📖 Paper] [🤗 Dataset] [🏆 Leaderboard]

  • 🌟 MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs
    arXiv 2025, Paper, Project

  • MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
    NeurIPS 2025 DB Highlight, Paper, Dataset, Eval Tool, ✒️ Citation

  • Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
    CVPR 2025, Paper, Project, Dataset


Table of Contents


Awesome Papers

Multimodal Instruction Tuning (& Latest Works)

TitleVenueDateCodeDemo
DeepSeek-V4: Towards Highly Efficient Million-Token Context IntelligenceDeepSeek2026-04-24Huggingface-
Qwen3.6-27B: Flagship-Level Coding in a 27B Dense ModelBlog2026-04-22HuggingfaceDemo
Xiaomi MiMo-V2.5Blog2026-04-22-Demo
Star
Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding
arXiv2026-04-06GithubDemo
Introducing Muse Spark: Scaling Towards Personal SuperintelligenceBlog2026-04-08-Demo
Star
VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing
arXiv2026-04-03GithubLocal Demo
Gemma 4: Byte for byte, the most capable open modelsBlog2026-04-02-Demo
Qwen3.6-Plus: Towards Real World AgentsBlog2026-04-02--
Qwen3.5-Omni: Scaling Up, Toward Native Omni-Modal AGIBlog2026-03-30-Demo
Xiaomi MiMo-V2-OmniBlog2026-03-18--
Star
InternVL-U: Democratizing Unified Multimodal Models for Understanding, Reasoning, Generation and Editing
arXiv2026-03-10GithubLocal Demo
Star
Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion
arXiv2026-03-06Github-
Beyond Language Modeling: An Exploration of Multimodal PretrainingarXiv2026-03-03--
Gemini 3.1 Pro: A smarter model for your most complex tasksBlog2026-02-19--
Star
Qwen3.5: Towards Native Multimodal Agents
Blog2026-02-16GithubDemo
Star
MiniCPM-o 4.5
Blog2026-02-06GithubDemo
Star
Kimi K2.5: Visual Agentic Intelligence
arXiv2026-02-02Github-
Star
DeepSeek-OCR 2: Visual Causal Flow
DeepSeek2026-01-27Github-
Seed1.8 Model Card: Towards Generalized Real-World AgencyBytedance Seed2025-12-18--
Introducing GPT-5.2OpenAI2025-12-11--
Introducing Mistral 3Blog2025-12-02Huggingface-
Star
Qwen3-VL Technical Report
arXiv2025-11-26GithubDemo
Star
Emu3.5: Native Multimodal Models are World Learners
arXiv2025-10-30Github-
Star
VITA-E: Natural Embodied Interaction with Concurrent Seeing, Hearing, Speaking, and Acting
arXiv2025-10-21GithubLocal Demo
Star
DeepSeek-OCR: Contexts Optical Compression
arXiv2025-10-21Github-
Star
OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM
arXiv2025-10-17Github-
NExT-OMNI: Towards Any-to-Any Omnimodal Foundation Models with Discrete Flow MatchingarXiv2025-10-16--
Star
InteractiveOmni: A Unified Omni-modal Model for Audio-Visual Multi-turn Dialogue
arXiv2025-10-15Github-
Star
VITA-VLA: Efficiently Teaching Vision-Language Models to Act via Action Expert Distillation
arXiv2025-10-10Github-
Star
LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training
arXiv2025-10-09GithubDemo
Star
Qwen3-Omni Technical Report
arXiv2025-09-22GithubDemo
Star
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
arXiv2025-08-27GithubDemo
MiniCPM-V 4.5: A GPT-4o Level MLLM for Single Image, Multi Image and Video Understanding on Your Phone-2025-08-26GithubDemo
Star
Thyme: Think Beyond Images
arXiv2025-08-18GithubDemo
Introducing GPT-5OpenAI2025-08-07--
Star
dots.vlm1
rednote-hilab2025-08-06GithubDemo
Star
Step3: Cost-Effective Multimodal Intelligence
StepFun2025-07-31GithubDemo
Star
GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning
arXiv2025-07-02GithubDemo
Star
DenseWorld-1M: Towards Detailed Dense Grounded Caption in the Real World
arXiv2025-06-30Github-
Qwen VLo: From "Understanding" the World to "Depicting" ItQwen2025-06-26-Demo
Star
MMSearch-R1: Incentivizing LMMs to Search
arXiv2025-06-25Github-
Star
Show-o2: Improved Native Unified Multimodal Models
arXiv2025-06-18Github-
Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic CapabilitiesGoogle2025-06-17--
Star
Ego-R1: Chain-of-Tool-Thought for Ultra-Long Egocentric Video Reasoning
arXiv2025-06-16Github-
Star
MiMo-VL Technical Report
arXiv2025-06-04Github-
Star
OpenUni: A Simple Baseline for Unified Multimodal Understanding and Generation
arXiv2025-05-29Github-
Star
Emerging Properties in Unified Multimodal Pretraining
arXiv2025-05-23GithubDemo
Star
MMaDA: Multimodal Large Diffusion Language Models
arXiv2025-05-21GithubDemo
UniGen: Enhanced Training & Test-Time Strategies for Unified Multimodal Understanding and GenerationarXiv2025-05-20--
Star
BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset
arXiv2025-05-14GithubLocal Demo
Seed1.5-VL Technical ReportarXiv2025-05-11--
Star
Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models
arXiv2025-05-08Github-
Star
VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model
arXiv2025-05-06GithubLocal Demo
Star
Skywork R1V2: Multimodal Hybrid Reinforcement Learning for Reasoning
arXiv2025-04-23Github-
Star
Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models
arXiv2025-04-21Github-
Star
An LMM for Efficient Video Understanding via Reinforced Compression of Video Cubes
arXiv2025-04-21Github-
Star
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
arXiv2025-04-14GithubDemo
Introducing GPT-4.1 in the APIOpenAI2025-04-14--
Star
Kimi-VL Technical Report
arXiv2025-04-10GithubDemo
The Llama 4 herd: The beginning of a new era of natively multimodal AI innovationMeta2025-04-05Hugging Face-
Star
Qwen2.5-Omni Technical Report
Qwen2025-03-26GithubDemo
Addendum to GPT-4o System Card: Native image generationOpenAI2025-03-25--
Star
Sparrow: Data-Efficient Video-LLM with Text-to-Image Augmentation
arXiv2025-03-17Github-
Nexus-O: An Omni-Perceptive And -Interactive Model for Language, Audio, And VisionarXiv2025-03-07--
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAsarXiv2025-03-03Hugging FaceDemo
Star
Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuray
arXiv2025-02-19Github-
Star
Qwen2.5-VL Technical Report
arXiv2025-02-19GithubDemo
Star
Baichuan-Omni-1.5 Technical Report
Tech Report2025-01-26GithubLocal Demo
Star
LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs
arXiv2025-01-10Github-
Star
VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
arXiv2025-01-03Github-
Star
QVQ: To See the World with Wisdom
Qwen2024-12-25GithubDemo
Star
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding
arXiv2024-12-13Github-
Apollo: An Exploration of Video Understanding in Large Multimodal ModelsarXiv2024-12-13--
Star
InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions
arXiv2024-12-12GithubLocal Demo
StreamChat: Chatting with Streaming VideoarXiv2024-12-11Coming soon-
CompCap: Improving Multimodal Large Language Models with Composite CaptionsarXiv2024-12-06--
Star
LinVT: Empower Your Image-level Large Language Model to Understand Videos
arXiv2024-12-06Github-
Star
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
arXiv2024-12-06GithubDemo
Star
NVILA: Efficient Frontier Visual Language Models
arXiv2024-12-05GithubDemo
Star
Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning
arXiv2024-12-04Github-
Star
TimeMarker: A Versatile Video-LLM for Long and Short Video Understanding with Superior Temporal Localization Ability
arXiv2024-11-27Github-
Star
ChatRex: Taming Multimodal LLM for Joint Perception and Understanding
arXiv2024-11-27GithubLocal Demo
Star
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding
arXiv2024-10-22GithubDemo
Star
Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate
arXiv2024-10-09Github-
Star
AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark
arXiv2024-10-04GithubLocal Demo
Star
EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions
CVPR2024-09-26GithubDemo
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal ModelsarXiv2024-09-25HuggingfaceDemo
Star
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
arXiv2024-09-18GithubDemo
Star
ChartMoE: Mixture of Expert Connector for Advanced Chart Understanding
ICLR2024-09-05GithubLocal Demo
Star
LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture
arXiv2024-09-04Github-
Star
EAGLE: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders
arXiv2024-08-28GithubDemo
Star
LLaVA-MoD: Making LLaVA Tiny via MoE Knowledge Distillation
arXiv2024-08-28Github-
Star
mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models
arXiv2024-08-09Github-
Star
VITA: Towards Open-Source Interactive Omni Multimodal LLM
arXiv2024-08-09Github-
Star
LLaVA-OneVision: Easy Visual Task Transfer
arXiv2024-08-06GithubDemo
Star
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
arXiv2024-08-03GithubDemo
VILA^2: VILA Augmented VILAarXiv2024-07-24--
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language ModelsarXiv2024-07-22--
EVLM: An Efficient Vision-Language Model for Visual UnderstandingarXiv2024-07-19--
Star
IDA-VLM: Towards Movie Understanding via ID-Aware Large Vision-Language Model
arXiv2024-07-10Github-
Star
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output
arXiv2024-07-03GithubDemo
Star
OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding
arXiv2024-06-27GithubLocal Demo
Star
DocKylin: A Large Multimodal Model for Visual Document Understanding with Efficient Visual Slimming
AAAI2024-06-27Github-
Star
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
arXiv2024-06-24GithubLocal Demo
Star
Long Context Transfer from Language to Vision
arXiv2024-06-24GithubLocal Demo
Star
video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models
ICML2024-06-22Github-
Star
TroL: Traversal of Layers for Large Language and Vision Models
EMNLP2024-06-18GithubLocal Demo
Star
Unveiling Encoder-Free Vision-Language Models
arXiv2024-06-17GithubLocal Demo
Star
VideoLLM-online: Online Video Large Language Model for Streaming Video
CVPR2024-06-17GithubLocal Demo
Star
RoboPoint: A Vision-Language Model for Spatial Affordance Prediction for Robotics
CoRL2024-06-15GithubDemo
Star
Comparison Visual Instruction Tuning
arXiv2024-06-13GithubLocal Demo
Star
Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models
arXiv2024-06-12Github-
Star
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
arXiv2024-06-11GithubLocal Demo
Star
Parrot: Multilingual Visual Instruction Tuning
arXiv2024-06-04Github-
Star
Ovis: Structural Embedding Alignment for Multimodal Large Language Model
arXiv2024-05-31Github-
Star
Matryoshka Query Transformer for Large Vision-Language Models
arXiv2024-05-29GithubDemo
Star
ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models
arXiv2024-05-24Github-
Star
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models
arXiv2024-05-24GithubDemo
Star
Libra: Building Decoupled Vision System on Large Language Models
ICML2024-05-16GithubLocal Demo
Star
CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts
arXiv2024-05-09GithubLocal Demo
Star
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
arXiv2024-04-25GithubDemo
Star
Graphic Design with Large Multimodal Model
arXiv2024-04-22Github-
BRAVE: Broadening the visual encoding of vision-language modelsECCV2024-04-10--
Star
InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD
arXiv2024-04-09GithubDemo
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMsarXiv2024-04-08--
Star
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding
CVPR2024-04-08Github-
Star
VITRON: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing
NeurIPS2024-04-04GithubLocal Demo
TOMGPT: Reliable Text-Only Training Approach for Cost-Effective Multi-modal Large Language ModelACM TKDD2024-03-28--
Star
LITA: Language Instructed Temporal-Localization Assistant
arXiv2024-03-27GithubLocal Demo
Star
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models
arXiv2024-03-27GithubDemo
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-trainingarXiv2024-03-14--
Star
MoAI: Mixture of All Intelligence for Large Language and Vision Models
arXiv2024-03-12GithubLocal Demo
Star
DeepSeek-VL: Towards Real-World Vision-Language Understanding
arXiv2024-03-08GithubDemo
Star
TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document
arXiv2024-03-07GithubDemo
Star
The All-Seeing Project V2: Towards General Relation Comprehension of the Open World
arXiv2024-02-29Github-
GROUNDHOG: Grounding Large Language Models to Holistic SegmentationCVPR2024-02-26Coming soonComing soon
Star
AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling
arXiv2024-02-19Github-
Star
Momentor: Advancing Video Large Language Model with Fine-Grained Temporal Reasoning
arXiv2024-02-18Github-
Star
ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model
arXiv2024-02-18GithubDemo
Star
CoLLaVO: Crayon Large Language and Vision mOdel
arXiv2024-02-17Github-
Star
Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models
ICML2024-02-12Github-
Star
CogCoM: Train Large Vision-Language Models Diving into Details through Chain of Manipulations
arXiv2024-02-06Github-
Star
MobileVLM V2: Faster and Stronger Baseline for Vision Language Model
arXiv2024-02-06Github-
Star
GITA: Graph to Visual and Textual Integration for Vision-Language Graph Reasoning
NeurIPS2024-02-03Github-
Enhancing Multimodal Large Language Models with Vision Detection Models: An Empirical StudyarXiv2024-01-31Coming soon-
Star
LLaVA-NeXT: Improved reasoning, OCR, and world knowledge
Blog2024-01-30GithubDemo
Star
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
arXiv2024-01-29GithubDemo
Star
InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model
arXiv2024-01-29GithubDemo
Star
Yi-VL
-2024-01-23GithubLocal Demo
SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning CapabilitiesarXiv2024-01-22--
Star
ChartAssisstant: A Universal Chart Multimodal Language Model via Chart-to-Table Pre-training and Multitask Instruction Tuning
ACL2024-01-04GithubLocal Demo
Star
MobileVLM : A Fast, Reproducible and Strong Vision Language Assistant for Mobile Devices
arXiv2023-12-28Github-
Star
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
CVPR2023-12-21GithubDemo
Star
Osprey: Pixel Understanding with Visual Instruction Tuning
CVPR2023-12-15GithubDemo
Star
CogAgent: A Visual Language Model for GUI Agents
arXiv2023-12-14GithubComing soon
Pixel Aligned Language ModelsarXiv2023-12-14Coming soon-
Star
VILA: On Pre-training for Visual Language Models
CVPR2023-12-13GithubLocal Demo
See, Say, and Segment: Teaching LMMs to Overcome False PremisesarXiv2023-12-13Coming soon-
Star
Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models
ECCV2023-12-11GithubDemo
Star
Honeybee: Locality-enhanced Projector for Multimodal LLM
CVPR2023-12-11Github-
Gemini: A Family of Highly Capable Multimodal ModelsGoogle2023-12-06--
Star
OneLLM: One Framework to Align All Modalities with Language
arXiv2023-12-06GithubDemo
Star
Lenna: Language Enhanced Reasoning Detection Assistant
arXiv2023-12-05Github-
VaQuitA: Enhancing Alignment in LLM-Assisted Video UnderstandingarXiv2023-12-04--
Star
TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding
arXiv2023-12-04GithubLocal Demo
Star
Making Large Multimodal Models Understand Arbitrary Visual Prompts
CVPR2023-12-01GithubDemo
Star
Dolphins: Multimodal Language Model for Driving
arXiv2023-12-01Github-
Star
LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning
arXiv2023-11-30GithubComing soon
Star
VTimeLLM: Empower LLM to Grasp Video Moments
arXiv2023-11-30GithubLocal Demo
Star
mPLUG-PaperOwl: Scientific Diagram Analysis with the Multimodal Large Language Model
arXiv2023-11-30Github-
Star
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models
arXiv2023-11-28GithubComing soon
Star
LLMGA: Multimodal Large Language Model based Generation Assistant
arXiv2023-11-27GithubDemo
Star
ChartLlama: A Multimodal LLM for Chart Understanding and Generation
arXiv2023-11-27Github-
Star
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions
arXiv2023-11-21GithubDemo
Star
LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge
arXiv2023-11-20Github-
Star
An Embodied Generalist Agent in 3D World
arXiv2023-11-18GithubDemo
Star
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
arXiv2023-11-16GithubDemo
Star
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
CVPR2023-11-14Github-
Star
To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning
arXiv2023-11-13Github-
Star
SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models
arXiv2023-11-13GithubDemo
Star
Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models
CVPR2023-11-11GithubDemo
Star
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
arXiv2023-11-09GithubDemo
Star
NExT-Chat: An LMM for Chat, Detection and Segmentation
arXiv2023-11-08GithubLocal Demo
Star
mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration
arXiv2023-11-07GithubDemo
Star
OtterHD: A High-Resolution Multi-modality Model
arXiv2023-11-07Github-
CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative DecodingarXiv2023-11-06Coming soon-
Star
GLaMM: Pixel Grounding Large Multimodal Model
CVPR2023-11-06GithubDemo
Star
What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning
arXiv2023-11-02Github-
Star
MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning
arXiv2023-10-14GithubLocal Demo
Star
SALMONN: Towards Generic Hearing Abilities for Large Language Models
ICLR2023-10-20Github-
Star
Ferret: Refer and Ground Anything Anywhere at Any Granularity
arXiv2023-10-11Github-
Star
CogVLM: Visual Expert For Large Language Models
arXiv2023-10-09GithubDemo
Star
Improved Baselines with Visual Instruction Tuning
arXiv2023-10-05GithubDemo
Star
LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
ICLR2023-10-03GithubDemo
Star
Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs
arXiv2023-10-01Github-
Star
Reformulating Vision-Language Foundation Models and Datasets Towards Universal Multimodal Assistants
arXiv2023-10-01GithubLocal Demo
AnyMAL: An Efficient and Scalable Any-Modality Augmented Language ModelarXiv2023-09-27--
Star
InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition
arXiv2023-09-26GithubLocal Demo
Star
DreamLLM: Synergistic Multimodal Comprehension and Creation
ICLR2023-09-20GithubComing soon
An Empirical Study of Scaling Instruction-Tuned Large Multimodal ModelsarXiv2023-09-18Coming soon-
Star
TextBind: Multi-turn Interleaved Multimodal Instruction-following
arXiv2023-09-14GithubDemo
Star
NExT-GPT: Any-to-Any Multimodal LLM
arXiv2023-09-11GithubDemo
Star
Sight Beyond Text: Multi-Modal Training Enhances LLMs in Truthfulness and Ethics
arXiv2023-09-13Github-
Star
ImageBind-LLM: Multi-modality Instruction Tuning
arXiv2023-09-07GithubDemo
Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction TuningarXiv2023-09-05--
Star
PointLLM: Empowering Large Language Models to Understand Point Clouds
arXiv2023-08-31GithubDemo
Star
✨Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models
arXiv2023-08-31GithubLocal Demo
Star
MLLM-DataEngine: An Iterative Refinement Approach for MLLM
arXiv2023-08-25Github-
Star
Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Models
arXiv2023-08-25GithubDemo
Star
Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities
arXiv2023-08-24GithubDemo
Star
Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages
ICLR2023-08-23GithubDemo
Star
StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data
arXiv2023-08-20Github-
Star
BLIVA: A Simple Multimodal LLM for Better Handling of Text-rich Visual Questions
arXiv2023-08-19GithubDemo
Star
Fine-tuning Multimodal LLMs to Follow Zero-shot Demonstrative Instructions
arXiv2023-08-08Github-
Star
The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World
ICLR2023-08-03GithubDemo
Star
LISA: Reasoning Segmentation via Large Language Model
arXiv2023-08-01GithubDemo
Star
MovieChat: From Dense Token to Sparse Memory for Long Video Understanding
arXiv2023-07-31GithubLocal Demo
Star
3D-LLM: Injecting the 3D World into Large Language Models
arXiv2023-07-24Github-
ChatSpot: Bootstrapping Multimodal LLMs via Precise Referring Instruction Tuning
arXiv2023-07-18-Demo
Star
BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs
arXiv2023-07-17GithubDemo
Star
SVIT: Scaling up Visual Instruction Tuning
arXiv2023-07-09Github-
Star
GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest
arXiv2023-07-07GithubDemo
Star
What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?
arXiv2023-07-05Github-
Star
mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding
arXiv2023-07-04GithubDemo
Star
Visual Instruction Tuning with Polite Flamingo
arXiv2023-07-03GithubDemo
Star
LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding
arXiv2023-06-29GithubDemo
Star
Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic
arXiv2023-06-27GithubDemo
Star
MotionGPT: Human Motion as a Foreign Language
arXiv2023-06-26Github-
Star
Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration
arXiv2023-06-15GithubComing soon
Star
LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark
arXiv2023-06-11GithubDemo
Star
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
arXiv2023-06-08GithubDemo
Star
MIMIC-IT: Multi-Modal In-Context Instruction Tuning
arXiv2023-06-08GithubDemo
M3IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction TuningarXiv2023-06-07--
Star
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
arXiv2023-06-05GithubDemo
Star
LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day
arXiv2023-06-01Github-
Star
GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction
arXiv2023-05-30GithubDemo
Star
PandaGPT: One Model To Instruction-Follow Them All
arXiv2023-05-25GithubDemo
Star
ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst
arXiv2023-05-25Github-
Star
Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models
arXiv2023-05-24GithubLocal Demo
Star
DetGPT: Detect What You Need via Reasoning
arXiv2023-05-23GithubDemo
Star
Pengi: An Audio Language Model for Audio Tasks
NeurIPS2023-05-19Github-
Star
VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks
arXiv2023-05-18Github-
Star
Listen, Think, and Understand
arXiv2023-05-18GithubDemo
Star
VisualGLM-6B
-2023-05-17GithubLocal Demo
Star
PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering
arXiv2023-05-17Github-
Star
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
arXiv2023-05-11GithubLocal Demo
Star
VideoChat: Chat-Centric Video Understanding
arXiv2023-05-10GithubDemo
Star
MultiModal-GPT: A Vision and Language Model for Dialogue with Humans
arXiv2023-05-08GithubDemo
Star
X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages
arXiv2023-05-07Github-
Star
LMEye: An Interactive Perception Network for Large Language Models
arXiv2023-05-05GithubLocal Demo
Star
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model
arXiv2023-04-28GithubDemo
Star
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality
arXiv2023-04-27GithubDemo
Star
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
arXiv2023-04-20Github-
Star
Visual Instruction Tuning
NeurIPS2023-04-17GitHubDemo
Star
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
ICLR2023-03-28GithubDemo
Star
MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning
ACL2022-12-21Github-

Multimodal Hallucination

TitleVenueDateCodeDemo
Star
Look Twice Before You Answer: Memory-Space Visual Retracing for Hallucination Mitigation in Multimodal Large Language Models
arXiv2024-10-04Github-
Star
Interpreting and Editing Vision-Language Representations to Mitigate Hallucinations
arXiv2024-10-03Github-
FIHA: Autonomous Hallucination Evaluation in Vision-Language Models with Davidson Scene GraphsarXiv2024-09-20Link-
Alleviating Hallucination in Large Vision-Language Models with Active Retrieval AugmentationarXiv2024-08-01--
Star
Paying More Attention to Image: A Training-Free Method for Alleviating Hallucination in LVLMs
ECCV2024-07-31Github-
Star
Evaluating and Analyzing Relationship Hallucinations in LVLMs
ICML2024-06-24Github-
Star
AGLA: Mitigating Object Hallucinations in Large Vision-Language Models with Assembly of Global and Local Attention
arXiv2024-06-18Github-
CODE: Contrasting Self-generated Description to Combat Hallucination in Large Multi-modal ModelsarXiv2024-06-04Coming soon-
Mitigating Object Hallucination via Data Augmented Contrastive TuningarXiv2024-05-28Coming soon-
VDGD: Mitigating LVLM Hallucinations in Cognitive Prompts by Bridging the Visual Perception GaparXiv2024-05-24Coming soon-
Detecting and Mitigating Hallucination in Large Vision Language Models via Fine-Grained AI FeedbackarXiv2024-04-22--
Mitigating Hallucinations in Large Vision-Language Models with Instruction Contrastive DecodingarXiv2024-03-27--
Star
What if...?: Counterfactual Inception to Mitigate Hallucination Effects in Large Multimodal Models
arXiv2024-03-20Github-
Strengthening Multimodal Large Language Model with Bootstrapped Preference OptimizationarXiv2024-03-13--
Star
Debiasing Multimodal Large Language Models
arXiv2024-03-08Github-
Star
HALC: Object Hallucination Reduction via Adaptive Focal-Contrast Decoding
arXiv2024-03-01Github-
IBD: Alleviating Hallucinations in Large Vision-Language Models via Image-Biased DecodingarXiv2024-02-28--
Star
Less is More: Mitigating Multimodal Hallucination from an EOS Decision Perspective
arXiv2024-02-22Github-
Star
Logical Closed Loop: Uncovering Object Hallucinations in Large Vision-Language Models
arXiv2024-02-18Github-
Star
The Instinctive Bias: Spurious Images lead to Hallucination in MLLMs
arXiv2024-02-06Github-
Star
Unified Hallucination Detection for Multimodal Large Language Models
arXiv2024-02-05Github-
A Survey on Hallucination in Large Vision-Language ModelsarXiv2024-02-01--
Temporal Insight Enhancement: Mitigating Temporal Hallucination in Multimodal Large Language ModelsarXiv2024-01-18--
Star
Hallucination Augmented Contrastive Learning for Multimodal Large Language Model
arXiv2023-12-12Github-
Star
MOCHa: Multi-Objective Reinforcement Mitigating Caption Hallucinations
arXiv2023-12-06Github-
Star
Mitigating Fine-Grained Hallucination by Fine-Tuning Large Vision-Language Models with Caption Rewrites
arXiv2023-12-04Github-
Star
RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback
arXiv2023-12-01GithubDemo
Star
OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation
CVPR2023-11-29Github-
Star
Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding
CVPR2023-11-28Github-
Beyond Hallucinations: Enhancing LVLMs through Hallucination-Aware Direct Preference OptimizationarXiv2023-11-28GithubComins Soon
Mitigating Hallucination in Visual Language Models with Visual SupervisionarXiv2023-11-27--
Star
HalluciDoctor: Mitigating Hallucinatory Toxicity in Visual Instruction Data
arXiv2023-11-22Github-
Star
An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation
arXiv2023-11-13Github-
Star
FAITHSCORE: Evaluating Hallucinations in Large Vision-Language Models
arXiv2023-11-02Github-
Star
Woodpecker: Hallucination Correction for Multimodal Large Language Models
arXiv2023-10-24GithubDemo
Negative Object Presence Evaluation (NOPE) to Measure Object Hallucination in Vision-Language ModelsarXiv2023-10-09--
Star
HallE-Switch: Rethinking and Controlling Object Existence Hallucinations in Large Vision Language Models for Detailed Caption
arXiv2023-10-03Github-
Star
Analyzing and Mitigating Object Hallucination in Large Vision-Language Models
ICLR2023-10-01Github-
Star
Aligning Large Multimodal Models with Factually Augmented RLHF
arXiv2023-09-25GithubDemo
Evaluation and Mitigation of Agnosia in Multimodal Large Language ModelsarXiv2023-09-07--
CIEM: Contrastive Instruction Evaluation Method for Better Instruction TuningarXiv2023-09-05--
Star
Evaluation and Analysis of Hallucination in Large Vision-Language Models
arXiv2023-08-29Github-
Star
VIGC: Visual Instruction Generation and Correction
arXiv2023-08-24GithubDemo
Detecting and Preventing Hallucinations in Large Vision Language ModelsarXiv2023-08-11--
Star
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
ICLR2023-06-26GithubDemo
Star
Evaluating Object Hallucination in Large Vision-Language Models
EMNLP2023-05-17Github-

Multimodal In-Context Learning

TitleVenueDateCodeDemo
Visual In-Context Learning for Large Vision-Language ModelsarXiv2024-02-18--
Star
RAG-Driver: Generalisable Driving Explanations with Retrieval-Augmented In-Context Learning in Multi-Modal Large Language Model
RSS2024-02-16Github-
Star
Can MLLMs Perform Text-to-Image In-Context Learning?
arXiv2024-02-02Github-
Star
Generative Multimodal Models are In-Context Learners
CVPR2023-12-20GithubDemo
Hijacking Context in Large Multi-modal ModelsarXiv2023-12-07--
Towards More Unified In-context Visual UnderstandingarXiv2023-12-05--
Star
MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning
arXiv2023-09-14GithubDemo
Star
Link-Context Learning for Multimodal LLMs
arXiv2023-08-15GithubDemo
Star
OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models
arXiv2023-08-02GithubDemo
Star
Med-Flamingo: a Multimodal Medical Few-shot Learner
arXiv2023-07-27GithubLocal Demo
Star
Generative Pretraining in Multimodality
ICLR2023-07-11GithubDemo
AVIS: Autonomous Visual Information Seeking with Large Language ModelsarXiv2023-06-13--
Star
MIMIC-IT: Multi-Modal In-Context Instruction Tuning
arXiv2023-06-08GithubDemo
Star
Exploring Diverse In-Context Configurations for Image Captioning
NeurIPS2023-05-24Github-
Star
Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models
arXiv2023-04-19GithubDemo
Star
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace
arXiv2023-03-30GithubDemo
Star
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action
arXiv2023-03-20GithubDemo
Star
ICL-D3IE: In-Context Learning with Diverse Demonstrations Updating for Document Information Extraction
ICCV2023-03-09Github-
Star
Prompting Large Language Models with Answer Heuristics for Knowledge-based Visual Question Answering
CVPR2023-03-03Github-
Star
Visual Programming: Compositional visual reasoning without training
CVPR2022-11-18GithubLocal Demo
Star
An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA
AAAI2022-06-28Github-
Star
Flamingo: a Visual Language Model for Few-Shot Learning
NeurIPS2022-04-29GithubDemo
Multimodal Few-Shot Learning with Frozen Language ModelsNeurIPS2021-06-25--

Multimodal Chain-of-Thought

TitleVenueDateCodeDemo
Star
Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models
arXiv2024-11-21Github-
Star
Cantor: Inspiring Multimodal Chain-of-Thought of MLLM
arXiv2024-04-24GithubLocal Demo
Star
Visual CoT: Unleashing Chain-of-Thought Reasoning in Multi-Modal Language Models
arXiv2024-03-25GithubLocal Demo
Star
Compositional Chain-of-Thought Prompting for Large Multimodal Models
CVPR2023-11-27Github-
Star
DDCoT: Duty-Distinct Chain-of-Thought Prompting for Multimodal Reasoning in Language Models
NeurIPS2023-10-25Github-
Star
Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic
arXiv2023-06-27GithubDemo
Star
Explainable Multimodal Emotion Reasoning
arXiv2023-06-27Github-
Star
EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought
arXiv2023-05-24Github-
Let’s Think Frame by Frame: Evaluating Video Chain of Thought with Video Infilling and PredictionarXiv2023-05-23--
T-SciQ: Teaching Multimodal Chain-of-Thought Reasoning via Large Language Model Signals for Science Question AnsweringarXiv2023-05-05--
Star
Caption Anything: Interactive Image Description with Diverse Multimodal Controls
arXiv2023-05-04GithubDemo
Visual Chain of Thought: Bridging Logical Gaps with Multimodal InfillingsarXiv2023-05-03Coming soon-
Star
Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models
arXiv2023-04-19GithubDemo
Chain of Thought Prompt Tuning in Vision Language ModelsarXiv2023-04-16Coming soon-
Star
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action
arXiv2023-03-20GithubDemo
Star
Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models
arXiv2023-03-08GithubDemo
Star
Multimodal Chain-of-Thought Reasoning in Language Models
arXiv2023-02-02Github-
Star
Visual Programming: Compositional visual reasoning without training
CVPR2022-11-18GithubLocal Demo
Star
Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering
NeurIPS2022-09-20Github-

LLM-Aided Visual Reasoning

TitleVenueDateCodeDemo
Star
VideoDeepResearch: Long Video Understanding With Agentic Tool Using
arXiv2025-06-12GithubLocal Demo
Star
Beyond Embeddings: The Promise of Visual Table in Multi-Modal Models
arXiv2024-03-27Github-
Star
V∗: Guided Visual Search as a Core Mechanism in Multimodal LLMs
arXiv2023-12-21GithubLocal Demo
Star
LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing
arXiv2023-11-01GithubDemo
MM-VID: Advancing Video Understanding with GPT-4V(vision)arXiv2023-10-30--
Star
ControlLLM: Augment Language Models with Tools by Searching on Graphs
arXiv2023-10-26Github-
Star
Woodpecker: Hallucination Correction for Multimodal Large Language Models
arXiv2023-10-24GithubDemo
Star
MindAgent: Emergent Gaming Interaction
arXiv2023-09-18Github-
Star
Towards Language Models That Can See: Computer Vision Through the LENS of Natural Language
arXiv2023-06-28GithubDemo
Retrieving-to-Answer: Zero-Shot Video Question Answering with Frozen Large Language ModelsarXiv2023-06-15--
Star
AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn
arXiv2023-06-14Github-
AVIS: Autonomous Visual Information Seeking with Large Language ModelsarXiv2023-06-13--
Star
GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction
arXiv2023-05-30GithubDemo
Mindstorms in Natural Language-Based Societies of MindarXiv2023-05-26--
Star
LayoutGPT: Compositional Visual Planning and Generation with Large Language Models
arXiv2023-05-24Github-
Star
IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models
arXiv2023-05-24GithubLocal Demo
Star
Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation
arXiv2023-05-10Github-
Star
Caption Anything: Interactive Image Description with Diverse Multimodal Controls
arXiv2023-05-04GithubDemo
Star
Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models
arXiv2023-04-19GithubDemo
Star
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace
arXiv2023-03-30GithubDemo
Star
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action
arXiv2023-03-20GithubDemo
Star
ViperGPT: Visual Inference via Python Execution for Reasoning
arXiv2023-03-14GithubLocal Demo
Star
ChatGPT Asks, BLIP-2 Answers: Automatic Questioning Towards Enriched Visual Descriptions
arXiv2023-03-12GithubLocal Demo
ICL-D3IE: In-Context Learning with Diverse Demonstrations Updating for Document Information ExtractionICCV2023-03-09--
Star
Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models
arXiv2023-03-08GithubDemo
Star
Prompt, Generate, then Cache: Cascade of Foundation Models makes Strong Few-shot Learners
CVPR2023-03-03Github-
Star
From Images to Textual Prompts: Zero-shot VQA with Frozen Large Language Models
CVPR2022-12-21GithubDemo
Star
SuS-X: Training-Free Name-Only Transfer of Vision-Language Models
arXiv2022-11-28Github-
Star
PointCLIP V2: Adapting CLIP for Powerful 3D Open-world Learning
CVPR2022-11-21Github-
Star
Visual Programming: Compositional visual reasoning without training
CVPR2022-11-18GithubLocal Demo
Star
Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language
arXiv2022-04-01Github-

Foundation Models

TitleVenueDateCodeDemo
Introducing GPT-5OpenAI2025-08-07--
Star
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
arXiv2025-01-22GithubDemo
Star
Emu3: Next-Token Prediction is All You Need
arXiv2024-09-27GithubLocal Demo
Llama 3.2: Revolutionizing edge AI and vision with open, customizable modelsMeta2024-09-25-Demo
Pixtral-12BMistral2024-09-17--
Star
xGen-MM (BLIP-3): A Family of Open Large Multimodal Models
arXiv2024-08-16Github-
The Llama 3 Herd of ModelsarXiv2024-07-31--
Chameleon: Mixed-Modal Early-Fusion Foundation ModelsarXiv2024-05-16--
Hello GPT-4oOpenAI2024-05-13--
The Claude 3 Model Family: Opus, Sonnet, HaikuAnthropic2024-03-04--
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of contextGoogle2024-02-15--
Gemini: A Family of Highly Capable Multimodal ModelsGoogle2023-12-06--
Fuyu-8B: A Multimodal Architecture for AI AgentsBlog2023-10-17HuggingfaceDemo
Star
Unified Model for Image, Video, Audio and Language Tasks
arXiv2023-07-30GithubDemo
PaLI-3 Vision Language Models: Smaller, Faster, StrongerarXiv2023-10-13--
GPT-4V(ision) System CardOpenAI2023-09-25--
Star
Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization
arXiv2023-09-09Github-
Multimodal Foundation Models: From Specialists to General-Purpose AssistantsarXiv2023-09-18--
Star
Bootstrapping Vision-Language Learning with Decoupled Language Pre-training
NeurIPS2023-07-13Github-
Star
Generative Pretraining in Multimodality
arXiv2023-07-11GithubDemo
Star
Kosmos-2: Grounding Multimodal Large Language Models to the World
arXiv2023-06-26GithubDemo
Star
Transfer Visual Prompt Generator across LLMs
arXiv2023-05-02GithubDemo
GPT-4 Technical ReportarXiv2023-03-15--
PaLM-E: An Embodied Multimodal Language ModelarXiv2023-03-06-Demo
Star
Prismer: A Vision-Language Model with An Ensemble of Experts
arXiv2023-03-04GithubDemo
Star
Language Is Not All You Need: Aligning Perception with Language Models
arXiv2023-02-27Github-
Star
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
arXiv2023-01-30GithubDemo
Star
VIMA: General Robot Manipulation with Multimodal Prompts
ICML2022-10-06GithubLocal Demo
Star
MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge
NeurIPS2022-06-17Github-
Star
Write and Paint: Generative Vision-Language Models are Unified Modal Learners
ICLR2022-06-15Github-
Star
Language Models are General-Purpose Interfaces
arXiv2022-06-13Github-

Evaluation

TitleVenueDatePage
Stars
Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces
arXiv2024-12-18Github
Stars
MMGenBench: Evaluating the Limits of LMMs from the Text-to-Image Generation Perspective
arXiv2024-11-21Github
Stars
OmniBench: Towards The Future of Universal Omni-Language Models
arXiv2024-09-23Github
Stars
MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?
arXiv2024-08-23Github
Stars
UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models
TPAMI2023-10-17Github
Stars
MMEvalPro: Calibrating Multimodal Benchmarks Towards Trustworthy and Efficient Evaluation
arXiv2024-06-29Github
Stars
Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs
arXiv2024-06-28Github
Stars
CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs
arXiv2024-06-26Github
Stars
ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via Chart-to-Code Generation
arXiv2024-04-15Github
Stars
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
arXiv2024-05-31Github
Stars
Benchmarking Large Multimodal Models against Common Corruptions
NAACL2024-01-22Github
Stars
Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs
arXiv2024-01-11Github
Stars
A Challenger to GPT-4V? Early Explorations of Gemini in Visual Expertise
arXiv2023-12-19Github
Stars
BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models
arXiv2023-12-05Github
Star
How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs
arXiv2023-11-27Github
Star
Charting New Territories: Exploring the Geographic and Geospatial Capabilities of Multimodal LLMs
arXiv2023-11-24Github
Star
MLLM-Bench, Evaluating Multi-modal LLMs using GPT-4V
arXiv2023-11-23Github
VLM-Eval: A General Evaluation on Video Large Language ModelsarXiv2023-11-20Coming soon
Star
Holistic Analysis of Hallucination in GPT-4V(ision): Bias and Interference Challenges
arXiv2023-11-06Github
Star
On the Road with GPT-4V(ision): Early Explorations of Visual-Language Model on Autonomous Driving
arXiv2023-11-09Github
Towards Generic Anomaly Detection and Understanding: Large-scale Visual-linguistic Model (GPT-4V) Takes the LeadarXiv2023-11-05-
A Comprehensive Study of GPT-4V's Multimodal Capabilities in Medical ImagingarXiv2023-10-31-
Star
An Early Evaluation of GPT-4V(ision)
arXiv2023-10-25Github
Star
Exploring OCR Capabilities of GPT-4V(ision) : A Quantitative and In-depth Evaluation
arXiv2023-10-25Github
Star
HallusionBench: You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5, and Other Multi-modality Models
CVPR2023-10-23Github
Star
MathVista: Evaluating Math Reasoning in Visual Contexts with GPT-4V, Bard, and Other Large Multimodal Models
ICLR2023-10-03Github
Star
Fool Your (Vision and) Language Model With Embarrassingly Simple Permutations
arXiv2023-10-02Github
Star
Beyond Task Performance: Evaluating and Reducing the Flaws of Large Multimodal Models with In-Context Learning
arXiv2023-10-01Github
Star
Can We Edit Multimodal Large Language Models?
arXiv2023-10-12Github
Star
REVO-LION: Evaluating and Refining Vision-Language Instruction Tuning Datasets
arXiv2023-10-10Github
The Dawn of LMMs: Preliminary Explorations with GPT-4V(vision)arXiv2023-09-29-
Star
TouchStone: Evaluating Vision-Language Models by Language Models
arXiv2023-08-31Github
Star
✨Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models
arXiv2023-08-31Github
Star
SciGraphQA: A Large-Scale Synthetic Multi-Turn Question-Answering Dataset for Scientific Graphs
arXiv2023-08-07Github
Star
Tiny LVLM-eHub: Early Multimodal Experiments with Bard
arXiv2023-08-07Github
Star
MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities
arXiv2023-08-04Github
Star
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension
CVPR2023-07-30Github
Star
MMBench: Is Your Multi-modal Model an All-around Player?
arXiv2023-07-12Github
Star
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
arXiv2023-06-23Github
Star
LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models
arXiv2023-06-15Github
Star
LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark
arXiv2023-06-11Github
Star
M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models
arXiv2023-06-08Github
Star
On The Hidden Mystery of OCR in Large Multimodal Models
arXiv2023-05-13Github

Multimodal RLHF

TitleVenueDateCodeDemo
Star
R1-Reward: Training Multimodal Reward Model Through Stable Reinforcement Learning
arXiv2025-05-09Github-
Star
Aligning Multimodal LLM with Human Preference: A Survey
arXiv2025-03-23Github-
Star
MM-RLHF: The Next Step Forward in Multimodal LLM Alignment
arXiv2025-02-14Github-
Enhancing Multimodal LLM for Detailed and Accurate Video Captioning using Multi-Round Preference OptimizationarXiv2024-10-09--
Star
Silkie: Preference Distillation for Large Visual Language Models
arXiv2023-12-17Github-
Star
RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback
arXiv2023-12-01GithubDemo
Star
Aligning Large Multimodal Models with Factually Augmented RLHF
arXiv2023-09-25GithubDemo
Star
RoVRM: A Robust Visual Reward Model Optimized via Auxiliary Textual Preference Data
arXiv2024-08-22Github-

Others

TitleVenueDateCodeDemo
Star
TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models
arXiv2024-11-17Github-
Star
Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models
arXiv2024-02-03Github-
Star
VCoder: Versatile Vision Encoders for Multimodal Large Language Models
arXiv2023-12-21GithubLocal Demo
Star
Prompt Highlighter: Interactive Control for Multi-Modal LLMs
arXiv2023-12-07Github-
Star
Planting a SEED of Vision in Large Language Model
arXiv2023-07-16Github
Star
Can Large Pre-trained Models Help Vision Models on Perception Tasks?
arXiv2023-06-01Github-
Star
Contextual Object Detection with Multimodal Large Language Models
arXiv2023-05-29GithubDemo
Star
Generating Images with Multimodal Language Models
arXiv2023-05-26Github-
Star
On Evaluating Adversarial Robustness of Large Vision-Language Models
arXiv2023-05-26Github-
Star
Grounding Language Models to Images for Multimodal Inputs and Outputs
ICML2023-01-31GithubDemo

Awesome Datasets

Datasets of Pre-Training for Alignment

NamePaperTypeModalities
ShareGPT4VideoShareGPT4Video: Improving Video Understanding and Generation with Better CaptionsCaptionVideo-Text
COYO-700MCOYO-700M: Image-Text Pair DatasetCaptionImage-Text
ShareGPT4VShareGPT4V: Improving Large Multi-Modal Models with Better CaptionsCaptionImage-Text
AS-1BThe All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open WorldHybridImage-Text
InternVidInternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and GenerationCaptionVideo-Text
MS-COCOMicrosoft COCO: Common Objects in ContextCaptionImage-Text
SBU CaptionsIm2Text: Describing Images Using 1 Million Captioned PhotographsCaptionImage-Text
Conceptual CaptionsConceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image CaptioningCaptionImage-Text
LAION-400MLAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text PairsCaptionImage-Text
VG CaptionsVisual Genome: Connecting Language and Vision Using Crowdsourced Dense Image AnnotationsCaptionImage-Text
Flickr30kFlickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence ModelsCaptionImage-Text
AI-CapsAI Challenger : A Large-scale Dataset for Going Deeper in Image UnderstandingCaptionImage-Text
Wukong CaptionsWukong: A 100 Million Large-scale Chinese Cross-modal Pre-training BenchmarkCaptionImage-Text
GRITKosmos-2: Grounding Multimodal Large Language Models to the WorldCaptionImage-Text-Bounding-Box
Youku-mPLUGYouku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and BenchmarksCaptionVideo-Text
MSR-VTTMSR-VTT: A Large Video Description Dataset for Bridging Video and LanguageCaptionVideo-Text
Webvid10MFrozen in Time: A Joint Video and Image Encoder for End-to-End RetrievalCaptionVideo-Text
WavCapsWavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal ResearchCaptionAudio-Text
AISHELL-1AISHELL-1: An open-source Mandarin speech corpus and a speech recognition baselineASRAudio-Text
AISHELL-2AISHELL-2: Transforming Mandarin ASR Research Into Industrial ScaleASRAudio-Text
VSDial-CNX-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign LanguagesASRImage-Audio-Text

Datasets of Multimodal Instruction Tuning

NamePaperLinkNotes
Inst-IT DatasetInst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction TuningLinkAn instruction-tuning dataset which contains fine-grained multi-level annotations for 21k videos and 51k images
E.T. Instruct 164KE.T. Bench: Towards Open-Ended Event-Level Video-Language UnderstandingLinkAn instruction-tuning dataset for time-sensitive video understanding
MSQAMulti-modal Situated Reasoning in 3D ScenesLinkA large scale dataset for multi-modal situated reasoning in 3D scenes
MM-EvolMMEvol: Empowering Multimodal Large Language Models with Evol-InstructLinkAn instruction dataset with rich diversity
UNK-VQAUNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large ModelsLinkA dataset designed to teach models to refrain from answering unanswerable questions
VEGAVEGA: Learning Interleaved Image-Text Comprehension in Vision-Language Large ModelsLinkA dataset for enhancing model capabilities in comprehension of interleaved information
ALLaVA-4VALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language ModelLinkVision and language caption and instruction dataset generated by GPT4V
IDKVisually Dehallucinative Instruction Generation: Know What You Don't KnowLinkDehallucinative visual instruction for "I Know" hallucination
CAP2QAVisually Dehallucinative Instruction GenerationLinkImage-aligned visual instruction dataset
M3DBenchM3DBench: Let's Instruct Large Models with Multi-modal 3D PromptsLinkA large-scale 3D instruction tuning dataset
ViP-LLaVA-InstructMaking Large Multimodal Models Understand Arbitrary Visual PromptsLinkA mixture of LLaVA-1.5 instruction data and the region-level visual prompting data
LVIS-Instruct4VTo See is to Believe: Prompting GPT-4V for Better Visual Instruction TuningLinkA visual instruction dataset via self-instruction from GPT-4V
ComVintWhat Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction TuningLinkA synthetic instruction dataset for complex visual reasoning
SparklesDialogue✨Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following ModelsLinkA machine-generated dialogue dataset tailored for word-level interleaved multi-image and text interactions to augment the conversational competence of instruction-following LLMs across multiple images and dialogue turns.
StableLLaVAStableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue DataLinkA cheap and effective approach to collect visual instruction tuning data
M-HalDetectDetecting and Preventing Hallucinations in Large Vision Language ModelsComing soonA dataset used to train and benchmark models for hallucination detection and prevention
MGVLIDChatSpot: Bootstrapping Multimodal LLMs via Precise Referring Instruction Tuning-A high-quality instruction-tuning dataset including image-text and region-text pairs
BuboGPTBuboGPT: Enabling Visual Grounding in Multi-Modal LLMsLinkA high-quality instruction-tuning dataset including audio-text audio caption data and audio-image-text localization data
SVITSVIT: Scaling up Visual Instruction TuningLinkA large-scale dataset with 4.2M informative visual instruction tuning data, including conversations, detailed descriptions, complex reasoning and referring QAs
mPLUG-DocOwlmPLUG-DocOwl: Modularized Multimodal Large Language Model for Document UnderstandingLinkAn instruction tuning dataset featuring a wide range of visual-text understanding tasks including OCR-free document understanding
PF-1MVisual Instruction Tuning with Polite FlamingoLinkA collection of 37 vision-language datasets with responses rewritten by Polite Flamingo.
ChartLlamaChartLlama: A Multimodal LLM for Chart Understanding and GenerationLinkA multi-modal instruction-tuning dataset for chart understanding and generation
LLaVARLLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image UnderstandingLinkA visual instruction-tuning dataset for Text-rich Image Understanding
MotionGPTMotionGPT: Human Motion as a Foreign LanguageLinkA instruction-tuning dataset including multiple human motion-related tasks
LRV-InstructionMitigating Hallucination in Large Multi-Modal Models via Robust Instruction TuningLinkVisual instruction tuning dataset for addressing hallucination issue
Macaw-LLMMacaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text IntegrationLinkA large-scale multi-modal instruction dataset in terms of multi-turn dialogue
LAMM-DatasetLAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and BenchmarkLinkA comprehensive multi-modal instruction tuning dataset
Video-ChatGPTVideo-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language ModelsLink100K high-quality video instruction dataset
MIMIC-ITMIMIC-IT: Multi-Modal In-Context Instruction TuningLinkMultimodal in-context instruction tuning
M3ITM3IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction TuningLinkLarge-scale, broad-coverage multimodal instruction tuning dataset
LLaVA-MedLLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One DayComing soonA large-scale, broad-coverage biomedical instruction-following dataset
GPT4ToolsGPT4Tools: Teaching Large Language Model to Use Tools via Self-instructionLinkTool-related instruction datasets
MULTISChatBridge: Bridging Modalities with Large Language Model as a Language CatalystComing soonMultimodal instruction tuning dataset covering 16 multimodal tasks
DetGPTDetGPT: Detect What You Need via ReasoningLinkInstruction-tuning dataset with 5000 images and around 30000 query-answer pairs
PMC-VQAPMC-VQA: Visual Instruction Tuning for Medical Visual Question AnsweringComing soonLarge-scale medical visual question-answering dataset
VideoChatVideoChat: Chat-Centric Video UnderstandingLinkVideo-centric multimodal instruction dataset
X-LLMX-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign LanguagesLinkChinese multimodal instruction dataset
LMEyeLMEye: An Interactive Perception Network for Large Language ModelsLinkA multi-modal instruction-tuning dataset
cc-sbu-alignMiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language ModelsLinkMultimodal aligned dataset for improving model's usability and generation's fluency
LLaVA-Instruct-150KVisual Instruction TuningLinkMultimodal instruction-following data generated by GPT
MultiInstructMultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction TuningLinkThe first multimodal instruction tuning benchmark dataset

Datasets of In-Context Learning

NamePaperLinkNotes
MICMMICL: Empowering Vision-language Model with Multi-Modal In-Context LearningLinkA manually constructed instruction tuning dataset including interleaved text-image inputs, inter-related multiple image inputs, and multimodal in-context learning inputs.
MIMIC-ITMIMIC-IT: Multi-Modal In-Context Instruction TuningLinkMultimodal in-context instruction dataset

Datasets of Multimodal Chain-of-Thought

NamePaperLinkNotes
EMERExplainable Multimodal Emotion ReasoningComing soonA benchmark dataset for explainable emotion reasoning task
EgoCOTEmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of ThoughtComing soonLarge-scale embodied planning dataset
VIPLet’s Think Frame by Frame: Evaluating Video Chain of Thought with Video Infilling and PredictionComing soonAn inference-time dataset that can be used to evaluate VideoCOT
ScienceQALearn to Explain: Multimodal Reasoning via Thought Chains for Science Question AnsweringLinkLarge-scale multi-choice dataset, featuring multimodal science questions and diverse domains

Datasets of Multimodal RLHF

NamePaperLinkNotes
VLFeedbackSilkie: Preference Distillation for Large Visual Language ModelsLinkA vision-language feedback dataset annotated by AI

Benchmarks for Evaluation

NamePaperLinkNotes
Inst-IT BenchInst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction TuningLinkA benchmark to evaluate fine-grained instance-level understanding in images and videos
M3CoTM3CoT: A Novel Benchmark for Multi-Domain Multi-step Multi-modal Chain-of-ThoughtLinkA multi-domain, multi-step benchmark for multimodal CoT
MMGenBenchMMGenBench: Evaluating the Limits of LMMs from the Text-to-Image Generation PerspectiveLinkA benchmark that gauges the performance of constructing image-generation prompt given an image
MiCEvalMiCEval: Unveiling Multimodal Chain of Thought's Quality via Image Description and Reasoning StepsLinkA multimodal CoT benchmark to evaluate MLLMs' reasoning capabilities
LiveXivLiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers ContentLinkA live benchmark based on arXiv papers
TemporalBenchTemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video ModelsLinkA benchmark for evaluation of fine-grained temporal understanding
OmniBenchOmniBench: Towards The Future of Universal Omni-Language ModelsLinkA benchmark that evaluates models' capabilities of processing visual, acoustic, and textual inputs simultaneously
MME-RealWorldMME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?LinkA challenging benchmark that involves real-life scenarios
VELOCITIVELOCITI: Can Video-Language Models Bind Semantic Concepts through Time?LinkA video benhcmark that evaluates on perception and binding capabilities
MMRSeeing Clearly, Answering Incorrectly: A Multimodal Robustness Benchmark for Evaluating MLLMs on Leading QuestionsLinkA benchmark for measuring MLLMs' understanding capability and robustness to leading questions
CharXivCharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMsLinkChart understanding benchmark curated by human experts
Video-MMEVideo-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video AnalysisLinkA comprehensive evaluation benchmark of Multi-modal LLMs in video analysis
VL-ICL BenchVL-ICL Bench: The Devil in the Details of Benchmarking Multimodal In-Context LearningLinkA benchmark for M-ICL evaluation, covering a wide spectrum of tasks
TempCompassTempCompass: Do Video LLMs Really Understand Videos?LinkA benchmark to evaluate the temporal perception ability of Video LLMs
GVLQAGITA: Graph to Visual and Textual Integration for Vision-Language Graph ReasoningLinkA benchmark for evaluation of graph reasoning capabilities
CoBSATCan MLLMs Perform Text-to-Image In-Context Learning?LinkA benchmark for text-to-image ICL
VQAv2-IDKVisually Dehallucinative Instruction Generation: Know What You Don't KnowLinkA benchmark for assessing "I Know" visual hallucination
Math-VisionMeasuring Multimodal Mathematical Reasoning with MATH-Vision DatasetLinkA diverse mathematical reasoning benchmark
SciMMIRSciMMIR: Benchmarking Scientific Multi-modal Information RetrievalLink
CMMMUCMMMU: A Chinese Massive Multi-discipline Multimodal Understanding BenchmarkLinkA Chinese benchmark involving reasoning and knowledge across multiple disciplines
MMCBenchBenchmarking Large Multimodal Models against Common CorruptionsLinkA benchmark for examining self-consistency under common corruptions
MMVPEyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMsLinkA benchmark for assessing visual capabilities
TimeITTimeChat: A Time-sensitive Multimodal Large Language Model for Long Video UnderstandingLinkA video instruction-tuning dataset with timestamp annotations, covering diverse time-sensitive video-understanding tasks.
ViP-BenchMaking Large Multimodal Models Understand Arbitrary Visual PromptsLinkA benchmark for visual prompts
M3DBenchM3DBench: Let's Instruct Large Models with Multi-modal 3D PromptsLinkA 3D-centric benchmark
Video-BenchVideo-Bench: A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language ModelsLinkA benchmark for video-MLLM evaluation
Charting-New-TerritoriesCharting New Territories: Exploring the Geographic and Geospatial Capabilities of Multimodal LLMsLinkA benchmark for evaluating geographic and geospatial capabilities
MLLM-BenchMLLM-Bench, Evaluating Multi-modal LLMs using GPT-4VLinkGPT-4V evaluation with per-sample criteria
BenchLMMBenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal ModelsLinkA benchmark for assessment of the robustness against different image styles
MMC-BenchmarkMMC: Advancing Multimodal Chart Understanding with Large-scale Instruction TuningLinkA comprehensive human-annotated benchmark with distinct tasks evaluating reasoning capabilities over charts
MVBenchMVBench: A Comprehensive Multi-modal Video Understanding BenchmarkLinkA comprehensive multimodal benchmark for video understanding
BingoHolistic Analysis of Hallucination in GPT-4V(ision): Bias and Interference ChallengesLinkA benchmark for hallucination evaluation that focuses on two common types
MagnifierBenchOtterHD: A High-Resolution Multi-modality ModelLinkA benchmark designed to probe models' ability of fine-grained perception
HallusionBenchHallusionBench: You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5, and Other Multi-modality ModelsLinkAn image-context reasoning benchmark for evaluation of hallucination
PCA-EVALTowards End-to-End Embodied Decision Making via Multi-modal Large Language Model: Explorations with GPT4-Vision and BeyondLinkA benchmark for evaluating multi-domain embodied decision-making.
MMHal-BenchAligning Large Multimodal Models with Factually Augmented RLHFLinkA benchmark for hallucination evaluation
MathVistaMathVista: Evaluating Math Reasoning in Visual Contexts with GPT-4V, Bard, and Other Large Multimodal ModelsLinkA benchmark that challenges both visual and math reasoning capabilities
SparklesEval✨Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following ModelsLinkA GPT-assisted benchmark for quantitatively assessing a model's conversational competence across multiple images and dialogue turns based on three distinct criteria.
ISEKAILink-Context Learning for Multimodal LLMsLinkA benchmark comprising exclusively of unseen generated image-label pairs designed for link-context learning
M-HalDetectDetecting and Preventing Hallucinations in Large Vision Language ModelsComing soonA dataset used to train and benchmark models for hallucination detection and prevention
I4Empowering Vision-Language Models to Follow Interleaved Vision-Language InstructionsLinkA benchmark to comprehensively evaluate the instruction following ability on complicated interleaved vision-language instructions
SciGraphQASciGraphQA: A Large-Scale Synthetic Multi-Turn Question-Answering Dataset for Scientific GraphsLinkA large-scale chart-visual question-answering dataset
MM-VetMM-Vet: Evaluating Large Multimodal Models for Integrated CapabilitiesLinkAn evaluation benchmark that examines large multimodal models on complicated multimodal tasks
SEED-BenchSEED-Bench: Benchmarking Multimodal LLMs with Generative ComprehensionLinkA benchmark for evaluation of generative comprehension in MLLMs
MMBenchMMBench: Is Your Multi-modal Model an All-around Player?LinkA systematically-designed objective benchmark for robustly evaluating the various abilities of vision-language models
LynxWhat Matters in Training a GPT4-Style Language Model with Multimodal Inputs?LinkA comprehensive evaluation benchmark including both image and video tasks
GAVIEMitigating Hallucination in Large Multi-Modal Models via Robust Instruction TuningLinkA benchmark to evaluate the hallucination and instruction following ability
MMEMME: A Comprehensive Evaluation Benchmark for Multimodal Large Language ModelsLinkA comprehensive MLLM Evaluation benchmark
LVLM-eHubLVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language ModelsLinkAn evaluation platform for MLLMs
LAMM-BenchmarkLAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and BenchmarkLinkA benchmark for evaluating the quantitative performance of MLLMs on various2D/3D vision tasks
M3ExamM3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language ModelsLinkA multilingual, multimodal, multilevel benchmark for evaluating MLLM
OwlEvalmPLUG-Owl: Modularization Empowers Large Language Models with MultimodalityLinkDataset for evaluation on multiple capabilities

Others

NamePaperLinkNotes
IMADIMAD: IMage-Augmented multi-modal DialogueLinkMultimodal dialogue dataset
Video-ChatGPTVideo-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language ModelsLinkA quantitative evaluation framework for video-based dialogue models
CLEVR-ATVCAccountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creationLinkA synthetic multimodal fine-tuning dataset for learning to reject instructions
Fruit-ATVCAccountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creationLinkA manually pictured multimodal fine-tuning dataset for learning to reject instructions
InfoSeekCan Pre-trained Vision and Language Models Answer Visual Information-Seeking Questions?LinkA VQA dataset that focuses on asking information-seeking questions
OVENOpen-domain Visual Entity Recognition: Towards Recognizing Millions of Wikipedia EntitiesLinkA dataset that focuses on recognizing the Visual Entity on the Wikipedia, from images in the wild