| DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence | DeepSeek | 2026-04-24 | Huggingface | - |
| Qwen3.6-27B: Flagship-Level Coding in a 27B Dense Model | Blog | 2026-04-22 | Huggingface | Demo |
| Xiaomi MiMo-V2.5 | Blog | 2026-04-22 | - | Demo |
Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding
| arXiv | 2026-04-06 | Github | Demo |
| Introducing Muse Spark: Scaling Towards Personal Superintelligence | Blog | 2026-04-08 | - | Demo |
VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing
| arXiv | 2026-04-03 | Github | Local Demo |
| Gemma 4: Byte for byte, the most capable open models | Blog | 2026-04-02 | - | Demo |
| Qwen3.6-Plus: Towards Real World Agents | Blog | 2026-04-02 | - | - |
| Qwen3.5-Omni: Scaling Up, Toward Native Omni-Modal AGI | Blog | 2026-03-30 | - | Demo |
| Xiaomi MiMo-V2-Omni | Blog | 2026-03-18 | - | - |
InternVL-U: Democratizing Unified Multimodal Models for Understanding, Reasoning, Generation and Editing
| arXiv | 2026-03-10 | Github | Local Demo |
Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion
| arXiv | 2026-03-06 | Github | - |
| Beyond Language Modeling: An Exploration of Multimodal Pretraining | arXiv | 2026-03-03 | - | - |
| Gemini 3.1 Pro: A smarter model for your most complex tasks | Blog | 2026-02-19 | - | - |
Qwen3.5: Towards Native Multimodal Agents
| Blog | 2026-02-16 | Github | Demo |
MiniCPM-o 4.5
| Blog | 2026-02-06 | Github | Demo |
Kimi K2.5: Visual Agentic Intelligence
| arXiv | 2026-02-02 | Github | - |
DeepSeek-OCR 2: Visual Causal Flow
| DeepSeek | 2026-01-27 | Github | - |
| Seed1.8 Model Card: Towards Generalized Real-World Agency | Bytedance Seed | 2025-12-18 | - | - |
| Introducing GPT-5.2 | OpenAI | 2025-12-11 | - | - |
| Introducing Mistral 3 | Blog | 2025-12-02 | Huggingface | - |
Qwen3-VL Technical Report
| arXiv | 2025-11-26 | Github | Demo |
Emu3.5: Native Multimodal Models are World Learners
| arXiv | 2025-10-30 | Github | - |
VITA-E: Natural Embodied Interaction with Concurrent Seeing, Hearing, Speaking, and Acting
| arXiv | 2025-10-21 | Github | Local Demo |
DeepSeek-OCR: Contexts Optical Compression
| arXiv | 2025-10-21 | Github | - |
OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM
| arXiv | 2025-10-17 | Github | - |
| NExT-OMNI: Towards Any-to-Any Omnimodal Foundation Models with Discrete Flow Matching | arXiv | 2025-10-16 | - | - |
InteractiveOmni: A Unified Omni-modal Model for Audio-Visual Multi-turn Dialogue | arXiv | 2025-10-15 | Github | - |
VITA-VLA: Efficiently Teaching Vision-Language Models to Act via Action Expert Distillation
| arXiv | 2025-10-10 | Github | - |
LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training
| arXiv | 2025-10-09 | Github | Demo |
Qwen3-Omni Technical Report
| arXiv | 2025-09-22 | Github | Demo |
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
| arXiv | 2025-08-27 | Github | Demo |
| MiniCPM-V 4.5: A GPT-4o Level MLLM for Single Image, Multi Image and Video Understanding on Your Phone | - | 2025-08-26 | Github | Demo |
Thyme: Think Beyond Images
| arXiv | 2025-08-18 | Github | Demo |
| Introducing GPT-5 | OpenAI | 2025-08-07 | - | - |
dots.vlm1
| rednote-hilab | 2025-08-06 | Github | Demo |
Step3: Cost-Effective Multimodal Intelligence
| StepFun | 2025-07-31 | Github | Demo |
GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning
| arXiv | 2025-07-02 | Github | Demo |
DenseWorld-1M: Towards Detailed Dense Grounded Caption in the Real World
| arXiv | 2025-06-30 | Github | - |
| Qwen VLo: From "Understanding" the World to "Depicting" It | Qwen | 2025-06-26 | - | Demo |
MMSearch-R1: Incentivizing LMMs to Search
| arXiv | 2025-06-25 | Github | - |
Show-o2: Improved Native Unified Multimodal Models
| arXiv | 2025-06-18 | Github | - |
| Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities | Google | 2025-06-17 | - | - |
Ego-R1: Chain-of-Tool-Thought for Ultra-Long Egocentric Video Reasoning
| arXiv | 2025-06-16 | Github | - |
MiMo-VL Technical Report
| arXiv | 2025-06-04 | Github | - |
OpenUni: A Simple Baseline for Unified Multimodal Understanding and Generation
| arXiv | 2025-05-29 | Github | - |
Emerging Properties in Unified Multimodal Pretraining
| arXiv | 2025-05-23 | Github | Demo |
MMaDA: Multimodal Large Diffusion Language Models
| arXiv | 2025-05-21 | Github | Demo |
| UniGen: Enhanced Training & Test-Time Strategies for Unified Multimodal Understanding and Generation | arXiv | 2025-05-20 | - | - |
BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset
| arXiv | 2025-05-14 | Github | Local Demo |
| Seed1.5-VL Technical Report | arXiv | 2025-05-11 | - | - |
Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models
| arXiv | 2025-05-08 | Github | - |
VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model
| arXiv | 2025-05-06 | Github | Local Demo |
Skywork R1V2: Multimodal Hybrid Reinforcement Learning for Reasoning
| arXiv | 2025-04-23 | Github | - |
Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models
| arXiv | 2025-04-21 | Github | - |
An LMM for Efficient Video Understanding via Reinforced Compression of Video Cubes
| arXiv | 2025-04-21 | Github | - |
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
| arXiv | 2025-04-14 | Github | Demo |
| Introducing GPT-4.1 in the API | OpenAI | 2025-04-14 | - | - |
Kimi-VL Technical Report
| arXiv | 2025-04-10 | Github | Demo |
| The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation | Meta | 2025-04-05 | Hugging Face | - |
Qwen2.5-Omni Technical Report
| Qwen | 2025-03-26 | Github | Demo |
| Addendum to GPT-4o System Card: Native image generation | OpenAI | 2025-03-25 | - | - |
Sparrow: Data-Efficient Video-LLM with Text-to-Image Augmentation
| arXiv | 2025-03-17 | Github | - |
| Nexus-O: An Omni-Perceptive And -Interactive Model for Language, Audio, And Vision | arXiv | 2025-03-07 | - | - |
| Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs | arXiv | 2025-03-03 | Hugging Face | Demo |
Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuray
| arXiv | 2025-02-19 | Github | - |
Qwen2.5-VL Technical Report
| arXiv | 2025-02-19 | Github | Demo |
Baichuan-Omni-1.5 Technical Report
| Tech Report | 2025-01-26 | Github | Local Demo |
LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs
| arXiv | 2025-01-10 | Github | - |
VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
| arXiv | 2025-01-03 | Github | - |
QVQ: To See the World with Wisdom
| Qwen | 2024-12-25 | Github | Demo |
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding
| arXiv | 2024-12-13 | Github | - |
| Apollo: An Exploration of Video Understanding in Large Multimodal Models | arXiv | 2024-12-13 | - | - |
InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions
| arXiv | 2024-12-12 | Github | Local Demo |
| StreamChat: Chatting with Streaming Video | arXiv | 2024-12-11 | Coming soon | - |
| CompCap: Improving Multimodal Large Language Models with Composite Captions | arXiv | 2024-12-06 | - | - |
LinVT: Empower Your Image-level Large Language Model to Understand Videos
| arXiv | 2024-12-06 | Github | - |
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
| arXiv | 2024-12-06 | Github | Demo |
NVILA: Efficient Frontier Visual Language Models
| arXiv | 2024-12-05 | Github | Demo |
Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning
| arXiv | 2024-12-04 | Github | - |
TimeMarker: A Versatile Video-LLM for Long and Short Video Understanding with Superior Temporal Localization Ability
| arXiv | 2024-11-27 | Github | - |
ChatRex: Taming Multimodal LLM for Joint Perception and Understanding
| arXiv | 2024-11-27 | Github | Local Demo |
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding
| arXiv | 2024-10-22 | Github | Demo |
Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate
| arXiv | 2024-10-09 | Github | - |
AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark
| arXiv | 2024-10-04 | Github | Local Demo |
EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions
| CVPR | 2024-09-26 | Github | Demo |
| Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models | arXiv | 2024-09-25 | Huggingface | Demo |
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
| arXiv | 2024-09-18 | Github | Demo |
ChartMoE: Mixture of Expert Connector for Advanced Chart Understanding
| ICLR | 2024-09-05 | Github | Local Demo |
LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture
| arXiv | 2024-09-04 | Github | - |
EAGLE: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders
| arXiv | 2024-08-28 | Github | Demo |
LLaVA-MoD: Making LLaVA Tiny via MoE Knowledge Distillation
| arXiv | 2024-08-28 | Github | - |
mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models
| arXiv | 2024-08-09 | Github | - |
VITA: Towards Open-Source Interactive Omni Multimodal LLM
| arXiv | 2024-08-09 | Github | - |
LLaVA-OneVision: Easy Visual Task Transfer
| arXiv | 2024-08-06 | Github | Demo |
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
| arXiv | 2024-08-03 | Github | Demo |
| VILA^2: VILA Augmented VILA | arXiv | 2024-07-24 | - | - |
| SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models | arXiv | 2024-07-22 | - | - |
| EVLM: An Efficient Vision-Language Model for Visual Understanding | arXiv | 2024-07-19 | - | - |
IDA-VLM: Towards Movie Understanding via ID-Aware Large Vision-Language Model
| arXiv | 2024-07-10 | Github | - |
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output
| arXiv | 2024-07-03 | Github | Demo |
OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding
| arXiv | 2024-06-27 | Github | Local Demo |
DocKylin: A Large Multimodal Model for Visual Document Understanding with Efficient Visual Slimming
| AAAI | 2024-06-27 | Github | - |
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
| arXiv | 2024-06-24 | Github | Local Demo |
Long Context Transfer from Language to Vision
| arXiv | 2024-06-24 | Github | Local Demo |
video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models
| ICML | 2024-06-22 | Github | - |
TroL: Traversal of Layers for Large Language and Vision Models
| EMNLP | 2024-06-18 | Github | Local Demo |
Unveiling Encoder-Free Vision-Language Models
| arXiv | 2024-06-17 | Github | Local Demo |
VideoLLM-online: Online Video Large Language Model for Streaming Video
| CVPR | 2024-06-17 | Github | Local Demo |
RoboPoint: A Vision-Language Model for Spatial Affordance Prediction for Robotics
| CoRL | 2024-06-15 | Github | Demo |
Comparison Visual Instruction Tuning
| arXiv | 2024-06-13 | Github | Local Demo |
Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models
| arXiv | 2024-06-12 | Github | - |
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
| arXiv | 2024-06-11 | Github | Local Demo |
Parrot: Multilingual Visual Instruction Tuning
| arXiv | 2024-06-04 | Github | - |
Ovis: Structural Embedding Alignment for Multimodal Large Language Model
| arXiv | 2024-05-31 | Github | - |
Matryoshka Query Transformer for Large Vision-Language Models
| arXiv | 2024-05-29 | Github | Demo |
ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models
| arXiv | 2024-05-24 | Github | - |
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models
| arXiv | 2024-05-24 | Github | Demo |
Libra: Building Decoupled Vision System on Large Language Models
| ICML | 2024-05-16 | Github | Local Demo |
CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts
| arXiv | 2024-05-09 | Github | Local Demo |
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
| arXiv | 2024-04-25 | Github | Demo |
Graphic Design with Large Multimodal Model
| arXiv | 2024-04-22 | Github | - |
| BRAVE: Broadening the visual encoding of vision-language models | ECCV | 2024-04-10 | - | - |
InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD
| arXiv | 2024-04-09 | Github | Demo |
| Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs | arXiv | 2024-04-08 | - | - |
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding
| CVPR | 2024-04-08 | Github | - |
VITRON: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing
| NeurIPS | 2024-04-04 | Github | Local Demo |
| TOMGPT: Reliable Text-Only Training Approach for Cost-Effective Multi-modal Large Language Model | ACM TKDD | 2024-03-28 | - | - |
LITA: Language Instructed Temporal-Localization Assistant | arXiv | 2024-03-27 | Github | Local Demo |
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models
| arXiv | 2024-03-27 | Github | Demo |
| MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training | arXiv | 2024-03-14 | - | - |
MoAI: Mixture of All Intelligence for Large Language and Vision Models
| arXiv | 2024-03-12 | Github | Local Demo |
DeepSeek-VL: Towards Real-World Vision-Language Understanding
| arXiv | 2024-03-08 | Github | Demo |
TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document
| arXiv | 2024-03-07 | Github | Demo |
The All-Seeing Project V2: Towards General Relation Comprehension of the Open World | arXiv | 2024-02-29 | Github | - |
| GROUNDHOG: Grounding Large Language Models to Holistic Segmentation | CVPR | 2024-02-26 | Coming soon | Coming soon |
AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling
| arXiv | 2024-02-19 | Github | - |
Momentor: Advancing Video Large Language Model with Fine-Grained Temporal Reasoning
| arXiv | 2024-02-18 | Github | - |
ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model
| arXiv | 2024-02-18 | Github | Demo |
CoLLaVO: Crayon Large Language and Vision mOdel
| arXiv | 2024-02-17 | Github | - |
Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models
| ICML | 2024-02-12 | Github | - |
CogCoM: Train Large Vision-Language Models Diving into Details through Chain of Manipulations
| arXiv | 2024-02-06 | Github | - |
MobileVLM V2: Faster and Stronger Baseline for Vision Language Model
| arXiv | 2024-02-06 | Github | - |
GITA: Graph to Visual and Textual Integration for Vision-Language Graph Reasoning
| NeurIPS | 2024-02-03 | Github | - |
| Enhancing Multimodal Large Language Models with Vision Detection Models: An Empirical Study | arXiv | 2024-01-31 | Coming soon | - |
LLaVA-NeXT: Improved reasoning, OCR, and world knowledge | Blog | 2024-01-30 | Github | Demo |
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
| arXiv | 2024-01-29 | Github | Demo |
InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model
| arXiv | 2024-01-29 | Github | Demo |
Yi-VL
| - | 2024-01-23 | Github | Local Demo |
| SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities | arXiv | 2024-01-22 | - | - |
ChartAssisstant: A Universal Chart Multimodal Language Model via Chart-to-Table Pre-training and Multitask Instruction Tuning
| ACL | 2024-01-04 | Github | Local Demo |
MobileVLM : A Fast, Reproducible and Strong Vision Language Assistant for Mobile Devices
| arXiv | 2023-12-28 | Github | - |
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
| CVPR | 2023-12-21 | Github | Demo |
Osprey: Pixel Understanding with Visual Instruction Tuning
| CVPR | 2023-12-15 | Github | Demo |
CogAgent: A Visual Language Model for GUI Agents
| arXiv | 2023-12-14 | Github | Coming soon |
| Pixel Aligned Language Models | arXiv | 2023-12-14 | Coming soon | - |
VILA: On Pre-training for Visual Language Models
| CVPR | 2023-12-13 | Github | Local Demo |
| See, Say, and Segment: Teaching LMMs to Overcome False Premises | arXiv | 2023-12-13 | Coming soon | - |
Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models
| ECCV | 2023-12-11 | Github | Demo |
Honeybee: Locality-enhanced Projector for Multimodal LLM
| CVPR | 2023-12-11 | Github | - |
| Gemini: A Family of Highly Capable Multimodal Models | Google | 2023-12-06 | - | - |
OneLLM: One Framework to Align All Modalities with Language
| arXiv | 2023-12-06 | Github | Demo |
Lenna: Language Enhanced Reasoning Detection Assistant
| arXiv | 2023-12-05 | Github | - |
| VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding | arXiv | 2023-12-04 | - | - |
TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding
| arXiv | 2023-12-04 | Github | Local Demo |
Making Large Multimodal Models Understand Arbitrary Visual Prompts
| CVPR | 2023-12-01 | Github | Demo |
Dolphins: Multimodal Language Model for Driving
| arXiv | 2023-12-01 | Github | - |
LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning
| arXiv | 2023-11-30 | Github | Coming soon |
VTimeLLM: Empower LLM to Grasp Video Moments
| arXiv | 2023-11-30 | Github | Local Demo |
mPLUG-PaperOwl: Scientific Diagram Analysis with the Multimodal Large Language Model
| arXiv | 2023-11-30 | Github | - |
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models
| arXiv | 2023-11-28 | Github | Coming soon |
LLMGA: Multimodal Large Language Model based Generation Assistant
| arXiv | 2023-11-27 | Github | Demo |
ChartLlama: A Multimodal LLM for Chart Understanding and Generation
| arXiv | 2023-11-27 | Github | - |
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions
| arXiv | 2023-11-21 | Github | Demo |
LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge
| arXiv | 2023-11-20 | Github | - |
An Embodied Generalist Agent in 3D World
| arXiv | 2023-11-18 | Github | Demo |
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
| arXiv | 2023-11-16 | Github | Demo |
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
| CVPR | 2023-11-14 | Github | - |
To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning
| arXiv | 2023-11-13 | Github | - |
SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models
| arXiv | 2023-11-13 | Github | Demo |
Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models
| CVPR | 2023-11-11 | Github | Demo |
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
| arXiv | 2023-11-09 | Github | Demo |
NExT-Chat: An LMM for Chat, Detection and Segmentation
| arXiv | 2023-11-08 | Github | Local Demo |
mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration
| arXiv | 2023-11-07 | Github | Demo |
OtterHD: A High-Resolution Multi-modality Model
| arXiv | 2023-11-07 | Github | - |
| CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding | arXiv | 2023-11-06 | Coming soon | - |
GLaMM: Pixel Grounding Large Multimodal Model
| CVPR | 2023-11-06 | Github | Demo |
What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning
| arXiv | 2023-11-02 | Github | - |
MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning
| arXiv | 2023-10-14 | Github | Local Demo |
SALMONN: Towards Generic Hearing Abilities for Large Language Models
| ICLR | 2023-10-20 | Github | - |
Ferret: Refer and Ground Anything Anywhere at Any Granularity
| arXiv | 2023-10-11 | Github | - |
CogVLM: Visual Expert For Large Language Models
| arXiv | 2023-10-09 | Github | Demo |
Improved Baselines with Visual Instruction Tuning
| arXiv | 2023-10-05 | Github | Demo |
LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
| ICLR | 2023-10-03 | Github | Demo |
Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs | arXiv | 2023-10-01 | Github | - |
Reformulating Vision-Language Foundation Models and Datasets Towards Universal Multimodal Assistants
| arXiv | 2023-10-01 | Github | Local Demo |
| AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model | arXiv | 2023-09-27 | - | - |
InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition
| arXiv | 2023-09-26 | Github | Local Demo |
DreamLLM: Synergistic Multimodal Comprehension and Creation
| ICLR | 2023-09-20 | Github | Coming soon |
| An Empirical Study of Scaling Instruction-Tuned Large Multimodal Models | arXiv | 2023-09-18 | Coming soon | - |
TextBind: Multi-turn Interleaved Multimodal Instruction-following
| arXiv | 2023-09-14 | Github | Demo |
NExT-GPT: Any-to-Any Multimodal LLM
| arXiv | 2023-09-11 | Github | Demo |
Sight Beyond Text: Multi-Modal Training Enhances LLMs in Truthfulness and Ethics
| arXiv | 2023-09-13 | Github | - |
ImageBind-LLM: Multi-modality Instruction Tuning
| arXiv | 2023-09-07 | Github | Demo |
| Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning | arXiv | 2023-09-05 | - | - |
PointLLM: Empowering Large Language Models to Understand Point Clouds
| arXiv | 2023-08-31 | Github | Demo |
✨Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models
| arXiv | 2023-08-31 | Github | Local Demo |
MLLM-DataEngine: An Iterative Refinement Approach for MLLM
| arXiv | 2023-08-25 | Github | - |
Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Models
| arXiv | 2023-08-25 | Github | Demo |
Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities
| arXiv | 2023-08-24 | Github | Demo |
Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages
| ICLR | 2023-08-23 | Github | Demo |
StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data
| arXiv | 2023-08-20 | Github | - |
BLIVA: A Simple Multimodal LLM for Better Handling of Text-rich Visual Questions
| arXiv | 2023-08-19 | Github | Demo |
Fine-tuning Multimodal LLMs to Follow Zero-shot Demonstrative Instructions
| arXiv | 2023-08-08 | Github | - |
The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World
| ICLR | 2023-08-03 | Github | Demo |
LISA: Reasoning Segmentation via Large Language Model
| arXiv | 2023-08-01 | Github | Demo |
MovieChat: From Dense Token to Sparse Memory for Long Video Understanding
| arXiv | 2023-07-31 | Github | Local Demo |
3D-LLM: Injecting the 3D World into Large Language Models
| arXiv | 2023-07-24 | Github | - |
ChatSpot: Bootstrapping Multimodal LLMs via Precise Referring Instruction Tuning
| arXiv | 2023-07-18 | - | Demo |
BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs
| arXiv | 2023-07-17 | Github | Demo |
SVIT: Scaling up Visual Instruction Tuning
| arXiv | 2023-07-09 | Github | - |
GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest
| arXiv | 2023-07-07 | Github | Demo |
What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?
| arXiv | 2023-07-05 | Github | - |
mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding
| arXiv | 2023-07-04 | Github | Demo |
Visual Instruction Tuning with Polite Flamingo
| arXiv | 2023-07-03 | Github | Demo |
LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding
| arXiv | 2023-06-29 | Github | Demo |
Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic
| arXiv | 2023-06-27 | Github | Demo |
MotionGPT: Human Motion as a Foreign Language
| arXiv | 2023-06-26 | Github | - |
Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration
| arXiv | 2023-06-15 | Github | Coming soon |
LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark
| arXiv | 2023-06-11 | Github | Demo |
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
| arXiv | 2023-06-08 | Github | Demo |
MIMIC-IT: Multi-Modal In-Context Instruction Tuning
| arXiv | 2023-06-08 | Github | Demo |
| M3IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning | arXiv | 2023-06-07 | - | - |
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
| arXiv | 2023-06-05 | Github | Demo |
LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day
| arXiv | 2023-06-01 | Github | - |
GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction
| arXiv | 2023-05-30 | Github | Demo |
PandaGPT: One Model To Instruction-Follow Them All
| arXiv | 2023-05-25 | Github | Demo |
ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst
| arXiv | 2023-05-25 | Github | - |
Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models
| arXiv | 2023-05-24 | Github | Local Demo |
DetGPT: Detect What You Need via Reasoning
| arXiv | 2023-05-23 | Github | Demo |
Pengi: An Audio Language Model for Audio Tasks
| NeurIPS | 2023-05-19 | Github | - |
VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks
| arXiv | 2023-05-18 | Github | - |
Listen, Think, and Understand
| arXiv | 2023-05-18 | Github | Demo |
VisualGLM-6B
| - | 2023-05-17 | Github | Local Demo |
PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering
| arXiv | 2023-05-17 | Github | - |
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
| arXiv | 2023-05-11 | Github | Local Demo |
VideoChat: Chat-Centric Video Understanding
| arXiv | 2023-05-10 | Github | Demo |
MultiModal-GPT: A Vision and Language Model for Dialogue with Humans
| arXiv | 2023-05-08 | Github | Demo |
X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages
| arXiv | 2023-05-07 | Github | - |
LMEye: An Interactive Perception Network for Large Language Models
| arXiv | 2023-05-05 | Github | Local Demo |
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model
| arXiv | 2023-04-28 | Github | Demo |
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality
| arXiv | 2023-04-27 | Github | Demo |
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
| arXiv | 2023-04-20 | Github | - |
Visual Instruction Tuning
| NeurIPS | 2023-04-17 | GitHub | Demo |
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
| ICLR | 2023-03-28 | Github | Demo |
MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning
| ACL | 2022-12-21 | Github | - |