🎇 Introduction

May 22, 2026 · View on GitHub

Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey

💡 You're very welcome to join our discussion (on WeChat or Slack) on the topic of multimodal reasoning.
📌 Please feel free to ping us for any possibly missed related work — see CONTRIBUTING.md for how to suggest a paper.

MCoT cover figure

🎇 Introduction

Multimodal chain-of-thought (MCoT) reasoning has garnered attention for its ability to enhance step-by-step reasoning in multimodal contexts, particularly within multimodal large language models (MLLMs). Current MCoT research explores various methodologies to address the challenges posed by images, videos, speech, audio, 3D data, and structured data, achieving success in fields such as robotics, healthcare, and autonomous driving. However, despite these advancements, the field lacks a comprehensive review that addresses the numerous remaining challenges.

To fill this gap, we present the first systematic survey of MCoT reasoning, elucidating the foundational concepts and definitions pertinent to this area. Our work includes a detailed taxonomy and an analysis of existing methodologies across different applications, as well as insights into current challenges and future research directions aimed at fostering the development of multimodal reasoning.

MCoT research timeline

Updates

2025-05-20: We upload the Chinese language version, enjoy!
2025-04-25: We gain 500 stars! Thank you all!
2025-03-18: We release the Awesome-MCoT repo and survey.

📕 Table of Contents

🎖 MCoT Datasets and Benchmarks
🎊 Multimodal Reasoning via RL
✨ MCoT Over Various Modalities
🔥 MCoT Methodologies
🎨 Applications with MCoT Reasoning
🚀 Useful Links
❤️ Citation
⭐️ Star History

🎖 MCoT Datasets and Benchmarks

"MC" and "Open" refer to multiple-choice and open-ended answer formats.
"T", "I", "V", and "A" represent Text, Image, Video, and Audio, respectively.

Tab-1: Datasets for MCoT Training with Rationale.

Datasets	Year	Task	Domain	Modality	Format	Samples
ScienceQA	2022	VQA	Science	T, I	MC	21K
A-OKVQA	2022	VQA	Common	T, I	MC	25K
EgoCoT	2023	VideoQA	Common	T, V	Open	200M
VideoCoT	2024	VideoQA	Human Action	T, V	Open	22K
VideoEspresso	2024	VideoQA	Common	T, V	Open	202,164
EMMA-X	2024	Robot Manipulation	Indoor	T, V	Robot Actions	60K
M3CoT	2024	VQA	Science, Math, Common	T, I	MC	11.4K
MAVIS	2024	ScienceQA	Math	T, I	MC and Open	834K
LLaVA-CoT-100k	2024	VQA	Common, Science	T, I	MC and Open	834K
MAmmoTH-VL	2024	Diverse	Diverse	T, I	MC and Open	12M
Mulberry-260k	2024	Diverse	Diverse	T, I	MC and Open	260K
MM-Verify	2025	MathQA	Math	T, I	MC and Open	59,772
VisualPRM400K	2025	ScienceQA	Math, Science	T, I	MC and Open	400K
R1-OneVision	2025	Diverse	Diverse	T, I	MC and Open	155K

Tab-2: Benchmarks for MCoT Evaluation without Rationale.

Datasets	Year	Task	Domain	Modality	Format	Samples
MMMU	2023	VQA	Arts, Science	T, I	MC and Open	11.5K
SEED	2023	VQA	Common	T, I	MC	19K
MathVista	2023	ScienceQA	Math	T, I	MC and Open	6,141
MathVerse	2024	ScienceQA	Math	T, I	MC and Open	15K
Math-Vision	2024	ScienceQA	Math	T, I	MC and Open	3040
MeViS	2023	Referring VOS	Common	T, V	Dense Mask	2K
VSIBench	2024	VideoQA	Indoor	T, V	MC and Open	5K
HallusionBench	2024	VQA	Common	T, I	Yes-No	1,129
AV-Odyssey	2024	AVQA	Common	T, V, A	MC	4,555
AVHBench	2024	AVQA	Common	T, V, A	Open	5,816
RefAVS-Bench	2024	Referring AVS	Common	T, V, A	Dense Mask	4,770
MMAU	2024	AQA	Common	T, A	MC	10K
AVTrustBench	2025	AVQA	Common	T, V, A	MC and Open	600K
MIG-Bench	2025	Multi-image Grounding	Common	T, I	BBox	5.89K
MedAgentsBench	2025	MedicalQA	Medical	T, I	MC and Open	862
OSWorld	2024	Agent	Real Comp. Env.	T, I	Agent Action	369
AgentClinic	2024	MedicalQA	Medical	T, I	Open	335

Tab-3: Benchmarks for MCoT Evaluation with Rationale.

Datasets	Year	Task	Domain	Modality	Format	Samples
CoMT	2024	VQA	Common	T, I	MC	3,853
OmniBench	2024	VideoQA	Common	T, I, A	MC	1,142
WorldQA	2024	VideoQA	Common	T, V, A	Open	1,007
MiCEval	2024	VQA	Common	T, I	Open	643
OlympiadBench	2024	ScienceQA	Maths, Physics	T, I	Open	8,476
MME-CoT	2025	VQA	Science, Math, Common	T, I	MC and Open	1,130
EMMA	2025	VQA	Science	T, I	MC and Open	2,788
VisualProcessBench	2025	ScienceQA	Math, Science	T, I	MC and Open	2,866

🎊 Multimodal Reasoning via RL

The following table concludes the techniques used by MLLMs with RL for better long-MCoT reasoning, where "T", "I", "V", and "A" represent Text, Image, Video, and Audio, respectively.
In summary, RL unlocks complex reasoning and aha-moment without SFT, demonstrating its potential to enhance model capabilities through iterative self-improvement and rule-based approaches, ultimately paving the way for more advanced and autonomous multimodal reasoning systems.

Model	Foundational LLMs	Modality	Learning	Cold Start	Algorithm	Aha-moment
Deepseek-R1-Zero	Deepseek-V3	T	RL	❌	GRPO	✅
Deepseek-R1	Deepseek-V3	T	SFT+RL	✅	GRPO	-
LLaVA-Reasoner	LLaMA3-LLaVA-NEXT-8B	T,I	SFT+RL	✅	DPO	-
Insight-V	Deepseek-V3	T,I	SFT+RL	✅	DPO	-
Multimodal-Open-R1	Qwen2-VL-7B-Instruct	T,I	RL	❌	GRPO	❌
R1-OneVision	Qwen2.5-VL-7B-Instruct	T,I	SFT	-	-	-
R1-V	Qwen2.5-VL	T,I	RL	❌	GRPO	❌
VLM-R1	Qwen2.5-VL	T,I	RL	❌	GRPO	❌
LMM-R1	Qwen2.5-VL-Instruct-3B	T,I	RL	❌	PPO	❌
Curr-ReFT	Qwen2.5-VL-3B	T,I	RL+SFT	❌	GRPO	-
Seg-Zero	Qwen2.5-VL-3B + SAM2	T,I	RL	❌	GRPO	❌
MM-Eureka	InternVL2.5-Instruct-8B	T,I	SFT+RL	✅	RLOO	-
MM-Eureka-Zero	InternVL2.5-Pretrained-38B	T,I	RL	❌	GRPO	✅
VisualThinker-R1-Zero	Qwen2-VL-2B	T,I	RL	❌	GRPO	✅
Easy-R1	Qwen2.5-VL	T,I	RL	❌	GRPO	-
Open-R1-Video	Qwen2-VL-7B	T,I,V	RL	❌	GRPO	❌
R1-Omni	HumanOmni-0.5B	T,I,V,A	SFT+RL	✅	GRPO	-
VisRL	Qwen2.5-VL-7B	T,I	SFT+RL	✅	DPO	-
R1-VL	Qwen2-VL-7B	T,I	RL	❌	StepGRPO	-
OpenVLThinker	Qwen2.5-VL-7B-Instruct	T,I	SFT+RL	✅	GRPO	-
EchoInk-R1	Qwen2.5-Omni-7B	T, I, A	RL	❌	GRPO	✅
Web-CogReasoner	Qwen2.5-VL-7B	T, I	SFT	-	-	-

✨ MCoT Over Various Modalities

MCoT Reasoning Over Image

2026 · 15 papers

2025 · 25 papers

2024 · 30 papers

2023 · 10 papers

MCoT Reasoning Over Video

2026 · 2 papers

2025 · 4 papers

2024 · 9 papers

2023 · 2 papers

MCoT Reasoning Over 3D

2025 · 3 papers

2024 · 3 papers

2023 · 1 papers

Gen2Sim: Scaling up Robot Learning in Simulation with Generative Models

MCoT Reasoning Over Audio and Speech

2025 · 6 papers

ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing
Audio-Thinker: Guiding Audio Language Model When and How to Think via Reinforcement Learning
EchoInk-R1
R1-AQA: Reinforcement Learning Outperforms Supervised Fine-Tuning: A Case Study on Audio Question Answering
Audio-Reasoner: Improving Reasoning Capability in Large Audio Language Models
Audio-CoT: Exploring Chain-of-Thought Reasoning in Large Audio Language Model
Both Ears Wide Open: Towards Language-Driven Spatial Audio Generation
Leveraging Chain of Thought towards Empathetic Spoken Dialogue without Corresponding Question-Answering Data
Thinking with Sound: Audio Chain-of-Thought Enables Multimodal Reasoning in Large Audio-Language Models

2024 · 2 papers

2023 · 2 papers

MCoT Reasoning Over Table and Chart

2025 · 2 papers

2024 · 2 papers

2023 · 1 papers

TableGPT: Towards Unifying Tables, Nature Language and Commands into One GPT

2026 · 1 papers

UniVLR: Unifying Text and Vision in Visual Latent Reasoning for Multimodal LLMs

2025 · 1 papers

R1-Omni: Explainable Omni-Multimodal Emotion Recognition with Reinforcement Learning

2024 · 4 papers

🔥 MCoT Methodologies

Rationale Construction

MCoT reasoning methodologies primarily concern the construction of rationales and can be categorized into three distinct types: prompt-based, plan-based, and learning-based methods:

Prompt-based MCoT reasoning employs carefully designed prompts, including instructions or in-context demonstrations, to guide models in generating rationales during inference, typically in zero-shot or few-shot settings.

Plan-based MCoT reasoning enables models to dynamically explore and refine thoughts during the reasoning process.

Learning-based MCoT reasoning embeds rationale construction within the training or fine-tuning process, requiring models to explicitly learn reasoning skills alongside multimodal inputs.

Structural Reasoning

The proposed structural reasoning framework aims to enhance the controllability and interpretability of the rationale generation process. The structured formats can be categorized into three types: asynchronous modality modeling, defined procedure staging, and autonomous procedure staging

🎨 Applications with MCoT Reasoning

MedVLM-R1: Incentivizing Medical Reasoning Capability of Vision-Language Models (VLMs) via Reinforcement Learning
MedCoT: Medical Chain of Thought via Hierarchical Expert
Interpretable Video based Stress Detection with Self-Refine Chain-of-thought Reasoning
TI-PREGO: Chain of Thought and In-Context Learning for Online Mistake Detection in PRocedural EGOcentric Videos
Open Set Video HOI detection from Action-centric Chain-of-Look Prompting
VILA-M3: Enhancing Vision-Language Models with Medical Expert Knowledge
S-Chain: Structured Visual Chain-of-Thought For Medicine
M3CoTBench: Benchmark Chain-of-Thought of MLLMs in Medical Image Understanding
TumorChain: Interleaved Multimodal Chain-of-Thought Reasoning for Traceable Cancer Diagnosis

Chain-of-Exemplar: Enhancing Distractor Generation for Multimodal Educational Question Generation
X-Reflect: Cross-Reflection Prompting for Multimodal Recommendation
Multimodal PEAR Chain-of-Thought Reasoning for Multimodal Sentiment Analysis
Chain of Empathy: Enhancing Empathetic Response of Large Language Models Based on Psychotherapy Models
Chain-of-Thought Prompting for Demographic Inference with Large Multimodal Models
PanoSent: A Panoptic Sextuple Extraction Benchmark for Multimodal Conversational Aspect-based Sentiment Analysis

Multimodal Generation

Spatial Chain-of-Thought: Bridging Understanding and Generation Models for Spatial Reasoning
EndoCoT: Scaling Endogenous Chain-of-Thought Reasoning in Diffusion Models
Thinking Diffusion: Penalize and Guide Visual-Grounded Reasoning in Diffusion Multimodal LLMs

🚀 Useful Links

Survey

❤️ Citation

We would be honored if this work could assist you, and greatly appreciate it if you could consider starring and citing it:

@article{wang2025multimodal,
  title={Multimodal chain-of-thought reasoning: A comprehensive survey},
  author={Wang, Yaoting and Wu, Shengqiong and Zhang, Yuecheng and Yan, Shuicheng and Liu, Ziwei and Luo, Jiebo and Fei, Hao},
  journal={arXiv preprint arXiv:2503.12605},
  year={2025}
}

🎇 Introduction

Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey

🎇 Introduction

Updates

📕 Table of Contents

🎖 MCoT Datasets and Benchmarks

Tab-1: Datasets for MCoT Training with Rationale.

Tab-2: Benchmarks for MCoT Evaluation without Rationale.

Tab-3: Benchmarks for MCoT Evaluation with Rationale.

🎊 Multimodal Reasoning via RL

✨ MCoT Over Various Modalities

MCoT Reasoning Over Image

MCoT Reasoning Over Video

MCoT Reasoning Over 3D

MCoT Reasoning Over Audio and Speech

MCoT Reasoning Over Table and Chart

🔥 MCoT Methodologies

Rationale Construction

Structural Reasoning

Information Enhancing

Objective Granularity

Multimodal Rationale

Test-time Scaling

🎨 Applications with MCoT Reasoning

Embodied AI

Agentic System

Autonomous Driving

Medical and Healthcare

Multimodal Generation

🚀 Useful Links

Survey

❤️ Citation

⭐️ Star History