🎇 Introduction

May 22, 2026 · View on GitHub

Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey

arXiv Maintenance Discussion WeChat Slack

💡 You're very welcome to join our discussion (on WeChat or Slack) on the topic of multimodal reasoning.
📌 Please feel free to ping us for any possibly missed related work — see CONTRIBUTING.md for how to suggest a paper.

MCoT cover figure

🎇 Introduction

Multimodal chain-of-thought (MCoT) reasoning has garnered attention for its ability to enhance step-by-step reasoning in multimodal contexts, particularly within multimodal large language models (MLLMs). Current MCoT research explores various methodologies to address the challenges posed by images, videos, speech, audio, 3D data, and structured data, achieving success in fields such as robotics, healthcare, and autonomous driving. However, despite these advancements, the field lacks a comprehensive review that addresses the numerous remaining challenges.

To fill this gap, we present the first systematic survey of MCoT reasoning, elucidating the foundational concepts and definitions pertinent to this area. Our work includes a detailed taxonomy and an analysis of existing methodologies across different applications, as well as insights into current challenges and future research directions aimed at fostering the development of multimodal reasoning.

MCoT research timeline


Updates

2025-05-20: We upload the Chinese language version, enjoy!
2025-04-25: We gain 500 stars! Thank you all!
2025-03-18: We release the Awesome-MCoT repo and survey.


📕 Table of Contents


🎖 MCoT Datasets and Benchmarks

  • "MC" and "Open" refer to multiple-choice and open-ended answer formats.
  • "T", "I", "V", and "A" represent Text, Image, Video, and Audio, respectively.

Tab-1: Datasets for MCoT Training with Rationale.

DatasetsYearTaskDomainModalityFormatSamples
ScienceQA2022VQAScienceT, IMC21K
A-OKVQA2022VQACommonT, IMC25K
EgoCoT2023VideoQACommonT, VOpen200M
VideoCoT2024VideoQAHuman ActionT, VOpen22K
VideoEspresso2024VideoQACommonT, VOpen202,164
EMMA-X2024Robot ManipulationIndoorT, VRobot Actions60K
M3CoT2024VQAScience, Math, CommonT, IMC11.4K
MAVIS2024ScienceQAMathT, IMC and Open834K
LLaVA-CoT-100k2024VQACommon, ScienceT, IMC and Open834K
MAmmoTH-VL2024DiverseDiverseT, IMC and Open12M
Mulberry-260k2024DiverseDiverseT, IMC and Open260K
MM-Verify2025MathQAMathT, IMC and Open59,772
VisualPRM400K2025ScienceQAMath, ScienceT, IMC and Open400K
R1-OneVision2025DiverseDiverseT, IMC and Open155K

Tab-2: Benchmarks for MCoT Evaluation without Rationale.

DatasetsYearTaskDomainModalityFormatSamples
MMMU2023VQAArts, ScienceT, IMC and Open11.5K
SEED2023VQACommonT, IMC19K
MathVista2023ScienceQAMathT, IMC and Open6,141
MathVerse2024ScienceQAMathT, IMC and Open15K
Math-Vision2024ScienceQAMathT, IMC and Open3040
MeViS2023Referring VOSCommonT, VDense Mask2K
VSIBench2024VideoQAIndoorT, VMC and Open5K
HallusionBench2024VQACommonT, IYes-No1,129
AV-Odyssey2024AVQACommonT, V, AMC4,555
AVHBench2024AVQACommonT, V, AOpen5,816
RefAVS-Bench2024Referring AVSCommonT, V, ADense Mask4,770
MMAU2024AQACommonT, AMC10K
AVTrustBench2025AVQACommonT, V, AMC and Open600K
MIG-Bench2025Multi-image GroundingCommonT, IBBox5.89K
MedAgentsBench2025MedicalQAMedicalT, IMC and Open862
OSWorld2024AgentReal Comp. Env.T, IAgent Action369
AgentClinic2024MedicalQAMedicalT, IOpen335

Tab-3: Benchmarks for MCoT Evaluation with Rationale.

DatasetsYearTaskDomainModalityFormatSamples
CoMT2024VQACommonT, IMC3,853
OmniBench2024VideoQACommonT, I, AMC1,142
WorldQA2024VideoQACommonT, V, AOpen1,007
MiCEval2024VQACommonT, IOpen643
OlympiadBench2024ScienceQAMaths, PhysicsT, IOpen8,476
MME-CoT2025VQAScience, Math, CommonT, IMC and Open1,130
EMMA2025VQAScienceT, IMC and Open2,788
VisualProcessBench2025ScienceQAMath, ScienceT, IMC and Open2,866

🎊 Multimodal Reasoning via RL

  • The following table concludes the techniques used by MLLMs with RL for better long-MCoT reasoning, where "T", "I", "V", and "A" represent Text, Image, Video, and Audio, respectively.
  • In summary, RL unlocks complex reasoning and aha-moment without SFT, demonstrating its potential to enhance model capabilities through iterative self-improvement and rule-based approaches, ultimately paving the way for more advanced and autonomous multimodal reasoning systems.
ModelFoundational LLMsModalityLearningCold StartAlgorithmAha-moment
Deepseek-R1-ZeroDeepseek-V3TRLGRPO
Deepseek-R1Deepseek-V3TSFT+RLGRPO-
LLaVA-ReasonerLLaMA3-LLaVA-NEXT-8BT,ISFT+RLDPO-
Insight-VDeepseek-V3T,ISFT+RLDPO-
Multimodal-Open-R1Qwen2-VL-7B-InstructT,IRLGRPO
R1-OneVisionQwen2.5-VL-7B-InstructT,ISFT---
R1-VQwen2.5-VLT,IRLGRPO
VLM-R1Qwen2.5-VLT,IRLGRPO
LMM-R1Qwen2.5-VL-Instruct-3BT,IRLPPO
Curr-ReFTQwen2.5-VL-3BT,IRL+SFTGRPO-
Seg-ZeroQwen2.5-VL-3B + SAM2T,IRLGRPO
MM-EurekaInternVL2.5-Instruct-8BT,ISFT+RLRLOO-
MM-Eureka-ZeroInternVL2.5-Pretrained-38BT,IRLGRPO
VisualThinker-R1-ZeroQwen2-VL-2BT,IRLGRPO
Easy-R1Qwen2.5-VLT,IRLGRPO-
Open-R1-VideoQwen2-VL-7BT,I,VRLGRPO
R1-OmniHumanOmni-0.5BT,I,V,ASFT+RLGRPO-
VisRLQwen2.5-VL-7BT,ISFT+RLDPO-
R1-VLQwen2-VL-7BT,IRLStepGRPO-
OpenVLThinkerQwen2.5-VL-7B-InstructT,ISFT+RLGRPO-
EchoInk-R1Qwen2.5-Omni-7BT, I, ARLGRPO
Web-CogReasonerQwen2.5-VL-7BT, ISFT---

✨ MCoT Over Various Modalities

MCoT Reasoning Over Image

2026  ·  15 papers
2025  ·  25 papers
2024  ·  30 papers
2023  ·  10 papers

MCoT Reasoning Over Video

2026  ·  2 papers
2025  ·  4 papers
2024  ·  9 papers
2023  ·  2 papers

MCoT Reasoning Over 3D

2025  ·  3 papers
2024  ·  3 papers
2023  ·  1 papers

MCoT Reasoning Over Audio and Speech

2025  ·  6 papers
2024  ·  2 papers
2023  ·  2 papers

MCoT Reasoning Over Table and Chart

2025  ·  2 papers
2024  ·  2 papers
2023  ·  1 papers

Cross-modal CoT Reasoning

2026  ·  1 papers
2025  ·  1 papers
2024  ·  4 papers

🔥 MCoT Methodologies

Rationale Construction

MCoT reasoning methodologies primarily concern the construction of rationales and can be categorized into three distinct types: prompt-based, plan-based, and learning-based methods:

  1. Prompt-based MCoT reasoning employs carefully designed prompts, including instructions or in-context demonstrations, to guide models in generating rationales during inference, typically in zero-shot or few-shot settings.
  1. Plan-based MCoT reasoning enables models to dynamically explore and refine thoughts during the reasoning process.
  1. Learning-based MCoT reasoning embeds rationale construction within the training or fine-tuning process, requiring models to explicitly learn reasoning skills alongside multimodal inputs.

Structural Reasoning

The proposed structural reasoning framework aims to enhance the controllability and interpretability of the rationale generation process. The structured formats can be categorized into three types: asynchronous modality modeling, defined procedure staging, and autonomous procedure staging

Information Enhancing

Enhancing multimodal inputs facilitates comprehensive reasoning through the integration of expert tools and internal or external knowledge.

Objective Granularity

Multimodal Rationale

The reasoning processes adopt either text-only or multimodal rationales.

Test-time Scaling


🎨 Applications with MCoT Reasoning

Embodied AI

Agentic System

Autonomous Driving

Medical and Healthcare

Social and Human

Multimodal Generation


🚀 Useful Links

Survey


❤️ Citation

We would be honored if this work could assist you, and greatly appreciate it if you could consider starring and citing it:

@article{wang2025multimodal,
  title={Multimodal chain-of-thought reasoning: A comprehensive survey},
  author={Wang, Yaoting and Wu, Shengqiong and Zhang, Yuecheng and Yan, Shuicheng and Liu, Ziwei and Luo, Jiebo and Fei, Hao},
  journal={arXiv preprint arXiv:2503.12605},
  year={2025}
}

⭐️ Star History

Star History Chart