PyTorch Implementation of MVoT
April 12, 2025 Β· View on GitHub
This is the PyTorch implementation of the MVoT paper: Imagine While Reasoning in Space: Multimodal Visualization-of-Thought
Authors: Chengzu Li*, Wenshan Wu*, Huanyu Zhang, Yan Xia, Shaoguang Mao, Li Dong, Ivan VuliΔ, Furu Wei. (* equal contribution)
π Overview
Multimodal Visualization-of-Thought (MVoT) is a novel reasoning strategy that enables reasoning with native multimodal traces in the context of spatial reasoning. It generates explicit visual thoughts through autoregressive generation and optimizes the visualization quality via token discrepancy loss.
MVoT boosts spatial reasoning performance across a variety of tasks and models, and establishes new possibilities for complex tasks where visual thinking effectively complements verbal reasoning.
π§ Key Contributions
- π Multimodal Visualization-of-Thought: a multimodal native reasoning paradigm that unifies text and vision within the reasoning traces.
- πΌοΈ Token Discrepancy Loss: for auto-regressive MLLM to bridge the gap between separately trained tokenizer.
- π Strong empirical performance across multiple dynamic spatial reasoning tasks.
π οΈ Installation
git clone https://github.com/chengzu-li/MVoT.git
cd MVoT
conda create -n mvot python=3.10
conda activate mvot
pip install torch==2.4.0
pip install -r requirements.txt --user
π Usage
Data
Only partial datasets are released for the purpose of debugging and illustrating data format. Should there be any further inquiries regarding the data, please get in touch with the authors.
unzip data_samples.zip
Training MVoT with Anole
Single Node with Multiple GPUs
torchrun --nproc_per_node <GPU_NUM> train.py \
--model anole \
--data interleaved_maze \
--decoder_type anole \
--image_seq_length 1024 \
--input_format anole \
--output <OUTPUT_PATH> \
--note <NOTE> \
--report_to none \
--do_train \
--do_eval
Multiple Nodes with Multiple GPUs
torchrun --nnodes <NODE_NUM> --nproc_per_node <GPU_NUM_PER_NODE> --node_rank $RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT train.py \
--model anole \
--data frozenlake \
--decoder_type anole \
--image_seq_length 1024 \
--input_format anole \
--output <OUTPUT_PATH> \
--note <NOTE> \
--report_to none \
--do_train \
--do_eval
- <GPU_NUM>, <NODE_NUM>, <OUTPUT_PATH>, <NOTE> specified depending on your circumstances.
π Citation
If you find this work useful, please consider citing:
@misc{li2025imaginereasoningspacemultimodal,
title={Imagine while Reasoning in Space: Multimodal Visualization-of-Thought},
author={Chengzu Li and Wenshan Wu and Huanyu Zhang and Yan Xia and Shaoguang Mao and Li Dong and Ivan VuliΔ and Furu Wei},
year={2025},
eprint={2501.07542},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2501.07542},
}