Awesome Spatial Reasoning with MVLMs
January 25, 2026 · View on GitHub
This repository collects and organises state‑of‑the‑art papers on spatial reasoning for Multimodal Vision–Language Models (MVLMs).
Feel free to open a Pull Request to add new work!
📑 Table of Contents
Introduction
In this survey, we provide a comprehensive review of existing tasks in multimodal spatial reasoning with large models, categorizing and highlighting the frontiers of multimodal large language models (MLLMs), and introducing open benchmarks for evaluating these models. We start by reviewing the general spatial reasoning area with focuses on post-training techniques, explainability, and architecture. Beyond classical 2D scenarios, we systemically review the spatial relationship reasoning, scene and layout reasoning, and also visual question answering, grounding in the 3D space.
Further, we also discuss the recent advances in embodied AI tasks, such as vision-language navigation and action models. Additionally, audio and ego-centric video modalities are also considered as part of this survey for distinct and emerging spatial understanding with novel sensors. We believe this survey establishes a solid foundation and offers valuable insights into the critical field of multimodal spatial reasoning.
Existing reasoning surveys are in Reasoning_survey.md.
Papers
3D Vision
Embodied AI
General MLLM
Video / Audio / Egocentric
Spatial Benchmark
Resources
Workshops and Tutorials
TBD
Contributing
Contributions are welcome! To contribute:
- Fork this repository
- Add your paper/resource in the appropriate markdown file or create a new one
- Update the link list in README.md if needed
- Submit a Pull Request 🎉
Citation
If you find this project helpful, please cite:
@article{zheng2025multimodal,
title={Multimodal Spatial Reasoning in the Large Model Era: A Survey and Benchmarks},
author={Zheng, Xu and Dongfang, Zihao and Jiang, Lutao and Zheng, Boyuan and Guo, Yulong and Zhang, Zhenquan and Albanese, Giuliano and Yang, Runyi and Ma, Mengjiao and Zhang, Zixin and others},
journal={https://arxiv.org/abs/2510.25760},
year={2025}
}
Star History
License
This project is licensed under the MIT License — see the LICENSE file for details.