General_MLLM
November 21, 2025 ยท View on GitHub
| Title | Authors | Venue/Date | Paper Link | Code | Entire/Partial | Modal | Remarks |
|---|---|---|---|---|---|---|---|
| Benchmark and Dataset | |||||||
| Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation | Liao et al. | Arxiv 2025 (Oct) | paper | https://github.com/KangLiao929/Puffin | Entire | Image-Text-Camera | |
| From Objects to Anywhere: A Holistic Benchmark for Multi-level Visual Grounding in 3D Scenes | Wang et al. | NeurIPS 2025 | paper | https://anywhere-3d.github.io/ | Entire | Image-Text | |
| Video-R1: Reinforcing Video Reasoning in MLLMs | Feng et al. | Arxiv 2025 (Mar) | paper | https://github.com/tulerfeng/Video-R1 | Partial | Image-Text | |
| Logic-RAG: Augmenting Large Multimodal Models with Visual-Spatial Knowledge for Road Scene Understanding | Imran Kabir et al. | Arxiv 2025 (Mar) | paper | https://github.com/Imran2205/LogicRAG | Entire | Vedio-Text | |
| ST-Think: How Multimodal Large Language Models Reason About 4D Worlds from Ego-Centric Videos | Peiran Wu et al. | Arxiv 2025 (Mar) | paper | / | Entire | Vedio-Text | |
| How Do Multimodal Large Language Models Handle Complex Multimodal Reasoning? Placing Them in An Extensible Escape Game | Ziyue Wang et al. | Arxiv 2025 (Mar) | paper | https://github.com/THUNLP-MT/EscapeCraft | Entire | Image-Text | |
| ZeroBench: An Impossible Visual Benchmark for Contemporary Large Multimodal Models | Jonathan Roberts et al. | Arxiv 2025 (Feb) | paper | Github | Entire | Image-Text | |
| LLaVA-SpaceSGG: Visual Instruct Tuning for Open-vocabulary Scene Graph Generation with Enhanced Spatial Relations | Mingjie Xu et al. | WACV 2025 | paper | https://github.com/Endlinc/LLaVA-SpaceSGG | Entire | Graph-Desc/QA/Conv | |
| LLaVA-ST: A Multimodal Large Language Model for Fine-Grained Spatial-Temporal Understanding | Hongyu Li et al. | Arxiv 2025 (Jan) | paper | https://github.com/appletea233/LLaVA-ST | Entire | Vedio-Text(QA) | |
| Thinking in space: How multimodal large language models see, remember, and recall spaces | Yang et al. | CVPR 2025 | paper | code | Entire | Vedio-Text(QA) | |
| Spatial457: A Diagnostic Benchmark for 6D Spatial Reasoning of Large Multimodal Models | Xingrui Wang et al. | CVPR 2025 | paper | https://github.com/XingruiWang/Spatial457 | Entire | Image-Text | |
| Improved Visual-Spatial Reasoning via R1-Zero-Like Training | Liao et al. | Arxiv 2025 (Apl) | paper | https://github.com/zhijie-group/R1-Zero-VSI | Entire | Vedio-Text(QA) | |
| Imagine while Reasoning in Space: Multimodal Visualization-of-Thought | Chengzu Li et al. | Arxiv 2025 (Jan) | paper | / | Entire | Image-Text | |
| MM-IQ: Benchmarking Human-Like Abstraction and Reasoning in Multimodal Models | Huanqia Cai et al. | Arxiv 2025 (Feb) | paper | Github | Entire | Image-Text | Doubtful |
| CAD-GPT: Synthesising CAD Construction Sequence with Spatial Reasoning-Enhanced Multimodal LLMs | Siyu Wang et al. | AAAI 2025 | paper | Github | Entire | CAD-Text | Doubtful |
| GSR-BENCH: A Benchmark for Grounded Spatial Reasoning Evaluation via Multimodal LLMs | Navid Rajabi et al. | NIPS 2024 Workshop | paper | / | Entire | Image-Text(QA) | |
| DriveLM: Driving with Graph Visual Question Answering | Chonghao Sima et al. | ECCV 2024 | paper | https://github.com/OpenDriveLab/DriveLM | Entire | Image/Graph-Text(QA) | |
| Spatial Task-Explicity Matters in Prompting Large Multimodal Models for Spatial Planning | Ivan Majic et al. | GeoAI 2024 | paper | https://github.com/ivan-majic/llm_modality_reasoning | Entire | Image-Text | |
| ABenchmark Dataset for Evaluating Spatial Perception in Multimodal Large Models | Li Xuan et al. | IOTMMIM 24 | paper | / | Entire | Image-Text | |
| PUZZLEVQA: Diagnosing Multimodal Reasoning Challenges of Language Models with Abstract Visual Patterns | Yew Ken Chia et al. | ACL 2024 | paper | https://github.com/declare-lab/LLM-PuzzleTest | Entire | Image-Text | Doubtful |
| SpatialRGPT: Grounded Spatial Reasoning in Vision Language Models | An-Chieh Cheng et al. | NIPS 2024 | paper | Github | Entire | Image-Text(QA) | |
| SAT: Dynamic Spatial Aptitude Training for Multimodal Language Models | Arijit Ray et al. | arxiv 2024(Dec) | paper | Huggingface | Entire | Image-Text(QA) | |
| BLINK : Multimodal Large Language Models | |||||||
| Can See but Not Perceive | Xingyu Fu et al. | arxiv 2024(Apr) | paper | Github | Entire | Image-Text(QA) | |
| Does Spatial Cognition Emerge in Frontier Models? | Ramakrishnan et al. | Arxiv 2024 (Oct) | paper | / | Entire | Image-Text | |
| Is A Picture Worth A Thousand Words? Delving Into Spatial Reasoning for Vision Language Models | Wang et al. | NeurIPS 2024 | paper | code | Entire | Image-Text | |
| CityGPT: Empowering Urban Spatial Cognition of Large Language Models | Feng et al. | Arxiv 2024 (Jun) | paper | code | Partial | Map/Image/Geo-Text | |
| DriveMLLM: A Benchmark for Spatial Understanding with Multimodal Large Language Models in Autonomous Driving | Guo et al. | Arxiv 2024 (Nov) | paper | code | Entire | Image-Text(QA) | |
| SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities | Boyuan Chen et al. | arxiv 2024(Jan) | paper | Github | Entire | Image-Text | |
| EmbSpatial-Bench: Benchmarking Spatial Understanding for Embodied Tasks with Large Vision-Language Models | Mengfei Du et al. | ACL 2024 | paper | https://github.com/mengfeidu/EmbSpatial-Bench | Entire | Image-Text(QA) | |
| AirVista: Empowering UAVs with 3D Spatial Reasoning Abilities Through a Multimodal Large Language Model Agent | Fei Lin et al. | ITSC 2024 | paper | / | Entire | Image-Text | |
| What's "up" with vision-language models? Investigating their struggle with spatial reasoning | Amita Kamath et al. | EMNLP 2023 | paper | https://github.com/amitakamath/whatsup_vlms | Entire | Image-Text | |
| Visual Spatial Reasoning | Liu et al. | TACL Volume 11 2023 | paper | code | Entire | Image-Text | |
| VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena | Letitia Parcalabescu et al. | ACL 2022 | paper | https://github.com/Heidelberg-NLP/VALSE | Partial | Image-Text | |
| Things not written in text: Exploring spatial commonsense from visual signals | Xiao Liu et al. | ACL 2022 | paper | https://github.com/xxxiaol/spatial-commonsense | Entire | Image-Text | |
| SpartQA: : A Textual Question Answering Benchmark for Spatial Reasoning | Roshanak Mirzaee et al. | NAACL 2021 | paper | https://github.com/HLR/SpartQA_generation | Entire | Text | |
| 2.5D Visual Relationship Detection | Yu-Chuan Su et al. | arxiv 2021(Apr) | paper | https://github.com/google-research-datasets/2.5vrd | Entire | Image-Text | |
| PIP: Physical Interaction Prediction via Mental Simulation with Span Selection | Jiafei Duan et al. | arxiv 2021(Sep) | paper | / | Entire | Vedio-Text(Classify) | |
| TVQA+: Spatio-temporal grounding for video question answering | Jie Lei et al. | ACL 2020 | paper | https://github.com/jayleicn/TVQAplus | Entire | Vedio-Text | |
| Rel3D: A Minimally Contrastive Benchmark for Grounding Spatial Relations in 3D | Ankit Goyal et al. | NIPS 2020 | paper | https://github.com/princeton-vl/Rel3D | Entire | Image-Text | |
| SpatialSense: An Adversarially Crowdsourced Benchmark for Spatial Relation Recognition | Kaiyu Yang et al. | ICCV 2019 | paper | https://github.com/princeton-vl/SpatialSense | Entire | Image-Text | |
| Acquiring Common Sense Spatial Knowledge through Implicit Spatial Templates | Guillem Collell et al. | AAAI 2018 | paper | https://github.com/gcollell/spatial-commonsense | Entire | Image-Text | |
| Visual Genome: Connecting language and vision using crowdsourced dense image annotations | Ranjay Krishna et al. | IJCV 2017 | paper | Code | Entire | Image-Text | |
| Sence Graph | |||||||
| MLLM-CompBench: A Comparative Reasoning Benchmark for Multimodal LLMs | Jihyung Kil et al. | NIPS 2017 | paper | Github | Partial | Image-Text(QA) | |
| Stating the Obvious: Extracting Visual Common Sense Knowledge | Mark Yatskar et al. | NAACL 2016 | paper | / (extract from COCO) | Entire | Text |