General_MLLM

November 21, 2025 · View on GitHub

Title	Authors	Venue/Date	Paper Link	Code	Entire/Partial	Modal	Remarks
Benchmark and Dataset
Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation	Liao et al.	Arxiv 2025 (Oct)	paper	https://github.com/KangLiao929/Puffin	Entire	Image-Text-Camera
From Objects to Anywhere: A Holistic Benchmark for Multi-level Visual Grounding in 3D Scenes	Wang et al.	NeurIPS 2025	paper	https://anywhere-3d.github.io/	Entire	Image-Text
Video-R1: Reinforcing Video Reasoning in MLLMs	Feng et al.	Arxiv 2025 (Mar)	paper	https://github.com/tulerfeng/Video-R1	Partial	Image-Text
Logic-RAG: Augmenting Large Multimodal Models with Visual-Spatial Knowledge for Road Scene Understanding	Imran Kabir et al.	Arxiv 2025 (Mar)	paper	https://github.com/Imran2205/LogicRAG	Entire	Vedio-Text
ST-Think: How Multimodal Large Language Models Reason About 4D Worlds from Ego-Centric Videos	Peiran Wu et al.	Arxiv 2025 (Mar)	paper	/	Entire	Vedio-Text
How Do Multimodal Large Language Models Handle Complex Multimodal Reasoning? Placing Them in An Extensible Escape Game	Ziyue Wang et al.	Arxiv 2025 (Mar)	paper	https://github.com/THUNLP-MT/EscapeCraft	Entire	Image-Text
ZeroBench: An Impossible Visual Benchmark for Contemporary Large Multimodal Models	Jonathan Roberts et al.	Arxiv 2025 (Feb)	paper	Github	Entire	Image-Text
LLaVA-SpaceSGG: Visual Instruct Tuning for Open-vocabulary Scene Graph Generation with Enhanced Spatial Relations	Mingjie Xu et al.	WACV 2025	paper	https://github.com/Endlinc/LLaVA-SpaceSGG	Entire	Graph-Desc/QA/Conv
LLaVA-ST: A Multimodal Large Language Model for Fine-Grained Spatial-Temporal Understanding	Hongyu Li et al.	Arxiv 2025 (Jan)	paper	https://github.com/appletea233/LLaVA-ST	Entire	Vedio-Text(QA)
Thinking in space: How multimodal large language models see, remember, and recall spaces	Yang et al.	CVPR 2025	paper	code	Entire	Vedio-Text(QA)
Spatial457: A Diagnostic Benchmark for 6D Spatial Reasoning of Large Multimodal Models	Xingrui Wang et al.	CVPR 2025	paper	https://github.com/XingruiWang/Spatial457	Entire	Image-Text
Improved Visual-Spatial Reasoning via R1-Zero-Like Training	Liao et al.	Arxiv 2025 (Apl)	paper	https://github.com/zhijie-group/R1-Zero-VSI	Entire	Vedio-Text(QA)
Imagine while Reasoning in Space: Multimodal Visualization-of-Thought	Chengzu Li et al.	Arxiv 2025 (Jan)	paper	/	Entire	Image-Text
MM-IQ: Benchmarking Human-Like Abstraction and Reasoning in Multimodal Models	Huanqia Cai et al.	Arxiv 2025 (Feb)	paper	Github	Entire	Image-Text	Doubtful
CAD-GPT: Synthesising CAD Construction Sequence with Spatial Reasoning-Enhanced Multimodal LLMs	Siyu Wang et al.	AAAI 2025	paper	Github	Entire	CAD-Text	Doubtful
GSR-BENCH: A Benchmark for Grounded Spatial Reasoning Evaluation via Multimodal LLMs	Navid Rajabi et al.	NIPS 2024 Workshop	paper	/	Entire	Image-Text(QA)
DriveLM: Driving with Graph Visual Question Answering	Chonghao Sima et al.	ECCV 2024	paper	https://github.com/OpenDriveLab/DriveLM	Entire	Image/Graph-Text(QA)
Spatial Task-Explicity Matters in Prompting Large Multimodal Models for Spatial Planning	Ivan Majic et al.	GeoAI 2024	paper	https://github.com/ivan-majic/llm_modality_reasoning	Entire	Image-Text
ABenchmark Dataset for Evaluating Spatial Perception in Multimodal Large Models	Li Xuan et al.	IOTMMIM 24	paper	/	Entire	Image-Text
PUZZLEVQA: Diagnosing Multimodal Reasoning Challenges of Language Models with Abstract Visual Patterns	Yew Ken Chia et al.	ACL 2024	paper	https://github.com/declare-lab/LLM-PuzzleTest	Entire	Image-Text	Doubtful
SpatialRGPT: Grounded Spatial Reasoning in Vision Language Models	An-Chieh Cheng et al.	NIPS 2024	paper	Github	Entire	Image-Text(QA)
SAT: Dynamic Spatial Aptitude Training for Multimodal Language Models	Arijit Ray et al.	arxiv 2024(Dec)	paper	Huggingface	Entire	Image-Text(QA)
BLINK : Multimodal Large Language Models
Can See but Not Perceive	Xingyu Fu et al.	arxiv 2024(Apr)	paper	Github	Entire	Image-Text(QA)
Does Spatial Cognition Emerge in Frontier Models?	Ramakrishnan et al.	Arxiv 2024 (Oct)	paper	/	Entire	Image-Text
Is A Picture Worth A Thousand Words? Delving Into Spatial Reasoning for Vision Language Models	Wang et al.	NeurIPS 2024	paper	code	Entire	Image-Text
CityGPT: Empowering Urban Spatial Cognition of Large Language Models	Feng et al.	Arxiv 2024 (Jun)	paper	code	Partial	Map/Image/Geo-Text
DriveMLLM: A Benchmark for Spatial Understanding with Multimodal Large Language Models in Autonomous Driving	Guo et al.	Arxiv 2024 (Nov)	paper	code	Entire	Image-Text(QA)
SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities	Boyuan Chen et al.	arxiv 2024(Jan)	paper	Github	Entire	Image-Text
EmbSpatial-Bench: Benchmarking Spatial Understanding for Embodied Tasks with Large Vision-Language Models	Mengfei Du et al.	ACL 2024	paper	https://github.com/mengfeidu/EmbSpatial-Bench	Entire	Image-Text(QA)
AirVista: Empowering UAVs with 3D Spatial Reasoning Abilities Through a Multimodal Large Language Model Agent	Fei Lin et al.	ITSC 2024	paper	/	Entire	Image-Text
What's "up" with vision-language models? Investigating their struggle with spatial reasoning	Amita Kamath et al.	EMNLP 2023	paper	https://github.com/amitakamath/whatsup_vlms	Entire	Image-Text
Visual Spatial Reasoning	Liu et al.	TACL Volume 11 2023	paper	code	Entire	Image-Text
VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena	Letitia Parcalabescu et al.	ACL 2022	paper	https://github.com/Heidelberg-NLP/VALSE	Partial	Image-Text
Things not written in text: Exploring spatial commonsense from visual signals	Xiao Liu et al.	ACL 2022	paper	https://github.com/xxxiaol/spatial-commonsense	Entire	Image-Text
SpartQA: : A Textual Question Answering Benchmark for Spatial Reasoning	Roshanak Mirzaee et al.	NAACL 2021	paper	https://github.com/HLR/SpartQA_generation	Entire	Text
2.5D Visual Relationship Detection	Yu-Chuan Su et al.	arxiv 2021(Apr)	paper	https://github.com/google-research-datasets/2.5vrd	Entire	Image-Text
PIP: Physical Interaction Prediction via Mental Simulation with Span Selection	Jiafei Duan et al.	arxiv 2021(Sep)	paper	/	Entire	Vedio-Text(Classify)
TVQA+: Spatio-temporal grounding for video question answering	Jie Lei et al.	ACL 2020	paper	https://github.com/jayleicn/TVQAplus	Entire	Vedio-Text
Rel3D: A Minimally Contrastive Benchmark for Grounding Spatial Relations in 3D	Ankit Goyal et al.	NIPS 2020	paper	https://github.com/princeton-vl/Rel3D	Entire	Image-Text
SpatialSense: An Adversarially Crowdsourced Benchmark for Spatial Relation Recognition	Kaiyu Yang et al.	ICCV 2019	paper	https://github.com/princeton-vl/SpatialSense	Entire	Image-Text
Acquiring Common Sense Spatial Knowledge through Implicit Spatial Templates	Guillem Collell et al.	AAAI 2018	paper	https://github.com/gcollell/spatial-commonsense	Entire	Image-Text
Visual Genome: Connecting language and vision using crowdsourced dense image annotations	Ranjay Krishna et al.	IJCV 2017	paper	Code	Entire	Image-Text
Sence Graph
MLLM-CompBench: A Comparative Reasoning Benchmark for Multimodal LLMs	Jihyung Kil et al.	NIPS 2017	paper	Github	Partial	Image-Text(QA)
Stating the Obvious: Extracting Visual Common Sense Knowledge	Mark Yatskar et al.	NAACL 2016	paper	/ (extract from COCO)	Entire	Text