General_MLLM

November 21, 2025 ยท View on GitHub

TitleAuthorsVenue/DatePaper LinkCodeEntire/PartialModalRemarks
Benchmark and Dataset
Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and GenerationLiao et al.Arxiv 2025 (Oct)paperhttps://github.com/KangLiao929/PuffinEntireImage-Text-Camera
From Objects to Anywhere: A Holistic Benchmark for Multi-level Visual Grounding in 3D ScenesWang et al.NeurIPS 2025paperhttps://anywhere-3d.github.io/EntireImage-Text
Video-R1: Reinforcing Video Reasoning in MLLMsFeng et al.Arxiv 2025 (Mar)paperhttps://github.com/tulerfeng/Video-R1PartialImage-Text
Logic-RAG: Augmenting Large Multimodal Models with Visual-Spatial Knowledge for Road Scene UnderstandingImran Kabir et al.Arxiv 2025 (Mar)paperhttps://github.com/Imran2205/LogicRAGEntireVedio-Text
ST-Think: How Multimodal Large Language Models Reason About 4D Worlds from Ego-Centric VideosPeiran Wu et al.Arxiv 2025 (Mar)paper/EntireVedio-Text
How Do Multimodal Large Language Models Handle Complex Multimodal Reasoning? Placing Them in An Extensible Escape GameZiyue Wang et al.Arxiv 2025 (Mar)paperhttps://github.com/THUNLP-MT/EscapeCraftEntireImage-Text
ZeroBench: An Impossible Visual Benchmark for Contemporary Large Multimodal ModelsJonathan Roberts et al.Arxiv 2025 (Feb)paperGithubEntireImage-Text
LLaVA-SpaceSGG: Visual Instruct Tuning for Open-vocabulary Scene Graph Generation with Enhanced Spatial RelationsMingjie Xu et al.WACV 2025paperhttps://github.com/Endlinc/LLaVA-SpaceSGGEntireGraph-Desc/QA/Conv
LLaVA-ST: A Multimodal Large Language Model for Fine-Grained Spatial-Temporal UnderstandingHongyu Li et al.Arxiv 2025 (Jan)paperhttps://github.com/appletea233/LLaVA-STEntireVedio-Text(QA)
Thinking in space: How multimodal large language models see, remember, and recall spacesYang et al.CVPR 2025papercodeEntireVedio-Text(QA)
Spatial457: A Diagnostic Benchmark for 6D Spatial Reasoning of Large Multimodal ModelsXingrui Wang et al.CVPR 2025paperhttps://github.com/XingruiWang/Spatial457EntireImage-Text
Improved Visual-Spatial Reasoning via R1-Zero-Like TrainingLiao et al.Arxiv 2025 (Apl)paperhttps://github.com/zhijie-group/R1-Zero-VSIEntireVedio-Text(QA)
Imagine while Reasoning in Space: Multimodal Visualization-of-ThoughtChengzu Li et al.Arxiv 2025 (Jan)paper/EntireImage-Text
MM-IQ: Benchmarking Human-Like Abstraction and Reasoning in Multimodal ModelsHuanqia Cai et al.Arxiv 2025 (Feb)paperGithubEntireImage-TextDoubtful
CAD-GPT: Synthesising CAD Construction Sequence with Spatial Reasoning-Enhanced Multimodal LLMsSiyu Wang et al.AAAI 2025paperGithubEntireCAD-TextDoubtful
GSR-BENCH: A Benchmark for Grounded Spatial Reasoning Evaluation via Multimodal LLMsNavid Rajabi et al.NIPS 2024 Workshoppaper/EntireImage-Text(QA)
DriveLM: Driving with Graph Visual Question AnsweringChonghao Sima et al.ECCV 2024paperhttps://github.com/OpenDriveLab/DriveLMEntireImage/Graph-Text(QA)
Spatial Task-Explicity Matters in Prompting Large Multimodal Models for Spatial PlanningIvan Majic et al.GeoAI 2024paperhttps://github.com/ivan-majic/llm_modality_reasoningEntireImage-Text
ABenchmark Dataset for Evaluating Spatial Perception in Multimodal Large ModelsLi Xuan et al.IOTMMIM 24paper/EntireImage-Text
PUZZLEVQA: Diagnosing Multimodal Reasoning Challenges of Language Models with Abstract Visual PatternsYew Ken Chia et al.ACL 2024paperhttps://github.com/declare-lab/LLM-PuzzleTestEntireImage-TextDoubtful
SpatialRGPT: Grounded Spatial Reasoning in Vision Language ModelsAn-Chieh Cheng et al.NIPS 2024paperGithubEntireImage-Text(QA)
SAT: Dynamic Spatial Aptitude Training for Multimodal Language ModelsArijit Ray et al.arxiv 2024(Dec)paperHuggingfaceEntireImage-Text(QA)
BLINK : Multimodal Large Language Models
Can See but Not PerceiveXingyu Fu et al.arxiv 2024(Apr)paperGithubEntireImage-Text(QA)
Does Spatial Cognition Emerge in Frontier Models?Ramakrishnan et al.Arxiv 2024 (Oct)paper/EntireImage-Text
Is A Picture Worth A Thousand Words? Delving Into Spatial Reasoning for Vision Language ModelsWang et al.NeurIPS 2024papercodeEntireImage-Text
CityGPT: Empowering Urban Spatial Cognition of Large Language ModelsFeng et al.Arxiv 2024 (Jun)papercodePartialMap/Image/Geo-Text
DriveMLLM: A Benchmark for Spatial Understanding with Multimodal Large Language Models in Autonomous DrivingGuo et al.Arxiv 2024 (Nov)papercodeEntireImage-Text(QA)
SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning CapabilitiesBoyuan Chen et al.arxiv 2024(Jan)paperGithubEntireImage-Text
EmbSpatial-Bench: Benchmarking Spatial Understanding for Embodied Tasks with Large Vision-Language ModelsMengfei Du et al.ACL 2024paperhttps://github.com/mengfeidu/EmbSpatial-BenchEntireImage-Text(QA)
AirVista: Empowering UAVs with 3D Spatial Reasoning Abilities Through a Multimodal Large Language Model AgentFei Lin et al.ITSC 2024paper/EntireImage-Text
What's "up" with vision-language models? Investigating their struggle with spatial reasoningAmita Kamath et al.EMNLP 2023paperhttps://github.com/amitakamath/whatsup_vlmsEntireImage-Text
Visual Spatial ReasoningLiu et al.TACL Volume 11 2023papercodeEntireImage-Text
VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaLetitia Parcalabescu et al.ACL 2022paperhttps://github.com/Heidelberg-NLP/VALSEPartialImage-Text
Things not written in text: Exploring spatial commonsense from visual signalsXiao Liu et al.ACL 2022paperhttps://github.com/xxxiaol/spatial-commonsenseEntireImage-Text
SpartQA: : A Textual Question Answering Benchmark for Spatial ReasoningRoshanak Mirzaee et al.NAACL 2021paperhttps://github.com/HLR/SpartQA_generationEntireText
2.5D Visual Relationship DetectionYu-Chuan Su et al.arxiv 2021(Apr)paperhttps://github.com/google-research-datasets/2.5vrdEntireImage-Text
PIP: Physical Interaction Prediction via Mental Simulation with Span SelectionJiafei Duan et al.arxiv 2021(Sep)paper/EntireVedio-Text(Classify)
TVQA+: Spatio-temporal grounding for video question answeringJie Lei et al.ACL 2020paperhttps://github.com/jayleicn/TVQAplusEntireVedio-Text
Rel3D: A Minimally Contrastive Benchmark for Grounding Spatial Relations in 3DAnkit Goyal et al.NIPS 2020paperhttps://github.com/princeton-vl/Rel3DEntireImage-Text
SpatialSense: An Adversarially Crowdsourced Benchmark for Spatial Relation RecognitionKaiyu Yang et al.ICCV 2019paperhttps://github.com/princeton-vl/SpatialSenseEntireImage-Text
Acquiring Common Sense Spatial Knowledge through Implicit Spatial TemplatesGuillem Collell et al.AAAI 2018paperhttps://github.com/gcollell/spatial-commonsenseEntireImage-Text
Visual Genome: Connecting language and vision using crowdsourced dense image annotationsRanjay Krishna et al.IJCV 2017paperCodeEntireImage-Text
Sence Graph
MLLM-CompBench: A Comparative Reasoning Benchmark for Multimodal LLMsJihyung Kil et al.NIPS 2017paperGithubPartialImage-Text(QA)
Stating the Obvious: Extracting Visual Common Sense KnowledgeMark Yatskar et al.NAACL 2016paper/ (extract from COCO)EntireText