🌟 Motivation & Useful Findings

September 19, 2025 · View on GitHub

IR3D-Bench: Evaluating Vision-Language Model Scene Understanding as Agentic Inverse Rendering

"What I cannot create, I do not understand." ——Richard Feynman

IR3D Logo

IR3D-Bench: Evaluating Vision-Language Model Scene Understanding as Agentic Inverse Rendering
Parker Liu^1,*, Chenxin Li^1,*, Zhengxin Li², Yipeng Wu², Wuyang Li³, Zhiqin Yang¹, Zhenyuan Zhang⁴, Yunlong Lin⁵, Sirui Han⁴, Brandon Y. Feng⁶,
¹CUHK, ²TJU, ³EPFL, ⁴HKUST, ⁵XMU, ⁶MIT
NeurIPS DB 2025

🌟 Motivation & Useful Findings

Inspired by Richard Feynman's aphorism (see the header), we propose a new perspective to evaluate VLMs' spatial visual understanding via a pretext task: how well they "recreate this scene."
We found that the aim of scene reconstruction enables VLMs to spontaneously estimate key attributes (object ID, localization, color, material, object relations, etc.) via a inverse rendering fahsion—critical for understanding what they see.
VLMs shows surprising potential for human-like reflection during this "recreation" game: feeding VLMs their recreated scenes, they compare with originals and update their understanding of the secene (the key attributes they estimate). We expect this multi-round feedback iteration to unlock more possibilities for improving existing VLMs in both understanding and generation performance.

🎨 Pipeline Overview

Pipeline

🛠️ Environment setup

(1) Create Environment:

conda create --name ir3d python=3.10
conda activate ir3d

(2) First install vllm

pip install vllm

(3) Install Blender on linux

snap install blender --classic

(4) Install SAM

pip install git+https://github.com/facebookresearch/segment-anything.git

📚 Dataset setup

Download our processed data: IR3D-bench-data.

Inverse Rendering

Task prompt

Prompt for inverse rendering and gpt4o score is in prompts/gpt4o_as_evaluator.txt and prompts/vlm_estimate_params.txt

Latest Proprietary Models

Modified the model-name as defined in main_vllm.py to use the required model.

python main_vllm.py --model-type "model-name"

Open-source Models

Modified the model-name as you needed, such as "gpt-4o", "grok-3", etc.

python main_api.py \ 
    --image_dir /path/to/images \ 
    --result_dir /output/path \ 
    --prompt_path prompts/vlm_estimate_params.txt \ 
    --model_name "model-name"

Eval

bash cal_metric.sh "/output/path" "/path/to/images" "GPI_ID"

🎈 Acknowledgement

Thanks to the following fantastic repos: SAM, vllm, Clever dataset, Blender.

📒 Citation

If you find our work helpful, please consider citing:

@article{liu2025ir3d,
  title={IR3D-Bench: Evaluating Vision-Language Model Scene Understanding as Agentic Inverse Rendering},
  author={Liu, Parker and Li, Chenxin and Li, Zhengxin and Wu, Yipeng and Li, Wuyang and Yang, Zhiqin and Zhang, Zhenyuan and Lin, Yunlong and Han, Sirui and Feng, Brandon Y},
  journal={arXiv preprint arXiv:2506.23329},
  year={2025}
}