๐ŸŒŸ Motivation & Useful Findings

September 19, 2025 ยท View on GitHub

IR3D-Bench: Evaluating Vision-Language Model Scene Understanding as Agentic Inverse Rendering

"What I cannot create, I do not understand." โ€”โ€”Richard Feynman

Arxiv Project Page Video

IR3D Logo

IR3D-Bench: Evaluating Vision-Language Model Scene Understanding as Agentic Inverse Rendering
Parker Liu1,*, Chenxin Li1,*, Zhengxin Li2, Yipeng Wu2, Wuyang Li3, Zhiqin Yang1, Zhenyuan Zhang4, Yunlong Lin5, Sirui Han4, Brandon Y. Feng6,
1CUHK, 2TJU, 3EPFL, 4HKUST, 5XMU, 6MIT
NeurIPS DB 2025

๐ŸŒŸ Motivation & Useful Findings

  1. Inspired by Richard Feynman's aphorism (see the header), we propose a new perspective to evaluate VLMs' spatial visual understanding via a pretext task: how well they "recreate this scene."
  2. We found that the aim of scene reconstruction enables VLMs to spontaneously estimate key attributes (object ID, localization, color, material, object relations, etc.) via a inverse rendering fahsionโ€”critical for understanding what they see.
  3. VLMs shows surprising potential for human-like reflection during this "recreation" game: feeding VLMs their recreated scenes, they compare with originals and update their understanding of the secene (the key attributes they estimate). We expect this multi-round feedback iteration to unlock more possibilities for improving existing VLMs in both understanding and generation performance.

๐ŸŽจ Pipeline Overview

Pipeline

๐Ÿ› ๏ธ Environment setup

(1) Create Environment:

conda create --name ir3d python=3.10
conda activate ir3d

(2) First install vllm

pip install vllm

(3) Install Blender on linux

snap install blender --classic

(4) Install SAM

pip install git+https://github.com/facebookresearch/segment-anything.git

๐Ÿ“š Dataset setup

Download our processed data: IR3D-bench-data.

Inverse Rendering

Task prompt

Prompt for inverse rendering and gpt4o score is in prompts/gpt4o_as_evaluator.txt and prompts/vlm_estimate_params.txt

Latest Proprietary Models

Modified the model-name as defined in main_vllm.py to use the required model.

python main_vllm.py --model-type "model-name"

Open-source Models

Modified the model-name as you needed, such as "gpt-4o", "grok-3", etc.

python main_api.py \ 
    --image_dir /path/to/images \ 
    --result_dir /output/path \ 
    --prompt_path prompts/vlm_estimate_params.txt \ 
    --model_name "model-name"

Eval

bash cal_metric.sh "/output/path" "/path/to/images" "GPI_ID"

๐ŸŽˆ Acknowledgement

Thanks to the following fantastic repos: SAM, vllm, Clever dataset, Blender.

๐Ÿ“’ Citation

If you find our work helpful, please consider citing:

@article{liu2025ir3d,
  title={IR3D-Bench: Evaluating Vision-Language Model Scene Understanding as Agentic Inverse Rendering},
  author={Liu, Parker and Li, Chenxin and Li, Zhengxin and Wu, Yipeng and Li, Wuyang and Yang, Zhiqin and Zhang, Zhenyuan and Lin, Yunlong and Han, Sirui and Feng, Brandon Y},
  journal={arXiv preprint arXiv:2506.23329},
  year={2025}
}