Chat-Scene
April 12, 2026 ยท View on GitHub
We build a multi-modal large language model for 3D scene understanding, excelling in tasks such as 3D grounding, captioning, and question answering.
๐ฅ Ranked 1st on the ScanRefer Benchmark (Sept. 2024)
๐ฅ Ranked 1st on the Scan2Cap Benchmark (Sept. 2024)
News
[2026.03] ๐ฅ Chat-Scene++ has been accepted by TPAMI 2026! [paper] [data]
[2024.09] ๐ฅ Chat-Scene has been accepted by NeurIPS 2024! [paper]
[2024.08] We release Chat-Scene, capable of processing both 3D point clouds and 2D multi-view images for improved 3D scene understanding, leading to significant advancements in grounding and captioning performance.
[2024.04] We release a refined implementation (v2.1), which achieves better performance on grounding, captioning, and QA tasks. The code is available in branch v2.1.
[2023.12] We release Chat-3D v2 [paper], introducing object identifiers for enhanced object referencing and grounding in 3D scenes. The original code is available in branch v2.0.
[2023.08] We release Chat-3D [paper] [code], an LLM-based dialogue system for 3D scenes.
๐ฅ Chat-Scene vs Chat-3D v2
-
Performance Comparison
ScanRefer Multi3dRefer Scan2Cap ScanQA SQA3D Acc@0.25 Acc@0.5 F1@0.25 F1@0.5 CIDEr@0.5 B-4@0.5 CIDEr B-4 EM v2.0 35.9 30.4 - - 28.1 15.5 77.1 7.3 - v2.1 42.5 38.4 45.1 41.6 63.9 31.8 87.6 14.0 54.7 Chat-Scene 55.5 50.2 57.1 52.4 77.1 36.3 87.7 14.3 54.6 *The v2.1 and Chat-Scene results are based on single models without task-specific finetuning.
-
Main Changes
New features in Chat-Scene
-
Introduce a 2D token for each object, with 2D representations extracted from multi-view images using DINOv2.
-
Enable processing of 2D video using a tracking-based detector when 3D input is unavailable.
New features in v2.1 (Chat-Scene is built upon v2.1)
-
LLM backbone: Vicuna v0 -> Vicuna v1.5 + LoRA.
-
Training scheme: three-stage training -> one-stage joint training.
-
Detector: PointGroup -> Mask3D.
-
Code Optimization:
- batch size: 1 -> 32.
- Simplified training and evaluation processes.
-
๐จ Preparation
-
Prepare the environment:
conda create -n chat-scene python=3.9.17 conda activate chat-scene conda install pytorch==2.2.1 torchvision==0.17.1 torchaudio==2.2.1 pytorch-cuda=11.8 -c pytorch -c nvidia pip install -r requirements.txt -
Download LLM backbone:
-
We use Vicuna-7B v1.5 in our experiments, which can be downloaded from Hugging Face.
-
Change the
llama_model_pathin run.sh to the path ofvicuna-7b-v1.5.
-
-
Annotations and extracted features:
Please follow the instructions in preprocess.
๐ค Training and Inference
-
Training
-
Modify run.sh:
train_tag="scanrefer#scan2cap#scanqa#sqa3d#multi3dref#nr3d_caption#obj_align" val_tag="scanrefer#scan2cap#scanqa#sqa3d#multi3dref" evaluate=FalseExplanation of "train_tag" and "val_tag"
-
Use
#to seperate different datasets -
Datasets:
-
-
Run:
bash scripts/run.sh
-
-
Inference
-
Modify run.sh: (We provide the pretrained checkpoint in Hugging Face)
val_tag="scanrefer#scan2cap#scanqa#sqa3d#multi3dref" evaluate=True pretrained_path="/path/to/pretrained_model.pth" -
Run:
bash scripts/run.sh
-
๐ Citation
If you find this project useful in your research, please consider cite:
@article{huang2026chat,
title={Chat-Scene++: Exploiting Context-Rich Object Identification for 3D LLM},
author={Huang, Haifeng and Chen, Yilun and Wang, Zehan and Pang, Jiangmiao and Zhao, Zhou},
journal={arXiv preprint arXiv:2603.27507},
year={2026}
}
@article{huang2024chat,
title={Chat-scene: Bridging 3d scene and large language models with object identifiers},
author={Huang, Haifeng and Chen, Yilun and Wang, Zehan and Huang, Rongjie and Xu, Runsen and Wang, Tai and Liu, Luping and Cheng, Xize and Zhao, Yang and Pang, Jiangmiao and others},
journal={Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada},
year={2024}
}
@article{wang2023chat,
title={Chat-3d: Data-efficiently tuning large language model for universal dialogue of 3d scenes},
author={Wang, Zehan and Huang, Haifeng and Zhao, Yang and Zhang, Ziang and Zhao, Zhou},
journal={arXiv preprint arXiv:2308.08769},
year={2023}
}
Stay tuned for our project. ๐ฅ
If you have any questions or suggestions, feel free to open an issue or drop us an email (huanghaifeng317@gmail.com).
๐ Acknowledgement
Thanks to the open source of the following projects:
(Multi-modal) LLMs: LLaMA, Vicuna, VideoChat, LEO
3D Datasets: ScanNet, ScanRefer, ReferIt3D, Scan2Cap, ScanQA, SQA3D, Multi3dRefer
Detectors: PointGroup, Mask3D, DEVA