Chat-Scene

April 12, 2026 · View on GitHub

We build a multi-modal large language model for 3D scene understanding, excelling in tasks such as 3D grounding, captioning, and question answering.

🔥 Ranked 1st on the ScanRefer Benchmark (Sept. 2024)

alt text leaderboard link

🔥 Ranked 1st on the Scan2Cap Benchmark (Sept. 2024)

alt text leaderboard link

News

[2026.03] 🔥 Chat-Scene++ has been accepted by TPAMI 2026! [paper] [data]

[2024.09] 🔥 Chat-Scene has been accepted by NeurIPS 2024! [paper]

[2024.08] We release Chat-Scene, capable of processing both 3D point clouds and 2D multi-view images for improved 3D scene understanding, leading to significant advancements in grounding and captioning performance.

[2024.04] We release a refined implementation (v2.1), which achieves better performance on grounding, captioning, and QA tasks. The code is available in branch v2.1.

[2023.12] We release Chat-3D v2 [paper], introducing object identifiers for enhanced object referencing and grounding in 3D scenes. The original code is available in branch v2.0.

[2023.08] We release Chat-3D [paper] [code], an LLM-based dialogue system for 3D scenes.

🔥 Chat-Scene vs Chat-3D v2

Performance Comparison

	ScanRefer		Multi3dRefer		Scan2Cap		ScanQA		SQA3D
	Acc@0.25	Acc@0.5	F1@0.25	F1@0.5	CIDEr@0.5	B-4@0.5	CIDEr	B-4	EM
v2.0	35.9	30.4	-	-	28.1	15.5	77.1	7.3	-
v2.1	42.5	38.4	45.1	41.6	63.9	31.8	87.6	14.0	54.7
Chat-Scene	55.5	50.2	57.1	52.4	77.1	36.3	87.7	14.3	54.6

*The v2.1 and Chat-Scene results are based on single models without task-specific finetuning.

Main Changes
New features in Chat-Scene
- Introduce a 2D token for each object, with 2D representations extracted from multi-view images using DINOv2.
- Enable processing of 2D video using a tracking-based detector when 3D input is unavailable.
New features in v2.1 (Chat-Scene is built upon v2.1)
- LLM backbone: Vicuna v0 -> Vicuna v1.5 + LoRA.
- Training scheme: three-stage training -> one-stage joint training.
- Detector: PointGroup -> Mask3D.
- Code Optimization:
  - batch size: 1 -> 32.
  - Simplified training and evaluation processes.

🔨 Preparation

Prepare the environment:

conda create -n chat-scene python=3.9.17
conda activate chat-scene
conda install pytorch==2.2.1 torchvision==0.17.1 torchaudio==2.2.1 pytorch-cuda=11.8 -c pytorch -c nvidia
pip install -r requirements.txt

Download LLM backbone:
- We use Vicuna-7B v1.5 in our experiments, which can be downloaded from Hugging Face.
- Change the llama_model_path in run.sh to the path of vicuna-7b-v1.5.
Annotations and extracted features:

Please follow the instructions in preprocess.

🤖 Training and Inference

Training
- Modify run.sh:
```
train_tag="scanrefer#scan2cap#scanqa#sqa3d#multi3dref#nr3d_caption#obj_align"
val_tag="scanrefer#scan2cap#scanqa#sqa3d#multi3dref"
evaluate=False
```
  Explanation of "train_tag" and "val_tag"
  - Use # to seperate different datasets
  - Datasets:
    - scanrefer: ScanRefer Dataset
    - scan2cap: Scan2Cap Dataset
    - scanqa: ScanQA Dataset
    - sqa3d: SQA3D Dataset
    - multi3dref: Multi3dRefer Dataset
    - nr3d_caption: A captioning dataset originated from Nr3D.
    - obj_align: A dataset originated from ScanRefer to align the object identifiers with object tokens.
- Run: bash scripts/run.sh

Inference

Modify run.sh: (We provide the pretrained checkpoint in Hugging Face)

val_tag="scanrefer#scan2cap#scanqa#sqa3d#multi3dref"
evaluate=True
pretrained_path="/path/to/pretrained_model.pth"

Run: bash scripts/run.sh

📄 Citation

If you find this project useful in your research, please consider cite:

@article{huang2026chat,
  title={Chat-Scene++: Exploiting Context-Rich Object Identification for 3D LLM},
  author={Huang, Haifeng and Chen, Yilun and Wang, Zehan and Pang, Jiangmiao and Zhao, Zhou},
  journal={arXiv preprint arXiv:2603.27507},
  year={2026}
}
@article{huang2024chat,
  title={Chat-scene: Bridging 3d scene and large language models with object identifiers},
  author={Huang, Haifeng and Chen, Yilun and Wang, Zehan and Huang, Rongjie and Xu, Runsen and Wang, Tai and Liu, Luping and Cheng, Xize and Zhao, Yang and Pang, Jiangmiao and others},
  journal={Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada},
  year={2024}
}
@article{wang2023chat,
  title={Chat-3d: Data-efficiently tuning large language model for universal dialogue of 3d scenes},
  author={Wang, Zehan and Huang, Haifeng and Zhao, Yang and Zhang, Ziang and Zhao, Zhou},
  journal={arXiv preprint arXiv:2308.08769},
  year={2023}
}

Stay tuned for our project. 🔥

If you have any questions or suggestions, feel free to open an issue or drop us an email (huanghaifeng317@gmail.com).

😊 Acknowledgement

Thanks to the open source of the following projects:

(Multi-modal) LLMs: LLaMA, Vicuna, VideoChat, LEO

3D Datasets: ScanNet, ScanRefer, ReferIt3D, Scan2Cap, ScanQA, SQA3D, Multi3dRefer

Detectors: PointGroup, Mask3D, DEVA

Representations: ULIP, Uni3D, DINOv2

3D Models: vil3dref, OpenScene