Chat-Scene

April 12, 2026 ยท View on GitHub

We build a multi-modal large language model for 3D scene understanding, excelling in tasks such as 3D grounding, captioning, and question answering.

๐Ÿ”ฅ Ranked 1st on the ScanRefer Benchmark (Sept. 2024)

alt text leaderboard link

๐Ÿ”ฅ Ranked 1st on the Scan2Cap Benchmark (Sept. 2024)

alt text leaderboard link

News

[2026.03] ๐Ÿ”ฅ Chat-Scene++ has been accepted by TPAMI 2026! [paper] [data]

[2024.09] ๐Ÿ”ฅ Chat-Scene has been accepted by NeurIPS 2024! [paper]

[2024.08] We release Chat-Scene, capable of processing both 3D point clouds and 2D multi-view images for improved 3D scene understanding, leading to significant advancements in grounding and captioning performance.

[2024.04] We release a refined implementation (v2.1), which achieves better performance on grounding, captioning, and QA tasks. The code is available in branch v2.1.

[2023.12] We release Chat-3D v2 [paper], introducing object identifiers for enhanced object referencing and grounding in 3D scenes. The original code is available in branch v2.0.

[2023.08] We release Chat-3D [paper] [code], an LLM-based dialogue system for 3D scenes.

๐Ÿ”ฅ Chat-Scene vs Chat-3D v2

  • Performance Comparison

    ScanReferMulti3dReferScan2CapScanQASQA3D
    Acc@0.25Acc@0.5F1@0.25F1@0.5CIDEr@0.5B-4@0.5CIDErB-4EM
    v2.035.930.4--28.115.577.17.3-
    v2.142.538.445.141.663.931.887.614.054.7
    Chat-Scene55.550.257.152.477.136.387.714.354.6

    *The v2.1 and Chat-Scene results are based on single models without task-specific finetuning.

  • Main Changes

    New features in Chat-Scene
    • Introduce a 2D token for each object, with 2D representations extracted from multi-view images using DINOv2.

    • Enable processing of 2D video using a tracking-based detector when 3D input is unavailable.

    New features in v2.1 (Chat-Scene is built upon v2.1)
    • LLM backbone: Vicuna v0 -> Vicuna v1.5 + LoRA.

    • Training scheme: three-stage training -> one-stage joint training.

    • Detector: PointGroup -> Mask3D.

    • Code Optimization:

      • batch size: 1 -> 32.
      • Simplified training and evaluation processes.

๐Ÿ”จ Preparation

  • Prepare the environment:

    conda create -n chat-scene python=3.9.17
    conda activate chat-scene
    conda install pytorch==2.2.1 torchvision==0.17.1 torchaudio==2.2.1 pytorch-cuda=11.8 -c pytorch -c nvidia
    pip install -r requirements.txt
    
  • Download LLM backbone:

    • We use Vicuna-7B v1.5 in our experiments, which can be downloaded from Hugging Face.

    • Change the llama_model_path in run.sh to the path of vicuna-7b-v1.5.

  • Annotations and extracted features:

    Please follow the instructions in preprocess.

๐Ÿค– Training and Inference

  • Training

    • Modify run.sh:

      train_tag="scanrefer#scan2cap#scanqa#sqa3d#multi3dref#nr3d_caption#obj_align"
      val_tag="scanrefer#scan2cap#scanqa#sqa3d#multi3dref"
      evaluate=False
      
      Explanation of "train_tag" and "val_tag"
      • Use # to seperate different datasets

      • Datasets:

        • scanrefer: ScanRefer Dataset
        • scan2cap: Scan2Cap Dataset
        • scanqa: ScanQA Dataset
        • sqa3d: SQA3D Dataset
        • multi3dref: Multi3dRefer Dataset
        • nr3d_caption: A captioning dataset originated from Nr3D.
        • obj_align: A dataset originated from ScanRefer to align the object identifiers with object tokens.
    • Run: bash scripts/run.sh

  • Inference

    • Modify run.sh: (We provide the pretrained checkpoint in Hugging Face)

      val_tag="scanrefer#scan2cap#scanqa#sqa3d#multi3dref"
      evaluate=True
      pretrained_path="/path/to/pretrained_model.pth"
      
    • Run: bash scripts/run.sh

๐Ÿ“„ Citation

If you find this project useful in your research, please consider cite:

@article{huang2026chat,
  title={Chat-Scene++: Exploiting Context-Rich Object Identification for 3D LLM},
  author={Huang, Haifeng and Chen, Yilun and Wang, Zehan and Pang, Jiangmiao and Zhao, Zhou},
  journal={arXiv preprint arXiv:2603.27507},
  year={2026}
}
@article{huang2024chat,
  title={Chat-scene: Bridging 3d scene and large language models with object identifiers},
  author={Huang, Haifeng and Chen, Yilun and Wang, Zehan and Huang, Rongjie and Xu, Runsen and Wang, Tai and Liu, Luping and Cheng, Xize and Zhao, Yang and Pang, Jiangmiao and others},
  journal={Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada},
  year={2024}
}
@article{wang2023chat,
  title={Chat-3d: Data-efficiently tuning large language model for universal dialogue of 3d scenes},
  author={Wang, Zehan and Huang, Haifeng and Zhao, Yang and Zhang, Ziang and Zhao, Zhou},
  journal={arXiv preprint arXiv:2308.08769},
  year={2023}
}

Stay tuned for our project. ๐Ÿ”ฅ

If you have any questions or suggestions, feel free to open an issue or drop us an email (huanghaifeng317@gmail.com).

๐Ÿ˜Š Acknowledgement

Thanks to the open source of the following projects:

(Multi-modal) LLMs: LLaMA, Vicuna, VideoChat, LEO

3D Datasets: ScanNet, ScanRefer, ReferIt3D, Scan2Cap, ScanQA, SQA3D, Multi3dRefer

Detectors: PointGroup, Mask3D, DEVA

Representations: ULIP, Uni3D, DINOv2

3D Models: vil3dref, OpenScene