3DGraphLLM

August 4, 2025 ยท View on GitHub

arXiv Huggingace

In this work, we propose 3DGraphLLM, a method for constructing a learnable representation of a 3D scene graph, which serves as input for LLMs to perform 3D vision-language tasks.

News

[2025.6] We are pleased to inform you that our paper has been accepted for poster presentation at ICCV 2025! ๐ŸŽ‰

[2024.12] We release 3DGraphLLM pre-training on GT instance segmentation scene graphs

[2024.12] We release 3DGraphLLM paper code

๐Ÿ”ฅ Semantic relations boost LLM performance on 3D Referred Object Grounding and Dense Scene Captioning tasks

ScanReferMulti3dReferScan2CapScanQASQA3D
Acc@0.25Acc@0.5F1@0.25F1@0.5CIDEr@0.5B-4@0.5CIDErB-4EM
Chat-Scene55.550.257.152.377.136.387.714.354.6
3DGraphLLM Vicuna-1.5 58.653.061.957.379.234.791.213.755.1
3DGraphLLM LLAMA3-8B62.456.664.759.981.036.588.815.955.9

๐Ÿ”จ Preparation

  • Prepare the environment:

    conda create -n 3dgraphllm python=3.9.17
    conda activate 3dgraphllm
    conda install pytorch==2.2.1 torchvision==0.17.1 torchaudio==2.2.1 pytorch-cuda=11.8 -c pytorch -c nvidia
    pip install -r requirements.txt
    
  • If you don't have root permissions to install java (needed for pycocoeval scripts for metrics such as BLEU and CIDER), install it with conda:

conda install -c conda-forge openjdk
  • Download LLM backbone:

    • We use LLAMA3-8B-Instruct in our experiments, which can be downloaded from Hugging Face.

    • Change the llama_model_path in config.py to the path of LLAMA3-8B-Instruct.

  • Annotations and extracted features:

    Please follow the instructions in preprocess.

๐Ÿค– Training and Inference

  • Pre-training on GT instance segmentation scene graphs.

    • Modify run_gt_pretrain.sh:

      train_tag="scanrefer#scan2cap#scanqa#sqa3d#multi3dref#nr3d_caption#obj_align"
      val_tag="scanrefer#scan2cap#scanqa#sqa3d#multi3dref"
      evaluate=False
      
      Explanation of "train_tag" and "val_tag"
      • Use # to seperate different datasets

      • Datasets:

        • scanrefer: ScanRefer Dataset
        • scan2cap: Scan2Cap Dataset
        • scanqa: ScanQA Dataset
        • sqa3d: SQA3D Dataset
        • multi3dref: Multi3dRefer Dataset
        • nr3d_caption: A captioning dataset originated from Nr3D.
        • obj_align: A dataset originated from ScanRefer to align the object identifiers with object tokens.
    • Run: bash scripts/run_gt_pretrain.sh

  • Training

    • Modify run.sh:
      train_tag="scanrefer#scan2cap#scanqa#sqa3d#multi3dref#nr3d_caption#obj_align"
      val_tag="scanrefer#scan2cap#scanqa#sqa3d#multi3dref"
      evaluate=False
      pretrained_path="outputs/llama3-8b-gt-pretrain-2/ckpt_00_28927.pth"
      
    • Run: bash scripts/run.sh
  • Inference

    • Modify run.sh:

      val_tag="scanrefer#scan2cap#scanqa#sqa3d#multi3dref"
      evaluate=True
      pretrained_path="/path/to/pretrained_model.pth"
      
    • Run: bash scripts/run.sh

๐Ÿš€ Demo

  • Run: bash demo/run_demo.sh. You will be prompted to ask different queries about Scene 435 of ScanNet.

๐Ÿ“ช Contact

If you have any questions about the project, please open an issue in this repository or send an email to Tatiana Zemskova.

๐Ÿ“‘ Citation

If you find this work helpful, please consider citing our work as:

@misc{zemskova20243dgraphllm,
      title={3DGraphLLM: Combining Semantic Graphs and Large Language Models for 3D Scene Understanding}, 
      author={Tatiana Zemskova and Dmitry Yudin},
      year={2024},
      eprint={2412.18450},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2412.18450}, 
}

๐Ÿ˜Š Acknowledgement

Thanks to the open source of the following projects:

Chat-Scene