README.md

July 25, 2025 ยท View on GitHub

โœจ VDG-Uni3DSeg

All in One: Visual-Description-Guided Unified Point Cloud Segmentation

Zongyan Han1,Mohamed El Amine Boudjoghra2,Jiahua Dong1, Jinhong Wang1, Rao Muhammad Anwer1,

1 Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI), 2 Technical University of Munich

๐Ÿ“„ [Paper]

๐Ÿ’ก Introduction

Unified segmentation of 3D point clouds is crucial for scene understanding, but is hindered by its sparse structure, limited annotations, and the challenge of distinguishing fine-grained object classes in complex environments. Existing methods often struggle to capture rich semantic and contextual information due to limited supervision and a lack of diverse multimodal cues, leading to suboptimal differentiation of classes and instances. To address these challenges, we propose VDG-Uni3DSeg, a novel framework that integrates pre-trained vision-language models (e.g., CLIP) and large language models (LLMs) to enhance 3D segmentation. By leveraging LLM-generated textual descriptions and reference images from the internet, our method incorporates rich multimodal cues, facilitating fine-grained class and instance separation. We further design a Semantic-Visual Contrastive Loss to align point features with multimodal queries and a Spatial Enhanced Module to model scene-wide relationships efficiently. Operating within a closed-set paradigm that utilizes multimodal knowledge generated offline, VDG-Uni3DSeg achieves state-of-the-art results in semantic, instance, and panoptic segmentation, offering a scalable and practical solution for 3D understanding.

Semantic predictions

๐Ÿš€ Installation & Data Preparation

๐Ÿ› ๏ธ 1. Installation & Data Prep

Please refer to the official Oneformer3D to set up the environment and prepare the datasets with ease.

๐ŸŒ 2. Internet Images

  1. To obtain internet images, we use the google-images-download tool. Please follow the instructions below to install it:
mkdir internet_image && cd internet_image
git clone https://github.com/ultralytics/google-images-download
cd google-images-download
pip install -r requirements.txt
  1. Once installed, you can download images by running the following command. For example, to download 20 images for the class "sofa" (e.g., for the S3DIS dataset), use:
python3 bing_scraper.py --search sofa --limit 20 --download --output_directory 'images/s3dis'

Tip: Weโ€™ve already collected images for you! Find them here.

  1. Next, extract image features using the CLIP model. Use the commands below to install CLIP and extract features:
pip install git+https://github.com/openai/CLIP.git
python get_image_features.py

Or download pre-extracted features here.

๐Ÿ“ 3. Generating LLM Descriptions

We follow this method, replacing GPT with the open-source Llama3.1-8B-Instruct model.

  1. Get the Model:
    # Download weights & tokenizer from HuggingFace
    # Place them in:
    ./llama_model/Llama3.1-8B-Instruct
    
  2. Generate Descriptions:
    torchrun --nproc_per_node 1 generate_descriptors_llama.py
    
    Descriptions ๐Ÿ‘‰ ./class_description/descriptors

Weโ€™ve pre-generated descriptors for ScanNet, ScanNet200, and S3DIS. Download the bundle here.

  1. Extract Text Features:
    python get_text_features_llama.py
    
    CLIP embeddings ๐Ÿ‘‰ class_description/clip_embedding

Pre-extracted text features are also available here.

๐Ÿ‹๏ธโ€โ™‚๏ธ Training and Evaluation

Below we provide training and testing commands for three datasets: ScanNet, ScanNet200, and S3DIS.

๐Ÿ“ฆ ScanNet

# Train
python tools/train.py configs/vdguni_scannet.py

# Fix checkpoint before evaluation
python tools/fix_spconv_checkpoint.py --in-path work_dirs/vdguni_scannet/epoch_512.pth --out-path work_dirs/vdguni_scannet/final.pth

# Evaluation
python tools/test.py configs/vdguni_scannet.py work_dirs/vdguni_scannet/final.pth

๐Ÿงฑ ScanNet200

# Train
python tools/train.py configs/vdguni_scannet200.py
# Evaluation
python tools/test.py configs/vdguni_scannet200.py work_dirs/vdguni_scannet200/epoch_512.pth

๐Ÿซ S3DIS

We train on Areas 1โ€“4,6 and test on Area 5. Modify train_area / test_area in the config to change splits.

# Train
python tools/train.py configs/vdguni_s3dis.py

# Fix checkpoint before evaluation
python tools/fix_spconv_checkpoint.py --in-path work_dirs/vdguni_s3dis/epoch_512.pth --out-path work_dirs/vdguni_s3dis/final.pth

# Evaluation
python tools/test.py configs/vdguni_s3dis.py work_dirs/vdguni_s3dis/final.pth

๐Ÿ“Š Model Zoo & Results

Note: Due to random initialization, training results may slightly vary. Running multiple seeds may be needed to match paper performance. Config files are available in configs/.

DatasetmAP25mAP50mAPmIoUPQDownload
ScanNet86.578.559.376.271.5model | log
ScanNet20045.140.029.529.731.3model | log
S3DIS80.674.160.171.566.3model | log

๐Ÿ–ผ๏ธ Example Semantic Segmentation

Semantic predictions

Citation

If you find this work useful for your research, please cite our paper:

@inproceedings{han2025all,
  title={All in One: Visual-Description-Guided Unified Point Cloud Segmentation},
  author={Han, Zongyan and Boudjoghra, Mohamed El Amine and Dong, Jiahua and Wang, Jinhong and Anwer, Rao Muhammad},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  year={2025}
}

Acknowledgements

We gratefully acknowledge the following open-source projects that our work builds upon:

We thank the authors for their contributions to the community.