README.md

July 25, 2025 · View on GitHub

✨ VDG-Uni3DSeg

All in One: Visual-Description-Guided Unified Point Cloud Segmentation

Zongyan Han¹,Mohamed El Amine Boudjoghra²,Jiahua Dong¹, Jinhong Wang¹, Rao Muhammad Anwer¹,

¹ Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI), ² Technical University of Munich

📄 [Paper]

💡 Introduction

Unified segmentation of 3D point clouds is crucial for scene understanding, but is hindered by its sparse structure, limited annotations, and the challenge of distinguishing fine-grained object classes in complex environments. Existing methods often struggle to capture rich semantic and contextual information due to limited supervision and a lack of diverse multimodal cues, leading to suboptimal differentiation of classes and instances. To address these challenges, we propose VDG-Uni3DSeg, a novel framework that integrates pre-trained vision-language models (e.g., CLIP) and large language models (LLMs) to enhance 3D segmentation. By leveraging LLM-generated textual descriptions and reference images from the internet, our method incorporates rich multimodal cues, facilitating fine-grained class and instance separation. We further design a Semantic-Visual Contrastive Loss to align point features with multimodal queries and a Spatial Enhanced Module to model scene-wide relationships efficiently. Operating within a closed-set paradigm that utilizes multimodal knowledge generated offline, VDG-Uni3DSeg achieves state-of-the-art results in semantic, instance, and panoptic segmentation, offering a scalable and practical solution for 3D understanding.

Semantic predictions

🚀 Installation & Data Preparation

🛠️ 1. Installation & Data Prep

Please refer to the official Oneformer3D to set up the environment and prepare the datasets with ease.

🌐 2. Internet Images

To obtain internet images, we use the google-images-download tool. Please follow the instructions below to install it:

mkdir internet_image && cd internet_image
git clone https://github.com/ultralytics/google-images-download
cd google-images-download
pip install -r requirements.txt

Once installed, you can download images by running the following command. For example, to download 20 images for the class "sofa" (e.g., for the S3DIS dataset), use:

python3 bing_scraper.py --search sofa --limit 20 --download --output_directory 'images/s3dis'

Tip: We’ve already collected images for you! Find them here.

Next, extract image features using the CLIP model. Use the commands below to install CLIP and extract features:

pip install git+https://github.com/openai/CLIP.git
python get_image_features.py

Or download pre-extracted features here.

📝 3. Generating LLM Descriptions

We follow this method, replacing GPT with the open-source Llama3.1-8B-Instruct model.

Get the Model:

# Download weights & tokenizer from HuggingFace
# Place them in:
./llama_model/Llama3.1-8B-Instruct

Generate Descriptions:
```
torchrun --nproc_per_node 1 generate_descriptors_llama.py
```
Descriptions 👉 ./class_description/descriptors

We’ve pre-generated descriptors for ScanNet, ScanNet200, and S3DIS. Download the bundle here.

Extract Text Features:
```
python get_text_features_llama.py
```
CLIP embeddings 👉 class_description/clip_embedding

Pre-extracted text features are also available here.

🏋️‍♂️ Training and Evaluation

Below we provide training and testing commands for three datasets: ScanNet, ScanNet200, and S3DIS.

📦 ScanNet

Backbone: SpConv
Init checkpoint: SSTNet
📥 Download and place in work_dirs/tmp/

# Train
python tools/train.py configs/vdguni_scannet.py

# Fix checkpoint before evaluation
python tools/fix_spconv_checkpoint.py --in-path work_dirs/vdguni_scannet/epoch_512.pth --out-path work_dirs/vdguni_scannet/final.pth

# Evaluation
python tools/test.py configs/vdguni_scannet.py work_dirs/vdguni_scannet/final.pth

🧱 ScanNet200

Backbone: MinkowskiEngine
Init checkpoint: Mask3D
📥 Download and place in work_dirs/tmp/

# Train
python tools/train.py configs/vdguni_scannet200.py
# Evaluation
python tools/test.py configs/vdguni_scannet200.py work_dirs/vdguni_scannet200/epoch_512.pth

🏫 S3DIS

Backbone: [SpConv]((https://github.com/traveller59/spconv)
Pretrained on: Structured3D + ScanNet
📥 Download and place in work_dirs/tmp/

We train on Areas 1–4,6 and test on Area 5. Modify train_area / test_area in the config to change splits.

# Train
python tools/train.py configs/vdguni_s3dis.py

# Fix checkpoint before evaluation
python tools/fix_spconv_checkpoint.py --in-path work_dirs/vdguni_s3dis/epoch_512.pth --out-path work_dirs/vdguni_s3dis/final.pth

# Evaluation
python tools/test.py configs/vdguni_s3dis.py work_dirs/vdguni_s3dis/final.pth

📊 Model Zoo & Results

Note: Due to random initialization, training results may slightly vary. Running multiple seeds may be needed to match paper performance. Config files are available in configs/.

Dataset	mAP₂₅	mAP₅₀	mAP	mIoU	PQ	Download
ScanNet	86.5	78.5	59.3	76.2	71.5	model \| log
ScanNet200	45.1	40.0	29.5	29.7	31.3	model \| log
S3DIS	80.6	74.1	60.1	71.5	66.3	model \| log

🖼️ Example Semantic Segmentation

Semantic predictions

Citation

If you find this work useful for your research, please cite our paper:

@inproceedings{han2025all,
  title={All in One: Visual-Description-Guided Unified Point Cloud Segmentation},
  author={Han, Zongyan and Boudjoghra, Mohamed El Amine and Dong, Jiahua and Wang, Jinhong and Anwer, Rao Muhammad},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  year={2025}
}

Acknowledgements

We gratefully acknowledge the following open-source projects that our work builds upon:

We thank the authors for their contributions to the community.