README.md
July 25, 2025 ยท View on GitHub
โจ VDG-Uni3DSeg
All in One: Visual-Description-Guided Unified Point Cloud Segmentation
Zongyan Han1,Mohamed El Amine Boudjoghra2,Jiahua Dong1, Jinhong Wang1, Rao Muhammad Anwer1,
1 Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI), 2 Technical University of Munich
๐ [Paper]
๐ก Introduction
Unified segmentation of 3D point clouds is crucial for scene understanding, but is hindered by its sparse structure, limited annotations, and the challenge of distinguishing fine-grained object classes in complex environments. Existing methods often struggle to capture rich semantic and contextual information due to limited supervision and a lack of diverse multimodal cues, leading to suboptimal differentiation of classes and instances. To address these challenges, we propose VDG-Uni3DSeg, a novel framework that integrates pre-trained vision-language models (e.g., CLIP) and large language models (LLMs) to enhance 3D segmentation. By leveraging LLM-generated textual descriptions and reference images from the internet, our method incorporates rich multimodal cues, facilitating fine-grained class and instance separation. We further design a Semantic-Visual Contrastive Loss to align point features with multimodal queries and a Spatial Enhanced Module to model scene-wide relationships efficiently. Operating within a closed-set paradigm that utilizes multimodal knowledge generated offline, VDG-Uni3DSeg achieves state-of-the-art results in semantic, instance, and panoptic segmentation, offering a scalable and practical solution for 3D understanding.
๐ Installation & Data Preparation
๐ ๏ธ 1. Installation & Data Prep
Please refer to the official Oneformer3D to set up the environment and prepare the datasets with ease.
๐ 2. Internet Images
- To obtain internet images, we use the google-images-download tool. Please follow the instructions below to install it:
mkdir internet_image && cd internet_image
git clone https://github.com/ultralytics/google-images-download
cd google-images-download
pip install -r requirements.txt
- Once installed, you can download images by running the following command. For example, to download 20 images for the class "sofa" (e.g., for the S3DIS dataset), use:
python3 bing_scraper.py --search sofa --limit 20 --download --output_directory 'images/s3dis'
Tip: Weโve already collected images for you! Find them here.
- Next, extract image features using the CLIP model. Use the commands below to install CLIP and extract features:
pip install git+https://github.com/openai/CLIP.git
python get_image_features.py
Or download pre-extracted features here.
๐ 3. Generating LLM Descriptions
We follow this method, replacing GPT with the open-source Llama3.1-8B-Instruct model.
- Get the Model:
# Download weights & tokenizer from HuggingFace # Place them in: ./llama_model/Llama3.1-8B-Instruct - Generate Descriptions:
Descriptions ๐torchrun --nproc_per_node 1 generate_descriptors_llama.py./class_description/descriptors
Weโve pre-generated descriptors for ScanNet, ScanNet200, and S3DIS. Download the bundle here.
- Extract Text Features:
CLIP embeddings ๐python get_text_features_llama.pyclass_description/clip_embedding
Pre-extracted text features are also available here.
๐๏ธโโ๏ธ Training and Evaluation
Below we provide training and testing commands for three datasets: ScanNet, ScanNet200, and S3DIS.
๐ฆ ScanNet
# Train
python tools/train.py configs/vdguni_scannet.py
# Fix checkpoint before evaluation
python tools/fix_spconv_checkpoint.py --in-path work_dirs/vdguni_scannet/epoch_512.pth --out-path work_dirs/vdguni_scannet/final.pth
# Evaluation
python tools/test.py configs/vdguni_scannet.py work_dirs/vdguni_scannet/final.pth
๐งฑ ScanNet200
- Backbone: MinkowskiEngine
- Init checkpoint: Mask3D
๐ฅ Download and place inwork_dirs/tmp/
# Train
python tools/train.py configs/vdguni_scannet200.py
# Evaluation
python tools/test.py configs/vdguni_scannet200.py work_dirs/vdguni_scannet200/epoch_512.pth
๐ซ S3DIS
- Backbone: [SpConv]((https://github.com/traveller59/spconv)
- Pretrained on: Structured3D + ScanNet
๐ฅ Download and place inwork_dirs/tmp/
We train on Areas 1โ4,6 and test on Area 5. Modify train_area / test_area in the config to change splits.
# Train
python tools/train.py configs/vdguni_s3dis.py
# Fix checkpoint before evaluation
python tools/fix_spconv_checkpoint.py --in-path work_dirs/vdguni_s3dis/epoch_512.pth --out-path work_dirs/vdguni_s3dis/final.pth
# Evaluation
python tools/test.py configs/vdguni_s3dis.py work_dirs/vdguni_s3dis/final.pth
๐ Model Zoo & Results
Note: Due to random initialization, training results may slightly vary. Running multiple seeds may be needed to match paper performance. Config files are available in configs/.
| Dataset | mAP25 | mAP50 | mAP | mIoU | PQ | Download |
|---|---|---|---|---|---|---|
| ScanNet | 86.5 | 78.5 | 59.3 | 76.2 | 71.5 | model | log |
| ScanNet200 | 45.1 | 40.0 | 29.5 | 29.7 | 31.3 | model | log |
| S3DIS | 80.6 | 74.1 | 60.1 | 71.5 | 66.3 | model | log |
๐ผ๏ธ Example Semantic Segmentation
Citation
If you find this work useful for your research, please cite our paper:
@inproceedings{han2025all,
title={All in One: Visual-Description-Guided Unified Point Cloud Segmentation},
author={Han, Zongyan and Boudjoghra, Mohamed El Amine and Dong, Jiahua and Wang, Jinhong and Anwer, Rao Muhammad},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
year={2025}
}
Acknowledgements
We gratefully acknowledge the following open-source projects that our work builds upon:
We thank the authors for their contributions to the community.