README.md
July 12, 2025 · View on GitHub
Skip the data preparation
- We’ve provided the prepared data in Google Drive. Download the zip file and unzip it:
Or simply download all the unzipped files and place them in the annotations/ directory. You’ll then be ready to run and test the code.unzip annotations.zip -d annotations
Prepare data
-
Download the ScanNet dataset by following the ScanNet instructions.
-
Extract object masks using a pretrained 3D detector:
- Use Mask3D for instance segmentation. We used the checkpoint pretrained on ScanNet200.
- The complete predicted results (especially the masks) for the train/validation sets are too large to share (~40GB). We’ve shared the post-processed results:
- Unzip the
mask3d_inst_seg.tar.gzfile. - Each file under
mask3d_inst_segcontains the predicted results for a single scene, including a list of segmented instances with their labels and segmented indices.
- Unzip the
-
Process object masks and prepare annotations:
- If you use Mask3D for instance segmentation, set the
segment_result_dirin run_prepare.sh to the output directory of Mask3D. - If you use the downloaded
mask3d_inst_segdirectly, setsegment_result_dirto None and setinst_seg_dirto the path ofmask3d_inst_seg. - Run:
bash preprocess/run_prepare.sh
- If you use Mask3D for instance segmentation, set the
-
Extract 3D features using a pretrained 3D encoder:
- Follow Uni3D to extract 3D features for each instance. We used the pretrained model uni3d-g.
- We've also provided modified code for feature extraction in this forked repository. Set the
data_dirhere to the path to${processed_data_dir}/pcd_all(processed_data_diris an intermediate directory set inrun_prepare.sh). After preparing the environment, runbash scripts/inference.sh.
-
Extract 2D features using a pretrained 2D encoder:
-
We followed OpenScene's code to calculate the mapping between 3D points and 2D image pixels. This allows each object to be projected onto multi-view images. Based on the projected masks on the images, we extract and merge DINOv2 features from multi-view images for each object.
-
[TODO] Detailed implementation will be released.
-