README.md

July 12, 2025 · View on GitHub

Skip the data preparation

We’ve provided the prepared data in Google Drive. Download the zip file and unzip it:
```
unzip annotations.zip -d annotations
```
Or simply download all the unzipped files and place them in the annotations/ directory. You’ll then be ready to run and test the code.

Prepare data

Download the ScanNet dataset by following the ScanNet instructions.
Extract object masks using a pretrained 3D detector:
- Use Mask3D for instance segmentation. We used the checkpoint pretrained on ScanNet200.
- The complete predicted results (especially the masks) for the train/validation sets are too large to share (~40GB). We’ve shared the post-processed results:
  - Unzip the mask3d_inst_seg.tar.gz file.
  - Each file under mask3d_inst_seg contains the predicted results for a single scene, including a list of segmented instances with their labels and segmented indices.
Process object masks and prepare annotations:
- If you use Mask3D for instance segmentation, set the segment_result_dir in run_prepare.sh to the output directory of Mask3D.
- If you use the downloaded mask3d_inst_seg directly, set segment_result_dir to None and set inst_seg_dir to the path of mask3d_inst_seg.
- Run: bash preprocess/run_prepare.sh
Extract 3D features using a pretrained 3D encoder:
- Follow Uni3D to extract 3D features for each instance. We used the pretrained model uni3d-g.
- We've also provided modified code for feature extraction in this forked repository. Set the data_dir here to the path to ${processed_data_dir}/pcd_all (processed_data_dir is an intermediate directory set in run_prepare.sh). After preparing the environment, run bash scripts/inference.sh.
Extract 2D features using a pretrained 2D encoder:
- We followed OpenScene's code to calculate the mapping between 3D points and 2D image pixels. This allows each object to be projected onto multi-view images. Based on the projected masks on the images, we extract and merge DINOv2 features from multi-view images for each object.
- [TODO] Detailed implementation will be released.