README.md

November 26, 2025 · View on GitHub

The original implementation version of TMM 2025 paper PointCloud-Text Matching: Benchmark Dataset and Baseline. 🎉🎉🎉

Abstract

In this paper, we present and study a new instance-level retrieval task: PointCloud-Text Matching (PTM), which aims to identify the exact cross-modal instance that matches a given point-cloud query or text query. PTM has potential applications in various scenarios, such as indoor/urban-canyon localization and scene retrieval. However, there is a lack of suitable and targeted datasets for PTM in practice. To address this issue, we present a new PTM benchmark dataset, namely SceneDepict-3D2T. We observe that the data poses significant challenges due to its inherent characteristics, such as the sparsity, noise, or disorder of point clouds and the ambiguity, vagueness, or incompleteness of texts, which render existing cross-modal matching methods ineffective for PTM. To overcome these challenges, we propose a PTM baseline, named Robust PointCloud-Text Matching method (RoMa). RoMa consists of two key modules: a Dual Attention Perception module (DAP) and a Robust Negative Contrastive Learning module (RNCL). Specifically, DAP leverages token-level and feature-level attention mechanisms to adaptively focus on useful local and global features, and aggregate them into common representations, thereby reducing the adverse impact of noise and ambiguity. To handle noisy correspondence, RNCL enhances robustness against mismatching by dividing negative pairs into clean and noisy subsets and assigning them forward and reverse optimization directions, respectively. We conduct extensive experiments on our benchmarks and demonstrate the superiority of our RoMa.

Task

intro

We introduce a novel instance-level retrieval task: PointCloud-Text Matching (PTM). Different from PointCloud-Text Retrieval (PTR), the instance-level alignment is challenging and realistic as it reflects the need for precise and relevant information to build alignment between point clouds and texts in real-world applications.

Dataset 📕

intro

Existing descriptions in most datasets (e.g., ScanRefer, Nr3d, SQA) primarily focus on portraying a single object for visual grounding and captioning, and a few other (e.g., LLM-3D-Scene) describing several objects in isolation within the corresponding scenes. To more reasonably and comprehensively evaluate the PTM task, we constitute a new benchmark dataset for PTM, namely SceneDepict-3D2T. Our SceneDepict-3D2T dataset adopts ScanNet as pointcloud data (the pre-extracted grid feature we used could be found in ), and the new generated text data and existing text data could be found in .

In addition, we believe that a more detailed and comprehensive SceneDepict-3D2T could be used to help further achieve pointcloud-text alignment in recent pretrained task.

Please place all data files in the data folder.

Preprocessed point cloud data files used for ScanRefer and Nr3D:

pt2vec_200_random_train.npy
pt2vec_200_random_val.npy
pt2vec_200_random_pos_train.npy
pt2vec_200_random_pos_val.npy

Preprocessed point cloud data files used for the 3D-LLM dataset:

3d_llm_grid_train.npy
3d_llm_grid_val.npy
3d_llm_pos_train.npy
3d_llm_pos_val.npy

Preprocessed point cloud data files used for the proposed SceneDepict-3D2T dataset:

3D_Text_Retrv_grid_train.npy
3D_Text_Retrv_grid_val.npy
3D_Text_Retrv_pos_train.npy
3D_Text_Retrv_pos_val.npy

The corresponding text data files are the .jsonl files named after each dataset.
Please download them from the appropriate location in Google Disk as needed.

Data directory structure:

./data/
    datafiles_put_into_this_dir
    ···

Baseline Framework

test

The pipeline of our proposed method. (a) shows the pipeline of our RoMa, which involves two modules: Dual Attention Perception (DAP) and Robust Negative Contrastive Learning (RNCL). In DAP, comprehensive common representations could be extracted from both modalities and then matched into negative pairs. In RNCL, these negative pairs are adaptively optimized in both forward and reverse directions based on pairwise similarities, enhancing the robustness and discrimination of the common representations. (b) is the schematic illustration of DAP in point-cloud modality, which operates similarly for the text modality. Query and Value are obtained from features through a fully connected layer (FC), while Generic-Key is general and learnable for the whole dataset. The Query is combined with token-level and feature-level Generic-Key to obtain dual attention. Following this, the features and attentions are aggregated into common representations.

Requirements ⚙️

The complex training and testing sets of this work have already been fully preprocessed by us, so the requirements for external libraries are not complex:

python 3.8.16
open3d 0.17.0
pyTorch 1.12.1
torchvision 0.13.1
numpy 1.24.3
transformers
tensorboard_logger

Train and test 🎯

To run the BiGRU branch as the text encoder:

cd BiGRU

# ScanRefer
bash ./sh/train_GRU_scanrefer.sh

# Nr3D
bash ./sh/train_GRU_nr3d.sh

# 3D-LLM
bash ./sh/train_GRU_3DLLM.sh

# SceneDepict-3D2T
bash ./sh/train_GRU_our_data.sh

To run the BERT branch as the text encoder:

cd BERT

# ScanRefer
bash ./sh/train_BERT_scanrefer.sh

# Nr3D
bash ./sh/train_BERT_nr3d.sh

# 3D-LLM
bash ./sh/train_BERT_3DLLM.sh

# SceneDepict-3D2T
bash ./sh/train_BERT_our_data.sh

Due to the complexity and information content of the data, for datasets with relatively less information such as ScanRefer, Nr3D, and 3D-LLM, text data augmentation is not recommended. (BiGRU: /home/fengyanglin/RoMa/BiGRU/lib/datasets/image_caption.py line 208: target_i = process_caption(self.vocab, caption, False);
BERT: /home/fengyanglin/RoMa/BERT/lib/datasets/image_caption.py line 208: target_i = process_caption(self.tokenizer, caption_tokens, False)).

For datasets with richer information like our proposed SceneDepict-3D2T, text data augmentation is recommended.

Results 🪧

In the code, we reported R@1, R@5, R@10, and R@30, and finally obtained the following results. results results

Our work follows the Image-Text Retrieval paradigm, preprocessing large 3D point clouds into lightweight features. This provides the advantage of fast training and testing.

If you want to train using the original data, please refer to http://www.scan-net.org/.

Reference 🤗

If this paper is helpful for your research, please cite:

@article{feng2025pointcloud,
  title={Pointcloud-text matching: Benchmark dataset and baseline},
  author={Feng, Yanglin and Qin, Yang and Peng, Dezhong and Zhu, Hongyuan and Peng, Xi and Hu, Peng},
  journal={IEEE Transactions on Multimedia},
  year={2025},
  publisher={IEEE}
}

In addition, our team has developed an interactive retrieval framework based on this work (NeurIPS 2025). You are welcome to use and cite it:

@inproceedings{fenginteractive,
  title={Interactive Cross-modal Learning for Text-3D Scene Retrieval},
  author={Feng, Yanglin and Li, Yongxiang and Sun, Yuan and Qin, Yang and Peng, Dezhong and Hu, Peng},
  booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems}
}

Feel free to reach out for discussion or collaboration: fcyzfyl@163.com