CoT3DRef: Chain-of-Thoughts Data-Efficient 3D Visual Grounding
April 20, 2024 · View on GitHub
This is the official implementation for our ICLR 2024 paper
"CoT3DRef: Chain-of-Thoughts Data-Efficient 3D Visual Grounding"
:books: Synopsis
To achieve this objective, we formulate the 3D visual grounding problem as a sequence-to-sequence (Seq2Seq) task. As illustrated in the architecture above, the The input sequence comprises 3D objects from the scene and an utterance describing a specific object. In contrast to existing architectures, our model predicts both the target object and a chain of anchors on the output side.
Model Zoo
MVT-Nr3D: (Weights and logs)
| Method-Data Percentage | 10% | 40% | 70% | 100% |
|---|---|---|---|---|
| MVT Baseline | 27.56 | 41.64 | 51.93 | 55.22 |
| MVT + Pseudo Labels | 38.16 | 54.83 | 58.81 | 60.36 |
| MVT + GT Labels | 37.23 | 53.89 | 62.95 | 64.36 |
MVT-Sr3D: (Weights and logs)
| Method-Data Percentage | 10% | 40% | 70% | 100% |
|---|---|---|---|---|
| MVT Baseline | 48.75 | 65.03 | 65.08 | 66.05 |
| MVT + GT Labels | 66.37 | 72.78 | 73.53 | 73.23 |
Installation and Data Preparation
Please refer the installation and data preparation from referit3d.
We adopt bert-base-uncased from huggingface, which can be installed using pip as follows:
pip install transformers
you can download the pretrained weight in this page, and put them into a folder, noted as PATH_OF_BERT.
Training
- To train on Sr3d dataset, use the following command
python /CoT3D_VG/refering_codes/MVT-3DVG/train_referit3d.py \
-scannet-file $PATH_OF_SCANNET_FILE$\
-referit3D-file $PATH_OF_REFERIT3D_FILE$ \
--bert-pretrain-path $PATH_OF_BERT$ \
--log-dir logs/COT3DRef_sr3d \
--n-workers 16 \
--model 'referIt3DNet_transformer' \
--unit-sphere-norm True \
--batch-size 24 \
--encoder-layer-num 3 \
--decoder-layer-num 4 \
--decoder-nhead-num 8 \
--gpu "0" \
--view_number 4 \
--rotate_number 4 \
--label-lang-sup True \
--anchors "cot" \
--max-test-objects 52 \
--cot_type "cross" \
--predict_lang_anchors True \
--lang_filter_objs False \
--visaug_shuffle_mode 'none' \
--visaug_extracted_obj_path '$PATH_OF_SR3D_DATA' \
--visaug_pc_augment True \
--train_data_percent 1.0
- To train on NR3d dataset, just add the following commands:
--shuffle_objects_percentage 0 \
--visaug_pc_augment False \
--train_data_percent 1.0 \
--max_num_anchors 7 \
--dropout-rate 0.15 \
--textaug_paraphrase_percentage 0 \
--target_aug_percentage 0 \
--gaussian_latent False \
--distractor_aux_loss_flag True \
--train_data_repeatation 1 \
--augment-with-sr3d '$PATH_OF_SR3D_DATA'
- And Change the "--visaug_extracted_obj_path"
--visaug_extracted_obj_path '$PATH_OF_NR3D_DATA' \
- To Train on Both SR3D and NR3d datasets, use the following command:
python /CoT3D_VG/refering_codes/MVT-3DVG/train_referit3d.py \
-scannet-file $PATH_OF_SCANNET_FILE$\
-referit3D-file $PATH_OF_REFERIT3D_FILE$ \
--bert-pretrain-path $PATH_OF_BERT$ \
--log-dir logs/MVT_nr3d \
--n-workers 16 \
--model 'referIt3DNet_transformer' \
--unit-sphere-norm True \
--batch-size 64 \
--encoder-layer-num 3 \
--decoder-layer-num 4 \
--decoder-nhead-num 8 \
--gpu "0" \
--view_number 4 \
--rotate_number 4 \
--label-lang-sup True \
--anchors "cot" \
--max-test-objects 52 \
--cot_type "cross" \
--predict_lang_anchors True \
--lang_filter_objs False \
--visaug_shuffle_mode 'none' \
--shuffle_objects_percentage 0 \
--visaug_extracted_obj_path '$PATH_OF_NR3D_DATA' \
--visaug_pc_augment False \
--train_data_percent 1.0 \
--max_num_anchors 7 \
--dropout-rate 0.15 \
--textaug_paraphrase_percentage 0 \
--target_aug_percentage 0 \
--gaussian_latent False \
--distractor_aux_loss_flag True \
--train_data_repeatation 1 \
--augment-with-sr3d '$PATH_OF_SR3D_DATA'
Validation
- After each epoch of the training, the program will automatically evaluate the performance of the current model. Our code will save the last model in the training as last_model.pth, and save the best model following the original Referit3D's repo as best_model.pth.
Test
-
At test time, the analyze_predictions will run following the original code of Referit3D.
-
The analyze_predictions will test the model multiple times, each time using a different random seed. With different random seeds, the sampled point clouds of each object are different. The average accuracy and std will be reported.
-
To test on either Nr3d or Sr3d dataset, use the following commands
python referit3d/scripts/train_referit3d.py \
--mode evaluate \
-scannet-file $PATH_OF_SCANNET_FILE$ \
-referit3D-file $PATH_OF_REFERIT3D_FILE$ \
--bert-pretrain-path $PATH_OF_BERT$ \
--log-dir logs/MVT_nr3d \
--resume-path $the_path_to_the_model.pth$ \
--n-workers 8 \
--model 'referIt3DNet_transformer' \
--unit-sphere-norm True \
--batch-size 24 \
--encoder-layer-num 3 \
--decoder-layer-num 4 \
--decoder-nhead-num 8 \
--gpu "0" \
--view_number 4 \
--rotate_number 4 \
--label-lang-sup True
- To test on joint trained model, add the following argument to the above command
--augment-with-sr3d sr3d_dataset_file.csv
:bouquet: Credits
The project is built based on the following repository:
:telephone: Contact us
eslam.abdelrahman@kaust.edu.sa
:mailbox_with_mail: Citation
@article{bakr2023cot3dref,
title={CoT3DRef: Chain-of-Thoughts Data-Efficient 3D Visual Grounding},
author={Bakr, Eslam Mohamed and Ayman, Mohamed and Ahmed, Mahmoud and Slim, Habib and Elhoseiny, Mohamed},
journal={arXiv preprint arXiv:2310.06214},
year={2023}
}