README.md

March 3, 2024 · View on GitHub

TAMM: TriAdapter Multi-Modal Learning for 3D Shape Understanding

Zhihao Zhang1*, Shengcao Cao 2*, Yuxiong Wang2

1XJTU, 2UIUC
* Equal Contribution

CVPR 2024

overview

We introduce TriAdapter Multi-Modal Learning (TAMM) -- a novel two-stage learning approach based on three synergetic adapters. First, our CLIP Image Adapter mitigates the domain gap between 3D-rendered images and natural images, by adapting the visual representations of CLIP for synthetic image-text pairs. Subsequently, our Dual Adapters decouple the 3D shape representation space into two complementary sub-spaces: one focusing on visual attributes and the other for semantic understanding, which ensure a more comprehensive and effective multi-modal pre-training.

Schedule

We are committed to open-sourcing TAMM related materials, including:

  • Evaluation code
  • Evaluation data
  • Pretraining code
  • Pretrained checkpoints
  • Downstream tasks implementation

Installation

Clone this repository and install the required packages:

conda create -n tamm python=3.9
conda activate tamm
conda install pytorch==1.13.0 torchvision==0.14.0 torchaudio==0.13.0 pytorch-cuda=11.6 -c pytorch -c nvidia
pip install -U git+https://github.com/NVIDIA/MinkowskiEngine
conda install -c dglteam/label/cu113 dgl
pip install huggingface_hub tqdm

Model Zoo

ModelTraining DataObjaverse-LVIS Top1 (Top5)ModelNet40 Top1 (Top5)ScanObjectNN Top1 (Top5)
PointBertEnsembled w/o LVIS43.5 (72.3)86.2 (97.9)55.9 (88.2)
PointBertEnsembled51.9 (81.3)86.1 (97.8)57.0 (86.8)
PointBertShapeNet13.7 (29.2)73.2 (91.8)54.3 (83.6)

Pre-training

  1. Please refer to here for pre-train dataset preparation and put it in the data folder. The folder should look like this:
├── data
│   ├── objaverse_processed
│   │   ├── merged_for_training_all
│   │   │   ├── ...
│   meta_data
  1. Run the pre-training stage1 by the following command:
CUDA_VISIBLE_DEVICES=0,1,2,3  torchrun  --nproc_per_node=4 --master_port=29001 main.py --config configs/clip_image_adapter_training.yaml
  1. Run the pre-training stage2 by the following command:
CUDA_VISIBLE_DEVICES=0,1,2,3  torchrun  --nproc_per_node=4 --master_port=29001 main.py --config configs/pointbert.yaml

Inference

Run the zero-shot evaluation by the following command

CUDA_VISIBLE_DEVICES=0,1,2,3  torchrun  --nproc_per_node=4 --master_port=29001 test.py --config configs/Pre-training/pointbert.yaml --resume /path/to/pre-trained-models

Acknowledgement

TAMM is built using the awesome OpenCLIP, ULIP, OpenShape and Uni3D.

Citation

@article{zhang2024tamm,
  title={TAMM: TriAdapter Multi-Modal Learning for 3D Shape Understanding},
  author={Zhang, Zhihao and Cao, Shengcao and Wang, Yu-Xiong},
  journal={arXiv preprint arXiv:2402.18490},
  year={2024}
}