README.md

March 3, 2024 · View on GitHub

TAMM: TriAdapter Multi-Modal Learning for 3D Shape Understanding

Zhihao Zhang^1*, Shengcao Cao^2*, Yuxiong Wang²

¹XJTU, ²UIUC
^* Equal Contribution

CVPR 2024

overview

We introduce TriAdapter Multi-Modal Learning (TAMM) -- a novel two-stage learning approach based on three synergetic adapters. First, our CLIP Image Adapter mitigates the domain gap between 3D-rendered images and natural images, by adapting the visual representations of CLIP for synthetic image-text pairs. Subsequently, our Dual Adapters decouple the 3D shape representation space into two complementary sub-spaces: one focusing on visual attributes and the other for semantic understanding, which ensure a more comprehensive and effective multi-modal pre-training.

Schedule

We are committed to open-sourcing TAMM related materials, including:

Installation

Clone this repository and install the required packages:

conda create -n tamm python=3.9
conda activate tamm
conda install pytorch==1.13.0 torchvision==0.14.0 torchaudio==0.13.0 pytorch-cuda=11.6 -c pytorch -c nvidia
pip install -U git+https://github.com/NVIDIA/MinkowskiEngine
conda install -c dglteam/label/cu113 dgl
pip install huggingface_hub tqdm

Model Zoo

Model	Training Data	Objaverse-LVIS Top1 (Top5)	ModelNet40 Top1 (Top5)	ScanObjectNN Top1 (Top5)
PointBert	Ensembled w/o LVIS	43.5 (72.3)	86.2 (97.9)	55.9 (88.2)
PointBert	Ensembled	51.9 (81.3)	86.1 (97.8)	57.0 (86.8)
PointBert	ShapeNet	13.7 (29.2)	73.2 (91.8)	54.3 (83.6)

Pre-training

Please refer to here for pre-train dataset preparation and put it in the data folder. The folder should look like this:

├── data
│   ├── objaverse_processed
│   │   ├── merged_for_training_all
│   │   │   ├── ...
│   meta_data

Run the pre-training stage1 by the following command:

CUDA_VISIBLE_DEVICES=0,1,2,3  torchrun  --nproc_per_node=4 --master_port=29001 main.py --config configs/clip_image_adapter_training.yaml

Run the pre-training stage2 by the following command:

CUDA_VISIBLE_DEVICES=0,1,2,3  torchrun  --nproc_per_node=4 --master_port=29001 main.py --config configs/pointbert.yaml

Inference

Run the zero-shot evaluation by the following command

CUDA_VISIBLE_DEVICES=0,1,2,3  torchrun  --nproc_per_node=4 --master_port=29001 test.py --config configs/Pre-training/pointbert.yaml --resume /path/to/pre-trained-models

Acknowledgement

TAMM is built using the awesome OpenCLIP, ULIP, OpenShape and Uni3D.

Citation

@article{zhang2024tamm,
  title={TAMM: TriAdapter Multi-Modal Learning for 3D Shape Understanding},
  author={Zhang, Zhihao and Cao, Shengcao and Wang, Yu-Xiong},
  journal={arXiv preprint arXiv:2402.18490},
  year={2024}
}