CLIPoint3D: Language-Grounded Few-Shot Unsupervised 3D Point Cloud Domain Adaptation (CVPR'26)

April 22, 2026 · View on GitHub

Mainak Singha, Sarthak Mehrotra, Paolo Casari, Subhasis Chaudhuri, Elisa Ricci, Biplab Banerjee

arXiv Project Page License

Abstract

Recent vision-language models (VLMs) such as CLIP demonstrate impressive cross-modal reasoning, extending beyond images to 3D perception. Yet, these models remain fragile under domain shifts, especially when adapting from synthetic to real-world point clouds. Conventional 3D domain adaptation approaches rely on heavy trainable encoders, yielding strong accuracy but at the cost of efficiency. We introduce CLIPoint3D, the first framework for few-shot unsupervised 3D point cloud domain adaptation built upon CLIP. Our approach projects 3D samples into multiple depth maps and exploits the frozen CLIP backbone, refined through a knowledge-driven prompt tuning scheme that integrates high-level language priors with geometric cues from a lightweight 3D encoder. To adapt task-specific features effectively, we apply parameter-efficient fine-tuning to CLIP's encoders and design an entropy-guided view sampling strategy for selecting confident projections. Furthermore, an optimal transport-based alignment loss and an uncertainty-aware prototype alignment loss collaboratively bridge source-target distribution gaps while maintaining class separability. Extensive experiments on PointDA-10 and GraspNetPC-10 benchmarks show that CLIPoint3D achieves consistent 3-16% accuracy gains over both CLIP-based and conventional encoder-based baselines.

Setup

1. Create the conda environment

conda env create -f environment.yml
conda activate clipoint3d

The environment supports Python 3.9.20 and all dependencies including PyTorch 2.5.1 with CUDA 12.

3. Install Dassl

Follow the installation instructions in Dassl.pytorch/README.md. The relevant steps from their guide are:

cd Dassl.pytorch/

# Install dependencies
pip install -r requirements.txt

# Install this library (no need to re-build if the source code is modified)
python setup.py develop

cd ..

4. CLIP model weights

CLIP weights are downloaded automatically on first use via the clip library. Ensure you have internet access on the first run, or pre-download the ViT-B/16 weights.

Dataset Setup

PointDA

Download the PointDA-10 dataset and place it under PointDA_data/:

PointDA_data/
├── shapenet/
├── modelnet/
└── scannet/

GraspNet

Download the GraspNet point cloud data and place it under GraspNetPointClouds/:

GraspNetPointClouds/
├── synthetic/
├── kinect/
└── realsense/

Training

Single experiment

python train.py \
  --config-file configs/trainers/trainer_200.yaml \
  --dataset-config-file configs/datasets/pointda_shapenet_modelnet.yaml \
  --output-dir experiments/run1 \
  --seed 42 \
  --use_sinkhorn_loss \
  --use_entropy_loss \
  --use_confidence_sampling

Key arguments

ArgumentDefaultDescription
--config-fileconfigs/trainers/trainer.yamlTrainer configuration
--dataset-config-fileconfigs/datasets/pointda_shapenet_modelnet.yamlDataset configuration
--output-dirtest_runs_with_sinkhornOutput directory for checkpoints and logs
--rootPointDA_dataPath to dataset root
--seed42Random seed (positive = fixed)
--source-domainsOverride source domains
--target-domainsOverride target domains
--use_sinkhorn_lossoffOptimal transport loss between source/target
--use_entropy_lossoffEntropy minimization on target predictions
--use_align_lossoffDirect feature alignment loss
--use_prototype_lossoffPrototype-based domain alignment
--use_kl_lossoffKL divergence loss
--use_w1_lossoffWasserstein-1 distance loss
--use_confidence_samplingoffSample target points by prediction confidence

Output is saved to <output-dir>/<model>/<source>/<target>/.

Configuration

Configs use YACS and are split into two files:

  • Trainer config (configs/trainers/): Model architecture, optimizer, batch size, learning rate, number of context tokens. The recommended config is trainer_200.yaml.
  • Dataset config (configs/datasets/): Dataset name, source and target domain names. Named as pointda_<source>_<target>.yaml or graspnet_<source>_<target>.yaml.

You can also override any config value directly from the command line using YACS syntax at the end of the command:

python train.py ... OPTIM.LR 0.001 DATALOADER.TRAIN_X.BATCH_SIZE 32

Key trainer config options:

MODEL:
  NAME: CLIPoint3D
  BACKBONE:
    NAME: "ViT-B/16"   # CLIP backbone

OPTIM:
  NAME: "sgd"
  LR: 0.002
  MAX_EPOCH: 200
  LR_SCHEDULER: "cosine"

TRAINER:
  MODEL:
    N_CTX: 4       # Number of learnable context tokens in prompts
    PREC: "fp32"   # Precision: fp32, fp16, or amp

Project Structure

clipoint3d/
├── train.py                  # Entry point
├── trainer.py                # Trainer class with loss implementations
├── environment.yml           # Conda environment spec
├── train_single.sh           # Run all PointDA domain pairs
├── train_graspnet.sh         # Run all GraspNet domain pairs
├── ablations.sh              # Ablation study runs
├── models/
│   ├── model.py              # Main model (PointNet + CLIP + cross-attention)
│   ├── pointnet.py           # PointNet 3D encoder
│   ├── prompt_learner.py     # Learnable text prompt module
│   ├── text_encoder.py       # CLIP text encoder wrapper
│   ├── image_encoder.py      # CLIP image encoder wrapper
│   ├── cross_attention.py    # Cross-modal attention module
│   └── lora.py               # LoRA parameter-efficient fine-tuning
├── clip/                     # CLIP model integration
├── utils/
│   ├── config_defaults.py    # YACS config defaults
│   ├── dataloader.py         # Data loading utilities
│   ├── loss.py               # Domain adaptation loss functions
│   ├── render.py             # Point cloud -> multi-view image renderer
│   └── peft_utils.py         # Parameter-efficient fine-tuning helpers
├── configs/
│   ├── datasets/             # Dataset YAML configs
│   └── trainers/             # Trainer YAML configs
├── Dassl.pytorch/            # Domain adaptation framework
├── PointDA_data/             # PointDA dataset (ShapeNet/ModelNet/ScanNet)
└── GraspNetPointClouds/      # GraspNet dataset

Citation

@article{singha2026clipoint3d,
  title={CLIPoint3D: Language-Grounded Few-Shot Unsupervised 3D Point Cloud Domain Adaptation},
  author={Singha, Mainak and Mehrotra, Sarthak and Casari, Paolo and Chaudhuri, Subhasis and Ricci, Elisa and Banerjee, Biplab},
  journal={arXiv preprint arXiv:2602.20409},
  year={2026}
}