Kernel-based Unsupervised Embedding Alignment for Enhanced Visual Representation in Vision-language Models

September 7, 2025 · View on GitHub

Implementation for ICML 2025 paper Kernel-based Unsupervised Embedding Alignment for Enhanced Visual Representation in Vision-language Models by Shizhan Gong, Yankai Jiang, Qi Dou, and Farzan Farnia

Setup

We recommend to install the environment through conda:

cd KUEA
conda create --name myenv python=3.11
conda activate myenv
pip install -r requirements.txt

Alignment Fine-tuning

Please use the following code for the alignment fine-tuning.

python -m train.align_training_clip --clip_model_name ViT-L-14 --pretrained openai --dataset imagenet 
--imagenet_root /path/to/imagenet2012 --template std --output_normalize False --steps 40000 --warmup 2800 
--batch_size 64 --loss l2 --loss_clean l2 --opt adamw --lr 1e-5 --wd 1e-4 --inner_loss l2 --wandb False 
--output_dir /path/to/checkpoint --clean_weight 1. --penalty_weight 0.5 --kernel_dino polynomial 
--kernel_clip polynomial --gamma 0.0032 --coef0 0.191623 --experiment_name exp_1  --log_freq 1 --eval_freq 10

--imagenet_root should be adjusted to designate the directory of the imagenet dataset. --output_dir specifies the directory to store the fine-tuned checkpoint. --gamma and --coef0 are the initial parameters used to calculate the polynomial kernel of CLIP representations. We pre-calculate them by sampling several images from the training data and minimize the L2 distance between kernel matrices of CLIP and DINOv2.

Evaluation

We utilize CLIP-Benchmark for evaluation of the fine-tuned models.

To evaluate the model, first go to the CLIP_benchmark directory

cd CLIP_benchmark

Edit the file benchmark/models.txt to include the model to evaluate:

ViT-L-14-336,openai
ViT-L-14-336,directory/to/finetuned/models.pt

The first element specify the architecture of the model, and the second element specify the saved checkpoints. Using openai for evaluation of the original CLIP model. Then run the corresponding bash command:

./bash/run_benchmark_clean.sh # zero-shot classification
./bash/run_benchmark_lp.sh # linear probing
./bash/run_benchmark_rt.sh # image-text retrieval

Please edit the SAVE_DIR field of the corresponding files, which specifies the directory to save the evaluation results.

Fine-tuning of LLaVA

The script to fine-tune LLaVA is adjusted from LLaVA. We use the following command to perform LoRA fine-tuning

cd LLaVA
./scripts/v1_5/finetune_task_lora.sh

Note to edit the --vision_tower filed of the script to denote the directory of the checkpoints after the alignment fine-tuning.

Evaluation of LLaVA

We utilize the tool provided by Prismatic library for evaluation of the LLaVA.

Pre-trained checkpoints

The pretrained checkpoints for the CLIP vision encoder can be downloaded from OneDrive.

ViT-L-14-224

ViT-L-14-336

Bibtex

If you find this work helpful, you can cite our paper as follows:

@article{gong2025kernel,
  title={Kernel-based Unsupervised Embedding Alignment for Enhanced Visual Representation in Vision-language Models},
  author={Gong, Shizhan and Jiang, Yankai and Dou, Qi and Farnia, Farzan},
  journal={arXiv preprint arXiv:2506.02557},
  year={2025}
}

Contact

For any questions, please contact szgong22@cse.cuhk.edu.hk