Post-pre-training for Modality Alignment in Vision-Language Foundation Models (CVPR2025)
July 25, 2025 ยท View on GitHub
Requirements
Software Requirements
- CUDA >= 12.3
Python Requirements
- Please see
apptainer/config.def
Preparations
Post-pre-training Dataset: COCO Caption (2017)
- Download the dataset from here
- Install the dataset into
./dataset/coco/
Evaluation Dataset: ImageNet
- Download the dataset from here
- Install the dataset into
./dataset/imagenet/
Example
Run Post-pre-training of CLIP-Refine on COCO Caption
python3 main/train.py --config_path config/01_post-pre-training/clip-refine.yaml
Evaluate Zero-shot Performance on ImageNet
python3 main/test.py --config_path config/01_post-pre-training/clip-refine.yaml
Citation
@inproceedings{Yamaguchi_CVPR25_CLIP-Refine,
title={Post-pre-training for Modality Alignment in Vision-Language Foundation Models},
author={Yamaguchi, Shin'ya and Feng, Dewei and Kanai, Sekitoshi and Adachi, Kazuki and Chijiwa, Daiki},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year={2025}
}