README.md
March 19, 2026 Β· View on GitHub
ποΈ un2CLIP:
Improving CLIP's Visual Detail Capturing Ability via Inverting unCLIP
Yinqi Li1,2, Jiahe Zhao1,2, Hong Chang1,2, Ruibing Hou1, Shiguang Shan1,2, Xilin Chen1,2
1Institute of Computing Technology, Chinese Academy of Sciences
2University of Chinese Academy of Sciences
ArXiv Β· HuggingFace Β· OpenReview
unCLIP provides an encoding-decoding tool for observing which features are disregarded by CLIP.
Our un2CLIP further leverages this framework to improve CLIP, aiming to recapture the disregarded features.
Installation
Clone this repository and create a conda environment with the following commands:
git clone git@github.com:LiYinqi/un2CLIP.git
cd un2CLIP
conda env create -f environment.yaml
conda activate un2clip
Pretrained Checkpoints
Our models are released on HuggingFaceπ€.
| CLIP Model | Resolution | MMVP-VLM (Original) | MMVP-VLM (Ours) | Link |
|---|---|---|---|---|
| OpenAI CLIP ViT-L-14 | 224 | 19.3 | 32.6 | openai_vit_l_14_224.ckpt |
| OpenAI CLIP ViT-L-14 | 336 | 20.0 | 30.4 | openai_vit_l_14_336.ckpt |
| OpenCLIP ViT-H-14 | 224 | 28.9 | 36.3 | openclip_vit_h_14_224.ckpt |
| SigLIP ViT-SO-14 | 384 | 37.0 | 41.5 | siglip_vit_so_14_384.ckpt |
We assume the checkpoints are saved in the ./pretrained_models directory with their original names.
MMVP-VLM Evaluation
-
Download the MMVP-VLM benchmark and place it in a local directory.
-
Run the evaluation script for each CLIP model by specifying different
un2clip_ckpt_patharguments. For example, to evaluate OpenAI CLIP ViT-L-14 at 224 resolution, run:
python eval_mmvpvlm.py \
--benchmark_dir "YOUR_MMVP_VLM_PATH" \
--un2clip_ckpt_path "./pretrained_models/openai_vit_l_14_224.ckpt"
Training
Preparation
(1) Pretrained unCLIP models
Download the pretrained unCLIP models from the Stable unCLIP huggingface page and place them in a local directory, e.g., ./unclip_ckpts.
Note: Currently, the above official checkpoints as well as corresponding github repos are no longer publicly available (see discussion). You may need to find alternative sources to obtain them if not downloaded previously. You could also view the original official link via the Wayback Machine.
Additionally, download a ViT-L-14_stats.th file using the script below.
This file is required for running the "OpenAI CLIP ViT-L-14" experiments due to its usage in the original Stable unCLIP model.
# Source: https://github.com/Stability-AI/stablediffusion/blob/main/doc/UNCLIP.MD
mkdir ./unclip_ckpts/karlo_models && cd ./unclip_ckpts/karlo_models
wget https://arena.kakaocdn.net/brainrepo/models/karlo-public/v1.0.0.alpha/0b62380a75e56f073e2844ab5199153d/ViT-L-14_stats.th
cd ../..
(2) Training dataset
Download the CC3M dataset following the instructions from image2dataset or from pixparse/cc3m-wds (community-maintained).
The desired structure is as follows:
dataset/
βββ CC3M/
βββ train/
β βββ images/
β βββ 000000000.jpg
β βββ 000000001.jpg
β βββ ...
βββ val/
βββ images/
βββ 00000000.jpg
βββ 00000002.jpg
βββ ...
Run
Training can be started with the following example command:
python main.py \
--gpus=0,1,2,3,4,5,6,7 \
--base="$CONFIG" \
--allow_tf32 \
--allow_bf16
-
Configs are provided in the
./configs/directory. -
By default, we use TF32 (
--allow_tf32) and BF16 (--allow_bf16) to speed up the training process on NVIDIA Ampere+ GPUs. You may disable them if your hardware does not support these features.
Acknowledgments
This project is built upon the Stable unCLIP project under the Stable Diffusion Version 2 repo.
Parts of the code from the original Stable Diffusion repo are also referenced in building the training pipeline.
We thank the authors for their excellent work and for open-sourcing the code and models.
Citation
If you find this code or project useful, please consider giving a starβ or citing:
@inproceedings{li2025un2clip,
title = {{un$^2$CLIP}: Improving {CLIP}'s Visual Detail Capturing Ability via Inverting {unCLIP}},
author = {Yinqi Li and Jiahe Zhao and Hong Chang and Ruibing Hou and Shiguang Shan and Xilin Chen},
year = {2025},
booktitle = {The Thirty-ninth Annual Conference on Neural Information Processing Systems}
}