README.md

March 19, 2026 · View on GitHub

🔄️ un²CLIP:
Improving CLIP's Visual Detail Capturing Ability via Inverting unCLIP

Yinqi Li^1,2, Jiahe Zhao^1,2, Hong Chang^1,2, Ruibing Hou¹, Shiguang Shan^1,2, Xilin Chen^1,2

¹Institute of Computing Technology, Chinese Academy of Sciences

²University of Chinese Academy of Sciences

ArXiv · HuggingFace · OpenReview

unCLIP provides an encoding-decoding tool for observing which features are disregarded by CLIP.

Our un²CLIP further leverages this framework to improve CLIP, aiming to recapture the disregarded features.

Installation

Clone this repository and create a conda environment with the following commands:

git clone git@github.com:LiYinqi/un2CLIP.git
cd un2CLIP

conda env create -f environment.yaml
conda activate un2clip

Pretrained Checkpoints

Our models are released on HuggingFace🤗.

CLIP Model	Resolution	MMVP-VLM (Original)	MMVP-VLM (Ours)	Link
OpenAI CLIP ViT-L-14	224	19.3	32.6	openai_vit_l_14_224.ckpt
OpenAI CLIP ViT-L-14	336	20.0	30.4	openai_vit_l_14_336.ckpt
OpenCLIP ViT-H-14	224	28.9	36.3	openclip_vit_h_14_224.ckpt
SigLIP ViT-SO-14	384	37.0	41.5	siglip_vit_so_14_384.ckpt

We assume the checkpoints are saved in the ./pretrained_models directory with their original names.

MMVP-VLM Evaluation

Download the MMVP-VLM benchmark and place it in a local directory.
Run the evaluation script for each CLIP model by specifying different un2clip_ckpt_path arguments. For example, to evaluate OpenAI CLIP ViT-L-14 at 224 resolution, run:

python eval_mmvpvlm.py \
  --benchmark_dir "YOUR_MMVP_VLM_PATH" \
  --un2clip_ckpt_path "./pretrained_models/openai_vit_l_14_224.ckpt"

Training

Preparation

(1) Pretrained unCLIP models

Download the pretrained unCLIP models from the Stable unCLIP huggingface page and place them in a local directory, e.g., ./unclip_ckpts.

Note: Currently, the above official checkpoints as well as corresponding github repos are no longer publicly available (see discussion). You may need to find alternative sources to obtain them if not downloaded previously. You could also view the original official link via the Wayback Machine.

Additionally, download a ViT-L-14_stats.th file using the script below. This file is required for running the "OpenAI CLIP ViT-L-14" experiments due to its usage in the original Stable unCLIP model.

# Source: https://github.com/Stability-AI/stablediffusion/blob/main/doc/UNCLIP.MD
mkdir ./unclip_ckpts/karlo_models && cd ./unclip_ckpts/karlo_models
wget https://arena.kakaocdn.net/brainrepo/models/karlo-public/v1.0.0.alpha/0b62380a75e56f073e2844ab5199153d/ViT-L-14_stats.th
cd ../..

(2) Training dataset

Download the CC3M dataset following the instructions from image2dataset or from pixparse/cc3m-wds (community-maintained).

The desired structure is as follows:

dataset/
└── CC3M/
    ├── train/
    │   └── images/
    │       ├── 000000000.jpg
    │       ├── 000000001.jpg
    │       └── ...
    └── val/
        └── images/
            ├── 00000000.jpg
            ├── 00000002.jpg
            └── ...

Run

Training can be started with the following example command:

python main.py \
  --gpus=0,1,2,3,4,5,6,7 \
  --base="$CONFIG" \
  --allow_tf32 \
  --allow_bf16

Configs are provided in the ./configs/ directory.
By default, we use TF32 (--allow_tf32) and BF16 (--allow_bf16) to speed up the training process on NVIDIA Ampere+ GPUs. You may disable them if your hardware does not support these features.

Acknowledgments

This project is built upon the Stable unCLIP project under the Stable Diffusion Version 2 repo.

Parts of the code from the original Stable Diffusion repo are also referenced in building the training pipeline.

We thank the authors for their excellent work and for open-sourcing the code and models.

Citation

If you find this code or project useful, please consider giving a star⭐ or citing:

@inproceedings{li2025un2clip,
  title     = {{un$^2$CLIP}: Improving {CLIP}'s Visual Detail Capturing Ability via Inverting {unCLIP}},
  author    = {Yinqi Li and Jiahe Zhao and Hong Chang and Ruibing Hou and Shiguang Shan and Xilin Chen},
  year      = {2025},
  booktitle = {The Thirty-ninth Annual Conference on Neural Information Processing Systems}
}

🔄️ un2CLIP: Improving CLIP's Visual Detail Capturing Ability via Inverting unCLIP

ArXiv · HuggingFace · OpenReview

🔄️ un²CLIP:
Improving CLIP's Visual Detail Capturing Ability via Inverting unCLIP