README.md

March 19, 2026 Β· View on GitHub

πŸ”„οΈ un2CLIP:
Improving CLIP's Visual Detail Capturing Ability via Inverting unCLIP

Yinqi Li1,2, Jiahe Zhao1,2, Hong Chang1,2, Ruibing Hou1, Shiguang Shan1,2, Xilin Chen1,2

1Institute of Computing Technology, Chinese Academy of Sciences

2University of Chinese Academy of Sciences

ArXiv Β· HuggingFace Β· OpenReview

unCLIP provides an encoding-decoding tool for observing which features are disregarded by CLIP.

Our un2CLIP further leverages this framework to improve CLIP, aiming to recapture the disregarded features.

Installation

Clone this repository and create a conda environment with the following commands:

git clone git@github.com:LiYinqi/un2CLIP.git
cd un2CLIP

conda env create -f environment.yaml
conda activate un2clip

Pretrained Checkpoints

Our models are released on HuggingFaceπŸ€—.

CLIP ModelResolutionMMVP-VLM (Original)MMVP-VLM (Ours)Link
OpenAI CLIP ViT-L-1422419.332.6openai_vit_l_14_224.ckpt
OpenAI CLIP ViT-L-1433620.030.4openai_vit_l_14_336.ckpt
OpenCLIP ViT-H-1422428.936.3openclip_vit_h_14_224.ckpt
SigLIP ViT-SO-1438437.041.5siglip_vit_so_14_384.ckpt

We assume the checkpoints are saved in the ./pretrained_models directory with their original names.

MMVP-VLM Evaluation

  1. Download the MMVP-VLM benchmark and place it in a local directory.

  2. Run the evaluation script for each CLIP model by specifying different un2clip_ckpt_path arguments. For example, to evaluate OpenAI CLIP ViT-L-14 at 224 resolution, run:

python eval_mmvpvlm.py \
  --benchmark_dir "YOUR_MMVP_VLM_PATH" \
  --un2clip_ckpt_path "./pretrained_models/openai_vit_l_14_224.ckpt"

Training

Preparation

(1) Pretrained unCLIP models

Download the pretrained unCLIP models from the Stable unCLIP huggingface page and place them in a local directory, e.g., ./unclip_ckpts.

Note: Currently, the above official checkpoints as well as corresponding github repos are no longer publicly available (see discussion). You may need to find alternative sources to obtain them if not downloaded previously. You could also view the original official link via the Wayback Machine.

Additionally, download a ViT-L-14_stats.th file using the script below. This file is required for running the "OpenAI CLIP ViT-L-14" experiments due to its usage in the original Stable unCLIP model.

# Source: https://github.com/Stability-AI/stablediffusion/blob/main/doc/UNCLIP.MD
mkdir ./unclip_ckpts/karlo_models && cd ./unclip_ckpts/karlo_models
wget https://arena.kakaocdn.net/brainrepo/models/karlo-public/v1.0.0.alpha/0b62380a75e56f073e2844ab5199153d/ViT-L-14_stats.th
cd ../..

(2) Training dataset

Download the CC3M dataset following the instructions from image2dataset or from pixparse/cc3m-wds (community-maintained).

The desired structure is as follows:

dataset/
└── CC3M/
    β”œβ”€β”€ train/
    β”‚   └── images/
    β”‚       β”œβ”€β”€ 000000000.jpg
    β”‚       β”œβ”€β”€ 000000001.jpg
    β”‚       └── ...
    └── val/
        └── images/
            β”œβ”€β”€ 00000000.jpg
            β”œβ”€β”€ 00000002.jpg
            └── ...

Run

Training can be started with the following example command:

python main.py \
  --gpus=0,1,2,3,4,5,6,7 \
  --base="$CONFIG" \
  --allow_tf32 \
  --allow_bf16
  • Configs are provided in the ./configs/ directory.

  • By default, we use TF32 (--allow_tf32) and BF16 (--allow_bf16) to speed up the training process on NVIDIA Ampere+ GPUs. You may disable them if your hardware does not support these features.

Acknowledgments

This project is built upon the Stable unCLIP project under the Stable Diffusion Version 2 repo.

Parts of the code from the original Stable Diffusion repo are also referenced in building the training pipeline.

We thank the authors for their excellent work and for open-sourcing the code and models.

Citation

If you find this code or project useful, please consider giving a star⭐ or citing:

@inproceedings{li2025un2clip,
  title     = {{un$^2$CLIP}: Improving {CLIP}'s Visual Detail Capturing Ability via Inverting {unCLIP}},
  author    = {Yinqi Li and Jiahe Zhao and Hong Chang and Ruibing Hou and Shiguang Shan and Xilin Chen},
  year      = {2025},
  booktitle = {The Thirty-ninth Annual Conference on Neural Information Processing Systems}
}