README.md

October 11, 2025 · View on GitHub

[ICCV2025] USP: Unified Self-Supervised Pretraining for Image Generation and Understanding

This is official implementation of USP.

arch

Converge much faster just with weight initialization from pretrain.

If you find USP useful in your research or applications, please consider giving a star ⭐ and citing using the following BibTeX:

@article{chu2025usp,
  title={Usp: Unified self-supervised pretraining for image generation and understanding},
  author={Chu, Xiangxiang and Li, Renda and Wang, Yong},
  journal={arXiv preprint arXiv:2503.06132},
  year={2025}
}

Catalog

【4.21】Upload image generation finetuning weights
Pre-training code
(ImageNet SFT and linear probe finetuning code)

Finetuning Weights

Uploaded image generation finetuning weights in Hugging Face

All weights were pretrained for 1600 epochs and then finetuned for 400 K steps.

Using the above weights and following the inference and evaluation procedures outlined in GENERATION.md, we obtained the following evaluation results:

Model Name	Pretrain	Finetuning	FID	IS	sFID
DiT_B-2	1600 epochs	400 K steps	27.22	50.47	7.60
DiT_L-2	1600 epochs	400 K steps	15.05	80.11	6.41
DiT_XL-2	1600 epochs	400 K steps	9.64	112.93	6.30
SiT_B-2	1600 epochs	400 K steps	22.10	61.59	5.88
SiT_XL-2	1600 epochs	400 K steps	7.35	128.50	5.00

Our method is somewhat orthogonal to other DINO based acceleration methods.

Model	Params	Steps	FID (↓)	IS (↑)
SiT-XL/2	130M	400K	16.97	77.50
USP	130M	400K	7.38	127.96
REPA	130M	400K	7.9	122.6
USP + REPA	130M	400K	6.26	139.84
VAVAE	130M	64 Epochs	5.18/2.15†	132.4/245.1†
USP + VAVAE	130M	64 Epochs	4.2/1.81†	144/261.0†

Table: Results Combined with External-Model-Based Methods. †: w/ CFG=10.0.

Recent studies have highlighted the interplay between diffusion models and representation learning. Intermediate representations from diffusion models can be leveraged for downstream visual tasks, while self-supervised vision models can enhance the convergence and generation quality of diffusion models. However, transferring pretrained weights from vision models to diffusion models is challenging due to input mismatches and the use of latent spaces. To address these challenges, we propose Unified Self-supervised Pretraining (USP), a framework that initializes diffusion models via masked latent modeling in a Variational Autoencoder (VAE) latent space. USP achieves comparable performance in understanding tasks while significantly improving the convergence speed and generation quality of diffusion models.

README.md

[ICCV2025] USP: Unified Self-Supervised Pretraining for Image Generation and Understanding

Catalog

Finetuning Weights

Introduction

Pretraining

Downstream Task

Generation

Acknowledgement