README.md

October 11, 2025 · View on GitHub

[ICCV2025] USP: Unified Self-Supervised Pretraining for Image Generation and Understanding

arXiv

This is official implementation of USP.

arch

Converge much faster just with weight initialization from pretrain. converge

If you find USP useful in your research or applications, please consider giving a star ⭐ and citing using the following BibTeX:

@article{chu2025usp,
  title={Usp: Unified self-supervised pretraining for image generation and understanding},
  author={Chu, Xiangxiang and Li, Renda and Wang, Yong},
  journal={arXiv preprint arXiv:2503.06132},
  year={2025}
}

Catalog

  • 【4.21】Upload image generation finetuning weights
  • Pre-training code
  • (ImageNet SFT and linear probe finetuning code)

Finetuning Weights

Uploaded image generation finetuning weights in Hugging Face

All weights were pretrained for 1600 epochs and then finetuned for 400 K steps.

Using the above weights and following the inference and evaluation procedures outlined in GENERATION.md, we obtained the following evaluation results:

Model NamePretrainFinetuningFIDISsFID
DiT_B-21600 epochs400 K steps27.2250.477.60
DiT_L-21600 epochs400 K steps15.0580.116.41
DiT_XL-21600 epochs400 K steps9.64112.936.30
SiT_B-21600 epochs400 K steps22.1061.595.88
SiT_XL-21600 epochs400 K steps7.35128.505.00

Our method is somewhat orthogonal to other DINO based acceleration methods.

ModelParamsStepsFID (↓)IS (↑)
SiT-XL/2130M400K16.9777.50
USP130M400K7.38127.96
REPA130M400K7.9122.6
USP + REPA130M400K6.26139.84
VAVAE130M64 Epochs5.18/2.15†132.4/245.1†
USP + VAVAE130M64 Epochs4.2/1.81†144/261.0†

Table: Results Combined with External-Model-Based Methods. †: w/ CFG=10.0.

Introduction

Recent studies have highlighted the interplay between diffusion models and representation learning. Intermediate representations from diffusion models can be leveraged for downstream visual tasks, while self-supervised vision models can enhance the convergence and generation quality of diffusion models. However, transferring pretrained weights from vision models to diffusion models is challenging due to input mismatches and the use of latent spaces. To address these challenges, we propose Unified Self-supervised Pretraining (USP), a framework that initializes diffusion models via masked latent modeling in a Variational Autoencoder (VAE) latent space. USP achieves comparable performance in understanding tasks while significantly improving the convergence speed and generation quality of diffusion models.

Pretraining

Please refer to PRETRAIN.md

Downstream Task

Generation

Please refer to GENERATION.md

Acknowledgement

Our code are based on MAE, DiT, SiT and VisionLLaMA. Thanks for their great work.