DiffiT: Diffusion Vision Transformers for Image Generation
March 9, 2026 ยท View on GitHub
Official PyTorch implementation of DiffiT: Diffusion Vision Transformers for Image Generation.
For business inquiries, please visit our website and submit the form: NVIDIA Research Licensing
DiffiT (Diffusion Vision Transformers) is a generative model that combines the expressive power of diffusion models with Vision Transformers (ViTs), introducing Time-dependent Multihead Self Attention (TMSA) for fine-grained control over the denoising at each timestep. DiffiT achieves SOTA performance on class-conditional ImageNet generation at multiple resolutions, notably an FID score of 1.73 on ImageNet-256.


๐ฅ News ๐ฅ
- [03.08.2026] ๐ฅ๐ฅ DiffiT code and pretrained model are released !
- [07.01.2024] ๐ฅ๐ฅ DiffiT has been accepted to ECCV 2024 !
- [04.02.2024] Updated manuscript now available on arXiv !
- [12.04.2023] ๐ฅ Paper is published on arXiv !
Models
ImageNet-256
| Model | Dataset | Resolution | FID-50K | Inception Score | Download |
|---|---|---|---|---|---|
| DiffiT | ImageNet | 256x256 | 1.73 | 276.49 | model |
ImageNet-512
| Model | Dataset | Resolution | FID-50K | Inception Score | Download |
|---|---|---|---|---|---|
| DiffiT | ImageNet | 512x512 | 2.67 | 252.12 | model |
Getting Started: Sampling & Evaluation
This repository provides the code for the DiffiT model, pretrained model checkpoints, and everything needed to sample images and compute FID scores to reproduce the results reported in our paper.
Sampling Images
Image sampling is performed using sample.py. To reproduce the reported numbers, use the commands below.
ImageNet-256:
python sample.py \
--log_dir $LOG_DIR \
--cfg_scale 4.4 \
--model_path $MODEL \
--image_size 256 \
--model Diffit \
--num_sampling_steps 250 \
--num_samples 50000 \
--cfg_cond True
ImageNet-512:
python sample.py \
--log_dir $LOG_DIR \
--cfg_scale 1.49 \
--model_path $MODEL \
--image_size 512 \
--model Diffit \
--num_sampling_steps 250 \
--num_samples 50000 \
--cfg_cond True
We also provide ready-to-use Slurm scripts for convenience:
- `slurm_sample_256.sh$ โ \text{samples} 50\text{K} \text{images} \text{at} 256 \times 256 \text{resolution}
- โ \text{samples} 50\text{K} \text{images} \text{at} 512 \times 512 \text{resolution}
\text{Computing} \text{FID}
\text{Once} \text{images} \text{have} \text{been} \text{sampled}, \text{you} \text{can} \text{compute} \text{the} \text{FID} \text{and} \text{other} \text{metrics} \text{using} \text{the} \text{provided} $eval_run.sh` script. Our evaluation pipeline exactly follows the protocol from openai/guided-diffusion/evaluations.
bash eval_run.sh
Expected Results
Running the above sampling and evaluation commands should yield the following metrics:
ImageNet-256:
| Inception Score | FID | sFID | Precision | Recall |
|---|---|---|---|---|
| 276.49 | 1.73 | 4.54 | 0.8024 | 0.6205 |
ImageNet-512:
| Inception Score | FID | sFID | Precision | Recall |
|---|---|---|---|---|
| 252.13 | 2.67 | 4.99 | 0.8277 | 0.5500 |
Note: Small variations in the reported numbers are expected depending on the device used for sampling and due to numerical precision differences.
Citation
@inproceedings{hatamizadeh2025diffit,
title={Diffit: Diffusion vision transformers for image generation},
author={Hatamizadeh, Ali and Song, Jiaming and Liu, Guilin and Kautz, Jan and Vahdat, Arash},
booktitle={European Conference on Computer Vision},
pages={37--55},
year={2025},
organization={Springer}
}
Star History
Licenses
Copyright ยฉ 2026, NVIDIA Corporation. All rights reserved.
This work is made available under the NVIDIA Source Code License-NC. Click here to view a copy of this license.
The pre-trained models are shared under CC-BY-NC-SA-4.0. If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.
Acknowledgement
We gratefully acknowledge the authors of Guided-Diffusion, DiT and MDT for making their excellent codebases publicly available.