README.md
June 4, 2026 · View on GitHub
(CVPR 2026) STCDiT: Spatio-Temporally Consistent Diffusion Transformer for High-Quality Video Super-Resolution
[Project Page] [Paper] [Supp]
Junyang Chen, Jiangxin Dong, Long Sun, Yixin Yang, Jinshan Pan
IMAG Lab, Nanjing University of Science and Technology
If STCDiT is helpful for you, please help star the GitHub Repo. Thanks!
Welcome to visit our website (专注底层视觉领域的信息服务平台) for low-level vision: https://lowlevelcv.com/
😊 You may also want to check our relevant works:
-
FaithDiff (CVPR 2025) Paper | Code
Unleashing diffusion priors with feature alignment and joint VAE–LDM optimization for faithful SR. -
CODSR (CVPR2026) Paper | Code
A one-step diffusion SR framework enabling region-discriminative activation of generative priors and precise semantic grounding.
🚩 New Features/Updates
- ✅ June 4, 2026. Released training code and added support for DDP inference.
- ✅ April 16, 2026. STCDiT achieved 3rd place in the human subjective evaluation track of NTIRE 2026 UGC VSR without any additional training.
- ✅ April 15, 2026. Release enhanced results of STCDiT on VideoLQ and SportsLQ.
- ✅ April 15, 2026. Release SportsLQ. It includes 20 sports event videos at 720p resolution.
- ✅ April 15, 2026. Release testing code and pre-trained model.
- ✅ November 24, 2025. Create the repository.
⚡ To do
- Release the Gradio Demo and ComfyUI Integration.
Release the training code. Note that STCDiT-tiny can be trained on 4×24 GB GPUs with the same training settings as in paper.Release the testing code and pre-trained model. Note that STCDiT-tiny can be inferred on a single 24 GB GPU.
📷 Real-World Enhancement Results
🚀 How to evaluate
Environment
conda create -n STCDiT python=3.10.19 -y
pip install -r ./requirements_for_STCDiT.txt
conda create -n Qwen python=3.10.19 -y
pip install -r ./requirements_for_Qwen.txt
Note: If FlashAttention installation fails, download the .whl file and install it via pip.
Download Dependent Models
- STCDiT and STCDiT-Tiny
- Wan2.1-i2v-14B
- Wan2.1-t2v-1.3B
- Qwen2.5-VL-7B-Instruct
- Put them in the
./model_checkpointsfolder. For download instructions, refer to download.sh.
Val Dataset
- SportsLQ: Modelscope
- Enhanced results of STCDiT on VideoLQ and SportsLQ: Modelscope.
- For download instructions, refer to download.sh.
Inference Script
Note: Please modify line 3 in
./Inference/test_STCDiT_large.pyand./Inference/test_STCDiT_tiny.pyto your local directory path.
# Step 1: Generate Captions with Qwen2.5-VL
conda activate Qwen
bash ./Qwen2.5-VL/inference.sh
# Step 2: Run Video Super-Resolution with STCDiT
conda activate STCDiT
# STCDiT-Large with Wan2.1-I2V-14B base model, if you observe frequent texture flickering, set `cfg_scale=1`.
bash ./Inference/test_STCDiT_large.sh
# STCDiT-Tiny with Wan2.1-T2V-1.3B base model (a single 24 GB GPU is sufficient)
bash ./Inference/test_STCDiT_tiny.sh
# For multi-GPU inference, please use the corresponding DDP inference script
bash ./Inference_ddp/test_STCDiT_large.sh # or bash ./Inference_ddp/test_STCDiT_tiny.sh
⚡ How to train
Training Script
# Stage 1 Preprocess videos and text into `.pth` files to reduce GPU memory consumption during training.
bash ./dataset/data_preparation.sh
# Stage 2 training
bash ./train/train_process_i2v_large.sh # or bash ./train/train_process_t2v_tiny.sh
# After Stage 2 training, enter the checkpoints folder.
cd ./outputs/checkpoint
python zero_to_fp32.py ./ ./STCDiT.bin --exclude_frozen_parameters
Tips for video data preparation
- For video data preparation, we recommend using DOVER, MUSIQ, and optical-flow-based motion magnitude to filter high-quality videos for training.
BibTeX
@inproceedings{chen_STCDiT,
title={STCDiT: Spatio-Temporally Consistent Diffusion Transformer for High-Quality Video Super-Resolution},
author={Chen, Junyang and Dong, Jiangxin and Sun, Long and Yang, Yixin and Pan, Jinshan},
booktitle={CVPR},
year={2026}
}
Contact
If you have any questions, please feel free to reach me out at jychen9811@gmail.com.
Acknowledgments
Our project is based on DiffSynth-Studio and Wan 2.1. Thanks for their awesome works.