README.md

June 4, 2026 · View on GitHub

(CVPR 2026) STCDiT: Spatio-Temporally Consistent Diffusion Transformer for High-Quality Video Super-Resolution

visitors

[Project Page]   [Paper]   [Supp]

Junyang Chen, Jiangxin Dong, Long Sun, Yixin Yang, Jinshan Pan
IMAG Lab, Nanjing University of Science and Technology

If STCDiT is helpful for you, please help star the GitHub Repo. Thanks!

Welcome to visit our website (专注底层视觉领域的信息服务平台) for low-level vision: https://lowlevelcv.com/


😊 You may also want to check our relevant works:

  1. FaithDiff (CVPR 2025) Paper | Code
    Unleashing diffusion priors with feature alignment and joint VAE–LDM optimization for faithful SR.

  2. CODSR (CVPR2026) Paper | Code
    A one-step diffusion SR framework enabling region-discriminative activation of generative priors and precise semantic grounding.

🚩 New Features/Updates

  • ✅ June 4, 2026. Released training code and added support for DDP inference.
  • ✅ April 16, 2026. STCDiT achieved 3rd place in the human subjective evaluation track of NTIRE 2026 UGC VSR without any additional training.
  • ✅ April 15, 2026. Release enhanced results of STCDiT on VideoLQ and SportsLQ.
  • ✅ April 15, 2026. Release SportsLQ. It includes 20 sports event videos at 720p resolution.
  • ✅ April 15, 2026. Release testing code and pre-trained model.
  • ✅ November 24, 2025. Create the repository.

To do

  • Release the Gradio Demo and ComfyUI Integration.
  • Release the training code. Note that STCDiT-tiny can be trained on 4×24 GB GPUs with the same training settings as in paper.
  • Release the testing code and pre-trained model. Note that STCDiT-tiny can be inferred on a single 24 GB GPU.

📷 Real-World Enhancement Results


🚀 How to evaluate

Environment

conda create -n STCDiT python=3.10.19 -y
pip install -r ./requirements_for_STCDiT.txt

conda create -n Qwen python=3.10.19 -y
pip install -r ./requirements_for_Qwen.txt

Note: If FlashAttention installation fails, download the .whl file and install it via pip.

Download Dependent Models

Val Dataset

  • SportsLQ: Modelscope
  • Enhanced results of STCDiT on VideoLQ and SportsLQ: Modelscope.
  • For download instructions, refer to download.sh.

Inference Script

Note: Please modify line 3 in ./Inference/test_STCDiT_large.py and ./Inference/test_STCDiT_tiny.py to your local directory path.

# Step 1: Generate Captions with Qwen2.5-VL
conda activate Qwen
bash ./Qwen2.5-VL/inference.sh

# Step 2: Run Video Super-Resolution with STCDiT
conda activate STCDiT

# STCDiT-Large with Wan2.1-I2V-14B base model, if you observe frequent texture flickering, set `cfg_scale=1`.
bash ./Inference/test_STCDiT_large.sh

# STCDiT-Tiny with Wan2.1-T2V-1.3B base model (a single 24 GB GPU is sufficient)
bash ./Inference/test_STCDiT_tiny.sh

# For multi-GPU inference, please use the corresponding DDP inference script
bash ./Inference_ddp/test_STCDiT_large.sh  # or bash ./Inference_ddp/test_STCDiT_tiny.sh

⚡ How to train

Training Script

# Stage 1 Preprocess videos and text into `.pth` files to reduce GPU memory consumption during training.
bash ./dataset/data_preparation.sh

# Stage 2 training
bash ./train/train_process_i2v_large.sh # or bash ./train/train_process_t2v_tiny.sh

# After Stage 2 training, enter the checkpoints folder.
cd ./outputs/checkpoint
python zero_to_fp32.py ./ ./STCDiT.bin --exclude_frozen_parameters

Tips for video data preparation

  • For video data preparation, we recommend using DOVER, MUSIQ, and optical-flow-based motion magnitude to filter high-quality videos for training.

BibTeX

@inproceedings{chen_STCDiT,
title={STCDiT: Spatio-Temporally Consistent Diffusion Transformer for High-Quality Video Super-Resolution},
author={Chen, Junyang and Dong, Jiangxin and Sun, Long and Yang, Yixin and Pan, Jinshan},
booktitle={CVPR},
year={2026}
}

Contact

If you have any questions, please feel free to reach me out at jychen9811@gmail.com.


Acknowledgments

Our project is based on DiffSynth-Studio and Wan 2.1. Thanks for their awesome works.