README.md

March 11, 2026 · View on GitHub

OMGSR: You Only Need One Mid-timestep Guidance for Real-World Image Super-Resolution

¹Software Engineering Institute, East China Normal University | ²vivo Mobile Communication Co. Ltd, Hangzhou, China | ^*Work done during internship at vivo | ^†Corresponding author

:boom: HighLight

Unlike the paper, this repo has been further optimized by:

Replace ~~LPIPS Loss (natively support 224 resolution)~~ with the proposed DINOv3-ConvNeXt DISTS Loss (natively support 1k or higher resolution) for structural perception.
Develop DINOv3-ConvNeXt Multi-level Discriminator Head (natively support 1k or higher resolution) for GAN training.

:boom: News

If you find OMGSR helpful, we hope for a :star:.

2026.3.11: :hugs: We will support for Z-Image (6B) and Longcat-Image (6B)
2025.10.14: :hugs: The latest version is released.
2025.8.16: The training code is released.
2025.8.15: The inference code and weights are released.
2025.8.12: The arXiv paper is released.
2025.8.6: This repo is released.

:eyes: Visualization

Please Click the images for detailed visualization.

1. RealLQ250x4 (256->1k Resolution) Complete Results

2. RealSRx8 (128->1k Resolution) Complete Results

3. DrealSRx8 (128->1k Resolution) Complete Results

OMGSR-S-512 Results

1. RealLQ250x4 (256->1k Resolution) Complete Results

2. RealLQ200x4 (256->1k Resolution) Complete Results

3. RealSRx4 (128->512 Resolution) Complete Results

4. DrealSRx4 (128->512 Resolution) Complete Results

Averge Optimal Mid-timestep via Signal-to-Noise Ratio (SNR)

1. Pre-trained Noisy Latent Representation

\text{DDPM}: \mathbf{z}_t = \sqrt{\bar{\alpha}_t} \mathbf{z}_H + \sqrt{1-\bar{\alpha}_t} \epsilon. \quad \text{FM}: \mathbf{z}_t = (1 - \sigma_t) \mathbf{z}_H + \sigma_t \epsilon.

2. SNR of Pre-trained Noisy Latent Representation

\text{DDPM}: \texttt{SNR}(\mathbf{z}_t)=\frac{\bar{\alpha}_t \cdot \mathbb{E}[\mathbf{z}_{H}^2]}{(1 - \bar{\alpha}_t) \cdot\mathbb{E}[\epsilon^2]}=\frac{\bar{\alpha}_t \cdot \mathbb{E}[\mathbf{z}_H^2]}{1 - \bar{\alpha}_t}. \quad \text{FM}: \texttt{SNR}(\mathbf{z}_t)=\frac{(1 - \sigma_t)^2 \cdot \mathbb{E}[\mathbf{z}_{H}^2]}{\sigma_t^2 \cdot \mathbb{E}[\epsilon^2]}=\frac{(1 - \sigma_t)^2 \cdot \mathbb{E}[\mathbf{z}_H^2]}{\sigma_t^2}.

3. SNR of Low-Quality (LQ) Image Latent Representation

\texttt{SNR}(\mathbf{z}_L) = \frac{\mathbb{E}[\mathbf{z}_H^2]}{\mathbb{E}[(\mathbf{z}_L - \mathbf{z}_H)^2]}

4. Compute Averge Optimal Mid-timestep

$t^\ast = \arg \min_t \frac{1}{N}\sum_{i=1}^N \left|\text{SNR}(\mathbf{z}_t^{(i)}) - \text{SNR}(\mathbf{z}_L^{(i)})\right|, \quad \text{Dataset:} \\{(\mathbf{z}_L^{(i)}, \mathbf{z}_H^{(i)})\\}_N$

5. Mid-timestep Script

You can run the script:

# OMGSR-S-512
python mid_timestep/mid_timestep_sd.py --dataset_txt_or_dir_paths /path1/to/images /path2/to/images

# OMGSR-F-1024
python mid_timestep/mid_timestep_flux.py --dataset_txt_or_dir_paths /path1/to/images /path2/to/images

In this repo, we using mid-timestep 273 for OMGSR-S-512 and 244 for OMGSR-F-1024.
In fact, a mid-timestep around the recommended value is also ok and does not need to be very accurate.
Note that the mid-timesteps during training and inference should be consistent.
The mid-timestep is actually related to degraded configuration in a dataset.

:wrench: Environment

# git clone this repository
git clone https://github.com/wuer5/OMGSR.git
cd OMGSR
# create an environment
conda create -n OMGSR python=3.10
conda activate OMGSR
pip install --upgrade pip
pip install -r requirements.txt

:rocket: Quick Inference

1. Download the pre-trained models from HuggingFace

Download SD2.1-base for OMGSR-S-512.
Download FLUX.1-dev for OMGSR-F-1024.

2. Download the OMGSR Lora adapter weights

Download the OMGSR-S-512 Lora Adapter Weight (rename it as omgsr-s-512-adapter) to the folder adapters (please make the folder).
Download the OMGSR-F-1024 Lora Adapter Weight (rename it as omgsr-f-1024-adapter) to the folder adapters (please make the folder).

3. Prepare your testing data

You should put the testing data (.png, .jpg, .jpeg formats) to the folder tests.

4. Start inference

For OMGSR-S-512:

bash infer_omgsr_s.sh

For OMGSR-F-1024:

bash infer_omgsr_f.sh

:hugs: Training

1. Prepare your training datasets

You should download the training datasets LSDIR and FFHQ (first 10k images) followed by our paper settings or your custom datasets.

You need to edit dataset_txt_or_dir_paths in the configs/xxx.yml like:

dataset_txt_or_dir_paths: [path1, path2, ...]

Note that path1, path2, ... can be the .txt path (containing the paths of training images) or the folder path (containing the training images). The type of images can be png, jpg, jpeg.

2. Download the DINOv3-ConvNeXt

You can download the DINOv3-ConvNeXt-Large to the folder dinov3_gan/dinov3_weights (please make the folder).

3. Prepare your training datasets

Start to train OMGSR-S-512:

bash train_omgsr_s_512.sh

Start to train OMGSR-F-1024:

bash train_omgsr_f_1024.sh

:book: Citation

If OMGSR is helpful to you, you could cite this paper.

@misc{wu2025omgsrneedmidtimestepguidance,
      title={OMGSR: You Only Need One Mid-timestep Guidance for Real-World Image Super-Resolution}, 
      author={Zhiqiang Wu and Zhaomang Sun and Tong Zhou and Bingtao Fu and Ji Cong and Yitong Dong and Huaqi Zhang and Xuan Tang and Mingsong Chen and Xian Wei},
      year={2025},
      eprint={2508.08227},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2508.08227}, 
}

:thumbsup: Acknowledgement

The dinov3_gan folder in this project is modified from Vision-aided GAN and DINOv3. Thanks for these awesome work.

:email: Contact

If you have any questions, please contact 51265902095@stu.ecnu.edu.cn.