generate image captions using BLIP-2
March 19, 2026 ยท View on GitHub
(TCSVT2025) OmniScaleSR: Unleashing Scale-Controlled Diffusion Prior for Faithful and Realistic Arbitrary-Scale Image Super-Resolution
๐ Overview framework

โ๏ธ Dependencies and Installation
## git clone this repository
git clone https://github.com/chaixinning/OmniScaleSR.git
cd OmniScaleSR
# create an environment
conda create -n omniscalesr python=3.10
conda activate omniscalesr
pip install -r requirements.txt
๐ Quick Inference
Step 1: Download the pretrained models
- Download the pretrained SD-2-base models from HuggingFace or GoogleDrive.
- Download the OmniScaleSR models from GoogleDrive.
- Download the pre-x4 model from GoogleDrive and then rename it to 'seemoredetail_4x.pth'
- Download the RAM model from HuggingFace
- Download the DAPE model from GoogleDrive
You can put the models into preset/models.
Step 2: Prepare testing data
You can put the testing images in the input_images.
Step 3: Running testing command
# x16 SR
CUDA_VISIBLE_DEVICES=4 python test_omniscalesr.py \
--upscale 16 \
--pretrained_model_path 'xxx/stable-diffusion-2-base/' \
--omniscalesr_model_path 'xxx/OmniScaleSR/' \
--ram_ft_path 'xxx/DAPE.pth' \
--image_path ./input_images \
--output_dir ./output_images/x16 \
--start_point noise \
--num_inference_steps 50 \
--guidance_scale 7.5 \
--process_size 512
You can change the target SR scale by modifying the parameter 'upscale'
๐ Train
Step1: Download the pretrained models
- Download the pretrained SD-2-base models.
- Download the DAPE model from GoogleDrive
Step2: Prepare real-world ASSR training data
(1) Generate LR images:
python utils_data/make_paired_data_ASRealSR.py \
--gt_path PATH1 \
--save_dir 'ASRealSR_data/PATH_1' \
--epoch 2
- --gt_path the path of gt images. e.g., xxx/LSDIR
- --save_dir the path of paired images
- --epoch the number of epoch you want to make
Arbitrary-resolution degradation is performed on-the-fly in the dataloader during training, ensuring that all images within each mini-batch have a consistent spatial resolution.
(2) Generate image captions:
# generate image captions using BLIP-2
python utils_data/BLIP2_generation_ASRealSR.py --start_gpu 0 --all_gpu 1
Step3: Training for OmniScaleSR
CUDA_VISIBLE_DEVICES="1" accelerate launch --num_processes 1 \
--main_process_port 10000 train_omniscalesr.py \
--pretrained_model_name_or_path="xxx/stable-diffusion-2-base/" \
--ram_ft_path 'xxx/DAPE.pth' \
--enable_xformers_memory_efficient_attention \
--mixed_precision="fp16" \
--resolution=512 \
--learning_rate=5e-5 \
--train_batch_size=1 \
--gradient_accumulation_steps=2 \
--null_text_ratio=0.5 \
--dataloader_num_workers=0 \
--checkpointing_steps=20000 \
--data_root TRAIN_DATA_ROOT \
--output_dir='./checkpoints' \
--max_train_steps 160000
Acknowledgments
This project is based on diffusers and BasicSR. Some codes are brought from SeeSR and SeemoRe. Thanks for their awesome works.
Contact
If you have any questions, please feel free to contact: chaixinning@sjtu.edu.cn
License
This project is released under the MIT License.
๐Citations
If our code helps your research or work, please consider citing our paper. The following are BibTeX references:
@ARTICLE{chai2025omniscalesr,
author={Chai, Xinning and Cheng, Zhengxue and Zhang, Yuhong and Zhang, Hengsheng and Qin, Yingsheng and Yang, Yucai and Xie, Rong and Song, Li},
journal={IEEE Transactions on Circuits and Systems for Video Technology},
title={OmniScaleSR: Unleashing Scale-Controlled Diffusion Prior for Faithful and Realistic Arbitrary-Scale Image Super-Resolution},
year={2025},
volume={},
number={},
pages={1-1},
doi={10.1109/TCSVT.2025.3642578}}