create new anaconda env
November 6, 2025 · View on GitHub
Diffusion-based Blind Text Image Super-Resolution (CVPR2024)
Yuzhe Zhang1 | Jiawei Zhang2 | Hao Li2 | Zhouxia Wang3 | Luwei Hou2 | Dongqing Zou2 | Liheng Bian1
1Beijing Institute of Technology, 2SenseTime Research, 3The University of Hong Kong
📢 News
- 🚀Training Code has been released, enjoy.
- 🌟Added some discussions about this work Q&A and limitations of input image and dataset production Data_Produce.
- 2024.12 🚀Some bugs and stability issues have been fixed. Please use the latest model for inference.
- 2024.05 🚀Inference code has been released, enjoy.
- 2024.04 🚀Official repository of DiffTSR.
- 2024.03 🌟The implementation code will be released shortly.
- 2024.03 ❤️Accepted by CVPR2024.
💬 Q&A
Please Read Before Trying.
🇨🇳 中文 Q&A:对于大家关心的一些细节问题,这里进行了归纳供大家参考 (点击展开)
-
DiffTSR的对真实世界图片的泛化性,是否泛化到Real-World Scenarios?
A: DiffTSR在训练过程中考虑到了真实世界的各种退化,继承了BSRGAN和Real-ESRGAN中对于复杂退化流程的构建。且“Blind Text Image Super-Resolution”的Blind-盲图像恢复就是指针对真实世界未知退化的图像恢复。
-
IDM 中 Unet 用的是 Stable-Diffusion 的权重吗?
A: 不是。IDM 的 Unet 是从头训练的,没有加载任何预训练权重,IDM 的结构也和任何一个 Diffusion 模型的 Unet 不一致。但是 VAE 是加载了 ldm 的 f4 VAE 在 Open-Image 上预训练的权重,然后在本项目的 CTR-TSR-Train 数据集上进行了微调,微调了 100,000 iter,batch_size=16。此外,包括 TDM 和 MoM 在内的模型均未使用预训练模型,均为从头训练获得。详细训练设置请看 附加材料 Section 1.4。
-
DiffTSR 模型的输入尺寸和要求,需要将输入 resize 吗?
A: 模型的 LR 输入需要统一 resize 到
width=512/height=128。此外,因为本项目仅考虑单行文本输入,输入图片需要只包含一行文本。IDM 和 TDM 仅适配单行文本,多行文本输入会导致效果扭曲和错误的结果。 -
图片的推理速度非常慢,有什么解决办法吗?
A: 由于本项目基于 Diffusion 技术,每处理一张图像都需要进行
T次迭代(默认T=200)。若想提升推理速度,可以考虑:- 减小
T,由于采样器为 DDIM,在T=20时仍有较好表现。 - 对 DiffTSR 模型进行量化,可参考 Diffusion 模型量化的相关 Repo。
- 使用本项目的 Baseline model,虽然 Baseline 会在一定程度上降低性能,但可提升约 2 倍的推理速度,并且在大多数场景下不会明显退化。
- 对模型进行蒸馏,或基于论文训练一个更小的 IDM 模型,文本场景可能不需要像通用场景图像生成那样重的模型。
- 减小
-
在训练 IDM 时,损失是如何设置的?text_recognition loss 是如何实现的?
A: 训练 IDM 时使用了两个损失函数:
- L2 loss:用于预测噪声。
- OCR loss:用于从预测出的干净
x^0上检测文字。
具体来说:
- L2 loss 是传统 diffusion 模型中用于最小化
(Unet 输出 - noise map),使 Unet 具备噪声估计能力。 - OCR loss 通过
z_t计算z^(t-1),再得到z^0,然后解码z^0得到x^0。将x^0输入冻结权重的 TransOCR 模型,获得x^0上的文字 embedding,计算预测的pred-text-embedding和gt-text-embedding之间的 cross-entropy loss,OCR loss 额外添加了weight=0.02约束。
详细内容参见 Issue。
-
训练的损失函数是什么?
A: DiffTSR 模型训练经历了三个阶段,每个阶段使用了不同损失函数的组合:
- 训练 IDM:IDM 从头训练 Unet,损失函数为
L_IDM,包含L2 loss和OCR loss。 - 训练 TDM:TDM 从头训练 Transformer,损失函数为
L_TDM,参考 Multinomial Diffusion Section 4。 - 训练 DiffTSR 整体:冻结 IDM 和 TDM,仅训练 MoM,损失函数为
L_MoM = L_IDM + L_TDM * weight。
其中:
具体符号定义和理论推导详见 附加材料 Section 1 及 Algorithm 1 DiffTSR Training。
未完待续...
- 训练 IDM:IDM 从头训练 Unet,损失函数为
🇬🇧 English Q&A: For some details you may want to know, here is a summary for your reference (click to expand)
-
Generalization of DiffTSR to Real-World Scenarios
A: DiffTSR takes various real-world degradations into account during training, inheriting the complex degradation modeling from BSRGAN and Real-ESRGAN. Moreover, the "Blind" in "Blind Text Image Super-Resolution" specifically refers to the restoration of images with unknown degradations, which is targeted at real-world scenarios.
-
Does the Unet in IDM use Stable-Diffusion weights?
A: No. The Unet in IDM is trained from scratch and does not load any pre-trained weights. Additionally, the structure of IDM is different from any Diffusion model's Unet. However, the VAE loads the pre-trained weights from
ldm f4 VAE, which was pre-trained on the Open-Image dataset and then fine-tuned on the CTR-TSR-Train dataset in this project. The fine-tuning was conducted for 100,000 iterations with a batch size of 16. Moreover, models including TDM and MoM were also trained from scratch without using any pre-trained models. For detailed training settings, please refer to Supplementary Material Section 1.4. -
What are the input size and requirements for the DiffTSR model? Does the input need to be resized?
A: The LR input of the model needs to be uniformly resized to
width=512andheight=128. Additionally, since this project only considers single-line text input, the input image must contain only one line of text. Both IDM and TDM are designed specifically for single-line text, and multi-line text input will result in distortion and incorrect results. -
The inference speed of the image is very slow. What are the possible solutions?
A: Since this project is based on Diffusion technology, processing a single image requires
Titerations (defaultT=200). To improve inference speed, you may consider:- Reducing
T, as the sampler is DDIM, and it still performs well atT=20. - Quantizing the DiffTSR model, referring to relevant repositories on Diffusion model quantization.
- Using the project's Baseline model, which, although it may slightly reduce performance, provides approximately 2× speed-up while maintaining acceptable performance in most scenarios.
- Performing model distillation on IDM or training a smaller IDM model. In textual scenarios, a heavy model like general image generation may not be necessary.
- Reducing
-
How is the loss function set when training IDM? How is the text recognition loss implemented?
A: When training IDM, two loss functions are used:
- L2 loss: Used for predicting noise.
- OCR loss: Used for detecting text from the predicted clean
x^0.
Specifically:
- L2 loss is the traditional loss used in diffusion models, minimizing the difference between Unet output and noise map, enabling Unet to estimate noise.
- OCR loss is computed by first obtaining
z^(t-1)fromz_t, then derivingz^0, and subsequently decodingz^0to obtainx^0. The decodedx^0is fed into a frozen TransOCR model to obtain the text embedding inx^0. The cross-entropy loss is then computed between the predicted text embedding (pred-text-embedding) and the ground truth text embedding (gt-text-embedding). A weight constraint ofweight=0.02is applied to the OCR loss.
For more details, see Issue.
-
What are the loss functions used during training?
A: The DiffTSR model training consists of three stages, each using a different combination of loss functions:
- Training IDM: IDM trains Unet from scratch using loss
L_IDM, which includes L2 loss and OCR loss. - Training TDM: TDM trains the Transformer from scratch using loss
L_TDM, referring to Multinomial Diffusion Section 4. - Training the entire DiffTSR: IDM and TDM are frozen, and only MoM is trained with loss
L_MoM = L_IDM + L_TDM * weight.
Where:
For detailed symbol definitions and theoretical derivations, see Supplementary Material Section 1 and Algorithm 1 DiffTSR Training.
To be continued...
- Training IDM: IDM trains Unet from scratch using loss
🌟 Input data and dataset production guideline
When using this model, the input data should meet the following requirements:
- The number of characters must not exceed 24.
- The text must be arranged horizontally.
- Single-line text only.
- The aspect ratio must satisfy $1 \leq \mathrm{ratio}(W/H) \leq 8$.
- In addition, during dataset cleaning, besides the above requirements, it is also necessary to ensure that the LR and HR images are not misaligned.
Future improvements aim to achieve more flexible, faster, and more general text image enhancement.
🔥 TODO
- Attach the detailed implementation and supplementary material.
- Add inference code and checkpoints for blind text image SR.
- Add training code and scripts.
👁️ Gallery
🛠️ Try
Dependencies and Installation
- Pytorch >= 1.7.0
- CUDA >= 11.0
# git clone this repository
git clone https://github.com/YuzheZhang-1999/DiffTSR
cd DiffTSR
# create new anaconda env
conda env create -f environment.yaml
conda activate DiffTSR
Download the checkpoint
Please download the checkpoint file from the URL below to the ./ckpt/ folder.
-
[BaiduDisk] [Password: vk9n]
Inference
python inference_DiffTSR.py
# check the code for more detail
Training
# cd ./train
# check the README.md file for training details
# Please note that you need to carefully review the training sh file and the configuration yaml.
# Some of the configurations need to be modified according to your data or address.
🔎 Overview of DiffTSR

Abstract
Recovering degraded low-resolution text images is challenging, especially for Chinese text images with complex strokes and severe degradation in real-world scenarios. Ensuring both text fidelity and style realness is crucial for high-quality text image super-resolution. Recently, diffusion models have achieved great success in natural image synthesis and restoration due to their powerful data distribution modeling abilities and data generation capabilities In this work, we propose an Image Diffusion Model (IDM) to restore text images with realistic styles. For diffusion models, they are not only suitable for modeling realistic image distribution but also appropriate for learning text distribution. Since text prior is important to guarantee the correctness of the restored text structure according to existing arts, we also propose a Text Diffusion Model (TDM) for text recognition which can guide IDM to generate text images with correct structures. We further propose a Mixture of Multi-modality module (MoM) to make these two diffusion models cooperate with each other in all the diffusion steps. Extensive experiments on synthetic and real-world datasets demonstrate that our Diffusion-based Blind Text Image Super-Resolution (DiffTSR) can restore text images with more accurate text structures as well as more realistic appearances simultaneously.
Visual performance comparison overview
Blind text image super-resolution results between different methods on synthetic and real-world text images. Our method can restore text images with high text fidelity and style realness under complex strokes, severe degradation, and various text styles.
📷 More Visual Results
🎓Citations
@inproceedings{zhang2024diffusion,
title={Diffusion-based Blind Text Image Super-Resolution},
author={Zhang, Yuzhe and Zhang, Jiawei and Li, Hao and Wang, Zhouxia and Hou, Luwei and Zou, Dongqing and Bian, Liheng},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={25827--25836},
year={2024}
}
🎫 License
This project is released under the Apache 2.0 license.
Acknowledgement
Thanks to these awesome work:







