Predicting the Original Appearance of Damaged Historical Documents
July 15, 2025 ยท View on GitHub

๐ผ๏ธ Gallery โข ๐ HDR28K โข ๐ฅ Model Zoo โข ๐ฅ Dataset Zoo โข ๐ง Installation โข ๐บ Inference โข ๐ Evaluation
๐ Highlight

- We introduce a Historical Document Repair (HDR) task, which endeavors to predict the original appearance of damaged historical document images.
- We build a large-scale historical document repair dataset, termed HDR28K, which includes 28,552 damaged-repaired image pairs with character-level annotations and multi-style degradation.
- ๐ฅ๐ฅ๐ฅ We propose a Diffusion-based Historical Document Repair method (DiffHDR), which augments the DDPM framework with semantic and spatial information
๐ฐ News
- 2025.07.15: ๐ We propose a novel historical document restoration method, AutoHDR. Welcome to try our demo!
- 2025.03.20: ๐๐ The Historical Document Repair dataset HDR28K is released!
- 2024.12.17: Release inference code.
- 2024.12.10: ๐๐ Our paper is accepted by AAAI2025.
๐ฅ Model Zoo
| Model | chekcpoint | status |
|---|---|---|
| DiffHDR | GoogleDrive / BaiduYun:x62f | Released |
๐ฅ Dataset Zoo
| Model | chekcpoint | status |
|---|---|---|
| HDR28K | BaiduYun:upm9 | Released |
The dataset file structure is as followed:
- character_missing
- test
- char_mask_images
- content_images
- degraded_images
- original_images
- train
- char_mask_images
- content_images
- degraded_images
- original_images
- ink erosion
- similar to 'character_missing'
- paper damage
- similar to 'character_missing'
- test_image_only_damage
- hole_M5_image_2000_32_467_544_979_degrade0.png
- ......
NOTE: The test_image_only_damage contains the gt image after replacing the non-damaged region of by the target .
๐ง Installation
Prerequisites (Recommended)
- Linux
- Python 3.9
- Pytorch 1.13.1
- CUDA 11.7
Environment Setup
Clone this repo:
git clone https://github.com/yeungchenwa/HDR.git
Step 0: Download and install Miniconda from the official website.
Step 1: Create a conda environment and activate it.
conda create -n diffhdr python=3.9 -y
conda activate diffhdr
Step 2: Install related version Pytorch following here.
# Suggested
pip install torch==1.13.1+cu117 torchvision==0.14.1+cu117 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu117
Step 3: Install the required packages.
pip install -r requirements.txt
๐บ Inference
Using DiffHDR for damaged historical documents repair (Some examples including damaged images, mask images, and content images are provided in /examples):
sh scripts/inference.sh
device: CUDA or CPU used for inference,image_path: The damaged image path.mask_image_path: The masked image path.content_image_path: The content image path.save_dir: The directory for saving repaired image.content_mask_guidance_scale: The guidance scale of content image and masked image.degraded_guidance_scale: The guidance scale of damaged image.ckpt_path: The unet checkpoint path.num_inference_steps: The number of inference steps.
๐ HDR28K

๐ Evaluation
Coming soon ...
๐ Acknowledgement
โ๏ธ Copyright
- This repository can only be used for non-commercial research purposes.
- For commercial use, please contact Prof. Lianwen Jin (eelwjin@scut.edu.cn).
- Copyright 2024, Deep Learning and Vision Computing Lab (DLVC-Lab), South China University of Technology.
๐ Citation
@inproceedings{yang2024fontdiffuser,
title={Predicting the Original Appearance of Damaged Historical Documents},
author={Yang, Zhenhua and Peng, Dezhi and Shi, Yongxin and Zhang, Yuyi and Liu, Chongyu and Jin, Lianwen},
booktitle={Proceedings of the AAAI conference on artificial intelligence},
year={2025}
}