Predicting the Original Appearance of Damaged Historical Documents

July 15, 2025 ยท View on GitHub

Predicting the Original Appearance of Damaged Historical Documents

HDR_LOGO

arXiv preprint Homepage Code

๐Ÿ–ผ๏ธ Gallery โ€ข ๐Ÿ“Š HDR28K โ€ข ๐Ÿ”ฅ Model Zoo โ€ข ๐Ÿ”ฅ Dataset Zoo โ€ข ๐Ÿšง Installation โ€ข ๐Ÿ“บ Inference โ€ข ๐Ÿ“ Evaluation

๐ŸŒŸ Highlight

Vis_1 Vis_2

  • We introduce a Historical Document Repair (HDR) task, which endeavors to predict the original appearance of damaged historical document images.
  • We build a large-scale historical document repair dataset, termed HDR28K, which includes 28,552 damaged-repaired image pairs with character-level annotations and multi-style degradation.
  • ๐Ÿ”ฅ๐Ÿ”ฅ๐Ÿ”ฅ We propose a Diffusion-based Historical Document Repair method (DiffHDR), which augments the DDPM framework with semantic and spatial information

๐Ÿ“ฐ News

  • 2025.07.15: ๐ŸŽ‰ We propose a novel historical document restoration method, AutoHDR. Welcome to try our demo!
  • 2025.03.20: ๐ŸŽ‰๐ŸŽ‰ The Historical Document Repair dataset HDR28K is released!
  • 2024.12.17: Release inference code.
  • 2024.12.10: ๐ŸŽ‰๐ŸŽ‰ Our paper is accepted by AAAI2025.

๐Ÿ”ฅ Model Zoo

Modelchekcpointstatus
DiffHDRGoogleDrive / BaiduYun:x62fReleased

๐Ÿ”ฅ Dataset Zoo

Modelchekcpointstatus
HDR28KBaiduYun:upm9Released

The dataset file structure is as followed:

- character_missing
  - test
    - char_mask_images
    - content_images
    - degraded_images
    - original_images
  - train
    - char_mask_images
    - content_images
    - degraded_images
    - original_images
- ink erosion
  - similar to 'character_missing'
- paper damage
  - similar to 'character_missing'
- test_image_only_damage
  - hole_M5_image_2000_32_467_544_979_degrade0.png
  - ......

NOTE: The test_image_only_damage contains the gt image after replacing the non-damaged region of xrx_r by the target xtargetx_{target}.

๐Ÿšง Installation

  • Linux
  • Python 3.9
  • Pytorch 1.13.1
  • CUDA 11.7

Environment Setup

Clone this repo:

git clone https://github.com/yeungchenwa/HDR.git

Step 0: Download and install Miniconda from the official website.

Step 1: Create a conda environment and activate it.

conda create -n diffhdr python=3.9 -y
conda activate diffhdr

Step 2: Install related version Pytorch following here.

# Suggested
pip install torch==1.13.1+cu117 torchvision==0.14.1+cu117 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu117

Step 3: Install the required packages.

pip install -r requirements.txt

๐Ÿ“บ Inference

Using DiffHDR for damaged historical documents repair (Some examples including damaged images, mask images, and content images are provided in /examples):

sh scripts/inference.sh
  • device: CUDA or CPU used for inference,
  • image_path: The damaged image path.
  • mask_image_path: The masked image path.
  • content_image_path: The content image path.
  • save_dir: The directory for saving repaired image.
  • content_mask_guidance_scale: The guidance scale of content image and masked image.
  • degraded_guidance_scale: The guidance scale of damaged image.
  • ckpt_path: The unet checkpoint path.
  • num_inference_steps: The number of inference steps.

๐Ÿ“Š HDR28K

HDR28K

๐Ÿ“ Evaluation

Coming soon ...

๐Ÿ’™ Acknowledgement

๐Ÿ“‡ Citation

@inproceedings{yang2024fontdiffuser,
  title={Predicting the Original Appearance of Damaged Historical Documents},
  author={Yang, Zhenhua and Peng, Dezhi and Shi, Yongxin and Zhang, Yuyi and Liu, Chongyu and Jin, Lianwen},
  booktitle={Proceedings of the AAAI conference on artificial intelligence},
  year={2025}
}

๐ŸŒŸ Star Rising

Star Rising