Install required packages
March 16, 2025 · View on GitHub
TextCtrl: Diffusion-based Scene Text Editing with
Prior Guidance Control [NeurIPS 2024]

TODOs
- Release ScenePair benchmark dataset and code of model;
- Release checkpoints and inference code;
- Release tranining pipeline;
1 Installation
1.1 Code Preparation
# Clone the repo
$ git clone https://github.com/weichaozeng/TextCtrl.git
$ cd TextCtrl/
# Install required packages
$ conda create --name textctrl python=3.8
$ conda activate textctrl
$ pip install torch==1.13.0+cu116 torchvision==0.14.0+cu116 torchaudio==0.13.0 --extra-index-url https://download.pytorch.org/whl/cu116
$ pip install -r requirement.txt
1.2 Checkpoints Preparation
Download the checkpoints from Link_1 and Link_2.The file structure should be set as follows:
TextCtrl/
├── weights/
│ ├── model.pth # weight of style encoder and unet
│ ├── text_encoder.pth # weight of pretrained glyph encoder
│ ├── style_encoder.pth # weight of pretrained style encoder
│ ├── vision_model.pth # monitor weight
│ ├── ocr_model.pth # ocr weight
│ ├── vgg19.pth # vgg weight
│ ├── vitstr_base_patch16_224.pth # vitstr weight
│ └── sd/ # pretrained weight of stable-diffusion-v1-5
│ ├── vae/
│ ├── unet/
│ └── scheduler/
├── README.md
├── ...
2 Inference
2.1 Data Preparation
The file structure of inference data should be set as the example/:
TextCtrl/
├── example/
│ ├── i_s/ # source cropped text images
│ ├── i_s.txt # filename and text label of source images in i_s/
│ └── i_t.txt # filename and text label of target images
2.2 Edit Arguments
Edit the arguments in inference.py, especially:
parser.add_argument("--ckpt_path", type=str, default="weights/model.pth")
parser.add_argument("--dataset_dir", type=str, default="example/")
parser.add_argument("--output_dir", type=str, default="example_result/")
2.3 Generate Images
The inference result could be found in example_result/ after:
$ PYTHONPATH=.../TextCtrl/ python inference.py
2.4 Inference Results
| Source Images | Target Text | Infer Results | Reference GT |
|---|---|---|---|
![]() | "Private" | ![]() | ![]() |
![]() | "First" | ![]() | ![]() |
![]() | "RECORDS" | ![]() | ![]() |
![]() | "Sunset" | ![]() | ![]() |
![]() | "Network" | ![]() | ![]() |
3 Training
3.1 Data Preparation
The training relies on synthetic data generated by SRNet-Datagen with some modification for required elements. The file structure should be set as follows:
Syn_data/
├── fonts/
│ ├── arial.ttf/
│ └── .../
├── train/
│ ├── train-50k-1/
│ ├── train-50k-2/
│ ├── train-50k-3/
│ └── train-50k-4/
│ ├── i_s/
│ ├── mask_s/
│ ├── i_s.txt
│ ├── t_f/
│ ├── mask_t/
│ ├── i_t.txt
│ ├── t_t/
│ ├── t_b/
│ └── font.txt/
└── eval/
└── eval-1k/
3.2 Text Style Pretraining
$ cd prestyle/
# Modify the path of dir in the config file
$ cd configs/
$ vi StyleTrain.yaml
# Start pretraining
$ cd ..
$ python train.py
3.3 Text Glyph Pretraining
$ cd preglyph/
# Modify the path of dir in the config file
$ cd configs/
$ vi GlyphTrain.yaml
# Start pretraining
$ cd ..
$ python pretrain.py
3.4 Prior Guided Training
$ cd TextCtrl/
# Modify the path of dir in the config file
$ cd configs/
$ vi train.yaml
# Start pretraining
$ cd ..
$ python train.py
4 Evaluation
4.1 Data Preparation
Download the ScenePair dataset from Link and unzip the files. The structure of each folder is as follows:
├── ScenePair/
│ ├── i_s/ # source cropped text images
│ ├── t_f/ # target cropped text images
│ ├── i_full/ # full-size images
│ ├── i_s.txt # filename and text label of images in i_s/
│ ├── i_t.txt # filename and text label of images in t_f/
│ ├── i_s_full.txt # filename, text label, corresponding full-size image name and location information of images in i_s/
│ └── i_t_full.txt # filename, text label, corresponding full-size image name and location information of images in t_f/
4.2 Generate Images
Before evaluation, corresponding edited images should be generated for a certain method based on the ScenePair dataset and should be saved in a '.../result_folder/' with the same filename. Result of some methods on ScenePair dataset are provided here.
4.3 Style Fidelity
SSIM, PSNR, MSE and FID are uesd to evaluate the style fidelity of edited result, with reference to qqqyd/MOSTEL.
$ cd evaluation/
$ python evaluation.py --target_path .../result_folder/ --gt_path .../ScenePair/t_f/
4.4 Text Accuracy
ACC and NED are used to evaluate the text accuracy of edited result, with the offical code and checkpoint in clovaai/deep-text-recognition-benchmark.
Related Resources
Many thanks to these great projects lksshw/SRNet , youdao-ai/SRNet-Datagen , qqqyd/MOSTEL , UCSB-NLP-Chang/DiffSTE , ZYM-PKU/UDiffText , TencentARC/MasaCtrl , unilm/textdiffuser , tyxsspa/AnyText.
Citation
@article{zeng2024textctrl,
title={TextCtrl: Diffusion-based scene text editing with prior guidance control},
author={Zeng, Weichao and Shu, Yan and Li, Zhenhang and Yang, Dongbao and Zhou, Yu},
journal={Advances in Neural Information Processing Systems},
volume={37},
pages={138569--138594},
year={2024}
}














