Install required packages

March 16, 2025 · View on GitHub

TextCtrl: Diffusion-based Scene Text Editing with
Prior Guidance Control [NeurIPS 2024]

TextCtrl_model

TODOs

Release ScenePair benchmark dataset and code of model;
Release checkpoints and inference code;
Release tranining pipeline;

1 Installation

1.1 Code Preparation

# Clone the repo
$ git clone https://github.com/weichaozeng/TextCtrl.git
$ cd TextCtrl/
# Install required packages
$ conda create --name textctrl python=3.8
$ conda activate textctrl
$ pip install torch==1.13.0+cu116 torchvision==0.14.0+cu116 torchaudio==0.13.0 --extra-index-url https://download.pytorch.org/whl/cu116
$ pip install -r requirement.txt

1.2 Checkpoints Preparation

Download the checkpoints from Link_1 and Link_2.The file structure should be set as follows:

TextCtrl/
├── weights/
│   ├── model.pth                      # weight of style encoder and unet 
│   ├── text_encoder.pth               # weight of pretrained glyph encoder
│   ├── style_encoder.pth              # weight of pretrained style encoder
│   ├── vision_model.pth               # monitor weight
│   ├── ocr_model.pth                  # ocr weight
│   ├── vgg19.pth                      # vgg weight
│   ├── vitstr_base_patch16_224.pth    # vitstr weight
│   └── sd/                            # pretrained weight of stable-diffusion-v1-5
│       ├── vae/
│       ├── unet/
│       └── scheduler/ 
├── README.md
├── ...

2 Inference

2.1 Data Preparation

The file structure of inference data should be set as the example/:

TextCtrl/
├── example/
│   ├── i_s/                # source cropped text images
│   ├── i_s.txt             # filename and text label of source images in i_s/
│   └── i_t.txt             # filename and text label of target images

2.2 Edit Arguments

Edit the arguments in inference.py, especially:

parser.add_argument("--ckpt_path", type=str, default="weights/model.pth")
parser.add_argument("--dataset_dir", type=str, default="example/")
parser.add_argument("--output_dir", type=str, default="example_result/")

2.3 Generate Images

The inference result could be found in example_result/ after:

$ PYTHONPATH=.../TextCtrl/ python inference.py

2.4 Inference Results

Source Images	Target Text	Infer Results	Reference GT
	"Private"
	"First"
	"RECORDS"
	"Sunset"
	"Network"

3 Training

3.1 Data Preparation

The training relies on synthetic data generated by SRNet-Datagen with some modification for required elements. The file structure should be set as follows:

Syn_data/
├── fonts/
│   ├── arial.ttf/              
│   └── .../  
├── train/
│   ├── train-50k-1/                    
│   ├── train-50k-2/            
│   ├── train-50k-3/              
│   └── train-50k-4/                     
│       ├── i_s/
│       ├── mask_s/
│       ├── i_s.txt
│       ├── t_f/
│       ├── mask_t/
│       ├── i_t.txt
│       ├── t_t/
│       ├── t_b/
│       └── font.txt/ 
└── eval/
    └── eval-1k/

3.2 Text Style Pretraining

$ cd prestyle/
# Modify the path of dir in the config file
$ cd configs/
$ vi StyleTrain.yaml
# Start pretraining
$ cd ..
$ python train.py

3.3 Text Glyph Pretraining

$ cd preglyph/
# Modify the path of dir in the config file
$ cd configs/
$ vi GlyphTrain.yaml
# Start pretraining
$ cd ..
$ python pretrain.py

3.4 Prior Guided Training

$ cd TextCtrl/
# Modify the path of dir in the config file
$ cd configs/
$ vi train.yaml
# Start pretraining
$ cd ..
$ python train.py

4 Evaluation

4.1 Data Preparation

Download the ScenePair dataset from Link and unzip the files. The structure of each folder is as follows:

├── ScenePair/
│   ├── i_s/                # source cropped text images
│   ├── t_f/                # target cropped text images
│   ├── i_full/             # full-size images
│   ├── i_s.txt             # filename and text label of images in i_s/
│   ├── i_t.txt             # filename and text label of images in t_f/
│   ├── i_s_full.txt        # filename, text label, corresponding full-size image name and location information of images in i_s/
│   └── i_t_full.txt        # filename, text label, corresponding full-size image name and location information of images in t_f/

Before evaluation, corresponding edited images should be generated for a certain method based on the ScenePair dataset and should be saved in a '.../result_folder/' with the same filename. Result of some methods on ScenePair dataset are provided here.

4.3 Style Fidelity

SSIM, PSNR, MSE and FID are uesd to evaluate the style fidelity of edited result, with reference to qqqyd/MOSTEL.

$ cd evaluation/
$ python evaluation.py --target_path .../result_folder/ --gt_path .../ScenePair/t_f/

4.4 Text Accuracy

ACC and NED are used to evaluate the text accuracy of edited result, with the offical code and checkpoint in clovaai/deep-text-recognition-benchmark.

Many thanks to these great projects lksshw/SRNet , youdao-ai/SRNet-Datagen , qqqyd/MOSTEL , UCSB-NLP-Chang/DiffSTE , ZYM-PKU/UDiffText , TencentARC/MasaCtrl , unilm/textdiffuser , tyxsspa/AnyText.

Citation

@article{zeng2024textctrl,
  title={TextCtrl: Diffusion-based scene text editing with prior guidance control},
  author={Zeng, Weichao and Shu, Yan and Li, Zhenhang and Yang, Dongbao and Zhou, Yu},
  journal={Advances in Neural Information Processing Systems},
  volume={37},
  pages={138569--138594},
  year={2024}
}

TextCtrl: Diffusion-based Scene Text Editing with Prior Guidance Control [NeurIPS 2024]

TextCtrl: Diffusion-based Scene Text Editing with
Prior Guidance Control [NeurIPS 2024]