FastGECToR

August 2, 2023 · View on GitHub

1. Introduction

A faster and simpler implementation of GECToR – Grammatical Error Correction: Tag, Not Rewrite with amp and distributed support by deepspeed.

Note: To make it faster and more readable, we remove allennlp dependencies and reconstruct related codes.

2. Requirements

  1. Install Pytorch with cuda support pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
  2. Install NVIDIA-Apex with CUDA and C++ extensions
  3. Install the rest packages with pip install -r ./requirements.txt

3. Data Processing

  1. Tokenize your data (one sentence per line, split words by space)
  2. Generate edits from parallel sents bash scripts/prepare_data.sh
  3. (Optional) Define your own target vocab (data/vocabulary/labels.txt)

4. Configuration

  • We use deepspeed configs to support distributed training and fp16/bf16 data types. Please refer to the deepspeed config-json for more details.
  • We use max_num_tokens to limit the max length of a sequence after tokenization instead of max_len (in word level) as it's not aligned with the length of input_ids and might be misleading.
  • --segmented 1 should be used if your input file has spaces between words. If it's set to 0, it will split words by char.

5. Training

  • Edit deepspeed_config.json according to your config params. Edit deepspeed_config.json according to your config params. lr, train_batch_size, gradient_accumulation_steps will be inherited from deepspeed config file.
bash scripts/train.sh

* Performance Tuning

  • Suppose you want to train a GECToR model with bert-base-uncased (with 110M params), and the max seq len is set to 256 for all cases. There’re some configurations you may need to consider in order to achieve better performance / efficiency.

  • The basic config is to use single GPU without any tricks. Then you may get the following statistics.

    global batch sizen_gpusMaxMemAllocated (CUDA)GPU Mem Usage (NVIDIA-SMI)
    813.3GB5880MiB
    1615.33GB7610MiB
    3219.28GB11712MiB
    64117.28GB20344MiB
    128133.25GB36654MiB
    256165.21GB69864MiB
  • As you can see, The max batch size you can set is limited by the GPU memory allocation. The simplest way to get a larger batch size is to use gradient accumulation, which accumulates the gradients several steps and update at a given interval. In this case, you can reduce the memory usage a lot.

    global batch sizeeffective batch sizegradient accumulation stepsn_gpusMaxMemAllocated (CUDA)GPU Mem Usage (NVIDIA-SMI)
    2562561165.21GB69864MiB
    2561282133.68GB36654MiB
    256644117.71GB20152MiB
    25632819.7GB12344MiB
    256161615.76GB8018MiB
    25683213.72GB5872MiB
  • Another way to train with a large batch size is to use data parallel strategy, which make model replicas and data batch slices across DP ranks to alleviate the memory consumed per GPU.

    global batch sizen_gpusMaxMemAllocated (CUDA)Per GPU Mem Usage (NVIDIA-SMI)
    256165.21GB69864MiB
    256233.25GB37038MiB
    256417.28GB21160MiB
    25689.28GB12616MiB
  • It’s also possible to further reduce the memory usage. For example, you can use FP16 data types for training efficiently at the cost of lower precision. Furthermore, deepspeed’s zero optimizations can also be used alone / together in distributed training. Note that for small models, higher zero stages may not help. For most cases, zero1 (optimizer states partitioning) is enough.

    global batch sizen_gpususe fp16use zero1MaxMemAllocated (CUDA)Per GPU Mem Usage (NVIDIA-SMI)
    2561FalseFalse65.21GB69864MiB
    2561TrueFalse35.18GB38594MiB
    2568FalseFalse9.28GB12616MiB
    2568TrueFalse5.71GB9066MiB
    2568FalseTrue8.59GB12172MiB
    2568TrueTrue4.64GB7610MiB
  • There are other strategies to maximize hardware usage to gain a better performance. Check https://www.deepspeed.ai/ for more details.

6. Inference

  • Edit deepspeed_config.json according to your config params
bash scripts/predict.sh

Reference

[1] Omelianchuk, K., Atrasevych, V., Chernodub, A., & Skurzhanskyi, O. (2020). GECToR – Grammatical Error Correction: Tag, Not Rewrite. arXiv:2005.12592 [cs]. http://arxiv.org/abs/2005.12592