LAMB Optimizer (TensorFlow)

January 17, 2020 ยท View on GitHub

This is a simple implementation of LAMB Optimizer, which appeared in the paper "Large Batch Optimization for Deep Learning: Training BERT in 76 minutes".

The older name of the paper was "Reducing BERT Pre-Training Time from 3 Days to 76 Minutes"

Update: official implementation of LAMB optimizer is now available: https://github.com/tensorflow/addons/blob/master/tensorflow_addons/optimizers/lamb.py

Notes

  • This is NOT an official implementation.
  • LAMB optimizer changes slightly from arXiv v1 ~ v3.
  • We implement v3 version (which is the latest version on June, 2019.).
  • Some uncertain parts are clarified by consulting original authors (such as scaling function).

Algorithm

LAMB optimizer is originally designed for large batch learning in neural networks, but could also used in small batch size as indicated by authors.

algorithm.png

Usage

The implementation is based on BERT repository, which uses AdamWeightDecayOptimizer (appears in optimization.py) for pre-training and fine-tuning.

  • Just use LAMBOptimizer as a regular optimizer in TensorFlow, similar to Adam or AdamWeightDecayOptimizer.
  • Find LAMB optimizer in optimization.py.
  • There is nothing special to tune other than initial learning_rate.

Results on MNIST

  • I don't have TPU Pod to test its scalability on BERT with large batch ๐Ÿ˜‚, but tested on MNIST for verify its effectiveness.
  • All optimizers use an initial learning rate of 0.001 (default settings), and did NOT scale to the batch size (may bring another gain, but leave it for you to test).
  • All the experiments are done on NVIDIA TESLA T4.

Here are the numbers on several three classical neural networks (MLP, CNN, Bi-RNN, Bi-GRU, Bi-LSTM) with different optimizers (Adam, AdamW, LAMB).

I only list results of batch={64, 128, 1024, 16384}. For full results, please see FULL_RESULTS.md.

Batch=64

OptimizerMLPCNNBi-RNNBi-GRUBi-LSTMNote
Adam97.0398.9396.2498.9299.04Just ordinary Adam
AdamW97.1199.0196.5099.1199.04Used in BERT
LAMB98.2799.3397.7398.8398.94New optimizer for large batch

Batch=128

OptimizerMLPCNNBi-RNNBi-GRUBi-LSTMNote
Adam96.3898.7697.7399.0899.09Just ordinary Adam
AdamW96.5798.7298.0598.9699.00Used in BERT
LAMB97.9099.2098.0498.8798.76New optimizer for large batch

Batch=1024

OptimizerMLPCNNBi-RNNBi-GRUBi-LSTMNote
Adam93.0597.9298.1098.9498.67Just ordinary Adam
AdamW93.6798.0098.1998.8698.82Used in BERT
LAMB97.6898.8298.2798.6198.47New optimizer for large batch

Batch=16384

OptimizerMLPCNNBi-RNNBi-GRUBi-LSTMNote
Adam88.4695.0695.9897.8197.74Just ordinary Adam
AdamW91.4696.5796.3498.4598.39Used in BERT
LAMB93.2397.8993.7687.6080.36New optimizer for large batch

Several Conclusions

Note: The conclusions are only made by the results above.

  • LAMB consistently outperforms Adam and AdamW in most of the times, and shows consistent results among different batch sizes.
  • LAMB shows big advantage than Adam and AdamW on large batch, showing its excellent scalability.
  • LAMB failed to outperform than Adam and AdamW on complex RNN-based models, despite batch size.

Reproducibility

Check mnist_tensorflow.ipynb for details.

Note: You know the GPU/TPU won't get exactly the same results even we use fixed random seed.

References

Issues

For help or issues, please submit a GitHub issue.