README.md

November 5, 2020 · View on GitHub

EAdam Optimizier

Arxiv: EAdam Optimizer: How ε Impact Adam

Introduction

We find that simply changing the position of epsilon can obtain better performance than Adam through experiments. Based on this finding, we propose a new variant of Adam called EAdam, which doesn't need extra hyper-parameters or computational costs. We also discuss the relationships and differences between our method and Adam. We perform a thorough evaluation of our EAdam optimizer against popular and latest optimization methods including Adam, RAdam and Adabelief on different deep learning tasks. We focus on these following tasks: image classification on CIFAR-10 and CIFAR-100, language modeling on Penn Treebank and object detection on PASCAL VOC.

Algorithm

According to update formulas in Algorithms, Vt can be expressed by the gradients at all previous timesteps as follows
After the bias correction step, we have
Then， the adaptive stepsize are
We firstly let $\epsilon^{'}=\epsilon=10^{-8}$ , then we want to analyse the differences of stepsizes when using Adam and EAdam to train deep networks. At the begin of training, the elements in $\G_t$ are far larger than $\epsilon^{'}$ and $\epsilon$ , the stepsizes in Adam and EAdam can all approximated as $\alpha/\sqrt{G_t}$ . In this case, the stepsize is determined by $\G_t$ . Then, the elements in $\G_t$ may become small and $\epsilon^{'}$ or $\epsilon$ can affect the elements in $\G_t$ . In this case, the stepsize is determined by $\G_t$ and $\epsilon^{'}$ ( $\epsilon$ ). It easy to see that this case happens earlier in EAdam because $\epsilon$ is added to $\G_t$ rather than $\sqrt{G_t}$ . Finally, the elements in $\G_t$ may become far smaller than $\epsilon^{'}$ or $\epsilon$ , and the stepsizes become
- In this case, EAdam takes smaller stepsize than Adam.
We can see that EAdam essentially adds a constant times of $\epsilon$ to $\G_t$ before the square root operation. However, this operation is not equivalent to adding a fixed constant $\epsilon^{'}$ to $\sqrt{G_t}$ . In other words, we can't find a fixed constant $\epsilon^{'}$ such that $\sqrt{G_t}+\epsilon^{'}=\sqrt{G_t+\epsilon/(1-\beta_2)}$ , where $\epsilon$ is known, for the following reasons. If we let $\sqrt{G_t}+\epsilon^{'}=\sqrt{G_t+\epsilon/(1-\beta_2)}$ where $\epsilon^{'}$ is known. Then, we have

$\epsilon^{'}=\sqrt{G_t+\epsilon(1-\beta_2)}-\sqrt{G_t}$

Because $\G_t$ is constantly updated, $\epsilon^{'}$ is also adjusted based on $\G_t$ in the iterative process. Therefore, $\epsilon^{'}$ is not fixed. From this interpretation, the change in EAdam can be seen as adopting an adaptive $\epsilon$ rather than a constant in Adam. To sum up, we give some intuitive comparisons and explanations for EAdam in this subsection. However, analyzing the reasons why EAdam performances better in theory may be difficult and it is worthy to be further studied.

Experiments

We did not precisely adjust the parameters and repeat the experiment, which will be supplemented in the future.

Code is base on:

CIFAR10 and CIFAR100

Experiment is base on torch1.4.0
Parameter Settings for all methods are shown in the following table
lr beta1 beta2 eps weight decay batch size
1e-3 0.9 0.999 1e-8 5e-4 128
Results:

lr	beta1	beta2	eps	weight decay	batch size
1e-3	0.9	0.999	1e-8	5e-4	128

Penn Treebank

Experiment is base on torch1.1.0
Parameter Settings shown in the following table

model	lr	beta1	beta2	eps	weight decay	batch size
1-layer LSTM	1e-3	0.9	0.999	1e-8(EAdam and AdaBelief are 1e-16)	1.2e-6	20
2-layer LSTM	1e-2(RAdam is 1e-3)	0.9	0.999	1e-8(EAdam and AdaBelief are 1e-16)	1.2e-6	20
2-layer LSTM	1e-2(RAdam is 1e-3)	0.9	0.999	1e-8(EAdam and AdaBelief are 1e-16)	1.2e-6	20

Results:

Pascal Voc

Experiment is base on torch1.6.0, torchvision0.7.0 and mmcv-full1.1.6
Parameter Settings for all methods are shown in the following table
lr beta1 beta2 eps weight decay batch size
1e-4 0.9 0.999 1e-8 1e-4 2
Results:

lr	beta1	beta2	eps	weight decay	batch size
1e-4	0.9	0.999	1e-8	1e-4	2

Plan

We will precisely adjust the parameters and repeat the experiment in the future. We may add extra experiments incluing image classification on ImageNet and objective detection on COCO. More experimental data will be published in this repository in the future.