PyTorch Image Classification

February 17, 2019 · View on GitHub

This is a fork of the original PyTorch Image Classification

PyTorch Image Classification

Following papers are implemented using PyTorch.

ResNet (1512.03385)
ResNet-preact (1603.05027)
WRN (1605.07146)
DenseNet (1608.06993)
PyramidNet (1610.02915)
ResNeXt (1611.05431)
shake-shake (1705.07485)
LARS (1708.03888, 1801.03137)
Cutout (1708.04552)
Random Erasing (1708.04896)
SENet (1709.01507)
Mixup (1710.09412)
Dual-Cutout (1802.07426)
RICAP (1811.09030)

Requirements

Python >= 3.6
PyTorch >= 1.0.0
torchvision
tensorboardX (optional)

Usage

$ ./main.py --arch resnet_preact --depth 56 --outdir results

Use Cutout

$ ./main.py --arch resnet_preact --depth 56 --outdir results --use_cutout

Use RandomErasing

$ ./main.py --arch resnet_preact --depth 56 --outdir results --use_random_erasing

Use Mixup

$ ./main.py --arch resnet_preact --depth 56 --outdir results --use_mixup

Use cosine annealing

$ ./main.py --arch wrn --outdir results --scheduler cosine

Results on Kuzushiji-49

Comparison of models and different batch size

Model	Batch size	Balanced accuracy	# of epochs	Training time
DenseNet-100 (k=12)	1536	96.03	1000	34h27m
DenseNet-100 (k=12)	1536	97.32	1500	47h39m
Shake-Shake-26 2x96d	512	97.41	1000	47h21m
Shake-Shake-26 2x96d	1024	97.57	1000	41h14m

Comparison of different settings when using Shake-Shake model

Model	Batch size	Balanced accuracy	# of epochs	Training time
Shake-Shake-26 2x96d	1024	97.64	1100	47h25m
Shake-Shake-26 2x96d *	2048	97.72	1100	21h45m
Shake-Shake-26 2x96d *	2048	98.00	1800	34h25m
Shake-Shake-26 2x96d (cutout 14)	1024	98.10	1100	47h3m
Shake-Shake-26 2x96d (mixup alpha=1)	1024	97.42	1100	47h14m
Shake-Shake-26 2x96d (cutout 14) *	2048	98.16	1100	23h27m
Shake-Shake-26 2x96d (cutout 14) *	2048	98.29	1800	36h15m

* run on eight Tesla V100 GPUs; other experiments were run on four Tesla P100 GPUs

Here are the training arguments used to achieve the best balanced accuracy.

python train.py --dataset K49 --arch shake_shake --depth 26 --base_channels 96 --shake_forward True --shake_backward True --shake_image True --seed 7 --outdir results/k49/shake_shake_26_2x96d_cutout14/04 --epochs 1800 --scheduler cosine --base_lr 0.2 --batch_size 2048 --use_cutout --cutout_size 14

Results on CIFAR-10

Results using almost same settings as papers

Model	Test Error (median of 3 runs)	Test Error (in paper)	Training Time
VGG-like (depth 15, w/ BN, channel 64)	7.29	N/A	1h20m
ResNet-110	6.52	6.43 (best), 6.61 +/- 0.16	3h06m
ResNet-preact-110	6.47	6.37 (median of 5 runs)	3h05m
ResNet-preact-164 bottleneck	5.90	5.46 (median of 5 runs)	4h01m
ResNet-preact-1001 bottleneck		4.62 (median of 5 runs), 4.69 +/- 0.20
WRN-28-10	4.03	4.00 (median of 5 runs)	16h10m
WRN-28-10 w/ dropout		3.89 (median of 5 runs)
DenseNet-100 (k=12)	3.87 (1 run)	4.10 (1 run)	24h28m*
DenseNet-100 (k=24)		3.74 (1 run)
DenseNet-BC-100 (k=12)	4.69	4.51 (1 run)	15h20m
DenseNet-BC-250 (k=24)		3.62 (1 run)
DenseNet-BC-190 (k=40)		3.46 (1 run)
PyramidNet-110 (alpha=84)	4.40	4.26 +/- 0.23	11h40m
PyramidNet-110 (alpha=270)	3.92 (1 run)	3.73 +/- 0.04	24h12m*
PyramidNet-164 bottleneck (alpha=270)	3.44 (1 run)	3.48 +/- 0.20	32h37m*
PyramidNet-272 bottleneck (alpha=200)		3.31 +/- 0.08
ResNeXt-29 4x64d	3.89	~3.75 (from Figure 7)	31h17m
ResNeXt-29 8x64d	3.97 (1 run)	3.65 (average of 10 runs)	42h50m*
ResNeXt-29 16x64d		3.58 (average of 10 runs)
shake-shake-26 2x32d (S-S-I)	3.68	3.55 (average of 3 runs)	33h49m
shake-shake-26 2x64d (S-S-I)	2.88 (1 run)	2.98 (average of 3 runs)	78h48m
shake-shake-26 2x96d (S-S-I)	2.90 (1 run)	2.86 (average of 5 runs)	101h32m*

Notes

Differences with papers in training settings:
- Trained WRN-28-10 with batch size 64 (128 in paper).
- Trained DenseNet-BC-100 (k=12) with batch size 32 and initial learning rate 0.05 (batch size 64 and initial learning rate 0.1 in paper).
- Trained ResNeXt-29 4x64d with a single GPU, batch size 32 and initial learning rate 0.025 (8 GPUs, batch size 128 and initial learning rate 0.1 in paper).
- Trained shake-shake models with a single GPU (2 GPUs in paper).
- Trained shake-shake 26 2x64d (S-S-I) with batch size 64, and initial learning rate 0.1.
Test errors reported above are the ones at last epoch.
Experiments with only 1 run are done on different computer from the one used for experiments with 3 runs.
GeForce GTX 980 was used in these experiments.

VGG-like

$ python -u main.py --arch vgg --seed 7 --outdir results/vgg_15_BN_64/00

ResNet

$ python -u main.py --arch resnet --depth 110 --block_type basic --seed 7 --outdir results/resnet_basic_110/00

ResNet-preact

$ python -u main.py --arch resnet_preact --depth 110 --block_type basic --seed 7 --outdir results/resnet_preact_basic_110/00

$ python -u main.py --arch resnet_preact --depth 164 --block_type bottleneck --seed 7 --outdir results/resnet_preact_bottleneck_164/00

WRN

$ python -u main.py --arch wrn --depth 28 --widening_factor 10 --seed 7 --outdir results/wrn_28_10/00

DenseNet

$ python -u main.py --arch densenet --depth 100 --block_type bottleneck --growth_rate 12 --compression_rate 0.5 --batch_size 32 --base_lr 0.05 --seed 7 --outdir results/densenet_BC_100_12/00

PyramidNet

$ python -u main.py --arch pyramidnet --depth 110 --block_type basic --pyramid_alpha 84 --seed 7 --outdir results/pyramidnet_basic_110_84/00

$ python -u main.py --arch pyramidnet --depth 110 --block_type basic --pyramid_alpha 270 --seed 7 --outdir results/pyramidnet_basic_110_270/00

ResNeXt

$ python -u main.py --arch resnext --depth 29 --cardinality 4 --base_channels 64 --batch_size 32 --base_lr 0.025 --seed 7 --outdir results/resnext_29_4x64d/00

$ python -u main.py --arch resnext --depth 29 --cardinality 8 --base_channels 64 --batch_size 64 --base_lr 0.05 --seed 7 --outdir results/resnext_29_8x64d/00

shake-shake

$ python -u main.py --arch shake_shake --depth 26 --base_channels 32 --shake_forward True --shake_backward True --shake_image True --seed 7 --outdir results/shake_shake_26_2x32d_SSI/00

$ python -u main.py --arch shake_shake --depth 26 --base_channels 64 --shake_forward True --shake_backward True --shake_image True --batch_size 64 --base_lr 0.1 --seed 7 --outdir results/shake_shake_26_2x64d_SSI/00

$ python -u main.py --arch shake_shake --depth 26 --base_channels 96 --shake_forward True --shake_backward True --shake_image True --seed 7 --outdir results/shake_shake_26_2x96d_SSI/00

Results

Model	Test Error (1 run)	# of Epochs	Training Time
WRN-28-10, Cutout 16	3.19	200	16h23m*
WRN-28-10, mixup (alpha=1)	3.32	200	6h35m
WRN-28-10, RICAP (beta=0.3)	2.83	200	6h35m
WRN-28-10, Dual-Cutout (alpha=0.1)	2.87	200	12h42m
WRN-28-10, Cutout 16	3.07	400	13h10m
WRN-28-10, mixup (alpha=1)	3.04	400	13h08m
WRN-28-10, RICAP (beta=0.3)	2.71	400	13h08m
WRN-28-10, Dual-Cutout (alpha=0.1)	2.76	400	25h20m
shake-shake-26 2x64d, Cutout 16	2.64	1800	78h55m*
shake-shake-26 2x64d, mixup (alpha=1)	2.63	1800	35h56m
shake-shake-26 2x64d, RICAP (beta=0.3)	2.29	1800	35h10m
shake-shake-26 2x64d, Dual-Cutout (alpha=0.1)		1800
shake-shake-26 2x96d, Cutout 16	2.50	1800	60h20m
shake-shake-26 2x96d, mixup (alpha=1)	2.36	1800	60h20m
shake-shake-26 2x96d, RICAP (beta=0.3)	2.10	1800	60h20m
shake-shake-26 2x96d, Dual-Cutout (alpha=0.1)	2.41	1800	113h09m

Note

Results reported in the table are the test errors at last epochs.
All models are trained using cosine annealing with initial learning rate 0.2.
GeForce GTX 1080 Ti was used in these experiments, except ones with *, which are done using GeForce GTX 980.

python -u main.py --arch wrn --depth 28 --outdir results/wrn_28_10_cutout16 --epochs 200 --scheduler cosine --base_lr 0.1 --batch_size 64 --seed 17 --use_cutout --cutout_size 16

python -u main.py --arch shake_shake --depth 26 --base_channels 64 --outdir results/shake_shake_26_2x64d_SSI_cutout16 --epochs 300 --scheduler cosine --base_lr 0.1 --batch_size 64 --seed 17 --use_cutout --cutout_size 16

Results on FashionMNIST

Model	Test Error (1 run)	# of Epochs	Training Time
ResNet-preact-20, widening factor 4, Cutout 12	4.17	200	1h32m
ResNet-preact-20, widening factor 4, Cutout 14	4.11	200	1h32m
ResNet-preact-50, Cutout 12	4.45	200	57m
ResNet-preact-50, Cutout 14	4.38	200	57m
ResNet-preact-50, widening factor 4,Cutout 12	4.07	200	3h37m
ResNet-preact-50, widening factor 4,Cutout 14	4.13	200	3h39m
shake-shake-26 2x32d (S-S-I), Cutout 12	4.08	400	3h41m
shake-shake-26 2x32d (S-S-I), Cutout 14	4.05	400	3h39m
shake-shake-26 2x96d (S-S-I), Cutout 12	3.72	400	13h46m
shake-shake-26 2x96d (S-S-I), Cutout 14	3.85	400	13h39m
shake-shake-26 2x96d (S-S-I), Cutout 12	3.65	800	26h42m
shake-shake-26 2x96d (S-S-I), Cutout 14	3.60	800	26h42m

Model	Test Error (median of 3 runs)	# of Epochs	Training Time
ResNet-preact-20	5.04	200	26m
ResNet-preact-20, Cutout 6	4.84	200	26m
ResNet-preact-20, Cutout 8	4.64	200	26m
ResNet-preact-20, Cutout 10	4.74	200	26m
ResNet-preact-20, Cutout 12	4.68	200	26m
ResNet-preact-20, Cutout 14	4.64	200	26m
ResNet-preact-20, Cutout 16	4.49	200	26m
ResNet-preact-20, RandomErasing	4.61	200	26m
ResNet-preact-20, Mixup	4.92	200	26m
ResNet-preact-20, Mixup	4.64	400	52m

Note

Results reported in the tables are the test errors at last epochs.
All models are trained using cosine annealing with initial learning rate 0.2.
Following data augmentations are applied to the training data:
- Images are padded with 4 pixels on each side, and 28x28 patches are randomly cropped from the padded images.
- Images are randomly flipped horizontally.
GeForce GTX 1080 Ti was used in these experiments.

Results on MNIST

Model	Test Error (median of 3 runs)	# of Epochs	Training Time
ResNet-preact-20	0.40	100	12m
ResNet-preact-20, Cutout 6	0.32	100	12m
ResNet-preact-20, Cutout 8	0.25	100	12m
ResNet-preact-20, Cutout 10	0.27	100	12m
ResNet-preact-20, Cutout 12	0.26	100	12m
ResNet-preact-20, Cutout 14	0.26	100	12m
ResNet-preact-20, Cutout 16	0.25	100	12m
ResNet-preact-20, Mixup (alpha=1)	0.40	100	12m
ResNet-preact-20, Mixup (alpha=0.5)	0.38	100	12m
ResNet-preact-20, widening factor 4, Cutout 14	0.26	100	45m
ResNet-preact-50, Cutout 14	0.29	100	28m
ResNet-preact-50, widening factor 4, Cutout 14	0.25	100	1h50m
shake-shake-26 2x96d (S-S-I), Cutout 14	0.24	100	3h22m

Note

Results reported in the table are the test errors at last epochs.
All models are trained using cosine annealing with initial learning rate 0.2.
GeForce GTX 1080 Ti was used in these experiments.

Results on Kuzushiji-MNIST

Model	Test Error (median of 3 runs)	# of Epochs	Training Time
ResNet-preact-20, Cutout 14	0.82 (best 0.67)	200	24m
ResNet-preact-20, widening factor 4, Cutout 14	0.72 (best 0.67)	200	1h30m
PyramidNet-110-270, Cutout 14	0.72 (best 0.70)	200	10h05m
shake-shake-26 2x96d (S-S-I), Cutout 14	0.66 (best 0.63)	200	6h46m

Note

Results reported in the table are the test errors at last epochs.
All models are trained using cosine annealing with initial learning rate 0.2.
GeForce GTX 1080 Ti was used in these experiments.

Experiments

Experiment on residual units, learning rate scheduling, and data augmentation

In this experiment, the effects of the following on classification accuracy are investigated:

PyramidNet-like residual units
Cosine annealing of learning rate
Cutout
Random Erasing
Mixup
Preactivation of shortcuts after downsampling

ResNet-preact-56 is trained on CIFAR-10 with initial learning rate 0.2 in this experiment.

Note

PyramidNet paper (1610.02915) showed that removing first ReLU in residual units and adding BN after last convolutions in residual units both improve classification accuracy.
SGDR paper (1608.03983) showed cosine annealing improves classification accuracy even without restarting.

Results

PyramidNet-like units works.
- It might be better not to preactivate shortcuts after downsampling when using PyramidNet-like units.
Cosine annealing slightly improves accuracy.
Cutout, RandomErasing, and Mixup all work great.
- Mixup needs longer training.

Model	Test Error (median of 5 runs)	Training Time
w/ 1st ReLU, w/o last BN, preactivate shortcut after downsampling	6.45	95 min
w/ 1st ReLU, w/o last BN	6.47	95 min
w/o 1st ReLU, w/o last BN	6.14	89 min
w/ 1st ReLU, w/ last BN	6.43	104 min
w/o 1st ReLU, w/ last BN	5.85	98 min
w/o 1st ReLU, w/ last BN, preactivate shortcut after downsampling	6.27	98 min
w/o 1st ReLU, w/ last BN, Cosine annealing	5.72	98 min
w/o 1st ReLU, w/ last BN, Cutout	4.96	98 min
w/o 1st ReLU, w/ last BN, RandomErasing	5.22	98 min
w/o 1st ReLU, w/ last BN, Mixup (300 epochs)	5.11	191 min

preactivate shortcut after downsampling

$ python -u main.py --arch resnet_preact --depth 56 --block_type basic --base_lr 0.2 --preact_stage '[true, true, true]' --remove_first_relu false --add_last_bn false --seed 7 --outdir results/experiments/00_preact_after_downsampling/00

w/ 1st ReLU, w/o last BN

$ python -u main.py --arch resnet_preact --depth 56 --block_type basic --base_lr 0.2 --preact_stage '[true, false, false]' --remove_first_relu false --add_last_bn false --seed 7 --outdir results/experiments/01_w_relu_wo_bn/00

w/o 1st ReLU, w/o last BN

$ python -u main.py --arch resnet_preact --depth 56 --block_type basic --base_lr 0.2 --preact_stage '[true, false, false]' --remove_first_relu true --add_last_bn false --seed 7 --outdir results/experiments/02_wo_relu_wo_bn/00

w/ 1st ReLU, w/ last BN

$ python -u main.py --arch resnet_preact --depth 56 --block_type basic --base_lr 0.2 --preact_stage '[true, false, false]' --remove_first_relu false --add_last_bn true --seed 7 --outdir results/experiments/03_w_relu_w_bn/00

w/o 1st ReLU, w/ last BN

$ python -u main.py --arch resnet_preact --depth 56 --block_type basic --base_lr 0.2 --preact_stage '[true, false, false]' --remove_first_relu true --add_last_bn true --seed 7 --outdir results/experiments/04_wo_relu_w_bn/00

w/o 1st ReLU, w/ last BN, preactivate shortcut after downsampling

$ python -u main.py --arch resnet_preact --depth 56 --block_type basic --base_lr 0.2 --preact_stage '[true, true, true]' --remove_first_relu true --add_last_bn true --seed 7 --outdir results/experiments/05_preact_after_downsampling/00

w/o 1st ReLU, w/ last BN, cosine annealing

$ python -u main.py --arch resnet_preact --depth 56 --block_type basic --base_lr 0.2 --preact_stage '[true, false, false]' --remove_first_relu true --add_last_bn true --scheduler cosine --seed 7 --outdir results/experiments/06_cosine_annealing/00

w/o 1st ReLU, w/ last BN, Cutout

$ python -u main.py --arch resnet_preact --depth 56 --block_type basic --base_lr 0.2 --preact_stage '[true, false, false]' --remove_first_relu true --add_last_bn true --use_cutout --seed 7 --outdir results/experiments/07_cutout/00

w/o 1st ReLU, w/ last BN, RandomErasing

$ python -u main.py --arch resnet_preact --depth 56 --block_type basic --base_lr 0.2 --preact_stage '[true, false, false]' --remove_first_relu true --add_last_bn true --use_random_erasing --seed 7 --outdir results/experiments/08_random_erasing/00

w/o 1st ReLU, w/ last BN, Mixup

$ python -u main.py --arch resnet_preact --depth 56 --block_type basic --base_lr 0.2 --preact_stage '[true, false, false]' --remove_first_relu true --add_last_bn true --use_mixup --seed 7 --outdir results/experiments/09_mixup/00

Experiments on label smoothing, Mixup, RICAP, and Dual-Cutout

Results on CIFAR-10

Model	Test Error (median of 3 runs)	# of Epochs	Training Time
ResNet-preact-20	7.60	200	24m
ResNet-preact-20, label smoothing (epsilon=0.001)	7.41	200	25m
ResNet-preact-20, label smoothing (epsilon=0.1)	7.53	200	25m
ResNet-preact-20, mixup (alpha=1)	7.24	200	26m
ResNet-preact-20, RICAP (beta=0.3), w/ random crop	6.88	200	28m
ResNet-preact-20, RICAP (beta=0.3)	6.77	200	28m
ResNet-preact-20, Dual-Cutout 16 (alpha=0.1)	6.24	200	45m
ResNet-preact-20	7.05	400	49m
ResNet-preact-20, label smoothing (epsilon=0.001)	7.05	400	49m
ResNet-preact-20, label smoothing (epsilon=0.1)	7.13	400	49m
ResNet-preact-20, mixup (alpha=1)	6.66	400	51m
ResNet-preact-20, RICAP (beta=0.3), w/ random crop	6.30	400	56m
ResNet-preact-20, RICAP (beta=0.3)	6.19	400	56m
ResNet-preact-20, Dual-Cutout 16 (alpha=0.1)	5.55	400	1h36m

Note

Results reported in the table are the test errors at last epochs.
All models are trained using cosine annealing with initial learning rate 0.2.
GeForce GTX 1080 Ti was used in these experiments.

Experiments on batch size and learning rate

Following experiments are done on CIFAR-10 dataset using GeForce 1080 Ti.
Results reported in the table are the test errors at last epochs.

Linear scaling rule for learning rate

Model	batch size	initial lr	lr schedule	# of Epochs	Test Error (1 run)	Training Time
ResNet-preact-20	4096	3.2	cosine	200	10.57	22m
ResNet-preact-20	2048	1.6	cosine	200	8.87	21m
ResNet-preact-20	1024	0.8	cosine	200	8.40	21m
ResNet-preact-20	512	0.4	cosine	200	8.22	20m
ResNet-preact-20	256	0.2	cosine	200	8.61	22m
ResNet-preact-20	128	0.1	cosine	200	8.09	24m
ResNet-preact-20	64	0.05	cosine	200	8.22	28m
ResNet-preact-20	32	0.025	cosine	200	8.00	43m
ResNet-preact-20	16	0.0125	cosine	200	7.75	1h17m
ResNet-preact-20	8	0.006125	cosine	200	7.70	2h32m

Model	batch size	initial lr	lr schedule	# of Epochs	Test Error (1 run)	Training Time
ResNet-preact-20	4096	3.2	multistep	200	28.97	22m
ResNet-preact-20	2048	1.6	multistep	200	9.07	21m
ResNet-preact-20	1024	0.8	multistep	200	8.62	21m
ResNet-preact-20	512	0.4	multistep	200	8.23	20m
ResNet-preact-20	256	0.2	multistep	200	8.40	21m
ResNet-preact-20	128	0.1	multistep	200	8.28	24m
ResNet-preact-20	64	0.05	multistep	200	8.13	28m
ResNet-preact-20	32	0.025	multistep	200	7.58	43m
ResNet-preact-20	16	0.0125	multistep	200	7.93	1h18m
ResNet-preact-20	8	0.006125	multistep	200	8.31	2h34m

Linear scaling + longer training

Model	batch size	initial lr	lr schedule	# of Epochs	Test Error (1 run)	Training Time
ResNet-preact-20	4096	3.2	cosine	400	8.97	44m
ResNet-preact-20	2048	1.6	cosine	400	7.85	43m
ResNet-preact-20	1024	0.8	cosine	400	7.20	42m
ResNet-preact-20	512	0.4	cosine	400	7.83	40m
ResNet-preact-20	256	0.2	cosine	400	7.65	42m
ResNet-preact-20	128	0.1	cosine	400	7.09	47m
ResNet-preact-20	64	0.05	cosine	400	7.17	44m
ResNet-preact-20	32	0.025	cosine	400	7.24	2h11m
ResNet-preact-20	16	0.0125	cosine	400	7.26	4h10m
ResNet-preact-20	8	0.006125	cosine	400	7.02	7h53m

Model	batch size	initial lr	lr schedule	# of Epochs	Test Error (1 run)	Training Time
ResNet-preact-20	4096	3.2	cosine	800	8.14	1h29m
ResNet-preact-20	2048	1.6	cosine	800	7.74	1h23m
ResNet-preact-20	1024	0.8	cosine	800	7.15	1h31m
ResNet-preact-20	512	0.4	cosine	800	7.27	1h25m
ResNet-preact-20	256	0.2	cosine	800	7.22	1h26m
ResNet-preact-20	128	0.1	cosine	800	6.68	1h35m
ResNet-preact-20	64	0.05	cosine	800	7.18	2h20m
ResNet-preact-20	32	0.025	cosine	800	7.03	4h16m
ResNet-preact-20	16	0.0125	cosine	800	6.78	8h37m
ResNet-preact-20	8	0.006125	cosine	800	6.89	16h47m

Effect of initial learning rate

Model	batch size	initial lr	lr schedule	# of Epochs	Test Error (1 run)	Training Time
ResNet-preact-20	4096	3.2	cosine	200	10.57	22m
ResNet-preact-20	4096	1.6	cosine	200	10.32	22m
ResNet-preact-20	4096	0.8	cosine	200	10.71	22m

Model	batch size	initial lr	lr schedule	# of Epochs	Test Error (1 run)	Training Time
ResNet-preact-20	2048	3.2	cosine	200	11.34	21m
ResNet-preact-20	2048	2.4	cosine	200	8.69	21m
ResNet-preact-20	2048	2.0	cosine	200	8.81	21m
ResNet-preact-20	2048	1.6	cosine	200	8.73	22m
ResNet-preact-20	2048	0.8	cosine	200	9.62	21m

Model	batch size	initial lr	lr schedule	# of Epochs	Test Error (1 run)	Training Time
ResNet-preact-20	1024	3.2	cosine	200	9.12	21m
ResNet-preact-20	1024	2.4	cosine	200	8.42	22m
ResNet-preact-20	1024	2.0	cosine	200	8.38	22m
ResNet-preact-20	1024	1.6	cosine	200	8.07	22m
ResNet-preact-20	1024	1.2	cosine	200	8.25	21m
ResNet-preact-20	1024	0.8	cosine	200	8.08	22m
ResNet-preact-20	1024	0.4	cosine	200	8.49	22m

Model	batch size	initial lr	lr schedule	# of Epochs	Test Error (1 run)	Training Time
ResNet-preact-20	512	3.2	cosine	200	8.51	21m
ResNet-preact-20	512	1.6	cosine	200	7.73	20m
ResNet-preact-20	512	0.8	cosine	200	7.73	21m
ResNet-preact-20	512	0.4	cosine	200	8.22	20m

Model	batch size	initial lr	lr schedule	# of Epochs	Test Error (1 run)	Training Time
ResNet-preact-20	256	3.2	cosine	200	9.64	22m
ResNet-preact-20	256	1.6	cosine	200	8.32	22m
ResNet-preact-20	256	0.8	cosine	200	7.45	21m
ResNet-preact-20	256	0.4	cosine	200	7.68	22m
ResNet-preact-20	256	0.2	cosine	200	8.61	22m

Model	batch size	initial lr	lr schedule	# of Epochs	Test Error (1 run)	Training Time
ResNet-preact-20	128	1.6	cosine	200	9.03	24m
ResNet-preact-20	128	0.8	cosine	200	7.54	24m
ResNet-preact-20	128	0.4	cosine	200	7.28	24m
ResNet-preact-20	128	0.2	cosine	200	7.96	24m
ResNet-preact-20	128	0.1	cosine	200	8.09	24m
ResNet-preact-20	128	0.05	cosine	200	8.81	24m
ResNet-preact-20	128	0.025	cosine	200	10.07	24m

Model	batch size	initial lr	lr schedule	# of Epochs	Test Error (1 run)	Training Time
ResNet-preact-20	64	0.4	cosine	200	7.42	35m
ResNet-preact-20	64	0.2	cosine	200	7.52	36m
ResNet-preact-20	64	0.1	cosine	200	7.78	37m
ResNet-preact-20	64	0.05	cosine	200	8.22	28m

Model	batch size	initial lr	lr schedule	# of Epochs	Test Error (1 run)	Training Time
ResNet-preact-20	32	0.2	cosine	200	7.64	1h05m
ResNet-preact-20	32	0.1	cosine	200	7.25	1h08m
ResNet-preact-20	32	0.05	cosine	200	7.45	1h07m
ResNet-preact-20	32	0.025	cosine	200	8.00	43m

Good learning rate + longer training

Model	batch size	initial lr	lr schedule	# of Epochs	Test Error (1 run)	Training Time
ResNet-preact-20	4096	1.6	cosine	800	8.36	1h33m
ResNet-preact-20	2048	1.6	cosine	800	7.53	1h27m
ResNet-preact-20	1024	1.6	cosine	800	7.30	1h30m
ResNet-preact-20	1024	0.8	cosine	800	7.42	1h30m
ResNet-preact-20	512	1.6	cosine	800	6.69	1h26m
ResNet-preact-20	512	0.8	cosine	800	6.77	1h26m
ResNet-preact-20	256	0.8	cosine	800	6.84	1h28m
ResNet-preact-20	128	0.4	cosine	800	6.86	1h35m
ResNet-preact-20	128	0.2	cosine	800	7.05	1h38m
ResNet-preact-20	128	0.1	cosine	800	6.68	1h35m

Model	batch size	initial lr	lr schedule	# of Epochs	Test Error (1 run)	Training Time
ResNet-preact-20	4096	1.6	cosine	1600	8.25	3h10m
ResNet-preact-20	2048	1.6	cosine	1600	7.34	2h50m
ResNet-preact-20	1024	1.6	cosine	1600	6.94	2h52m
ResNet-preact-20	512	1.6	cosine	1600	6.99	2h44m
ResNet-preact-20	256	0.8	cosine	1600	6.95	2h50m
ResNet-preact-20	128	0.4	cosine	1600	6.64	3h09m

Model	batch size	initial lr	lr schedule	# of Epochs	Test Error (1 run)	Training Time
ResNet-preact-20	4096	1.6	cosine	3200	9.52	6h15m
ResNet-preact-20	2048	1.6	cosine	3200	6.92	5h42m
ResNet-preact-20	1024	1.6	cosine	3200	6.96	5h43m

Model	batch size	initial lr	lr schedule	# of Epochs	Test Error (1 run)	Training Time
ResNet-preact-20	2048	1.6	cosine	6400	7.45	11h44m

LARS

In the original papers (1708.03888, 1801.03137), they used polynomial decay learning rate scheduling, but cosine annealing is used in these experiments.
In this implementation, LARS coefficient is not used, so learning rate should be adjusted accordingly.

$ python -u train.py --dataset CIFAR10 --arch resnet_preact --depth 20 --block_type basic --seed 7 --scheduler cosine --optimizer lars --base_lr 0.02 --batch_size 4096 --epochs 200 --outdir results/experiment00/00

Model	batch size	initial lr	lr schedule	# of Epochs	Test Error (median of 3 run)	Training Time
ResNet-preact-20	4096	0.005	cosine	200	14.31	22m
ResNet-preact-20	4096	0.01	cosine	200	9.33	22m
ResNet-preact-20	4096	0.015	cosine	200	8.47	22m
ResNet-preact-20	4096	0.02	cosine	200	8.21	22m
ResNet-preact-20	4096	0.03	cosine	200	8.46	22m
ResNet-preact-20	4096	0.04	cosine	200	9.58	22m

Model	batch size	initial lr	lr schedule	# of Epochs	Test Error (median of 3 run)	Training Time
ResNet-preact-20	4096	0.02	cosine	200	8.21	22m
ResNet-preact-20	4096	0.02	cosine	400	7.53	44m
ResNet-preact-20	4096	0.02	cosine	800	7.48	1h29m
ResNet-preact-20	4096	0.02	cosine	1600	7.37 (1 run)	2h58m

References

Model architecture

He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. "Deep Residual Learning for Image Recognition." The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. link, arXiv:1512.03385
He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. "Identity Mappings in Deep Residual Networks." In European Conference on Computer Vision (ECCV). 2016. arXiv:1603.05027, Torch implementation
Zagoruyko, Sergey, and Nikos Komodakis. "Wide Residual Networks." Proceedings of the British Machine Vision Conference (BMVC), 2016. arXiv:1605.07146, Torch implementation
Huang, Gao, Zhuang Liu, Kilian Q Weinberger, and Laurens van der Maaten. "Densely Connected Convolutional Networks." The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. link, arXiv:1608.06993, Torch implementation
Han, Dongyoon, Jiwhan Kim, and Junmo Kim. "Deep Pyramidal Residual Networks." The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. link, arXiv:1610.02915, Torch implementation, Caffe implementation, PyTorch implementation
Xie, Saining, Ross Girshick, Piotr Dollar, Zhuowen Tu, and Kaiming He. "Aggregated Residual Transformations for Deep Neural Networks." The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. link, arXiv:1611.05431, Torch implementation
Gastaldi, Xavier. "Shake-Shake regularization of 3-branch residual networks." In International Conference on Learning Representations (ICLR) Workshop, 2017. link, arXiv:1705.07485, Torch implementation
Hu, Jie, Li Shen, and Gang Sun. "Squeeze-and-Excitation Networks." The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 7132-7141. link, arXiv:1709.01507, Caffe implementation

Regularization, data augmentation

Szegedy, Christian, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. "Rethinking the Inception Architecture for Computer Vision." The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. link, arXiv:1512.00567
DeVries, Terrance, and Graham W. Taylor. "Improved Regularization of Convolutional Neural Networks with Cutout." arXiv preprint arXiv:1708.04552 (2017). arXiv:1708.04552, PyTorch implementation
Abu-El-Haija, Sami. "Proportionate Gradient Updates with PercentDelta." arXiv preprint arXiv:1708.07227 (2017). arXiv:1708.07227
Zhong, Zhun, Liang Zheng, Guoliang Kang, Shaozi Li, and Yi Yang. "Random Erasing Data Augmentation." arXiv preprint arXiv:1708.04896 (2017). arXiv:1708.04896, PyTorch implementation
Zhang, Hongyi, Moustapha Cisse, Yann N. Dauphin, and David Lopez-Paz. "mixup: Beyond Empirical Risk Minimization." In International Conference on Learning Representations (ICLR), 2017. link, arXiv:1710.09412
Kawaguchi, Kenji, Yoshua Bengio, Vikas Verma, and Leslie Pack Kaelbling. "Towards Understanding Generalization via Analytical Learning Theory." arXiv preprint arXiv:1802.07426 (2018). arXiv:1802.07426, PyTorch implementation
Takahashi, Ryo, Takashi Matsubara, and Kuniaki Uehara. "Data Augmentation using Random Image Cropping and Patching for Deep CNNs." Proceedings of The 10th Asian Conference on Machine Learning (ACML), 2018. link, arXiv:1811.09030

Large batch

Keskar, Nitish Shirish, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. "On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima." In International Conference on Learning Representations (ICLR), 2017. link, arXiv:1609.04836
Goyal, Priya, Piotr Dollar, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. "Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour." arXiv preprint arXiv:1706.02677 (2017). arXiv:1706.02677
You, Yang, Igor Gitman, and Boris Ginsburg. "Large Batch Training of Convolutional Networks." arXiv preprint arXiv:1708.03888 (2017). arXiv:1708.03888
Gitman, Igor, Deepak Dilipkumar, and Ben Parr. "Convergence Analysis of Gradient Descent Algorithms with Proportional Updates." arXiv preprint arXiv:1801.03137 (2018). arXiv:1801.03137 TensorFlow implementation
Shallue, Christopher J., Jaehoon Lee, Joseph Antognini, Jascha Sohl-Dickstein, Roy Frostig, and George E. Dahl. "Measuring the Effects of Data Parallelism on Neural Network Training." arXiv preprint arXiv:1811.03600 (2018). arXiv:1811.03600

Others

Loshchilov, Ilya, and Frank Hutter. "SGDR: Stochastic Gradient Descent with Warm Restarts." In International Conference on Learning Representations (ICLR), 2017. link, arXiv:1608.03983, Lasagne implementation
Recht, Benjamin, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. "Do CIFAR-10 Classifiers Generalize to CIFAR-10?" arXiv preprint arXiv:1806.00451 (2018). arXiv:1806.00451
He, Tong, Zhi Zhang, Hang Zhang, Zhongyue Zhang, Junyuan Xie, and Mu Li. "Bag of Tricks for Image Classification with Convolutional Neural Networks." arXiv preprint arXiv:1812.01187 (2018). arXiv:1812.01187