ModelZoo for Pytorch

November 30, 2020 · View on GitHub

This is a model zoo project under Pytorch. In this repo I will implement some of basic classification models which have good performance on ImageNet. Then I will train them in most fair way as possible and try my best to get SOTA model on ImageNet. In this repo I'll only consider FP16.

Usage

Environment

OS: Ubuntu 18.04
CUDA: 10.1, CuDNN: 7.6
Devices: I use 8 * RTX 2080ti(8 * V100 should be much better /cry). This project is in FP16 precision, it's recommend to use FP16 friendly devices like RTX series, V100. If you want to totally reproduce my research, you'd better use same batch size with me.

Requirement

Pytorch: >= 1.6.0 (Need torch.cuda.amp in version 1.6)
TorchToolbox: stable version. Helper functions to make your code simpler and more readable, it's a optional tools if you don't want to use it just write them yourself.

LMDB Dataset

No necessary.

If you found any IO bottleneck please use LMDB format dataset. A good way is try both and find out which is more faster.

I provide conversion script here.

Train script

python distribute_train_script --params

Here is a example

python distribute_train_script.py --data-path /s4/piston/ImageNet --batch-size 256 --dtype float16 \
                                  -j 48 --epochs 360 --lr 2.6 --warmup-epochs 5 --label-smoothing \
                                  --no-wd --wd 0.00003 --model GhostNet --log-interval 150 --model-info \
                                  --dist-url tcp://127.0.0.1:26548 --world-size 1 --rank 0

ToDo

Resume training
~~Try Nvidia-DALI~~
Multi-node(distributed) training by ~~Apex or BytePS~~ Pytorch
I may try AutoAugment.This project aims to train models by ourselves to observe and learn, it's impossible for me to train this, just copy feels meaningless.

Baseline models

model	epochs	dtype	batch size*	gpus	lr	tricks	Params(M)/FLOPs	top1/top5	params/logs
resnet50	120	FP16	128	8	0.4	-	25.6/4.1G	77.36/-	Google Drive
resnet101	120	FP16	128	8	0.4	-	44.7/7.8G	79.13/94.38	Google Drive
resnet50v2	120	FP16	128	8	0.4	-	25.6/4.1G	77.06/93.44	Google Drive
resnet101v2	120	FP16	128	8	0.4	-	44.6/7.8G	78.90/94.39	Google Drive
ResNext50_32x4d	120	FP16	128	8	0.4	-	25.1/4.2G	79.00/94.39
RegNetX4_0GF	120	FP16	128	8	0.4	-	22.2/4.0G	78.40/94.04
RegNetY4_0GF	120	FP16	128	8	0.4	-	22.1/4.0G	79.22/94.57
RegNetY6_4GF	120	FP16	128	8	0.4	-	31.2/6.4G	79.69/94.82
ResNeST50	120	FP16	128	8	0.4	-	27.5/4.1G	78.62/94.28
mobilenetv1	150	FP16	256	8	0.4	-	4.3/572.2M	72.17/90.70	Google Drive
mobilenetv2	150	FP16	256	8	0.4	-	3.5/305.3M	71.94/90.59	Google Drive
mobilenetv3 Large	360	FP16	256	8	2.6	Label smoothing No decay bias Dropout	5.5/219M	75.64/92.61	Google Drive
mobilenetv3 Small	360	FP16	256	8	2.6	Label smoothing No decay bias Dropout	3.0/57.8M	67.83/87.78
GhostNet1.3	360	FP16	400	8	2.6	Label smoothing No decay bias Dropout	7.4/230.4M	75.78/92.77	Google Drive

I use nesterov SGD and cosine lr decay with 5 warmup epochs by default[2][3] (to save time), it's more common and effective.
*Batch size is pre GPU holds. Total batch size should be (batch size * gpus).

Optimized Models(with tricks)

In progress.

Ablation Study on Tricks

Here are lots of tricks to improve accuracy during this years.(If you have another idea please open an issue.) I want to verify them in a fair way.

Tricks: RandomRotation, OctConv[14], Drop out, Label Smoothing[4], Sync BN, ~~SwitchNorm[6]~~, Mixup[17], no decay bias[7], Cutout[5], Relu6[18], ~~swish activation[10]~~, Stochastic Depth[9], Lookahead Optimizer[11], Pre-active(ResnetV2)[12], ~~DCNv2[13]~~, LIP[16].

Delete line means make me out of memory.

Special: Zero-initialize the last BN, just call it 'Zero γ', only for post-active model.

I'll only use 120 epochs and 128*8 batch size to train them. I know some tricks may need train more time or larger batch size but it's not fair for others. You can think of it as a performance in the current situation.

model	epochs	dtype	batch size*	gpus	lr	tricks	degree	top1/top5	improve	params/logs
resnet50	120	FP16	128	8	0.4	-	-	77.36/-	baseline	Google Drive
resnet50	120	FP16	128	8	0.4	Label smoothing	smoothing=0.1	77.78/93.80	+0.42	Google Drive
resnet50	120	FP16	128	8	0.4	No decay bias	-	77.28/93.61	-0.08	Google Drive
resnet50	120	FP16	128	8	0.4	Sync BN	-	77.31/93.49	-0.05	Google Drive
resnet50	120	FP16	128	8	0.4	Mixup	alpha=0.2	77.49/93.73	+0.13	missing
resnet50	120	FP16	128	8	0.4	RandomRotation	degree=15	76.64/93.28	-1.15	Google Drive
resnet50	120	FP16	128	8	0.4	Cutout	read code	77.44/93.62	+0.08	Google Drive
resnet50	120	FP16	128	8	0.4	Dropout	rate=0.3	77.11/93.58	-0.25	Google Drive
resnet50	120	FP16	128	8	0.4	Lookahead-SGD	-	77.23/93.39	-0.13	Google Drive
resnet50v2	120	FP16	128	8	0.4	pre-active	-	77.06/93.44	-0.30	Google Drive
oct_resnet50	120	FP16	128	8	0.4	OctConv	alpha=0.125	-	-
resnet50	120	FP16	128	8	0.4	Relu6		77.28/93.5	-0.08	Google Drive
resnet50	120	FP16	128	8	0.4		-	77.00/-	DDP baseline
resnet50	120	FP16	128	8	0.4	Gradient Centralization	Conv only	77.40/93.57	+0.40
resnet50	120	FP16	128	8	0.4	Zero γ		77.24/-	+0.24
resnet50	120	FP16	128	8	0.4	No decay bias		77.74/93.77	+0.74
resnet50	120	FP16	128	8	0.4	RandAugment	n=2,m=9	76.44/93.18	-0.96
resnet50	120	FP16	128	8	0.4	AutoAugment		76.50/93.23	-0.50

More epochs for Mixup, Cutout, Dropout may get better results.
Auto/Rand Augment may train 180 epochs better.

Citation

@misc{ModelZoo.pytorch,
  title = {Basic deep conv neural network reproduce and explore},
  author = {X.Yang},
  URL = {https://github.com/PistonY/ModelZoo.pytorch},
  year = {2019}
  }