Pretrained models

May 20, 2021 · View on GitHub

We provide a collection of models trained with semantic softmax on ImageNet-21K-P dataset. All results are on input resolution of 224.
For proper comparison between the models, we also provide some throughput metrics.

Backbone	ImageNet-21K-P semantic top-1 Accuracy [%]	ImageNet-1K top-1 Accuracy [%]	Maximal batch size	Maximal training speed (img/sec)	Maximal inference speed (img/sec)
MobilenetV3_large_100	73.1	78.0	488	1210	5980
OFA_flops_595m_s	75.0	81.0	288	500	3240
ResNet50	75.6	82.0	320	720	2760
Mixer-B-16	76.3	82.3	160	420	1420
TResNet-M	76.4	83.1	520	670	2970
TResNet-L (V2)	76.7	83.9	240	300	1460
ViT-B-16	77.6	84.4	160	340	1140

To initialize the different models and properly load the weights, use this file.

use the following models names (--model_name): tresnet_m, tresnet_l, ofa_flops_595m_s, resnet50, vit_base_patch16_224, mobilenetv3_large_100

Notes

Maximal training and inference speeds were calculated on NVIDIA V100 GPU, with 90% of maximal batch size.
ViT model highly benefits from O2 mixed-precision training and inference. O1 mixed-precision speeds (torch.autocast) are lower.
We are still optimising ViT hyper parameters on ImageNet-1K training. Accuracy would probably be higher in the future.
Our ofa_flops_595m model is slightly different than the orignal model - we converted all hard-sigmoids to regular sigmoids, since they are faster, both on CPU and GPU, and gives better scores. Hence we renamed the model to 'ofa_flops_595m_s'.