Pretrained models

May 20, 2021 ยท View on GitHub

We provide a collection of models trained with semantic softmax on ImageNet-21K-P dataset. All results are on input resolution of 224.
For proper comparison between the models, we also provide some throughput metrics.

BackboneImageNet-21K-P semantic
top-1 Accuracy
[%]
ImageNet-1K
top-1 Accuracy
[%]
Maximal
batch size
Maximal
training speed
(img/sec)
Maximal
inference speed
(img/sec)
MobilenetV3_large_10073.178.048812105980
OFA_flops_595m_s75.081.02885003240
ResNet5075.682.03207202760
Mixer-B-1676.382.31604201420
TResNet-M76.483.15206702970
TResNet-L (V2)76.783.92403001460
ViT-B-1677.684.41603401140

To initialize the different models and properly load the weights, use this file.

use the following models names (--model_name): tresnet_m, tresnet_l, ofa_flops_595m_s, resnet50, vit_base_patch16_224, mobilenetv3_large_100

Notes

  • Maximal training and inference speeds were calculated on NVIDIA V100 GPU, with 90% of maximal batch size.
  • ViT model highly benefits from O2 mixed-precision training and inference. O1 mixed-precision speeds (torch.autocast) are lower.
  • We are still optimising ViT hyper parameters on ImageNet-1K training. Accuracy would probably be higher in the future.
  • Our ofa_flops_595m model is slightly different than the orignal model - we converted all hard-sigmoids to regular sigmoids, since they are faster, both on CPU and GPU, and gives better scores. Hence we renamed the model to 'ofa_flops_595m_s'.