Main Results of VMamba Series

May 26, 2024 ยท View on GitHub

Classification on ImageNet-1K

namepretrainresolutionacc@1#paramsFLOPsTP.Train TP.configs/logs/ckpts
Swin-TImageNet-1K224x22481.228M4.5G1244987--
Swin-SImageNet-1K224x22483.250M8.7G718642--
Swin-BImageNet-1K224x22483.588M15.4G458496--
Vanilla-VMamba-TImageNet-1K224x22482.223M4.5G 5.6G638195config/log/ckpt
Vanilla-VMamba-SImageNet-1K224x22483.544M9.1G 11.2G359111config/log/ckpt
Vanilla-VMamba-BImageNet-1K224x22483.776M15.2G 18.0G26884config/log/ckpt
VMamba-T[s2l5]ImageNet-1K224x22482.531M4.9G1340464config/log/ckpt
VMamba-S[s2l15]ImageNet-1K224x22483.650M8.7G877314config/log/ckpt
VMamba-B[s2l15]ImageNet-1K224x22483.989M15.4G646247config/log/ckpt
VMamba-T[s1l8]ImageNet-1K224x22482.630M4.9G1686571config/log/ckpt
VMamba-S[s1l20]ImageNet-1K224x22483.349M8.6G1106390config/log/ckpt
VMamba-B[s1l20]ImageNet-1K224x22483.887M15.2G827313config/log/ckpt
  • Models in this subsection is trained from scratch with random or manual initialization. The hyper-parameters are inherited from Swin, except for drop_path_rate and EMA. All models are trained with EMA except for the Vanilla-VMamba-T.
  • TP.(Throughput) and Train TP. (Train Throughput) are assessed on an A100 GPU paired with an AMD EPYC 7542 CPU, with batch size 128. Train TP. is tested with mix-resolution, excluding the time consumption of optimizers.
  • FLOPs and parameters are now gathered with head (In previous versions, without head, so the numbers raise a little bit).
  • we calculate FLOPs with the algorithm @albertgu provides, which will be bigger than previous calculation (which is based on the selective_scan_ref function, and ignores the hardware-aware algorithm).

Object Detection on COCO

Backbone#paramsFLOPsDetectorbboxAPbboxAP50bboxAP75segmAPsegmAP50segmAP75configs/logs/ckpts
Swin-T48M267GMaskRCNN@1x42.765.246.839.362.242.2--
Swin-S69M354GMaskRCNN@1x44.866.648.940.963.444.2--
Swin-B107M496GMaskRCNN@1x46.9----42.3------
Vanilla-VMamba-T42M262G 286GMaskRCNN@1x46.568.550.742.165.545.3config/log/ckpt
Vanilla-VMamba-S64M357G 400GMaskRCNN@1x48.269.752.543.066.646.4config/log/ckpt
Vanilla-VMamba-B96M482G 540GMaskRCNN@1x48.670.053.143.367.146.7config/log/ckpt
VMamba-T[s2l5]50M270GMaskRCNN@1x47.469.552.042.766.346.0config/log/ckpt
VMamba-S[s2l15]70M384GMaskRCNN@1x48.770.053.443.767.347.0config/log/ckpt
VMamba-B[s2l15]108M485GMaskRCNN@1x49.271.454.044.168.347.7config/log/ckpt
VMamba-B[s2l15]108M485GMaskRCNN@1x[bs8]49.270.953.943.967.747.6config/log/ckpt
VMamba-T[s1l8]50M271GMaskRCNN@1x47.369.352.042.766.445.9config/log/ckpt
:---::---::---::---::---::---::---::---::---::---::---:
Swin-T48M267GMaskRCNN@3x46.068.150.341.665.144.9--
Swin-S69M354GMaskRCNN@3x48.269.852.843.267.046.1--
Vanilla-VMamba-T42M262G 286GMaskRCNN@3x48.570.052.743.266.946.4config/log/ckpt
Vanilla-VMamba-S64M357G 400GMaskRCNN@3x49.770.454.244.067.647.3config/log/ckpt
VMamba-T[s2l5]50M270GMaskRCNN@3x48.970.653.643.767.746.8config/log/ckpt
VMamba-S[s2l15]70M384GMaskRCNN@3x49.970.954.744.2068.247.7config/log/ckpt
VMamba-T[s1l8]50M271GMaskRCNN@3x48.870.453.5043.767.447.0config/log/ckpt
  • Models in this subsection is initialized from the models trained in classfication.
  • we now calculate FLOPs with the algrithm @albertgu provides, which will be bigger than previous calculation (which is based on the selective_scan_ref function, and ignores the hardware-aware algrithm).

Semantic Segmentation on ADE20K

BackboneInput#paramsFLOPsSegmentormIoU(SS)mIoU(MS)configs/logs/logs(ms)/ckpts
Swin-T512x51260M945GUperNet@160k44.445.8--
Swin-S512x51281M1039GUperNet@160k47.649.5--
Swin-B512x512121M1188GUperNet@160k48.149.7--
Vanilla-VMamba-T512x51255M939G 964GUperNet@160k47.348.3config/log/log(ms)/ckpt
Vanilla-VMamba-S512x51276M1037G 1081GUperNet@160k49.550.5config/log/log(ms)/ckpt
Vanilla-VMamba-B512x512110M1167G 1226GUperNet@160k50.051.3config/log/log(ms)/ckpt
VMamba-T[s2l5]512x51262M948GUperNet@160k48.348.6config/log/log(ms)/ckpt
VMamba-S[s2l15]512x51282M1028GUperNet@160k50.651.2config/log/log(ms)/ckpt
VMamba-B[s2l15]512x512122M1170GUperNet@160k51.051.6config/log/log(ms)/ckpt
VMamba-T[s1l8]512x51262M949GUperNet@160k47.948.8config/log/log(ms)/ckpt
  • Models in this subsection is initialized from the models trained in classfication.
  • we now calculate FLOPs with the algrithm @albertgu provides, which will be bigger than previous calculation (which is based on the selective_scan_ref function, and ignores the hardware-aware algrithm).