ScalableViT
December 15, 2022 · View on GitHub
This is the code of paper "ScalableViT: Rethinking the Context-oriented Generalization of Vision Transformer".
It currently includes code and models for the following tasks:
Introduction
ScalableViT (Scalable Vision Transformer) inculdes Scalable Self-Attention (SSA) and Interactive Window-based Self-Attention (IWSA) mechanisms. SSA leverages two scaling factors to release dimensions of , , and matrices. IWSA establishes interaction between non-overlapping regions by re-merging independent tokens and aggregating spatial information from adjacent windows. By stacking the SSA and IWSA alternately, ScalableViT-S achieves $83.1 %$ acc on ImageNet-1K.

Main results
Image Classification on ImageNet
| Model | #Param.(M) | FLOPs(G) | top1-acc |
|---|---|---|---|
| ScalableViT-S | 32.4 | 4.2 | 83.1 |
| ScalableViT-B | 81.9 | 8.6 | 84.1 |
| ScalableViT-L | 104.9 | 14.7 | 84.4 |
Object Detection on COCO
RetinaNet
| Backbone | Pretrain | Lr Schd | #Param.(M) | FLOPs(G) | bbox mAP |
|---|---|---|---|---|---|
| ScalableViT-S | ImageNet-1K | 1x | 36.4 | 238 | 45.2 |
| ScalableViT-S | ImageNet-1K | 3x | 36.4 | 238 | 47.8 |
| ScalableViT-B | ImageNet-1K | 1x | 85.6 | 330 | 45.8 |
| ScalableViT-B | ImageNet-1K | 3x | 85.6 | 330 | 48.0 |
| ScalableViT-L | ImageNet-1K | 1x | 112.6 | 457 | 46.8 |
Mask R-CNN
| Backbone | Pretrain | Lr Schd | #Param.(M) | FLOPs(G) | bbox mAP | mask mAP |
|---|---|---|---|---|---|---|
| ScalableViT-S | ImageNet-1K | 1x | 46.3 | 256 | 45.8 | 41.7 |
| ScalableViT-S | ImageNet-1K | 3x | 46.3 | 256 | 48.7 | 43.6 |
| ScalableViT-B | ImageNet-1K | 1x | 94.9 | 349 | 46.6 | 42.1 |
| ScalableViT-B | ImageNet-1K | 3x | 94.9 | 349 | 48.9 | 43.6 |
| ScalableViT-L | ImageNet-1K | 1x | 121.4 | 477 | 47.6 | 42.9 |
Semantic Segmentation on ADE20K
Semantic FPN
| Backbone | Method | Crop Size | Lr Schd | #Param.(M) | FLOPs(G) | mIoU |
|---|---|---|---|---|---|---|
| ScalableViT-S | Semantic FPN | 512x512 | 80K | 30.4 | 174 | 44.9 |
| ScalableViT-B | Semantic FPN | 512x512 | 80K | 79.0 | 270 | 48.4 |
| ScalableViT-L | Semantic FPN | 512x512 | 80K | 105.5 | 402 | 49.4 |
UperNet
| Backbone | Method | Crop Size | Lr Schd | #Param.(M) | FLOPs(G) | mIoU | mIoU (ms+flip) |
|---|---|---|---|---|---|---|---|
| ScalableViT-S | UperNet | 512x512 | 160K | 56.5 | 931 | 48.5 | 49.4 |
| ScalableViT-B | UperNet | 512x512 | 160K | 107.0 | 1029 | 49.5 | 50.4 |
| ScalableViT-L | UperNet | 512x512 | 160K | 135.5 | 1162 | 49.7 | 50.7 |
Citation
@article{ScalableViT,
title={ScalableViT: Rethinking the context-oriented generalization of vision transformer},
author={Yang, Rui and Ma, Hailong and Wu, Jie and Tang, Yansong and Xiao, Xuefeng and Zheng, Min and Li, Xiu},
journal={arXiv preprint arXiv:2203.10790},
year={2022}
}