ScalableViT

December 15, 2022 · View on GitHub

This is the code of paper "ScalableViT: Rethinking the Context-oriented Generalization of Vision Transformer".

It currently includes code and models for the following tasks:

Introduction

ScalableViT (Scalable Vision Transformer) inculdes Scalable Self-Attention (SSA) and Interactive Window-based Self-Attention (IWSA) mechanisms. SSA leverages two scaling factors to release dimensions of $query$ , $key$ , and $value$ matrices. IWSA establishes interaction between non-overlapping regions by re-merging independent $value$ tokens and aggregating spatial information from adjacent windows. By stacking the SSA and IWSA alternately, ScalableViT-S achieves $83.1 %$ acc on ImageNet-1K.

Architecture

Main results

Image Classification on ImageNet

Model	#Param.(M)	FLOPs(G)	top1-acc
ScalableViT-S	32.4	4.2	83.1
ScalableViT-B	81.9	8.6	84.1
ScalableViT-L	104.9	14.7	84.4

Object Detection on COCO

RetinaNet

Backbone	Pretrain	Lr Schd	#Param.(M)	FLOPs(G)	bbox mAP
ScalableViT-S	ImageNet-1K	1x	36.4	238	45.2
ScalableViT-S	ImageNet-1K	3x	36.4	238	47.8
ScalableViT-B	ImageNet-1K	1x	85.6	330	45.8
ScalableViT-B	ImageNet-1K	3x	85.6	330	48.0
ScalableViT-L	ImageNet-1K	1x	112.6	457	46.8

Mask R-CNN

Backbone	Pretrain	Lr Schd	#Param.(M)	FLOPs(G)	bbox mAP	mask mAP
ScalableViT-S	ImageNet-1K	1x	46.3	256	45.8	41.7
ScalableViT-S	ImageNet-1K	3x	46.3	256	48.7	43.6
ScalableViT-B	ImageNet-1K	1x	94.9	349	46.6	42.1
ScalableViT-B	ImageNet-1K	3x	94.9	349	48.9	43.6
ScalableViT-L	ImageNet-1K	1x	121.4	477	47.6	42.9

Semantic Segmentation on ADE20K

Semantic FPN

Backbone	Method	Crop Size	Lr Schd	#Param.(M)	FLOPs(G)	mIoU
ScalableViT-S	Semantic FPN	512x512	80K	30.4	174	44.9
ScalableViT-B	Semantic FPN	512x512	80K	79.0	270	48.4
ScalableViT-L	Semantic FPN	512x512	80K	105.5	402	49.4

UperNet

Backbone	Method	Crop Size	Lr Schd	#Param.(M)	FLOPs(G)	mIoU	mIoU (ms+flip)
ScalableViT-S	UperNet	512x512	160K	56.5	931	48.5	49.4
ScalableViT-B	UperNet	512x512	160K	107.0	1029	49.5	50.4
ScalableViT-L	UperNet	512x512	160K	135.5	1162	49.7	50.7

Citation

@article{ScalableViT,
  title={ScalableViT: Rethinking the context-oriented generalization of vision transformer},
  author={Yang, Rui and Ma, Hailong and Wu, Jie and Tang, Yansong and Xiao, Xuefeng and Zheng, Min and Li, Xiu},
  journal={arXiv preprint arXiv:2203.10790},
  year={2022}
}