Updates

October 27, 2022 ยท View on GitHub

  • (2022/08/09) Application examples for polyp segmentation (polyp-pvt) and vision-language modeling.
  • (2020/06/21) Code of PVTv2 is released! PVTv2 largely improves PVTv1 and works better than Swin Transformer with ImageNet-1K pre-training.

Pyramid Vision Transformer

The image is from Transformers: Revenge of the Fallen.

This repository contains the official implementation of PVTv1 & PVTv2 in image classification, object detection, and semantic segmentation tasks.

Model Zoo

Image Classification

Classification configs & weights see >>>here<<<.

  • PVTv2 on ImageNet-1K
MethodSizeAcc@1#Params (M)
PVTv2-B022470.53.7
PVTv2-B122478.714.0
PVTv2-B2-Linear22482.122.6
PVTv2-B222482.025.4
PVTv2-B322483.145.2
PVTv2-B422483.662.6
PVTv2-B522483.882.0
  • PVTv1 on ImageNet-1K
MethodSizeAcc@1#Params (M)
PVT-Tiny22475.113.2
PVT-Small22479.824.5
PVT-Medium22481.244.2
PVT-Large22481.761.4

Object Detection

Detection configs & weights see >>>here<<<.

  • PVTv2 on COCO

Baseline Detectors

MethodBackbonePretrainLr schdAugbox APmask AP
RetinaNetPVTv2-b0ImageNet-1K1xNo37.2-
RetinaNetPVTv2-b1ImageNet-1K1xNo41.2-
RetinaNetPVTv2-b2ImageNet-1K1xNo44.6-
RetinaNetPVTv2-b3ImageNet-1K1xNo45.9-
RetinaNetPVTv2-b4ImageNet-1K1xNo46.1-
RetinaNetPVTv2-b5ImageNet-1K1xNo46.2-
Mask R-CNNPVTv2-b0ImageNet-1K1xNo38.236.2
Mask R-CNNPVTv2-b1ImageNet-1K1xNo41.838.8
Mask R-CNNPVTv2-b2ImageNet-1K1xNo45.341.2
Mask R-CNNPVTv2-b3ImageNet-1K1xNo47.042.5
Mask R-CNNPVTv2-b4ImageNet-1K1xNo47.542.7
Mask R-CNNPVTv2-b5ImageNet-1K1xNo47.442.5

Advanced Detectors

MethodBackbonePretrainLr schdAugbox APmask AP
Cascade Mask R-CNNPVTv2-b2-LinearImageNet-1K3xYes50.944.0
Cascade Mask R-CNNPVTv2-b2ImageNet-1K3xYes51.144.4
ATSSPVTv2-b2-LinearImageNet-1K3xYes48.9-
ATSSPVTv2-b2ImageNet-1K3xYes49.9-
GFLPVTv2-b2-LinearImageNet-1K3xYes49.2-
GFLPVTv2-b2ImageNet-1K3xYes50.2-
Sparse R-CNNPVTv2-b2-LinearImageNet-1K3xYes48.9-
Sparse R-CNNPVTv2-b2ImageNet-1K3xYes50.1-
  • PVTv1 on COCO
DetectorBackbonePretrainLr schdbox APmask AP
RetinaNetPVT-TinyImageNet-1K1x36.7-
RetinaNetPVT-SmallImageNet-1K1x40.4-
Mask RCNNPVT-TinyImageNet-1K1x36.735.1
Mask RCNNPVT-SmallImageNet-1K1x40.437.8
DETRPVT-SmallImageNet-1K50ep34.7-

Semantic Segmentation

Segmentation configs & weights see >>>here<<<.

PVT-v2 + Segmentation see >>>here<<<.

  • PVTv1 on ADE20K
MethodBackbonePretrainItersmIoU
Semantic FPNPVT-TinyImageNet-1K40K35.7
Semantic FPNPVT-SmallImageNet-1K40K39.8
Semantic FPNPVT-MediumImageNet-1K40K41.6
Semantic FPNPVT-LargeImageNet-1K40K42.1

Polyp Segmentation

Polyp-PVT: Polyp Segmentation with Pyramid Vision Transformers. pdf | code

Vision-Language Modeling

Masked Vision-Language Transformer in Fashion. pdf | code

License

This repository is released under the Apache 2.0 license as found in the LICENSE file.

Citation

If you use this code for a paper, please cite:

PVTv1

@inproceedings{wang2021pyramid,
  title={Pyramid vision transformer: A versatile backbone for dense prediction without convolutions},
  author={Wang, Wenhai and Xie, Enze and Li, Xiang and Fan, Deng-Ping and Song, Kaitao and Liang, Ding and Lu, Tong and Luo, Ping and Shao, Ling},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  pages={568--578},
  year={2021}
}

PVTv2

@article{wang2021pvtv2,
  title={Pvtv2: Improved baselines with pyramid vision transformer},
  author={Wang, Wenhai and Xie, Enze and Li, Xiang and Fan, Deng-Ping and Song, Kaitao and Liang, Ding and Lu, Tong and Luo, Ping and Shao, Ling},
  journal={Computational Visual Media},
  volume={8},
  number={3},
  pages={1--10},
  year={2022},
  publisher={Springer}
}

Contact

This repo is currently maintained by Wenhai Wang (@whai362), Enze Xie (@xieenze), and Zhe Chen (@czczup).