train with the converted detection checkpoint as initialization

May 17, 2021 ยท View on GitHub

This project provides the source code for the object detection part of vision longformer paper. It is based on detectron2.

Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding

The classification part of the code and checkpoints can be found here.

Updates

  • 03/29/2021: First version of vision longformer paper posted on Arxiv.
  • 05/17/2021: Performance improved by adding relative positional bias, inspired by Swin Transformer! First version of Object Detection code released.

Usage

Here is an example command for evaluating a pretrained vision-longformer small model on COCO

python -m pip install -e .

ln -s /mnt/data_storage datasets

DETECTRON2_DATASETS=datasets python train_net.py --num-gpus 1 --eval-only --config configs/msvit_maskrcnn_fpn_1x_small_sparse.yaml 
MODEL.TRANSFORMER.MSVIT.ARCH "l1,h3,d96,n1,s1,g1,p4,f7,a0_l2,h3,d192,n2,s1,g1,p2,f7,a0_l3,h6,d384,n8,s1,g1,p2,f7,a0_l4,h12,d768,n1,s1,g0,p2,f7,a0" 
SOLVER.AMP.ENABLED True 
MODEL.WEIGHTS /mnt/model_storage/msvit_det/visionlongformer/vilsmall/maskrcnn1x/model_final.pth

Here is an example training command for training the vision-longformer small model on COCO

python -m pip install -e .

ln -s /mnt/data_storage datasets

# convert the classification checkpoint into a detection checkpoint for initialization
python3 converter.py --source_model "/mnt/model_storage/msvit/visionlongformer/small1281_relative/model_best.pth"
--output_model msvit_pretrain.pth --config configs/msvit_maskrcnn_fpn_3xms_small_sparse.yaml
MODEL.TRANSFORMER.MSVIT.ARCH "l1,h3,d96,n1,s1,g1,p4,f7,a0_l2,h3,d192,n2,s1,g1,p2,f7,a0_l3,h6,d384,n8,s1,g1,p2,f7,a0_l4,h12,d768,n1,s1,g0,p2,f7,a0"

# train with the converted detection checkpoint as initialization
DETECTRON2_DATASETS=datasets python train_net.py --num-gpus 8 --config configs/msvit_maskrcnn_fpn_3xms_small_sparse.yaml
MODEL.WEIGHTS msvit_pretrain.pth MODEL.TRANSFORMER.DROP_PATH 0.2 MODEL.TRANSFORMER.MSVIT.ATTN_TYPE
longformerhand MODEL.TRANSFORMER.MSVIT.ARCH "l1,h3,d96,n1,s1,g1,p4,f7,a0_l2,h3,d192,n2,s1,g1,p2,f7,a0_l3,h6,d384,n8,s1,g1,p2,f7,a0_l4,h12,d768,n1,s1,g0,p2,f7,a0"
SOLVER.AMP.ENABLED True SOLVER.BASE_LR 1e-4 SOLVER.WEIGHT_DECAY 0.1 TEST.EVAL_PERIOD
7330 SOLVER.IMS_PER_BATCH 16

Model Zoo on COCO

Vision Longformer with relative positional bias

BackboneMethodpretraindrop_pathLr Schdbox mAPmask mAP#paramsFLOPscheckpointslog
ViL-TinyMask R-CNNImageNet-1K0.051x41.438.126.9M145.6Gckpt configlog
ViL-TinyMask R-CNNImageNet-1K0.13x44.240.626.9M145.6Gckpt configlog
ViL-SmallMask R-CNNImageNet-1K0.21x44.941.145.0M218.3Gckpt configlog
ViL-SmallMask R-CNNImageNet-1K0.23x47.142.745.0M218.3Gckpt configlog
ViL-Medium (D)Mask R-CNNImageNet-21K0.21x47.643.060.1M293.8Gckpt configlog
ViL-Medium (D)Mask R-CNNImageNet-21K0.33x48.944.260.1M293.8Gckpt configlog
ViL-Base (D)Mask R-CNNImageNet-21K0.31x48.643.676.1M384.4Gckpt configlog
ViL-Base (D)Mask R-CNNImageNet-21K0.33x49.644.576.1M384.4Gckpt configlog
---------------------------
ViL-TinyRetinaNetImageNet-1K0.051x40.8--16.64M182.7Gckpt configlog
ViL-TinyRetinaNetImageNet-1K0.13x43.6--16.64M182.7Gckpt configlog
ViL-SmallRetinaNetImageNet-1K0.11x44.2--35.68M254.8Gckpt configlog
ViL-SmallRetinaNetImageNet-1K0.23x45.9--35.68M254.8Gckpt configlog
ViL-Medium (D)RetinaNetImageNet-21K0.21x46.8--50.77M330.4Gckpt configlog
ViL-Medium (D)RetinaNetImageNet-21K0.33x47.9--50.77M330.4Gckpt configlog
ViL-Base (D)RetinaNetImageNet-21K0.31x47.8--66.74M420.9Gckpt configlog
ViL-Base (D)RetinaNetImageNet-21K0.33x48.6--66.74M420.9Gckpt configlog

See more fine-grained results in Table 6 and Table 7 in the Vision Longformer paper. We use weight decay 0.05 for all experiments, but search for best drop path in [0.05, 0.1, 0.2, 0.3].

Comparison of various efficient attention mechanims with absolute positional embedding (Small size)

BackboneMethodpretraindrop_pathLr Schdbox mAPmask mAP#paramsFLOPsMemorycheckpointslog
srformer/64Mask R-CNNImageNet-1K0.11x36.434.673.3M224.1G7.1Gckpt configlog
srformer/32Mask R-CNNImageNet-1K0.11x39.937.351.5M268.3G13.6Gckpt configlog
Partial srformer/32Mask R-CNNImageNet-1K0.11x42.439.046.8M352.1G22.6Gckpt configlog
globalMask R-CNNImageNet-1K0.11x34.833.445.2M226.4G7.6Gckpt configlog
Partial globalMask R-CNNImageNet-1K0.11x42.539.245.1M326.5G20.1Gckpt configlog
performerMask R-CNNImageNet-1K0.11x36.134.345.0M251.5G8.4Gckpt configlog
Partial performerMask R-CNNImageNet-1K0.051x42.339.145.0M343.7G20.0Gckpt configlog
ViLMask R-CNNImageNet-1K0.11x42.939.645.0M218.3G7.4Gckpt configlog
Partial ViLMask R-CNNImageNet-1K0.11x43.339.845.0M326.8G19.5Gckpt configlog

We use weight decay 0.05 for all experiments, but search for best drop path in [0.05, 0.1, 0.2].