get pretrained model

August 18, 2024 · View on GitHub

Logo

A Unified Visual Parameter-Efficient Transfer Learning Benchmark

GitHub stars GitHub forks GitHub contributors GitHub activity

· Paper · Benchmark · Homepage · Document ·

🔥 News and Updates

  • ✅ [2024/07/25] Visual PEFT Benchmark starts releasing dataset, code, etc.

  • ✅ [2024/06/20] Visual PEFT Benchmark homepage is created.

  • ✅ [2024/06/01] Visual PEFT Benchmark repo is created.

📚 Table of Contents

Introduction

Parameter-efficient transfer learning (PETL) methods show promise in adapting a pre-trained model to various downstream tasks while training only a few parameters. In the computer vision (CV) domain, numerous PETL algorithms have been proposed, but their direct employment or comparison remains inconvenient. To address this challenge, we construct a Unified Visual PETL Benchmark (V-PETL Bench) for the CV domain by selecting 30 diverse, challenging, and comprehensive datasets from image recognition, video action recognition, and dense prediction tasks. On these datasets, we systematically evaluate 25 dominant PETL algorithms and open-source a modular and extensible codebase for a fair evaluation of these algorithms.

⚙️ Getting Started

👉 Data Preparation

1. Image Classification Dataset

  • Fine-Grained Visual Classification tasks (FGVC)

    FGVC comprises 5 fine-grained visual classification dataset. The datasets can be downloaded following the official links. We split the training data if the public validation set is not available. The splitted dataset can be found here: Download Link.

  • Visual Task Adaptation Benchmark (VTAB)

    VTAB comprises 19 diverse visual classification datasets. We have processed all the dataset and the data can be downloaded here: Download Link. For specific processing procedures and tips, please see VTAB_SETUP.

2. Video Action Recognition Dataset

  • Kinetics-400

    1. Download the dataset from Download Link or Download Link.

    2. Preprocess the dataset by resizing the short edge of video to 320px. You can refer to MMAction2 Data Benchmark.

    3. Generate annotations needed for dataloader ("<video_id> <video_class>" in annotations). The annotation usually includes train.csv, val.csv and test.csv. The format of *.csv file is like:

      video_1.mp4  label_1
      video_2.mp4  label_2
      video_3.mp4  label_3
      ...
      video_N.mp4  label_N
      

  • Something-Something V2 (SSv2)

    1. Download the dataset from Download Link.

    2. Preprocess the dataset by changing the video extension from webm to .mp4 with the original height of 240px. You can refer to MMAction2 Data Benchmark.

    3. Generate annotations needed for dataloader ("<video_id> <video_class>" in annotations). The annotation usually includes train.csv, val.csv and test.csv. The format of *.csv file is like:

      video_1.mp4  label_1
      video_2.mp4  label_2
      video_3.mp4  label_3
      ...
      video_N.mp4  label_N
      

3. Dense Prediction Dataset

  • MS-COCO

    MS-COCO are available from this Download Link.

  • ADE20K

    The training and validation set of ADE20K could be download from this Download Link. We may also download test set from Download Link.

  • PASCAL VOC

    Pascal VOC 2012 could be downloaded from Download Link. Beside, most recent works on Pascal VOC dataset usually exploit extra augmentation data, which could be found Download Link.

👉 Pre-trained Model Preperation

  • Download and place the ViT-B/16 pretrained model to /path/to/pretrained_models.
mkdir pretrained_models

wget https://storage.googleapis.com/vit_models/imagenet21k/ViT-B_16.npz
  • or you can download Swin-B pretrained model. Note that you also need to rename the downloaded Swin-B ckpt from swin_base_patch4_window7_224_22k.pth to Swin-B_16.pth.
mkdir pretrained_models

wget https://github.com/SwinTransformer/storage/releases/download/v1.0.0/swin_base_patch4_window7_224_22k.pth

mv swin_base_patch4_window7_224_22k.pth Swin_B_16.pth
  • Another way is to download pretrained models from below link and put it in /path/to/pretrained_models.
Pre-trained Backbone Pre-trained Objective Pre-trained Dataset Checkpoint
ViT-B/16 Supervised ImageNet-21K Download Link
ViT-L/16 Supervised ImageNet-21K Download Link
ViT-H/16 Supervised ImageNet-21K Download Link
Swin-B Supervised ImageNet-22K Download Link
Swin-L Supervised ImageNet-22K Download Link
ViT-B (VideoMAE) Self-Supervised Kinetics-400 Download Link
Video Swin-B Supervised Kinetics-400 Download Link

💻 Structure of the V-PETL Bench (key files are marked with 👉)

  • ImageClassification/configs: handles config parameters for the experiments.

    • 👉 ImageClassification/config/vtab/cifar100.yaml: main config setups for experiments and explanation for each of dataset.

    • .....

  • ImageClassification/dataloader: loading and setup input datasets.

    • ImageClassification/dataloader/transforms: Image transformations.

    • ImageClassification/dataloader/loader: Constructs the data loader for the given dataset.

  • ImageClassification/models: handles backbone archs and heads for different fine-tuning protocols

    • 👉ImageClassification/models/vision_transformer_adapter.py: a folder contains the same backbones in vit_backbones folder, specified for Adapter.

    • 👉ImageClassification/models/vision_transformer_sct.py: a folder contains the same backbones in vit_backbones folder, specified for SCT.

    • .....

  • 👉ImageClassification/train: a folder contains the training file folder,

    • 👉ImageClassification/train/train_model_adapter.py: call this one for training and eval a model with a specified transfer type, specified for Adapter.

    • 👉ImageClassification/train/train_model_sct.py: call this one for training and eval a model with a specified transfer type, specified for SCT.

    • .....

  • ImageClassification/scripts: a folder contains the scripts file folder,

    • ImageClassification/scripts/run_vit_adapter.sh: You can run the Adapter method on all datasets at once.

    • ImageClassification/scripts/run_vit_sct.sh: You can run the Adapter method on all datasets at once.

    • .....

  • ImageClassification/Visualize: Visualization Tools.

    • ImageClassification/Visualize/AttentionMap.py: Attention map visualization.

    • ImageClassification/Visualize/TSNE.py: T-SNE visualization.

  • ImageClassification/utils: Create logger, Set seed, etc.

  ❗️Note❗️: If you want to create your own PETL algorithm, pay attention to `ImageClassification/models`.

🐌 Quick Start

This is an example of how to set up V-PETL Bench locally.

To get a local copy up, running follow these simple example steps.

👉 Install V-PETL Bench

git clone https://github.com/synbol/Parameter-Efficient-Transfer-Learning-Benchmark.git

👉 Environment Setup

V-PETL Bench is built on pytorch, with torchvision, torchaudio, and timm, etc.

  • To install the required packages, you can create a conda environment.
conda create --name v-petl-bench python=3.8
  • Activate conda environment.
conda activate v-petl-bench
  • Use pip to install required packages.
cd Parameter-Efficient-Transfer-Learning-Benchmark

pip install -r requirements.txt

👉 Training and Evaluation

Training and Evaluation Demo

  • We provide a specific training and evaluation demo, taking LoRA on VTAB Cifar100 as an example.
import sys
sys.path.append("Parameter-Efficient-Transfer-Learning-Benchmark")
import torch
from ImageClassification import utils
from ImageClassification.dataloader import vtab
from ImageClassification.train import train
# get lora methods
from timm.scheduler.cosine_lr import CosineLRScheduler
from ImageClassification.models import vision_transformer_lora
import timm


# path to save model and logs
exp_base_path = '../output'
utils.mkdirss(exp_base_path)

# create logger
logger = utils.create_logger(log_path=exp_base_path, log_name='training')

# dataset config parameter
config = utils.get_config('model_lora', 'vtab', 'cifar100')

# get vtab dataset
data_path = '/home/ma-user/work/haozhe/synbol/vtab-1k'
train_dl, test_dl = vtab.get_data(data_path, 'cifar100', logger, evaluate=False, train_aug=config['train_aug'], batch_size=config['batch_size'])

# get pretrained model
model = timm.models.create_model('vit_base_patch16_224_in21k_lora', checkpoint_path='./released_models/ViT-B_16.npz', drop_path_rate=0.1, tuning_mode='lora')
model.reset_classifier(config['class_num'])

# training parameters
trainable = []
for n, p in model.named_parameters():
    if 'linear_a' in n or 'linear_b' in n or 'head' in n:
        trainable.append(p)
        logger.info(str(n))
    else:
        p.requires_grad = False
opt = torch.optim.AdamW(trainable, lr=1e-4, weight_decay=5e-2)
scheduler = CosineLRScheduler(opt, t_initial=config['epochs'], warmup_t=config['warmup_epochs'], lr_min=1e-5, warmup_lr_init=1e-6, cycle_decay = 0.1)


# crossEntropyLoss function
criterion = torch.nn.CrossEntropyLoss()

# training
model = train.train(config, model, criterion, train_dl, opt, scheduler, logger, config['epochs'], 'vtab', 'cifar100')

# evaluation
eval_acc = train.test(model, test_dl, 'vtab')

Call V-PETL Training and Evaluation file

  • You can train with a PETL algorithm on a dataset.
python python train/train_model_sct.py --dataset cifar100 --task vtab --lr 0.012 --wd 0.6 --eval True --dpr 0.1 --tuning_mode $tuning_mode --model_type $model_type --model $model --model_checkpoint $model_checkpoint
  • or you can train with a PETL algorithm on all dataset
bash scripts/run_model_sct.sh

🎯 Results and Checkpoints

Benchmark results of image classification on FGVC

  • We evaluate 13 PETL algorithms on five datasets with ViT-B/16 models pre-trained on ImageNet-21K.

  • To obtain the checkpoint, please download it at Download Link.

MethodCUB-200-2011NABirdsOxford FlowersStanford DogsStanford CarsMeanParams.PPT
Full fine-tuning87.382.798.889.484.588.5485.8M-
Linear probing85.375.997.986.251.379.320 M0.79
Adapter87.184.398.589.868.685.660.41M0.84
AdaptFormer88.484.799.288.281.988.480.46M0.87
Prefix Tuning87.582.098.074.290.286.380.36M0.85
U-Tuning89.285.499.284.192.190.000.36M0.89
BitFit87.785.299.286.581.588.020.10M0.88
VPT-Shallow86.778.898.490.768.784.660.25M0.84
VPT-Deep88.584.299.090.283.689.100.85M0.86
SSF89.585.799.689.689.290.720.39M0.89
LoRA85.679.898.987.672.084.780.77M0.82
GPS89.986.799.792.290.491.780.66M0.90
HST89.285.899.689.588.290.460.78M0.88
LAST88.584.499.786.088.989.500.66M0.87
SNF90.287.499.789.586.990.740.25M0.90

Benchmark results of image classification on VTAB

  • Benchmark results on VTAB. We evaluate 18 PETL algorithms on 19 datasets with ViT-B/16 models pre-trained on ImageNet-21K.

  • To obtain the checkpoint, please download it at Download Link.

MethodCIFAR-100Caltech101DTDFlowers102PetsSVHNSun397Patch CamelyonEuroSATResisc45RetinopathyClevr/countClevr/distanceDMLabKITTI/distancedSprites/locdSprites/oriSmallNORB/aziSmallNORB/eleMeanParams.PPT
Full fine-tuning68.987.764.397.286.987.438.879.795.784.273.956.358.641.765.557.546.725.729.165.5785.8M-
Linear probing63.485.063.297.086.336.651.078.587.568.674.034.330.633.255.412.520.09.619.252.940M0.53
Adapter69.290.168.098.889.982.854.384.094.981.975.580.965.348.678.374.848.529.941.671.440.16M0.71
VPT-Shallow77.786.962.697.587.374.551.278.292.075.672.950.558.640.567.168.736.120.234.164.850.08M0.65
VPT-Deep78.890.865.898.088.378.149.681.896.183.468.468.560.046.572.873.647.932.937.869.430.56M0.68
BitFit72.887.059.297.585.359.951.478.791.672.969.861.555.632.455.966.640.015.725.162.050.10M0.61
LoRA67.191.469.498.890.485.354.084.995.384.473.682.969.249.878.575.747.131.044.072.250.29M0.71
AdaptFormer70.891.270.599.190.986.654.883.095.884.476.381.964.349.380.376.345.731.741.172.320.16M0.72
SSF69.092.675.199.491.890.252.987.495.987.475.575.962.353.380.677.354.929.537.973.100.21M0.72
NOAH69.692.770.299.190.486.153.784.495.483.975.882.868.949.981.781.848.332.844.273.250.43M0.72
SCT75.391.672.299.291.191.255.085.096.186.376.281.565.151.780.275.446.233.245.773.590.11M0.73
FacT70.690.670.899.190.788.654.184.896.284.575.782.668.249.880.780.847.433.243.073.230.07M0.73
RepAdapter72.491.671.099.291.490.755.185.395.984.675.982.368.050.479.980.449.238.641.073.840.22M0.72
Hydra72.791.372.099.291.490.755.585.896.086.175.983.268.250.982.380.350.834.543.174.210.28M0.73
LST59.591.569.099.289.979.554.686.995.985.374.181.861.852.281.071.749.533.745.271.702.38M0.65
DTL69.694.871.399.391.383.356.287.196.286.175.082.864.248.881.993.953.934.247.174.580.04M0.75
HST76.794.174.899.691.191.252.387.196.388.676.585.463.752.981.787.256.835.852.175.990.78M0.74
GPS81.194.275.899.491.791.652.487.996.286.576.579.962.655.082.484.055.429.746.175.180.22M0.74
LAST66.793.476.199.689.886.154.386.296.386.875.481.965.949.482.687.946.732.351.574.150.66M0.72
SNF84.094.072.799.391.390.354.987.297.385.574.582.363.849.882.575.849.231.442.174.100.25M0.73

Benchmark results of video action recognition on SSv2 and HMDB51.

  • Benchmark results on SSv2 and HMDB51. We evaluate 5 PETL algorithms with ViT-B from VideoMAE and Video Swin Transformer.

  • To obtain the checkpoint, please download it at Download Link.

MethodModelPre-trainingParams.SSv2 (Top1)SSv2 (PPT)HMDB51 (Top1)HMDB51 (PPT)
Full fine-tuningViT-BKinetics 40085.97 M53.97%-46.41%-
FrozenViT-BKinetics 4000 M29.23%0.2949.84%0.50
AdaptFormerViT-BKinetics 4001.19 M59.02%0.5655.69%0.53
BAPATViT-BKinetics 4002.06 M57.78%0.5357.18%0.53
Full fine-tuningVideo Swin-BKinetics 40087.64 M50.99%-68.07%-
FrozenVideo Swin-BKinetics 4000 M24.13%0.2471.28%0.71
LoRAVideo Swin-BKinetics 4000.75 M38.34%0.3762.12%0.60
BitFitVideo Swin-BKinetics 4001.09 M45.94%0.4468.26%0.65
AdaptFormerVideo Swin-BKinetics 4001.56 M40.80%0.3868.66%0.64
Prefix-tuningVideo Swin-BKinetics 4006.37 M39.46%0.3256.13%0.45
BAPATVideo Swin-BKinetics 4006.18 M53.36%0.4371.93%0.58

Benchmark results of dense prediction on COCO

  • Benchmark results on COCO. We evaluate 9 PETL algorithms with Swin-B models pre-trained on ImageNet-22K.

  • To obtain the checkpoint, please download it at Coming Soon.

Swin-BParams.MemoryCOCO (APBox\mathrm{AP_{Box}})COCO (PPT)COCO (APMask\mathrm{AP_{Mask}})COCO (PPT)
Full fine-tuning86.75 M17061 MB51.9%-45.0%-
Frozen0.00 M7137 MB43.5%0.4438.6%0.39
Bitfit0.20 M13657 MB47.9%0.4741.9%0.42
LN TUNE0.06 M12831 MB48.0%0.4841.4%0.41
Partial-112.60 M7301 MB49.2%0.3542.8%0.30
Adapter3.11 M12557 MB50.9%0.4543.8%0.39
LoRA3.03 M11975 MB51.2%0.4644.3%0.40
AdaptFormer3.11 M13186 MB51.4%0.4644.5%0.40
LoRand1.20 M13598 MB51.0%0.4943.9%0.42
E3^3VA1.20 M7639 MB50.5%0.4843.8%0.42
Mona4.16 M13996 MB53.4%0.4646.0%0.40

Benchmark results of dense prediction on PASCAL VOC and ADE20K.

  • Benchmark results on PASCAL VOC and ADE20K. We evaluate 9 PETL algorithms with Swin-L models pre-trained on ImageNet-22K.

  • To obtain the checkpoint, please download it at Coming Soon.

Swin-LParams.Memory (VOC)Pascal VOC (APBox\mathrm{AP_{Box}})Pascal VOC (PPT)ADE20K (mIoU\mathrm{mIoU})ADE20K (PPT)
Full fine-tuning198.58 M15679 MB83.5%-52.10%-
Frozen0.00 M3967 MB83.6%0.8446.84%0.47
Bitfit0.30 M10861 MB85.7%0.8548.37%0.48
LN TUNE0.09 M10123 MB85.8%0.8647.98%0.48
Partial-128.34 M3943 MB85.4%0.4847.44%0.27
Adapter4.66 M10793 MB87.1%0.7450.78%0.43
LoRA4.57 M10127 MB87.5%0.7450.34%0.43
AdaptFormer4.66 M11036 MB87.3%0.7450.83%0.43
LoRand1.31 M11572 MB86.8%0.8250.76%0.48
E3^3VA1.79 M4819 MB86.5%0.8149.64%0.46
Mona5.08 M11958 MB87.3%0.7351.36%0.43

💬 Community and Contact

📝 Citation

  • If you find our survey and repository useful for your research, please cite it below:
@article{xin2024bench,
  title={V-PETL Bench: A Unified Visual Parameter-Efficient Transfer Learning Benchmark},
  author={Yi Xin, Siqi Luo, Xuyang Liu, Haodi Zhou, Xinyu Cheng, Christina Luoluo Lee, Junlong Du, Yuntao Du., Haozhe Wang, MingCai Chen, Ting Liu, Guimin Hu, Zhongwei Wan, Rongchao Zhang, Aoxue Li, Mingyang Yi, Xiaohong Liu},
  year={2024}
}