VideoMAEv2 Model Zoo

October 8, 2024 ยท View on GitHub

Please fill out VideoMAE V2 Download Request Form, you will see the download link for the VideoMAE V2 model weights after submission. The form asks for some information about your organization and how you plan to use the model, so that we can better understand the needs of our users and improve our future works.

The weights of the distilled models can be downloaded directly at Distillation section.

Pre-train

ModelConfigDatasetEncoder MaskingDecoder MaskingEpoch#Frame
ViT-giantvit_g_hybrid_pt_1200eUnlabeledHybridtube (90%)running cell (50%)120016
  • We set different sampling intervals for the videos from different sources in unlabeledhybrid: 2 for SSv2 and 4 for the other datasets.

Fine-tune

ModelConfigDatasetPre-trainPost-pre-train#FrameTop-1Top-5
ViT-giantvit_g_hybrid_pt_1200e_k710_ftK710UnlabeledHybridNone16x5x383.896.4
ViT-giantvit_g_hybrid_pt_1200e_k400_ftK400UnlabeledHybridNone16x5x387.297.4
ViT-giantvit_g_hybrid_pt_1200e_k710_it_k400_ftK400UnlabeledHybridK71016x5x388.498.0
ViT-giantvit_g_hybrid_pt_1200e_k710_it_k600_ftK600UnlabeledHybridK71016x5x388.898.2
ViT-giantvit_g_hybrid_pt_1200e_ssv2_ftSSv2UnlabeledHybridNone16x2x377.095.9
ViT-giantvit_g_hybrid_pt_1200e_k710_it_ucf101_ftUCF101UnlabeledHybridK71016x5x399.6100.0
ViT-giantvit_g_hybrid_pt_1200e_k710_it_hmdb51_ftHMDB51UnlabeledHybridK71016x5x388.198.5
  • We report the fine-tuning accuracy for sparse sampling on SSv2 and for dense sampling on the other datasets.
  • #Frame = #input_frame x #clip x #crop.
  • all the input resolution is $2$24^{2}$$.

Distillation

ModelDatasetTeacher Model#FrameK710 Top-1K400 Top-1K600 Top-1Checkpoint
ViT-smallK710vit_g_hybrid_pt_1200e_k710_ft16x5x377.683.783.1vit_s_k710_dl_from_giant.pth
fine-tuning accuracy16x7x3--84.084.6--
ViT-baseK710vit_g_hybrid_pt_1200e_k710_ft16x5x381.586.685.9vit_b_k710_dl_from_giant.pth
fine-tuning accuracy16x7x3--87.187.4
  • We initialize the parameters of the student model with the model obtained after the post-pre-train stage.
  • The fine-tuning accuracy refers to the accuracy achieved by further fine-tuning several epochs in the specified dataset after distillation.