Pre-trained VideoMAE Models

March 7, 2024 ยท View on GitHub

For all experiments on APT, we use VideoMAE pre-trained ViT models on Kinetics-400.

The following table provide different checkpoints.

Note that we use pre-trained checkpoint. Not the fine-tuned one.

Kinetics-400

MethodExtra DataBackboneEpoch#FramePre-trainFine-tuneTop-1Top-5
VideoMAEnoViT-S160016x5x3script/log/checkpointscript/log/checkpoint79.093.8
VideoMAEnoViT-B80016x5x3script/log/checkpointscript/log/checkpoint
(w/o repeated aug)
80.094.4
VideoMAEnoViT-B80016x5x3same as aboveTODO81.094.8
VideoMAEnoViT-B160016x5x3script/log/checkpointscript/log/checkpoint81.595.1
VideoMAEnoViT-L160016x5x3script/log/checkpointscript/log/checkpoint85.296.8
VideoMAEnoViT-H160016x5x3script/log/checkpointscript/log/checkpoint86.697.1

Something-Something V2

MethodExtra DataBackboneEpoch#FramePre-trainFine-tuneTop-1Top-5
VideoMAEnoViT-S240016x2x3script/log/checkpointscript/log/checkpoint66.890.3
VideoMAEnoViT-B80016x2x3script/log/checkpointscript/log/checkpoint
(w/o repeated aug)
69.692.0
VideoMAEnoViT-B240016x2x3script/log/checkpointscript/log/checkpoint70.892.4

UCF101

MethodExtra DataBackboneEpoch#FramePre-trainFine-tuneTop-1Top-5
VideoMAEnoViT-B320016x5x3script/log/checkpointscript/log/checkpoint91.398.5

Note:

  • We report the results of VideoMAE finetuned with I3D dense sampling on Kinetics400 and TSN uniform sampling on Something-Something V2, respectively.
  • #Frame = #input_frame x #clip x #crop.
  • #input_frame means how many frames are input for model during the test phase.
  • #crop means spatial crops (e.g., 3 for left/right/center crop).
  • #clip means temporal clips (e.g., 5 means repeted temporal sampling five clips with different start indices).