VideoMAE Model Zoo
August 8, 2022 ยท View on GitHub
Kinetics-400
| Method | Extra Data | Backbone | Epoch | #Frame | Pre-train | Fine-tune | Top-1 | Top-5 |
|---|---|---|---|---|---|---|---|---|
| VideoMAE | no | ViT-B | 800 | 16x5x3 | script/log/checkpoint | script/log/checkpoint (w/o repeated aug) | 80.0 | 94.4 |
| VideoMAE | no | ViT-B | 800 | 16x5x3 | same as above | TODO | 81.0 | 94.8 |
| VideoMAE | no | ViT-B | 1600 | 16x5x3 | script/log/checkpoint | script/log/checkpoint | 81.5 | 95.1 |
| VideoMAE | no | ViT-L | 1600 | 16x5x3 | script/log/checkpoint | script/log/checkpoint | 85.2 | 96.8 |
Something-Something V2
| Method | Extra Data | Backbone | Epoch | #Frame | Pre-train | Fine-tune | Top-1 | Top-5 |
|---|---|---|---|---|---|---|---|---|
| VideoMAE | no | ViT-B | 800 | 16x2x3 | script/log/checkpoint | script/log/checkpoint (w/o repeated aug) | 69.6 | 92.0 |
| VideoMAE | no | ViT-B | 2400 | 16x2x3 | script/log/checkpoint | script/log/checkpoint | 70.8 | 92.4 |
Note:
- We report the results of VideoMAE finetuned with
I3D dense samplingon Kinetics400 anduniform samplingon Something-Something V2, respectively. - #Frame = #input_frame x #clip x #crop.
- #input_frame means how many frames are input for model during the test phase.
- #crop means spatial crops (e.g., 3 for left/right/center crop).
- #clip means temporal clips (e.g., 5 means repeted temporal sampling five clips with different start indices).