Pre-trained VideoMAE Models

March 7, 2024 · View on GitHub

For all experiments on APT, we use VideoMAE pre-trained ViT models on Kinetics-400.

The following table provide different checkpoints.

Note that we use pre-trained checkpoint. Not the fine-tuned one.

Kinetics-400

Method	Extra Data	Backbone	Epoch	#Frame	Pre-train	Fine-tune	Top-1	Top-5
VideoMAE	no	ViT-S	1600	16x5x3	script/log/checkpoint	script/log/checkpoint	79.0	93.8
VideoMAE	no	ViT-B	800	16x5x3	script/log/checkpoint	script/log/checkpoint (w/o repeated aug)	80.0	94.4
VideoMAE	no	ViT-B	800	16x5x3	same as above	TODO	81.0	94.8
VideoMAE	no	ViT-B	1600	16x5x3	script/log/checkpoint	script/log/checkpoint	81.5	95.1
VideoMAE	no	ViT-L	1600	16x5x3	script/log/checkpoint	script/log/checkpoint	85.2	96.8
VideoMAE	no	ViT-H	1600	16x5x3	script/log/checkpoint	script/log/checkpoint	86.6	97.1

Something-Something V2

Method	Extra Data	Backbone	Epoch	#Frame	Pre-train	Fine-tune	Top-1	Top-5
VideoMAE	no	ViT-S	2400	16x2x3	script/log/checkpoint	script/log/checkpoint	66.8	90.3
VideoMAE	no	ViT-B	800	16x2x3	script/log/checkpoint	script/log/checkpoint (w/o repeated aug)	69.6	92.0
VideoMAE	no	ViT-B	2400	16x2x3	script/log/checkpoint	script/log/checkpoint	70.8	92.4

UCF101

Method	Extra Data	Backbone	Epoch	#Frame	Pre-train	Fine-tune	Top-1	Top-5
VideoMAE	no	ViT-B	3200	16x5x3	script/log/checkpoint	script/log/checkpoint	91.3	98.5

Note:

We report the results of VideoMAE finetuned with I3D dense sampling on Kinetics400 and TSN uniform sampling on Something-Something V2, respectively.
#Frame = #input_frame x #clip x #crop.
#input_frame means how many frames are input for model during the test phase.
#crop means spatial crops (e.g., 3 for left/right/center crop).
#clip means temporal clips (e.g., 5 means repeted temporal sampling five clips with different start indices).