Model Zoo

August 24, 2020 · View on GitHub

Action Recognition

For action recognition, unless specified, models are trained on Kinetics-400. The version of Kinetics-400 we used contains 240436 training videos and 19796 testing videos. For TSN, we also train it on UCF-101, initialized with ImageNet pretrained weights. We also provide transfer learning results on UCF101 and HMDB51 for some algorithms. Models with * are converted from other repos(including VMZ and kinetics_i3d), others are trained by ourselves.

For data preprocessing, we find that resizing short-edges of videos to 256px is generally a better choice than resizing the video to fixed width and height 340x256, since the size ratios are kept. Most of our Kinetics-400 models are trained with videos which short-edges are resized to 256px. However, some legacy Kinetics-400 models are trained with videos with fixed width and height (340x256). We use the mark $^{340\times256}$ to indicate the model is legacy.

If you can not reproduce our testing results due to dataset unalignment, please submit a request at get validation data.

TSN

Kinetics

Modality	Pretrained	Backbone	Input	Top-1	Top-5	Download
RGB	ImageNet	ResNet50	3seg	70.6	89.4	model $^{340\times256}$

UCF101

Modality	Pretrained	Backbone	Input	Top-1	Download
RGB	ImageNet	BNInception	3seg	86.4	model
TV-L1	ImageNet	BNInception	3seg	87.7	model

C3D

Sports-1M

Modality	Pretrained	Backbone	Input	Top-1	Download
RGB	None	C3D	16x1	N/A	model*

* Converted from C3D-v1.0 in Caffe and TGAN in Chainer.

UCF101

Modality	Pretrained	Backbone	Input	Top-1	Download
RGB	Sports-1M	C3D	16x1	82.26	model*

* Converted from C3D-v1.0 in Caffe and TGAN in Chainer.

I3D

Modality	Pretrained	Backbone	Input	Top-1	Top-5	Download
RGB	ImageNet	Inception-V1	64x1	71.1	89.3	model*
RGB	ImageNet	ResNet50	32x2	72.9	90.8	model $^{340\times256}$
Flow	ImageNet	Inception-V1	64x1	63.4	84.9	model*
Two-Stream	ImageNet	Inception-V1	64x1	74.2	91.3	/

* Converted from kinetics_i3d in TensorFlow.

SlowOnly

Modality	Pretrained	Backbone	Input	Top-1	Top-5	Download
RGB	None	ResNet50	4x16	72.9	90.9	model
RGB	ImageNet	ResNet50	4x16	73.8	90.9	model
RGB	None	ResNet50	8x8	74.8	91.9	model
RGB	ImageNet	ResNet50	8x8	75.7	92.2	model
RGB	None	ResNet101	8x8	76.5	92.7	model
RGB	ImageNet	ResNet101	8x8	76.8	92.8	model

SlowFast

Modality	Pretrained	Backbone	Input	Top-1	Top-5	Download
RGB	None	ResNet50	4x16	75.4	92.1	model
RGB	ImageNet	ResNet50	4x16	75.9	92.3	model

R(2+1)D

Modality	Pretrained	Backbone	Input	Top-1	Top-5	Download
RGB	None	ResNet34	8x8	63.7	85.9	model
RGB	IG-65M	ResNet34	8x8	74.4	91.7	model
RGB	None	ResNet34	32x2	71.8	90.4	model
RGB	IG-65M	ResNet34	32x2	80.3	94.7	model

CSN

Modality	Pretrained	Backbone	Input	Top-1	Top-5	Download
RGB	IG-65M	irCSN-152	32x2	82.6	95.7	model*
RGB	IG-65M	ipCSN-152	32x2	82.7	95.6	model*

OmniSource

Modality	Pretrained	Backbone	Input	Top-1 (Baseline / OmniSource ( $\Delta$ ))	Top-5 (Baseline / OmniSource ( $\Delta$ ))	Download
RGB	ImageNet	ResNet50	3seg	70.6 / 73.6 (+ 3.0)	89.4 / 91.0 (+ 1.6)	Baseline $^{340\times256}$ / OmniSource $^{340\times256}$
RGB	IG-1B	ResNet50	3seg	73.1 / 75.7 (+ 2.6)	90.4 / 91.9 (+ 1.5)	Baseline / OmniSource
RGB	Scratch	ResNet50	4x16	72.9 / 76.8 (+ 3.9)	90.9 / 92.5 (+ 1.6)	Baseline / OmniSource
RGB	Scratch	ResNet101	8x8	76.5 / 80.4 (+ 3.9)	92.7 / 94.4 (+ 1.7)	Baseline / OmniSource

Transfer Learning

Model	Modality	Pretrained	Backbone	Input	UCF101	HMDB51	Download (split1)
I3D	RGB	Kinetics	I3D	64x1	94.8	72.6	UCF101 / HMDB51
I3D	Flow	Kinetics	I3D	64x1	96.6	79.2	UCF101 / HMDB51
I3D	TwoStream	Kinetics	I3D	64x1	97.8	80.8	/

Action Detection

For action detection, we release models trained on THUMOS14.

SSN

Modality	Pretrained	Backbone	mAP@0.10	mAP@0.20	mAP@0.30	mAP@0.40	mAP@0.50	Download
RGB	ImageNet	BNInception	43.09%	37.95%	32.56%	25.71%	18.33%	model

Spatial Temporal Action Detection

For spatial temporal action detection, we release models trained on AVA.

Modality	Model	Pretrained	Backbone	mAP@0.5	Download
RGB	Fast-RCNN	Kinetics	NL-I3D R50	21.2	model