Model Zoo

August 24, 2020 · View on GitHub

Action Recognition

For action recognition, unless specified, models are trained on Kinetics-400. The version of Kinetics-400 we used contains 240436 training videos and 19796 testing videos. For TSN, we also train it on UCF-101, initialized with ImageNet pretrained weights. We also provide transfer learning results on UCF101 and HMDB51 for some algorithms. Models with * are converted from other repos(including VMZ and kinetics_i3d), others are trained by ourselves.

For data preprocessing, we find that resizing short-edges of videos to 256px is generally a better choice than resizing the video to fixed width and height 340x256, since the size ratios are kept. Most of our Kinetics-400 models are trained with videos which short-edges are resized to 256px. However, some legacy Kinetics-400 models are trained with videos with fixed width and height (340x256). We use the mark 340×256^{340\times256} to indicate the model is legacy.

If you can not reproduce our testing results due to dataset unalignment, please submit a request at get validation data.

TSN

Kinetics

ModalityPretrainedBackboneInputTop-1Top-5Download
RGBImageNetResNet503seg70.689.4model340×256^{340\times256}

UCF101

ModalityPretrainedBackboneInputTop-1Download
RGBImageNetBNInception3seg86.4model
TV-L1ImageNetBNInception3seg87.7model

C3D

Sports-1M

ModalityPretrainedBackboneInputTop-1Download
RGBNoneC3D16x1N/Amodel*

* Converted from C3D-v1.0 in Caffe and TGAN in Chainer.

UCF101

ModalityPretrainedBackboneInputTop-1Download
RGBSports-1MC3D16x182.26model*

* Converted from C3D-v1.0 in Caffe and TGAN in Chainer.

I3D

ModalityPretrainedBackboneInputTop-1Top-5Download
RGBImageNetInception-V164x171.189.3model*
RGBImageNetResNet5032x272.990.8model340×256^{340\times256}
FlowImageNetInception-V164x163.484.9model*
Two-StreamImageNetInception-V164x174.291.3/

* Converted from kinetics_i3d in TensorFlow.

SlowOnly

ModalityPretrainedBackboneInputTop-1Top-5Download
RGBNoneResNet504x1672.990.9model
RGBImageNetResNet504x1673.890.9model
RGBNoneResNet508x874.891.9model
RGBImageNetResNet508x875.792.2model
RGBNoneResNet1018x876.592.7model
RGBImageNetResNet1018x876.892.8model

SlowFast

ModalityPretrainedBackboneInputTop-1Top-5Download
RGBNoneResNet504x1675.492.1model
RGBImageNetResNet504x1675.992.3model

R(2+1)D

ModalityPretrainedBackboneInputTop-1Top-5Download
RGBNoneResNet348x863.785.9model
RGBIG-65MResNet348x874.491.7model
RGBNoneResNet3432x271.890.4model
RGBIG-65MResNet3432x280.394.7model

CSN

ModalityPretrainedBackboneInputTop-1Top-5Download
RGBIG-65MirCSN-15232x282.695.7model*
RGBIG-65MipCSN-15232x282.795.6model*

OmniSource

ModalityPretrainedBackboneInputTop-1 (Baseline / OmniSource (Δ\Delta))Top-5 (Baseline / OmniSource (Δ\Delta))Download
RGBImageNetResNet503seg70.6 / 73.6 (+ 3.0)89.4 / 91.0 (+ 1.6)Baseline340×256^{340\times256} / OmniSource340×256^{340\times256}
RGBIG-1BResNet503seg73.1 / 75.7 (+ 2.6)90.4 / 91.9 (+ 1.5)Baseline / OmniSource
RGBScratchResNet504x1672.9 / 76.8 (+ 3.9)90.9 / 92.5 (+ 1.6)Baseline / OmniSource
RGBScratchResNet1018x876.5 / 80.4 (+ 3.9)92.7 / 94.4 (+ 1.7)Baseline / OmniSource

Transfer Learning

ModelModalityPretrainedBackboneInputUCF101HMDB51Download (split1)
I3DRGBKineticsI3D64x194.872.6UCF101 / HMDB51
I3DFlowKineticsI3D64x196.679.2UCF101 / HMDB51
I3DTwoStreamKineticsI3D64x197.880.8/

Action Detection

For action detection, we release models trained on THUMOS14.

SSN

ModalityPretrainedBackbonemAP@0.10mAP@0.20mAP@0.30mAP@0.40mAP@0.50Download
RGBImageNetBNInception43.09%37.95%32.56%25.71%18.33%model

Spatial Temporal Action Detection

For spatial temporal action detection, we release models trained on AVA.

ModalityModelPretrainedBackbonemAP@0.5Download
RGBFast-RCNNKineticsNL-I3D R5021.2model