For action recognition, unless specified, models are trained on Kinetics-400. The version of Kinetics-400 we used contains 240436 training videos and 19796 testing videos. For TSN, we also train it on UCF-101, initialized with ImageNet pretrained weights. We also provide transfer learning results on UCF101 and HMDB51 for some algorithms. Models with * are converted from other repos(including VMZ and kinetics_i3d), others are trained by ourselves.
For data preprocessing, we find that resizing short-edges of videos to 256px is generally a better choice than resizing the video to fixed width and height 340x256, since the size ratios are kept. Most of our Kinetics-400 models are trained with videos which short-edges are resized to 256px. However, some legacy Kinetics-400 models are trained with videos with fixed width and height (340x256). We use the mark 340×256 to indicate the model is legacy.
If you can not reproduce our testing results due to dataset unalignment, please submit a request at get validation data.
| Modality | Pretrained | Backbone | Input | Top-1 | Top-5 | Download |
|---|
| RGB | ImageNet | ResNet50 | 3seg | 70.6 | 89.4 | model340×256 |
| Modality | Pretrained | Backbone | Input | Top-1 | Download |
|---|
| RGB | ImageNet | BNInception | 3seg | 86.4 | model |
| TV-L1 | ImageNet | BNInception | 3seg | 87.7 | model |
| Modality | Pretrained | Backbone | Input | Top-1 | Download |
|---|
| RGB | None | C3D | 16x1 | N/A | model* |
* Converted from C3D-v1.0 in Caffe and TGAN in Chainer.
| Modality | Pretrained | Backbone | Input | Top-1 | Download |
|---|
| RGB | Sports-1M | C3D | 16x1 | 82.26 | model* |
* Converted from C3D-v1.0 in Caffe and TGAN in Chainer.
| Modality | Pretrained | Backbone | Input | Top-1 | Top-5 | Download |
|---|
| RGB | ImageNet | Inception-V1 | 64x1 | 71.1 | 89.3 | model* |
| RGB | ImageNet | ResNet50 | 32x2 | 72.9 | 90.8 | model340×256 |
| Flow | ImageNet | Inception-V1 | 64x1 | 63.4 | 84.9 | model* |
| Two-Stream | ImageNet | Inception-V1 | 64x1 | 74.2 | 91.3 | / |
* Converted from kinetics_i3d in TensorFlow.
| Modality | Pretrained | Backbone | Input | Top-1 | Top-5 | Download |
|---|
| RGB | None | ResNet50 | 4x16 | 72.9 | 90.9 | model |
| RGB | ImageNet | ResNet50 | 4x16 | 73.8 | 90.9 | model |
| RGB | None | ResNet50 | 8x8 | 74.8 | 91.9 | model |
| RGB | ImageNet | ResNet50 | 8x8 | 75.7 | 92.2 | model |
| RGB | None | ResNet101 | 8x8 | 76.5 | 92.7 | model |
| RGB | ImageNet | ResNet101 | 8x8 | 76.8 | 92.8 | model |
| Modality | Pretrained | Backbone | Input | Top-1 | Top-5 | Download |
|---|
| RGB | None | ResNet50 | 4x16 | 75.4 | 92.1 | model |
| RGB | ImageNet | ResNet50 | 4x16 | 75.9 | 92.3 | model |
| Modality | Pretrained | Backbone | Input | Top-1 | Top-5 | Download |
|---|
| RGB | None | ResNet34 | 8x8 | 63.7 | 85.9 | model |
| RGB | IG-65M | ResNet34 | 8x8 | 74.4 | 91.7 | model |
| RGB | None | ResNet34 | 32x2 | 71.8 | 90.4 | model |
| RGB | IG-65M | ResNet34 | 32x2 | 80.3 | 94.7 | model |
| Modality | Pretrained | Backbone | Input | Top-1 | Top-5 | Download |
|---|
| RGB | IG-65M | irCSN-152 | 32x2 | 82.6 | 95.7 | model* |
| RGB | IG-65M | ipCSN-152 | 32x2 | 82.7 | 95.6 | model* |
| Modality | Pretrained | Backbone | Input | Top-1 (Baseline / OmniSource (Δ)) | Top-5 (Baseline / OmniSource (Δ)) | Download |
|---|
| RGB | ImageNet | ResNet50 | 3seg | 70.6 / 73.6 (+ 3.0) | 89.4 / 91.0 (+ 1.6) | Baseline340×256 / OmniSource340×256 |
| RGB | IG-1B | ResNet50 | 3seg | 73.1 / 75.7 (+ 2.6) | 90.4 / 91.9 (+ 1.5) | Baseline / OmniSource |
| RGB | Scratch | ResNet50 | 4x16 | 72.9 / 76.8 (+ 3.9) | 90.9 / 92.5 (+ 1.6) | Baseline / OmniSource |
| RGB | Scratch | ResNet101 | 8x8 | 76.5 / 80.4 (+ 3.9) | 92.7 / 94.4 (+ 1.7) | Baseline / OmniSource |
| Model | Modality | Pretrained | Backbone | Input | UCF101 | HMDB51 | Download (split1) |
|---|
| I3D | RGB | Kinetics | I3D | 64x1 | 94.8 | 72.6 | UCF101 / HMDB51 |
| I3D | Flow | Kinetics | I3D | 64x1 | 96.6 | 79.2 | UCF101 / HMDB51 |
| I3D | TwoStream | Kinetics | I3D | 64x1 | 97.8 | 80.8 | / |
For action detection, we release models trained on THUMOS14.
| Modality | Pretrained | Backbone | mAP@0.10 | mAP@0.20 | mAP@0.30 | mAP@0.40 | mAP@0.50 | Download |
|---|
| RGB | ImageNet | BNInception | 43.09% | 37.95% | 32.56% | 25.71% | 18.33% | model |
For spatial temporal action detection, we release models trained on AVA.
| Modality | Model | Pretrained | Backbone | mAP@0.5 | Download |
|---|
| RGB | Fast-RCNN | Kinetics | NL-I3D R50 | 21.2 | model |