TSM

September 6, 2023 ยท View on GitHub

TSM: Temporal Shift Module for Efficient Video Understanding

Abstract

The explosive growth in video streaming gives rise to challenges on performing video understanding at high accuracy and low computation cost. Conventional 2D CNNs are computationally cheap but cannot capture temporal relationships; 3D CNN based methods can achieve good performance but are computationally intensive, making it expensive to deploy. In this paper, we propose a generic and effective Temporal Shift Module (TSM) that enjoys both high efficiency and high performance. Specifically, it can achieve the performance of 3D CNN but maintain 2D CNN's complexity. TSM shifts part of the channels along the temporal dimension; thus facilitate information exchanged among neighboring frames. It can be inserted into 2D CNNs to achieve temporal modeling at zero computation and zero parameters. We also extended TSM to online setting, which enables real-time low-latency online video recognition and video object detection. TSM is accurate and efficient: it ranks the first place on the Something-Something leaderboard upon publication; on Jetson Nano and Galaxy Note8, it achieves a low latency of 13ms and 35ms for online video recognition.

Results and Models

Kinetics-400

frame sampling strategyresolutiongpusbackbonepretraintop1 acctop5 acctesting protocolFLOPsparamsconfigckptlog
1x1x8224x2248ResNet50ImageNet73.1890.568 clips x 10 crop32.88G23.87Mconfigckptlog
1x1x8224x2248ResNet50ImageNet73.2290.228 clips x 10 crop32.88G23.87Mconfigckptlog
1x1x16224x2248ResNet50ImageNet75.1291.5516 clips x 10 crop65.75G23.87Mconfigckptlog
1x1x8 (dense)224x2248ResNet50ImageNet73.3890.788 clips x 10 crop32.88G23.87Mconfigckptlog
1x1x8224x2248ResNet50 (NonLocalDotProduct)ImageNet74.4991.158 clips x 10 crop61.30G31.68Mconfigckptlog
1x1x8224x2248ResNet50 (NonLocalGauss)ImageNet73.6690.998 clips x 10 crop59.06G28.00Mconfigckptlog
1x1x8224x2248ResNet50 (NonLocalEmbedGauss)ImageNet74.3491.238 clips x 10 crop61.30G31.68Mconfigckptlog
1x1x8224x2248MobileNetV2ImageNet68.7188.328 clips x 3 crop3.269G2.736Mconfigckptlog
1x1x16224x2248MobileOne-S4ImageNet74.3891.7116 clips x 10 crop48.65G13.72Mconfigckptlog

Something-something V2

frame sampling strategyresolutiongpusbackbonepretraintop1 acctop5 acctesting protocolFLOPsparamsconfigckptlog
1x1x8224x2248ResNet50ImageNet62.7287.708 clips x 3 crop32.88G23.87Mconfigckptlog
1x1x16224x2248ResNet50ImageNet64.1688.6116 clips x 3 crop65.75G23.87Mconfigckptlog
1x1x8224x2248ResNet101ImageNet63.7088.288 clips x 3 crop62.66G42.86Mconfigckptlog
  1. The gpus indicates the number of gpus we used to get the checkpoint. If you want to use a different number of gpus or videos per gpu, the best way is to set --auto-scale-lr when calling tools/train.py, this parameter will auto-scale the learning rate according to the actual batch size and the original batch size.
  2. The validation set of Kinetics400 we used consists of 19796 videos. These videos are available at Kinetics400-Validation. The corresponding data list (each line is of the format 'video_id, num_frames, label_index') and the label map are also available.
  3. MoibleOne backbone supports reparameterization during inference. You can use the provided reparameterize tool to convert the checkpoint and switch to the deploy config file.

For more details on data preparation, you can refer to Kinetics400.

Train

You can use the following command to train a model.

python tools/train.py ${CONFIG_FILE} [optional arguments]

Example: train TSM model on Kinetics-400 dataset in a deterministic option.

python tools/train.py configs/recognition/tsm/tsm_imagenet-pretrained-r50_8xb16-1x1x8-50e_kinetics400-rgb.py \
     --seed=0 --deterministic

For more details, you can refer to the Training part in the Training and Test Tutorial.

Test

You can use the following command to test a model.

python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [optional arguments]

Example: test TSM model on Kinetics-400 dataset and dump the result to a pkl file.

python tools/test.py configs/recognition/tsm/tsm_imagenet-pretrained-r50_8xb16-1x1x8-50e_kinetics400-rgb.py \
    checkpoints/SOME_CHECKPOINT.pth --dump result.pkl

For more details, you can refer to the Test part in the Training and Test Tutorial.

Citation

@inproceedings{lin2019tsm,
  title={TSM: Temporal Shift Module for Efficient Video Understanding},
  author={Lin, Ji and Gan, Chuang and Han, Song},
  booktitle={Proceedings of the IEEE International Conference on Computer Vision},
  year={2019}
}
@article{Nonlocal2018,
  author =   {Xiaolong Wang and Ross Girshick and Abhinav Gupta and Kaiming He},
  title =    {Non-local Neural Networks},
  journal =  {CVPR},
  year =     {2018}
}