TSM

September 6, 2023 · View on GitHub

TSM: Temporal Shift Module for Efficient Video Understanding

Abstract

The explosive growth in video streaming gives rise to challenges on performing video understanding at high accuracy and low computation cost. Conventional 2D CNNs are computationally cheap but cannot capture temporal relationships; 3D CNN based methods can achieve good performance but are computationally intensive, making it expensive to deploy. In this paper, we propose a generic and effective Temporal Shift Module (TSM) that enjoys both high efficiency and high performance. Specifically, it can achieve the performance of 3D CNN but maintain 2D CNN's complexity. TSM shifts part of the channels along the temporal dimension; thus facilitate information exchanged among neighboring frames. It can be inserted into 2D CNNs to achieve temporal modeling at zero computation and zero parameters. We also extended TSM to online setting, which enables real-time low-latency online video recognition and video object detection. TSM is accurate and efficient: it ranks the first place on the Something-Something leaderboard upon publication; on Jetson Nano and Galaxy Note8, it achieves a low latency of 13ms and 35ms for online video recognition.

Results and Models

Kinetics-400

frame sampling strategy	resolution	gpus	backbone	pretrain	top1 acc	top5 acc	testing protocol	FLOPs	params	config	ckpt	log
1x1x8	224x224	8	ResNet50	ImageNet	73.18	90.56	8 clips x 10 crop	32.88G	23.87M	config	ckpt	log
1x1x8	224x224	8	ResNet50	ImageNet	73.22	90.22	8 clips x 10 crop	32.88G	23.87M	config	ckpt	log
1x1x16	224x224	8	ResNet50	ImageNet	75.12	91.55	16 clips x 10 crop	65.75G	23.87M	config	ckpt	log
1x1x8 (dense)	224x224	8	ResNet50	ImageNet	73.38	90.78	8 clips x 10 crop	32.88G	23.87M	config	ckpt	log
1x1x8	224x224	8	ResNet50 (NonLocalDotProduct)	ImageNet	74.49	91.15	8 clips x 10 crop	61.30G	31.68M	config	ckpt	log
1x1x8	224x224	8	ResNet50 (NonLocalGauss)	ImageNet	73.66	90.99	8 clips x 10 crop	59.06G	28.00M	config	ckpt	log
1x1x8	224x224	8	ResNet50 (NonLocalEmbedGauss)	ImageNet	74.34	91.23	8 clips x 10 crop	61.30G	31.68M	config	ckpt	log
1x1x8	224x224	8	MobileNetV2	ImageNet	68.71	88.32	8 clips x 3 crop	3.269G	2.736M	config	ckpt	log
1x1x16	224x224	8	MobileOne-S4	ImageNet	74.38	91.71	16 clips x 10 crop	48.65G	13.72M	config	ckpt	log

Something-something V2

frame sampling strategy	resolution	gpus	backbone	pretrain	top1 acc	top5 acc	testing protocol	FLOPs	params	config	ckpt	log
1x1x8	224x224	8	ResNet50	ImageNet	62.72	87.70	8 clips x 3 crop	32.88G	23.87M	config	ckpt	log
1x1x16	224x224	8	ResNet50	ImageNet	64.16	88.61	16 clips x 3 crop	65.75G	23.87M	config	ckpt	log
1x1x8	224x224	8	ResNet101	ImageNet	63.70	88.28	8 clips x 3 crop	62.66G	42.86M	config	ckpt	log

The gpus indicates the number of gpus we used to get the checkpoint. If you want to use a different number of gpus or videos per gpu, the best way is to set --auto-scale-lr when calling tools/train.py, this parameter will auto-scale the learning rate according to the actual batch size and the original batch size.
The validation set of Kinetics400 we used consists of 19796 videos. These videos are available at Kinetics400-Validation. The corresponding data list (each line is of the format 'video_id, num_frames, label_index') and the label map are also available.
MoibleOne backbone supports reparameterization during inference. You can use the provided reparameterize tool to convert the checkpoint and switch to the deploy config file.

For more details on data preparation, you can refer to Kinetics400.

Train

You can use the following command to train a model.

python tools/train.py ${CONFIG_FILE} [optional arguments]

Example: train TSM model on Kinetics-400 dataset in a deterministic option.

python tools/train.py configs/recognition/tsm/tsm_imagenet-pretrained-r50_8xb16-1x1x8-50e_kinetics400-rgb.py \
     --seed=0 --deterministic

For more details, you can refer to the Training part in the Training and Test Tutorial.

Test

You can use the following command to test a model.

python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [optional arguments]

Example: test TSM model on Kinetics-400 dataset and dump the result to a pkl file.

python tools/test.py configs/recognition/tsm/tsm_imagenet-pretrained-r50_8xb16-1x1x8-50e_kinetics400-rgb.py \
    checkpoints/SOME_CHECKPOINT.pth --dump result.pkl

For more details, you can refer to the Test part in the Training and Test Tutorial.

Citation

@inproceedings{lin2019tsm,
  title={TSM: Temporal Shift Module for Efficient Video Understanding},
  author={Lin, Ji and Gan, Chuang and Han, Song},
  booktitle={Proceedings of the IEEE International Conference on Computer Vision},
  year={2019}
}

@article{Nonlocal2018,
  author =   {Xiaolong Wang and Ross Girshick and Abhinav Gupta and Kaiming He},
  title =    {Non-local Neural Networks},
  journal =  {CVPR},
  year =     {2018}
}