SA-LSTM

January 19, 2020 · View on GitHub

This project tries to implement SA-LSTM proposed in Describing Videos by Exploiting Temporal Structure [1], ICCV 2015.

Environment

Ubuntu 16.04
CUDA 9.0
cuDNN 7.3.1
Nvidia Geforce GTX Titan Xp 12GB

Requirements

Java 8
Python 2.7.12
- PyTorch 1.0
- Other python libraries specified in requirements.txt

How to use

Step 1. Setup python virtual environment

$ virtualenv .env
$ source .env/bin/activate
(.env) $ pip install --upgrade pip
(.env) $ pip install -r requirements.txt

Step 2. Prepare Data

Extract features from network you want to use, and locate them at <PROJECT ROOT>/<DATASET>/features/<DATASET>_<NETWORK>.hdf5. I extracted features of VGG19, ResNet-101, ResNet-152, and Inception-v4 from here, R(2+1)D from here, and 3D-ResNext from here.

Dataset ResNet-101 Inception-v4 3D-ResNext-101
MSVD link link link
MSR-VTT link link link
After changing model of <DATASET>SplitConfig in config.py as above, split the dataset along with the official splits using following:
```
(.env) $ python -m splits.MSVD
(.env) $ python -m splits.MSR-VTT
```

Dataset	ResNet-101	Inception-v4	3D-ResNext-101
MSVD	link	link	link
MSR-VTT	link	link	link

Step 3. Prepare Evaluation Codes

Clone evaluation codes from the official coco-evaluation repo.

(.env) $ git clone https://github.com/tylin/coco-caption.git
(.env) $ mv coco-caption/pycocoevalcap .
(.env) $ rm -rf coco-caption

Step 4. Train

Run

(.env) $ python train.py

You can change some hyperparameters by modifying config.py.

Step 5. Inference

Set the checkpoint path by changing ckpt_fpath of EvalConfig in config.py.
Run
```
(.env) $ python run.py
```

I select a checkpoint which achieves the best CIDEr score on the validation set, and report the test scores of it. All experiments are run 5 times and averaged. For SqueezeNet [7], I met a memory issue because the size of feature vector is 86528.

MSVD

Model	Features	Trained on	BLEU4	CIDEr	METEOR	ROUGE_L
SA-LSTM [1]	GoogLeNet [2] & 3D conv.		41.92	51.67	29.6	-
SA-LSTM [3]	Inception-v4 [4]	ImageNet	45.3	76.2	31.9	64.2

Ours	AlexNet [9]	ImageNet	36.3	34.9	26.7	63.4
Ours	GoogleNet [10]	ImageNet	36.0	38.8	25.0	57.1
Ours	VGG19 [5]	ImageNet	46.4	68.3	31.2	67.4
Ours	ResNet-152 [6]	ImageNet	50.8	79.5	33.3	69.8
Ours	ResNext-101 [11]	ImageNet	50.0	77.2	33.0	63.4
Ours	Inception-v4 [4]	ImageNet	50.2	79.0	33.3	69.7
Ours	R(2+1)D [8]	Sports1M, finetuned on Kinetics	51.2	77.8	33.4	70.1
Ours	3D-ResNext-101 [12]	Kinetics	49.2	82.3	33.1	70.0

MSR-VTT

Model	Features	Trained on	BLEU4	CIDEr	METEOR	ROUGE_L
SA-LSTM [3]	Inception-v4	ImageNet	36.3	39.9	25.5	58.3

Ours	AlexNet [9]	ImageNet	31.3	29.8	23.3	54.5
Ours	GoogleNet [10]	ImageNet	26.5	26.0	22.4	58.4
Ours	VGG19 [5]	ImageNet	34.9	37.4	24.6	56.3
Ours	ResNet-152 [6]	ImageNet	36.4	41.3	25.5	57.6
Ours	ResNext-101 [11]	ImageNet	36.5	41.9	25.7	57.8
Ours	Inception-v4 [4]	ImageNet	36.2	40.9	25.3	57.3
Ours	R(2+1)D [8]	Sports1M, finetuned on Kinetics	36.7	41.4	25.4	57.7
Ours	3D-ResNext-101 [12]	Kinetics	38.1	42.6	25.4	58.5

References

[1] Yao, Li, et al. "Describing videos by exploiting temporal structure." Proceedings of the IEEE international conference on computer vision. 2015.

[2] Szegedy, Christian, et al. "Going deeper with convolutions." Proceedings of the IEEE conference on computer vision and pattern recognition. 2015.

[3] Wang, Bairui, et al. "Reconstruction Network for Video Captioning." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018.

[4] Szegedy, Christian, et al. "Inception-v4, inception-resnet and the impact of residual connections on learning." AAAI. Vol. 4. 2017.

[5] Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale image recognition." arXiv preprint arXiv:1409.1556 (2014).

[6] He, Kaiming, et al. "Deep residual learning for image recognition." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.

[7] Iandola, Forrest N., et al. "SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and< 0.5 MB model size." arXiv preprint arXiv:1602.07360 (2016).

[8] Tran, Du, et al. "A closer look at spatiotemporal convolutions for action recognition." Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 2018.

[9] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep convolutional neural networks." Advances in neural information processing systems. 2012.

[10] Szegedy, Christian, et al. "Going deeper with convolutions." Proceedings of the IEEE conference on computer vision and pattern recognition. 2015.

[11] Xie, Saining, et al. "Aggregated residual transformations for deep neural networks." Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.

[12] Hara, Kensho, Hirokatsu Kataoka, and Yutaka Satoh. "Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?." Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 2018.