SA-LSTM

January 19, 2020 ยท View on GitHub

This project tries to implement SA-LSTM proposed in Describing Videos by Exploiting Temporal Structure [1], ICCV 2015.

Environment

  • Ubuntu 16.04
  • CUDA 9.0
  • cuDNN 7.3.1
  • Nvidia Geforce GTX Titan Xp 12GB

Requirements

  • Java 8
  • Python 2.7.12
    • PyTorch 1.0
    • Other python libraries specified in requirements.txt

How to use

Step 1. Setup python virtual environment

$ virtualenv .env
$ source .env/bin/activate
(.env) $ pip install --upgrade pip
(.env) $ pip install -r requirements.txt

Step 2. Prepare Data

  1. Extract features from network you want to use, and locate them at <PROJECT ROOT>/<DATASET>/features/<DATASET>_<NETWORK>.hdf5. I extracted features of VGG19, ResNet-101, ResNet-152, and Inception-v4 from here, R(2+1)D from here, and 3D-ResNext from here.

    DatasetResNet-101Inception-v43D-ResNext-101
    MSVDlinklinklink
    MSR-VTTlinklinklink
  2. After changing model of <DATASET>SplitConfig in config.py as above, split the dataset along with the official splits using following:

    (.env) $ python -m splits.MSVD
    (.env) $ python -m splits.MSR-VTT
    

Step 3. Prepare Evaluation Codes

Clone evaluation codes from the official coco-evaluation repo.

(.env) $ git clone https://github.com/tylin/coco-caption.git
(.env) $ mv coco-caption/pycocoevalcap .
(.env) $ rm -rf coco-caption

Step 4. Train

Run

(.env) $ python train.py

You can change some hyperparameters by modifying config.py.

Step 5. Inference

  1. Set the checkpoint path by changing ckpt_fpath of EvalConfig in config.py.
  2. Run
    (.env) $ python run.py
    

Results

I select a checkpoint which achieves the best CIDEr score on the validation set, and report the test scores of it. All experiments are run 5 times and averaged. For SqueezeNet [7], I met a memory issue because the size of feature vector is 86528.

  • MSVD

    ModelFeaturesTrained onBLEU4CIDErMETEORROUGE_L
    SA-LSTM [1]GoogLeNet [2] & 3D conv.41.9251.6729.6-
    SA-LSTM [3]Inception-v4 [4]ImageNet45.376.231.964.2
    OursAlexNet [9]ImageNet36.334.926.763.4
    OursGoogleNet [10]ImageNet36.038.825.057.1
    OursVGG19 [5]ImageNet46.468.331.267.4
    OursResNet-152 [6]ImageNet50.879.533.369.8
    OursResNext-101 [11]ImageNet50.077.233.063.4
    OursInception-v4 [4]ImageNet50.279.033.369.7
    OursR(2+1)D [8]Sports1M, finetuned on Kinetics51.277.833.470.1
    Ours3D-ResNext-101 [12]Kinetics49.282.333.170.0
  • MSR-VTT

    ModelFeaturesTrained onBLEU4CIDErMETEORROUGE_L
    SA-LSTM [3]Inception-v4ImageNet36.339.925.558.3
    OursAlexNet [9]ImageNet31.329.823.354.5
    OursGoogleNet [10]ImageNet26.526.022.458.4
    OursVGG19 [5]ImageNet34.937.424.656.3
    OursResNet-152 [6]ImageNet36.441.325.557.6
    OursResNext-101 [11]ImageNet36.541.925.757.8
    OursInception-v4 [4]ImageNet36.240.925.357.3
    OursR(2+1)D [8]Sports1M, finetuned on Kinetics36.741.425.457.7
    Ours3D-ResNext-101 [12]Kinetics38.142.625.458.5

References

[1] Yao, Li, et al. "Describing videos by exploiting temporal structure." Proceedings of the IEEE international conference on computer vision. 2015.

[2] Szegedy, Christian, et al. "Going deeper with convolutions." Proceedings of the IEEE conference on computer vision and pattern recognition. 2015.

[3] Wang, Bairui, et al. "Reconstruction Network for Video Captioning." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018.

[4] Szegedy, Christian, et al. "Inception-v4, inception-resnet and the impact of residual connections on learning." AAAI. Vol. 4. 2017.

[5] Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale image recognition." arXiv preprint arXiv:1409.1556 (2014).

[6] He, Kaiming, et al. "Deep residual learning for image recognition." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.

[7] Iandola, Forrest N., et al. "SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and< 0.5 MB model size." arXiv preprint arXiv:1602.07360 (2016).

[8] Tran, Du, et al. "A closer look at spatiotemporal convolutions for action recognition." Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 2018.

[9] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep convolutional neural networks." Advances in neural information processing systems. 2012.

[10] Szegedy, Christian, et al. "Going deeper with convolutions." Proceedings of the IEEE conference on computer vision and pattern recognition. 2015.

[11] Xie, Saining, et al. "Aggregated residual transformations for deep neural networks." Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.

[12] Hara, Kensho, Hirokatsu Kataoka, and Yutaka Satoh. "Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?." Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 2018.