Translating Videos to Natural Language Using Deep Recurrent Neural Networks

October 7, 2015 ยท View on GitHub

##Translating Videos to Natural Language Using Deep Recurrent Neural Networks

Paper : NAACL-HLT 2015 PDF

Download Model: NAACL15_VGG_MEAN_POOL_MODEL (220MB)

Project Page

Description

The model is an improved version of the mean pooled model described in the NAACL-HLT 2015 paper. It uses video frame features from the VGG-16 layer model. This is trained only on the Youtube video dataset.

Translating Videos to Natural Language Using Deep Recurrent Neural Networks
S. Venugopalan, H. Xu, J. Donahue, M. Rohrbach, R. Mooney, K. Saenko
North American Chapter of the Association for Computational Linguistics โ€“ Human Language Technologies
NAACL-HLT 2015

Please consider citing the above paper if you use this model.

Performance

The METEOR score of this model is 27.7% on the Youtube (MSVD) video test dataset. (refer to Table 2 in the Sequence to Sequence - Video to Text paper).

Caffe compatibility

The models are currently supported by the recurrent branch of the Caffe fork by Jeff Donahue and Subhashini Venugopalan, but are not yet compatible with master branch of Caffe.

Training

More details on the code and data can be found on this Project Page.

The prototxts for the network and solver can also be found here: https://github.com/vsubhashini/caffe/tree/recurrent/examples/youtube