Towards Long-Form Video Understanding
July 6, 2021 · View on GitHub
Chao-Yuan Wu, Philipp Krähenbühl, CVPR 2021
[Paper] [Project Page] [Dataset]
Citation
@inproceedings{lvu2021,
Author = {Chao-Yuan Wu and Philipp Kr\"{a}henb\"{u}hl},
Title = {{Towards Long-Form Video Understanding}},
Booktitle = {{CVPR}},
Year = {2021}}
Overview
This repo implements Object Transformers for long-form video understanding.
Getting Started
Please organize data/ as follows
data
|_ ava
|_ features
|_ instance_meta
|_ lvu_1.0
ava, features, and instance_meta could be found at this Google Drive folder.
lvu_1.0 can be found at here.
Please also download pre-trained weights at this Google Drive folder and put them in pretrained_models/.
Pre-training
bash run_pretrain.sh
This pretrains on a small demo dataset data/instance_meta/instance_meta_pretrain_demo.pkl as an example. Please follow its file format if you'd like to pretrain on a larger dataset (e.g., latest full version of MovieClips).
Training and evaluating on AVA v2.2
bash run_ava.sh
This should achieve 31.0 mAP.
Training and evaluating on LVU tasks
bash run.sh [1-9]
The argument selects a task to run on. Please see run.py for details.
Acknowledgment
This implementation largely borrows from Huggingface Transformers. Please consider citing it if you use this repo.