README.md

November 22, 2024 ยท View on GitHub

ECCV 2024 Efficient Pre-training for Localized Instruction Generation of Videos

Anil Batra, Davide Moltisanti, Laura Sevilla-Lara, Marcus Rohrbach, Frank Keller

arXiv Dataset

Abstract

In this work we propose Sieve & Swap technique, to automatically generate high quality pre-training data for the recipe domain: (i) Sieve: filters irrelevant transcripts and (ii) Swap: acquires high quality text by replacing transcripts with human-written instruction from a text-only recipe dataset. The resulting dataset is three orders of magnitude smaller than current web-scale datasets but enables efficient training of large-scale models. Alongside Sieve & Swap, we propose Procedure Transformer (ProcX), a model for end-to-end step localization and instruction generation for procedural videos. When pre-trained on our curated dataset, this model achieves state-of-the-art performance on YouCook2 and Tasty while using a fraction of the training data.

sieve and swap approach

Dataset

Sieve & Swap : Our curated dataset along with processed features can be downloaded from :hugs: Hugging Face. More details are available in data.md

Raw Pre-Training : HowTo100M, RecipeNLG

Downstream Task : YouCook2, Tasty

Code

Coming Soon!

:page_facing_up: Citation

If you find this project useful in your research, please consider cite:


@inproceedings{batra2025efficient,
  title={Efficient Pre-training for Localized Instruction Generation of Procedural Videos},
  author={Batra, Anil and Moltisanti, Davide and Sevilla-Lara, Laura and Rohrbach, Marcus and Keller, Frank},
  booktitle={European Conference on Computer Vision},
  pages={347--363},
  year={2025},
  organization={Springer}
}

Licenses

This code is released under the MIT License. The licenses for datasets used in the paper are available at the following links: HowTo100M, YouCook2, and Tasty.

:dizzy: Acknowledgement

Thanks to the open source of the following projects:

PDVC, VidChapter, Lightning-Hydra-Template.