README.md
December 29, 2024 ยท View on GitHub
SSVP-SLT: Self-supervised Video Pretraining for Sign Language Translation
This repository contains research code for the paper Towards Privacy-Aware Sign Language Translation at Scale.
SSVP-SLT relies on masked autoencoding (MAE) on anonymized and unannotated videos as a form of self-supervised pretraining to learn continuous sign language representations at scale. The learned representations are transferred to the supervised gloss-free sign language translation task. SSVP-SLT outperforms prior SOTA methods on the ASL-to-English How2Sign benchmark in the finetuned and zero-shot settings by over 3 BLEU points.
Installation
We provide installation instructions in INSTALL.md.
Usage
1. Preparing the data
We describe how to prepare the datasets in DATASETS.md.
2. Pretraining
- MAE pretraining instructions are in pretraining/README.md.
- Joint MAE & CLIP/FLIP pretraining instructions are in pretraining_clip/README.md.
3. Sign Language Translation (SLT)
Instructions for feature extraction and SLT training and evaluation are in translation/README.md.
DailyMoth-70h
We release the DailyMoth-70h (DM-70) dataset as part of this project. DailyMoth-70h is released under a CC-BY-NC 4.0 license.
You can find an overview of the data and download and data preparation instructions in DATASETS.md.
Alternatively, download the files manually via these links:
| Subset | Link | md5 |
|---|---|---|
| Raw videos | download | 875ffe4eeac3a37e50b4202c2b4996d2 |
| Blurred clips | download | a2819c7b06a8b38eb7686e4dc90a7433 |
| Unblurred clips | download | 3e69046f6cf415cec89c3544d0523325 |
| Manifest files | download | 69e500cc5cfad3133c4b589428865472 |
Note
Check out our paper for detailed information on the DailyMoth-70h dataset.
Translation using SONAR
In order to try ASL translation using the massively multilingual and multimodal SONAR sentence embedding space, please see here.
Citing our work
If you find our work useful in your research, please consider citing:
@inproceedings{rust-etal-2024-towards,
title = "Towards Privacy-Aware Sign Language Translation at Scale",
author = "Rust, Phillip and Shi, Bowen and Wang, Skyler and Camgoz, Necati Cihan and Maillard, Jean",
booktitle = "Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
year = "2024",
address = "Bangkok, Thailand",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.acl-long.467",
pages = "8624--8641",
}
References
This codebase is heavily influenced by the mae and mae_st repositories. Our models are based on code from Hiera, HF Transformers, OpenCLIP, and Fairseq.
License
This project is primarily under the CC-BY-NC 4.0 license; see LICENSE for details. Portions of the project are available under separate license terms: Transformers is licensed under the Apache-2.0 license and OpenCLIP is licensed under the OpenCLIP license.