README.md

December 29, 2024 · View on GitHub

SSVP-SLT: Self-supervised Video Pretraining for Sign Language Translation

This repository contains research code for the paper Towards Privacy-Aware Sign Language Translation at Scale.

SSVP-SLT Overview

SSVP-SLT relies on masked autoencoding (MAE) on anonymized and unannotated videos as a form of self-supervised pretraining to learn continuous sign language representations at scale. The learned representations are transferred to the supervised gloss-free sign language translation task. SSVP-SLT outperforms prior SOTA methods on the ASL-to-English How2Sign benchmark in the finetuned and zero-shot settings by over 3 BLEU points.

Installation

We provide installation instructions in INSTALL.md.

Usage

1. Preparing the data

We describe how to prepare the datasets in DATASETS.md.

2. Pretraining

MAE pretraining instructions are in pretraining/README.md.
Joint MAE & CLIP/FLIP pretraining instructions are in pretraining_clip/README.md.

3. Sign Language Translation (SLT)

Instructions for feature extraction and SLT training and evaluation are in translation/README.md.

DailyMoth-70h

We release the DailyMoth-70h (DM-70) dataset as part of this project. DailyMoth-70h is released under a CC-BY-NC 4.0 license.

You can find an overview of the data and download and data preparation instructions in DATASETS.md.

Alternatively, download the files manually via these links:

Subset	Link	md5
Raw videos	download	`875ffe4eeac3a37e50b4202c2b4996d2`
Blurred clips	download	`a2819c7b06a8b38eb7686e4dc90a7433`
Unblurred clips	download	`3e69046f6cf415cec89c3544d0523325`
Manifest files	download	`69e500cc5cfad3133c4b589428865472`

Note

Check out our paper for detailed information on the DailyMoth-70h dataset.

Translation using SONAR

In order to try ASL translation using the massively multilingual and multimodal SONAR sentence embedding space, please see here.

Citing our work

If you find our work useful in your research, please consider citing:

@inproceedings{rust-etal-2024-towards,
    title = "Towards Privacy-Aware Sign Language Translation at Scale",
    author = "Rust, Phillip and Shi, Bowen and Wang, Skyler and Camgoz, Necati Cihan and Maillard, Jean",
    booktitle = "Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    year = "2024",
    address = "Bangkok, Thailand",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.acl-long.467",
    pages = "8624--8641",
}

References

This codebase is heavily influenced by the mae and mae_st repositories. Our models are based on code from Hiera, HF Transformers, OpenCLIP, and Fairseq.

License

This project is primarily under the CC-BY-NC 4.0 license; see LICENSE for details. Portions of the project are available under separate license terms: Transformers is licensed under the Apache-2.0 license and OpenCLIP is licensed under the OpenCLIP license.