README.md

August 17, 2023 · View on GitHub

logo

Visual Speech Recognition for Multiple Languages

📘Introduction | 🛠️Preparation | 📊Benchmark | 🔮Inference | 🐯Model zoo | 📝License

Authors

Pingchuan Ma, Alexandros Haliassos, Adriana Fernandez-Lopez, Honglie Chen, Stavros Petridis, Maja Pantic.

Update

2023-07-26: We have released our training recipe for real-time AV-ASR, see here.

2023-06-16: We have released our training recipe for AutoAVSR, see here.

2023-03-27: We have released our AutoAVSR models for LRS3, see here.

Introduction

This is the repository of Visual Speech Recognition for Multiple Languages, which is the successor of End-to-End Audio-Visual Speech Recognition with Conformers. By using this repository, you can achieve the performance of 19.1%, 1.0% and 0.9% WER for automatic, visual, and audio-visual speech recognition (ASR, VSR, and AV-ASR) on LRS3.

Tutorial

We provide a tutorial Open In Colab to show how to use our Auto-AVSR models to perform speech recognition (ASR, VSR, and AV-ASR), crop mouth ROIs or extract visual speech features.

Demo

English -> Mandarin -> SpanishFrench -> Portuguese -> Italian

Youtube | Bilibili

Preparation

  1. Clone the repository and enter it locally:
git clone https://github.com/mpc001/Visual_Speech_Recognition_for_Multiple_Languages
cd Visual_Speech_Recognition_for_Multiple_Languages
  1. Setup the environment.
conda create -y -n autoavsr python=3.8
conda activate autoavsr
  1. Install pytorch, torchvision, and torchaudio by following instructions here, and install all packages:
pip install -r requirements.txt
conda install -c conda-forge ffmpeg
  1. Download and extract a pre-trained model and/or language model from model zoo to:
  • ./benchmarks/${dataset}/models

  • ./benchmarks/${dataset}/language_models

  1. [For VSR and AV-ASR] Install RetinaFace or MediaPipe tracker.

Benchmark evaluation

python eval.py config_filename=[config_filename] \
               labels_filename=[labels_filename] \
               data_dir=[data_dir] \
               landmarks_dir=[landmarks_dir]
  • [config_filename] is the model configuration path, located in ./configs.

  • [labels_filename] is the labels path, located in ${lipreading_root}/benchmarks/${dataset}/labels.

  • [data_dir] and [landmarks_dir] are the directories for original dataset and corresponding landmarks.

  • gpu_idx=-1 can be added to switch from cuda:0 to cpu.

Speech prediction

python infer.py config_filename=[config_filename] data_filename=[data_filename]
  • data_filename is the path to the audio/video file.

  • detector=mediapipe can be added to switch from RetinaFace to MediaPipe tracker.

Mouth ROIs cropping

python crop_mouth.py data_filename=[data_filename] dst_filename=[dst_filename]
  • dst_filename is the path where the cropped mouth will be saved.

Model zoo

Overview

We support a number of datasets for speech recognition:

AutoAVSR models

Lip Reading Sentences 3 (LRS3)

ComponentsWERurlsize (MB)
Visual-only
-19.1GoogleDrive or BaiduDrive(key: dqsy)891
Audio-only
-1.0GoogleDrive or BaiduDrive(key: dvf2)860
Audio-visual
-0.9GoogleDrive or BaiduDrive(key: sai5)1540
Language models
--GoogleDrive or BaiduDrive(key: t9ep)191
Landmarks
--GoogleDrive or BaiduDrive(key: mi3c)18577

VSR for multiple languages models

Lip Reading Sentences 2 (LRS2)

ComponentsWERurlsize (MB)
Visual-only
-26.1GoogleDrive or BaiduDrive(key: 48l1)186
Language models
--GoogleDrive or BaiduDrive(key: 59u2)180
Landmarks
--GoogleDrive or BaiduDrive(key: 53rc)9358
Lip Reading Sentences 3 (LRS3)

ComponentsWERurlsize (MB)
Visual-only
-32.3GoogleDrive or BaiduDrive(key: 1b1s)186
Language models
--GoogleDrive or BaiduDrive(key: 59u2)180
Landmarks
--GoogleDrive or BaiduDrive(key: mi3c)18577
Chinese Mandarin Lip Reading (CMLR)

ComponentsCERurlsize (MB)
Visual-only
-8.0GoogleDrive or BaiduDrive(key: 7eq1)195
Language models
--GoogleDrive or BaiduDrive(key: k8iv)187
Landmarks
--GoogleDrive or BaiduDrive(key: 1ret)3721
CMU Multimodal Opinion Sentiment, Emotions and Attributes (CMU-MOSEAS)

ComponentsWERurlsize (MB)
Visual-only
Spanish44.5GoogleDrive or BaiduDrive(key: m35h)186
Portuguese51.4GoogleDrive or BaiduDrive(key: wk2h)186
French58.6GoogleDrive or BaiduDrive(key: t1hf)186
Language models
Spanish-GoogleDrive or BaiduDrive(key: 0mii)180
Portuguese-GoogleDrive or BaiduDrive(key: l6ag)179
French-GoogleDrive or BaiduDrive(key: 6tan)179
Landmarks
--GoogleDrive or BaiduDrive(key: vsic)3040
GRID

ComponentsWERurlsize (MB)
Visual-only
Overlapped1.2GoogleDrive or BaiduDrive(key: d8d2)186
Unseen4.8GoogleDrive or BaiduDrive(key: ttsh)186
Landmarks
--GoogleDrive or BaiduDrive(key: 16l9)1141

You can include data_ext=.mpg in your command line to match the video file extension in the GRID dataset.

Lombard GRID

ComponentsWERurlsize (MB)
Visual-only
Unseen (Front Plain)4.9GoogleDrive or BaiduDrive(key: 38ds)186
Unseen (Side Plain)8.0GoogleDrive or BaiduDrive(key: k6m0)186
Landmarks
--GoogleDrive or BaiduDrive(key: cusv)309

You can include data_ext=.mov in your command line to match the video file extension in the Lombard GRID dataset.

TCD-TIMIT

ComponentsWERurlsize (MB)
Visual-only
Overlapped16.9GoogleDrive or BaiduDrive(key: jh65)186
Unseen21.8GoogleDrive or BaiduDrive(key: n2gr)186
Language models
--GoogleDrive or BaiduDrive(key: 59u2)180
Landmarks
--GoogleDrive or BaiduDrive(key: bnm8)930

Citation

If you use the AutoAVSR models training code, please consider citing the following paper:

@inproceedings{ma2023auto,
  author={Ma, Pingchuan and Haliassos, Alexandros and Fernandez-Lopez, Adriana and Chen, Honglie and Petridis, Stavros and Pantic, Maja},
  booktitle={IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  title={Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels}, 
  year={2023},
}

If you use the VSR models for multiple languages please consider citing the following paper:

@article{ma2022visual,
  title={{Visual Speech Recognition for Multiple Languages in the Wild}},
  author={Ma, Pingchuan and Petridis, Stavros and Pantic, Maja},
  journal={{Nature Machine Intelligence}},
  volume={4},
  pages={930--939},
  year={2022}
  url={https://doi.org/10.1038/s42256-022-00550-z},
  doi={10.1038/s42256-022-00550-z}
}

License

It is noted that the code can only be used for comparative or benchmarking purposes. Users can only use code supplied under a License for non-commercial purposes.

Contact

[Pingchuan Ma](pingchuan.ma16[at]imperial.ac.uk)