StyleDubber
October 25, 2024 Β· View on GitHub
This package contains the accompanying code for the following paper:
"StyleDubber: Towards Multi-Scale Style Learning for Movie Dubbing", which has appeared as long paper in the Findings of the ACL, 2024.

π£ News
π TODOs
- Release StyleDubber's training and inference code.
- Release pretrained weights.
- Release the raw data and preprocessed data features of the GRID dataset.
- Metrics Testing Scripts (SECS, WER_Whisper).
- Release Demo Pages.
- Release the preprocessed data features of the V2C-Animation dataset.
- Update README.md.
- Upload the dataset to Google Drive.
π Dataset
- GRID BaiduDrive (code: GRID) / GoogleDrive
βββ Lip_Grid_Gray
β βββ [GRID's Lip Region Images in Gray-scale]
βββ Lip_Grid_Color
β βββ [GRID's Lip Region Images in RGB]
βββ Grid_resample_ABS οΌGoogleDrive β
οΌ
β βββ [22050 Hz Ground Truth Audio Files in .wav] (The original data of GRID is 25K Hz)
βββ Grid_lip_Feature
β βββ [Lip Feature extracted from ```Lip_Grid_Gray``` via Lipreading_using_Temporal_Convolutional_Networks]
βββ Grid_Face_Image
β βββ [GRID's Face Region Images]
βββ Grid_dataset_Raw
β βββ [GRID's raw data from Website]
βββ Grad_eachframe
β βββ [Each frame files of Grid dataset]
βββ Gird_FaceVAFeature
β βββ [Face Feature extracted from ```Grid_Face_Image``` via EmoFAN]
βββ 0_Grid_Wav_22050_Abs_Feature οΌGoogleDrive β
οΌ
βββ [Contains all the data features for train and inference in the GRID dataset]
Note: If you just want to train StyleDubber on the GRID dataset, you only need to download the files in 0_Grid_Wav_22050_Abs_Feature (Preprocessed data features) and Grid_resample_ABS (Ground truth waveform used for testing). If you're going to plot and display, use it for other tasks (lip reading, ASV, etc.), or re-preprocess features on your way, you can download the rest of the files you need π.
- V2C-Animation dataset (chenqi-Denoise2) BaiduDrive (code: k9mb) / GoogleDrive
βββ Phoneme_level_Feature οΌGoogleDrive β
οΌ
β βββ [Contains all the data features for train and inference in the V2C-Animation dataset]
βββ GT_Wav οΌGoogleDrive β
οΌ
βββ [22050 Hz ground truth Audio Files in .wav]
Note: For training on V2C-Animation, you need to download the files in Phoneme_level_Feature (Preprocessed data features) and GT_Wav (Ground truth waveform used for testing).
Other visual images (e.g., face and lip regions) in intermediate processes can be accessed at HPMDubbing.
Quick Q&A: HPMDubbing also has pre-processed features. Are they the sameοΌ Can I use it to train StyleDubber?
No, you need to re-download to train StyleDubber. HPMDubbing needs frame frame-level feature with 220 hop length and 880 window length for the desired upsampling manner.
StyleDubber currently only supports phoneme-level features and we adjust the hop length (256) and window length (1024) during pre-processing.
π‘ Checkpoints
We provide the pre-trained checkpoints on GRID and V2C-Animation datasets as follows, respectively:
-
GRID: https://pan.baidu.com/s/1Mj3MN4TuAEc7baHYNqwbYQ (y8kb), Google Drive
-
V2C-Animation dataset (chenqi-Denoise2): https://pan.baidu.com/s/1hZBUszTaxCTNuHM82ljYWg (n8p5), Google Drive
βοΈ Environment
Our python version is 3.8.18 and cuda version 11.5. It's possible to have other compatible version.
Both training and inference are implemented with PyTorch on a
GeForce RTX 4090 GPU.
conda create -n style_dubber python=3.8.18
conda activate style_dubber
pip install -r requirements.txt
π₯ Train Your Own Model
You need repalce tha path in preprocess_config (see "./ModelConfig_V2C/model_config/MovieAnimation/config_all.txt") to you own path.
Training V2C-Animation dataset (153 cartoon speakers), please run:
python train_StyleDubber_V2C.py
You need repalce tha path in preprocess_config (see "./ModelConfig_GRID/model_config/GRID/config_all.txt") to you own path.
Training GRID dataset (33 real-world speakers), please run:
python train_StyleDubber_GRID.py
β Inference Wav
There are three kinds of dubbing settings in this paper. The first setting is the same as in V2C-Net (Chen et al., 2022a), which uses target audio as reference audio from test set. However, this is impractical in real-world applications. Thus, we design two new and more reasonable settings: βDub 2.0β uses non-ground truth audio of the same speaker as reference audio; βDub 3.0β uses the audio of unseen characters (from another dataset) as reference audio.

Inference Setting1: V2C & GRID
python 0_evaluate_V2C_Setting1.py --restore_step <checkpoint_step>
or
python 0_evaluate_GRID_Setting1.py --restore_step <checkpoint_step>
Inference Setting2: V2C
python 0_evaluate_V2C_Setting2.py --restore_step <checkpoint_step>
Inference Setting3: V2C
python 0_evaluate_V2C_Setting3.py --restore_step <checkpoint_step>
π€οΈ Output Result
-
π Word Error Rate (WER)
Please download pre-trained model of whisper-large-v3 (Calculating V2C-Animation dataset) and whisper-base (Calculating GRID dataset), and
pip install jiwer.
For Setting1 and Setting2: Please run:
python Dub_Metric/WER_Whisper/Setting_test.py -p <Generated_wav_path> -t <GT_Wav_Path>Note: If you need test GRID dataset, please replace
model = whisper.load_model("large-v3")tomodel = whisper.load_model("base")(see line 102 in./Dub_Metric/WER_Whisper/Setting_test.py).For Setting3 (only for V2C): Please run:
python Dub_Metric/WER_Whisper/Setting3_test.py -p <Generated_wav_path> -t <GT_Wav_Path>β Quick Q&A: Why does V2C use whisper-large-v3, while GRID uses whisper-base?
Considering the challenges of the
V2C-Animation dataset, the reviewer of ACL ARR suggested using whisper_large to enhance convincing. Through comparison, we finally choosewhisper-large-v3as the WER testing benchmark. Considering the inference speed and memory, the GRID dataset still retains the βWhisper-baseβ as the test benchmark to calculate WER (22%), which is similar to the VDTTS (Hassid et al., 2022) result (26%) in Table 2 (GRID evaluation), so this is sufficient to ensure a fair comparison.
-
π SPK-SIM / SECS (Speaker Encoder Cosine Similarity)
Please download
wav2mel.ptanddvector.ptand save in./ckptsFor Setting1: Please run:
python Dub_Metric/SECS/Setting1.py -p <Generated_wav_path> -t <GT_Wav_Path>For Setting2: Please run:
python Dub_Metric/SECS/Setting2_V2C.py -p <Generated_wav_path> -t <GT_Wav_Path>or:
python Dub_Metric/SECS/Setting2_GRID.py -p <Generated_wav_path> -t <GT_Wav_Path>For Setting3 (only for V2C): Please run:
python Dub_Metric/SECS/Setting3.py -p <Generated_wav_path> -t <GT_Wav_Path> -
π MCD-DTW and MCD-DTW-SL
The MCD-DTW and MCD-DTW-SL are calculated by running 0_evaluate_V2C_Setting*.py and 0_evaluate_GRID_Setting*.py, see
β Inference Wav. -
π Sim-O & Sim-R by WavLM-TDNN
-
π EMO-ACC
βοΈ Citing
If you find our work useful, please consider citing:
@inproceedings{cong-etal-2024-styledubber,
title = "{S}tyle{D}ubber: Towards Multi-Scale Style Learning for Movie Dubbing",
author = "Cong, Gaoxiang and
Qi, Yuankai and
Li, Liang and
Beheshti, Amin and
Zhang, Zhedong and
Hengel, Anton and
Yang, Ming-Hsuan and
Yan, Chenggang and
Huang, Qingming",
booktitle = "Findings of the Association for Computational Linguistics ACL 2024",
month = aug,
year = "2024",
pages = "6767--6779",
}
π Acknowledgments
We would like to thank the authors of previous related projects for generously sharing their code and insights: CDFSE_FastSpeech2, Multimodal Transformer, SMA, Meta-StyleSpeech, and FastSpeech2.