DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism

February 3, 2023 ยท View on GitHub

arXiv GitHub Stars downloads Hugging Face

DiffSpeech (TTS)

1. Preparation

Data Preparation

a) Download and extract the LJ Speech dataset, then create a link to the dataset folder: ln -s /xxx/LJSpeech-1.1/ data/raw/

b) Download and Unzip the ground-truth duration extracted by MFA: tar -xvf mfa_outputs.tar; mv mfa_outputs data/processed/ljspeech/

c) Run the following scripts to pack the dataset for training/inference.

export PYTHONPATH=.
CUDA_VISIBLE_DEVICES=0 python data_gen/tts/bin/binarize.py --config configs/tts/lj/fs2.yaml

# `data/binary/ljspeech` will be generated.

Vocoder Preparation

We provide the pre-trained model of HifiGAN vocoder. Please unzip this file into checkpoints before training your acoustic model.

2. Training Example

First, you need a pre-trained FastSpeech2 checkpoint. You can use the pre-trained model, or train FastSpeech2 from scratch, run:

CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config configs/tts/lj/fs2.yaml --exp_name fs2_lj_1 --reset

Then, to train DiffSpeech, run:

CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config usr/configs/lj_ds_beta6.yaml --exp_name lj_ds_beta6_1213 --reset

Remember to adjust the "fs2_ckpt" parameter in usr/configs/lj_ds_beta6.yaml to fit your path.

3. Inference Example

CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config usr/configs/lj_ds_beta6.yaml --exp_name lj_ds_beta6_1213 --reset --infer

We also provide:

  • the pre-trained model of DiffSpeech;
  • the individual pre-trained model of FastSpeech 2 for the shallow diffusion mechanism in DiffSpeech;

Remember to put the pre-trained models in checkpoints directory.

Mel Visualization

Along vertical axis, DiffSpeech: [0-80]; FastSpeech2: [80-160].

DiffSpeech vs. FastSpeech 2
DiffSpeech-vs-FastSpeech2
DiffSpeech-vs-FastSpeech2
DiffSpeech-vs-FastSpeech2