(Ongoing) Zero-shot TTS based on VITS

September 21, 2022 ยท View on GitHub

VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech

Note

  1. This repository aims to implement a VITS-based zero-shot TTS system varying with diverse style/speaker conditioning methods.
  2. To remove the secondary elements, we simply extract a style representation by jointly training a reference encoder from StyleSpeech. In detail, 1. we do not utilize pretrained models (e.g., Link1, Link2) as the reference encoder, 2. we do not apply meta-learning or speaker verification loss during training.
  3. LibriTTS dataset (train-clean-100 and train-clean-360) is used for training.
ModelText EncoderFlowPosterior EncoderVocoder
master(YourTTS)Output additionGlobal conditioningGlobal conditioningInput addition
transfer(TransferTTS)NoneGlobal conditioningNoneNone
s1(Proposed)SC-CNNGlobal ConditioningGlobal ConditioningInput addition
s2(Proposed)SC-CNNSC-CNNSC-CNNTBD
  • master
  • transfer
  • s1
  • s2

Pre-requisites

  1. Python >= 3.6
  2. Clone this repository
  3. Install python requirements. Please refer requirements.txt
    1. You may need to install espeak first: apt-get install espeak
  4. Download datasets
  5. Build Monotonic Alignment Search and run preprocessing if you use your own datasets.
# Cython-version Monotonoic Alignment Search
cd monotonic_align
python setup.py build_ext --inplace

Training Exmaple

python train_zs.py -c configs/libritts_base.json -m libritts_base

Inference Example

See inference.ipynb