ch-tts-llasa-rl-grpo

June 2, 2025 · View on GitHub

Installation

You can follow the instructions in the verl install.

Data Preprocessing

Push Dummy Dataset

Push Dummy Dataset defines a dummy dataset and pushes it to the Hugging Face Hub.

Make local hdfs directory

Make local hdfs directory makes a local hdfs directory and pushes the dataset to the hdfs directory.

python3 examples/data_preprocess/tts.py \
  --data_source Seungyoun/dummy_llasa_tts_text \
  --local_dir ~/data/llasa-tts-rl-grpo

Training

This tutorial use 3xA6000 GPUs. 2 for training and 1 for whisper nll calculation.

we make grpo reward objective function as follows:

Here's the reward calculation described clearly in English with mathematical notation:


Reward Calculation

The reward is calculated based on two key metrics: the Character Error Rate (CER) and the Negative Log-Likelihood (NLL) obtained from Whisper. The formula is given by:

Reward=λc+λnλcUCER+λnUNLL\text{Reward} = \frac{\lambda_c + \lambda_n}{\frac{\lambda_c}{U_{CER}} + \frac{\lambda_n}{U_{NLL}}}

where:

  • CER Utility:

    UCER=1tanh(βcCER)U_{CER} = 1 - \tanh(\beta_c \cdot CER)
  • NLL Utility:

    UNLL=eNLLτnU_{NLL} = e^{-\frac{NLL}{\tau_n}}

Explanation of Variables:

  • CER: Character Error Rate (difference between the ground truth and Whisper's transcript).
  • NLL: Negative Log-Likelihood from Whisper (a measure of speech synthesis quality).
  • βc\beta_c, τn\tau_n: Parameters controlling sensitivity of CER and NLL respectively.
  • λc\lambda_c, λn\lambda_n: Weights determining the relative importance of CER and NLL.

This results in a reward value ranging between 0 and 1, with higher values indicating better quality.

Launch Whisper server

Whisper server is a server that calculates the NLL of the Whisper model.

CUDA_VISIBLE_DEVICES=2 \
python3 tts/whisper_server.py \
  --port 8001 \
  --model large-v3

then

WHISPER_SERVER=http://localhost:8001

Launch Training

nohup bash ./examples/grpo_trainer/run_llasa_tts_grpo.sh > verl_grpo_1b.log 2>&1 &

Results

results

We performed continual training of a Korean TTS model starting from the LLASA-1B checkpoint and evaluated its performance using our internal dataset.

The results clearly indicate an improvement when applying GRPO:

  • LLasa1B + 15K Korean: CER = 0.0266
  • LLasa1B + 15K Korean + GRPO: CER = 0.0204

The chart visually demonstrates that GRPO significantly reduces the Character Error Rate (CER), indicating enhanced synthesis quality.

Acknowledge

This work is build upon veRL and LLASA.