README.md

April 12, 2026 ยท View on GitHub

๐ŸŽค YingMusic-Singer-Plus: Controllable Singing Voice Synthesis with Flexible Lyric Manipulation and Annotation-free Melody Guidance

English ๏ฝœ ไธญๆ–‡

Python License

arXiv Paper GitHub Demo Page HuggingFace Space HuggingFace Model Dataset LyricEditBench Discord WeChat Lab

Chunbo Hao1,2 ยท Junjie Zheng2 ยท Guobin Ma1 ยท Yuepeng Jiang1 ยท Huakang Chen1 ยท Wenjie Tian1 ยท Gongyu Chen2 ยท Zihao Chen2 ยท Lei Xie1

1 Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science, Northwestern Polytechnical University, China
2 AI Lab, GiantNetwork, China

๐ŸŽฅ Demo Video

YingMusic-Singer-Plus Demo, narration voiceover provided by VoiceSculptor. Click the badge below to jump to watch the demo video:

YouTubeBilibili
YouTubeBilibili

๐Ÿ“– Introduction

YingMusic-Singer-Plus is a fully diffusion-based singing voice synthesis model that enables melody-controllable singing voice editing with flexible lyric manipulation, requiring no manual alignment or precise phoneme annotation.

Given only three inputs โ€” an optional timbre reference, a melody-providing singing clip, and modified lyrics โ€” YingMusic-Singer-Plus synthesizes high-fidelity singing voices at 44.1 kHz while faithfully preserving the original melody.

YingMusic-Singer-Plus Architecture

Overall architecture of YingMusic-Singer-Plus. Left: SFT training pipeline. Right: GRPO training pipeline.

โœจ Key Features

  • Annotation-free: No manual lyric-MIDI alignment required at inference
  • Flexible lyric manipulation: Supports 6 editing types โ€” partial/full changes, insertion, deletion, translation (CNโ†”EN), and code-switching
  • Strong melody preservation: CKA-based melody alignment loss + GRPO-based optimization
  • Bilingual: Unified IPA tokenizer for both Chinese and English
  • High fidelity: 44.1 kHz stereo output via Stable Audio 2 VAE

๐Ÿš€ Quick Start

Option 1: Install from Scratch

# We strongly recommend uv for faster dependency resolution.
uv venv --python 3.10
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
uv pip install -r requirements.txt

# If you are in CN, use the USTC mirror for faster downloads:
uv pip install -r requirements.txt -i https://mirrors.ustc.edu.cn/pypi/simple

# Alternatively, conda is also supported:
conda create -n YingMusic-Singer-Plus python=3.10
conda activate YingMusic-Singer-Plus
pip install uv
uv pip install -r requirements.txt

# If you are in CN:
uv pip install -r requirements.txt -i https://mirrors.ustc.edu.cn/pypi/simple

Option 2: Pre-built Environment

Conda

  1. Download and install Miniconda from https://repo.anaconda.com/miniconda/ for your platform. Verify with conda --version.
  2. Download the pre-built environment package for your setup from the table below.
  3. Navigate to your Conda envs/ directory and create a folder named YingMusic-Singer-Plus.
  4. Move the downloaded package into that folder and extract it:
   tar -xvf <package_name>

uv

  1. Install uv via pip install uv or follow the official instructions.
  2. Download the pre-built environment package for your setup from the table below.
  3. Extract the package and activate the environment:
   tar -xvf <package_name>
   source .venv/bin/activate  # On Windows: .venv\Scripts\activate
CPU ArchitectureGPUOSTypeDownload
AMD64NVIDIALinuxCondaComing soon
AMD64NVIDIALinuxuvComing soon
AMD64NVIDIAWindowsuvComing soon

Option 3: Docker

Build the image:

docker build -t YingMusic-Singer-Plus .

๐ŸŽต Inference

Option 1: Online Demo (HuggingFace Space)

Visit https://huggingface.co/spaces/ASLP-lab/YingMusic-Singer-Plus to try the model instantly in your browser.

Option 2: Local Gradio App (same as online demo)

python app_local.py

Option 3: Command-line Inference

python infer.py \
    --ref_audio examples/hf_space/melody_control/melody_control_ZH_02_timbre.wav \
    --melody_audio examples/hf_space/melody_control/melody_control_ZH_02_melody.wav \
    --ref_text "ๅฐฑ่ฎฉไฝ |ๅœจๅˆซไบบๆ€€้‡Œ|ๅฟซไน" \
    --target_text "Missing you in my mind|missing you in my heart" \
    --output output/melody_control_zh_missing_you.wav

Enable vocal separation and accompaniment mixing:

python infer.py \
    --ref_audio examples/hf_space/lyric_edit/SingEdit_EN_01.wav \
    --melody_audio examples/hf_space/lyric_edit/SingEdit_EN_01.wav \
    --ref_text "can you tell my heart is speaking|my eyes will give you clues" \
    --target_text "can you spot the moon is grinning|my lips will show you hints" \
    --separate_vocals \
    --mix_accompaniment \
    --output output/lyric_edit_en_moon_grinning.wav

Option 4: Batch Inference

Note: All audio fed to the model must be pure vocal tracks (no accompaniment). If your inputs contain accompaniment, run vocal separation first using src/third_party/MusicSourceSeparationTraining/inference_api.py.

The input JSONL file should contain one JSON object per line, formatted as follows:

{
    "id": "lyric_edit_en_moon_grinning", 
    "melody_ref_path": "examples/hf_space/lyric_edit/SingEdit_EN_01.wav", 
    "gen_text": "can you spot the moon is grinning|my lips will show you hints", 
    "timbre_ref_path": "examples/hf_space/lyric_edit/SingEdit_EN_01.wav", 
    "timbre_ref_text": "can you tell my heart is speaking|my eyes will give you clues"
}
python batch_infer.py \
    --input_type jsonl \
    --input_path /path/to/input.jsonl \
    --output_dir /path/to/output \
    --ckpt_path /path/to/ckpts \
    --num_gpus 4

Multi-process inference on LyricEditBench (melody control) โ€” the test set will be downloaded automatically:

python inference_mp.py \
    --input_type lyric_edit_bench_melody_control \
    --output_dir path/to/LyricEditBench_melody_control \
    --ckpt_path ASLP-lab/YingMusic-Singer-Plus \
    --num_gpus 8

Multi-process inference on LyricEditBench (singing edit):

python inference_mp.py \
    --input_type lyric_edit_bench_sing_edit \
    --output_dir path/to/LyricEditBench_sing_edit \
    --ckpt_path ASLP-lab/YingMusic-Singer-Plus \
    --num_gpus 8

๐Ÿ—๏ธ Model Architecture

YingMusic-Singer-Plus consists of four core components:

ComponentDescription
VAEStable Audio 2 encoder/decoder; downsamples stereo 44.1 kHz audio by 2048ร—
Melody ExtractorEncoder of a pretrained MIDI extraction model (SOME); captures disentangled melody information
IPA TokenizerConverts Chinese & English lyrics into a unified phoneme sequence with sentence-level alignment
DiT-based CFMConditional flow matching backbone following F5-TTS (22 layers, 16 heads, hidden dim 1024)

Total parameters: ~727.3M (453.6M CFM + 156.1M VAE + 117.6M Melody Extractor)

๐Ÿ“Š LyricEditBench

We introduce LyricEditBench, the first benchmark for melody-preserving lyric modification evaluation, built on GTSinger. The dataset is available on HuggingFace at https://huggingface.co/datasets/ASLP-lab/LyricEditBench.

Results

Table 2: Comparison with Baseline Model on LyricEditBench across Task Types in Table 1 and Languages. Metrics (M): P: PER, S: SIM, F: F0-CORR, V: VS are detailed in Section 3. Best results are Bold.

LyricEditBench Results

๐Ÿ™ Acknowledgements

This work builds upon the following open-source projects:

๐Ÿ“„ License

The code and model weights in this project are licensed under CC BY 4.0, except for the following:

The VAE model weights and inference code (in src/YingMusic-Singer/utils/stable-audio-tools) are derived from Stable Audio Open by Stability AI, and are licensed under the Stability AI Community License.

โœ‰๏ธ Contact Us

If you are interested in leaving a message to our work, feel free to email cbhao@mail.nwpu.edu.cn or lxie@nwpu.edu.cn

Youโ€™re welcome to join our WeChat group for technical discussions, updates.


WeChat Group QR Code WeChat Group QR Code

Star History

Star History Chart

Institutional Logo