README.md
April 12, 2026 ยท View on GitHub
๐ค YingMusic-Singer-Plus: Controllable Singing Voice Synthesis with Flexible Lyric Manipulation and Annotation-free Melody Guidance
Chunbo Hao1,2 ยท Junjie Zheng2 ยท Guobin Ma1 ยท Yuepeng Jiang1 ยท Huakang Chen1 ยท Wenjie Tian1 ยท Gongyu Chen2 ยท Zihao Chen2 ยท Lei Xie1
1 Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science, Northwestern Polytechnical University, China
2 AI Lab, GiantNetwork, China
๐ฅ Demo Video
YingMusic-Singer-Plus Demo, narration voiceover provided by VoiceSculptor. Click the badge below to jump to watch the demo video:
| YouTube | Bilibili |
|---|---|
![]() | ![]() |
๐ Introduction
YingMusic-Singer-Plus is a fully diffusion-based singing voice synthesis model that enables melody-controllable singing voice editing with flexible lyric manipulation, requiring no manual alignment or precise phoneme annotation.
Given only three inputs โ an optional timbre reference, a melody-providing singing clip, and modified lyrics โ YingMusic-Singer-Plus synthesizes high-fidelity singing voices at 44.1 kHz while faithfully preserving the original melody.
Overall architecture of YingMusic-Singer-Plus. Left: SFT training pipeline. Right: GRPO training pipeline.
โจ Key Features
- Annotation-free: No manual lyric-MIDI alignment required at inference
- Flexible lyric manipulation: Supports 6 editing types โ partial/full changes, insertion, deletion, translation (CNโEN), and code-switching
- Strong melody preservation: CKA-based melody alignment loss + GRPO-based optimization
- Bilingual: Unified IPA tokenizer for both Chinese and English
- High fidelity: 44.1 kHz stereo output via Stable Audio 2 VAE
๐ Quick Start
Option 1: Install from Scratch
# We strongly recommend uv for faster dependency resolution.
uv venv --python 3.10
source .venv/bin/activate # On Windows: .venv\Scripts\activate
uv pip install -r requirements.txt
# If you are in CN, use the USTC mirror for faster downloads:
uv pip install -r requirements.txt -i https://mirrors.ustc.edu.cn/pypi/simple
# Alternatively, conda is also supported:
conda create -n YingMusic-Singer-Plus python=3.10
conda activate YingMusic-Singer-Plus
pip install uv
uv pip install -r requirements.txt
# If you are in CN:
uv pip install -r requirements.txt -i https://mirrors.ustc.edu.cn/pypi/simple
Option 2: Pre-built Environment
Conda
- Download and install Miniconda from https://repo.anaconda.com/miniconda/ for your platform. Verify with
conda --version. - Download the pre-built environment package for your setup from the table below.
- Navigate to your Conda
envs/directory and create a folder namedYingMusic-Singer-Plus. - Move the downloaded package into that folder and extract it:
tar -xvf <package_name>
uv
- Install uv via
pip install uvor follow the official instructions. - Download the pre-built environment package for your setup from the table below.
- Extract the package and activate the environment:
tar -xvf <package_name>
source .venv/bin/activate # On Windows: .venv\Scripts\activate
| CPU Architecture | GPU | OS | Type | Download |
|---|---|---|---|---|
| AMD64 | NVIDIA | Linux | Conda | Coming soon |
| AMD64 | NVIDIA | Linux | uv | Coming soon |
| AMD64 | NVIDIA | Windows | uv | Coming soon |
Option 3: Docker
Build the image:
docker build -t YingMusic-Singer-Plus .
๐ต Inference
Option 1: Online Demo (HuggingFace Space)
Visit https://huggingface.co/spaces/ASLP-lab/YingMusic-Singer-Plus to try the model instantly in your browser.
Option 2: Local Gradio App (same as online demo)
python app_local.py
Option 3: Command-line Inference
python infer.py \
--ref_audio examples/hf_space/melody_control/melody_control_ZH_02_timbre.wav \
--melody_audio examples/hf_space/melody_control/melody_control_ZH_02_melody.wav \
--ref_text "ๅฐฑ่ฎฉไฝ |ๅจๅซไบบๆ้|ๅฟซไน" \
--target_text "Missing you in my mind|missing you in my heart" \
--output output/melody_control_zh_missing_you.wav
Enable vocal separation and accompaniment mixing:
python infer.py \
--ref_audio examples/hf_space/lyric_edit/SingEdit_EN_01.wav \
--melody_audio examples/hf_space/lyric_edit/SingEdit_EN_01.wav \
--ref_text "can you tell my heart is speaking|my eyes will give you clues" \
--target_text "can you spot the moon is grinning|my lips will show you hints" \
--separate_vocals \
--mix_accompaniment \
--output output/lyric_edit_en_moon_grinning.wav
Option 4: Batch Inference
Note: All audio fed to the model must be pure vocal tracks (no accompaniment). If your inputs contain accompaniment, run vocal separation first using
src/third_party/MusicSourceSeparationTraining/inference_api.py.
The input JSONL file should contain one JSON object per line, formatted as follows:
{
"id": "lyric_edit_en_moon_grinning",
"melody_ref_path": "examples/hf_space/lyric_edit/SingEdit_EN_01.wav",
"gen_text": "can you spot the moon is grinning|my lips will show you hints",
"timbre_ref_path": "examples/hf_space/lyric_edit/SingEdit_EN_01.wav",
"timbre_ref_text": "can you tell my heart is speaking|my eyes will give you clues"
}
python batch_infer.py \
--input_type jsonl \
--input_path /path/to/input.jsonl \
--output_dir /path/to/output \
--ckpt_path /path/to/ckpts \
--num_gpus 4
Multi-process inference on LyricEditBench (melody control) โ the test set will be downloaded automatically:
python inference_mp.py \
--input_type lyric_edit_bench_melody_control \
--output_dir path/to/LyricEditBench_melody_control \
--ckpt_path ASLP-lab/YingMusic-Singer-Plus \
--num_gpus 8
Multi-process inference on LyricEditBench (singing edit):
python inference_mp.py \
--input_type lyric_edit_bench_sing_edit \
--output_dir path/to/LyricEditBench_sing_edit \
--ckpt_path ASLP-lab/YingMusic-Singer-Plus \
--num_gpus 8
๐๏ธ Model Architecture
YingMusic-Singer-Plus consists of four core components:
| Component | Description |
|---|---|
| VAE | Stable Audio 2 encoder/decoder; downsamples stereo 44.1 kHz audio by 2048ร |
| Melody Extractor | Encoder of a pretrained MIDI extraction model (SOME); captures disentangled melody information |
| IPA Tokenizer | Converts Chinese & English lyrics into a unified phoneme sequence with sentence-level alignment |
| DiT-based CFM | Conditional flow matching backbone following F5-TTS (22 layers, 16 heads, hidden dim 1024) |
Total parameters: ~727.3M (453.6M CFM + 156.1M VAE + 117.6M Melody Extractor)
๐ LyricEditBench
We introduce LyricEditBench, the first benchmark for melody-preserving lyric modification evaluation, built on GTSinger. The dataset is available on HuggingFace at https://huggingface.co/datasets/ASLP-lab/LyricEditBench.
Results
Table 2: Comparison with Baseline Model on LyricEditBench across Task Types in Table 1 and Languages. Metrics (M): P: PER, S: SIM, F: F0-CORR, V: VS are detailed in Section 3. Best results are Bold.
๐ Acknowledgements
This work builds upon the following open-source projects:
- F5-TTS โ DiT-based CFM backbone
- Stable Audio 2 โ VAE architecture
- SOME โ Melody Extractor
- DiffRhythm โ Sentence-level alignment strategy
- GTSinger โ Benchmark base corpus
- Emilia โ TTS pretraining data
๐ License
The code and model weights in this project are licensed under CC BY 4.0, except for the following:
The VAE model weights and inference code (in src/YingMusic-Singer/utils/stable-audio-tools) are derived from Stable Audio Open by Stability AI, and are licensed under the Stability AI Community License.
โ๏ธ Contact Us
If you are interested in leaving a message to our work, feel free to email cbhao@mail.nwpu.edu.cn or lxie@nwpu.edu.cn
Youโre welcome to join our WeChat group for technical discussions, updates.
Star History
