ComfyUIRHVoxCPM

April 20, 2026 · View on GitHub

License

中文说明

ComfyUI custom nodes for VoxCPM — Tokenizer-Free TTS for Context-Aware Speech Generation and True-to-Life Voice Cloning.

Run this node online: RunningHub (CN) | RunningHub (Global)

GitHub Repository: HM-RunningHub/ComfyUI_RH_VoxCPM

✨ Features

  • Voice Design: Create unique voices from text descriptions (gender, age, tone, emotion, pace)
  • Controllable Cloning: Clone a voice with optional style guidance via reference audio
  • Ultimate Cloning: Reproduce every vocal nuance through audio continuation (VoxCPM2 only)
  • LoRA Fine-tuning: Load custom LoRA weights for personalized voice generation
  • LoRA / Full Training: Train VoxCPM LoRA (or full fine-tune) directly from a ComfyUI workflow, reusing the upstream training loop
  • Auto ASR: Automatically recognize reference audio text via FunASR SenseVoiceSmall when reference_audio_text is empty
  • Reference Denoising: Optional ZipEnhancer denoising for reference audio before cloning

🛠️ Installation

Method 1: Clone from GitHub

cd ComfyUI/custom_nodes
git clone https://github.com/HM-RunningHub/ComfyUI_RH_VoxCPM.git
cd ComfyUI_RH_VoxCPM
pip install -r requirements.txt

Method 2: ComfyUI Manager

Search for ComfyUI_RH_VoxCPM in ComfyUI Manager and install.

📦 Model Download & Installation

VoxCPM Models (required, pick one)

ModelParamsSizeRecommended
VoxCPM22B~4.6 GB✅ Best quality
VoxCPM1.5800M~1.9 GBGood balance
VoxCPM-0.5B640M~1.5 GBLightweight
huggingface-cli download openbmb/VoxCPM2 --local-dir ComfyUI/models/voxcpm/VoxCPM2

Method 2: Download from ModelScope (For China users)

pip install modelscope
modelscope download --model openbmb/VoxCPM2 --local_dir ComfyUI/models/voxcpm/VoxCPM2

Model Directory Structure

ComfyUI/
└── models/
    └── voxcpm/
        ├── VoxCPM2/                # Main model (required)
        │   ├── config.json
        │   ├── model.safetensors
        │   ├── audiovae.pth
        │   ├── tokenizer.json
        │   ├── tokenizer_config.json
        │   └── special_tokens_map.json
        ├── loras/                  # LoRA weights (optional)
        │   └── my_custom_voice.pth
        └── speech_zipenhancer_ans_multiloss_16k_base/  # Denoiser (optional)

SenseVoiceSmall (required for auto ASR)

# From ModelScope
modelscope download --model iic/SenseVoiceSmall --local_dir ComfyUI/models/SenseVoice/SenseVoiceSmall

ZipEnhancer (optional, for reference audio denoising)

# From ModelScope
modelscope download --model iic/speech_zipenhancer_ans_multiloss_16k_base --local_dir ComfyUI/models/voxcpm/speech_zipenhancer_ans_multiloss_16k_base

🚀 Usage

Example Workflows

Download example workflows from the examples/ directory and import into ComfyUI:

  1. Basic Workflow — Single-speaker speech generation with voice design / cloning
  2. Multi-Speaker Workflow — Fixed 5-speaker multi-speaker dialogue generation with per-speaker voice control
  3. LoRA Training Workflow — Build a tiny dataset from two audio clips and run a LoRA fine-tune

Notes:

  • RunningHub VoxCPM Multi-Speaker is the fixed 5-speaker version
  • RunningHub VoxCPM Multi-Speaker (Dynamic Audio) uses the same script format but grows reference-audio inputs automatically
  • If the dynamic inputs do not appear after updating the plugin, refresh the ComfyUI frontend page or reopen the workflow

Three Modes

  • Voice Design: Fill control_instruction (e.g. "A warm young woman"), leave reference_audio empty. The model creates a brand-new voice from your description alone.
  • Controllable Cloning: Upload reference_audio, keep ultimate_clone OFF. Use control_instruction to steer emotion, pace, and style while preserving the reference timbre.
  • Ultimate Cloning: Upload reference_audio, turn ultimate_clone ON. The model treats the reference as a spoken prefix and continues from it, faithfully reproducing every vocal detail. control_instruction is ignored in this mode. If reference_audio_text is empty, ASR will auto-recognize it.

📝 Node Reference

RunningHub VoxCPM Load Model

Load VoxCPM/VoxCPM2 model from local directory with optional LoRA weights.

InputTypeDescription
model_nameCOMBOModel directory under models/voxcpm/
optimizeBOOLEANEnable torch.compile optimization (default: off)
lora_nameCOMBOLoRA weights under models/voxcpm/loras/ (optional, default: None)

RunningHub VoxCPM Generate Speech

Generate speech with voice design, controllable cloning, or ultimate cloning.

InputTypeDescription
modelVOXCPM_MODELModel from Load Model node
textSTRINGTarget text to synthesize
cfg_valueFLOATGuidance scale (default: 2.0)
inference_stepsINTLocDiT flow-matching steps (default: 10)
seedINTRandom seed for reproducibility
control_instructionSTRINGVoice description for voice design mode (optional)
reference_audioAUDIOReference audio for cloning (optional)
ultimate_cloneBOOLEANEnable ultimate cloning mode (default: off)
reference_audio_textSTRINGTranscript of reference audio; auto ASR if empty (optional)
normalize_textBOOLEANText normalization (default: off)
denoise_referenceBOOLEANDenoise reference audio via ZipEnhancer (default: off)
max_lenINTMaximum token length during generation (default: 4096)
retry_badcaseBOOLEANAuto-retry when output quality is poor (default: on)

RunningHub VoxCPM Multi-Speaker

Generate multi-speaker dialogue from a tagged script. Supports up to 5 speakers with individual voice control.

InputTypeDescription
modelVOXCPM_MODELModel from Load Model node
scriptSTRINGTagged script, e.g. [spk1]Hello[spk2]Hi there
cfg_valueFLOATGuidance scale (default: 2.0)
inference_stepsINTLocDiT flow-matching steps (default: 10)
seedINTRandom seed for reproducibility
audio_1 ~ audio_5AUDIOReference audio for each speaker (optional)
control_1 ~ control_5STRINGVoice description for each speaker (optional)
normalize_textBOOLEANText normalization (default: off)
denoise_referenceBOOLEANDenoise reference audio via ZipEnhancer (default: off)
max_lenINTMaximum token length during generation (default: 4096)
retry_badcaseBOOLEANAuto-retry when output quality is poor (default: on)

RunningHub VoxCPM Multi-Speaker (Dynamic Audio)

For multi-speaker reference-audio workflows. The script still uses [spk1]...[spk2]... tags, while speaker control instructions are merged into a single multiline input using the same tag format. The node shows 2 reference-audio inputs by default and automatically adds the next one when all current inputs are connected, with no fixed upper limit. At execution time, audio_1 maps to spk1, audio_2 maps to spk2, and so on, so tags like spk10 and spk20 are supported as well.

Usage tips:

  • You need to connect all currently visible audio_* inputs before the next one is added
  • This auto-growth behavior depends on the frontend extension script; if it does not update after installing a new version, refresh the page
InputTypeDescription
modelVOXCPM_MODELModel from Load Model node
scriptSTRINGTagged script, e.g. [spk1]Hello[spk2]Hi there
speaker_controlsSTRINGMultiline tagged controls, e.g. [spk1]Sichuan accent\n[spk2]Adult female, northeastern accent
cfg_valueFLOATGuidance scale (default: 2.0)
inference_stepsINTLocDiT flow-matching steps (default: 10)
seedINTRandom seed for reproducibility
audio_1 ~ audio_NAUDIODynamic reference-audio inputs mapped to spk1 ~ spkN by slot order; starts with 2, auto-grows when filled, and has no fixed upper limit
normalize_textBOOLEANText normalization (default: off)
denoise_referenceBOOLEANDenoise reference audio via ZipEnhancer (default: off)
max_lenINTMaximum token length during generation (default: 4096)
retry_badcaseBOOLEANAuto-retry when output quality is poor (default: on)

🎓 Training Nodes (LoRA / Full Fine-tuning)

⚠️ The training nodes rely on the upstream training modules (voxcpm.training.*). They pull transformers / datasets / safetensors / argbind via requirements.txt, and require a full VoxCPM source tree to be available — either install the full repo, or drop a checkout next to this plugin (e.g. ComfyUI/custom_nodes/VoxCPM/src/voxcpm/training/) or inside <plugin>/voxcpm/src/.

Typical workflow:

  1. Dataset Entry wraps a single (audio, text) pair into a training sample.
  2. Dataset Build aggregates samples into a train.jsonl manifest (an existing jsonl path also works).
  3. Train LoRA / Train Full runs the training loop. Artifacts are written to ComfyUI/output/voxcpm_train/<name>_<timestamp>/; with copy_to_loras_dir enabled LoRA weights are also copied to ComfyUI/models/voxcpm/loras/ so the Load Model node picks them up after a frontend refresh.

RunningHub VoxCPM Dataset Entry

InputTypeDescription
audioAUDIOTraining clip
textSTRINGOptional transcript for the clip. If left blank, funasr SenseVoiceSmall is used to auto-transcribe audio
dataset_idINTOptional dataset id for multi-dataset training (default: 0)
ref_audioAUDIOOptional voice-style reference audio. When provided it is written to the manifest as ref_audio and used by the training pipeline for voice conditioning (requires voxcpm built after 2026-04)

Returns entry (feed into Dataset Build) and text (the transcript actually used, handy for preview/reuse). Auto-ASR requires the SenseVoiceSmall model under models/SenseVoice/SenseVoiceSmall.

RunningHub VoxCPM Dataset Build

InputTypeDescription
entry_1, entry_2VOXCPM_DATA_ENTRYAt least two samples
entry_3 ~ entry_8VOXCPM_DATA_ENTRYAdditional samples (optional)
extra_manifestSTRINGPath to an existing jsonl to append (optional)
sample_rateINTSample rate to save WAVs at; match the base model AudioVAE (default: 16000)
dataset_nameSTRINGOutput directory prefix

Outputs manifest_path (path to train.jsonl) and num_samples.

RunningHub VoxCPM Train LoRA

InputTypeDescription
model_nameCOMBOBase model directory under models/voxcpm/
train_manifestSTRINGTraining manifest (jsonl) path (use Dataset Build output)
output_nameSTRINGOutput name prefix (the final folder is suffixed with a timestamp)
num_itersINTTotal training steps (default: 500)
batch_sizeINTPer-step batch size (default: 1)
grad_accum_stepsINTGradient accumulation steps (default: 1)
learning_rateFLOATLearning rate (default: 1e-4)
lora_rankINTLoRA rank (default: 32)
lora_alphaINTLoRA alpha (default: 32)
val_manifestSTRINGOptional validation manifest
warmup_stepsINTWarmup steps (default: 100)
weight_decayFLOATWeight decay (default: 0.01)
max_grad_normFLOATGradient clipping; 0 = disabled (default: 1.0)
num_workersINTData loader workers (default: 2)
log_intervalINTLog interval in steps (default: 10)
save_intervalINTCheckpoint interval; 0 = save only at the end (default: 0)
lora_dropoutFLOATLoRA dropout (default: 0.0)
enable_lmBOOLEANApply LoRA to the LM (default: on)
enable_ditBOOLEANApply LoRA to the DiT (default: on)
enable_projBOOLEANApply LoRA to projection layers (default: off)
copy_to_loras_dirBOOLEANCopy final LoRA to models/voxcpm/loras/ (default: on)

Outputs lora_path (folder containing lora_weights.safetensors + lora_config.json) and info (summary string).

RunningHub VoxCPM Train Full

Mirrors the LoRA node without LoRA-specific inputs. ⚠️ Full fine-tuning is memory-heavy; prefer the LoRA node for voice adaptation.

📄 License

This project is licensed under the Apache License 2.0.

🙏 Acknowledgements

This project is based on VoxCPM, developed by OpenBMB / ModelBest.