ComfyUI VoxCPMTTS Node

December 11, 2025 ยท View on GitHub

A clean, efficient ComfyUI custom node for VoxCPMTTS (Text-to-Speech) functionality. This implementation provides high-quality speech generation and voice cloning capabilities using the VoxCPM 0.5B and 1.5 model.

VoxCPMTTX_v2.0.0

Features

  • ๐ŸŽฏ High-Quality TTS: Generate natural-sounding speech from text
  • ๐ŸŽญ Voice Cloning: Clone any voice using a reference audio sample
  • ๐Ÿ”„ Auto-Transcription: Automatic speech recognition for reference audio
  • โšก Multi-Device Support: CUDA, MPS, and CPU compatibility
  • ๐ŸŽ›๏ธ Fine-Tuned Control: Adjustable guidance scale, inference steps, and more
  • ๐Ÿ”Š Audio Post-Processing: Built-in fade-in to reduce artifacts

Installation

  1. Open ComfyUI Manager
  2. Search for "VoxCPMTTS"
  3. Install the node

Method 2: Manual Installation

  1. Navigate to your ComfyUI custom nodes directory:
cd ComfyUI/custom_nodes/
  1. Clone this repository:
git clone https://github.com/1038lab/ComfyUI-VoxCPMTTS.git
  1. Install dependencies:
cd ComfyUI-VoxCPMTTS
pip install -r requirements.txt
  1. Restart ComfyUI

Dependencies

The node will automatically install required dependencies on first use:

  • huggingface_hub>=0.20.0
  • einops>=0.6.0
  • pydantic>=2.0.0
  • wetext>=0.1.0
  • faster-whisper

Model Download

VoxCPM1.5 (default) will be automatically downloaded to ComfyUI/models/TTS/VoxCPM1.5/ on first use. VoxCPM-0.5B is no longer used. https://huggingface.co/openbmb/VoxCPM1.5

Usage

Text-to-Speech

  1. Add the VoxCPMTTS node to your workflow
  2. Input your text in the text field
  3. Adjust parameters as needed:
    • cfg_value: Controls adherence to prompt (1.0-10.0, default: 2.0)
    • inference_steps: Quality vs speed tradeoff (1-100, default: 10)
    • max_length: Maximum token length (256-8192, default: 4096)
  4. Connect the audio output to your desired destination

Voice Cloning

  1. Connect a reference audio to the reference_audio input
  2. Optionally provide reference_text (transcript of the reference audio)
    • If left empty, the node will automatically transcribe the audio
  3. The generated speech will mimic the reference voice characteristics

Parameters

ParameterTypeDefaultDescription
textSTRING"Hello, this is VoxCPMTTS."Text to synthesize
cfg_valueFLOAT2.0Guidance scale (higher = more prompt adherence)
inference_stepsINT10Diffusion steps (higher = better quality)
max_lengthINT4096Maximum token length
normalizeBOOLEANTrueEnable text normalization
seedINT-1Random seed (-1 for random)
deviceCOMBOautoDevice selection (auto/cuda/mps/cpu)
reference_audioAUDIO-Reference audio for voice cloning
reference_textSTRING""Reference audio transcript
fade_in_msINT20Fade-in duration (0-1000ms)

Outputs

  • REFERENCE_TEXT: Transcribed or provided reference text
  • AUDIO: Generated speech audio with 16kHz sample rate

Environment Variables

Set these environment variables to customize behavior:

# ASR model size (tiny, small, medium, large)
export VOXCPM_ASR_MODEL=small

# Maximum retry attempts for bad cases
export VOXCPM_RETRY_MAX=2

Device Selection

  • auto: Automatically selects the best available device
  • cuda: Force CUDA if available
  • mps: Force MPS (Apple Silicon) if available
  • cpu: Force CPU processing

Example Workflows

Basic TTS Workflow

[Text Input] โ†’ [VoxCPMTTS] โ†’ [Audio Output]

Voice Cloning Workflow

[Reference Audio] โ†’ [VoxCPMTTS] โ† [Target Text]
                         โ†“
                   [Cloned Audio]

Batch Processing Workflow

[Text List] โ†’ [VoxCPMTTS] โ†’ [Audio Batch] โ†’ [Save Audio]

Performance Tips

Memory Optimization

  • Use lower inference_steps for faster generation
  • Choose appropriate max_length for your text
  • Use CPU device if GPU memory is limited

Quality Settings

  • Fast: cfg_value=1.5, inference_steps=5
  • Balanced: cfg_value=2.0, inference_steps=10 (default)
  • High Quality: cfg_value=3.0, inference_steps=20

Voice Cloning Tips

  • Use high-quality reference audio (16kHz+)
  • Reference audio should be 3-30 seconds long
  • Clear speech with minimal background noise works best
  • Provide accurate reference text when possible

Troubleshooting

Common Issues

Model download fails

  • Check internet connection
  • Ensure sufficient disk space (~1.2GB)
  • Try clearing the download cache

Out of memory errors

  • Reduce max_length
  • Lower inference_steps
  • Switch to CPU device
  • Close other GPU-intensive applications

Poor voice cloning quality

  • Ensure reference audio is clear and high-quality
  • Verify reference text accuracy
  • Try different cfg_value settings
  • Use reference audio from the same speaker

ASR transcription errors

  • Install faster-whisper for better performance
  • Provide manual reference_text instead
  • Use clearer reference audio

Debug Mode

Enable verbose logging by setting:

export COMFYUI_LOG_LEVEL=DEBUG

Model Information

This node uses the VoxCPM1.5 model developed by OpenBMB:

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Acknowledgments

  • OpenBMB for the VoxCPM model
  • ComfyUI community
  • All contributors and users

Support

If you encounter any issues or have questions:

  • Open an issue on GitHub
  • Check the troubleshooting section above
  • Join the ComfyUI community discussions

Star this repository if you find it useful! โญ