ComfyUI-AceStep SFT
April 13, 2026 ยท View on GitHub
A modular node suite for ComfyUI that implements AceStep 1.5 SFT (Supervised Fine-Tuning), a state-of-the-art music generation model. It starts from the official AceStep workflow and extends it with stronger conditioning control and practical ComfyUI-oriented quality options.
SFT = Supervised Fine-Tuning: A specialized version of AceStep optimized for generating superior quality audio through supervised training.
๐ Overview
This package provides eight nodes under audio/AceStep SFT:
| Node | Purpose |
|---|---|
| AceStep 1.5 SFT Model Loader | Loads the diffusion model, CLIP text encoders, and VAE |
| AceStep 1.5 SFT Lora Loader | Applies a LoRA to MODEL + CLIP (chainable) |
| AceStep 1.5 SFT TextEncode | Encodes caption, lyrics, and metadata into conditioning |
| AceStep 1.5 SFT Generate | Diffusion sampler + optional VAE decode |
| AceStep 1.5 SFT Preview Audio | Audio playback with waveform spectrum visualizer |
| AceStep 1.5 SFT Save Audio | Save audio (FLAC/MP3/Opus) with waveform visualizer |
| AceStep 1.5 SFT Get Music Infos | AI-powered audio analysis (tags, BPM, key/scale) |
| AceStep 1.5 SFT Turbo Tag Adapter | Rewrites Turbo-oriented tags into SFT-friendly tags (BETA) |
Modular Architecture
The workflow is split into dedicated nodes for maximum flexibility:
Model Loader โ (model, clip, vae)
โ โ โ
โ Lora Loader (optional, chainable)
โ โ โ โ
โ โ TextEncode โ
โ โ โ โ โ
โผ โผ โผ โผ โผ
Generate (model, positive, negative, vae)
โ โ
Preview Audio Save Audio
Example Configuration

๐ฏ Key Features
โจ Advanced Guidance
The node supports three classifier-free guidance modes, each with unique characteristics:
-
APG (Adaptive Projected Guidance) โญ Recommended
- Dynamic adaptation via momentum buffering
- Gradient clipping with adaptive thresholds
- Orthogonal projection to eliminate unwanted noise
- AceStep SFT Default - best quality and stability balance
-
ADG (Angle-based Dynamic Guidance)
- Angle-based guidance between conditions
- Operates in velocity space (flow matching)
- Ideal for aggressive style distortion
-
Standard CFG
- Traditional Classifier-Free Guidance
- Simple and predictable implementation
- Useful as a comparison baseline
๐ต Intelligent Metadata Processing
- Auto-Duration: Automatically estimates music duration by analyzing lyric structure
- LLM Encoding: Use Qwen LLM (0.6B or 1.7B/4B) to generate semantic audio codes
- Auto Values: BPM, Time Signature, and Key/Scale automatic (model decides)
- Multilingual Support: Over 23 languages supported
๐ง AI Music Analyzer
- Audio Tag Extraction: Uses the native ACE-Step Transcriber to extract lyric, vocal, and song-structure tags from audio
- BPM Detection: Automatic tempo detection via librosa
- Key/Scale Detection: Detects musical key and scale (e.g. "G minor")
- JSON Output: Structured
music_infosoutput with all analysis results
๐ Audio Preview & Save with Waveform Visualizer
Both Preview Audio and Save Audio nodes feature:
- Interactive waveform spectrum display directly on the node (dark background with amplitude bars)
- Play/Pause button with click-to-seek on the waveform
- Time display showing current position and total duration
Save Audio additionally supports:
- Multiple formats: FLAC (lossless), MP3, and Opus
- Quality options: V0, 64k, 96k, 128k, 192k, 320k
- Auto-incrementing filenames with configurable prefix
๐ Audio Refinement (img2img)
- Latent-based Refinement: Use
denoise < 1.0withlatent_or_audioconnected to refine existing audio - Accepts AUDIO or LATENT: Connect any audio or latent output for img2img-style editing
- Batch Generation: Generate multiple variations in parallel
๐ง Extended Conditioning Control
- Split Text/Lyric Guidance: Independent
guidance_scale_textandguidance_scale_lyric - Omega Scale: Mean-preserving output reweighting to approximate AceStep scheduler behavior
- ERG Approximation: Node-local prompt energy reweighting via
erg_scale - Guidance Interval Decay: Smoothly decay guidance inside the active interval
๐๏ธ AceStep LoRA Workflow
- Direct LoRA Application: The Lora Loader takes MODEL + CLIP, applies the LoRA via
comfy.sd.load_lora_for_models(), and outputs the modified MODEL + CLIP - Chainable: Stack multiple Lora Loaders in sequence
- Separate strengths: Independent
strength_modelandstrength_clip - DoRA support: Full DoRA (Weight-Decomposed Low-Rank Adaptation) support with automatic
dora_scaledimension fix - Local
Loras/folder: Drop LoRA files directly into the node'sLoras/folder โ they are automatically registered at startup - Auto PEFT/DoRA conversion: PEFT-format LoRAs (
adapter_config.json+adapter_model.safetensors) placed inLoras/are automatically converted to ComfyUI format on first startup
๐ ๏ธ Latent Post-processing
- Latent Shift: Additive anti-clipping correction
- Latent Rescale: Multiplicative scaling for dynamic control
๐ฆ Installation
Prerequisites
- ComfyUI installed and functional
- CUDA/GPU or equivalent (modern processors)
- Recommended for better output quality (based on practical testing): use the merged SFT+Turbo model.
- Required model files:
- Diffusion model (DiT):
acestep_v1.5_sft.safetensors - Text Encoders:
qwen_0.6b_ace15.safetensors,qwen_1.7b_ace15.safetensors(or 4B) - VAE:
ace_1.5_vae.safetensors
- Diffusion model (DiT):
Download Model Files
Download the required models from HuggingFace:
- Diffusion Model (Recommended: merged SFT+Turbo):
-
Alternative Diffusion Model (official SFT):
-
Text Encoders (choose any versions):
- Text Encoders Collection
qwen_0.6b_ace15.safetensors(caption processing)qwen_1.7b_ace15.safetensorsorqwen_4b_ace15.safetensors(audio code generation)
- Text Encoders Collection
-
VAE (Audio codec):
Installation Steps
- Clone the repository to your custom nodes folder:
cd ComfyUI/custom_nodes
git clone https://github.com/jeankassio/ComfyUI-AceStep_SFT.git
- Place model files in the appropriate directories:
ComfyUI/models/diffusion_models/ # AceStep 1.5 SFT model
ComfyUI/models/text_encoders/ # Qwen encoders
ComfyUI/models/vae/ # VAE
ComfyUI/models/loras/ # Optional AceStep 1.5 LoRAs
- (Optional) Place LoRAs in the local folder:
ComfyUI/custom_nodes/ComfyUI-AceStep_SFT/Loras/ # Local LoRA folder
You can place LoRAs here in any of these formats:
- ComfyUI format: Single
.safetensorsfile (ready to use) - PEFT/DoRA format: A folder containing
adapter_config.json+adapter_model.safetensors(auto-converted on startup) - Nested zip artifact: If your zip extracted a folder-inside-folder, the node detects this and fixes it automatically
- Restart ComfyUI - the nodes will appear under
audio/AceStep SFT
๐งฉ Available Nodes
AceStep 1.5 SFT Model Loader
Loads the AceStep 1.5 diffusion model, dual CLIP text encoders, and audio VAE.
Inputs:
diffusion_model: AceStep 1.5 diffusion model (.safetensors)text_encoder_1: Qwen3-0.6B encoder (caption processing)text_encoder_2: Qwen3 LLM (1.7B or 4B, audio code generation)vae_name: AceStep 1.5 audio VAE
Outputs:
model: MODEL โ connect to Lora Loader or Generateclip: CLIP โ connect to Lora Loader or TextEncodevae: VAE โ connect to Generate
AceStep 1.5 SFT Lora Loader
Applies a LoRA directly to the MODEL and CLIP. Multiple Lora Loaders can be chained.
Inputs:
model: MODEL from Model Loader or previous Lora Loaderclip: CLIP from Model Loader or previous Lora Loaderlora_name: LoRA file fromComfyUI/models/lorasor the localLoras/folderstrength_model: strength applied to the diffusion modelstrength_clip: strength applied to the text encoder stack
Outputs:
model: MODEL โ connect to next Lora Loader or Generateclip: CLIP โ connect to next Lora Loader or TextEncode
Supported LoRA Formats
| Format | What to place in Loras/ | Action |
|---|---|---|
ComfyUI .safetensors | Single file | Used directly |
| PEFT/DoRA directory | Folder with adapter_config.json + adapter_model.safetensors | Auto-converted to *_comfyui.safetensors on startup |
| Nested zip artifact | Folder containing a .safetensors inside | Auto-extracted to root on startup |
AceStep 1.5 SFT TextEncode
Encodes caption, lyrics, and metadata into positive and negative conditioning for the Generate node.
Inputs:
clip: CLIP from Model Loader or Lora Loadercaption: Text description of the music (genre, mood, instruments)lyrics: Song lyrics or[Instrumental]instrumental: Force instrumental modeseed,duration,bpm,timesignature,language,keyscale- Optional:
generate_audio_codes,lm_cfg_scale,lm_temperature,lm_top_p,lm_top_k,lm_min_p,lm_negative_prompt - Optional style overrides:
style_tags,style_bpm,style_keyscale(from Music Analyzer)
Outputs:
positive: CONDITIONING โ connect to Generatenegative: CONDITIONING โ connect to Generate
AceStep 1.5 SFT Generate
Diffusion sampler + optional VAE decoder. Requires MODEL and conditioning inputs.
Inputs:
model: MODEL from Model Loader or Lora Loaderpositive: CONDITIONING from TextEncodenegative: CONDITIONING from TextEncode- Sampling:
seed,steps,cfg,sampler_name,scheduler,denoise,duration,infer_method,guidance_mode - Optional:
vae(for audio output),latent_or_audio(for img2img),batch_size - Optional post-processing:
latent_shift,latent_rescale,fade_in_duration,fade_out_duration,voice_boost,use_tiled_vae - Optional guidance:
apg_eta,apg_momentum,apg_norm_threshold,guidance_interval,guidance_interval_decay,min_guidance_scale,guidance_scale_text,guidance_scale_lyric,omega_scale,erg_scale,cfg_interval_start,cfg_interval_end,shift
Outputs:
model: MODEL (passthrough for chaining)vae: VAE (passthrough for chaining)positive: CONDITIONING (passthrough)negative: CONDITIONING (passthrough)latent: LATENT (raw diffusion output)audio: AUDIO (decoded audio, only when VAE is connected)
AceStep 1.5 SFT Preview Audio
Previews audio with an interactive waveform spectrum visualizer directly on the node.
Inputs:
audio: AUDIO to preview
Features:
- Interactive waveform display with play/pause button
- Click-to-seek on the waveform
- Current time / total duration display
AceStep 1.5 SFT Save Audio
Saves audio to disk with an interactive waveform spectrum visualizer.
Inputs:
audio: AUDIO to savefilename_prefix: Filename prefix (supports subfolder paths, e.g.audio/AceStep)format: FLAC, MP3, or Opusquality(optional): V0, 64k, 96k, 128k, 192k, 320k (for MP3/Opus)
Features:
- Auto-incrementing filenames (e.g.
AceStep_00001_.flac,AceStep_00002_.flac) - Waveform visualizer with play/pause and seek
- Metadata embedding (prompt, workflow)
AceStep 1.5 SFT Get Music Infos
AI-powered audio analysis node that extracts descriptive tags, BPM, and key/scale from audio input.
Inputs:
audio: Audio input to analyzeget_tags/get_bpm/get_keyscale: Enable/disable each analysismax_new_tokens: Maximum tokens for transcription outputaudio_duration: Max seconds of audio to analyzetemperature,top_p,top_k,repetition_penalty,seed: Generation parametersunload_model: Free VRAM after analysisuse_flash_attn: Enable Flash Attention 2 (if compatible)
Outputs:
tags: Comma-separated descriptive tags (STRING)bpm: Detected BPM (INT)keyscale: Key and scale e.g. "G minor" (STRING)music_infos: JSON with all results (STRING)
AceStep 1.5 SFT Turbo Tag Adapter
Rewrites Turbo-oriented music tags into shorter SFT-friendly prompt tags.
Inputs:
turbo_tags: Turbo-style tags or captionadaptation_strength: conservative / balanced / aggressivekeep_unknown_tags: Keep tags that were not explicitly mappedadd_sft_bias_tags: Add extra SFT-oriented anchor tags
Outputs:
sft_tags: Adapted comma-separated tags (STRING)notes: Conversion notes (STRING)suggested_cfg: Suggested CFG value (FLOAT)suggested_steps: Suggested steps value (INT)
๐๏ธ Node Parameters
Generate - Required Parameters
| Parameter | Range | Description |
|---|---|---|
| model | MODEL | AceStep 1.5 diffusion model from Model Loader or Lora Loader |
| positive | CONDITIONING | Positive conditioning from TextEncode |
| negative | CONDITIONING | Negative conditioning from TextEncode |
| seed | 0 - | Seed for reproducibility |
| steps | 1 - 200 | Diffusion inference steps (default: 50) |
| cfg | 1.0 - 20.0 | Classifier-free guidance scale (default: 7.0) |
| sampler_name | - | Sampler (euler, dpmpp, etc.) |
| scheduler | - | Scheduler (normal, karras, etc.) |
| denoise | 0.0 - 1.0 | Denoising strength (1.0 = fresh, < 1.0 = editing) |
| duration | 0.0 - 600.0 | Duration in seconds (0 = auto) |
| infer_method | ode/sde | ODE = deterministic, SDE = stochastic |
| guidance_mode | apg/adg/standard_cfg | Guidance type (default: apg) |
Generate - Optional Parameters
Batch Generation
- batch_size (1-16): Number of audios to generate in parallel
Audio Input
- vae: VAE from Model Loader (required for audio output)
- latent_or_audio: Base input for refinement (img2img). Accepts AUDIO or LATENT
Latent Post-processing
- latent_shift (-0.2-0.2, default: 0.0): Additive shift (anti-clipping)
- latent_rescale (0.5-1.5, default: 1.0): Multiplicative scaling
- fade_in_duration / fade_out_duration (0.0-10.0, default: 0.0): Optional linear fades
- use_tiled_vae (default: True): Uses tiled VAE for long audio / low VRAM
- voice_boost (-12.0-12.0, default: 0.0): Output gain in dB
APG Configuration
- apg_eta (-10.0-10.0, default: 0.0): Parallel component retention
- apg_momentum (-1.0-1.0, default: -0.75): Momentum buffer coefficient
- apg_norm_threshold (0.0-15.0, default: 2.5): Norm threshold for gradient clipping
Extended Guidance Controls
- guidance_interval (-1.0-1.0, default: 0.5): Centered guidance interval width
- guidance_interval_decay (0.0-1.0, default: 0.0): Linear decay inside interval
- min_guidance_scale (0.0-30.0, default: 3.0): Lower bound with decay
- guidance_scale_text (-1.0-30.0, default: -1.0): Text-only guidance (split)
- guidance_scale_lyric (-1.0-30.0, default: -1.0): Lyric-only guidance (split)
- omega_scale (-8.0-8.0, default: 0.0): Mean-preserving reweighting
- erg_scale (-0.9-2.0, default: 0.0): Prompt energy reweighting
- cfg_interval_start / cfg_interval_end (0.0-1.0): Schedule fraction range
- shift (0.0-5.0, default: 3.0): Timestep schedule shift
TextEncode - Parameters
| Parameter | Range | Description |
|---|---|---|
| clip | CLIP | CLIP from Model Loader or Lora Loader |
| caption | text | Music description (genre, mood, instruments) |
| lyrics | text | Song lyrics or [Instrumental] |
| instrumental | boolean | Force instrumental mode |
| seed | 0 - | Seed |
| duration | 0.0 - 600.0 | Duration in seconds (0 = auto from lyrics) |
| bpm | 0 - 300 | Beats per minute (0 = auto) |
| timesignature | auto/2/3/4/6 | Time signature numerator |
| language | - | Lyric language (en, ja, zh, es, pt, etc.) |
| keyscale | auto/... | Key and scale (e.g. "C major") |
TextEncode - Optional LLM Configuration
- generate_audio_codes (default: True): Enable LLM audio code generation
- lm_cfg_scale (0.0-100.0, default: 2.0): LLM CFG scale
- lm_temperature (0.0-2.0, default: 0.85): LLM sampling temperature
- lm_top_p (0.0-2000.0, default: 0.9): Nucleus sampling
- lm_top_k (0-100, default: 0): Top-k sampling
- lm_min_p (0.0-1.0, default: 0.0): Minimum probability
- lm_negative_prompt: Negative prompt for LLM CFG
TextEncode - Style Overrides (from Music Analyzer)
- style_tags: Appended to caption when connected
- style_bpm: Overrides bpm when > 0
- style_keyscale: Overrides keyscale when not empty
๐จ Workflow Examples
Example 1: Basic Generation
Model Loader:
diffusion_model: "acestep_v1.5_sft.safetensors"
text_encoder_1: "qwen_0.6b_ace15.safetensors"
text_encoder_2: "qwen_1.7b_ace15.safetensors"
vae_name: "ace_1.5_vae.safetensors"
โ model, clip, vae
TextEncode:
clip: (from Model Loader)
caption: "upbeat electronic dance music with synthesizers"
lyrics: [Instrumental]
instrumental: True
duration: 60.0
โ positive, negative
Generate:
model: (from Model Loader)
positive: (from TextEncode)
negative: (from TextEncode)
vae: (from Model Loader)
cfg: 7.0, steps: 50, guidance_mode: "apg"
โ audio
Preview Audio:
audio: (from Generate)
Example 2: With LoRA
Model Loader โ model, clip, vae
โ model, clip
Lora Loader:
lora_name: "ace-step15-style1.safetensors"
strength_model: 0.7
strength_clip: 0.0
โ model, clip
โ model, clip
Lora Loader:
lora_name: "Ace-Step1.5-TechnoRain.safetensors"
strength_model: 0.35
strength_clip: 0.0
โ model, clip
TextEncode (clip from last Lora Loader) โ positive, negative
Generate (model from last Lora Loader, vae from Model Loader) โ audio
Save Audio (format: mp3, quality: 320k)
Example 3: Audio Refinement (img2img)
Generate:
latent_or_audio: (existing audio)
denoise: 0.7 (preserves 30% of source)
duration: 0 (uses input duration)
โ Refines audio while preserving original characteristics
Example 4: Music Analysis โ Generation
Music Analyzer:
audio: (input audio file)
โ tags, bpm, keyscale
TextEncode:
style_tags: (from Music Analyzer)
style_bpm: (from Music Analyzer)
style_keyscale: (from Music Analyzer)
โ positive, negative
Generate โ Save Audio (format: flac)
๐ Troubleshooting
Audio Distortion/Clipping
Solution: Use negative latent_shift (e.g., -0.1) to reduce amplitude before VAE decoding
High Variance Results
Solution: Increase apg_norm_threshold (e.g., 3.0-4.0) for more gradient clipping
Lower Than Expected Quality
Solution:
- Use
guidance_mode: "apg"(recommended) - Start from
steps: 50,cfg: 7.0,sampler_name: "euler",scheduler: "normal",infer_method: "ode"
LoRA Sounds Deformed or Overcooked
Solution:
- Lower
strength_modelfirst, e.g.0.2to0.6 - Set
strength_clipto0.0unless the LoRA explicitly targets the text encoders - Compare
guidance_mode: "standard_cfg"vs"apg"for that LoRA - Avoid stacking multiple strong LoRAs at full strength
LoRA Dimension Mismatch Error (The size of tensor a must match...)
Cause: DoRA LoRAs store dora_scale as a 1D tensor [N]. ComfyUI's weight_decompose expects [N,1].
Solution: This is automatically fixed by the Lora Loader โ all dora_scale tensors are unsqueezed to 2D [N,1] at load time.
PEFT/DoRA LoRA Not Showing in Dropdown
Solution:
- Place the PEFT folder (containing
adapter_config.json+adapter_model.safetensors) insideComfyUI-AceStep_SFT/Loras/ - Restart ComfyUI โ the conversion runs automatically on startup
- Check the console for
[AceStep SFT] Converted PEFT/DoRA โ ComfyUI: ...message - The converted file appears as
*_comfyui.safetensorsin the dropdown
Slow Generation
Solution: Reduce batch_size, lower steps to ~20, or use "karras" scheduler
๐ Guidance Modes Comparison
| Aspect | APG | ADG | Standard CFG |
|---|---|---|---|
| Quality | โญโญโญโญโญ | โญโญโญโญ | โญโญโญ |
| Stability | โญโญโญโญโญ | โญโญโญโญ | โญโญ |
| Dynamics | Natural | Aggressive | Predictable |
| Computation | Normal | Normal | Minimal |
| Recommended | โ Yes | For extreme styles | Baseline |
๐๏ธ Quality Tips
- Use
guidance_mode=apgwithsteps=50to64for best quality - For img2img refinement, start with
denoise=0.5to0.7to preserve the original character - Mild vocal hiss is usually a generation artifact; APG and slightly higher step counts generally help more than raw
cfg - Simplify overly dense or contradictory tags for cleaner results
๐ Technical References
- AceStep 1.5: ICML 2024 (Learning Universal Features for Efficient Audio Generation)
- Flow Matching: Liphardt et al. 2024 (Generative Modeling by Estimating Gradients of the Data Distribution)
- APG/ADG: Techniques aligned with official AceStep paper
- ComfyUI: Modular node graph architecture for batch generation
๐ License
MIT License - Feel free to use in personal or commercial projects
๐ค Contributing
Issues and PRs are welcome! Please:
- Fork the repository
- Create a feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
โ ๏ธ Important Notes
- Recommended maximum duration: 240 seconds (GPU memory)
- Maximum batch size: Depends on your GPU (start with 1-2)
- SFT models: These models are specific to Supervised Fine-Tuning - not tested with non-SFT models
- Rights and attribution: Respect model and dataset usage rights
Built on the AceStep SFT workflow and extended with modular nodes, advanced guidance, waveform visualization, and quality controls for ComfyUI.
For bugs, questions, or suggestions: open an issue on the repository! ๐ต
ComfyUI-AceStep SFT
An all-in-one node for ComfyUI that implements AceStep 1.5 SFT (Supervised Fine-Tuning), a state-of-the-art music generation model. It starts from the official AceStep workflow and extends it with stronger conditioning control and practical ComfyUI-oriented quality options.
SFT = Supervised Fine-Tuning: A specialized version of AceStep optimized for generating superior quality audio through supervised training.
๐ Overview
This package currently provides four nodes under audio/AceStep SFT:
- AceStep 1.5 SFT Generate: all-in-one generation, editing, and decoding
- AceStep 1.5 SFT Music Analyzer: AI-powered audio analysis (tags, BPM, key/scale)
- AceStep 1.5 SFT Lora Loader: chainable LoRA stack builder for AceStep 1.5 SFT
- AceStep 1.5 SFT Turbo Tag Adapter: rewrites Turbo-oriented tags into shorter SFT-friendly prompt tags
The AceStepSFTGenerate node encapsulates the entire music generation workflow:
- Latent Creation - Generates initial latents or loads from
latent_or_audioinput - Text Encoding - Processes captions, lyrics, and metadata via multiple CLIP encoders
- Diffusion Sampling - Runs the diffusion model with advanced guidance control
- Audio Decoding - Converts latents to high-quality audio via VAE
Example Configuration

๐ฏ Key Features
โจ Advanced Guidance
The node supports three classifier-free guidance modes, each with unique characteristics:
-
APG (Adaptive Projected Guidance) โญ Recommended
- Dynamic adaptation via momentum buffering
- Gradient clipping with adaptive thresholds
- Orthogonal projection to eliminate unwanted noise
- AceStep SFT Default - best quality and stability balance
-
ADG (Angle-based Dynamic Guidance)
- Angle-based guidance between conditions
- Operates in velocity space (flow matching)
- Ideal for aggressive style distortion
- Adaptive clipping based on angle between x0_cond and x0_uncond
-
Standard CFG
- Traditional Classifier-Free Guidance
- Simple and predictable implementation
- Useful as a comparison baseline
๐ต Intelligent Metadata Processing
- Auto-Duration: Automatically estimates music duration by analyzing lyric structure
- LLM Encoding: Use Qwen LLM (0.6B or 1.7B/4B) to generate semantic audio codes
- Auto Values: BPM, Time Signature, and Key/Scale automatic (model decides)
- Multilingual Support: Over 23 languages supported
๐ง AI Music Analyzer
- Audio Tag Extraction: Uses the native ACE-Step Transcriber to extract lyric, vocal, and song-structure tags from audio
- BPM Detection: Automatic tempo detection via librosa
- Key/Scale Detection: Detects musical key and scale (e.g. "G minor")
- JSON Output: Structured
music_infosoutput with all analysis results - Generation Parameters: Control temperature, top_p, top_k, repetition_penalty, and seed
- Auto Model Download: Models are downloaded on first use (~1-7 GB each)
Native Analysis Model:
| Model | Size | Type | Best For |
|---|---|---|---|
| ACE-Step-Transcriber | 22.4 GB download | Audio-to-Text | Native ACE-Step 1.5 transcription for lyrics, singing voice, structure tags, and instrument hints |
This node is now dedicated to the native ACE-Step-Transcriber workflow. It uses the model's native prompt format, structured transcription output, and derives tags from language, lyrics, section markers such as verse/chorus/bridge, and optional instrument annotations.
๐ Audio Refinement (img2img)
- Latent-based Refinement: Use
denoise < 1.0withlatent_or_audioconnected to refine existing audio - Accepts AUDIO or LATENT: Connect any audio or latent output for img2img-style editing
- Batch Generation: Generate multiple variations in parallel
๐ง Extended Conditioning Control
- Split Text/Lyric Guidance: Independent
guidance_scale_textandguidance_scale_lyric - Omega Scale: Mean-preserving output reweighting to approximate AceStep scheduler behavior
- ERG Approximation: Node-local prompt energy reweighting via
erg_scale - Guidance Interval Decay: Smoothly decay guidance inside the active interval
๐๏ธ AceStep LoRA Workflow
- Chainable LoRA Loader: Stack one or more AceStep LoRAs before generation
- Separate strengths: Independent
strength_modelandstrength_clip - Single Generate input: Final LoRA stack plugs into the
lorainput on Generate - Local
Loras/folder: Drop LoRA files directly into the node'sLoras/folder โ they are automatically registered at startup - Auto PEFT/DoRA conversion: PEFT-format LoRAs (
adapter_config.json+adapter_model.safetensors) placed inLoras/are automatically converted to ComfyUI format on first startup - DoRA support: Full DoRA (Weight-Decomposed Low-Rank Adaptation) support with automatic
dora_scaledimension fix for ComfyUI compatibility
๐ ๏ธ Latent Post-processing
- Latent Shift: Additive anti-clipping correction
- Latent Rescale: Multiplicative scaling for dynamic control
๐ฆ Installation
Prerequisites
- ComfyUI installed and functional
- CUDA/GPU or equivalent (modern processors)
- Recommended for better output quality (based on practical testing): use the merged SFT+Turbo model.
- Required model files:
- Diffusion model (DiT):
acestep_v1.5_sft.safetensors - Text Encoders:
qwen_0.6b_ace15.safetensors,qwen_1.7b_ace15.safetensors(or 4B) - VAE:
ace_1.5_vae.safetensors
- Diffusion model (DiT):
Download Model Files
Download the required models from HuggingFace:
- Diffusion Model (Recommended: merged SFT+Turbo):
-
Alternative Diffusion Model (official SFT):
-
Text Encoders (choose any versions):
- Text Encoders Collection
qwen_0.6b_ace15.safetensors(caption processing)qwen_1.7b_ace15.safetensorsorqwen_4b_ace15.safetensors(audio code generation)
- Text Encoders Collection
-
VAE (Audio codec):
Installation Steps
- Clone the repository to your custom nodes folder:
cd ComfyUI/custom_nodes
git clone https://github.com/jeankassio/ComfyUI-AceStep_SFT.git
- Place model files in the appropriate directories:
ComfyUI/models/diffusion_models/ # AceStep 1.5 SFT model
ComfyUI/models/text_encoders/ # Qwen encoders
ComfyUI/models/vae/ # VAE
ComfyUI/models/loras/ # Optional AceStep 1.5 LoRAs
- (Optional) Place LoRAs in the local folder:
ComfyUI/custom_nodes/ComfyUI-AceStep_SFT/Loras/ # Local LoRA folder
You can place LoRAs here in any of these formats:
- ComfyUI format: Single
.safetensorsfile (ready to use) - PEFT/DoRA format: A folder containing
adapter_config.json+adapter_model.safetensors(auto-converted on startup) - Nested zip artifacts: If your zip extracted a folder-inside-folder, the node detects this and fixes it automatically
- Restart ComfyUI - the node will appear under
audio/AceStep SFT
๐งฉ Available Nodes
AceStep 1.5 SFT Generate
Main all-in-one node for text-to-music generation, latent-based audio refinement, and VAE decoding.
AceStep 1.5 SFT Music Analyzer
AI-powered audio analysis node that extracts descriptive tags, BPM, and key/scale from audio input.
Inputs:
audio: Audio input to analyzemodel: AI model selection (9 models, auto-downloaded)get_tags/get_bpm/get_keyscale: Enable/disable each analysismax_new_tokens: Maximum tokens for generative modelsaudio_duration: Max seconds of audio to analyzetemperature,top_p,top_k,repetition_penalty,seed: Generation parametersunload_model: Free VRAM after analysisuse_flash_attn: Enable Flash Attention 2 (if compatible)
Outputs:
tags: Comma-separated descriptive tags (STRING)bpm: Detected BPM as string e.g. "129bpm" (STRING)keyscale: Key and scale e.g. "G minor" (STRING)music_infos: JSON with all results (STRING)
AceStep 1.5 SFT Lora Loader
Chainable utility node that builds a LoRA stack for AceStep 1.5 SFT.
Inputs:
lora_name: LoRA file fromComfyUI/models/lorasor the localLoras/folderstrength_model: strength applied to the diffusion modelstrength_clip: strength applied to the text encoder stacklora(optional): upstream AceStep LoRA stack
Output:
lora: connect to another Lora Loader or directly into Generate
Supported LoRA Formats
| Format | What to place in Loras/ | Action |
|---|---|---|
ComfyUI .safetensors | Single file | Used directly |
| PEFT/DoRA directory | Folder with adapter_config.json + adapter_model.safetensors | Auto-converted to *_comfyui.safetensors on startup |
| Nested zip artifact | Folder containing a .safetensors inside | Auto-extracted to root on startup |
The auto-conversion handles:
- Key remapping:
lora_A/lora_Bโlora_down/lora_up - DoRA support:
lora_magnitude_vectorโdora_scale(with correct 2D shape) - Per-layer alpha injection from
adapter_config.json(supportsalpha_patternandrank_pattern)
๐๏ธ Node Parameters
Required Parameters
| Parameter | Range | Description |
|---|---|---|
| diffusion_model | - | Path to DiT model (AceStep 1.5 SFT) |
| text_encoder_1 | - | Qwen3 0.6B Encoder (caption processing) |
| text_encoder_2 | - | Qwen3 1.7B/4B Encoder (audio code generation) |
| vae_name | - | AceStep 1.5 VAE |
| caption | - | Text description of music (genre, mood, instruments) |
| lyrics | - | Song lyrics or [Instrumental] |
| instrumental | boolean | Force instrumental mode (overrides lyrics) |
| seed | 0 - | Seed for reproducibility |
| steps | 1 - 200 | Diffusion inference steps (default: 50 for ACE-Step 1.5 SFT) |
| cfg | 1.0 - 20.0 | Classifier-free guidance scale (default: 7.0; typical 7.0-9.0 for ACE-Step 1.5) |
| sampler_name | - | Sampler (euler, dpmpp, etc.) |
| scheduler | - | Scheduler (normal, karras, exponential, etc.; default: normal) |
| denoise | 0.0 - 1.0 | Denoising strength (1.0 = fresh generation, < 1.0 = editing) |
| infer_method | ode/sde | ODE keeps the selected sampler behavior; SDE remaps default Euler/Heun choices to a stochastic sampler |
| guidance_mode | apg/adg/standard_cfg | Guidance type (default: apg) |
| duration | 0.0 - 600.0 | Duration in seconds (default: 60.0, 0 = auto) |
| bpm | 0 - 300 | Beats per minute (0 = auto, model decides) |
| timesignature | auto/2/3/4/6 | Time signature numerator |
| language | - | Lyric language (en, ja, zh, es, pt, etc.) |
| keyscale | auto/... | Key and scale (e.g., "C major" or "D minor") |
Optional Parameters
Batch Generation
- batch_size (1-16): Number of audios to generate in parallel
Audio Input
- latent_or_audio: Base input for refinement (img2img). Accepts AUDIO or LATENT. Use
denoise < 1.0to refine this input. Withduration=0, duration is derived from the connected input. - lora: AceStep LoRA stack from one or more
AceStep 1.5 SFT Lora Loadernodes
LLM Configuration (Audio Code Generation)
- generate_audio_codes (default: True): Enable/disable LLM audio code generation for semantic structure
- lm_cfg_scale (0.0-100.0, default: 2.0): LLM classifier-free guidance scale
- lm_temperature (0.0-2.0, default: 0.85): LLM sampling temperature
- lm_top_p (0.0-2000.0, default: 0.9): Nucleus sampling parameter
- lm_top_k (0-100, default: 0): Top-k sampling
- lm_min_p (0.0-1.0, default: 0.0): Minimum probability threshold
- lm_negative_prompt: Negative prompt for LLM CFG
Latent Post-processing
- latent_shift (-0.2-0.2, default: 0.0): Additive shift (anti-clipping)
- latent_rescale (0.5-1.5, default: 1.0): Multiplicative scaling
- normalize_peak (default: False): Legacy hard normalization to 0 dBFS after VAE decode
- enable_normalization (default: True): Peak-normalize output to a target dBFS level
- normalization_db (-10.0-0.0, default: -1.0): Target peak level when normalization is enabled
- fade_in_duration / fade_out_duration (0.0-10.0, default: 0.0): Optional linear fades after normalization
- use_tiled_vae (default: True): Uses tiled VAE encode/decode for better long-audio and low-VRAM robustness
- voice_boost (-12.0-12.0, default: 0.0): Simple output gain in dB before normalization
APG Configuration
- apg_momentum (-1.0-1.0, default: -0.75): Momentum buffer coefficient
- apg_norm_threshold (0.0-10.0, default: 2.5): Norm threshold for gradient clipping
Extended Guidance Controls
- guidance_interval (-1.0-1.0, default: 0.5): Official centered guidance interval control
- guidance_interval_decay (0.0-1.0, default: 0.0): Linear decay inside the active guidance interval
- min_guidance_scale (0.0-30.0, default: 3.0): Lower bound when interval decay is enabled
- guidance_scale_text (-1.0-30.0, default: -1.0): Text-only guidance scale,
-1inheritscfg - guidance_scale_lyric (-1.0-30.0, default: -1.0): Lyric-only delta guidance scale,
-1inheritscfg - omega_scale (-8.0-8.0, default: 0.0): Mean-preserving output reweighting
- erg_scale (-0.9-2.0, default: 0.0): Prompt/lyric conditioning energy reweighting
Guidance Interval
- cfg_interval_start (0.0-1.0, default: 0.0): Start applying guidance at this schedule fraction
- cfg_interval_end (0.0-1.0, default: 1.0): Stop applying guidance at this schedule fraction
Custom Timesteps
- shift (1.0-5.0, default: 3.0): Schedule shift (3.0 = Gradio default)
- custom_timesteps: Custom comma-separated timesteps (overrides steps, shift, scheduler)
๐ How It Works - Technical Foundation
1. Latent Pipeline
The node automatically manages latent creation or reuse:
โโ If latent_or_audio provided:
โ โโ AUDIO: Resamples to VAE SR (48kHz), normalizes channels, encodes via VAE
โ โโ LATENT: Uses directly as latent_image
โ โโ Duration derived from input when duration=0
โ
โโ If no latent_or_audio:
โโ Creates zero latent (pure noise) [batch_size, 64, latent_length]
Automatic Sizing: Duration in seconds is converted to latent length via:
latent_length = max(10, round(duration * vae_sample_rate / 1920))
2. Auto-Duration Estimation
When duration <= 0, the node analyzes lyric structure:
[Intro/Outro] = 8 beats (~1 bar 4/4)
[Instrumental/Solo] = 16 beats (~2 bars 4/4)
Verse/Chorus โ ~2 beats per 2 words (typical singing rate)
Section transitions = 4 beats
Empty lines = 2 beats (pause)
Result: duration = beats * (60 / bpm)
3. Metadata Processing
Metadata (bpm, duration, key/scale, time sig) are encoded in multiple representations:
- Structured YAML (Chain-of-Thought):
bpm: 120
caption: "upbeat electronic dance"
duration: 120
keyscale: "G major"
language: "en"
timesignature: 4
- LLM Template (for audio code generation via Qwen):
<|im_start|>system
# Instruction
Generate audio semantic tokens...
<|im_end|>
<|im_start|>user
# Caption
upbeat electronic dance
# Lyric
[Verse 1]...
<|im_end|>
<|im_start|>assistant
<think>
{YAML above}
</think>
<|im_end|>
- Qwen3-0.6B Template (direct metadata):
# Instruction
# Caption
upbeat electronic dance
# Metas
- bpm: 120
- timesignature: 4
- keyscale: G major
- duration: 120 seconds
<|endoftext|>
4. Guidance Strategy
APG (Adaptive Projected Guidance) - Recommended
# Phase 1: Compute conditional difference
diff = pred_cond - pred_uncond
# Phase 2: Apply smooth momentum
if momentum_buffer:
diff = momentum * running_avg + diff
# Phase 3: Norm clipping
norm = ||diff||โ
scale = min(1, norm_threshold / norm)
diff = diff * scale
# Phase 4: Orthogonal decomposition
diff_parallel = projection of diff onto pred_cond
diff_orthogonal = diff - diff_parallel
# Phase 5: Final guidance
guidance = pred_cond + (cfg_scale - 1) * (diff_orthogonal + eta * diff_parallel)
Why It Works:
- Orthogonal projection removes collinear components that amplify noise
- Momentum smooths large jumps between timesteps
- Adaptive clipping prevents gradient explosion
- Result: cleaner and more stable audio
ADG (Angle-based Dynamic Guidance)
# Based on cosine angles between x0_cond and x0_uncond
# Dynamically adjusts guidance based on alignment
# Uses trigonometry for aggressive style deformation
5. Latent Refinement (img2img)
When latent_or_audio is connected with denoise < 1.0, the node operates in img2img mode:
- The input audio is encoded via VAE (or the latent is used directly)
- A fraction of noise is added based on
denoisestrength - The diffusion model refines the noisy latent while preserving the original structure
๐๏ธ Quality Tips
- Use
guidance_mode=apgwithsteps=50to64for best quality - For img2img refinement, start with
denoise=0.5to0.7to preserve the original character - Mild vocal hiss is usually a generation artifact; APG and slightly higher step counts generally help more than raw
cfg - Simplify overly dense or contradictory tags for cleaner results
๐ Guidance Modes Comparison
| Aspect | APG | ADG | Standard CFG |
|---|---|---|---|
| Quality | โญโญโญโญโญ | โญโญโญโญ | โญโญโญ |
| Stability | โญโญโญโญโญ | โญโญโญโญ | โญโญ |
| Dynamics | Natural | Aggressive | Predictable |
| Computation | Normal | Normal | Minimal |
| Recommended | โ Yes | For extreme styles | Baseline |
๐จ Workflow Examples
Example 1: Quality Baseline (Recommended)
AceStepSFTGenerate:
caption: "upbeat electronic dance music with synthesizers"
lyrics: [Instrumental]
instrumental: True
duration: 60.0
cfg: 7.0
steps: 50
sampler_name: "euler"
scheduler: "normal"
guidance_mode: "apg"
โ Generates a strong 60s ACE-Step 1.5 SFT baseline render
Example 2: Audio Refinement (img2img)
AceStepSFTGenerate:
latent_or_audio: (mixer output)
caption: "make it more orchestral"
denoise: 0.7 (preserves 30% of source)
duration: 0 (uses input duration)
โ Refines audio while preserving original characteristics
Example 3: Batch Generation with Varied Seeds
AceStepSFTGenerate:
batch_size: 4
seed: 42 (varies automatically)
โ Creates 4 variations with similar characteristics
Example 4: Chained LoRAs
AceStep 1.5 SFT Lora Loader:
lora_name: "Ace-Step1.5/ace-step15-style1.safetensors"
strength_model: 0.7
strength_clip: 0.0
โ
AceStep 1.5 SFT Lora Loader:
lora_name: "Ace-Step1.5/Ace-Step1.5-TechnoRain.safetensors"
strength_model: 0.35
strength_clip: 0.0
โ
AceStep 1.5 SFT Generate:
lora: (stack output)
Note: AceStep LoRAs are now supported directly by this package. If a specific LoRA produces unstable audio, start by lowering strength_model and compare apg against standard_cfg.
Example 5: Music Analysis โ Generation Pipeline
AceStepSFTMusicAnalyzer:
audio: (input audio file)
model: "Qwen2-Audio-7B-Instruct"
โ tags: "dancehall beat, powerful bassline, vocal samples, melancholic"
โ bpm: "129bpm"
โ keyscale: "G minor"
โ
AceStepSFTGenerate:
caption: (tags from analyzer)
bpm: 129
keyscale: "G minor"
โ Generates new music matching the analyzed style
๐ Troubleshooting
Audio Distortion/Clipping
Solution: Use negative latent_shift (e.g., -0.1) to reduce amplitude before VAE decoding
High Variance Results
Solution: Increase apg_norm_threshold (e.g., 3.0-4.0) for more gradient clipping
Lower Than Expected Quality
Solution:
- Use
guidance_mode: "apg"(recommended) - Start from
steps: 50,cfg: 7.0,sampler_name: "euler",scheduler: "normal",infer_method: "ode" - Keep
enable_normalization: Truewithnormalization_db: -1.0for cleaner final level management
LoRA Sounds Deformed or Overcooked
Solution:
- Lower
strength_modelfirst, e.g.0.2to0.6 - Set
strength_clipto0.0unless the LoRA explicitly targets the text encoders - Compare
guidance_mode: "standard_cfg"vs"apg"for that LoRA - Avoid stacking multiple strong LoRAs at full strength
LoRA Dimension Mismatch Error (The size of tensor a must match...)
Cause: DoRA LoRAs store dora_scale as a 1D tensor [N]. ComfyUI's weight_decompose divides it by weight_norm [N,1], which causes PyTorch to broadcast [1,N]/[N,1] โ [N,N] instead of the expected [N,1].
Solution: This is automatically fixed by the node โ all dora_scale tensors are unsqueezed to 2D [N,1] at load time. If you still see this error, ensure you are using the latest version of this node.
PEFT/DoRA LoRA Not Showing in Dropdown
Solution:
- Place the PEFT folder (containing
adapter_config.json+adapter_model.safetensors) insideComfyUI-AceStep_SFT/Loras/ - Restart ComfyUI โ the conversion runs automatically on startup
- Check the console for
[AceStep SFT] Converted PEFT/DoRA โ ComfyUI: ...message - The converted file appears as
*_comfyui.safetensorsin the dropdown
Slow Generation
Solution: Reduce batch_size, lower steps to ~20, or use "karras" scheduler
๐ Technical References
- AceStep 1.5: ICML 2024 (Learning Universal Features for Efficient Audio Generation)
- Flow Matching: Liphardt et al. 2024 (Generative Modeling by Estimating Gradients of the Data Distribution)
- APG/ADG: Techniques aligned with official AceStep paper
- ComfyUI: Modular node graph architecture for batch generation
๐ License
MIT License - Feel free to use in personal or commercial projects
๐ค Contributing
Issues and PRs are welcome! Please:
- Fork the repository
- Create a feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
โ ๏ธ Important Notes
- Recommended maximum duration: 240 seconds (GPU memory)
- Maximum batch size: Depends on your GPU (start with 1-2)
- SFT models: These models are specific to Supervised Fine-Tuning - not tested with non-SFT models
- Rights and attribution: Respect model and dataset usage rights
Built on the AceStep SFT workflow and extended with advanced guidance and quality controls for ComfyUI.
For bugs, questions, or suggestions: open an issue on the repository! ๐ต