ComfyUI-AceStep SFT

April 13, 2026 ยท View on GitHub

License: MIT Python 3.8+

A modular node suite for ComfyUI that implements AceStep 1.5 SFT (Supervised Fine-Tuning), a state-of-the-art music generation model. It starts from the official AceStep workflow and extends it with stronger conditioning control and practical ComfyUI-oriented quality options.

SFT = Supervised Fine-Tuning: A specialized version of AceStep optimized for generating superior quality audio through supervised training.

๐Ÿ“‹ Overview

This package provides eight nodes under audio/AceStep SFT:

NodePurpose
AceStep 1.5 SFT Model LoaderLoads the diffusion model, CLIP text encoders, and VAE
AceStep 1.5 SFT Lora LoaderApplies a LoRA to MODEL + CLIP (chainable)
AceStep 1.5 SFT TextEncodeEncodes caption, lyrics, and metadata into conditioning
AceStep 1.5 SFT GenerateDiffusion sampler + optional VAE decode
AceStep 1.5 SFT Preview AudioAudio playback with waveform spectrum visualizer
AceStep 1.5 SFT Save AudioSave audio (FLAC/MP3/Opus) with waveform visualizer
AceStep 1.5 SFT Get Music InfosAI-powered audio analysis (tags, BPM, key/scale)
AceStep 1.5 SFT Turbo Tag AdapterRewrites Turbo-oriented tags into SFT-friendly tags (BETA)

Modular Architecture

The workflow is split into dedicated nodes for maximum flexibility:

Model Loader โ†’ (model, clip, vae)
       โ”‚            โ”‚        โ”‚
       โ”‚   Lora Loader (optional, chainable)
       โ”‚     โ”‚    โ”‚          โ”‚
       โ”‚     โ”‚  TextEncode   โ”‚
       โ”‚     โ”‚   โ”‚    โ”‚      โ”‚
       โ–ผ     โ–ผ   โ–ผ    โ–ผ      โ–ผ
      Generate (model, positive, negative, vae)
         โ”‚         โ”‚
    Preview Audio  Save Audio

Example Configuration

AceStep SFT Node Configuration

๐ŸŽฏ Key Features

โœจ Advanced Guidance

The node supports three classifier-free guidance modes, each with unique characteristics:

  • APG (Adaptive Projected Guidance) โญ Recommended

    • Dynamic adaptation via momentum buffering
    • Gradient clipping with adaptive thresholds
    • Orthogonal projection to eliminate unwanted noise
    • AceStep SFT Default - best quality and stability balance
  • ADG (Angle-based Dynamic Guidance)

    • Angle-based guidance between conditions
    • Operates in velocity space (flow matching)
    • Ideal for aggressive style distortion
  • Standard CFG

    • Traditional Classifier-Free Guidance
    • Simple and predictable implementation
    • Useful as a comparison baseline

๐ŸŽต Intelligent Metadata Processing

  • Auto-Duration: Automatically estimates music duration by analyzing lyric structure
  • LLM Encoding: Use Qwen LLM (0.6B or 1.7B/4B) to generate semantic audio codes
  • Auto Values: BPM, Time Signature, and Key/Scale automatic (model decides)
  • Multilingual Support: Over 23 languages supported

๐ŸŽง AI Music Analyzer

  • Audio Tag Extraction: Uses the native ACE-Step Transcriber to extract lyric, vocal, and song-structure tags from audio
  • BPM Detection: Automatic tempo detection via librosa
  • Key/Scale Detection: Detects musical key and scale (e.g. "G minor")
  • JSON Output: Structured music_infos output with all analysis results

๐Ÿ”Š Audio Preview & Save with Waveform Visualizer

Both Preview Audio and Save Audio nodes feature:

  • Interactive waveform spectrum display directly on the node (dark background with amplitude bars)
  • Play/Pause button with click-to-seek on the waveform
  • Time display showing current position and total duration

Save Audio additionally supports:

  • Multiple formats: FLAC (lossless), MP3, and Opus
  • Quality options: V0, 64k, 96k, 128k, 192k, 320k
  • Auto-incrementing filenames with configurable prefix

๐Ÿ”„ Audio Refinement (img2img)

  • Latent-based Refinement: Use denoise < 1.0 with latent_or_audio connected to refine existing audio
  • Accepts AUDIO or LATENT: Connect any audio or latent output for img2img-style editing
  • Batch Generation: Generate multiple variations in parallel

๐Ÿง  Extended Conditioning Control

  • Split Text/Lyric Guidance: Independent guidance_scale_text and guidance_scale_lyric
  • Omega Scale: Mean-preserving output reweighting to approximate AceStep scheduler behavior
  • ERG Approximation: Node-local prompt energy reweighting via erg_scale
  • Guidance Interval Decay: Smoothly decay guidance inside the active interval

๐ŸŽš๏ธ AceStep LoRA Workflow

  • Direct LoRA Application: The Lora Loader takes MODEL + CLIP, applies the LoRA via comfy.sd.load_lora_for_models(), and outputs the modified MODEL + CLIP
  • Chainable: Stack multiple Lora Loaders in sequence
  • Separate strengths: Independent strength_model and strength_clip
  • DoRA support: Full DoRA (Weight-Decomposed Low-Rank Adaptation) support with automatic dora_scale dimension fix
  • Local Loras/ folder: Drop LoRA files directly into the node's Loras/ folder โ€” they are automatically registered at startup
  • Auto PEFT/DoRA conversion: PEFT-format LoRAs (adapter_config.json + adapter_model.safetensors) placed in Loras/ are automatically converted to ComfyUI format on first startup

๐Ÿ› ๏ธ Latent Post-processing

  • Latent Shift: Additive anti-clipping correction
  • Latent Rescale: Multiplicative scaling for dynamic control

๐Ÿ“ฆ Installation

Prerequisites

  • ComfyUI installed and functional
  • CUDA/GPU or equivalent (modern processors)
  • Recommended for better output quality (based on practical testing): use the merged SFT+Turbo model.
  • Required model files:
    • Diffusion model (DiT): acestep_v1.5_sft.safetensors
    • Text Encoders: qwen_0.6b_ace15.safetensors, qwen_1.7b_ace15.safetensors (or 4B)
    • VAE: ace_1.5_vae.safetensors

Download Model Files

Download the required models from HuggingFace:

  1. Diffusion Model (Recommended: merged SFT+Turbo):
  1. Alternative Diffusion Model (official SFT):

  2. Text Encoders (choose any versions):

    • Text Encoders Collection
      • qwen_0.6b_ace15.safetensors (caption processing)
      • qwen_1.7b_ace15.safetensors or qwen_4b_ace15.safetensors (audio code generation)
  3. VAE (Audio codec):

Installation Steps

  1. Clone the repository to your custom nodes folder:
cd ComfyUI/custom_nodes
git clone https://github.com/jeankassio/ComfyUI-AceStep_SFT.git
  1. Place model files in the appropriate directories:
ComfyUI/models/diffusion_models/     # AceStep 1.5 SFT model
ComfyUI/models/text_encoders/        # Qwen encoders
ComfyUI/models/vae/                  # VAE
ComfyUI/models/loras/                # Optional AceStep 1.5 LoRAs
  1. (Optional) Place LoRAs in the local folder:
ComfyUI/custom_nodes/ComfyUI-AceStep_SFT/Loras/   # Local LoRA folder

You can place LoRAs here in any of these formats:

  • ComfyUI format: Single .safetensors file (ready to use)
  • PEFT/DoRA format: A folder containing adapter_config.json + adapter_model.safetensors (auto-converted on startup)
  • Nested zip artifact: If your zip extracted a folder-inside-folder, the node detects this and fixes it automatically
  1. Restart ComfyUI - the nodes will appear under audio/AceStep SFT

๐Ÿงฉ Available Nodes

AceStep 1.5 SFT Model Loader

Loads the AceStep 1.5 diffusion model, dual CLIP text encoders, and audio VAE.

Inputs:

  • diffusion_model: AceStep 1.5 diffusion model (.safetensors)
  • text_encoder_1: Qwen3-0.6B encoder (caption processing)
  • text_encoder_2: Qwen3 LLM (1.7B or 4B, audio code generation)
  • vae_name: AceStep 1.5 audio VAE

Outputs:

  • model: MODEL โ€” connect to Lora Loader or Generate
  • clip: CLIP โ€” connect to Lora Loader or TextEncode
  • vae: VAE โ€” connect to Generate

AceStep 1.5 SFT Lora Loader

Applies a LoRA directly to the MODEL and CLIP. Multiple Lora Loaders can be chained.

Inputs:

  • model: MODEL from Model Loader or previous Lora Loader
  • clip: CLIP from Model Loader or previous Lora Loader
  • lora_name: LoRA file from ComfyUI/models/loras or the local Loras/ folder
  • strength_model: strength applied to the diffusion model
  • strength_clip: strength applied to the text encoder stack

Outputs:

  • model: MODEL โ€” connect to next Lora Loader or Generate
  • clip: CLIP โ€” connect to next Lora Loader or TextEncode

Supported LoRA Formats

FormatWhat to place in Loras/Action
ComfyUI .safetensorsSingle fileUsed directly
PEFT/DoRA directoryFolder with adapter_config.json + adapter_model.safetensorsAuto-converted to *_comfyui.safetensors on startup
Nested zip artifactFolder containing a .safetensors insideAuto-extracted to root on startup

AceStep 1.5 SFT TextEncode

Encodes caption, lyrics, and metadata into positive and negative conditioning for the Generate node.

Inputs:

  • clip: CLIP from Model Loader or Lora Loader
  • caption: Text description of the music (genre, mood, instruments)
  • lyrics: Song lyrics or [Instrumental]
  • instrumental: Force instrumental mode
  • seed, duration, bpm, timesignature, language, keyscale
  • Optional: generate_audio_codes, lm_cfg_scale, lm_temperature, lm_top_p, lm_top_k, lm_min_p, lm_negative_prompt
  • Optional style overrides: style_tags, style_bpm, style_keyscale (from Music Analyzer)

Outputs:

  • positive: CONDITIONING โ€” connect to Generate
  • negative: CONDITIONING โ€” connect to Generate

AceStep 1.5 SFT Generate

Diffusion sampler + optional VAE decoder. Requires MODEL and conditioning inputs.

Inputs:

  • model: MODEL from Model Loader or Lora Loader
  • positive: CONDITIONING from TextEncode
  • negative: CONDITIONING from TextEncode
  • Sampling: seed, steps, cfg, sampler_name, scheduler, denoise, duration, infer_method, guidance_mode
  • Optional: vae (for audio output), latent_or_audio (for img2img), batch_size
  • Optional post-processing: latent_shift, latent_rescale, fade_in_duration, fade_out_duration, voice_boost, use_tiled_vae
  • Optional guidance: apg_eta, apg_momentum, apg_norm_threshold, guidance_interval, guidance_interval_decay, min_guidance_scale, guidance_scale_text, guidance_scale_lyric, omega_scale, erg_scale, cfg_interval_start, cfg_interval_end, shift

Outputs:

  • model: MODEL (passthrough for chaining)
  • vae: VAE (passthrough for chaining)
  • positive: CONDITIONING (passthrough)
  • negative: CONDITIONING (passthrough)
  • latent: LATENT (raw diffusion output)
  • audio: AUDIO (decoded audio, only when VAE is connected)

AceStep 1.5 SFT Preview Audio

Previews audio with an interactive waveform spectrum visualizer directly on the node.

Inputs:

  • audio: AUDIO to preview

Features:

  • Interactive waveform display with play/pause button
  • Click-to-seek on the waveform
  • Current time / total duration display

AceStep 1.5 SFT Save Audio

Saves audio to disk with an interactive waveform spectrum visualizer.

Inputs:

  • audio: AUDIO to save
  • filename_prefix: Filename prefix (supports subfolder paths, e.g. audio/AceStep)
  • format: FLAC, MP3, or Opus
  • quality (optional): V0, 64k, 96k, 128k, 192k, 320k (for MP3/Opus)

Features:

  • Auto-incrementing filenames (e.g. AceStep_00001_.flac, AceStep_00002_.flac)
  • Waveform visualizer with play/pause and seek
  • Metadata embedding (prompt, workflow)

AceStep 1.5 SFT Get Music Infos

AI-powered audio analysis node that extracts descriptive tags, BPM, and key/scale from audio input.

Inputs:

  • audio: Audio input to analyze
  • get_tags / get_bpm / get_keyscale: Enable/disable each analysis
  • max_new_tokens: Maximum tokens for transcription output
  • audio_duration: Max seconds of audio to analyze
  • temperature, top_p, top_k, repetition_penalty, seed: Generation parameters
  • unload_model: Free VRAM after analysis
  • use_flash_attn: Enable Flash Attention 2 (if compatible)

Outputs:

  • tags: Comma-separated descriptive tags (STRING)
  • bpm: Detected BPM (INT)
  • keyscale: Key and scale e.g. "G minor" (STRING)
  • music_infos: JSON with all results (STRING)

AceStep 1.5 SFT Turbo Tag Adapter

Rewrites Turbo-oriented music tags into shorter SFT-friendly prompt tags.

Inputs:

  • turbo_tags: Turbo-style tags or caption
  • adaptation_strength: conservative / balanced / aggressive
  • keep_unknown_tags: Keep tags that were not explicitly mapped
  • add_sft_bias_tags: Add extra SFT-oriented anchor tags

Outputs:

  • sft_tags: Adapted comma-separated tags (STRING)
  • notes: Conversion notes (STRING)
  • suggested_cfg: Suggested CFG value (FLOAT)
  • suggested_steps: Suggested steps value (INT)

๐ŸŽ›๏ธ Node Parameters

Generate - Required Parameters

ParameterRangeDescription
modelMODELAceStep 1.5 diffusion model from Model Loader or Lora Loader
positiveCONDITIONINGPositive conditioning from TextEncode
negativeCONDITIONINGNegative conditioning from TextEncode
seed0 - 2642^{64}Seed for reproducibility
steps1 - 200Diffusion inference steps (default: 50)
cfg1.0 - 20.0Classifier-free guidance scale (default: 7.0)
sampler_name-Sampler (euler, dpmpp, etc.)
scheduler-Scheduler (normal, karras, etc.)
denoise0.0 - 1.0Denoising strength (1.0 = fresh, < 1.0 = editing)
duration0.0 - 600.0Duration in seconds (0 = auto)
infer_methodode/sdeODE = deterministic, SDE = stochastic
guidance_modeapg/adg/standard_cfgGuidance type (default: apg)

Generate - Optional Parameters

Batch Generation

  • batch_size (1-16): Number of audios to generate in parallel

Audio Input

  • vae: VAE from Model Loader (required for audio output)
  • latent_or_audio: Base input for refinement (img2img). Accepts AUDIO or LATENT

Latent Post-processing

  • latent_shift (-0.2-0.2, default: 0.0): Additive shift (anti-clipping)
  • latent_rescale (0.5-1.5, default: 1.0): Multiplicative scaling
  • fade_in_duration / fade_out_duration (0.0-10.0, default: 0.0): Optional linear fades
  • use_tiled_vae (default: True): Uses tiled VAE for long audio / low VRAM
  • voice_boost (-12.0-12.0, default: 0.0): Output gain in dB

APG Configuration

  • apg_eta (-10.0-10.0, default: 0.0): Parallel component retention
  • apg_momentum (-1.0-1.0, default: -0.75): Momentum buffer coefficient
  • apg_norm_threshold (0.0-15.0, default: 2.5): Norm threshold for gradient clipping

Extended Guidance Controls

  • guidance_interval (-1.0-1.0, default: 0.5): Centered guidance interval width
  • guidance_interval_decay (0.0-1.0, default: 0.0): Linear decay inside interval
  • min_guidance_scale (0.0-30.0, default: 3.0): Lower bound with decay
  • guidance_scale_text (-1.0-30.0, default: -1.0): Text-only guidance (split)
  • guidance_scale_lyric (-1.0-30.0, default: -1.0): Lyric-only guidance (split)
  • omega_scale (-8.0-8.0, default: 0.0): Mean-preserving reweighting
  • erg_scale (-0.9-2.0, default: 0.0): Prompt energy reweighting
  • cfg_interval_start / cfg_interval_end (0.0-1.0): Schedule fraction range
  • shift (0.0-5.0, default: 3.0): Timestep schedule shift

TextEncode - Parameters

ParameterRangeDescription
clipCLIPCLIP from Model Loader or Lora Loader
captiontextMusic description (genre, mood, instruments)
lyricstextSong lyrics or [Instrumental]
instrumentalbooleanForce instrumental mode
seed0 - 2642^{64}Seed
duration0.0 - 600.0Duration in seconds (0 = auto from lyrics)
bpm0 - 300Beats per minute (0 = auto)
timesignatureauto/2/3/4/6Time signature numerator
language-Lyric language (en, ja, zh, es, pt, etc.)
keyscaleauto/...Key and scale (e.g. "C major")

TextEncode - Optional LLM Configuration

  • generate_audio_codes (default: True): Enable LLM audio code generation
  • lm_cfg_scale (0.0-100.0, default: 2.0): LLM CFG scale
  • lm_temperature (0.0-2.0, default: 0.85): LLM sampling temperature
  • lm_top_p (0.0-2000.0, default: 0.9): Nucleus sampling
  • lm_top_k (0-100, default: 0): Top-k sampling
  • lm_min_p (0.0-1.0, default: 0.0): Minimum probability
  • lm_negative_prompt: Negative prompt for LLM CFG

TextEncode - Style Overrides (from Music Analyzer)

  • style_tags: Appended to caption when connected
  • style_bpm: Overrides bpm when > 0
  • style_keyscale: Overrides keyscale when not empty

๐ŸŽจ Workflow Examples

Example 1: Basic Generation

Model Loader:
  diffusion_model: "acestep_v1.5_sft.safetensors"
  text_encoder_1: "qwen_0.6b_ace15.safetensors"
  text_encoder_2: "qwen_1.7b_ace15.safetensors"
  vae_name: "ace_1.5_vae.safetensors"
  โ†’ model, clip, vae

TextEncode:
  clip: (from Model Loader)
  caption: "upbeat electronic dance music with synthesizers"
  lyrics: [Instrumental]
  instrumental: True
  duration: 60.0
  โ†’ positive, negative

Generate:
  model: (from Model Loader)
  positive: (from TextEncode)
  negative: (from TextEncode)
  vae: (from Model Loader)
  cfg: 7.0, steps: 50, guidance_mode: "apg"
  โ†’ audio

Preview Audio:
  audio: (from Generate)

Example 2: With LoRA

Model Loader โ†’ model, clip, vae
  โ†“ model, clip
Lora Loader:
  lora_name: "ace-step15-style1.safetensors"
  strength_model: 0.7
  strength_clip: 0.0
  โ†’ model, clip
  โ†“ model, clip
Lora Loader:
  lora_name: "Ace-Step1.5-TechnoRain.safetensors"
  strength_model: 0.35
  strength_clip: 0.0
  โ†’ model, clip

TextEncode (clip from last Lora Loader) โ†’ positive, negative
Generate (model from last Lora Loader, vae from Model Loader) โ†’ audio
Save Audio (format: mp3, quality: 320k)

Example 3: Audio Refinement (img2img)

Generate:
  latent_or_audio: (existing audio)
  denoise: 0.7 (preserves 30% of source)
  duration: 0 (uses input duration)
  โ†’ Refines audio while preserving original characteristics

Example 4: Music Analysis โ†’ Generation

Music Analyzer:
  audio: (input audio file)
  โ†’ tags, bpm, keyscale

TextEncode:
  style_tags: (from Music Analyzer)
  style_bpm: (from Music Analyzer)
  style_keyscale: (from Music Analyzer)
  โ†’ positive, negative

Generate โ†’ Save Audio (format: flac)

๐Ÿ› Troubleshooting

Audio Distortion/Clipping

Solution: Use negative latent_shift (e.g., -0.1) to reduce amplitude before VAE decoding

High Variance Results

Solution: Increase apg_norm_threshold (e.g., 3.0-4.0) for more gradient clipping

Lower Than Expected Quality

Solution:

  1. Use guidance_mode: "apg" (recommended)
  2. Start from steps: 50, cfg: 7.0, sampler_name: "euler", scheduler: "normal", infer_method: "ode"

LoRA Sounds Deformed or Overcooked

Solution:

  1. Lower strength_model first, e.g. 0.2 to 0.6
  2. Set strength_clip to 0.0 unless the LoRA explicitly targets the text encoders
  3. Compare guidance_mode: "standard_cfg" vs "apg" for that LoRA
  4. Avoid stacking multiple strong LoRAs at full strength

LoRA Dimension Mismatch Error (The size of tensor a must match...)

Cause: DoRA LoRAs store dora_scale as a 1D tensor [N]. ComfyUI's weight_decompose expects [N,1].

Solution: This is automatically fixed by the Lora Loader โ€” all dora_scale tensors are unsqueezed to 2D [N,1] at load time.

PEFT/DoRA LoRA Not Showing in Dropdown

Solution:

  1. Place the PEFT folder (containing adapter_config.json + adapter_model.safetensors) inside ComfyUI-AceStep_SFT/Loras/
  2. Restart ComfyUI โ€” the conversion runs automatically on startup
  3. Check the console for [AceStep SFT] Converted PEFT/DoRA โ†’ ComfyUI: ... message
  4. The converted file appears as *_comfyui.safetensors in the dropdown

Slow Generation

Solution: Reduce batch_size, lower steps to ~20, or use "karras" scheduler

๐Ÿ“Š Guidance Modes Comparison

AspectAPGADGStandard CFG
Qualityโญโญโญโญโญโญโญโญโญโญโญโญ
Stabilityโญโญโญโญโญโญโญโญโญโญโญ
DynamicsNaturalAggressivePredictable
ComputationNormalNormalMinimal
Recommendedโœ… YesFor extreme stylesBaseline

๐ŸŽš๏ธ Quality Tips

  • Use guidance_mode=apg with steps=50 to 64 for best quality
  • For img2img refinement, start with denoise=0.5 to 0.7 to preserve the original character
  • Mild vocal hiss is usually a generation artifact; APG and slightly higher step counts generally help more than raw cfg
  • Simplify overly dense or contradictory tags for cleaner results

๐Ÿ“š Technical References

  • AceStep 1.5: ICML 2024 (Learning Universal Features for Efficient Audio Generation)
  • Flow Matching: Liphardt et al. 2024 (Generative Modeling by Estimating Gradients of the Data Distribution)
  • APG/ADG: Techniques aligned with official AceStep paper
  • ComfyUI: Modular node graph architecture for batch generation

๐Ÿ“ License

MIT License - Feel free to use in personal or commercial projects

๐Ÿค Contributing

Issues and PRs are welcome! Please:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

โš ๏ธ Important Notes

  • Recommended maximum duration: 240 seconds (GPU memory)
  • Maximum batch size: Depends on your GPU (start with 1-2)
  • SFT models: These models are specific to Supervised Fine-Tuning - not tested with non-SFT models
  • Rights and attribution: Respect model and dataset usage rights

Built on the AceStep SFT workflow and extended with modular nodes, advanced guidance, waveform visualization, and quality controls for ComfyUI.

For bugs, questions, or suggestions: open an issue on the repository! ๐ŸŽต

ComfyUI-AceStep SFT

License: MIT Python 3.8+

An all-in-one node for ComfyUI that implements AceStep 1.5 SFT (Supervised Fine-Tuning), a state-of-the-art music generation model. It starts from the official AceStep workflow and extends it with stronger conditioning control and practical ComfyUI-oriented quality options.

SFT = Supervised Fine-Tuning: A specialized version of AceStep optimized for generating superior quality audio through supervised training.

๐Ÿ“‹ Overview

This package currently provides four nodes under audio/AceStep SFT:

  • AceStep 1.5 SFT Generate: all-in-one generation, editing, and decoding
  • AceStep 1.5 SFT Music Analyzer: AI-powered audio analysis (tags, BPM, key/scale)
  • AceStep 1.5 SFT Lora Loader: chainable LoRA stack builder for AceStep 1.5 SFT
  • AceStep 1.5 SFT Turbo Tag Adapter: rewrites Turbo-oriented tags into shorter SFT-friendly prompt tags

The AceStepSFTGenerate node encapsulates the entire music generation workflow:

  1. Latent Creation - Generates initial latents or loads from latent_or_audio input
  2. Text Encoding - Processes captions, lyrics, and metadata via multiple CLIP encoders
  3. Diffusion Sampling - Runs the diffusion model with advanced guidance control
  4. Audio Decoding - Converts latents to high-quality audio via VAE

Example Configuration

AceStep SFT Node Configuration

๐ŸŽฏ Key Features

โœจ Advanced Guidance

The node supports three classifier-free guidance modes, each with unique characteristics:

  • APG (Adaptive Projected Guidance) โญ Recommended

    • Dynamic adaptation via momentum buffering
    • Gradient clipping with adaptive thresholds
    • Orthogonal projection to eliminate unwanted noise
    • AceStep SFT Default - best quality and stability balance
  • ADG (Angle-based Dynamic Guidance)

    • Angle-based guidance between conditions
    • Operates in velocity space (flow matching)
    • Ideal for aggressive style distortion
    • Adaptive clipping based on angle between x0_cond and x0_uncond
  • Standard CFG

    • Traditional Classifier-Free Guidance
    • Simple and predictable implementation
    • Useful as a comparison baseline

๐ŸŽต Intelligent Metadata Processing

  • Auto-Duration: Automatically estimates music duration by analyzing lyric structure
  • LLM Encoding: Use Qwen LLM (0.6B or 1.7B/4B) to generate semantic audio codes
  • Auto Values: BPM, Time Signature, and Key/Scale automatic (model decides)
  • Multilingual Support: Over 23 languages supported

๐ŸŽง AI Music Analyzer

  • Audio Tag Extraction: Uses the native ACE-Step Transcriber to extract lyric, vocal, and song-structure tags from audio
  • BPM Detection: Automatic tempo detection via librosa
  • Key/Scale Detection: Detects musical key and scale (e.g. "G minor")
  • JSON Output: Structured music_infos output with all analysis results
  • Generation Parameters: Control temperature, top_p, top_k, repetition_penalty, and seed
  • Auto Model Download: Models are downloaded on first use (~1-7 GB each)

Native Analysis Model:

ModelSizeTypeBest For
ACE-Step-Transcriber22.4 GB downloadAudio-to-TextNative ACE-Step 1.5 transcription for lyrics, singing voice, structure tags, and instrument hints

This node is now dedicated to the native ACE-Step-Transcriber workflow. It uses the model's native prompt format, structured transcription output, and derives tags from language, lyrics, section markers such as verse/chorus/bridge, and optional instrument annotations.

๐Ÿ”„ Audio Refinement (img2img)

  • Latent-based Refinement: Use denoise < 1.0 with latent_or_audio connected to refine existing audio
  • Accepts AUDIO or LATENT: Connect any audio or latent output for img2img-style editing
  • Batch Generation: Generate multiple variations in parallel

๐Ÿง  Extended Conditioning Control

  • Split Text/Lyric Guidance: Independent guidance_scale_text and guidance_scale_lyric
  • Omega Scale: Mean-preserving output reweighting to approximate AceStep scheduler behavior
  • ERG Approximation: Node-local prompt energy reweighting via erg_scale
  • Guidance Interval Decay: Smoothly decay guidance inside the active interval

๐ŸŽš๏ธ AceStep LoRA Workflow

  • Chainable LoRA Loader: Stack one or more AceStep LoRAs before generation
  • Separate strengths: Independent strength_model and strength_clip
  • Single Generate input: Final LoRA stack plugs into the lora input on Generate
  • Local Loras/ folder: Drop LoRA files directly into the node's Loras/ folder โ€” they are automatically registered at startup
  • Auto PEFT/DoRA conversion: PEFT-format LoRAs (adapter_config.json + adapter_model.safetensors) placed in Loras/ are automatically converted to ComfyUI format on first startup
  • DoRA support: Full DoRA (Weight-Decomposed Low-Rank Adaptation) support with automatic dora_scale dimension fix for ComfyUI compatibility

๐Ÿ› ๏ธ Latent Post-processing

  • Latent Shift: Additive anti-clipping correction
  • Latent Rescale: Multiplicative scaling for dynamic control

๐Ÿ“ฆ Installation

Prerequisites

  • ComfyUI installed and functional
  • CUDA/GPU or equivalent (modern processors)
  • Recommended for better output quality (based on practical testing): use the merged SFT+Turbo model.
  • Required model files:
    • Diffusion model (DiT): acestep_v1.5_sft.safetensors
    • Text Encoders: qwen_0.6b_ace15.safetensors, qwen_1.7b_ace15.safetensors (or 4B)
    • VAE: ace_1.5_vae.safetensors

Download Model Files

Download the required models from HuggingFace:

  1. Diffusion Model (Recommended: merged SFT+Turbo):
  1. Alternative Diffusion Model (official SFT):

  2. Text Encoders (choose any versions):

    • Text Encoders Collection
      • qwen_0.6b_ace15.safetensors (caption processing)
      • qwen_1.7b_ace15.safetensors or qwen_4b_ace15.safetensors (audio code generation)
  3. VAE (Audio codec):

Installation Steps

  1. Clone the repository to your custom nodes folder:
cd ComfyUI/custom_nodes
git clone https://github.com/jeankassio/ComfyUI-AceStep_SFT.git
  1. Place model files in the appropriate directories:
ComfyUI/models/diffusion_models/     # AceStep 1.5 SFT model
ComfyUI/models/text_encoders/        # Qwen encoders
ComfyUI/models/vae/                  # VAE
ComfyUI/models/loras/                # Optional AceStep 1.5 LoRAs
  1. (Optional) Place LoRAs in the local folder:
ComfyUI/custom_nodes/ComfyUI-AceStep_SFT/Loras/   # Local LoRA folder

You can place LoRAs here in any of these formats:

  • ComfyUI format: Single .safetensors file (ready to use)
  • PEFT/DoRA format: A folder containing adapter_config.json + adapter_model.safetensors (auto-converted on startup)
  • Nested zip artifacts: If your zip extracted a folder-inside-folder, the node detects this and fixes it automatically
  1. Restart ComfyUI - the node will appear under audio/AceStep SFT

๐Ÿงฉ Available Nodes

AceStep 1.5 SFT Generate

Main all-in-one node for text-to-music generation, latent-based audio refinement, and VAE decoding.

AceStep 1.5 SFT Music Analyzer

AI-powered audio analysis node that extracts descriptive tags, BPM, and key/scale from audio input.

Inputs:

  • audio: Audio input to analyze
  • model: AI model selection (9 models, auto-downloaded)
  • get_tags / get_bpm / get_keyscale: Enable/disable each analysis
  • max_new_tokens: Maximum tokens for generative models
  • audio_duration: Max seconds of audio to analyze
  • temperature, top_p, top_k, repetition_penalty, seed: Generation parameters
  • unload_model: Free VRAM after analysis
  • use_flash_attn: Enable Flash Attention 2 (if compatible)

Outputs:

  • tags: Comma-separated descriptive tags (STRING)
  • bpm: Detected BPM as string e.g. "129bpm" (STRING)
  • keyscale: Key and scale e.g. "G minor" (STRING)
  • music_infos: JSON with all results (STRING)

AceStep 1.5 SFT Lora Loader

Chainable utility node that builds a LoRA stack for AceStep 1.5 SFT.

Inputs:

  • lora_name: LoRA file from ComfyUI/models/loras or the local Loras/ folder
  • strength_model: strength applied to the diffusion model
  • strength_clip: strength applied to the text encoder stack
  • lora (optional): upstream AceStep LoRA stack

Output:

  • lora: connect to another Lora Loader or directly into Generate

Supported LoRA Formats

FormatWhat to place in Loras/Action
ComfyUI .safetensorsSingle fileUsed directly
PEFT/DoRA directoryFolder with adapter_config.json + adapter_model.safetensorsAuto-converted to *_comfyui.safetensors on startup
Nested zip artifactFolder containing a .safetensors insideAuto-extracted to root on startup

The auto-conversion handles:

  • Key remapping: lora_A/lora_B โ†’ lora_down/lora_up
  • DoRA support: lora_magnitude_vector โ†’ dora_scale (with correct 2D shape)
  • Per-layer alpha injection from adapter_config.json (supports alpha_pattern and rank_pattern)

๐ŸŽ›๏ธ Node Parameters

Required Parameters

ParameterRangeDescription
diffusion_model-Path to DiT model (AceStep 1.5 SFT)
text_encoder_1-Qwen3 0.6B Encoder (caption processing)
text_encoder_2-Qwen3 1.7B/4B Encoder (audio code generation)
vae_name-AceStep 1.5 VAE
caption-Text description of music (genre, mood, instruments)
lyrics-Song lyrics or [Instrumental]
instrumentalbooleanForce instrumental mode (overrides lyrics)
seed0 - 2642^{64}Seed for reproducibility
steps1 - 200Diffusion inference steps (default: 50 for ACE-Step 1.5 SFT)
cfg1.0 - 20.0Classifier-free guidance scale (default: 7.0; typical 7.0-9.0 for ACE-Step 1.5)
sampler_name-Sampler (euler, dpmpp, etc.)
scheduler-Scheduler (normal, karras, exponential, etc.; default: normal)
denoise0.0 - 1.0Denoising strength (1.0 = fresh generation, < 1.0 = editing)
infer_methodode/sdeODE keeps the selected sampler behavior; SDE remaps default Euler/Heun choices to a stochastic sampler
guidance_modeapg/adg/standard_cfgGuidance type (default: apg)
duration0.0 - 600.0Duration in seconds (default: 60.0, 0 = auto)
bpm0 - 300Beats per minute (0 = auto, model decides)
timesignatureauto/2/3/4/6Time signature numerator
language-Lyric language (en, ja, zh, es, pt, etc.)
keyscaleauto/...Key and scale (e.g., "C major" or "D minor")

Optional Parameters

Batch Generation

  • batch_size (1-16): Number of audios to generate in parallel

Audio Input

  • latent_or_audio: Base input for refinement (img2img). Accepts AUDIO or LATENT. Use denoise < 1.0 to refine this input. With duration=0, duration is derived from the connected input.
  • lora: AceStep LoRA stack from one or more AceStep 1.5 SFT Lora Loader nodes

LLM Configuration (Audio Code Generation)

  • generate_audio_codes (default: True): Enable/disable LLM audio code generation for semantic structure
  • lm_cfg_scale (0.0-100.0, default: 2.0): LLM classifier-free guidance scale
  • lm_temperature (0.0-2.0, default: 0.85): LLM sampling temperature
  • lm_top_p (0.0-2000.0, default: 0.9): Nucleus sampling parameter
  • lm_top_k (0-100, default: 0): Top-k sampling
  • lm_min_p (0.0-1.0, default: 0.0): Minimum probability threshold
  • lm_negative_prompt: Negative prompt for LLM CFG

Latent Post-processing

  • latent_shift (-0.2-0.2, default: 0.0): Additive shift (anti-clipping)
  • latent_rescale (0.5-1.5, default: 1.0): Multiplicative scaling
  • normalize_peak (default: False): Legacy hard normalization to 0 dBFS after VAE decode
  • enable_normalization (default: True): Peak-normalize output to a target dBFS level
  • normalization_db (-10.0-0.0, default: -1.0): Target peak level when normalization is enabled
  • fade_in_duration / fade_out_duration (0.0-10.0, default: 0.0): Optional linear fades after normalization
  • use_tiled_vae (default: True): Uses tiled VAE encode/decode for better long-audio and low-VRAM robustness
  • voice_boost (-12.0-12.0, default: 0.0): Simple output gain in dB before normalization

APG Configuration

  • apg_momentum (-1.0-1.0, default: -0.75): Momentum buffer coefficient
  • apg_norm_threshold (0.0-10.0, default: 2.5): Norm threshold for gradient clipping

Extended Guidance Controls

  • guidance_interval (-1.0-1.0, default: 0.5): Official centered guidance interval control
  • guidance_interval_decay (0.0-1.0, default: 0.0): Linear decay inside the active guidance interval
  • min_guidance_scale (0.0-30.0, default: 3.0): Lower bound when interval decay is enabled
  • guidance_scale_text (-1.0-30.0, default: -1.0): Text-only guidance scale, -1 inherits cfg
  • guidance_scale_lyric (-1.0-30.0, default: -1.0): Lyric-only delta guidance scale, -1 inherits cfg
  • omega_scale (-8.0-8.0, default: 0.0): Mean-preserving output reweighting
  • erg_scale (-0.9-2.0, default: 0.0): Prompt/lyric conditioning energy reweighting

Guidance Interval

  • cfg_interval_start (0.0-1.0, default: 0.0): Start applying guidance at this schedule fraction
  • cfg_interval_end (0.0-1.0, default: 1.0): Stop applying guidance at this schedule fraction

Custom Timesteps

  • shift (1.0-5.0, default: 3.0): Schedule shift (3.0 = Gradio default)
  • custom_timesteps: Custom comma-separated timesteps (overrides steps, shift, scheduler)

๐Ÿ” How It Works - Technical Foundation

1. Latent Pipeline

The node automatically manages latent creation or reuse:

โ”œโ”€ If latent_or_audio provided:
โ”‚  โ”œโ”€ AUDIO: Resamples to VAE SR (48kHz), normalizes channels, encodes via VAE
โ”‚  โ”œโ”€ LATENT: Uses directly as latent_image
โ”‚  โ””โ”€ Duration derived from input when duration=0
โ”‚
โ””โ”€ If no latent_or_audio:
   โ””โ”€ Creates zero latent (pure noise) [batch_size, 64, latent_length]

Automatic Sizing: Duration in seconds is converted to latent length via:

latent_length = max(10, round(duration * vae_sample_rate / 1920))

2. Auto-Duration Estimation

When duration <= 0, the node analyzes lyric structure:

[Intro/Outro] = 8 beats (~1 bar 4/4)
[Instrumental/Solo] = 16 beats (~2 bars 4/4)  
Verse/Chorus โ†’ ~2 beats per 2 words (typical singing rate)
Section transitions = 4 beats
Empty lines = 2 beats (pause)

Result: duration = beats * (60 / bpm)

3. Metadata Processing

Metadata (bpm, duration, key/scale, time sig) are encoded in multiple representations:

  1. Structured YAML (Chain-of-Thought):
bpm: 120
caption: "upbeat electronic dance"
duration: 120
keyscale: "G major"
language: "en"
timesignature: 4
  1. LLM Template (for audio code generation via Qwen):
<|im_start|>system
# Instruction
Generate audio semantic tokens...
<|im_end|>
<|im_start|>user
# Caption
upbeat electronic dance

# Lyric
[Verse 1]...
<|im_end|>
<|im_start|>assistant
<think>
{YAML above}
</think>

<|im_end|>
  1. Qwen3-0.6B Template (direct metadata):
# Instruction
# Caption
upbeat electronic dance

# Metas
- bpm: 120
- timesignature: 4
- keyscale: G major
- duration: 120 seconds
<|endoftext|>

4. Guidance Strategy

# Phase 1: Compute conditional difference
diff = pred_cond - pred_uncond

# Phase 2: Apply smooth momentum
if momentum_buffer:
    diff = momentum * running_avg + diff

# Phase 3: Norm clipping
norm = ||diff||โ‚‚
scale = min(1, norm_threshold / norm)
diff = diff * scale

# Phase 4: Orthogonal decomposition
diff_parallel = projection of diff onto pred_cond
diff_orthogonal = diff - diff_parallel

# Phase 5: Final guidance
guidance = pred_cond + (cfg_scale - 1) * (diff_orthogonal + eta * diff_parallel)

Why It Works:

  • Orthogonal projection removes collinear components that amplify noise
  • Momentum smooths large jumps between timesteps
  • Adaptive clipping prevents gradient explosion
  • Result: cleaner and more stable audio

ADG (Angle-based Dynamic Guidance)

# Based on cosine angles between x0_cond and x0_uncond
# Dynamically adjusts guidance based on alignment
# Uses trigonometry for aggressive style deformation

5. Latent Refinement (img2img)

When latent_or_audio is connected with denoise < 1.0, the node operates in img2img mode:

  • The input audio is encoded via VAE (or the latent is used directly)
  • A fraction of noise is added based on denoise strength
  • The diffusion model refines the noisy latent while preserving the original structure

๐ŸŽš๏ธ Quality Tips

  • Use guidance_mode=apg with steps=50 to 64 for best quality
  • For img2img refinement, start with denoise=0.5 to 0.7 to preserve the original character
  • Mild vocal hiss is usually a generation artifact; APG and slightly higher step counts generally help more than raw cfg
  • Simplify overly dense or contradictory tags for cleaner results

๐Ÿ“Š Guidance Modes Comparison

AspectAPGADGStandard CFG
Qualityโญโญโญโญโญโญโญโญโญโญโญโญ
Stabilityโญโญโญโญโญโญโญโญโญโญโญ
DynamicsNaturalAggressivePredictable
ComputationNormalNormalMinimal
Recommendedโœ… YesFor extreme stylesBaseline

๐ŸŽจ Workflow Examples

AceStepSFTGenerate:
  caption: "upbeat electronic dance music with synthesizers"
  lyrics: [Instrumental]
  instrumental: True
  duration: 60.0
  cfg: 7.0
  steps: 50
  sampler_name: "euler"
  scheduler: "normal"
  guidance_mode: "apg"
  โ†’ Generates a strong 60s ACE-Step 1.5 SFT baseline render

Example 2: Audio Refinement (img2img)

AceStepSFTGenerate:
  latent_or_audio: (mixer output)
  caption: "make it more orchestral"
  denoise: 0.7 (preserves 30% of source)
  duration: 0 (uses input duration)
  โ†’ Refines audio while preserving original characteristics

Example 3: Batch Generation with Varied Seeds

AceStepSFTGenerate:
  batch_size: 4
  seed: 42 (varies automatically)
  โ†’ Creates 4 variations with similar characteristics

Example 4: Chained LoRAs

AceStep 1.5 SFT Lora Loader:
  lora_name: "Ace-Step1.5/ace-step15-style1.safetensors"
  strength_model: 0.7
  strength_clip: 0.0
  โ†“
AceStep 1.5 SFT Lora Loader:
  lora_name: "Ace-Step1.5/Ace-Step1.5-TechnoRain.safetensors"
  strength_model: 0.35
  strength_clip: 0.0
  โ†“
AceStep 1.5 SFT Generate:
  lora: (stack output)

Note: AceStep LoRAs are now supported directly by this package. If a specific LoRA produces unstable audio, start by lowering strength_model and compare apg against standard_cfg.

Example 5: Music Analysis โ†’ Generation Pipeline

AceStepSFTMusicAnalyzer:
  audio: (input audio file)
  model: "Qwen2-Audio-7B-Instruct"
  โ†’ tags: "dancehall beat, powerful bassline, vocal samples, melancholic"
  โ†’ bpm: "129bpm"
  โ†’ keyscale: "G minor"
  โ†“
AceStepSFTGenerate:
  caption: (tags from analyzer)
  bpm: 129
  keyscale: "G minor"
  โ†’ Generates new music matching the analyzed style

๐Ÿ› Troubleshooting

Audio Distortion/Clipping

Solution: Use negative latent_shift (e.g., -0.1) to reduce amplitude before VAE decoding

High Variance Results

Solution: Increase apg_norm_threshold (e.g., 3.0-4.0) for more gradient clipping

Lower Than Expected Quality

Solution:

  1. Use guidance_mode: "apg" (recommended)
  2. Start from steps: 50, cfg: 7.0, sampler_name: "euler", scheduler: "normal", infer_method: "ode"
  3. Keep enable_normalization: True with normalization_db: -1.0 for cleaner final level management

LoRA Sounds Deformed or Overcooked

Solution:

  1. Lower strength_model first, e.g. 0.2 to 0.6
  2. Set strength_clip to 0.0 unless the LoRA explicitly targets the text encoders
  3. Compare guidance_mode: "standard_cfg" vs "apg" for that LoRA
  4. Avoid stacking multiple strong LoRAs at full strength

LoRA Dimension Mismatch Error (The size of tensor a must match...)

Cause: DoRA LoRAs store dora_scale as a 1D tensor [N]. ComfyUI's weight_decompose divides it by weight_norm [N,1], which causes PyTorch to broadcast [1,N]/[N,1] โ†’ [N,N] instead of the expected [N,1].

Solution: This is automatically fixed by the node โ€” all dora_scale tensors are unsqueezed to 2D [N,1] at load time. If you still see this error, ensure you are using the latest version of this node.

PEFT/DoRA LoRA Not Showing in Dropdown

Solution:

  1. Place the PEFT folder (containing adapter_config.json + adapter_model.safetensors) inside ComfyUI-AceStep_SFT/Loras/
  2. Restart ComfyUI โ€” the conversion runs automatically on startup
  3. Check the console for [AceStep SFT] Converted PEFT/DoRA โ†’ ComfyUI: ... message
  4. The converted file appears as *_comfyui.safetensors in the dropdown

Slow Generation

Solution: Reduce batch_size, lower steps to ~20, or use "karras" scheduler

๐Ÿ“š Technical References

  • AceStep 1.5: ICML 2024 (Learning Universal Features for Efficient Audio Generation)
  • Flow Matching: Liphardt et al. 2024 (Generative Modeling by Estimating Gradients of the Data Distribution)
  • APG/ADG: Techniques aligned with official AceStep paper
  • ComfyUI: Modular node graph architecture for batch generation

๐Ÿ“ License

MIT License - Feel free to use in personal or commercial projects

๐Ÿค Contributing

Issues and PRs are welcome! Please:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

โš ๏ธ Important Notes

  • Recommended maximum duration: 240 seconds (GPU memory)
  • Maximum batch size: Depends on your GPU (start with 1-2)
  • SFT models: These models are specific to Supervised Fine-Tuning - not tested with non-SFT models
  • Rights and attribution: Respect model and dataset usage rights

Built on the AceStep SFT workflow and extended with advanced guidance and quality controls for ComfyUI.

For bugs, questions, or suggestions: open an issue on the repository! ๐ŸŽต