ACE-Step Quickstart
April 3, 2026 ยท View on GitHub
In this example, we'll be training the ACE-Step audio generation model. SimpleTuner currently supports the original ACE-Step v1 3.5B path plus forward-compatible LoRA training for the ACE-Step v1.5 bundle.
Overview
ACE-Step is a transformer-based flow-matching audio model designed for high-quality synthesis. In SimpleTuner:
basetargets the original ACE-Step v1 3.5B training path.v15-turbo,v15-base, andv15-sfttarget the ACE-Step v1.5 bundle variants loaded fromACE-Step/Ace-Step1.5.
Hardware Requirements
ACE-Step is a 3.5B parameter model, making it relatively lightweight compared to large image generation models like Flux.
- Minimum: NVIDIA GPU with 12GB+ VRAM (e.g., 3060, 4070).
- Recommended: NVIDIA GPU with 24GB+ VRAM (e.g., 3090, 4090, A10G) for larger batch sizes.
- Mac: Supported via MPS on Apple Silicon (Requires ~36GB+ Unified Memory).
Storage Requirements
โ ๏ธ Disk Usage Warning: The VAE cache for audio models can be substantial. For example, a single 60-second audio clip can result in a ~89MB cached latent file. This caching strategy is used to drastically reduce VRAM requirements during training. Ensure you have sufficient disk space for your dataset's cache.
๐ก Tip: For larger datasets, you can use the
--vae_cache_disableoption to disable writing embeddings to disk. This will implicitly enable on-demand caching, which saves disk space but will increase training time and memory usage as encodings are performed during the training loop.
๐ก Tip: Using
int8-quantoquantization allows training on GPUs with less VRAM (e.g., 12GB-16GB) with minimal quality loss.
Prerequisites
Ensure you have a working Python 3.10+ environment.
pip install simpletuner
Configuration
It is recommended to keep your configurations organized. We'll create a dedicated folder for this demo.
mkdir -p config/acestep-training-demo
Critical Settings
SimpleTuner currently supports these ACE-Step flavours:
base: original ACE-Step v1 3.5Bv15-turbo,v15-base,v15-sft: ACE-Step v1.5 bundle variants
Use the matching config for your target variant.
Ready-made example presets are available at:
simpletuner/examples/ace_step-v1-0.peft-lorasimpletuner/examples/ace_step-v1-5.peft-lora
You can launch them directly with simpletuner train example=ace_step-v1-0.peft-lora or simpletuner train example=ace_step-v1-5.peft-lora.
ACE-Step v1 example
Create config/acestep-training-demo/config.json with these values:
View example config
{
"model_family": "ace_step",
"model_type": "lora",
"model_flavour": "base",
"pretrained_model_name_or_path": "ACE-Step/ACE-Step-v1-3.5B",
"resolution": 0,
"mixed_precision": "bf16",
"base_model_precision": "int8-quanto",
"data_backend_config": "config/acestep-training-demo/multidatabackend.json"
}
ACE-Step v1.5 example
For ACE-Step v1.5, keep model_family: "ace_step" but select a v1.5 flavour and point the checkpoint root at the shared v1.5 bundle:
View example config
{
"model_family": "ace_step",
"model_type": "lora",
"model_flavour": "v15-base",
"pretrained_model_name_or_path": "ACE-Step/Ace-Step1.5",
"trust_remote_code": true,
"resolution": 0,
"mixed_precision": "bf16",
"base_model_precision": "int8-quanto",
"data_backend_config": "config/acestep-training-demo/multidatabackend.json"
}
Validation Settings
Add these to your config.json to monitor progress:
validation_prompt: A text description of the audio you want to generate (e.g., "A catchy pop song with upbeat drums").validation_lyrics: (Optional) Lyrics for the model to sing.validation_audio_duration: Duration in seconds for validation clips (default: 30.0).validation_guidance: Guidance scale (default: ~3.0 - 5.0).validation_step_interval: How often to generate samples (e.g., every 100 steps).
โน๏ธ ACE-Step v1.5 note: SimpleTuner now supports built-in v1.5 validation renders for prompt + optional lyrics conditioning. Loading the upstream v1.5 repository still requires
trust_remote_code: true, and more advanced upstream editing/inference workflows are not exposed through the SimpleTuner validation pipeline yet.
Advanced Experimental Features
Show advanced experimental details
SimpleTuner includes experimental features that can significantly improve training stability and performance.
- Scheduled Sampling (Rollout): reduces exposure bias and improves output quality by letting the model generate its own inputs during training.
โ ๏ธ These features increase the computational overhead of training.
Dataset Configuration
ACE-Step requires an audio-specific dataset configuration.
Option 1: Demo Dataset (Hugging Face)
For a quick start, you can use the prepared ACEStep-Songs preset.
Create config/acestep-training-demo/multidatabackend.json:
View example config
[
{
"id": "acestep-demo-data",
"type": "huggingface",
"dataset_type": "audio",
"dataset_name": "Yi3852/ACEStep-Songs",
"metadata_backend": "huggingface",
"caption_strategy": "huggingface",
"cache_dir_vae": "cache/vae/{model_family}/acestep-demo-data"
},
{
"id": "text-embeds",
"dataset_type": "text_embeds",
"default": true,
"type": "local",
"cache_dir": "cache/text/{model_family}"
}
]
See caption_strategy options and requirements in DATALOADER.md.
Option 2: Local Audio Files
Create config/acestep-training-demo/multidatabackend.json:
View example config
[
{
"id": "my-audio-dataset",
"type": "local",
"dataset_type": "audio",
"instance_data_dir": "datasets/my_audio_files",
"caption_strategy": "textfile",
"metadata_backend": "discovery",
"disabled": false
},
{
"id": "text-embeds",
"dataset_type": "text_embeds",
"default": true,
"type": "local",
"cache_dir": "cache/text/{model_family}"
}
]
Data Structure
Place your audio files in datasets/my_audio_files. SimpleTuner supports a wide range of formats including:
- Lossless:
.wav,.flac,.aiff,.alac - Lossy:
.mp3,.ogg,.m4a,.aac,.wma,.opus
โน๏ธ Note: To support formats like MP3, AAC, and WMA, you must have FFmpeg installed on your system.
For captions and lyrics, place corresponding text files next to your audio files:
- Audio:
track_01.wav - Caption (Prompt):
track_01.txt(Contains the text description, e.g., "A slow jazz ballad") - Lyrics (Optional):
track_01.lyrics(Contains the lyrics text)
Example dataset layout
datasets/my_audio_files/
โโโ track_01.wav
โโโ track_01.txt
โโโ track_01.lyrics
๐ก Advanced: If your dataset uses a different naming convention (e.g.
_lyrics.txt), you can customize this in your dataset config.
View custom lyrics filename example
"audio": {
"lyrics_filename_format": "{filename}_lyrics.txt"
}
โ ๏ธ Note on Lyrics: If a
.lyricsfile is not found for a sample, the lyric embeddings will be zeroed out. ACE-Step expects lyric conditioning; training heavily on data without lyrics (instrumentals) may require more training steps for the model to learn to generate high-quality instrumental audio with zeroed lyric inputs.
Training
Start the training run by specifying your environment:
simpletuner train env=acestep-training-demo
This command tells SimpleTuner to look for config.json inside config/acestep-training-demo/.
๐ก Tip (Continue Training): To continue fine-tuning from an existing LoRA (e.g. the official ACE-Step checkpoints or community adapters), use the
--init_loraoption:simpletuner train env=acestep-training-demo --init_lora=/path/to/existing_lora.safetensors
Training the Lyrics Embedder (upstream-style)
โน๏ธ Version note:
lyrics_embedder_traincurrently applies to the ACE-Step v1 training path. The v1.5 forward-compatible LoRA path in SimpleTuner is decoder-only.
The upstream ACE-Step trainer fine-tunes the lyrics embedder alongside the denoiser. To mirror that behaviour in SimpleTuner (full or standard LoRA only):
- Enable it:
lyrics_embedder_train: true - Optional overrides (otherwise the main optimizer/scheduler are reused):
lyrics_embedder_lrlyrics_embedder_optimizerlyrics_embedder_lr_scheduler
Example snippet:
View example config
{
"lyrics_embedder_train": true,
"lyrics_embedder_lr": 5e-5,
"lyrics_embedder_optimizer": "torch-adamw",
"lyrics_embedder_lr_scheduler": "cosine_with_restarts"
}
Troubleshooting
- Validation Errors: Ensure you are not trying to use image-centric validation features like
num_validation_images> 1 (conceptually mapped to batch size for audio) or image-based metrics (CLIP score). - Memory Issues: If running OOM, try reducing
train_batch_sizeor enablinggradient_checkpointing.
Migrating from Upstream Trainer
If you are coming from the original ACE-Step training scripts, here is how the parameters map to SimpleTuner's config.json:
| Upstream Parameter | SimpleTuner config.json | Default / Notes |
|---|---|---|
--learning_rate | learning_rate | 1e-4 |
--num_workers | dataloader_num_workers | 8 |
--max_steps | max_train_steps | 2000000 |
--every_n_train_steps | checkpointing_steps | 2000 |
--precision | mixed_precision | "fp16" or "bf16" (use "no" for fp32) |
--accumulate_grad_batches | gradient_accumulation_steps | 1 |
--gradient_clip_val | max_grad_norm | 0.5 |
--shift | flow_schedule_shift | 3.0 (Specific to ACE-Step) |
Converting Raw Data
If you have raw audio/text/lyrics files and want to use the Hugging Face dataset format (as used by the upstream convert2hf_dataset.py tool), you can use the resulting dataset directly in SimpleTuner.
The upstream converter produces a dataset with tags and norm_lyrics columns. To use these, configure your backend like this:
View example config
{
"type": "huggingface",
"dataset_type": "audio",
"dataset_name": "path/to/converted/dataset",
"config": {
"audio_caption_fields": ["tags"],
"lyrics_column": "norm_lyrics"
}
}