Load the default speaker profile

May 20, 2025 ยท View on GitHub

Interface Documentation

import outetts

# Initialize the interface
interface = outetts.Interface(
    config=outetts.ModelConfig.auto_config(
        model=outetts.Models.VERSION_1_0_SIZE_1B,
        # For llama.cpp backend
        backend=outetts.Backend.LLAMACPP,
        quantization=outetts.LlamaCppQuantization.FP16
        # For transformers backend
        # backend=outetts.Backend.HF,
    )
)

# Load the default speaker profile
speaker = interface.load_default_speaker("EN-FEMALE-1-NEUTRAL")

# Or create your own speaker profiles in seconds and reuse them instantly
# speaker = interface.create_speaker("path/to/audio.wav")
# interface.save_speaker(speaker, "speaker.json")
# speaker = interface.load_speaker("speaker.json")

# Generate speech
output = interface.generate(
    config=outetts.GenerationConfig(
        text="Hello, how are you doing?",
        generation_type=outetts.GenerationType.CHUNKED,
        speaker=speaker,
        sampler_config=outetts.SamplerConfig(
            temperature=0.4
        ),
    )
)

# Save to file
output.save("output.wav")

Configuration Options

Model Configuration

You can configure the model in two ways:

config = outetts.ModelConfig.auto_config(
    model=outetts.Models.VERSION_1_0_SIZE_1B,
    backend=outetts.Backend.LLAMACPP,
    quantization=outetts.LlamaCppQuantization.FP16
)

2. Manual Configuration

config = outetts.ModelConfig(
    model_path="OuteAI/Llama-OuteTTS-1.0-1B",
    tokenizer_path="OuteAI/Llama-OuteTTS-1.0-1B",
    interface_version=outetts.InterfaceVersion.V3,
    backend=outetts.Backend.HF,
    additional_model_config={
        "attn_implementation": "flash_attention_2"  # Enable flash attention if compatible
    },
    device="cuda",
    dtype=torch.bfloat16
)

Available Models

ModelDescription
VERSION_0_1_SIZE_350MOuteTTS v0.1 (350M parameters)
VERSION_0_2_SIZE_500MOuteTTS v0.2 (500M parameters)
VERSION_0_3_SIZE_500MOuteTTS v0.3 (500M parameters)
VERSION_0_3_SIZE_1BOuteTTS v0.3 (1B parameters)
VERSION_1_0_SIZE_1BLlama-OuteTTS v1.0 (1B parameters)
VERSION_1_0_SIZE_0_6BOuteTTS v1.0 (0.6B parameters)

Backend Options

BackendDescription
Backend.HFHugging Face Transformers backend
Backend.LLAMACPPLLAMA.cpp backend (for GGUF models)
Backend.EXL2EXL2 backend
Backend.EXL2ASYNCEXL2 backend with asynchronous batch processing
Backend.VLLMVLLM backend for efficient batch processing (Experimental)
Backend.LLAMACPP_SERVERDirect connection to a running LLAMA.cpp server (synchronous stream)
Backend.LLAMACPP_ASYNC_SERVERDirect connection to a running LLAMA.cpp server (asynchronous batch)

Note on VLLM Backend: The Backend.VLLM option is currently experimental and may sometimes result in missing audio chunks or static noise during generation. To enable this backend, you must set the environment variable OUTETTS_ALLOW_VLLM=1 before running your application.

Interface Versions

OuteTTS supports different interface versions for different model generations:

Interface VersionSupported Models
InterfaceVersion.V1OuteTTS v0.1 models (VERSION_0_1_SIZE_350M)
InterfaceVersion.V2OuteTTS v0.2 and v0.3 models (VERSION_0_2_SIZE_500M, VERSION_0_3_SIZE_500M, VERSION_0_3_SIZE_1B)
InterfaceVersion.V3OuteTTS v1.0 models (VERSION_1_0_SIZE_1B)

LLAMA.cpp Quantization Options

For the LLAMA.cpp backend, you can choose from multiple quantization options:

quantization=outetts.LlamaCppQuantization.FP16  # No quantization
quantization=outetts.LlamaCppQuantization.Q8_0  # 8-bit quantization
# Other options: Q6_K, Q5_K_S, Q5_K_M, Q5_1, Q5_0, Q4_K_S, Q4_K_M, Q4_1, Q4_0, Q3_K_S, Q3_K_M, Q3_K_L, Q2_K

Speaker Profiles

Using Default Speakers

# List available default speakers
interface.print_default_speakers()

# Load a default speaker
speaker = interface.load_default_speaker("EN-FEMALE-1-NEUTRAL")

Creating Custom Speaker Profiles

To create a speaker profile, point to your audio file. The system will use the Whisper model to transcribe the audio and audio codec model to generate a profile. Once done, save the speaker to a JSON file so you can reuse it later.

Note

This step only needs to be run once per speaker. It is meant for initial profile creation.

speaker = interface.create_speaker("path/to/audio.wav")
interface.save_speaker(speaker, "my_speaker.json")

You can then load your custom speaker from the JSON file at any time:

speaker = interface.load_speaker("my_speaker.json")

For older model versions, you can also provide a transcript:

speaker = interface.create_speaker(
    audio_path="path/to/audio.wav",
    transcript="What is being said in the audio",
    whisper_model="turbo",
    whisper_device="cuda"
)

Decoding and Saving Speaker Audio

For V3 interface models, you can decode and save the speaker audio:

interface.decode_and_save_speaker(speaker, "speaker_audio.wav")

Generation Options

Generation Types

TypeDescription
GenerationType.REGULARStandard generation for shorter texts
GenerationType.CHUNKEDSplits text into chunks for better processing of longer texts (recommended for single-text generation)
GenerationType.GUIDED_WORDSExperimental guided generation by words
GenerationType.STREAMStreaming generation (Not Implemented)
GenerationType.BATCHOptimized batch generation for processing multiple text chunks simultaneously (requires specific backends)

Sampler Configuration

You can customize the generation parameters:

sampler_config = outetts.SamplerConfig(
    temperature=0.4,
    repetition_penalty=1.1,
    top_k=40,
    top_p=0.9,
    min_p=0.05,
    mirostat=False,
    mirostat_tau=5,
    mirostat_eta=0.1
)

Complete Generation Example

output = interface.generate(
    config=outetts.GenerationConfig(
        text="Hello, how are you doing today? I hope you're having a wonderful day!",
        speaker=speaker,
        generation_type=outetts.GenerationType.CHUNKED,
        sampler_config=outetts.SamplerConfig(
            temperature=0.4,
            top_k=40,
            top_p=0.9
        ),
        max_length=8192
    )
)

# Play the audio
output.play()

# Save to file
output.save("output.wav")

Batch Generation

For backends optimized for batch processing (EXL2ASYNC, VLLM, LLAMACPP_ASYNC_SERVER), you can leverage GenerationType.BATCH. This mode processes multiple text chunks in parallel, which can significantly improve throughput for longer texts or when generating multiple audio outputs.

When using these backends, GenerationType.BATCH is the only supported type and will be automatically selected even if a different type is specified in GenerationConfig.

from outetts import Interface, ModelConfig, GenerationConfig, Backend, GenerationType

if __name__ == "__main__":
    # Initialize the interface with a batch-capable backend
    interface = Interface(
        ModelConfig(
            model_path="OuteAI/Llama-OuteTTS-1.0-0.6B-FP8", # Replace with your model path
            tokenizer_path="OuteAI/Llama-OuteTTS-1.0-0.6B", # Replace with your tokenizer path
            backend=Backend.VLLM
            # For EXL2, use backend=Backend.EXL2ASYNC + exl2_cache_seq_multiply={should be same as max_batch_size in GenerationConfig}
            # For LLAMACPP_ASYNC_SERVER, use backend=Backend.LLAMACPP_ASYNC_SERVER and provide server_host in GenerationConfig
        )
    )

    # Load your speaker profile
    speaker = interface.load_default_speaker("EN-FEMALE-1-NEUTRAL") # Or load/create custom speaker

    # Generate speech using BATCH type
    # Note: For EXL2ASYNC, VLLM, LLAMACPP_ASYNC_SERVER, BATCH is automatically selected.
    output = interface.generate(
        GenerationConfig(
            text="This is a longer text that will be automatically split into chunks and processed in batches.",
            speaker=speaker,
            generation_type=GenerationType.BATCH, # Can often be omitted for batch backends
            max_batch_size=32,       # Adjust based on your GPU memory and server capacity
            dac_decoding_chunk=2048, # Adjust chunk size for DAC decoding
            # If using LLAMACPP_ASYNC_SERVER, add:
            # server_host="http://localhost:8000" # Replace with your server address
        )
    )

    # Save to file
    output.save("output_batch.wav")

Advanced Configuration Options

Passing Custom Settings to Backends

OuteTTS allows you to pass custom settings directly to the underlying model backends:

Custom Model Initialization Parameters

The additional_model_config parameter lets you pass any backend-specific settings when initializing the model:

interface = outetts.Interface(
    config=outetts.ModelConfig(
        model_path="OuteAI/Llama-OuteTTS-1.0-1B",
        # ... other settings
        additional_model_config={
            "attn_implementation": "flash_attention_2",  # HF-specific setting
            "device_map": "auto"  # For multi-GPU setups
            # Any other backend-specific settings
        }
    )
)

These parameters are passed directly to the model initialization function of the corresponding backend.

Custom Generation Parameters

You can customize the generation process further using these options:

output = interface.generate(
    config=outetts.GenerationConfig(
        text="Hello, this is a test",
        speaker=speaker,
        # Standard generation options
        generation_type=outetts.GenerationType.CHUNKED,
        sampler_config=outetts.SamplerConfig(temperature=0.4),
        max_length=8192,
        # Custom backend-specific generation settings
        additional_gen_config={
            "frequency_penalty": 1.0,
            "presence_penalty": 0.5,
            # Any other backend-specific generation parameters
        }
    )
)

The additional_gen_config dictionary is passed directly to the model's generation function, allowing you to use any parameters supported by the specific backend.

Advanced Usage

Using with Hugging Face Models and Flash Attention

import outetts
import torch

interface = outetts.Interface(
    config=outetts.ModelConfig(
        model_path="OuteAI/Llama-OuteTTS-1.0-1B",
        tokenizer_path="OuteAI/Llama-OuteTTS-1.0-1B",
        interface_version=outetts.InterfaceVersion.V3,
        backend=outetts.Backend.HF,
        additional_model_config={
            "attn_implementation": "flash_attention_2"  # Enable flash attention if compatible
        },
        device="cuda",
        dtype=torch.bfloat16
    )
)

Determining Compatible Data Type

The library can automatically detect the best dtype for your hardware:

from outetts.models.config import get_compatible_dtype
import torch

dtype = get_compatible_dtype()  # Returns torch.bfloat16, torch.float16, or torch.float32