AI Provider Guides

Synthesize speech, transcribe audio, or run live voice sessions. Voice providers are separate from LLM providers — they handle audio I/O rather than text generation.

Text-to-Speech (TTS)

OpenAI TTS

Highest-quality text-to-speech

🎙️ Voices: alloy, echo, fable, onyx, nova, shimmer
🎵 Models: tts-1 (fast) and tts-1-hd (high quality)
🎼 Formats: MP3, WAV, OGG, Opus
🔑 Auth: API Key (OPENAI_API_KEY)

Setup Guide →

ElevenLabs

Best multilingual and voice-cloning TTS

🌍 Supports 30+ languages with natural prosody
🎭 Custom voice cloning from short audio samples
🎼 Formats: MP3, WAV (raw PCM, surfaced as pcm16), Opus (Ogg container)
🔑 Auth: API Key (ELEVENLABS_API_KEY)

Setup Guide →

Google TTS

1M characters/month free tier

💰 Generous free tier for standard voices
🌍 380+ voices across 50+ languages
🎼 Formats: MP3, WAV, OGG
🔑 Auth: Service Account

Setup Guide →

Azure TTS

Enterprise TTS with full SSML support

🏢 Fine-grained prosody control via SSML
🌍 400+ neural voices, 140+ languages
🎼 Formats: MP3, WAV (PCM), Opus (Ogg container)
🔑 Auth: API Key + Region

Setup Guide →

Fish Audio

Low-cost TTS with 15s voice cloning

💰 ~80% cheaper than ElevenLabs
🎭 15-second reference audio → custom voice
🌍 14 languages
🎼 Formats: MP3, WAV, PCM16 (raw)
🔑 Auth: API Key (FISH_AUDIO_API_KEY)

Setup Guide →

Cartesia

Low-latency Sonic models — synchronous + streaming

⚡ Sub-second turnaround on the synchronous /tts/bytes endpoint
🌊 Separate WebSocket streaming flow via CartesiaStream (voice server)
🎭 Voice cloning via dashboard upload
🎼 Formats: MP3 (44.1 kHz), WAV (PCM s16le @ 44.1 kHz), PCM16 (raw @ 24 kHz)
🔑 Auth: API Key (CARTESIA_API_KEY)

Setup Guide →

Speech-to-Text (STT)

Whisper (OpenAI)

Highest transcription accuracy

🎯 Best-in-class accuracy on diverse audio
🌍 Multilingual with automatic language detection
🎼 Formats: WAV, MP3, M4A, FLAC, OGG, OPUS, WEBM, MP4, MPEG, MPGA
🔑 Auth: API Key (OPENAI_API_KEY)

Setup Guide →

Deepgram

Real-time streaming transcription via WebSocket

⚡ Sub-300 ms word-level results over WebSocket
🌊 REST batch and WebSocket streaming modes
🎼 Formats: WAV, MP3, OGG, FLAC
🔑 Auth: API Key (DEEPGRAM_API_KEY)

Setup Guide →

Google STT

125+ languages with speaker diarization

🌍 Best fit for existing Google Cloud users
👥 Speaker diarization and multi-channel audio
🎼 Formats: WAV, FLAC, MP3, OGG
🔑 Auth: API Key (GOOGLE_AI_API_KEY / GEMINI_API_KEY) or Service Account (GOOGLE_APPLICATION_CREDENTIALS)

Setup Guide →

Azure STT

Enterprise STT with custom model training

🏢 Batch transcription and custom model support
🔒 Compliance controls for regulated industries
🎼 Formats: WAV (PCM), Ogg/Opus — convert MP3 to WAV first
🔑 Auth: API Key + Region

Setup Guide →

Realtime Voice

Realtime providers maintain a persistent bidirectional WebSocket connection, enabling low-latency spoken conversation with the AI model.

OpenAI Realtime

Low-latency bidirectional voice over WebSocket

⚡ Full-duplex audio stream with GPT-4o
🎵 Voice activity detection (VAD) built-in
🎼 Formats: WAV, Opus
🔑 Auth: API Key (OPENAI_API_KEY)

Setup Guide →

Gemini Live

Google's native realtime voice API

⚡ Native multimodal realtime session with Gemini
🎵 Supports audio + video input simultaneously
🎼 Formats: WAV, Opus
🔑 Auth: API Key (GOOGLE_AI_API_KEY or GEMINI_API_KEY)

Setup Guide →

🎬 Video Generation

Image-to-video and text-to-video providers (use via output: { mode: "video" }):

Vertex Veo 3.1 (default) — --videoProvider vertex
Kling (PiAPI) — --videoProvider kling (details)
Runway (Gen-3 Alpha / Gen-4 Turbo) — --videoProvider runway
Replicate — Wan-Alpha + many others — --videoProvider replicate (guide)

See Video Generation feature page for the full SDK / CLI surface.

👤 Avatar / Lip-Sync Generation

Talking-head video synthesis from a portrait image + audio (use via output: { mode: "avatar" }):

D-ID — --avatarProvider d-id (text-driven via Microsoft voices, or audio-driven)
HeyGen — --avatarProvider heygen (HeyGen avatar catalog id required)
Replicate (MuseTalk) — --avatarProvider replicate or musetalk (guide)

See docs/provider-integration/21-adding-new-modality.md for the architectural pattern.

🎵 Music / Sound Generation

Music + sound-effect generation (use via output: { mode: "music" }):

Beatoven.ai — --musicProvider beatoven (royalty-free background music)
ElevenLabs Music — --musicProvider elevenlabs-music (short SFX / loops up to 22s; same ELEVENLABS_API_KEY as TTS)
Lyria 3 Pro (Google) — --musicProvider lyria
Replicate (MusicGen) — --musicProvider replicate or musicgen (guide)

Quick Comparison

Provider	Free Tier	Enterprise	GDPR	Latency	Best For
Anthropic	Limited	✅	✅	Low	Reasoning, coding, Claude
Hugging Face	✅	❌	✅	Medium	Open source, experimentation
Google AI	✅	✅	✅	Low	Free tier, Gemini
Mistral AI	❌	✅	✅	Low	EU compliance, cost
OpenRouter	✅	✅	Varies	Low	Multi-model, automatic failover
OpenAI Compatible	Varies	✅	Varies	Varies	Flexibility, local deployment
LiteLLM	❌	✅	Varies	Low	Multi-provider, unified API
Azure OpenAI	❌	✅	✅	Low	Enterprise, Microsoft ecosystem
Vertex AI	❌	✅	✅	Low	Enterprise, GCP ecosystem
AWS Bedrock	❌	✅	✅	Low	Enterprise, AWS ecosystem
DeepSeek	❌	✅	❌	Low	Cost-effective reasoning, R1 model
NVIDIA NIM	❌	✅	Varies	Low	NVIDIA-hosted or self-hosted LLMs
LM Studio	✅ (Local)	❌	✅	Varies	Local GUI model management
llama.cpp	✅ (Local)	❌	✅	Varies	High-performance local GGUF inference
OpenAI TTS	❌	✅	✅	Low	High-quality TTS (tts-1-hd)
ElevenLabs	❌	✅	Varies	Low	Multilingual TTS, voice cloning
Google TTS	✅	✅	✅	Low	Cost-effective TTS, 1M chars free
Azure TTS	❌	✅	✅	Low	Enterprise TTS, SSML support
Fish Audio	❌	✅	Varies	Low	Low-cost TTS, voice cloning, 14 langs
Cartesia	❌	✅	Varies	Low	Low-latency Sonic models
Whisper	❌	✅	✅	Low	Best STT accuracy
Deepgram	❌	✅	Varies	Low	Real-time STT streaming (WebSocket)
Google STT	❌	✅	✅	Low	STT for GCP users, 125+ languages
Azure STT	❌	✅	✅	Low	Enterprise STT, custom models
OpenAI Realtime	❌	✅	✅	Low	Realtime bidirectional voice
Gemini Live	❌	✅	✅	Low	Realtime voice + video (Gemini)

Setup Strategies

Strategy 1: Free Tier First (Recommended for Development)

=== "SDK Usage"

```typescript
const ai = new NeuroLink({
providers: [
{
name: 'google-ai',
priority: 1,
config: { apiKey: process.env.GOOGLE_AI_KEY },
quotas: { daily: 1500 }
},
{
name: 'openai',
priority: 2,
config: { apiKey: process.env.OPENAI_API_KEY }
}
],
failoverConfig: { enabled: true, fallbackOnQuota: true }
});

    const result = await ai.generate({
      input: { text: "Hello world" }
    });
    ```

=== "CLI Usage"

```bash
# Set up environment variables
export GOOGLE_AI_KEY="your-key"
export OPENAI_API_KEY="your-key"

    # Use with automatic failover
    npx @juspay/neurolink generate "Hello world" \
      --provider google-ai
    ```

Strategy 2: Multi-Region Enterprise

const ai = new NeuroLink({
  providers: [
    {
      name: "azure-us",
      region: "us-east",
      config: {
        /* Azure US */
      },
    },
    {
      name: "azure-eu",
      region: "eu-west",
      config: {
        /* Azure EU */
      },
    },
    {
      name: "bedrock-us",
      region: "us-east",
      config: {
        /* Bedrock US */
      },
    },
  ],
  loadBalancing: "latency-based",
});

const ai = new NeuroLink({
  providers: [
    {
      name: "mistral",
      priority: 1,
      config: { apiKey: process.env.MISTRAL_API_KEY },
    },
    {
      name: "azure-eu",
      priority: 2,
      config: {
        /* Azure EU region */
      },
    },
  ],
  compliance: {
    framework: "GDPR",
    dataResidency: "EU",
  },
});

Next Steps

Choose a provider based on your requirements (free tier, compliance, region)
Follow the setup guide to get your API key
Configure NeuroLink with the provider
Test the integration with a simple request
Add failover for production reliability

Multi-Provider Failover - High availability patterns
Cost Optimization - Reduce costs by 80-95%
Compliance & Security - GDPR, SOC2, HIPAA
Load Balancing - Distribution strategies
Voice Providers Comparison - TTS, STT, and Realtime capability matrix
Voice Provider Selection - Choosing the right voice provider