voice-agent-starter

May 31, 2026 ยท View on GitHub

A self-hosted, full-duplex voice agent loop with swappable streaming STT, LLM, and TTS.

License Top language Last commit

A working starter for production voice agents. The browser captures microphone audio, the server runs a duplex pipeline with voice activity detection, a streaming LLM answers, and TTS audio chunks back to the browser as they are synthesised. Barge-in cancels the in-flight LLM and TTS streams the moment you start speaking, and the LLM can call server-side tools mid-turn through function-call passthrough.

Every layer is a pluggable adapter behind a small interface, so you can swap Whisper.cpp for Deepgram, OpenTTS for ElevenLabs, or Groq for OpenAI without touching the pipeline. The defaults are a fully self-hosted, open-source stack with no per-minute provider fees: Groq Llama 4 for the LLM, Whisper.cpp for STT, and OpenTTS Coqui XTTS v2 for TTS.

Quickstart

git clone https://github.com/sarmakska/voice-agent-starter.git
cd voice-agent-starter
pnpm install
cp .env.example .env
pnpm dev

Open http://localhost:3000, click Start, and grant microphone access. The web client connects to the server on port 3001 over a WebSocket and streams PCM frames into the pipeline. The state machine, barge-in, and tool calls all run without any provider keys; set keys or point the self-hosted URLs at running servers to get real transcripts and audio.

Architecture

graph TD
  Browser[Browser microphone]
  Browser -->|WebSocket PCM16| SRV[Fastify server]
  SRV --> ORC[Orchestrator state machine]
  ORC --> VAD[RMS voice activity detector]
  ORC --> STT[Streaming STT]
  ORC --> LLM[Streaming LLM]
  ORC --> TOOLS[Tool registry]
  ORC --> TTS[Chunked TTS]
  LLM -->|tool_call| TOOLS
  TOOLS -->|result| LLM
  TTS -->|audio frames| SRV
  SRV --> Browser

  classDef ext fill:#a78bfa,stroke:#a78bfa,color:#fff
  class STT,LLM,TTS ext

The orchestrator owns one voice session and runs an IDLE to LISTEN to THINK to SPEAK state machine. Full design notes and the state-transition table are in ARCHITECTURE.md.

Latency budget

StageP50 targetNotes
Mic to VAD30msRMS VAD on PCM frames
STT first partial250msWhisper.cpp growing-window transcription
LLM first token250msGroq Llama 4 on the LPU stack
TTS first audio chunk250msOpenTTS XTTS v2, first sentence
Total user-perceived~800msfirst audible response, self-hosted

What is in the box

  • Duplex orchestrator (apps/server/src/pipeline/orchestrator.ts): an IDLE to LISTEN to THINK to SPEAK state machine that owns one voice session, hands the first LLM token straight to TTS, handles barge-in by aborting the LLM and TTS streams mid-flight, and runs function-call passthrough.
  • Function-call passthrough (apps/server/src/pipeline/tools.ts): the LLM is advertised the registered tools, the server executes the matching handler, and the result is fed back so the model answers with grounded data. Ships with get_time and add_numbers and a clean seam for your own tools.
  • Voice activity detector (apps/server/src/pipeline/vad.ts): an RMS-based VAD with a clean seam for dropping in silero-vad-onnx for real workloads.
  • Pluggable adapters for STT (Whisper.cpp, Deepgram, OpenAI Whisper), LLM (Groq, SarmaLink-AI, OpenAI), and TTS (OpenTTS, Cartesia, ElevenLabs), each selected by an environment variable through a small registry. The three OpenAI-compatible LLM adapters share one streaming SSE reader.
  • Fastify 5 server exposing a /health endpoint that reports the active providers and a /voice WebSocket, plus a Next.js 15 web client that captures audio and renders transcripts.
  • Real end-to-end tests: the full pipeline driven through fake adapters and a fake socket, covering the loop, barge-in cancellation, and function-call passthrough, plus per-adapter unit tests. CI runs lint, typecheck, build, and tests on every push and pull request.

When to use this

  • You want to add voice to a product and do not want to build a streaming pipeline, barge-in handling, and a provider abstraction from scratch.
  • You want a stack you can run fully self-hosted with no per-minute provider fees, and the option to swap in hosted providers per layer later.
  • You need the LLM to call server-side functions mid-conversation and feed the results back into the same turn.
  • You want to A/B test STT, LLM, or TTS providers without rewriting the pipeline around each one.

When not to use this

  • You need a finished consumer product. This is a starter, not a turnkey app, and the default VAD and transport are deliberately simple.
  • You are building a one-shot, push-to-talk transcription tool. The full-duplex machinery here is overhead you would not need.
  • You need word-level interim STT results today. The default Whisper.cpp adapter surfaces window-level partials; wire the Deepgram streaming SDK for finer granularity.

Configuration

Env varPurposeDefault
STT_PROVIDERwhispercpp, deepgram, or whisperwhispercpp
LLM_PROVIDERgroq, sarmalink, or openaigroq
TTS_PROVIDERopentts, cartesia, or elevenlabsopentts
GROQ_API_KEYfor the Groq Llama 4 LLM adapterunset
WHISPERCPP_URLrunning whisper-server for STThttp://localhost:8090
OPENTTS_URLrunning OpenTTS server for TTShttp://localhost:5500

See .env.example for the full list, including the hosted-provider keys.

Swapping adapters

Each layer is one TypeScript file. Drop a new adapter into apps/server/src/adapters/<layer>/<provider>.ts implementing the interface, register it in the registry, and set the matching environment variable. No other changes. An OpenAI-compatible LLM adapter is a handful of lines because it reuses the shared SSE reader in apps/server/src/adapters/llm/sse.ts.

Documentation

Full architecture notes, a sequence diagram, real-world examples, and a troubleshooting guide live in the project wiki. Design reference is in ARCHITECTURE.md, the plan is in ROADMAP.md, and changes are in CHANGELOG.md.

License

MIT. Built by Sarma Linux.


More open source by Sarma

Part of a portfolio of production-shaped open-source repositories built and maintained by Sarma.

RepositoryWhat it is
Sarmalink-aiMulti-provider OpenAI-compatible AI gateway with 14-engine failover and intent-based plugin auto-routing
agent-orchestratorDurable multi-agent workflows in TypeScript with deterministic replay and Inspector UI
voice-agent-starterSelf-hosted full-duplex voice agent loop. Pluggable streaming STT / LLM / TTS, barge-in, function-call passthrough
ai-eval-runnerEvals as code. Python, DuckDB, FastAPI viewer, regression mode for CI
mcp-server-toolkitProduction Model Context Protocol server starter (Python / FastAPI)
local-llm-routerOpenAI-compatible proxy that routes to Ollama or cloud providers based on policy
rag-over-pdfMinimal end-to-end RAG starter for PDF corpora
receipt-scannerVision OCR for receipts with Zod-validated JSON output
webhook-to-emailWebhook receiver that forwards events to email via Resend
k8s-ops-toolkitHelm chart for shipping Next.js to Kubernetes with full observability stack
terraform-stackVercel + Supabase + Cloudflare + DigitalOcean modules in one Terraform repo
staff-portalOpen-source HR / ops portal for leave, attendance, expenses, and kiosk mode

Engineering essays at sarmalinux.com/blog. All projects at sarmalinux.com/open-source.