FunASR Deployment Matrix

May 25, 2026 ยท View on GitHub

Use this page to choose the shortest deployment path for a product, demo, benchmark, or internal workflow. Start with the smallest surface that satisfies the job, then move to heavier runtimes only when throughput, latency, or integration requirements demand it.

Quick decision table

PathBest forStart hereOperational notes
Colab notebookBrowser smoke tests, first evaluation, shareable demosColab quickstartNo local setup; first run downloads model files, GPU runtime is faster.
Python APINotebooks, offline jobs, first model evaluationREADME quick startLowest ceremony; caller owns batching, retries, and files.
OpenAI-compatible APIPrivate speech API, agents, Dify/LangChain/AutoGen-style clientsOpenAI API exampleEasiest integration for apps that already support OpenAI audio APIs.
Docker Compose APIReproducible local smoke test or small internal serviceOpenAI API Docker docsCPU by default; adapt the image before using CUDA in containers.
Kubernetes APIInternal speech API for cluster servicesKubernetes templateStarts as private ClusterIP; add auth, TLS, network policy, and GPU scheduling before broader exposure.
Runtime WebSocket serviceLive captions, meetings, call-center streamsRuntime service docsUse when partial results, endpointing, or long-lived audio streams matter.
vLLM accelerationHigher-throughput LLM-based ASR with Fun-ASR-NanovLLM guideUse for LLM decoder throughput; does not apply to non-autoregressive Paraformer.
MCP serverClaude/Cursor/desktop agent speech toolsMCP exampleGood when the ASR result should be exposed as a local tool.
Subtitle generatorSRT/VTT from long audio or videoSubtitle exampleUse verbose segments and speaker labels when readability matters.
Batch ASR scriptArchives, meetings, datasets, repeated offline runsBatch exampleAdd queueing, manifests, and retry logs for production use.
Triton runtimeSpecialized high-performance servingTriton runtime docsHeavier setup; choose when your team already operates Triton/GPU serving.

Common choices

I want to try FunASR in five minutes

Use the Colab quickstart when you want a browser-only smoke test, or use the Python API from the README for local work. It is the shortest route for validating installation, model download, device selection, and basic output shape. If you are unsure which model to start with, use the model selection guide.

I want a local replacement for cloud transcription

Use the OpenAI-compatible API. It exposes /v1/audio/transcriptions, /v1/models, /health, and Swagger docs. Start with sensevoice, run examples/openai_api/smoke_test.sh or examples/openai_api/smoke_test.py, then connect existing SDK or HTTP clients using client recipes and JavaScript/TypeScript recipes. For browser upload or microphone demos, use the Gradio browser demo. For Dify, n8n, HTTP nodes, or webhook workers, follow the workflow recipes. For API gateways, developer portals, and schema-driven imports, use the OpenAPI spec. Before sharing the service, review the security and gateway guide.

I want a repeatable container demo

Use examples/openai_api/docker-compose.yml for a CPU-mode smoke test:

cd examples/openai_api
cp .env.example .env
docker compose up --build

Keep CPU mode until you have a CUDA-capable PyTorch/FunASR image. After that, set FUNASR_DEVICE=cuda and verify with the same smoke test. Use python examples/openai_api/smoke_test.py --base-url http://localhost:8000 on systems without bash/curl.

I want an internal Kubernetes service

Use the Kubernetes template for a private ClusterIP OpenAI-compatible API with persistent model cache, /health probes, and a port-forward smoke-test path. Keep the CPU default until you have a CUDA-capable image and cluster GPU scheduling in place.

I need streaming or live captioning

Use the runtime WebSocket service. Validate chunk size, VAD, endpointing, punctuation, speaker diarization, reconnect behavior, and client backpressure with real audio before production rollout.

I need more LLM-based ASR throughput

Use the vLLM path for Fun-ASR-Nano. Benchmark with your own audio distribution and watch GPU memory, tensor parallel size, first-token latency, and warmup time.

Readiness checklist

  • Pick a model alias and pin it in deployment notes.
  • Record FunASR version, model version, device, CUDA/PyTorch version, Docker image tag, and command line.
  • Run a short public smoke sample and at least one realistic private sample.
  • Log audio duration, model, device, latency, response format, and error type for every request.
  • Add upload-size limits, authentication, TLS, and rate limits before exposing an API outside a trusted network; use the security and gateway guide to plan the boundary.
  • For streaming, test silence, noise, overlapping speakers, long sessions, reconnects, and slow clients.
  • For benchmark claims, include input duration, hardware, batch size, model, runtime path, and whether model download/warmup time is excluded.

When to open an issue

Use Deployment Help for runtime, Docker, vLLM, Triton, Android, browser, or agent integration problems. Include your deployment path, exact command/config, logs, model, device, and audio characteristics.