Kubernetes Deployment for the FunASR OpenAI-Compatible API
May 25, 2026 ยท View on GitHub
This folder provides a CPU-first Kubernetes template for the examples/openai_api server. Use it when you want an internal OpenAI-compatible speech endpoint reachable by agents, web backends, workflow engines, or batch workers inside a cluster.
The manifest is intentionally conservative:
ClusterIPservice by default, not a publicLoadBalancer.FUNASR_DEVICE=cpuby default so the image matches the portable Dockerfile.- A persistent cache volume mounted at
/root/.cacheso model downloads survive pod restarts. /healthstartup, readiness, and liveness probes.- A memory-backed
/dev/shmvolume for PyTorch and audio preprocessing.
1. Build and push the image
From examples/openai_api:
docker build -t registry.example.com/speech/funasr-api:cpu-latest .
docker push registry.example.com/speech/funasr-api:cpu-latest
Edit kustomization.yaml if you use a different registry, repository, or tag.
2. Deploy
kubectl create namespace speech --dry-run=client -o yaml | kubectl apply -f -
kubectl -n speech apply -k .
kubectl -n speech rollout status deploy/funasr-api --timeout=15m
Model download and first load can take several minutes. The startupProbe allows up to 10 minutes before Kubernetes restarts the container.
3. Smoke test
Keep the service private and verify it through port-forward first:
kubectl -n speech port-forward svc/funasr-api 8000:8000
From another terminal in examples/openai_api:
python smoke_test.py --base-url http://localhost:8000
For in-cluster clients, use http://funasr-api.speech.svc.cluster.local:8000 as the direct HTTP base URL and http://funasr-api.speech.svc.cluster.local:8000/v1 as the OpenAI SDK base URL.
4. Tune for your cluster
| Setting | Default | When to change it |
|---|---|---|
FUNASR_MODEL | sensevoice | Use another alias after checking /v1/models. |
FUNASR_DEVICE | cpu | Set to cuda only after building a CUDA-capable image and configuring GPU scheduling. |
| PVC size | 20Gi | Increase when caching multiple models or large model revisions. |
| Memory request | 8Gi | Tune after observing startup and real audio workloads. |
| Startup probe | 10 minutes | Increase if your registry, model hub, or storage backend is slow. |
GPU notes
The example Dockerfile is CPU-first. For GPU clusters you need to adapt the image to CUDA-capable PyTorch/FunASR dependencies, then add your cluster's GPU scheduling fields, for example:
resources:
limits:
nvidia.com/gpu: "1"
nodeSelector:
nvidia.com/gpu.present: "true"
Exact GPU labels, runtime classes, and device plugin configuration vary by Kubernetes distribution. Keep the service private until authentication, TLS, upload-size limits, and rate limits are in place.
Operational checks
- Use
/healthfor readiness and/v1/modelsto confirm model aliases. - Log model alias, device, audio duration, response format, latency, and error text.
- Start with one replica because the cache PVC is
ReadWriteOnce; scale horizontally with a registry image, per-pod cache, or a shared read-only model cache after measuring memory and startup time. - Put authentication and network policy in front of the service before exposing it outside a trusted namespace.
- For Dify, n8n, or web backends inside the same cluster, point them at the Kubernetes service name instead of
localhost.