FunASR vLLM Inference Engine Guide

June 9, 2026 · View on GitHub

Benchmark

Test set: 184 files, 11,541 seconds total. Models: Fun-ASR-Nano / GLM-ASR-Nano.

Model	Engine	VAD	RTFx	CER	Notes
Fun-ASR-Nano	PyTorch	dynamic	21	8.06%	Baseline
Fun-ASR-Nano	vLLM batch	dynamic	340	8.20%	16x speedup
Fun-ASR-Nano	Offline service (no SPK)	dynamic	102	8.14%
Fun-ASR-Nano	Offline service (+SPK)	dynamic	46	8.19%	SPK off by default
GLM-ASR-Nano	vLLM batch	fixed	265	12.93%	No long-audio support

vLLM matches PyTorch CER exactly (delta < 0.2%) while achieving 16–340x speedup.

Installation & Environment
vLLM Engine Architecture
Offline SDK Inference
Streaming SDK Inference
Offline Speech Recognition Service
Streaming Speech Recognition Service
Dynamic VAD
API Reference
FAQ

1. Installation & Environment

pip install funasr>=1.3.0
pip install vllm>=0.12.0
pip install safetensors tiktoken websockets regex fastapi uvicorn python-multipart

cd /path/to/FunASR && pip install -e .

Hardware: GPU ≥ 8 GB VRAM, CUDA ≥ 11.8. 16 GB+ recommended.

2. vLLM Engine Architecture

Overall Architecture

FunASR's vLLM integration splits the ASR model into two independently running components:

┌──────────────────────────────────────────────────────────────┐
│                  FunASR + vLLM Inference Architecture          │
├──────────────────────────────────────────────────────────────┤
│                                                              │
│  ┌─────────────── PyTorch (single GPU) ───────────┐          │
│  │                                                │          │
│  │  Audio ──→ Frontend ──→ Audio Encoder ──→ Adaptor         │
│  │            (fbank)      (SenseVoice/     (Transformer/    │
│  │                          Whisper)         MLP)            │
│  │                              │                            │
│  │                              ▼                            │
│  │                     Audio Embeddings                      │
│  │                              │                            │
│  │  Text Prompt ──→ Tokenize ──→ Embed                      │
│  │  (system/user/                  │                         │
│  │   hotwords/language)            │                         │
│  │                                 ▼                         │
│  │                          [Concat Embeddings]              │
│  └─────────────────────────────────┼─────────────┘          │
│                                    │                         │
│                                    ▼ EmbedsPrompt            │
│  ┌─────────────── vLLM Engine ────────────────────┐          │
│  │                                                │          │
│  │   PagedAttention + Continuous Batching         │          │
│  │   KV Cache management + CUDA Graph             │          │
│  │   Tensor Parallel (multi-GPU)                  │          │
│  │                                                │          │
│  │   Qwen3-0.6B / Llama-2B (LLM decoding)        │          │
│  │                                                │          │
│  └────────────────────┬───────────────────────────┘          │
│                       │                                      │
│                       ▼                                      │
│                Generated Text                                │
│                       │                                      │
│  ┌────────────────────┼──────────────────────────┐           │
│  │  (Optional) CTC Decoder ──→ Forced Alignment  │           │
│  │           ──→ Character-level timestamps       │           │
│  └───────────────────────────────────────────────┘           │
└──────────────────────────────────────────────────────────────┘

Why vLLM?

Feature	PyTorch generate()	vLLM
KV Cache management	Fixed allocation, wastes memory	PagedAttention, on-demand allocation
Batching	Manual padding required	Continuous Batching, automatic scheduling
CUDA optimization	None	CUDA Graph + operator fusion
Multi-GPU parallelism	Manual implementation	Tensor Parallel with one-line config
Throughput	RTFx ~20	RTFx 340+

Supported Models

Model	LLM component	Audio encoder	vLLM speedup
Fun-ASR-Nano	Qwen3-0.6B	SenseVoice	✓ 21.7x
GLM-ASR-Nano	Llama-2B	Whisper-like	✓ 7.6x
LLMASR	Qwen/Vicuna	Whisper	✓
Paraformer	No LLM	—	✗ Non-autoregressive
SenseVoice	No LLM	—	✗ Encoder-decoder

Key Implementation Details

Weight separation: LLM weights are extracted from model.pt and converted to HuggingFace format for vLLM loading
EmbedsPrompt: Audio embeddings and text embeddings are concatenated and fed to vLLM as a single prompt embedding
use_low_frame_rate: Fun-ASR-Nano's adaptor output must be truncated to the correct token count via a formula (critical for consistency)
Batch encode: Multiple audio files pass through extract_fbank → audio_encoder → audio_adaptor in a single forward pass
CTC timestamps: Encoder output is retained; after text generation, forced alignment yields character-level timing

3. Offline SDK Inference

Best suited for large-scale audio transcription and offline batch processing. vLLM's batching capability provides the greatest advantage in this scenario.

Design Principles

Offline SDK inference splits the ASR pipeline into two stages executed independently:

$ ┌─────────────────────────────────────────────────────────────────────┐ │ \text{Stage} 1: \text{Audio} \text{Encoding} (\text{PyTorch}, \text{single} \text{GPU}) │ ├─────────────────────────────────────────────────────────────────────┤ │ │ │ \text{Audio} \text{file} \text{list} ──→ \text{Group} (\text{batch} \text{of} 8) ──→ \text{Frontend} (\text{Fbank}) │ │ │ │ │ │ │ ▼ │ │ │ \text{SenseVoice} \text{Encoder} │ │ │ │ │ │ │ ▼ │ │ │ \text{Audio} \text{Adaptor} │ │ │ (\text{dim} \text{transform} + \text{LFR} \text{trunc}) │ │ │ │ │ │ └─── \text{Shared} \text{text} \text{prompt} \text{encoding} ────┐ ▼ │ │ (\text{system}/\text{hotwords}/\text{language}) │ \text{audio\_embeds} │ │ │ │ │ │ │ ▼ │ ▼ │ │ \text{prefix\_emb} ──→ [\text{concat}: \text{prefix} | \text{audio} | \text{suffix}] │ │ │ │ │ ▼ │ │ \text{EmbedsPrompt} (\text{N} \text{samples}) │ └──────────────────────────────────────────────────┼─────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────────┐ │ \text{Stage} 2: \text{LLM} \text{Decoding} (\text{vLLM}, \text{multi}-\text{GPU} \text{Tensor} \text{Parallel}) │ ├─────────────────────────────────────────────────────────────────────┤ │ │ │ \text{EmbedsPrompt} \times \text{N} ──→ \text{vLLM} \text{Continuous} \text{Batching} │ │ (\text{PagedAttention} + \text{CUDA} \text{Graph}) │ │ │ │ │ ▼ │ │ \text{Generated} \text{token\_ids} \times \text{N} │ │ │ │ │ ▼ │ │ \text{Decode} + \text{post}-\text{processing} (\text{strip} \text{special} \text{tokens}) │ │ │ │ │ ▼ │ │ (\text{Optional}) \text{CTC} \text{Forced} \text{Alignment} → \text{char} \text{timestamps}│ └─────────────────────────────────────────────────────────────────────┘ $

Key design decisions:

Weight separation: On first run, weights with the llm.* prefix are extracted from model.pt and saved in HuggingFace safetensors format for vLLM (cached in the Qwen3-0.6B-vllm/ directory)
Embedding concatenation: The text prompt is encoded through the LLM's embed_tokens layer into embeddings, then concatenated with the audio adaptor output along the sequence dimension: [prefix_emb | audio_emb | suffix_emb], and submitted to vLLM as an EmbedsPrompt
Low Frame Rate truncation: Adaptor output must be truncated to the correct length using: fake_token_len = ((((fbank_len - 3 + 2) // 2 - 3 + 2) // 2) - 1) // 2 + 1, ensuring consistency with the PyTorch training pipeline
Batch audio encoding: Multiple audio files are grouped in batches of 8 through the encoder + adaptor forward pass, reducing GPU kernel launch overhead
Shared text prompt: When hotwords and language are identical within a batch, prefix_emb and suffix_emb are computed only once
CTC timestamps: Encoder output is preserved; after LLM text generation, forced alignment produces character-level timestamps

Why faster than PyTorch generate()?

Dimension	PyTorch	vLLM
KV Cache	Fixed pre-allocation (wastes memory)	PagedAttention on-demand allocation
Batching	Manual padding alignment	Continuous Batching auto-scheduling
CUDA	Sequential per-sample execution	CUDA Graph + operator fusion
Multi-GPU	Manual implementation	Tensor Parallel one-line config
Result	RTFx ~20	RTFx 340+ (16x speedup)

Universal Interface (Recommended)

from funasr.auto.auto_model_vllm import AutoModelVLLM

model = AutoModelVLLM(
    model="FunAudioLLM/Fun-ASR-Nano-2512",
    hub="ms",                    # or "hf"
    tensor_parallel_size=2,      # multi-GPU parallel
    gpu_memory_utilization=0.8,
)

results = model.generate(
    ["audio1.wav", "audio2.wav"],
    language="中文",
    hotwords=["张三", "北京"],
)
for r in results:
    print(f"[{r['key']}] {r['text']}")

Direct Interface

from funasr.models.fun_asr_nano.inference_vllm import FunASRNanoVLLM

engine = FunASRNanoVLLM.from_pretrained(
    model="FunAudioLLM/Fun-ASR-Nano-2512",
    tensor_parallel_size=4,
)

results = engine.generate(
    inputs="wav.scp",  # supports scp/jsonl/file lists
    hotwords=["开放时间"],
    language="中文",
    max_new_tokens=512,
)

Command Line

cd examples/industrial_data_pretraining/fun_asr_nano

# Single file
python demo_vllm.py --input audio.wav --language 中文

# Batch + multi-GPU
python demo_vllm.py --input wav.scp --tensor-parallel-size 4 --batch-size 32

# With hotwords + save results
python demo_vllm.py --input audio.wav --hotwords 张三 北京 --output results.jsonl

4. Streaming SDK Inference

Processes audio in 720 ms chunks incrementally, outputting progressively stable recognition results. Suited for SDK-integrated real-time subtitle scenarios.

Design Principles

Audio stream (720 ms chunks)
    │ Cumulative re-encoding (each chunk covers all audio from the start)
    ▼
┌──────────────────────────┐
│ Stage 1: First 10 chunks │  ← No prev_text; batch generation
│ Identify stable output   │
└──────────┬───────────────┘
           ▼
┌──────────────────────────┐
│ Stage 2: Subsequent      │  ← Use stable output as prev_text
└──────────┬───────────────┘
           ▼
Each chunk: [fixed region (confirmed)] + [8-char unfixed (may change)]

Usage

from funasr.models.fun_asr_nano.inference_vllm_streaming import FunASRNanoStreamingVLLM

engine = FunASRNanoStreamingVLLM.from_pretrained(
    model="FunAudioLLM/Fun-ASR-Nano-2512",
    chunk_ms=720,
    rollback_chars=8,
)

for result in engine.streaming_generate("audio.wav", language="中文"):
    if result["is_final"]:
        print(f"Final: {result['text']}")
    else:
        print(f"[{result['audio_duration_ms']:.0f}ms] Confirmed: {result['fixed_text']}")

Output Characteristics

Accumulated audio	Output quality
< 1.5 s	Empty or noise
1.5–3.0 s	Partially correct
> 3.0 s	Accurate output

Note: repetition_penalty=1.3 is hardcoded internally to prevent short-chunk repetition degradation.

5. Offline Speech Recognition Service

5.1 Service Architecture

Client                                  serve_vllm.py
  │                                        │
  │── HTTP / OpenAI / WebSocket ─────────→│
  │                                        │
  │                                   ┌────┴────────────────────────┐
  │                                   │ 1. Receive complete audio    │
  │                                   │ 2. Dynamic VAD (≤60 s/seg)  │
  │                                   │ 3. vLLM batch all segments  │
  │                                   │ 4. CTC timestamps (per-char)│
  │                                   │ 5. Speaker diarization (opt)│
  │                                   └────┬────────────────────────┘
  │                                        │
  │←── JSON result ───────────────────────│

Characteristics:

Processes audio only after it arrives in full — ideal for file transcription
Dynamic VAD preserves long segments (≤60 s), reducing boundary-cut losses
Batch inference over all VAD segments maximizes throughput
Automatically outputs character-level timestamps
Speaker diarization is off by default; clients can enable it

5.2 Starting the Service

CUDA_VISIBLE_DEVICES=0 python examples/industrial_data_pretraining/fun_asr_nano/serve_vllm.py \
    --port 8899 \
    --model FunAudioLLM/Fun-ASR-Nano-2512 \
    --gpu-memory-utilization 0.5

5.3 Protocol 1: HTTP REST — `POST /asr`

The most feature-complete interface, supporting speaker diarization, timestamps, and hotwords.

Request: multipart/form-data

Parameter	Type	Default	Description
`file`	file	required	Audio file (wav/mp3/flac)
`language`	string	None	Language ("中文" / "English" / ...), None for auto
`hotwords`	string	""	Hotwords, comma-separated
`spk`	bool	false	Enable speaker diarization
`timestamp`	bool	true	Output character-level timestamps

Response:

{
    "text": "Full transcription text",
    "segments": [
        {
            "text": "Segment text",
            "start": 1.7,
            "end": 14.8,
            "speaker": "SPK0",
            "words": [
                {"word": "砸", "start": 2.02, "end": 2.08},
                {"word": "了", "start": 2.26, "end": 2.32}
            ]
        }
    ],
    "duration": 227.4,
    "processing_time": 3.422,
    "rtf": 0.015
}

Client examples:

# cURL
curl -X POST http://localhost:8899/asr \
    -F "file=@meeting.wav" -F "language=中文" -F "spk=true"

# Python requests
import requests
resp = requests.post("http://localhost:8899/asr",
    files={"file": open("audio.wav", "rb")},
    data={"language": "中文", "spk": "true"})
result = resp.json()

// JavaScript fetch
const form = new FormData();
form.append("file", audioBlob, "audio.wav");
form.append("language", "中文");
form.append("spk", "true");
const resp = await fetch("http://localhost:8899/asr", { method: "POST", body: form });
const result = await resp.json();

5.4 Protocol 2: OpenAI Whisper Compatible — `POST /v1/audio/transcriptions`

Compatible with the OpenAI Whisper API standard; works directly with the OpenAI SDK.

Request: multipart/form-data

Parameter	Type	Default	Description
`file`	file	required	Audio file
`model`	string	"fun-asr-nano"	Model name (compatibility field)
`language`	string	None	Language
`response_format`	string	"json"	"json" / "text" / "verbose_json"
`timestamp_granularities`	string	"word"	"word" / "segment"
`spk`	bool	false	Speaker diarization (FunASR extension)

Response (verbose_json):

{
    "task": "transcribe",
    "language": "zh",
    "duration": 5.17,
    "text": "我一直没有照顾孩子，但是我想要抚养权。",
    "segments": [
        {
            "id": 0, "start": 0.0, "end": 5.15,
            "text": "我一直没有照顾孩子，但是我想要抚养权。",
            "words": [{"word": "我", "start": 0.42, "end": 0.48}, ...]
        }
    ]
}

Client examples:

# OpenAI SDK (recommended)
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8899/v1", api_key="none")
result = client.audio.transcriptions.create(
    model="fun-asr-nano",
    file=open("audio.wav", "rb"),
    response_format="verbose_json",
)
print(result.text)

# cURL
curl -X POST http://localhost:8899/v1/audio/transcriptions \
    -F "file=@audio.wav" -F "model=fun-asr-nano" -F "response_format=verbose_json"

5.5 Protocol 3: WebSocket — `ws://host:port/ws`

WebSocket interface for the offline service. Send complete audio, then receive results. Speaker clustering is performed automatically on STOP, and results include the spk field.

Client → Server:

Message	Description
`"START"`	Begin session
`"LANGUAGE:中文"`	Set language (optional)
`"HOTWORDS:word1,word2"`	Set hotwords (optional)
`[binary]`	PCM16 16 kHz mono audio data
`"STOP"`	End session; request recognition result

Server → Client:

{"event": "started"}
{"event": "language_set", "language": "中文"}
{"sentences": [{"text":"...","start":..,"end":..}], "is_final": true, "duration_ms": 5170}
{"event": "stopped"}

Client example:

import asyncio, websockets, json, numpy as np, soundfile as sf

async def offline_ws(audio_path):
    audio, sr = sf.read(audio_path)
    pcm = (audio * 32768).astype(np.int16)

    async with websockets.connect("ws://localhost:8899/ws") as ws:
        await ws.send("START")
        await ws.recv()
        await ws.send("LANGUAGE:中文")
        await ws.recv()

        # Send complete audio
        await ws.send(pcm.tobytes())
        await ws.send("STOP")

        # Receive result
        async for msg in ws:
            data = json.loads(msg)
            if data.get("is_final"):
                for s in data["sentences"]:
                    print(f"[{s['start']/1000:.1f}s] {s['text']}")
                break

asyncio.run(offline_ws("audio.wav"))

6. Streaming Speech Recognition Service

6.1 Service Architecture

Client (microphone / audio stream)     serve_realtime_ws.py
  │                                      │
  │── WebSocket PCM16 16 kHz ──────────→│
  │   (~100 ms per frame, continuous)    │
  │                                      │
  │                                 ┌────┴─────────────────────────┐
  │                                 │ Real-time loop:               │
  │                                 │  ├─ Dynamic VAD (60 ms chunk) │
  │                                 │  ├─ Endpoint → vLLM decode    │
  │                                 │  ├─ No endpoint → partial     │
  │                                 │  └─ Streaming SPK assignment  │
  │                                 └────┬─────────────────────────┘
  │                                      │
  │←── JSON real-time push ─────────────│

Characteristics:

Audio arrives frame by frame; processing starts immediately
Natural sentence segmentation based on VAD endpoints
Confirmed segment text is locked and never changes; partial text updates in real time
Streaming speaker assignment + global re-clustering on STOP
First-word latency ~480 ms

6.2 Starting the Service

CUDA_VISIBLE_DEVICES=0 python examples/industrial_data_pretraining/fun_asr_nano/serve_realtime_ws.py \
    --port 10095 --language 中文 --hotword-file hotword_list

6.3 WebSocket Protocol

Connection: ws://host:10095

Client → Server:

Message	Format	Description
Start	`"START"`	Initialize session
Hotwords	`"HOTWORDS:word1,word2"`	Optional
Language	`"LANGUAGE:中文"`	Optional
Audio	`binary`	PCM16 16 kHz mono
End	`"STOP"`	Final decode + SPK re-clustering

Server → Client:

{"event": "started"}
{"sentences": [{"text":"你好","start":300,"end":1200,"spk":0}], "partial": "世界", "is_final": false}
{"sentences": [...], "is_final": true}
{"event": "stopped"}

Fields: sentences[] = locked segments, partial = text being spoken (may change), is_final = true after STOP.

Sequence diagram:

Client              Server
  │── START ───────→│
  │←─ started ──────│
  │── [audio] ─────→│
  │←─ {partial} ────│
  │── [audio] ─────→│
  │←─ {sentences+partial} ─│  (VAD cut a sentence)
  │── STOP ────────→│
  │←─ {is_final:true} ────│
  │←─ stopped ─────│

6.4 Client Usage

Python CLI:

python client_python.py --server ws://localhost:10095 --mic
python client_python.py --server ws://localhost:10095 --file audio.wav

Browser: Open client_mic.html

Custom Python:

import asyncio, websockets, numpy as np, json

async def stream(audio_path):
    import soundfile as sf
    audio, sr = sf.read(audio_path)
    pcm = (audio * 32768).astype(np.int16)

    async with websockets.connect("ws://localhost:10095") as ws:
        await ws.send("START")
        await ws.recv()

        for i in range(0, len(pcm), 1600):
            await ws.send(pcm[i:i+1600].tobytes())
            await asyncio.sleep(0.05)

        await ws.send("STOP")
        async for msg in ws:
            data = json.loads(msg)
            if data.get("is_final"):
                for s in data["sentences"]:
                    print(f"[{s['start']/1000:.1f}s] {s['text']}")
                break

asyncio.run(stream("audio.wav"))

7. Dynamic VAD

fsmn-vad enables dynamic silence thresholds by default. Offline and streaming modes use different configurations.

Accumulated duration	Offline (preserve long segs ≤60 s)	Streaming (balance latency)
≤ 5 s	2000 ms	2000 ms
5–10 s	2000 ms	1500 ms
10–15 s	1000 ms	1000 ms
15–20 s	1000 ms	800 ms
20–30 s	800 ms	800 ms
30–45 s	600 ms	400 ms
45–60 s	200–400 ms	100 ms
> 60 s	100 ms	100 ms

Offline mode favors longer segments to reduce boundary-cut losses; streaming mode tightens faster to reduce latency.

Customization

model.generate(input="audio.wav", silence_schedule=[(5000,1500), (20000,800), (float('inf'),300)])

GLM-ASR does not support long-segment inference; pass dynamic_silence=False when using it.

8. API Reference

Parameter	AutoModelVLLM	serve_vllm.py	serve_realtime_ws.py
model	✓	--model	--model
gpu_memory_utilization	✓	--gpu-memory-utilization	--gpu-memory-utilization
tensor_parallel_size	✓	—	--tensor-parallel-size
max_model_len	✓	--max-model-len	--max-model-len
language	generate() param	API param	--language / LANGUAGE:
hotwords	generate() param	API param	--hotword-file / HOTWORDS:

9. FAQ

Q: Offline or streaming? Complete files → offline (high throughput). Microphone / live stream → streaming (low latency).

Q: Can GLM-ASR use dynamic VAD? It does not support long-segment inference. Use dynamic_silence=False.

Q: Performance impact of SPK? RTFx drops from 102 to 46. CER is unchanged. Disabled by default.

Q: I get empty output on an older GPU (V100, P100)? The vLLM path is intended for NVIDIA Ampere or newer GPUs (compute capability >= 8.0). bf16 requires Ampere+, and fp16 on pre-Ampere GPUs may produce degraded or empty output. For V100, P100, and other pre-Ampere hardware, use the AutoModel (PyTorch) path with demo1.py / demo2.py.

Q: Entry points for custom development? Offline: serve_vllm.process_audio() / FunASRNanoVLLM.generate() Streaming: serve_realtime_ws.RealtimeASRSession

Q: Slow first startup? vLLM initialization takes 60–90 s (KV Cache + CUDA Graph warmup). Subsequent inferences are instant.