ASR (Automatic Speech Recognition)
April 9, 2026 · View on GitHub
Transcribe audio files using speech recognition, powered by sherpa-onnx. All inference runs offline on your machine.
Prerequisites
No external dependencies are required for WAV files. Non-WAV format support via the CLI is deprecated and requires ffmpeg (see COLI_DEP002).
CLI
# Plain text output
coli asr recording.wav
# JSON output
coli asr -j recording.wav
# Select model
coli asr --model whisper recording.wav
# Specify language (sensevoice only)
coli asr --language zh recording.wav
Options
-j, --json Output result in JSON format
--model Model to use: whisper, sensevoice (default: sensevoice)
--language Language for sensevoice: auto, zh, en, ja, ko, yue (default: auto)
coli asr-stream
Stream speech recognition from stdin. Expects raw 16kHz mono s16le PCM audio piped in.
# From microphone (macOS)
ffmpeg -f avfoundation -i :0 -ar 16000 -ac 1 -f s16le pipe:1 | coli asr-stream
# With VAD
ffmpeg -f avfoundation -i :0 -ar 16000 -ac 1 -f s16le pipe:1 | coli asr-stream --vad
# JSON output (one JSON object per line)
ffmpeg -f avfoundation -i :0 -ar 16000 -ac 1 -f s16le pipe:1 | coli asr-stream --vad --json
# From a file
ffmpeg -i podcast.m4a -ar 16000 -ac 1 -f s16le pipe:1 | coli asr-stream --vad
Options
-j, --json Output each result as a JSON line
--vad Enable voice activity detection
--language <lang> Language for sensevoice: auto, zh, en, ja, ko, yue (default: auto)
--asr-interval-ms <ms> Recognition interval in ms (default: 1000, ignored with --vad)
JSON output example
{
"text": "The tribal chieftain called for the boy.",
"model": "sensevoice-small",
"lang": "<|en|>",
"emotion": "<|NEUTRAL|>",
"event": "<|Speech|>",
"tokens": ["The", " tri", "bal", " chief", "tain", "..."],
"timestamps": [0.9, 1.26, 1.56, 1.8, 2.16, "..."],
"duration": 7.152
}
API
ensureModels(models?)
Download the specified models if not already present. Defaults to ['sensevoice']. Call this before runAsr or streamAsr.
import {ensureModels} from '@marswave/coli';
await ensureModels(); // downloads sensevoice only
await ensureModels(['whisper', 'sensevoice']); // downloads both
readWave(filename)
Read a WAV file and return an AudioData object. Use this to load WAV files for runAsr.
import {ensureModels, readWave, runAsr} from '@marswave/coli';
await ensureModels();
const audio = readWave('/path/to/recording.wav');
await runAsr(audio, {json: false, model: 'sensevoice'});
runAsr(input, options)
Run speech recognition on audio data. Results are printed to stdout.
The input parameter accepts either an AudioData object (recommended) or a file path string (deprecated).
import {ensureModels, runAsr} from '@marswave/coli';
await ensureModels();
// Recommended: pass AudioData directly
await runAsr(
{sampleRate: 16000, samples: myFloat32Array},
{json: false, model: 'sensevoice'},
);
// Deprecated: file path input (requires ffmpeg for non-WAV formats)
await runAsr('recording.m4a', {json: false, model: 'sensevoice'});
Options
| Property | Type | Description |
|---|---|---|
json | boolean | Output JSON (with model name, tokens, timestamps, etc.) instead of plain text |
model | 'whisper' | 'sensevoice' | Which model to use for recognition |
language | SenseVoiceLanguage | Language hint for sensevoice: 'auto', 'zh', 'en', 'ja', 'ko', 'yue' (default: 'auto') |
getModelPath(model)
Returns the local filesystem path for a given model.
import {getModelPath} from '@marswave/coli';
getModelPath('sensevoice');
// => '/Users/you/.coli/models/sherpa-onnx-sense-voice-zh-en-ja-ko-yue-int8-2024-07-17'
getModelPath('whisper');
// => '/Users/you/.coli/models/sherpa-onnx-whisper-tiny.en'
modelDisplayNames
A mapping from model key to its human-readable display name.
import {modelDisplayNames} from '@marswave/coli';
modelDisplayNames.sensevoice; // => 'sensevoice-small'
modelDisplayNames.whisper; // => 'whisper-tiny.en'
Streaming API
For streaming speech recognition, use the streaming API. It accepts audio as an async iterable of Float32Array chunks (16 kHz mono PCM) and delivers recognition results via the onResult callback as audio accumulates, using the SenseVoice model.
streamAsr(audio, options)
Stream audio in and receive recognition results incrementally. Call ensureModels() first. If using VAD, also call ensureVadModel().
import {ensureModels, ensureVadModel, streamAsr} from '@marswave/coli';
await ensureModels();
const audioSource = createAudioStream(); // AsyncIterable<Float32Array> of 16 kHz mono PCM
// Interval-based (default) — emits partial results at a fixed interval
await streamAsr(audioSource, {
onResult(result) {
console.log(result.text, result.isFinal ? '(final)' : '(partial)');
},
});
// VAD-based — segments speech automatically, each segment emits a final result
await ensureVadModel();
await streamAsr(audioSource, {
vad: true,
onResult(result) {
console.log(result.text);
},
});
// VAD with custom parameters
await streamAsr(audioSource, {
vad: {threshold: 0.4, minSilenceDuration: 0.3, maxSpeechDuration: 10},
onResult(result) {
console.log(result.text);
},
});
Options
| Property | Type | Description |
|---|---|---|
onResult | (result: AsrStreamResult) => void | Callback invoked with each recognition result |
sampleRate | number | Audio sample rate in Hz (default: 16000) |
language | SenseVoiceLanguage | Language hint for sensevoice: 'auto', 'zh', 'en', 'ja', 'ko', 'yue' (default: 'auto') |
asrIntervalMs | number | Recognition interval in milliseconds (default: 1000). Ignored when using VAD |
vad | boolean | VadOptions | Enable VAD. Pass true for defaults or a VadOptions object |
VadOptions
| Property | Type | Description |
|---|---|---|
threshold | number | Speech detection threshold (default: 0.5) |
minSpeechDuration | number | Minimum speech duration in seconds (default: 0.25) |
minSilenceDuration | number | Minimum silence to end a segment in seconds (default: 0.5) |
maxSpeechDuration | number | Maximum speech segment duration in seconds (default: 15) |
enableExternalBuffer | boolean | Use external buffer for VAD speech segments (default: undefined) |
Result
| Property | Type | Description |
|---|---|---|
text | string | Recognized text so far |
lang | string | Detected language tag |
emotion | string | Detected emotion tag |
event | string | Detected audio event tag |
tokens | string[] | Individual tokens |
timestamps | number[] | Timestamp for each token |
isFinal | boolean | Whether the result is finalized |
Models
On first run, coli automatically downloads required models to ~/.coli/models/:
ASR Models
| Name | Model | Languages |
|---|---|---|
sensevoice (default) | SenseVoice Small int8 | Chinese, English, Japanese, Korean, Cantonese |
whisper | Whisper tiny.en int8 | English |
VAD Model
| Name | Model | Size |
|---|---|---|
silero_vad | Silero VAD (k2-fsa export) | ~629 KB |
streamAsr uses the SenseVoice model for recognition. VAD uses Silero VAD and is downloaded separately via ensureVadModel().
Supported audio formats
The CLI accepts WAV files directly. For the programmatic API, use readWave() to load WAV files into an AudioData object, or provide your own AudioData from any source. Non-WAV file path input is deprecated (see COLI_DEP002).