ASR (Automatic Speech Recognition)

April 9, 2026 · View on GitHub

Transcribe audio files using speech recognition, powered by sherpa-onnx. All inference runs offline on your machine.

Prerequisites

No external dependencies are required for WAV files. Non-WAV format support via the CLI is deprecated and requires ffmpeg (see COLI_DEP002).

CLI

# Plain text output
coli asr recording.wav

# JSON output
coli asr -j recording.wav

# Select model
coli asr --model whisper recording.wav

# Specify language (sensevoice only)
coli asr --language zh recording.wav

Options

-j, --json     Output result in JSON format
--model        Model to use: whisper, sensevoice (default: sensevoice)
--language     Language for sensevoice: auto, zh, en, ja, ko, yue (default: auto)

coli asr-stream

Stream speech recognition from stdin. Expects raw 16kHz mono s16le PCM audio piped in.

# From microphone (macOS)
ffmpeg -f avfoundation -i :0 -ar 16000 -ac 1 -f s16le pipe:1 | coli asr-stream

# With VAD
ffmpeg -f avfoundation -i :0 -ar 16000 -ac 1 -f s16le pipe:1 | coli asr-stream --vad

# JSON output (one JSON object per line)
ffmpeg -f avfoundation -i :0 -ar 16000 -ac 1 -f s16le pipe:1 | coli asr-stream --vad --json

# From a file
ffmpeg -i podcast.m4a -ar 16000 -ac 1 -f s16le pipe:1 | coli asr-stream --vad

Options

-j, --json              Output each result as a JSON line
--vad                   Enable voice activity detection
--language <lang>       Language for sensevoice: auto, zh, en, ja, ko, yue (default: auto)
--asr-interval-ms <ms>  Recognition interval in ms (default: 1000, ignored with --vad)

JSON output example

{
	"text": "The tribal chieftain called for the boy.",
	"model": "sensevoice-small",
	"lang": "<|en|>",
	"emotion": "<|NEUTRAL|>",
	"event": "<|Speech|>",
	"tokens": ["The", " tri", "bal", " chief", "tain", "..."],
	"timestamps": [0.9, 1.26, 1.56, 1.8, 2.16, "..."],
	"duration": 7.152
}

API

ensureModels(models?)

Download the specified models if not already present. Defaults to ['sensevoice']. Call this before runAsr or streamAsr.

import {ensureModels} from '@marswave/coli';

await ensureModels(); // downloads sensevoice only
await ensureModels(['whisper', 'sensevoice']); // downloads both

readWave(filename)

Read a WAV file and return an AudioData object. Use this to load WAV files for runAsr.

import {ensureModels, readWave, runAsr} from '@marswave/coli';

await ensureModels();

const audio = readWave('/path/to/recording.wav');
await runAsr(audio, {json: false, model: 'sensevoice'});

runAsr(input, options)

Run speech recognition on audio data. Results are printed to stdout.

The input parameter accepts either an AudioData object (recommended) or a file path string (deprecated).

import {ensureModels, runAsr} from '@marswave/coli';

await ensureModels();

// Recommended: pass AudioData directly
await runAsr(
	{sampleRate: 16000, samples: myFloat32Array},
	{json: false, model: 'sensevoice'},
);

// Deprecated: file path input (requires ffmpeg for non-WAV formats)
await runAsr('recording.m4a', {json: false, model: 'sensevoice'});

Options

PropertyTypeDescription
jsonbooleanOutput JSON (with model name, tokens, timestamps, etc.) instead of plain text
model'whisper' | 'sensevoice'Which model to use for recognition
languageSenseVoiceLanguageLanguage hint for sensevoice: 'auto', 'zh', 'en', 'ja', 'ko', 'yue' (default: 'auto')

getModelPath(model)

Returns the local filesystem path for a given model.

import {getModelPath} from '@marswave/coli';

getModelPath('sensevoice');
// => '/Users/you/.coli/models/sherpa-onnx-sense-voice-zh-en-ja-ko-yue-int8-2024-07-17'

getModelPath('whisper');
// => '/Users/you/.coli/models/sherpa-onnx-whisper-tiny.en'

modelDisplayNames

A mapping from model key to its human-readable display name.

import {modelDisplayNames} from '@marswave/coli';

modelDisplayNames.sensevoice; // => 'sensevoice-small'
modelDisplayNames.whisper; // => 'whisper-tiny.en'

Streaming API

For streaming speech recognition, use the streaming API. It accepts audio as an async iterable of Float32Array chunks (16 kHz mono PCM) and delivers recognition results via the onResult callback as audio accumulates, using the SenseVoice model.

streamAsr(audio, options)

Stream audio in and receive recognition results incrementally. Call ensureModels() first. If using VAD, also call ensureVadModel().

import {ensureModels, ensureVadModel, streamAsr} from '@marswave/coli';

await ensureModels();

const audioSource = createAudioStream(); // AsyncIterable<Float32Array> of 16 kHz mono PCM

// Interval-based (default) — emits partial results at a fixed interval
await streamAsr(audioSource, {
	onResult(result) {
		console.log(result.text, result.isFinal ? '(final)' : '(partial)');
	},
});

// VAD-based — segments speech automatically, each segment emits a final result
await ensureVadModel();
await streamAsr(audioSource, {
	vad: true,
	onResult(result) {
		console.log(result.text);
	},
});

// VAD with custom parameters
await streamAsr(audioSource, {
	vad: {threshold: 0.4, minSilenceDuration: 0.3, maxSpeechDuration: 10},
	onResult(result) {
		console.log(result.text);
	},
});

Options

PropertyTypeDescription
onResult(result: AsrStreamResult) => voidCallback invoked with each recognition result
sampleRatenumberAudio sample rate in Hz (default: 16000)
languageSenseVoiceLanguageLanguage hint for sensevoice: 'auto', 'zh', 'en', 'ja', 'ko', 'yue' (default: 'auto')
asrIntervalMsnumberRecognition interval in milliseconds (default: 1000). Ignored when using VAD
vadboolean | VadOptionsEnable VAD. Pass true for defaults or a VadOptions object

VadOptions

PropertyTypeDescription
thresholdnumberSpeech detection threshold (default: 0.5)
minSpeechDurationnumberMinimum speech duration in seconds (default: 0.25)
minSilenceDurationnumberMinimum silence to end a segment in seconds (default: 0.5)
maxSpeechDurationnumberMaximum speech segment duration in seconds (default: 15)
enableExternalBufferbooleanUse external buffer for VAD speech segments (default: undefined)

Result

PropertyTypeDescription
textstringRecognized text so far
langstringDetected language tag
emotionstringDetected emotion tag
eventstringDetected audio event tag
tokensstring[]Individual tokens
timestampsnumber[]Timestamp for each token
isFinalbooleanWhether the result is finalized

Models

On first run, coli automatically downloads required models to ~/.coli/models/:

ASR Models

NameModelLanguages
sensevoice (default)SenseVoice Small int8Chinese, English, Japanese, Korean, Cantonese
whisperWhisper tiny.en int8English

VAD Model

NameModelSize
silero_vadSilero VAD (k2-fsa export)~629 KB

streamAsr uses the SenseVoice model for recognition. VAD uses Silero VAD and is downloaded separately via ensureVadModel().

Supported audio formats

The CLI accepts WAV files directly. For the programmatic API, use readWave() to load WAV files into an AudioData object, or provide your own AudioData from any source. Non-WAV file path input is deprecated (see COLI_DEP002).