LocalVQE

June 22, 2026 · View on GitHub

Open in Spaces Model on HF

Local Voice Quality Enhancement — compact neural models for acoustic echo cancellation (AEC), noise suppression (NS), and dereverberation of 16 kHz speech, running on commodity CPUs in real time. Causal and streaming (256-sample hop, 16 ms latency). F32 inference in C++ via GGML; a PyTorch reference is included for research.

A streaming, CPU-tuned derivative of DeepVQE (Indenbom et al., Interspeech 2023).

Models

Speed is per 16 ms hop on a Ryzen 9 7900 (Zen4), 4 threads; RT = realtime factor (higher is faster than realtime).

VersionDoesParamsSize (F32)SpeedPick it when
v1.3 (current)AEC + NS + dereverb4.8 M~19 MB3.2 ms · 5.0× RTbest joint quality, CPU budget available
v1.2AEC + NS + dereverb1.3 M~5 MB1.7 ms · 8.9× RTtight CPU / low-power devices
v1.4-AECecho only (keeps voice, noise, room)203 K~3 MB0.83 ms · 19× RTNS is handled elsewhere, or you want the room kept
v1.4-AEC 2.7Kecho only, linear filter (no mask)2.7 K~17 KB0.36 ms · 44× RTlightest echo canceller; echo isn't heavily reverberant
v1.1 / v1AEC + NS + dereverb1.3 M~5 MBsuperseded by v1.2
  • Joint models (v1.2 / v1.3) clean echo, noise, and reverb in one pass. v1.3 is wider and filters noise better; v1.2 is ~1/4 the per-hop cost.
  • v1.4-AEC removes only the far-end echo and passes voice, room, and background through unchanged. It's a classical adaptive filter followed by a small neural mask. The 2.7K build is that filter alone — cheaper and gentler, but it can't remove heavily reverberant echo the way the mask can.
  • Every model needs a far-end reference signal (a loopback of what your speakers play) in addition to the mic.
  • bf16 GGUFs are ~12 % smaller with identical quality and speed; pick f32 unless download size matters.
  • A separate compact / low-power line — a ~49 K-parameter GTCRN-AEC backend (a distinct architecture, not the v1.x graph) aimed at lower-power CPUs; ≈21× realtime on a single Raspberry Pi 5 core. See below.

Weight files on Hugging Face

FileModel
localvqe-v1.3-4.8M-f32.gguf / .ptv1.3 joint (GGUF for inference, .pt for research)
localvqe-v1.2-1.3M-f32.gguf / .ptv1.2 joint
localvqe-v1.4-aec-200K-f32.gguf / -bf16.ggufv1.4-AEC (echo only)
localvqe-v1.4-aec-2.7K-f32.ggufv1.4-AEC front-end only
localvqe-v1.1-1.3M-f32.gguf, localvqe-v1-1.3M-f32.ggufolder releases
localvqe-pi-v1-49k-f32.gguf, localvqe-pi-aec-v1-49k-f32.ggufCompact GTCRN-AEC line for lower-power CPUs (full enhance / echo-only)

v1.4-AEC is GGUF-only (no .pt). GGUF integrity is checked at load time against a built-in SHA256 allowlist (ggml/model_hash.cpp). PyTorch checkpoint hashes:

22d3e2f33bb8b25ec1c6a928cfb741bb631d45bae2b3759684818b101c95878e  localvqe-v1.3-4.8M.pt
ff6885e7c8d7d29a8ce963303dcd668ae0f2a7bdafae28631292fe6f06f7cd77  localvqe-v1.2-1.3M.pt

Performance

Full 800-clip eval on the ICASSP 2022 AEC Challenge blind test set (real recordings). AECMOS echo / deg are 1–5 (higher = more echo removed / cleaner speech); blind ERLE is 10·log10(E[mic²]/E[enh²]), only meaningful on far-end-only clips. Unprocessed-mic echo MOS is 2.67 / 2.56 / 1.90 / 2.13 / 5.00 across the five scenarios.

v1.4-AEC — keeps background noise and room by design, so its ERLE and far-end DNSMOS are intentionally lower than the joint models (it isn't deleting the ambience):

Scenarionecho ↑deg ↑ERLE ↑OVRL
doubletalk1154.202.452.59
doubletalk-with-movement1854.192.452.55
farend-singletalk1073.804.9914.6 dB1.37
farend-singletalk-with-movement1933.864.9511.1 dB1.31
nearend-singletalk2004.993.993.08

v1.4-AEC 2.7K (front-end only) — matches or beats the full model's perceptual far-end echo at 1/74 the parameters; the mask's extra work shows up as higher ERLE above, not higher echo MOS:

Scenarionecho ↑deg ↑ERLE ↑OVRL
doubletalk1154.002.792.46
doubletalk-with-movement1853.902.922.42
farend-singletalk1074.065.006.5 dB1.24
farend-singletalk-with-movement1934.054.973.9 dB1.22
nearend-singletalk2004.983.773.03

v1.3 (joint) and v1.2 (joint) — these also delete the background, so their far-end ERLE is much higher and not comparable to v1.4-AEC's:

Scenarionv1.3 echo / deg / ERLE / OVRLv1.2 echo / deg / ERLE / OVRL
doubletalk1154.73 / 2.62 / 8.5 dB / 2.894.72 / 2.37 / 8.4 dB / 2.83
doubletalk-with-movement1854.67 / 2.43 / 8.3 dB / 2.854.65 / 2.30 / 8.1 dB / 2.79
farend-singletalk1073.69 / 4.83 / 50.9 dB / 1.943.78 / 4.91 / 45.7 dB / 1.80
farend-singletalk-with-movement1933.88 / 4.98 / 49.9 dB / 1.964.12 / 4.96 / 40.6 dB / 1.75
nearend-singletalk2005.00 / 4.18 / 2.4 dB / 3.175.00 / 4.16 / 2.1 dB / 3.17

Latency

Per-hop p50 / p99 and RT factor. 16 kHz, 256-sample hop, 16 ms budget.

v1.4-AEC (Ryzen 9 7900, CPU):

Threadsp50p99RT
11.29 ms1.89 ms12.2×
40.83 ms1.30 ms18.6×

The 2.7K front-end-only build runs at 0.36 ms p50 (≈44× RT), single-threaded by nature. The adaptive front-end always runs on CPU; the neural stage is too small for GPU offload to pay off, so run v1.4-AEC on CPU.

v1.3 (joint):

HardwareBackendThreadsp50p99RT
Ryzen 9 7900CPU19.73 ms14.48 ms1.58×
Ryzen 9 7900CPU43.21 ms3.42 ms4.97×
Ryzen 9 7900 + RTX 5070 TiVulkan2.57 ms4.21 ms6.07×

v1.2 (joint):

HardwareBackendThreadsp50p99RT
Ryzen 9 7900CPU14.28 ms4.85 ms3.72×
Ryzen 9 7900CPU41.65 ms2.91 ms8.90×
Ryzen 9 7900 + RTX 5070 TiVulkan1.96 ms3.64 ms7.85×
Ryzen 7 6800U (laptop)CPU42.11 ms2.77 ms7.44×

These graphs are small, so threads hit diminishing returns past ~4. The library defaults to min(4, available CPUs) (respects taskset / cgroup limits); override with localvqe_options_set_threads. Run bench-run (below) to reproduce on your hardware.

Memory (CPU)

Working set the model adds on top of the ~7 MiB binary baseline:

ModelPost-load deltaPeak RSS
v1.3 (4.8 M)+24.4 MiB34.1 MiB
v1.2 (1.3 M)+10.0 MiB19.6 MiB
v1.4-AEC (203 K)+6.7 MiB17.0 MiB

Compact line — GTCRN-AEC (for lower-power CPUs)

A separate, much smaller second line of models for lower-power CPUs: a ~49 K-parameter GTCRN-AEC network — a distinct architecture based on GTCRN (Rong et al., ICASSP 2024) — with the project's DSP echo-cancellation front-end. Two variants share the architecture: a full enhancer (echo + NS + dereverb) and an echo-only "keep-noise" build. Runs on any CPU; for single-board ARM, cross-compile for aarch64 with ggml/docker/Dockerfile.arm64 (docker buildx + qemu).

Run it from the CLI exactly like any other model:

./ggml/build/bin/localvqe localvqe-pi-v1-49k-f32.gguf \
    --in-wav mic.wav ref.wav --out-wav enhanced.wav

Compact-line AEC sweep: a scrolling spectrogram whose processing frontier reveals the cleaned output ~21x faster than real-time playback on a single Raspberry Pi 5 core

Whole-clip RTF on the real ggml graph, benchmarked on a Raspberry Pi 5 (one example of a low-power target; test_gtcrn --bench, Cortex-A76, Ubuntu 24.04), parity-verified to the PyTorch reference within ~1e-6 on-device (~0.78 ms per 16 ms hop single-threaded). RTF is identical for both variants:

Threads8 s clipRTFRT
1388 ms0.048~21×
2219 ms0.027~37×
4163 ms0.020~49×

Usage

Build

Requires CMake ≥ 3.20 and a C++17 compiler. A Nix flake is provided (nix develop); without Nix, install cmake, gcc/clang, pkg-config, and libsndfile.

git clone --recursive https://github.com/localai-org/LocalVQE.git
cd LocalVQE
cmake -S ggml -B ggml/build -DCMAKE_BUILD_TYPE=Release
cmake --build ggml/build -j$(nproc)

Binaries land in ggml/build/bin/. The CPU build produces several libggml-cpu-*.so variants (SSE4.2 → AVX-512) selected at runtime — keep them next to the binary. For GPU, add -DLOCALVQE_VULKAN=ON (the loader falls back to CPU when no Vulkan ICD is present).

Run (CLI)

./ggml/build/bin/localvqe localvqe-v1.3-4.8M-f32.gguf \
    --in-wav mic.wav ref.wav \
    --out-wav enhanced.wav

16 kHz mono PCM for both mic and far-end reference. Swap the GGUF to switch models — same command for every version (the engine reads what to do from the file).

Embed (C API)

cmake -S ggml -B ggml/build -DLOCALVQE_BUILD_SHARED=ON
cmake --build ggml/build -j$(nproc)   # -> liblocalvqe.so

API in ggml/localvqe_api.h:

localvqe_ctx_t ctx = localvqe_new("localvqe-v1.3-4.8M-f32.gguf");
localvqe_process_f32(ctx, mic, ref, n_samples, out);   // whole clip
// or per 256-sample hop for real-time: localvqe_process_frame_f32(...)
localvqe_free(ctx);

// Compact / low-power line (GTCRN): same calls — whole-clip or per-hop streaming.
localvqe_ctx_t pi = localvqe_new("localvqe-pi-v1-49k-f32.gguf");
localvqe_process_f32(pi, mic, ref, n_samples, out);        // whole clip
// localvqe_process_frame_f32(pi, mic, ref, 256, hop_out); // or per 256-sample hop (16 ms latency)
localvqe_free(pi);

See ggml/example_purego_test.go for a Go / purego binding.

Benchmark / test

cmake --build ggml/build --target bench-run          # downloads a model + clip, benches
cmake --build ggml/build --target test_regression regression-assets
ctest --test-dir ggml/build --output-on-failure      # SKIPs models not downloaded

bench-run honors -DBENCH_BACKEND=Vulkan -DBENCH_DEVICE=N -DBENCH_ITERS=N set at configure time; bench-list-devices enumerates backends.

OBS Studio plugin

obs-plugin/ wraps liblocalvqe.so as an audio filter — appears as "LocalVQE (AEC + Noise + Dereverb)" in any source's filter list, with the bundled v1.3 GGUF preselected. NS and dereverb work out of the box; for AEC, set a Reference source (usually "Desktop Audio") so the model knows what's playing. Browse to localvqe-v1.4-aec-200K-f32.gguf to switch to echo-only mode.

nix develop .#obs-plugin
cmake -S ggml -B ggml/build -DCMAKE_BUILD_TYPE=Release -DLOCALVQE_BUILD_SHARED=ON
cmake --build ggml/build -j$(nproc)
cmake --build ggml/build --target regression-assets
cp ggml/build/bench_assets/localvqe-v1.3-4.8M-f32.gguf obs-plugin/data/
cmake -S obs-plugin -B obs-plugin/build -DCMAKE_BUILD_TYPE=Release
cmake --build obs-plugin/build -j$(nproc) && cmake --install obs-plugin/build

The install is self-contained (plugin .so + liblocalvqe.so + the libggml-cpu-*.so variants under ~/.config/obs-studio/plugins/). Tested on Linux; macOS expected to work; Windows implemented but unverified.

PyTorch reference

pytorch/ holds the model definition used to train and export the weights — for verification and research, not end-user inference (use the GGML build).

cd pytorch && pip install -r requirements.txt
python -c "import yaml, torch; from localvqe.model import LocalVQE; \
cfg = yaml.safe_load(open('configs/default.yaml')); \
m = LocalVQE(**cfg['model'], n_freqs=cfg['audio']['n_freqs']); \
print(sum(p.numel() for p in m.parameters()))"

Repository layout

ggml/        C++ streaming inference (GGML graph, CLI, C API, tests)
pytorch/     PyTorch reference (model definition only)
obs-plugin/  OBS Studio audio filter wrapping liblocalvqe.so

Citing

Cite the repository via CITATION.cff (GitHub's "Cite this repository" button produces APA / BibTeX), and the upstream DeepVQE paper:

@inproceedings{indenbom2023deepvqe,
  title     = {DeepVQE: Real Time Deep Voice Quality Enhancement for Joint
               Acoustic Echo Cancellation, Noise Suppression and Dereverberation},
  author    = {Indenbom, Evgenii and Beltr{\'a}n, Nicolae-C{\u{a}}t{\u{a}}lin
               and Chernov, Mykola and Aichner, Robert},
  booktitle = {Interspeech}, year = {2023},
  doi       = {10.21437/Interspeech.2023-2176}
}

The compact GTCRN-AEC line is based on GTCRN — please also cite:

@inproceedings{rong2024gtcrn,
  title     = {GTCRN: A Speech Enhancement Model Requiring Ultralow
               Computational Resources},
  author    = {Rong, Xiaobin and Sun, Tianchi and Zhang, Xu and Hu, Yuxiang
               and Zhu, Changbao and Lu, Jing},
  booktitle = {ICASSP 2024 - 2024 IEEE International Conference on Acoustics,
               Speech and Signal Processing (ICASSP)},
  pages     = {971--975}, year = {2024},
  doi       = {10.1109/ICASSP48485.2024.10448310}
}

Reference implementation: https://github.com/Xiaobin-Rong/gtcrn.

Attribution, safety, license

Weights are trained on the ICASSP 2023 DNS Challenge (Microsoft, CC BY 4.0) and fine-tuned on the ICASSP 2022/2023 AEC Challenge.

Safety: training data was filtered by DNSMOS, which can misclassify distressed speech (screaming, crying) as noise. LocalVQE may attenuate such signals and must not be relied on for emergency or safety-critical use.

Licensed under Apache 2.0 — see LICENSE.