ViSQOL (Python)

May 27, 2026 · View on GitHub

PyPI version CI Python License

A pure Python implementation of Google's ViSQOL (Virtual Speech Quality Objective Listener) for objective audio/speech quality assessment.

ViSQOL compares a reference audio signal with a degraded version and outputs a MOS-LQO (Mean Opinion Score - Listening Quality Objective) score on a scale of 1.0 – 5.0.

Features

  • Two modes: Audio mode (music/general audio at 48 kHz) and Speech mode (speech at 16 kHz)
  • High accuracy: 12/12 conformance tests pass against the official C++ implementation
    • Audio mode: 9/10 tests produce identical MOS scores (diff = 0.000000), 1 test diff = 0.000117
    • Speech mode (polynomial): diff = 0.001057
    • Speech mode (lattice TFLite): diff = 0.002341
  • Two speech quality mappers matching C++ ViSQOL:
    • Lattice (default) — deep-lattice TFLite network (--use_lattice_model=true in C++); requires the optional [lattice] extra
    • Polynomial (fallback) — legacy exponential fit (--use_lattice_model=false in C++)
  • Pure Python: no C/C++ compilation required (the optional [lattice] extra adds the Google ai-edge-litert TFLite runtime as a binary wheel)
  • Minimal dependencies: 4 core pip packages (numpy, scipy, soundfile, libsvm-official)
  • Optional Numba acceleration: pip install visqol-python[accel] for JIT-compiled Gammatone filterbank (parallel) and a fused NSIM + DP patch matching kernel
  • Optional pyFFTW backend: pip install visqol-python[fftw] routes alignment / xcorr FFTs through FFTW3 — ~16× overall speedup, RTF 0.036 (vs C++ estimate 0.093)
  • Batch & parallel evaluation: measure_batch(parallel=True) for multi-process execution across CPU cores
  • Fully typed: PEP 561 py.typed, strict mypy, ruff-enforced code style

Installation

pip install visqol-python

For C++-default-equivalent speech mode (deep-lattice TFLite mapper):

pip install visqol-python[lattice]   # requires Python ≥ 3.10

For Numba-accelerated Gammatone filtering and the fused NSIM + DP kernel:

pip install visqol-python[accel]

For FFTW3-backed alignment FFTs via pyFFTW:

pip install visqol-python[fftw]

Install everything (lattice + numba + fftw):

pip install visqol-python[all]

Or install from source:

git clone https://github.com/talker93/visqol-python.git
cd visqol-python
pip install -e ".[dev]"

Note on speech mode parity: Without the [lattice] extra, speech mode falls back to the polynomial mapping (equivalent to running C++ ViSQOL with --use_lattice_model=false). The polynomial can over-predict MOS by 1–2 points on degraded speech vs the C++ default. Install [lattice] whenever you need numbers that line up with the C++ default behaviour (see issue #1).

Quick Start

Python API

from visqol import VisqolApi

# Audio mode (default) - for music and general audio
api = VisqolApi()
api.create(mode="audio")
result = api.measure("reference.wav", "degraded.wav")
print(f"MOS-LQO: {result.moslqo:.4f}")

# Speech mode - for speech signals
api = VisqolApi()
api.create(mode="speech")
result = api.measure("ref_speech.wav", "deg_speech.wav")
print(f"MOS-LQO: {result.moslqo:.4f}")

Using NumPy Arrays

import numpy as np
import soundfile as sf
from visqol import VisqolApi

ref, sr = sf.read("reference.wav")
deg, _  = sf.read("degraded.wav")

api = VisqolApi()
api.create(mode="audio")
result = api.measure_from_arrays(ref, deg, sample_rate=sr)
print(f"MOS-LQO: {result.moslqo:.4f}")

Batch Evaluation

from visqol import VisqolApi

api = VisqolApi()
api.create(mode="audio")

file_pairs = [
    ("ref1.wav", "deg1.wav"),
    ("ref2.wav", "deg2.wav"),
    ("ref3.wav", "deg3.wav"),
]

# Sequential with progress callback
results = api.measure_batch(
    file_pairs,
    progress_callback=lambda done, total: print(f"{done}/{total}"),
)

# Multi-process parallel (uses all CPU cores)
results = api.measure_batch(file_pairs, parallel=True, max_workers=4)

for pair, result in zip(file_pairs, results):
    if isinstance(result, Exception):
        print(f"{pair}: FAILED — {result}")
    else:
        print(f"{pair}: MOS-LQO = {result.moslqo:.4f}")

Command Line

# Audio mode (default)
python -m visqol -r reference.wav -d degraded.wav

# Speech mode
python -m visqol -r reference.wav -d degraded.wav --speech_mode

# Verbose output (per-patch details)
python -m visqol -r reference.wav -d degraded.wav -v

CLI options:

FlagDescription
-r, --referencePath to reference WAV file (required)
-d, --degradedPath to degraded WAV file (required)
--speech_modeUse speech mode (16 kHz)
--no_lattice_modelSpeech mode: disable lattice TFLite mapper, use polynomial fallback
--lattice_modelCustom path to lattice .tflite model (speech mode)
--unscaled_speechDon't scale polynomial speech MOS to 5.0 (polynomial only)
--modelCustom SVR model file path (audio mode only)
--search_windowSearch window radius (default: 60)
--verbose, -vShow detailed per-patch results

Output

The measure() method returns a SimilarityResult object with:

FieldDescription
moslqoMOS-LQO score (1.0 – 5.0)
vnsimMean NSIM across all patches
fvnsimPer-frequency-band mean NSIM
fstdnsimPer-frequency-band std of NSIM
fvdegenergyPer-frequency-band degraded energy
patch_simsList of per-patch similarity details

Modes

Audio Mode (default)

  • Target sample rate: 48 kHz
  • 32 Gammatone frequency bands (50 Hz – 15 000 Hz)
  • Quality mapping: SVR (Support Vector Regression) model
  • Best for: music, environmental audio, codecs

Speech Mode

  • Target sample rate: 16 kHz
  • 21 Gammatone frequency bands (50 Hz – 8 000 Hz)
  • VAD (Voice Activity Detection) based patch selection
  • Quality mapping (choose one):
    • Deep-lattice TFLite (default) — same mapper as C++ ViSQOL's default --use_lattice_model=true; requires pip install visqol-python[lattice]
    • Exponential polynomial (fallback) — same as C++ --use_lattice_model=false; used automatically when the lattice runtime is not installed
  • Toggle from Python: api.create(mode="speech", use_lattice_model=False)
  • Toggle from CLI: --no_lattice_model
  • Best for: speech, VoIP, telephony

Performance

Measured on Apple M-series, Python 3.13, audio mode on the guitar48_stereo 12.5 s conformance case (3-run average):

ConfigurationRTFTypical TimeSpeedup vs pure Python
Pure Python + NumPy/SciPy0.58~7 s1.0×
+ [accel] (Numba JIT)0.067~0.84 s8.7×
+ [accel] [fftw] (Numba + FFTW3)0.036~0.45 s16×

RTF (Real-Time Factor) < 1.0 means faster than real-time. With Numba + pyFFTW the Python implementation runs at 2.6× the C++ estimated speed (C++ RTF ≈ 0.093).

Stage-level breakdown of the v3.6.0 fully-accelerated path:

StageTime%
Gammatone filterbank0.179 s40%
DP Patch matching (fused NSIM kernel)0.131 s29%
Global alignment (pyFFTW rfft/irfft)0.091 s20%
Fine alignment + NSIM0.043 s10%
Other (SPL, postproc, SVR, …)0.003 s< 1%

Project Structure

visqol-python/
├── visqol/                    # Main package
│   ├── __init__.py            # Package exports & version
│   ├── api.py                 # Public API (VisqolApi)
│   ├── visqol_manager.py      # Pipeline orchestrator
│   ├── visqol_core.py         # Core algorithm
│   ├── audio_utils.py         # Audio I/O & SPL normalization
│   ├── signal_utils.py        # Envelope, cross-correlation
│   ├── analysis_window.py     # Hann window
│   ├── gammatone.py           # ERB + Gammatone filterbank + spectrogram
│   ├── patch_creator.py       # Patch creation (Image + VAD modes)
│   ├── patch_selector.py      # DP-based optimal patch matching
│   ├── alignment.py           # Global alignment via cross-correlation
│   ├── nsim.py                # NSIM similarity metric
│   ├── quality_mapper.py      # SVR & exponential quality mapping
│   ├── numba_accel.py         # Optional Numba JIT kernels (DP, NSIM, Gammatone)
│   ├── __main__.py            # CLI entry point
│   ├── py.typed               # PEP 561 type marker
│   └── model/                 # Bundled SVR model
│       └── libsvm_nu_svr_model.txt
├── tests/                     # Tests & benchmarks (pytest)
│   ├── conftest.py            # Shared fixtures & CLI options
│   ├── test_quick.py          # Smoke tests (no external data needed)
│   ├── test_conformance.py    # Full conformance tests (needs testdata)
│   ├── test_parallel_correctness.py  # Numba parallel correctness tests
│   └── bench_*.py             # Performance benchmarks
├── .github/workflows/
│   ├── ci.yml                 # CI: lint + type-check + matrix test (Python × NumPy)
│   └── publish.yml            # Auto-publish to PyPI on tag push
├── pyproject.toml             # Package metadata & build config
├── CHANGELOG.md
├── CONTRIBUTING.md
├── LICENSE
└── README.md

Conformance Test Results

Tested against the official C++ ViSQOL v3.3.3 expected values:

Test CaseModeExpected MOSPython MOSΔ
strauss_lp35Audio1.38891.38890.000000
steely_lp7Audio2.25022.25020.000000
sopr_256aacAudio4.68234.68230.000000
ravel_128opusAudio4.46514.46510.000000
moonlight_128aacAudio4.68434.68430.000000
harpsichord_96mp3Audio4.22374.22370.000000
guitar_64aacAudio4.34974.34970.000000
glock_48aacAudio4.33254.33250.000000
contrabassoon_24aacAudio2.34692.34680.000117
castanets_identityAudio4.73214.73210.000000
speech_CA01 (polynomial)Speech3.37453.37560.001057
speech_CA01 (lattice)Speech3.31303.31530.002341

Both speech values come from running the C++ ViSQOL binary directly with the corresponding --use_lattice_model flag, so they represent ground-truth parity targets.

References

  • Google ViSQOL (C++) — the original implementation this project is ported from
  • Hines, A., Gillen, E., Kelly, D., Skoglund, J., Kokaram, A., & Harte, N. (2015). ViSQOLAudio: An Objective Audio Quality Metric for Low Bitrate Codecs. The Journal of the Acoustical Society of America.
  • Chinen, M., Lim, F. S., Skoglund, J., Gureev, N., O'Gorman, F., & Hines, A. (2020). ViSQOL v3: An Open Source Production Ready Objective Speech and Audio Metric. 2020 Twelfth International Conference on Quality of Multimedia Experience (QoMEX).

License

Apache License 2.0. See LICENSE for details.

This project is a Python port of Google's ViSQOL, which is also licensed under Apache 2.0.