Pretrained Models in Wespeaker

December 26, 2025 ยท View on GitHub

Besides speaker related tasks, speaker embeddings can be utilized for many related tasks which requires speaker modeling, such as

  • voice conversion
  • text-to-speech
  • speaker adaptive ASR
  • target speaker extraction

For users who would like to verify the SV performance or extract speaker embeddings for the above tasks without troubling about training the speaker embedding learner, we provide two types of pretrained models.

  1. Checkpoint Model, with suffix .pt, the model trained and saved as checkpoint by WeSpeaker python code, you can reproduce our published result with it, or you can use it as checkpoint to continue.

  2. Runtime Model, with suffix .onnx, the runtime model is exported by Onnxruntime on the checkpoint model.

Model License

The pretrained model in WeNet follows the license of it's corresponding dataset. For example, the pretrained model on VoxCeleb follows Creative Commons Attribution 4.0 International License., since it is used as license of the VoxCeleb dataset, see https://mm.kaist.ac.kr/datasets/voxceleb/.

Onnx Inference Demo

To use the pretrained model in pytorch format, please directly refer to the run.sh in corresponding recipe.

As for extracting speaker embeddings from the onnx model, the following is a toy example.

# Download the pretrained model in onnx format and save it as onnx_path
# wav_path is the path to your wave file (16k)
python wespeaker/bin/infer_onnx.py --onnx_path $onnx_path --wav_path $wav_path

You can easily adapt infer_onnx.py to your application, a speaker diarization example can be found in the voxconverse recipe.

Model List

The model with suffix LM means that it is further fine-tuned using large-margin fine-tuning, which could perform better on long audios, e.g. >3s.

modelscope

DatasetsLanguagesCheckpoint (pt)Runtime Model (onnx)
VoxCelebENResNet34 / ResNet34_LMResNet34 / ResNet34_LM
VoxCelebENResNet152_LMResNet152_LM
VoxCelebENResNet221_LMResNet221_LM
VoxCelebENResNet293_LMResNet293_LM
VoxCelebENCAM++ / CAM++_LMCAM++ / CAM++_LM
VoxCelebENECAPA512 / ECAPA512_LM / ECAPA512_DINOECAPA512 / ECAPA512_LM
VoxCelebENECAPA1024 / ECAPA1024_LMECAPA1024 / ECAPA1024_LM
VoxCelebENGemini_DFResnet114_LMGemini_DFResnet114_LM
CNCelebCNResNet34 / ResNet34_LMResNet34 / ResNet34_LM
VoxBlink2MultilingualSimAMResNet34SimAMResNet34
VoxBlink2 (pretrain) + VoxCeleb2 (finetune)MultilingualSimAMResNet34SimAMResNet34
VoxBlink2MultilingualSimAMResNet100SimAMResNet100
VoxBlink2 (pretrain) + VoxCeleb2 (finetune)MultilingualSimAMResNet100SimAMResNet100
VoxCelebENW2V-BERT2.0 / W2V-BERT2.0_LM-
VoxCeleb + VoxBlink2 paper linkENW2V-BERT2.0-MFA / W2V-BERT2.0-MFA-LM-

huggingface

DatasetsLanguagesCheckpoint (pt)Runtime Model (onnx)
VoxCelebENResNet34 / ResNet34_LMResNet34 / ResNet34_LM
VoxCelebENResNet152_LMResNet152_LM
VoxCelebENResNet221_LMResNet221_LM
VoxCelebENResNet293_LMResNet293_LM
VoxCelebENCAM++ / CAM++_LMCAM++ / CAM++_LM
VoxCelebENECAPA512 / ECAPA512_LMECAPA512 / ECAPA512_LM
VoxCelebENECAPA1024 / ECAPA1024_LMECAPA1024 / ECAPA1024_LM
VoxCelebENGemini_DFResnet114_LMGemini_DFResnet114_LM
CNCelebCNResNet34 / ResNet34_LMResNet34 / ResNet34_LM