Consistent Ensemble Distillation for Audio Tagging (CED)
March 20, 2026 · View on GitHub
This repo is the source for the ICASSP 2024 Paper Consistent Ensemble Distillation for Audio Tagging.

| Model | Parameters (M) | AS-20K (mAP) | AS-2M (mAP) |
|---|---|---|---|
| CED-Tiny | 5.5 | 36.5 | 48.1 |
| CED-Mini | 9.6 | 38.5 | 49.0 |
| CED-Small | 22 | 41.6 | 49.6 |
| CED-Base | 86 | 44.0 | 50.0 |
- All models work with 16 kHz audio and use 64-dim Mel-spectrograms, making them very fast.
CED-Tinyshould be faster than MobileNets on a single x86 CPU (even though MACs/FLops would indicate otherwise).
Pretrained models could be downloaded from Zenodo or Hugging Face.
| Zenodo | Hugging Face | |
|---|---|---|
| CED-Tiny | Link | Link |
| CED-Mini | Link | Link |
| CED-Small | Link | Link |
| CED-Base | Link | Link |
Demo
We have an online demo available here for CED-Base.
Inference/Usage
Huggingface Transformers
Inference (Onnx, Recommended)
>>> from optimum.onnxruntime import ORTModelForAudioClassification
>>> model_name = "mispeech/ced-mini"
>>> model = ORTModelForAudioClassification.from_pretrained(model_name, trust_remote_code=True)
>>> import torchaudio
>>> audio, sampling_rate = torchaudio.load("/path-to/JeD5V5aaaoI_931_932.wav")
>>> assert sampling_rate == 16000
>>> input_name = model.session.get_inputs()[0].name
>>> output = model(**{input_name: torch.randn(1, 16000)})
>>> logits = output.logits.squeeze()
>>> for idx in logits.argsort()[-2:][::-1]:
>>> print(f"{model.config.id2label[idx]}: {logits[idx]:.4f}")
'Finger snapping: 0.9155'
'Slap: 0.0567'
Or standard Huggingface Transformers
# pip install transformers
from transformers import AutoModelForAudioClassification, AutoFeatureExtractor
model_name = "mispeech/ced-base"
feature_extractor = AutoFeatureExtractor.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForAudioClassification.from_pretrained(model_name, trust_remote_code=True)
import torchaudio
audio, sampling_rate = torchaudio.load("/path-to/JeD5V5aaaoI_931_932.wav")
assert sampling_rate == 16000
inputs = feature_extractor(audio, sampling_rate=sampling_rate, return_tensors="pt")
import torch
with torch.no_grad():
logits = model(**inputs).logits
predicted_class_id = torch.argmax(logits, dim=-1).item()
model.config.id2label[predicted_class_id]
Locally
To just use the CED models for inference, simply run:
git clone https://github.com/Richermans/CED/
cd CED/
pip3 install -r requirements.txt
python3 inference.py resources/*
Note that I experienced some problems with higher versions of hdf5, so if possible please use 1.12.1.
By default we use CED-mini here, which offers a good trade-off between performance and speed.
One can switch the models with the -m flag:
python3 inference.py -m ced_tiny resources/*
python3 inference.py -m ced_mini resources/*
python3 inference.py -m ced_small resources/*
python3 inference.py -m ced_base resources/*
You can also use the models directly from Hugging Face, see here for usage instructions.
Training/Reproducing results
1. Preparing data
First, one needs to download Audioset. One might use one of our own scripts.
For example, one can put the downloaded files into a folder named data/balanced and data/unbalanced, data/eval such as:
data/balanced/
├── -0DdlOuIFUI_50.000.wav
├── -0DLPzsiXXE_30.000.wav
├── -0FHUc78Gqo_30.000.wav
├── -0mjrMposBM_80.000.wav
├── -0O3e95y4gE_100.000.wav
…
data/unbalanced/
├── --04kMEQOAs_0.000_10.000.wav
├── --0aJtOMp2M_30.000_40.000.wav
├── --0AzKXCHj8_22.000_32.000.wav
├── --0B3G_C3qc_10.000_20.000.wav
├── --0bntG9i7E_30.000_40.000.wav
…
data/eval/
├── 007P6bFgRCU_10.000_20.000.wav
├── 00AGIhlv-w0_300.000_310.000.wav
├── 00FBAdjlF4g_30.000_40.000.wav
├── 00G2vNrTnCc_10.000_20.000.wav
├── 00KM53yZi2A_30.000_40.000.wav
├── 00XaUxjGuX8_170.000_180.000.wav
├── 0-2Onbywljo_380.000_390.000.wav
Then just generate a .tsv file with:
find data/balanced/ -type f | awk 'BEGIN{print "filename"}{print}' > data/balanced.tsv
Then dump the data as hdf5 files using scripts/wavlist_to_hdf5.py:
python3 scripts/wavlist_to_hdf5.py data/balanced.tsv data/balanced_train/
This will generate a training datafile data/balanced_train/labels/balanced.tsv.
For the eval data, please use this script to download.
The resulting eval.tsv should look like this:
filename labels hdf5path
data/eval/--4gqARaEJE.wav 73;361;74;72 data/eval_data/hdf5/eval_0.h5
data/eval/--BfvyPmVMo.wav 419 data/eval_data/hdf5/eval_0.h5
data/eval/--U7joUcTCo.wav 47 data/eval_data/hdf5/eval_0.h5
data/eval/-0BIyqJj9ZU.wav 21;20;17 data/eval_data/hdf5/eval_0.h5
data/eval/-0Gj8-vB1q4.wav 273;268;137 data/eval_data/hdf5/eval_0.h5
data/eval/-0RWZT-miFs.wav 379;307 data/eval_data/hdf5/eval_0.h5
data/eval/-0YUDn-1yII.wav 268;137 data/eval_data/hdf5/eval_0.h5
data/eval/-0jeONf82dE.wav 87;137;89;0;72 data/eval_data/hdf5/eval_0.h5
data/eval/-0nqfRcnAYE.wav 364 data/eval_data/hdf5/eval_0.h5
2. Download logits
Download the logits used in the paper from Zenodo:
wget https://zenodo.org/record/8275347/files/logits.zip?download=1 -O logits.zip
unzip logits.zip
This will create:
logits/
└── ensemble5014
├── balanced
│ └── chunk_10
└── full
└── chunk_10
3. Train
python3 run.py train trainconfig/balanced_mixup_tiny_T_ensemble5014_chunk10.yaml
Export ONNX and Inference ONNX
python3 export_onnx.py -m ced_tiny
#or ced_mini ced_small ced_base
python3 onnx_inference_with_kaldi.py test.wav -m ced_tiny.onnx
python3 onnx_inference_with_torchaudio.py test.wav -m ced_tiny.onnx
Why use Kaldi to calculate Mel features? Because it has ready-made C++ implementation code, which can be found here: https://github.com/csukuangfj/kaldi-native-fbank/tree/master
Training on your own data
This is a label-free framework, meaning that any data can be used for optimization. To use your own data, do the follwing:
Put your data somewhere and generate a .tsv file with a single header filename, such as:
find some_directory -type f | awk 'BEGIN{print "filename"}{print}' > my_data.tsv
Then dump the corresponding hdf5 file using scripts/wavlist_to_hdf5.py:
python3 scripts/wavlist_to_hdf5.py my_data.tsv my_data_hdf5/
Then run the script save_logits.py as:
torchrun save_logits.py logitconfig/balanced_base_chunk10s_topk20.yaml --train_data my_data_hdf5/labels/my_data.tsv
Finally you can train your own model on that augmented dataset with:
python3 run.py train trainconfig/balanced_mixup_base_T_ensemble5014_chunk10.yaml --logitspath YOUR_LOGITS_PATH --train_data YOUR_TRAIN_DATA.tsv
Hear-Evaluation
We also submitted the models for the HEAR benchmark evaluation. Hear uses a simple linear downstream evaluation protocol across 19 tasks. We simply extracted the features from all ced-models from the penultimate layer. The repo can be found here.
| Model | Beehive States Avg | Beijing Opera Percussion | CREMA-D | DCASE16 | ESC-50 | FSD50K | GTZAN Genre | GTZAN Music Speech | Gunshot Triangulation | LibriCount | MAESTRO 5hr | Mridangam Stroke | Mridangam Tonic | NSynth Pitch 50hr | NSynth Pitch 5hr | Speech Commands 5hr | Speech Commands Full | Vocal Imitations | VoxLingua107 Top10 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ced-tiny | 38.345 | 94.90 | 62.52 | 88.02 | 95.80 | 62.73 | 89.20 | 93.01 | 91.67 | 61.26 | 4.81 | 96.13 | 90.74 | 69.19 | 44.00 | 70.53 | 77.10 | 19.18 | 33.64 |
| ced-mini | 59.17 | 96.18 | 65.26 | 90.66 | 95.35 | 63.88 | 90.30 | 94.49 | 86.01 | 64.02 | 8.29 | 96.56 | 93.32 | 75.20 | 55.60 | 77.38 | 81.96 | 20.37 | 34.67 |
| ced-small | 51.70 | 96.60 | 66.64 | 91.63 | 95.95 | 64.33 | 89.50 | 91.22 | 93.45 | 65.59 | 10.96 | 96.82 | 93.94 | 79.95 | 60.20 | 80.92 | 85.19 | 21.92 | 36.53 |
| ced-base | 48.35 | 96.60 | 69.10 | 92.19 | 96.65 | 65.48 | 88.60 | 94.36 | 89.29 | 67.85 | 14.76 | 97.43 | 96.55 | 82.81 | 68.20 | 86.93 | 89.67 | 22.69 | 38.57 |
Android APks
Thanks to csukuangfj, there are also pre-compiled android binaries using sherpa-onnx.
Binaries are available on the k2-fsa sherpa page, https://k2-fsa.github.io/sherpa/onnx/audio-tagging/apk.html.
Citation
Please cite our paper if you find this work useful:
@inproceedings{dinkel2023ced,
title={CED: Consistent ensemble distillation for audio tagging},
author={Dinkel, Heinrich and Wang, Yongqing and Yan, Zhiyong and Zhang, Junbo and Wang, Yujun},
booktitle={ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
year={2024}
}