Consistent Ensemble Distillation for Audio Tagging (CED)

March 20, 2026 · View on GitHub

This repo is the source for the ICASSP 2024 Paper Consistent Ensemble Distillation for Audio Tagging.

Framework

ModelParameters (M)AS-20K (mAP)AS-2M (mAP)
CED-Tiny5.536.548.1
CED-Mini9.638.549.0
CED-Small2241.649.6
CED-Base8644.050.0
  • All models work with 16 kHz audio and use 64-dim Mel-spectrograms, making them very fast. CED-Tiny should be faster than MobileNets on a single x86 CPU (even though MACs/FLops would indicate otherwise).

Pretrained models could be downloaded from Zenodo or Hugging Face.

ZenodoHugging Face
CED-TinyLinkLink
CED-MiniLinkLink
CED-SmallLinkLink
CED-BaseLinkLink

Demo

We have an online demo available here for CED-Base.

Inference/Usage

Huggingface Transformers

>>> from optimum.onnxruntime import ORTModelForAudioClassification

>>> model_name = "mispeech/ced-mini"
>>> model = ORTModelForAudioClassification.from_pretrained(model_name, trust_remote_code=True)

>>> import torchaudio
>>> audio, sampling_rate = torchaudio.load("/path-to/JeD5V5aaaoI_931_932.wav")
>>> assert sampling_rate == 16000
>>> input_name = model.session.get_inputs()[0].name
>>> output = model(**{input_name: torch.randn(1, 16000)})
>>> logits = output.logits.squeeze()
>>> for idx in logits.argsort()[-2:][::-1]:
>>>   print(f"{model.config.id2label[idx]}: {logits[idx]:.4f}")
'Finger snapping: 0.9155'
'Slap: 0.0567'

Or standard Huggingface Transformers

# pip install transformers
from transformers import AutoModelForAudioClassification, AutoFeatureExtractor

model_name = "mispeech/ced-base"
feature_extractor = AutoFeatureExtractor.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForAudioClassification.from_pretrained(model_name, trust_remote_code=True)

import torchaudio
audio, sampling_rate = torchaudio.load("/path-to/JeD5V5aaaoI_931_932.wav")
assert sampling_rate == 16000
inputs = feature_extractor(audio, sampling_rate=sampling_rate, return_tensors="pt")

import torch
with torch.no_grad():
    logits = model(**inputs).logits

predicted_class_id = torch.argmax(logits, dim=-1).item()
model.config.id2label[predicted_class_id]

Locally

To just use the CED models for inference, simply run:

git clone https://github.com/Richermans/CED/
cd CED/
pip3 install -r requirements.txt
python3 inference.py resources/*

Note that I experienced some problems with higher versions of hdf5, so if possible please use 1.12.1.

By default we use CED-mini here, which offers a good trade-off between performance and speed. One can switch the models with the -m flag:

python3 inference.py -m ced_tiny resources/*
python3 inference.py -m ced_mini resources/*
python3 inference.py -m ced_small resources/*
python3 inference.py -m ced_base resources/*

You can also use the models directly from Hugging Face, see here for usage instructions.

Training/Reproducing results

1. Preparing data

First, one needs to download Audioset. One might use one of our own scripts.

For example, one can put the downloaded files into a folder named data/balanced and data/unbalanced, data/eval such as:

data/balanced/
├── -0DdlOuIFUI_50.000.wav
├── -0DLPzsiXXE_30.000.wav
├── -0FHUc78Gqo_30.000.wav
├── -0mjrMposBM_80.000.wav
├── -0O3e95y4gE_100.000.wav


data/unbalanced/
├── --04kMEQOAs_0.000_10.000.wav
├── --0aJtOMp2M_30.000_40.000.wav
├── --0AzKXCHj8_22.000_32.000.wav
├── --0B3G_C3qc_10.000_20.000.wav
├── --0bntG9i7E_30.000_40.000.wav


data/eval/
├── 007P6bFgRCU_10.000_20.000.wav
├── 00AGIhlv-w0_300.000_310.000.wav
├── 00FBAdjlF4g_30.000_40.000.wav
├── 00G2vNrTnCc_10.000_20.000.wav
├── 00KM53yZi2A_30.000_40.000.wav
├── 00XaUxjGuX8_170.000_180.000.wav
├── 0-2Onbywljo_380.000_390.000.wav

Then just generate a .tsv file with:

find data/balanced/ -type f | awk 'BEGIN{print "filename"}{print}' > data/balanced.tsv

Then dump the data as hdf5 files using scripts/wavlist_to_hdf5.py:

python3 scripts/wavlist_to_hdf5.py data/balanced.tsv data/balanced_train/

This will generate a training datafile data/balanced_train/labels/balanced.tsv.

For the eval data, please use this script to download.

The resulting eval.tsv should look like this:

filename	labels	hdf5path
data/eval/--4gqARaEJE.wav	73;361;74;72	data/eval_data/hdf5/eval_0.h5
data/eval/--BfvyPmVMo.wav	419	data/eval_data/hdf5/eval_0.h5
data/eval/--U7joUcTCo.wav	47	data/eval_data/hdf5/eval_0.h5
data/eval/-0BIyqJj9ZU.wav	21;20;17	data/eval_data/hdf5/eval_0.h5
data/eval/-0Gj8-vB1q4.wav	273;268;137	data/eval_data/hdf5/eval_0.h5
data/eval/-0RWZT-miFs.wav	379;307	data/eval_data/hdf5/eval_0.h5
data/eval/-0YUDn-1yII.wav	268;137	data/eval_data/hdf5/eval_0.h5
data/eval/-0jeONf82dE.wav	87;137;89;0;72	data/eval_data/hdf5/eval_0.h5
data/eval/-0nqfRcnAYE.wav	364	data/eval_data/hdf5/eval_0.h5

2. Download logits

Download the logits used in the paper from Zenodo:

wget https://zenodo.org/record/8275347/files/logits.zip?download=1 -O logits.zip
unzip logits.zip

This will create:

logits/
└── ensemble5014
    ├── balanced
    │   └── chunk_10
    └── full
        └── chunk_10

3. Train

python3 run.py train trainconfig/balanced_mixup_tiny_T_ensemble5014_chunk10.yaml

Export ONNX and Inference ONNX

python3 export_onnx.py -m ced_tiny
#or ced_mini ced_small ced_base
python3 onnx_inference_with_kaldi.py test.wav -m ced_tiny.onnx
python3 onnx_inference_with_torchaudio.py test.wav -m ced_tiny.onnx

Why use Kaldi to calculate Mel features? Because it has ready-made C++ implementation code, which can be found here: https://github.com/csukuangfj/kaldi-native-fbank/tree/master

Training on your own data

This is a label-free framework, meaning that any data can be used for optimization. To use your own data, do the follwing:

Put your data somewhere and generate a .tsv file with a single header filename, such as:

find some_directory -type f | awk 'BEGIN{print "filename"}{print}' > my_data.tsv

Then dump the corresponding hdf5 file using scripts/wavlist_to_hdf5.py:

python3 scripts/wavlist_to_hdf5.py my_data.tsv my_data_hdf5/

Then run the script save_logits.py as:

torchrun save_logits.py logitconfig/balanced_base_chunk10s_topk20.yaml --train_data my_data_hdf5/labels/my_data.tsv

Finally you can train your own model on that augmented dataset with:

python3 run.py train trainconfig/balanced_mixup_base_T_ensemble5014_chunk10.yaml --logitspath YOUR_LOGITS_PATH --train_data YOUR_TRAIN_DATA.tsv

Hear-Evaluation

We also submitted the models for the HEAR benchmark evaluation. Hear uses a simple linear downstream evaluation protocol across 19 tasks. We simply extracted the features from all ced-models from the penultimate layer. The repo can be found here.

ModelBeehive States AvgBeijing Opera PercussionCREMA-DDCASE16ESC-50FSD50KGTZAN GenreGTZAN Music SpeechGunshot TriangulationLibriCountMAESTRO 5hrMridangam StrokeMridangam TonicNSynth Pitch 50hrNSynth Pitch 5hrSpeech Commands 5hrSpeech Commands FullVocal ImitationsVoxLingua107 Top10
ced-tiny38.34594.9062.5288.0295.8062.7389.2093.0191.6761.264.8196.1390.7469.1944.0070.5377.1019.1833.64
ced-mini59.1796.1865.2690.6695.3563.8890.3094.4986.0164.028.2996.5693.3275.2055.6077.3881.9620.3734.67
ced-small51.7096.6066.6491.6395.9564.3389.5091.2293.4565.5910.9696.8293.9479.9560.2080.9285.1921.9236.53
ced-base48.3596.6069.1092.1996.6565.4888.6094.3689.2967.8514.7697.4396.5582.8168.2086.9389.6722.6938.57

Android APks

Thanks to csukuangfj, there are also pre-compiled android binaries using sherpa-onnx.

Binaries are available on the k2-fsa sherpa page, https://k2-fsa.github.io/sherpa/onnx/audio-tagging/apk.html.

Citation

Please cite our paper if you find this work useful:

@inproceedings{dinkel2023ced,
  title={CED: Consistent ensemble distillation for audio tagging},
  author={Dinkel, Heinrich and Wang, Yongqing and Yan, Zhiyong and Zhang, Junbo and Wang, Yujun},
  booktitle={ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  year={2024}
}